WO2022116442A1 - Procédé et appareil, basés sur la géométrie, pour le criblage d'échantillons de parole, dispositif informatique et support de stockage - Google Patents

Procédé et appareil, basés sur la géométrie, pour le criblage d'échantillons de parole, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2022116442A1
WO2022116442A1 PCT/CN2021/083934 CN2021083934W WO2022116442A1 WO 2022116442 A1 WO2022116442 A1 WO 2022116442A1 CN 2021083934 W CN2021083934 W CN 2021083934W WO 2022116442 A1 WO2022116442 A1 WO 2022116442A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speech
feature
sample
initial
Prior art date
Application number
PCT/CN2021/083934
Other languages
English (en)
Chinese (zh)
Inventor
罗剑
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022116442A1 publication Critical patent/WO2022116442A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the technical field of speech semantics of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for screening speech samples based on geometry.
  • DNN Deep Neural Network
  • Active learning is a branch of machine learning that allows the model to choose the data to learn on its own.
  • the idea of active learning comes from the assumption that a machine learning algorithm, if it can choose the data it wants to learn from, will perform better with less training data.
  • the most widely used active learning query strategy is called uncertainty sampling (Uncertainty Sampling), in which the model will select the most uncertain samples predicted by the model for labeling.
  • uncertainty sampling Uncertainty Sampling
  • the inventors realized that this technique achieves good results with a small number of selected samples, but in the context of using a deep neural network as a training model, the model requires a large amount of training data, and as the number of selected labeled samples grows , the model predicts uncertain samples will have redundancy and overlap, and it is easier to select similar samples. However, selecting these similar samples is of limited help in model training.
  • voice data is different from non-sequential data such as pictures.
  • Voice data has the characteristics of variable length and rich structured information, which makes it more difficult to process and select voice data.
  • the embodiments of the present application provide a geometry-based voice sample screening method, device, computer equipment, and storage medium, aiming to solve the problem of using the uncertainty sampling technology in the training of a neural network for voice recognition in the prior art.
  • the uncertain samples predicted by the model will have redundancy and overlap. These similar samples are of limited help for model training.
  • due to the complex structure of speech it is difficult for uncertain sampling techniques to select speech samples.
  • an embodiment of the present application provides a method for screening speech samples based on geometry, which includes:
  • Obtaining an initial voice sample set extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
  • the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model;
  • the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
  • an embodiment of the present application provides a geometry-based voice sample screening device, which includes:
  • a voice feature extraction unit configured to obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes a plurality of initial voice samples sample;
  • a voice feature clustering unit configured to obtain the Euclidean distance between each voice feature in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result;
  • a clustering result screening unit configured to invoke preset sample subset screening conditions, and obtain clusters in the clustering results that satisfy the sample subset screening conditions, so as to form a target cluster set;
  • a label value obtaining unit configured to obtain a label value corresponding to each voice feature in the target cluster set, so as to obtain a current voice sample set corresponding to the target cluster set;
  • the speech recognition model training unit is used to use each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and the label value corresponding to each speech feature as the output of the speech recognition model to be trained to be trained speech recognition
  • the model is trained to obtain a speech recognition model; wherein, the speech recognition model to be trained includes a link time sequence classification sub-model and an attention-based mechanism sub-model; and
  • the speech recognition result sending unit is used for inputting the speech feature corresponding to the currently to-be-recognized speech data into the speech recognition model for operation if the currently to-be-recognized speech data uploaded by the user terminal is detected, and to obtain and send the current to-be-recognized speech data to the user terminal. Speech recognition results.
  • an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer
  • the program implements the following steps:
  • Obtaining an initial voice sample set extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
  • the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model;
  • the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
  • embodiments of the present application further provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when executed by a processor, the computer program causes the processor to perform the following operations :
  • Obtaining an initial voice sample set extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
  • the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model;
  • the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
  • the embodiments of the present application provide a method, device, computer equipment and storage medium for selecting voice samples based on geometry, which realizes automatic selection of samples with less redundancy to train a voice recognition model, and reduces the number of voices in the context of deep learning.
  • the labeling cost of the recognition task improves the training speed of the speech recognition model.
  • FIG. 1 is a schematic diagram of an application scenario of the geometry-based voice sample screening method provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a geometry-based voice sample screening method provided by an embodiment of the present application
  • FIG. 3 is a schematic block diagram of a geometry-based voice sample screening apparatus provided by an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of an application scenario of the geometry-based voice sample screening method provided by the embodiment of the application
  • FIG. 2 is a schematic flowchart of the geometry-based voice sample screening method provided by the embodiment of the application. , the geometry-based voice sample screening method is applied in the server, and the method is executed by application software installed in the server.
  • the method includes steps S110-S160.
  • data preprocessing and feature extraction may be performed on the initial speech sample set, and the corresponding data of each initial speech sample data in the initial speech sample set is obtained.
  • Voice features to form a voice feature set.
  • Data preprocessing includes pre-emphasis, framing, windowing and other operations. The purpose of these operations is to eliminate the aliasing, high-order harmonic distortion, high frequency and other factors caused by the defects of the human vocal organs and the defects of the acquisition equipment. The influence of quality, as much as possible to make the obtained signal more uniform and smooth.
  • step S110 includes:
  • the preprocessed speech data corresponding to each piece of initial speech sample data is extracted by Mel frequency cepstral coefficients or filter bank, respectively, to obtain speech features corresponding to each piece of initial speech sample data to form a speech feature set.
  • the initial speech sample data (denoted as s(t)) is first sampled with a sampling period T, and then discretized into s(n) .
  • the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and its transfer function is as follows:
  • the time domain signal corresponding to the windowed speech data is x(l)
  • the windowed and framed The nth frame of speech data in the preprocessed speech data is xn(m)
  • xn(m) satisfies Equation (3):
  • T is the frame shift
  • ⁇ (n) is the function of the Hamming window.
  • the initial speech sample data By preprocessing the initial speech sample data, it can be effectively used for subsequent sound parameter extraction, such as extracting Mel Frequency Cepstrum Coefficient (ie Mel Frequency Cepstrum Coefficient) or filter bank (ie Filter-Bank).
  • the speech feature corresponding to each piece of initial speech sample data can be obtained to form a speech feature set.
  • the quantization process can be calculated by calculating the corresponding data of the two initial speech sample data. Euclidean distance between phonetic features.
  • step S120 the Euclidean distance between each speech feature in the speech feature set is obtained by a dynamic time warping algorithm, including:
  • the voice feature set includes N voice features, the value ranges of i and j are both [1, N], and i and j not equal;
  • the distance matrix D of n*m, and obtain the minimum value among the matrix elements in the distance matrix D.
  • the value is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D Indicates the Euclidean distance between the x-th frame speech sequence in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature.
  • the method for calculating the Euclidean distance between any two voice features in the voice feature set is described by taking the calculation of the ith voice feature and the jth voice feature in the voice feature set as an example.
  • the above calculation process can be stopped after the Euclidean distance between the speech features in the speech feature set is obtained.
  • the Euclidean distance between any two speech features it is necessary to first determine whether the number of frames of the speech sequence between the two are equal (for example, to determine whether the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the j-th speech) The number of frames of the second speech sequence corresponding to the feature). If the number of frames of the speech sequence between the two is not equal, a distance matrix D of n*m needs to be constructed, and the minimum value of each matrix element in the distance matrix D is obtained as the i-th speech feature and the j-th speech feature.
  • the Euclidean distance of the feature where n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D represents the relationship between the x-th frame of speech sequence and the j-th speech sequence in the i-th speech feature.
  • the matrix element d(i,j) represents the Euclidean distance between the x frame in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature, find the distance from d(0,0) to d(n ,m), taking the length of the path as the distance between the i-th speech feature and the j-th speech feature, and the path satisfies continuity and time monotonicity (not backtracking).
  • the above calculation process adopts the dynamic time warping algorithm (the full name is Dynamic Time Warping, abbreviated as DTW).
  • step S120 after judging whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature in step S120, it also includes:
  • the Euclidean distance between the ith voice feature and the jth voice feature is calculated.
  • K-means clustering is performed according to the Euclidean distance between the speech features in step S120 to obtain a clustering result, including:
  • the voice feature set is divided to obtain an initial clustering result
  • the voice feature set of the voice feature set is divided according to the Euclidean distance from the adjusted clustering center, until the clustering result remains the same number of times more than the preset number of times, obtaining the same number as the preset number of times.
  • the number of clusters corresponding to the number of clusters to form the clustering results is provided.
  • the speech feature set can be clustered by the K-means clustering method at this time, and the specific process is as follows:
  • N2 voice features from the voice feature set including N1 voice features, and use them as the initial clustering centers of the N2 clusters; wherein, the initial total number of voice features in the voice feature set is N1, and N2 is arbitrarily selected from it voice features (N2 ⁇ N1, N2 is the preset number of clusters, that is, the number of expected clusters), and the initially selected N2 voice features are used as the initial clustering centers.
  • Step d) is repeated until the clustering result no longer changes, and the clustering result corresponding to the preset number of clusters is obtained.
  • the speech feature set can be quickly grouped to obtain multiple clusters.
  • the server can select clusters that meet the conditions from multiple clusters as training samples and label them.
  • the sample subset screening condition may be set such that the sample redundancy is the minimum value among the multiple sample subsets, so that target clusters can be filtered out to form a target cluster set.
  • the sample redundancy of a cluster it is to calculate the degree of data repetition. For example, the total number of data in a certain sample subset is Y1, and the total number of repeated data is Y2. At this time, the samples of this sample subset are The redundancy is Y2/Y1. Selecting a subset of samples with less redundancy can significantly reduce the labeling cost of speech recognition tasks in the context of deep learning.
  • the target cluster set since the target cluster set has been selected, only a small number of samples can be labeled at this time, that is, the current speech sample set corresponding to the target cluster set is obtained. Using less labeled data can significantly improve the training speed of the speech recognition model and reduce the computational pressure on the speech processing system.
  • the speech recognition model to be trained includes a link time series classification sub-model and an attention-based mechanism sub-model.
  • a hybrid CTC (Connectionist Temporal Classification, that is, link timing classification) model and an Attention model (that is, a model based on an attention mechanism) can be used.
  • CTC decoding recognizes speech by predicting the output of each frame. The implementation of the algorithm is based on the assumption that the decoding of each frame remains independent, so it lacks the connection between the speech features before and after the decoding process, and relies on the correction of the language model.
  • the Attention decoding process has nothing to do with the frame order of the input speech.
  • Each decoding unit generates the current result through the decoding result of the previous unit and the overall speech characteristics.
  • the decoding process ignores the monotonic timing of the speech, so a hybrid model can be used, taking into account advantages of both.
  • the link time series classification sub-model is set closer to the input for preliminary processing, and the attention-based sub-model is set closer to the output for subsequent processing.
  • the network structure of the speech recognition model to be trained adopts LSTM/CNN/GRU and other structures, and the two decoders jointly output the recognition result.
  • step S150 it further includes:
  • the corresponding summary information is obtained based on the first model parameter set and the second model parameter set.
  • the summary information is obtained by hashing the first model parameter set and the second model parameter set, for example, using The sha256s algorithm is processed.
  • Uploading summary information to the blockchain ensures its security and fairness and transparency to users.
  • the user equipment can download the summary information from the blockchain, so as to verify whether the first model parameter set and the second model parameter set have been tampered with.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the server after the training of the speech recognition model is completed in the server, it can be specifically applied to speech recognition. That is, once the server receives the current voice data to be recognized uploaded by the client, it inputs the voice feature corresponding to the current voice data to be recognized into the voice recognition model for calculation, and obtains and sends the current voice recognition result to the client. In one way, the current speech recognition result can be quickly fed back.
  • step S160 includes:
  • the first recognition sequence is input into the attention mechanism-based sub-model for operation, and the current speech recognition result is obtained and sent to the user.
  • the link sequence classification sub-model is set closer to the input end, and the attention-based sub-model is set closer to the output end, the currently to-be-recognized speech data is first input to the link sequence classifier
  • the model performs an operation to obtain a first recognition sequence, and then the first recognition sequence is input into the attention mechanism-based sub-model for operation to obtain the current speech recognition result.
  • the relationship between the speech features before and after the decoding process is fully considered, and the monotonic timing of speech is also considered, and the results obtained by this model are more accurate.
  • This method realizes the automatic selection of samples with less redundancy to train the speech recognition model, reduces the labeling cost of speech recognition tasks in the context of deep learning, and improves the training speed of the speech recognition model.
  • An embodiment of the present application further provides a geometry-based voice sample screening apparatus, which is used to perform any of the foregoing embodiments of the geometry-based voice sample screening method.
  • FIG. 3 is a schematic block diagram of the apparatus for screening speech samples based on geometry provided by an embodiment of the present application.
  • the geometry-based voice sample screening apparatus 100 may be configured in a server.
  • the geometry-based voice sample screening device 100 includes: a voice feature extraction unit 110, a voice feature clustering unit 120, a clustering result screening unit 130, a label value obtaining unit 140, a voice recognition model training unit 150, Speech recognition result sending unit 160 .
  • the voice feature extraction unit 110 is configured to obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes a plurality of initial voice samples. Speech sample data.
  • data preprocessing and feature extraction may be performed on the initial speech sample set, and the corresponding data of each initial speech sample data in the initial speech sample set is obtained.
  • Voice features to form a voice feature set.
  • Data preprocessing includes pre-emphasis, framing, windowing and other operations. The purpose of these operations is to eliminate the aliasing, high-order harmonic distortion, high frequency and other factors caused by the defects of the human vocal organs and the defects of the acquisition equipment. The influence of quality, as much as possible to make the obtained signal more uniform and smooth.
  • the speech feature extraction unit 110 includes:
  • a discrete sampling unit configured to call a pre-stored sampling period to sample each piece of initial voice sample data in the initial voice sample set, to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;
  • the pre-emphasis unit is used to call the pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete voice signal corresponding to each initial voice sample data, and obtain the current pre-emphasized voice signal corresponding to each initial voice sample data ;
  • the windowing unit is used to call the pre-stored Hamming window to add a window to the current pre-emphasized voice signal corresponding to each initial voice sample data, to obtain the windowed voice data corresponding to each initial voice sample data;
  • the framing unit is used to call the pre-stored frame shift and frame length to divide the windowed voice data corresponding to each initial voice sample data into frames, and obtain the preprocessed voice data corresponding to each initial voice sample data. ;
  • the feature extraction unit is used for extracting Mel frequency cepstral coefficients or filter bank extraction on the preprocessed speech data corresponding to each piece of initial speech sample data, to obtain the speech features corresponding to each piece of initial speech sample data, to form a speech feature set.
  • the initial speech sample data (denoted as s(t)) is first sampled with a sampling period T, and then discretized into s(n) .
  • the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and its transfer function is as shown in the above formula (1).
  • the time domain signal corresponding to the windowed speech data is x(l)
  • the windowed and framed The nth frame of speech data in the preprocessed speech data is xn(m)
  • xn(m) satisfies the formula (3).
  • the initial speech sample data By preprocessing the initial speech sample data, it can be effectively used for subsequent sound parameter extraction, such as extracting Mel Frequency Cepstrum Coefficient (ie Mel Frequency Cepstrum Coefficient) or filter bank (ie Filter-Bank).
  • the speech feature corresponding to each piece of initial speech sample data can be obtained to form a speech feature set.
  • the voice feature clustering unit 120 is used to obtain the Euclidean distance between each voice feature in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result .
  • the quantization process can be calculated by calculating the corresponding data of the two initial speech sample data. Euclidean distance between phonetic features.
  • the speech feature clustering unit 120 includes:
  • the voice feature selection unit is used to obtain the ith voice feature and the jth voice feature in the voice feature set; wherein, the voice feature set includes N voice features, and the value ranges of i and j are both [1, N], and i and j are not equal;
  • a voice sequence frame number comparison unit used for judging whether the first voice sequence frame number corresponding to the ith voice feature is equal to the second voice sequence frame number corresponding to the jth voice feature;
  • the first calculation unit is used for constructing a distance matrix D of n*m and obtaining the distance matrix D if the number of frames of the first voice sequence corresponding to the ith voice feature is not equal to the number of frames of the second voice sequence corresponding to the jth voice feature
  • the minimum value in each matrix element is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and the distance matrix D where d(x, y) represents the Euclidean distance between the x-th frame speech sequence in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature.
  • the method for calculating the Euclidean distance between any two voice features in the voice feature set is described by taking the calculation of the ith voice feature and the jth voice feature in the voice feature set as an example.
  • the above calculation process can be stopped after the Euclidean distance between the speech features in the speech feature set is obtained.
  • the Euclidean distance between any two speech features it is necessary to first determine whether the number of frames of the speech sequence between the two are equal (for example, to determine whether the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the j-th speech) The number of frames of the second speech sequence corresponding to the feature). If the number of frames of the speech sequence between the two is not equal, a distance matrix D of n*m needs to be constructed, and the minimum value of each matrix element in the distance matrix D is obtained as the i-th speech feature and the j-th speech feature.
  • the Euclidean distance of the feature where n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D represents the relationship between the x-th frame of speech sequence and the j-th speech sequence in the i-th speech feature.
  • the matrix element d(i,j) represents the Euclidean distance between the x frame in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature, find the distance from d(0,0) to d(n ,m), taking the length of the path as the distance between the i-th speech feature and the j-th speech feature, and the path satisfies continuity and time monotonicity (not backtracking).
  • the above calculation process adopts the dynamic time warping algorithm (the full name is Dynamic Time Warping, abbreviated as DTW).
  • the voice feature clustering unit 120 further includes:
  • the second calculation unit is configured to calculate the i-th speech feature and the j-th speech if the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the second speech sequence frame number corresponding to the j-th speech feature Euclidean distance for features.
  • the speech feature clustering unit 120 includes:
  • the initial cluster center acquisition unit is used to select the same number of voice features as the preset number of clusters in the voice feature set, and use the selected voice feature as the initial cluster center of each cluster;
  • the initial clustering unit is used to divide the voice feature set according to the Euclidean distance between each voice feature in the voice feature set and each initial cluster center to obtain an initial clustering result
  • the cluster center adjustment unit is used to obtain the adjusted cluster center of each cluster according to the initial clustering result
  • the clustering adjustment unit is used to divide the voice feature set of the voice feature set according to the Euclidean distance from the adjusted cluster center according to the adjusted cluster center, until the clustering results remain the same for more times than preset number of times to obtain a clustering cluster corresponding to the preset number of clustering clusters to form a clustering result.
  • the speech feature set can be clustered by the K-means clustering method at this time. After the cluster classification is completed, the speech feature set can be quickly grouped to obtain multiple clusters. After that, the server can select clusters that meet the conditions from multiple clusters as training samples and label them.
  • the clustering result screening unit 130 is configured to invoke preset sample subset screening conditions, and acquire clusters in the clustering results that satisfy the sample subset screening conditions, so as to form a target cluster set.
  • the sample subset screening condition may be set such that the sample redundancy is the minimum value among the multiple sample subsets, so that target clusters can be filtered out to form a target cluster set.
  • the sample redundancy of a cluster it is to calculate the degree of data repetition. For example, the total number of data in a certain sample subset is Y1, and the total number of repeated data is Y2. At this time, the samples of this sample subset are The redundancy is Y2/Y1. Selecting a subset of samples with less redundancy can significantly reduce the labeling cost of speech recognition tasks in the context of deep learning.
  • the label value obtaining unit 140 is configured to obtain label values corresponding to each speech feature in the target cluster set, so as to obtain a current speech sample set corresponding to the target cluster set.
  • the target cluster set since the target cluster set has been selected, only a small number of samples can be labeled at this time, that is, the current speech sample set corresponding to the target cluster set is obtained. Using less labeled data can significantly improve the training speed of the speech recognition model and reduce the computational pressure on the speech processing system.
  • the speech recognition model training unit 150 is configured to use each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and use the label value corresponding to each speech feature as the output of the speech recognition model to be trained to treat the speech for training
  • the recognition model is trained to obtain a speech recognition model; wherein, the speech recognition model to be trained includes a link time sequence classification sub-model and an attention mechanism-based sub-model.
  • a hybrid CTC (Connectionist Temporal Classification, that is, link timing classification) model and an Attention model (that is, a model based on an attention mechanism) can be used.
  • CTC decoding recognizes speech by predicting the output of each frame. The implementation of the algorithm is based on the assumption that the decoding of each frame remains independent, so it lacks the connection between the speech features before and after the decoding process, and relies on the correction of the language model.
  • the Attention decoding process has nothing to do with the frame order of the input speech.
  • Each decoding unit generates the current result through the decoding result of the previous unit and the overall speech characteristics.
  • the decoding process ignores the monotonic timing of the speech, so a hybrid model can be used, taking into account advantages of both.
  • the link time series classification sub-model is set closer to the input for preliminary processing, and the attention-based sub-model is set closer to the output for subsequent processing.
  • the network structure of the speech recognition model to be trained adopts LSTM/CNN/GRU and other structures, and the two decoders jointly output the recognition result.
  • the geometry-based voice sample screening apparatus 100 further includes:
  • the data uploading unit is used for uploading the first model parameter set corresponding to the link time series classification sub-model and the second model parameter set corresponding to the attention mechanism-based sub-model in the speech recognition model to the blockchain network.
  • the corresponding summary information is obtained based on the first model parameter set and the second model parameter set.
  • the summary information is obtained by hashing the first model parameter set and the second model parameter set, for example, using The sha256s algorithm is processed.
  • Uploading summary information to the blockchain ensures its security and fairness and transparency to users.
  • the user equipment can download the summary information from the blockchain, so as to verify whether the first model parameter set and the second model parameter set have been tampered with.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the speech recognition result sending unit 160 is configured to input the speech feature corresponding to the currently to-be-recognized speech data into the speech recognition model for operation if the current speech data to be recognized uploaded by the user terminal is detected, and obtain and send to the user terminal The current speech recognition result.
  • the server after the training of the speech recognition model is completed in the server, it can be specifically applied to speech recognition. That is, once the server receives the current voice data to be recognized uploaded by the client, it inputs the voice feature corresponding to the current voice data to be recognized into the voice recognition model for calculation, and obtains and sends the current voice recognition result to the client. In one way, the current speech recognition result can be quickly fed back.
  • the speech recognition result sending unit 160 includes:
  • the first decoding unit is used for inputting the speech feature corresponding to the currently to-be-recognized speech data into the link sequence classification sub-model for operation to obtain a first recognition sequence;
  • the second decoding unit is configured to input the first recognition sequence into the sub-model based on the attention mechanism for operation, obtain and send the current speech recognition result to the user terminal.
  • the link sequence classification sub-model is set closer to the input end, and the attention-based sub-model is set closer to the output end, the currently to-be-recognized speech data is first input to the link sequence classifier
  • the model performs an operation to obtain a first recognition sequence, and then the first recognition sequence is input into the attention mechanism-based sub-model for operation to obtain the current speech recognition result.
  • the relationship between the speech features before and after the decoding process is fully considered, and the monotonic timing of speech is also considered, and the results obtained by this model are more accurate.
  • the device realizes the automatic selection of samples with less redundancy to train the speech recognition model, reduces the labeling cost of speech recognition tasks in the context of deep learning, and improves the training speed of the speech recognition model.
  • the above-mentioned apparatus for screening speech samples based on geometry can be implemented in the form of a computer program, and the computer program can be executed on a computer device as shown in FIG. 4 .
  • FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • the computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
  • the computer device 500 includes a processor 502 , a memory and a network interface 505 connected through a system bus 501 , wherein the memory may include a non-volatile storage medium 503 and an internal memory 504 .
  • the nonvolatile storage medium 503 can store an operating system 5031 and a computer program 5032 .
  • the computer program 5032 when executed, can cause the processor 502 to perform a geometry-based voice sample screening method.
  • the processor 502 is used to provide computing and control capabilities to support the operation of the entire computer device 500 .
  • the internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, the computer program 5032, when executed by the processor 502, can cause the processor 502 to perform a geometry-based voice sample screening method.
  • the network interface 505 is used for network communication, such as providing transmission of data information.
  • the network interface 505 is used for network communication, such as providing transmission of data information.
  • FIG. 4 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.
  • the processor 502 is configured to run the computer program 5032 stored in the memory to implement the geometry-based voice sample screening method disclosed in the embodiment of the present application.
  • the embodiment of the computer device shown in FIG. 4 does not constitute a limitation on the specific structure of the computer device, and in other embodiments, the computer device may include more or less components than those shown in the figure, Either some components are combined, or different component arrangements.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are the same as those of the embodiment shown in FIG. 4 , and details are not repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
  • a computer-readable storage medium may be a non-volatile computer-readable storage medium, or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, where the computer program implements the geometry-based voice sample screening method disclosed in the embodiments of the present application when the computer program is executed by the processor.
  • the disclosed apparatus, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only logical function division.
  • there may be other division methods, or units with the same function may be grouped into one Units, such as multiple units or components, may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a storage medium.
  • the technical solutions of the present application are essentially or part of contributions to the prior art, or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: a U disk, a removable hard disk, a read-only memory (ROM, Read-Only Memory), a magnetic disk or an optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé et un appareil (100), basés sur la géométrie, pour le criblage d'échantillons de parole, ainsi qu'un dispositif informatique (500) et un support de stockage, qui se rapportent à la technologie de l'intelligence artificielle. Le procédé consiste : à acquérir un ensemble d'échantillons de parole initiaux, et à extraire une caractéristique de parole correspondant à chaque élément de données d'échantillon de parole initial dans l'ensemble d'échantillons de parole initiaux, de façon à constituer un ensemble de caractéristiques de parole (S110) ; à acquérir une distance euclidienne entre des caractéristiques de parole dans l'ensemble de caractéristiques de parole au moyen d'un algorithme de distorsion temporelle dynamique, de façon à effectuer un groupement de K-moyennes pour obtenir un résultat de groupement (S120) ; à appeler une condition de criblage de sous-ensemble d'échantillons prédéfinie, et à acquérir, à partir du résultat de groupement, un groupe qui satisfait la condition de criblage de sous-ensemble d'échantillons, de façon à constituer un ensemble de groupes cibles (S130) ; et à acquérir, à partir de l'ensemble de groupes cibles, une valeur annotée correspondant à chaque caractéristique de parole, de façon à obtenir un ensemble d'échantillons de parole actuel correspondant à l'ensemble de groupes cibles (S140). Des échantillons ayant une redondance relativement faible sont automatiquement sélectionnés pour l'entraînement d'un modèle de reconnaissance de parole, ce qui permet de réduire le coût d'annotation d'une tâche de reconnaissance de parole dans un contexte d'apprentissage profond, et d'améliorer la vitesse d'entraînement d'un modèle de reconnaissance de parole.
PCT/CN2021/083934 2020-12-01 2021-03-30 Procédé et appareil, basés sur la géométrie, pour le criblage d'échantillons de parole, dispositif informatique et support de stockage WO2022116442A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011387398.0 2020-12-01
CN202011387398.0A CN112530409B (zh) 2020-12-01 2020-12-01 基于几何学的语音样本筛选方法、装置及计算机设备

Publications (1)

Publication Number Publication Date
WO2022116442A1 true WO2022116442A1 (fr) 2022-06-09

Family

ID=74996045

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083934 WO2022116442A1 (fr) 2020-12-01 2021-03-30 Procédé et appareil, basés sur la géométrie, pour le criblage d'échantillons de parole, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN112530409B (fr)
WO (1) WO2022116442A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116825169A (zh) * 2023-08-31 2023-09-29 悦芯科技股份有限公司 一种基于测试设备的异常存储芯片检测方法
CN117334186A (zh) * 2023-10-13 2024-01-02 武汉赛思云科技有限公司 一种基于机器学习的语音识别方法及nlp平台

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530409B (zh) * 2020-12-01 2024-01-23 平安科技(深圳)有限公司 基于几何学的语音样本筛选方法、装置及计算机设备
CN113345424B (zh) * 2021-05-31 2024-02-27 平安科技(深圳)有限公司 一种语音特征提取方法、装置、设备及存储介质
CN115101058A (zh) * 2022-06-17 2022-09-23 科大讯飞股份有限公司 一种语音数据处理方法、装置、存储介质及设备
CN115146716B (zh) * 2022-06-22 2024-06-14 腾讯科技(深圳)有限公司 标注方法、装置、设备、存储介质及程序产品
CN114863939B (zh) * 2022-07-07 2022-09-13 四川大学 一种基于声音的大熊猫属性识别方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110931043A (zh) * 2019-12-06 2020-03-27 湖北文理学院 集成语音情感识别方法、装置、设备及存储介质
US10699719B1 (en) * 2011-12-31 2020-06-30 Reality Analytics, Inc. System and method for taxonomically distinguishing unconstrained signal data segments
CN111813905A (zh) * 2020-06-17 2020-10-23 平安科技(深圳)有限公司 语料生成方法、装置、计算机设备及存储介质
CN111966798A (zh) * 2020-07-24 2020-11-20 北京奇保信安科技有限公司 一种基于多轮K-means算法的意图识别方法、装置和电子设备
CN112530409A (zh) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 基于几何学的语音样本筛选方法、装置及计算机设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065028B (zh) * 2018-06-11 2022-12-30 平安科技(深圳)有限公司 说话人聚类方法、装置、计算机设备及存储介质
CN109657711A (zh) * 2018-12-10 2019-04-19 广东浪潮大数据研究有限公司 一种图像分类方法、装置、设备及可读存储介质
CN110648671A (zh) * 2019-08-21 2020-01-03 广州国音智能科技有限公司 声纹模型重建方法、终端、装置及可读存储介质
CN110929771B (zh) * 2019-11-15 2020-11-20 北京达佳互联信息技术有限公司 图像样本分类方法及装置、电子设备、可读存储介质
CN111179914B (zh) * 2019-12-04 2022-12-16 华南理工大学 一种基于改进动态时间规整算法的语音样本筛选方法
CN111046947B (zh) * 2019-12-10 2023-06-30 成都数联铭品科技有限公司 分类器的训练系统及方法、异常样本的识别方法
CN111554270B (zh) * 2020-04-29 2023-04-18 北京声智科技有限公司 训练样本筛选方法及电子设备
CN111950294A (zh) * 2020-07-24 2020-11-17 北京奇保信安科技有限公司 一种基于多参数K-means算法的意图识别方法、装置和电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10699719B1 (en) * 2011-12-31 2020-06-30 Reality Analytics, Inc. System and method for taxonomically distinguishing unconstrained signal data segments
CN110931043A (zh) * 2019-12-06 2020-03-27 湖北文理学院 集成语音情感识别方法、装置、设备及存储介质
CN111813905A (zh) * 2020-06-17 2020-10-23 平安科技(深圳)有限公司 语料生成方法、装置、计算机设备及存储介质
CN111966798A (zh) * 2020-07-24 2020-11-20 北京奇保信安科技有限公司 一种基于多轮K-means算法的意图识别方法、装置和电子设备
CN112530409A (zh) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 基于几何学的语音样本筛选方法、装置及计算机设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116825169A (zh) * 2023-08-31 2023-09-29 悦芯科技股份有限公司 一种基于测试设备的异常存储芯片检测方法
CN116825169B (zh) * 2023-08-31 2023-11-24 悦芯科技股份有限公司 一种基于测试设备的异常存储芯片检测方法
CN117334186A (zh) * 2023-10-13 2024-01-02 武汉赛思云科技有限公司 一种基于机器学习的语音识别方法及nlp平台
CN117334186B (zh) * 2023-10-13 2024-04-30 北京智诚鹏展科技有限公司 一种基于机器学习的语音识别方法及nlp平台

Also Published As

Publication number Publication date
CN112530409B (zh) 2024-01-23
CN112530409A (zh) 2021-03-19

Similar Documents

Publication Publication Date Title
WO2022116442A1 (fr) Procédé et appareil, basés sur la géométrie, pour le criblage d'échantillons de parole, dispositif informatique et support de stockage
WO2021093449A1 (fr) Procédé et appareil de détection de mot de réveil employant l'intelligence artificielle, dispositif, et support
JP7508533B2 (ja) 話者埋め込みと訓練された生成モデルとを使用する話者ダイアライゼーション
US10854193B2 (en) Methods, devices and computer-readable storage media for real-time speech recognition
US11646032B2 (en) Systems and methods for audio processing
WO2022121257A1 (fr) Procédé et appareil d'entraînement de modèle, procédé et appareil de reconnaissance de la parole, dispositif et support de stockage
CN110534099A (zh) 语音唤醒处理方法、装置、存储介质及电子设备
WO2021082420A1 (fr) Procédé et dispositif d'authentification d'empreinte vocale, support et dispositif électronique
CN111694940B (zh) 一种用户报告的生成方法及终端设备
WO2022178969A1 (fr) Procédé et appareil de traitement de données vocales de conversation, dispositif informatique et support de stockage
WO2022227190A1 (fr) Procédé et appareil de synthèse vocale, dispositif électronique et support de stockage
US11495212B2 (en) Dynamic vocabulary customization in automated voice systems
WO2021218136A1 (fr) Procédé et appareil de reconnaissance de sexe et d'âge d'utilisateur à base de voix, dispositif informatique et support de stockage
JP7230806B2 (ja) 情報処理装置、及び情報処理方法
WO2020052069A1 (fr) Procédé et appareil de segmentation en mots
CN113178201B (zh) 基于无监督的语音转换方法、装置、设备及介质
WO2020220824A1 (fr) Procédé et dispositif de reconnaissance vocale
CN111710337A (zh) 语音数据的处理方法、装置、计算机可读介质及电子设备
WO2022057759A1 (fr) Procédé de conversion de voix et dispositif associé
US10923113B1 (en) Speechlet recommendation based on updating a confidence value
WO2022121185A1 (fr) Procédé et appareil d'entraînement de modèle, procédé et appareil de reconnaissance de dialecte, et serveur et support d'enregistrement
KR20240122776A (ko) 뉴럴 음성 합성의 적응 및 학습
CN113823257B (zh) 语音合成器的构建方法、语音合成方法及装置
CN114373443A (zh) 语音合成方法和装置、计算设备、存储介质及程序产品
CN114495981A (zh) 语音端点的判定方法、装置、设备、存储介质及产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21899486

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21899486

Country of ref document: EP

Kind code of ref document: A1