WO2021189980A1 - 语音数据生成方法、装置、计算机设备及存储介质 - Google Patents

语音数据生成方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021189980A1
WO2021189980A1 PCT/CN2020/136366 CN2020136366W WO2021189980A1 WO 2021189980 A1 WO2021189980 A1 WO 2021189980A1 CN 2020136366 W CN2020136366 W CN 2020136366W WO 2021189980 A1 WO2021189980 A1 WO 2021189980A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice data
sample
voice
speech recognition
preset
Prior art date
Application number
PCT/CN2020/136366
Other languages
English (en)
French (fr)
Inventor
罗剑
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021189980A1 publication Critical patent/WO2021189980A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device, computer equipment and storage medium for generating voice data.
  • under-sampling is usually used to eliminate most sample voice data, or artificially collect voice data of a few sample users for data supplementation.
  • the inventor realizes that if part of the data in most sample voice data is eliminated, valuable user information is likely to be lost, thereby affecting the accuracy of the user’s voice recognition.
  • the method of supplementing voice data is subject to user privacy and security. Due to the limitation of factors, it is difficult to obtain a large amount of voice data of a small number of sample users, and this method of artificially collecting voice data is relatively inconvenient to operate.
  • This application provides a voice data generation method, device, computer equipment, and storage medium, mainly to solve the problem of how to balance the voice data of different users in the sample library while avoiding the loss of valuable user information.
  • a method for generating voice data including:
  • the embedding matrix is obtained by training the sample voice data
  • the verification voice data other than the target user sample voice data is determined.
  • a voice data generating device including:
  • the acquiring unit is used to acquire sample voice data of the target user
  • An extraction unit configured to perform feature extraction on the sample voice data to obtain voice features corresponding to the sample voice data
  • the first determining unit is configured to calculate the attention score corresponding to the sample voice data according to the voice feature corresponding to the sample voice data and a pre-built embedding matrix, and the embedding matrix is performed by performing a calculation on the sample voice data.
  • the second determining unit is configured to determine verification voice data other than the target user sample voice data based on the attention score.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for generating voice data is realized;
  • the steps of the voice data generating method include:
  • the embedding matrix is obtained by training the sample voice data
  • the verification voice data other than the target user sample voice data is determined.
  • a computer device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor implements a method for generating voice data when the program is executed ;
  • the steps of the voice data generating method include:
  • the embedding matrix is obtained by training the sample voice data
  • the verification voice data other than the target user sample voice data is determined.
  • the voice data generation method, device, computer equipment, and storage medium provided in the present application can calculate the attention score corresponding to the sample voice data by extracting the voice characteristics of the target user with a lack of sample data, and according to the attention score The value generates the verification voice data of the target user, so that more voice data can be generated according to the few sample voice data of the target user, so that the sample voice data of different users can be balanced, and the valuable user information can be avoided by adopting a method.
  • the speech recognition accuracy of the preset speech recognition model trained based on the sample speech data has also been improved.
  • Fig. 1 shows a flowchart of a method for generating voice data provided by an embodiment of the present application
  • Figure 2 shows a flowchart of another method for generating voice data provided by an embodiment of the present application
  • FIG. 3 shows a schematic structural diagram of a voice data generating apparatus provided by an embodiment of the present application
  • FIG. 4 shows a schematic structural diagram of another apparatus for generating voice data according to an embodiment of the present application
  • Fig. 5 shows a schematic diagram of the physical structure of a computer device provided by an embodiment of the present application.
  • under-sampling is usually used to eliminate most sample voice data, or artificially collect voice data of a few sample users for data supplementation.
  • valuable user information is likely to be lost, which will affect the accuracy of the user’s voice recognition.
  • the way to supplement the voice data is limited by user privacy and security factors. It is difficult to obtain a large amount of voice data of a small number of sample users, and this method of artificially collecting voice data is relatively inconvenient to operate.
  • an embodiment of the present application provides a method for generating voice data. As shown in FIG. 1, the method includes:
  • the target user is a user who lacks sample voice data
  • the data volume of the sample voice data of the target user is less than the preset data volume
  • the sample voice data of the target user is the voice data that already exists in the preset sample library, which can be specifically collected
  • a preset sample library is constructed. In the process of voice collection, due to privacy and security factors, some users have less sample voice data than other users. In order to ensure the training of the voice recognition model The accuracy of voice recognition needs to ensure that the voice data of different users in the preset sample library is balanced.
  • the embodiment of this application is mainly suitable for the generation of speech data.
  • the executive body of the embodiment of this application is a device or device capable of generating target speech data, which can be specifically set On the client or server side.
  • the voice data corresponding to each user of the preset sample library in order to filter the target users in the preset sample library, it is preferred to determine the voice data corresponding to each user of the preset sample library, and then calculate the voice data volume corresponding to each user based on the voice data corresponding to each user, and then according to each user
  • the voice data volume corresponding to the user is calculated, the voice data volume average value of the preset sample library is calculated, the voice data volume average value is determined as the preset data volume, and then the voice data volume corresponding to each user is compared with the preset data volume, The user whose voice data volume is less than the preset data volume is determined as the target user.
  • the voice data volume of the user is compared with the voice data volume of other users in the preset sample library. Therefore, the user whose voice data volume is less than the preset data volume is determined as the target user, and the voice data of the target user in the preset sample library is determined as the sample voice data of the target user, so as to generate the target user based on the sample voice data Add sample data after the sample voice data is added to achieve the balance of the sample voice data.
  • the Mel cepstrum coefficient corresponding to the sample voice data can be used as the voice feature corresponding to the sample voice data.
  • the sample voice data needs to be preprocessed before feature extraction is performed on the sample voice data.
  • the preprocessing process specifically includes Pre-emphasis, framing and windowing function processing, so that the sample voice data of the target user becomes flat, that is, every N of the sample voice data is synthesized into an observation unit (frame), and the left and right ends of the frame have continuity.
  • the sample voice data of the target user After preprocessing the sample voice data of the target user, it is necessary to perform fast Fourier transformation on the preprocessed sample voice data to obtain the converted voice data, and then input the converted voice data into the Mel filter to calculate the converted voice data
  • the voice energy of the voice data after passing the Mel filter, and then according to the voice energy corresponding to the sample voice data, the Mel cepstral coefficient corresponding to the sample voice data is calculated, and the Mel cepstrum coefficient is determined as the sample voice data of the target user
  • Corresponding voice features so as to generate more voice data of the target user according to the voice features corresponding to the sample voice data, so as to balance the sample voice data in the preset sample library.
  • the embedding matrix is obtained by training the sample voice data.
  • pre- Set the voice data generation model to generate more voice data of the target user.
  • the preset voice data generation model can be preset. GPT-2 model.
  • the preset GPT-2 model includes an attention layer and a neural network layer. Specifically, the voice features corresponding to the extracted sample voice data are input into the preset GPT-2 model for voice data generation.
  • the voice feature corresponding to the sample voice data is first input to the attention layer, and the attention layer is used to calculate the attention score corresponding to the existing voice features, with an attention layer
  • the embedding matrix in the trained GPT-2 model can be obtained, and then the query vector, key vector, and value vector corresponding to the voice feature can be calculated according to the embedding matrix, and then according to the calculated Query vector, key vector and value vector, and calculate the attention score corresponding to the voice feature.
  • the calculated attention score corresponding to the voice feature is input into the neural network layer to generate voice data.
  • the target user usually has multiple voice features, and the attention score corresponding to the multiple voice features is determined. After the force score, the attention score is input to the neural network layer.
  • the neural network layer will filter the voice features with higher attention scores. The higher the attention score of the voice feature, the higher the voice feature and the voice to be generated The higher the relevance of the data, the higher the attention score of the voice feature is used to generate the voice data of the target user.
  • the voice data generation method provided by the embodiment of this application is compared with the current way of eliminating most sample voice data by under-sampling.
  • This application can obtain the sample voice data of the target user; The data is feature extracted to obtain the voice feature corresponding to the sample voice data; at the same time, the attention score corresponding to the sample voice data is calculated according to the voice feature corresponding to the sample voice data and the pre-built embedding matrix, The embedding matrix is obtained by training the sample voice data; and based on the attention score, the verification voice data other than the target user sample voice data is determined.
  • the attention score corresponding to the sample voice data can be calculated, and the verification voice data of the target user can be generated according to the attention score, so that the target user can be based on a small number of samples Voice data, generate more voice data, balance the sample voice data of different users, and avoid losing valuable user information in a way that is not adopted.
  • it is based on the voice recognition of the preset voice recognition model trained on the sample voice data The accuracy has also been improved.
  • an embodiment of the present application provides another method for generating voice data. As shown in FIG. 2, the method includes :
  • the sample voice data is voice data that already exists in the preset sample library
  • the target user is a user who lacks sample voice data in the preset sample library.
  • the preset voice may be preset Data volume
  • the preset voice data volume can be specifically determined according to the training sample volume required to construct the preset voice data generation model, and then the voice data volume corresponding to each user in the preset sample library is determined, and the voice data corresponding to each user
  • the data volume is compared with the preset voice data volume, and the target users are screened according to the comparison results. Specifically, users whose voice data volume is less than the preset voice data volume can be determined as the target users.
  • the voice data volume corresponding to each user can be determined as the target user. , Calculate the average voice data volume of the preset sample library, and compare the voice data volume corresponding to each user with the average voice data volume, and screen the target users according to the comparison results. Specifically, the voice data volume can be less than the average voice data volume. Determined as the target user, which can determine the target user with a lack of data in the preset sample library, so as to generate more voice data of the target user according to the sample voice data of the target user, so as to achieve the balance of the voice data in the preset sample library .
  • the voice feature corresponding to the sample voice data may specifically be the Mel cepstrum coefficient corresponding to the sample voice data.
  • step 202 specifically includes: Perform filtering processing to obtain the voice energy corresponding to the sample voice data; perform discrete cosine processing on the voice energy to obtain the voice feature corresponding to the sample voice data.
  • the sample voice data needs to be pre-processed before feature extraction is performed on the sample voice data.
  • the pre-processing process specifically includes pre-emphasis, framing, and windowing function processing, so that the sample voice data of the target user becomes flat. That is to say, every N points of sample voice data are synthesized into an observation unit (frame), and the left and right ends of the frame have continuity.
  • the preprocessed sample voice data needs to be quickly processed After Fourier transformation, the converted voice data is obtained, and then the converted voice data is input into the Mel filter, and the voice energy of the converted voice data after passing through the Mel filter is calculated, and then the voice energy corresponding to the sample voice data is calculated.
  • the Mel cepstrum coefficient corresponding to the sample voice data, and the Mel cepstrum coefficient is determined as the voice feature corresponding to the sample voice data of the target user.
  • the specific calculation formula of the Mel cepstrum coefficient is as follows:
  • s(m) represents the speech energy output after the speech data passes through the m-th filter
  • M is the total number of filters
  • C(n) is the Mel cepstrum coefficient
  • n represents the order of the Mel cepstrum coefficient.
  • Number, L usually takes 12-16
  • the specific calculation formula of s(m) speech energy is as follows:
  • H m (k) is the frequency of the filter
  • K is the number of Fourier transform points. Therefore, according to the above formula, the Mel cepstrum coefficient corresponding to the sample voice data of the target user can be calculated and determined as the voice feature corresponding to the sample voice data, so as to generate the target user in addition to the sample voice data according to the sample voice data. In order to achieve the balance of sample voice data.
  • the embedding matrix is obtained by training the sample voice data.
  • the voice data corresponding to the sample voice data The features are input to the preset voice data generation model for data generation, and the verification voice data other than the target user sample voice data is obtained.
  • the preset voice data generation model can specifically be a trained GPT-2 model, specifically using GPT- 2
  • step 203 specifically includes: determining the query vector, key vector, and value vector corresponding to the voice feature according to the embedding matrix; and matching the query vector corresponding to the voice feature with it Multiply the key vectors of to obtain the weight value corresponding to the voice feature; and calculate the attention score corresponding to the voice feature according to the weight value and the value vector corresponding to the voice feature.
  • the preset embedding matrix is determined by the trained GPT-2 model, that is, the preset embedding matrix can be obtained by training the GPT-2 model, and then the query vector and key corresponding to the voice feature can be determined according to the preset embedding matrix.
  • Vector and value vector, and then the attention layer in the GPT-2 model calculates the weight value corresponding to the voice feature according to the query vector and key vector corresponding to the voice feature, and then calculates the voice feature according to the weight value and value vector corresponding to the voice feature
  • the corresponding attention score is output and the specific calculation formula of the attention score is as follows:
  • Attention (Q, K, V) is the attention score corresponding to the existing feature
  • Q is the query vector
  • K is the key vector
  • V is the value vector
  • dK is the dimension of the key vector, usually 64.
  • the attention score is input to the neural network layer in the GPT-2 model to generate verification voice data for the target user in addition to the sample voice data, so as to ensure that the amount of voice data of different users in the sample library is balanced .
  • other users are users who are not short of voice data, that is, the amount of voice data corresponding to other users is greater than the preset data amount.
  • the voice of different users in the sample library After generating more voice data of the target user, the voice of different users in the sample library The amount of data is balanced.
  • the voice data in the sample library can be used as training samples to build a preset voice recognition model. Specifically, sample voice data and verification voice data of the target user, as well as sample voices of other users in the preset sample library The data are collectively used as the first training sample, so as to construct a preset speech room recognition model based on the first training sample.
  • the preset speech recognition model may specifically be a preset neural network model, the preset neural network model includes a plurality of hidden layers, the initial parameters of the preset neural network model are given, and then the first training sample is input Training in the preset neural network model is to adjust the initial parameters in the preset neural network model to construct a preset speech recognition model.
  • the method further includes:
  • the adjusted preset voice recognition model can have a better recognition effect on real voice data.
  • the method further includes: using a test sample to test the adjusted preset speech recognition model to obtain the adjusted preset speech recognition model Corresponding test results; according to the test results, determine the voice recognition accuracy rate corresponding to the adjusted preset voice recognition model; if the voice recognition accuracy rate is less than the preset voice recognition accuracy rate, compare the adjusted The parameters in the preset speech recognition model are adjusted until the speech recognition accuracy rate corresponding to the adjusted preset speech recognition model reaches the preset speech recognition accuracy rate.
  • the test samples of multiple users are obtained, and the test samples are input to the adjusted preset speech recognition model for testing, and the test results of the adjusted preset speech recognition model can be obtained.
  • the test samples are counted The number of samples with the correct recognition result and the total number of samples, and according to the number of samples with the correct recognition result and the total number of samples, calculate the speech recognition accuracy rate corresponding to the adjusted preset speech recognition model. If the calculated speech recognition accuracy rate does not reach the preset speech Recognition accuracy rate, it is determined that the recognition accuracy of the adjusted preset speech recognition model does not meet the requirements, speech recognition cannot be performed, and training needs to be continued; if the calculated speech recognition accuracy rate reaches the preset speech recognition accuracy rate, then the adjustment is determined The recognition accuracy of the latter preset speech recognition model is required and can be used for speech recognition. Based on this, the method further includes: obtaining the speech data of the user to be recognized; and inputting the speech data of the user to be recognized to the adjusted Perform voice recognition on the preset voice recognition model to determine the voice recognition result corresponding to the user to be recognized.
  • the voice data of the user to be recognized is input into the adjusted preset voice recognition model for voice recognition, and the hidden layer in the adjusted preset voice recognition model extracts the voice features corresponding to the voice data of the user to be recognized.
  • the voice features corresponding to the user to be recognized are compared with the voice features of other users in the preset feature library, and the voice recognition results corresponding to the user to be recognized are output according to the comparison result, that is, the adjusted preset voice recognition model can be used to Identify the identity of the user to be identified.
  • the present application can obtain the sample voice data of the target user; Feature extraction is performed on the voice data to obtain the voice features corresponding to the sample voice data; at the same time, the attention score corresponding to the sample voice data is calculated according to the voice features corresponding to the sample voice data and the pre-built embedding matrix The embedding matrix is obtained by training the sample voice data, and based on the attention score, the verification voice data other than the target user sample voice data is determined.
  • the attention score corresponding to the sample voice data can be calculated, and the verification voice data of the target user can be generated according to the attention score, so that the target user can be based on a small number of samples Voice data, generate more voice data, balance the sample voice data of different users, and avoid losing valuable user information in a way that is not adopted.
  • it is based on the voice recognition of the preset voice recognition model trained on the sample voice data The accuracy has also been improved.
  • an embodiment of the present application provides a voice data generating device.
  • the device includes: an acquiring unit 31, an extracting unit 32, a first determining unit 33, and a second determining unit 33. Determine unit 34.
  • the acquiring unit 31 may be used to acquire sample voice data of the target user.
  • the acquiring unit 31 is the main functional module of the device for acquiring sample voice data of the target user.
  • the extraction unit 32 may be used to perform feature extraction on the sample voice data to obtain voice features corresponding to the sample voice data.
  • the extraction unit 32 is a main functional module that performs feature extraction on the sample voice data in the device to obtain the voice features corresponding to the sample voice data, and is also a core module.
  • the first determining unit 33 may be configured to calculate the attention score corresponding to the sample voice data according to the voice feature corresponding to the sample voice data and a pre-built embedding matrix, and the embedding matrix is obtained by comparing the Sample voice data is obtained through training.
  • the determining unit 33 is the main functional module of the device that calculates the attention score corresponding to the sample voice data according to the voice feature corresponding to the sample voice data and the pre-built embedding matrix, and is also a core module.
  • the second determining unit 34 may be configured to determine verification voice data other than the target user sample voice data based on the attention score.
  • the second determining unit is a main functional module of the device for determining voice data verification other than the voice data of the target user sample based on the attention score, and is also a core module.
  • the first determining unit 33 includes: a determining module 331, a multiplying module 332, and a calculating module 333.
  • the determining module 331 may be used to determine the embedding matrix corresponding to the voice feature, and determine the query vector, key vector, and value vector corresponding to the voice feature according to the embedding matrix.
  • the multiplication module 332 may be used to respectively multiply the query vector corresponding to the voice feature and the key vector corresponding to the voice feature to obtain the weight value corresponding to the voice feature.
  • the calculation module 333 may be used to calculate the attention score corresponding to the voice feature according to the weight value and the value vector corresponding to the voice feature.
  • the extraction unit 32 includes: a filtering module 321 and a discrete module 322.
  • the filtering module 321 may be used to perform filtering processing on the sample voice data to obtain the voice energy corresponding to the sample voice data.
  • the discrete module 332 may be used to perform discrete cosine processing on the voice energy to obtain voice features corresponding to the sample voice data.
  • the device further includes a construction unit 35.
  • the first determining unit 33 may also be used to determine the sample voice data and verification voice data of the target user, and the sample voice data of other users in the preset sample library as the first training sample.
  • the construction unit 35 may be used to train the first training sample by using a preset neural network algorithm to construct a preset speech recognition model.
  • the device further includes: an adjustment unit 36.
  • the first determining unit 33 may also be used to determine the sample voice data of the target user and the sample voice data of the other users as the second training sample.
  • the adjustment unit 36 may be configured to adjust the preset speech recognition model by using the second training sample to obtain an adjusted preset speech recognition model.
  • the device further includes a test unit 37.
  • the testing unit 37 may be used to test the adjusted preset speech recognition model by using test samples to obtain the test result corresponding to the adjusted preset speech recognition model.
  • the first determining unit 33 may also be configured to determine the speech recognition accuracy rate corresponding to the adjusted preset speech recognition model according to the test result.
  • the adjustment unit 36 may also be used to adjust the parameters in the adjusted preset speech recognition model if the speech recognition accuracy rate is less than the preset speech recognition accuracy rate until the adjusted preset The speech recognition accuracy rate corresponding to the speech recognition model reaches the preset speech recognition accuracy rate.
  • the device further includes: a recognition unit 38.
  • the acquiring unit 31 may also be used to acquire voice data of the user to be recognized.
  • the recognition unit 38 may be configured to input the voice data of the user to be recognized into the adjusted preset voice recognition model for voice recognition, and determine the voice recognition result corresponding to the user to be recognized.
  • an embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile, on which storage
  • the steps of the voice data generation method include: obtaining sample voice data of the target user; performing feature extraction on the sample voice data to obtain all The voice features corresponding to the sample voice data; according to the voice features corresponding to the sample voice data and a pre-built embedding matrix, the attention score corresponding to the sample voice data is calculated, and the embedding matrix is obtained by comparing the sample voice Data obtained by training; based on the attention score, the verification voice data other than the target user sample voice data is determined.
  • the calculating the attention score corresponding to the sample voice data according to the voice feature corresponding to the sample voice data and the pre-built embedding matrix includes:
  • the attention score corresponding to the voice feature is calculated.
  • the performing feature extraction on the sample voice data to obtain the voice feature corresponding to the sample voice data includes:
  • Discrete cosine processing is performed on the voice energy to obtain voice features corresponding to the sample voice data.
  • the method further includes:
  • a preset neural network algorithm is used to train the first training sample to construct a preset speech recognition model.
  • the method further includes:
  • the second training sample is used to adjust the preset speech recognition model to obtain an adjusted preset speech recognition model.
  • the method further includes:
  • test sample to test the adjusted preset speech recognition model to obtain a test result corresponding to the adjusted preset speech recognition model
  • the parameters in the adjusted preset speech recognition model are adjusted until the speech recognition accuracy rate corresponding to the adjusted preset speech recognition model reaches Preset voice recognition accuracy rate.
  • the computer device includes: a processor 41, The memory 42 and a computer program stored on the memory 42 and capable of running on the processor, wherein the memory 42 and the processor 41 are both set on the bus 43, and the processor 41 implements the voice data generation method when the program is executed; wherein
  • the steps of the voice data generation method include: obtaining sample voice data of the target user; performing feature extraction on the sample voice data to obtain voice features corresponding to the sample voice data; and according to the voice features corresponding to the sample voice data Calculate the attention score corresponding to the sample voice data with a pre-built embedding matrix, the embedding matrix is obtained by training the sample voice data; based on the attention score, determine the target user Verification voice data other than sample voice data.
  • the calculating the attention score corresponding to the sample voice data according to the voice feature corresponding to the sample voice data and the pre-built embedding matrix includes:
  • the attention score corresponding to the voice feature is calculated.
  • the performing feature extraction on the sample voice data to obtain the voice feature corresponding to the sample voice data includes:
  • Discrete cosine processing is performed on the voice energy to obtain voice features corresponding to the sample voice data.
  • the method further includes:
  • a preset neural network algorithm is used to train the first training sample to construct a preset speech recognition model.
  • the method further includes:
  • the second training sample is used to adjust the preset speech recognition model to obtain an adjusted preset speech recognition model.
  • the method further includes:
  • test sample to test the adjusted preset speech recognition model to obtain a test result corresponding to the adjusted preset speech recognition model
  • the parameters in the adjusted preset speech recognition model are adjusted until the speech recognition accuracy rate corresponding to the adjusted preset speech recognition model reaches Preset voice recognition accuracy rate.
  • the present application can obtain sample voice data of the target user; perform feature extraction on the sample voice data to obtain the voice features corresponding to the sample voice data; at the same time, according to the sample voice data Corresponding voice features and a pre-built embedding matrix, calculating the attention score corresponding to the sample voice data, the embedding matrix is obtained by training the sample voice data; and based on the attention score, Determine the verification voice data other than the target user's sample voice data, thereby extracting the voice features of the target user whose sample data is scarce, to calculate the attention score corresponding to the sample voice data, and generate it based on the attention score
  • the verification voice data of the target user can generate more voice data based on a few sample voice data of the target user, balance the sample voice data of different users, and avoid losing valuable user information in an under-used manner.
  • the speech recognition accuracy of the preset speech recognition model trained on the sample speech data is also improved.
  • modules or steps of this application can be implemented by a general computing device, and they can be concentrated on a single computing device or distributed in a network composed of multiple computing devices.
  • they can be implemented with program codes executable by the computing device, so that they can be stored in the storage device for execution by the computing device, and in some cases, can be executed in a different order than here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种语音数据生成方法、装置、计算机设备及存储介质,涉及人工智能技术领域,主要在于能够基于少数样本用户的样本语音数据,生成少数样本用户更多的语音数据,从而使得样本库中的不同用户的语音数据达到平衡。其中方法包括:获取目标用户的样本语音数据(101);对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征(102);根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的(103);基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据(104)。本方法采用了机器学习技术,主要适用于语音数据的生成。

Description

语音数据生成方法、装置、计算机设备及存储介质
本申请要求于2020年10月26日提交中国专利局、申请号为202011153538.8,发明名称为“语音数据生成方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其是涉及一种语音数据生成方法、装置、计算机设备及存储介质。
背景技术
在语音识别技术中,对于分类器来说,如果样本库中不同用户的语音数据量差距过大,分类器很难有良好的表现,会影响分类器识别的准确率,因此,为了确保语音识别的准确率,需要保证样本语音数据的平衡。
目前,对于不平衡的样本语音数据,通常采用欠采样的方式对多数样本语音数据进行数据消除,或者人为采集少数样本用户的语音数据进行数据补充。然而,发明人意识到如果消除多数样本语音数据中的部分数据,很可能会丢失有价值的用户信息,进而影响用户的语音识别精度,此外,对于补充语音数据的方式,由于受到用户隐私和安全因素的限制,很难获得少数样本用户的大量语音数据,且这种人为采集语音数据的方式,操作较为不便。
技术问题
本申请提供了一种语音数据生成方法、装置、计算机设备及存储介质,主要在于解决如何使得样本库中的不同用户的语音数据达到平衡,同时能够避免丢失有价值的用户信息的问题。
技术解决方案
根据本申请的第一个方面,提供一种语音数据生成方法,包括:
获取目标用户的样本语音数据;
对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;
根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的;
基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
根据本申请的第二个方面,提供一种语音数据生成装置,包括:
获取单元,用于获取目标用户的样本语音数据;
提取单元,用于对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;
第一确定单元,用于根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的,
第二确定单元,用于基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
根据本申请的第三个方面,提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现语音数据生成方法;
其中,所述语音数据生成方法的步骤包括:
获取目标用户的样本语音数据;
对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;
根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的;
基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
根据本申请的第四个方面,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现语音数据生成方法;
其中,所述语音数据生成方法的步骤包括:
获取目标用户的样本语音数据;
对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;
根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的;
基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
有益效果
本申请提供的一种语音数据生成方法、装置、计算机设备及存储介质,通过提取样本数据量匮乏的目标用户的语音特征,能够计算样本语音数据对应的注意力分值,并依据该注意力分值生成目标用户的验证语音数据,从而能够根据目标用户的少数样本语音数据,生成更多的语音数据,使不同用户的样本语音数据达到平衡,避免采用欠采用的方式丢失掉有价值的用户信息,同时依据该样本语音数据训练的预设语音识别模型的语音识别精度也得到了提高。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1示出了本申请实施例提供的一种语音数据生成方法流程图;
图2示出了本申请实施例提供的另一种语音数据生成方法流程图;
图3示出了本申请实施例提供的一种语音数据生成装置的结构示意图;
图4示出了本申请实施例提供的另一种语音数据生成装置的结构示意图;
图5示出了本申请实施例提供的一种计算机设备的实体结构示意图。
本发明的最佳实施方式
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
目前,对于不平衡的样本语音数据,通常采用欠采样的方式对多数样本语音数据进行数据消除,或者人为采集少数样本用户的语音数据进行数据补充。然而,如果消除多数样本语音数据中的部分数据,很可能会丢失有价值的用户信息,进而影响用户的语音识别精度,此外,对于补充语音数据的方式,由于受到用户隐私和安全因素的限制,很难获得少数样本用户的大量语音数据,且这种人为采集语音数据的方式,操作较为不便。
为了解决上述问题,本申请实施例提供了一种语音数据生成方法,如图1所示,所述方法包括:
101、获取目标用户的样本语音数据。
其中,目标用户为样本语音数据匮乏的用户,该目标用户的样本语音数据的数据量小 于预设数据量,目标用户的样本语音数据为预设样本库中已经存在的语音数据,具体可以通过搜集不同用户的语音数据,构建预设样本库,在语音搜集的过程中,可能由于隐私和安全因素的限制,有些用户的样本语音数据量相比其他用户较少,为了确保训练的语音识别模型的语音识别精度,需要保证预设样本库不同用户的语音数据达到平衡,因此需要利用目标用户已有的样本语音数据,生成更多的样本语音数据,以达到预设样本库中不同用户的样本语音数据的平衡,确保后续的预设语音识别模型的语音识别精度,本申请实施例主要适用于语音数据的生成,本申请实施例的执行主体为能够生成目语音数据的装置或设备,具体可以设置在客户端或者服务器一侧。
对于本申请实施例,为了筛选预设样本库中的目标用户,首选确定预设样本库各个用户对应的语音数据,基于各个用户对应的语音数据,统计各个用户对应的语音数据量,之后根据各个用户对应的语音数据量,统计预设样本库的语音数据量均值,将该语音数据量均值确定为预设数据量,接着分别将各个用户对应的语音数据量与预设数据量进行比对,并将语音数据量小于预设数据量的用户确定为目标用户,若用户的语音数据量小于预设数据量说明,该用户的语音数据量与预设样本库中其他用户的语音数据量相比较少,因此将语音数据量小于预设数据量的用户确定为目标用户,同时将预设样本库中目标用户的语音数据确定为目标用户的样本语音数据,以便根据该样本语音数据,生成目标用户除样本语音数据之后的新增样本数据,以到达样本语音数据的平衡。
102、对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征。
其中,可以将样本语音数据对应的梅尔倒谱系数作为样本语音数据对应的语音特征,具体地,在对样本语音数据进行特征提取之前需要对样本语音数据进行预处理,该预处理过程具体包括预加重、分帧和加窗函数处理,从而使得目标用户的样本语音数据变得平坦,即将样本语音数据的每N个采用点合成一个观测单位(帧),帧的左右端具有连续性,在对目标用户的样本语音数据进行预处理之后,需要对预处理后的样本语音数据进行快速傅里叶转化,得到转换后的语音数据,之后将转换后的语音数据输入Mel滤波器,计算转换后的语音数据通过Mel滤波器后的语音能量,接着根据样本语音数据对应的语音能量,计算样本语音数据对应的梅尔倒谱系数,并将该梅尔倒谱系数确定为目标用户的样本语音数据对应的语音特征,以便依据样本语音数据对应的语音特征,生成目标用户更多的语音数据,以便到预设样本库中样本语音数据的平衡。
103、根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值。
其中,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的,对于本申请实施例,为了基于目标用户的样本语音数据,生成目标用户样本语音数据之外的验证语音数据,可以利用预设语音数据生成模型来生成更多目标用户的语音数据,由于目标用户的语音数据为时序数据,而GPT-2模型能够很好的处理时序数据,因此预设语音数据生成模型具体可以为预设GPT-2模型,该预设GPT-2模型中包括注意力层和神经网络层,具体地,将提取的样本语音数据对应的语音特征输入至预设GPT-2模型中进行语音数据生成,在预设GPT-2模型进行语音数据生成的过程中,首选将样本语音数据对应的语音特征输入至注意力层,利用该注意力层计算已有语音特征对应的注意力分值,具注意力层具体计算语音特征对应的注意力分值时,可以获取训练好的GPT-2模型中的嵌入矩阵,之后根据该嵌入矩阵计算该语音特征对应的查询向量、键向量和值向量,接着根据计算的查询向量、键向量和值向量,计算语音特征对应的注意力分值。
104、基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
对于本申请实施例,将计算出的语音特征对应的注意力分值输入至神经网络层进行语音数据的生成,具体地,目标用户通常具有多个语音特征,在确定多个语音特征对应的注意力分值之后,将注意力分值输入至神经网络层,该神经网络层会筛选注意力分值较高的 语音特征,语音特征的注意力分值越高,说明该语音特征与待生成语音数据的关联性越高,进而利用注意力分值较高的语音特征来生成目标用户的语音数据。
本申请实施例提供的一种语音数据生成方法,与目前采用欠采样的方式对多数样本语音数据进行数据消除的方式相比,本申请能够获取目标用户的样本语音数据;并对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;与此同时,根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的;并基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。由此通过提取样本数据量匮乏的目标用户的语音特征,能够计算样本语音数据对应的注意力分值,并依据该注意力分值生成目标用户的验证语音数据,从而能够根据目标用户的少数样本语音数据,生成更多的语音数据,使不同用户的样本语音数据达到平衡,避免采用欠采用的方式丢失掉有价值的用户信息,同时依据该样本语音数据训练的预设语音识别模型的语音识别精度也得到了提高。
进一步的,为了更好的说明上述语音数据的生成过程,作为对上述实施例的细化和扩展,本申请实施例提供了另一种语音数据生成方法,如图2所示,所述方法包括:
201、获取目标用户的样本语音数据。
其中,样本语音数据为预设样本库中已经存在的语音数据,目标用户为预设样本库中样本语音数据匮乏的用户,对于本申请实施例,为了确定目标用户,可以预先设定预设语音数据量,该预设语音数据量具体可以根据构建预设语音数据生成模型所需的训练样本量进行确定,之后确定预设样本库中各个用户对应的语音数据量,并将各个用户对应的语音数据量分别与预设语音数据量进行对比,根据对比结果筛选目标用户,具体可以将语音数据量小于预设语音数据量的用户确定为目标用户,此外,还可以根据各个用户对应的语音数据量,计算预设样本库的语音数据量均值,并将各个用户对应的语音数据量分别与语音数据量均值进行对比,根据对比结果筛选目标用户,具体可以将语音数据量小于语音数据量均值的用户确定为目标用户,由此能够确定预设样本库中数据量匮乏的目标用户,以便根据目标用户的样本语音数据,生成目标用户更多的语音数据,以达到预设样本库中语音数据的平衡。
202、对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征。
其中,样本语音数据对应的语音特征具体可以为样本语音数据对应的梅尔倒谱系数,对于本申请实施例,为了提取样本语音数据对应的语音特征,步骤202具体包括:对所述样本语音数据进行滤波处理,得到所述样本语音数据对应的语音能量;对所述语音能量进行离散余弦化处理,得到所述样本语音数据对应的语音特征。
具体地,在对样本语音数据进行特征提取之前需要对样本语音数据进行预处理,该预处理过程具体包括预加重、分帧和加窗函数处理,从而使得目标用户的样本语音数据变得平坦,即将样本语音数据的每N个采用点合成一个观测单位(帧),帧的左右端具有连续性,在对目标用户的样本语音数据进行预处理之后,需要对预处理后的样本语音数据进行快速傅里叶转化,得到转换后的语音数据,之后将转换后的语音数据输入Mel滤波器,计算转换后的语音数据通过Mel滤波器后的语音能量,接着根据样本语音数据对应的语音能量,计算样本语音数据对应的梅尔倒谱系数,并将该梅尔倒谱系数确定为目标用户的样本语音数据对应的语音特征,梅尔倒谱系数的具体计算公式如下:
Figure PCTCN2020136366-appb-000001
其中,s(m)代表语音数据经过第m个滤波器后输出的语音能量,M为滤波器的总个数,C(n)为梅尔倒谱系数,n代表梅尔倒谱系数的阶数,L通常可取12-16,s(m)语音能量的具 体计算公式如下:
Figure PCTCN2020136366-appb-000002
其中,
Figure PCTCN2020136366-appb-000003
为对语音数据的频谱取模平方得到语音数据的功率谱,H m(k)为滤波器的频率,K为傅里叶变换的点数。由此按照上述公式,能够计算出目标用户样本语音数据对应的梅尔倒谱系数,并将其确定为样本语音数据对应的语音特征,以便根据该样本语音数据,生成目标用户除样本语音数据之外的新增样本数据,以到达样本语音数据的平衡。
203、根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值。
其中,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的,对于本申请实施例,为了获取目标用户更多的语音数据,以达到样本语音数据的平衡,将样本语音数据对应的语音特征输入至预设语音数据生成模型进行数据生成,得到目标用户样本语音数据之外的验证语音数据,其中,预设语音数据生成模型具体可以为已经训练好的GPT-2模型,具体利用GPT-2模型生成目标用户更多的语音数据时,步骤203具体包括:根据所述嵌入矩阵确定所述语音特征对应的查询向量、键向量和值向量;将所述语音特征对应的查询向量和与其对应的键向量相乘,得到所述语音特征对应的权重值;根据所述语音特征对应的权重值和值向量,计算所述语音特征对应的注意力分值。
具体地,预设嵌入矩阵是由训练好的GPT-2模型确定的,即通过训练GPT-2模型能够得到预设嵌入矩阵,之后根据该预设嵌入矩阵能够确定语音特征对应的查询向量、键向量和值向量,之后在GPT-2模型中的注意力层根据语音特征对应的查询向量和键向量,计算语音特征对应的权重值,接着根据语音特征对应的权重值和值向量,计算语音特征对应的注意力分值并输出,该注意力分值的具体计算公式如下:
Figure PCTCN2020136366-appb-000004
其中,Attention(Q,K,V)为已有特征对应的注意力分值,Q为查询向量,K为键向量,V为值向量,dK为为键向量的维数,通常取64。由此能够得到语音特征对应的注意力得分,
204、基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
对于本申请实施例,将该注意力得分输入至GPT-2模型中的神经网络层,生成目标用户除样本语音数据之外的验证语音数据,以确保样本库中不同用户的语音数据量达到平衡。
205、将所述目标用户的样本语音数据和验证语音数据,以及预设样本库中其他用户的样本语音数据确定为第一训练样本。
其中,其他用户为语音数据量不匮乏的用户,即其他用户对应的语音数据量大于预设数据量,对于本方实施例,生成目标用户更多的语音数据后,样本库中不同用户的语音数据量达到平衡,可以将样本库中的语音数据作为训练样本,构建预设语音识别模型,具体地,将目标用户的样本语音数据和验证语音数据,以及预设样本库中其他用户的样本语音数据共同作为第一训练样本,以便根据该第一训练样本构建预设语音室识别模型。
206、利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型。
对于本申请实施例,预设语音识别模型具体可以为预设神经网络模型,该预设神经网络模型包括多个隐藏层,给定预设神经网络模型的初始参数,之后将第一训练样本输入至预设神经网络模型中进行训练,即对预设神经网络模型中的初始参数进行调整,构建预设语音识别模型。
进一步地,为了确保预设语音识别模型能够对真实的语音数据有更好的识别效果,还 可以利用目标用户的样本语音数据以及预设样本库中其他用户的语音数据对构建的预设语音识别模型进行调整,基于此,所述方法还包括:
将所述目标用户的样本语音数据和所述其他用户的样本语音数据确定为第二训练样本;利用所述第二训练样本对所述预设语音识别模型进行调整,得到调整后的预设语音识别模型。由此调整后的预设语音识别模型能够对真实的语音数据有更好的识别效果。
进一步地,为了保证调整的预设语音识别模型的识别精度,所述方法还包括:利用测试样本对所述调整后的预设语音识别模型进行测试,得到所述调整后的预设语音识别模型对应的测试结果;根据所述测试结果,确定所述调整后的预设语音识别模型对应的语音识别准确率;若所述语音识别准确率小于预设语音识别准确率,对所述整后的预设语音识别模型中的参数进行调整,直至所述调整后的预设语音识别模型对应的语音识别准确率达到预设语音识别准确率。具体地,获取多个用户的测试样本,将测试样本输入至调整后的预设语音识别模型进行测试,能够得到调整后的预设语音识别模型的测试结果,根据该测试结果,统计测试样本中识别结果正确的样本数量和样本总数,并根据识别结果正确的样本数量和样本总数,计算调整后的预设语音识别模型对应的语音识别准确率,如果计算的语音识别准确率未达到预设语音识别准确率,则确定调整后的预设语音识别模型的识别精度未达到要求,不可以进行语音识别,需要继续进行训练;如果计算的语音识别准确率达到预设语音识别准确率,则确定调整后的预设语音识别模型的识别精度得到要求,可以用来进行语音识别,基于此,所述方法还包括:获取待识别用户的语音数据;将所述待识别用户的语音数据输入至调整后的预设语音识别模型进行语音识别,确定所述待识别用户对应的语音识别结果。
具体地,将待识别用户的语音数据输入至调整后的预设语音识别模型进行语音识别,该调整后的预设语音识别模型中的隐藏层会提取待识别用户的语音数据对应的语音特征,并将待识别用户对应的语音特征与预设特征库中其他用户对应的语音特征进行比对,根据比对结果输出待识别用户对应的语音识别结果,即利用调整后的预设语音识别模型能够对待识别用户的身份进行识别。
本申请实施例提供的另一种语音数据生成方法,与目前采用欠采样的方式对多数样本语音数据进行数据消除的方式相比,本申请能够获取目标用户的样本语音数据;并对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;与此同时,根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的,并基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。由此通过提取样本数据量匮乏的目标用户的语音特征,能够计算样本语音数据对应的注意力分值,并依据该注意力分值生成目标用户的验证语音数据,从而能够根据目标用户的少数样本语音数据,生成更多的语音数据,使不同用户的样本语音数据达到平衡,避免采用欠采用的方式丢失掉有价值的用户信息,同时依据该样本语音数据训练的预设语音识别模型的语音识别精度也得到了提高。
进一步地,作为图1的具体实现,本申请实施例提供了一种语音数据生成装置,如图3所示,所述装置包括:获取单元31、提取单元32、第一确定单元33和第二确定单元34。
所述获取单元31,可以用于获取目标用户的样本语音数据。所述获取单元31是本装置中获取目标用户的样本语音数据的主要功能模块。
所述提取单元32,可以用于对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征。所述提取单元32是本装置中对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征的主要功能模块,也是核心模块。
所述第一确定单元33,可以用于根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语 音数据进行训练得到的。所述确定单元33是本装置中根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值的主要功能模块,也是核心模块。
所述第二确定单元34,可以用于基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。所述第二确定单元是本装置中基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据的主要功能模块,也是核心模块。
进一步地,为了计算所述样本语音数据对应的注意力分值,如图4所示,所述第一确定单元33,包括:确定模块331、相乘模块332和计算模块333。
所述确定模块331,可以用于确定所述语音特征对应的嵌入矩阵,并根据所述嵌入矩阵确定所述语音特征对应的查询向量、键向量和值向量。
所述相乘模块332,可以用于分别将所述语音特征对应的查询向量和与其对应的键向量相乘,得到所述语音特征对应的权重值。
所述计算模块333,可以用于根据所述语音特征对应的权重值和值向量,计算所述语音特征对应的注意力分值。
进一步地,为了提取样本语音数据对应的语音特征,所述提取单元32,包括:滤波模块321和离散模块322。
所述滤波模块321,可以用于对所述样本语音数据进行滤波处理,得到所述样本语音数据对应的语音能量。
所述离散模块332,可以用于对所述语音能量进行离散余弦化处理,得到所述样本语音数据对应的语音特征。
进一步地,为了构建预设语音识别模型,所述装置还包括构建单元35。
所述第一确定单元33,还可以用于将所述目标用户的样本语音数据和验证语音数据,以及预设样本库中其他用户的样本语音数据确定为第一训练样本。
所述构建单元35,可以用于利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型。
进一步地,为了提高预设语音识别模型的识别精度,所述装置还包括:调整单元36。
所述第一确定单元33,还可以用于将所述目标用户的样本语音数据和所述其他用户的样本语音数据确定为第二训练样本。
所述调整单元36,可以用于利用所述第二训练样本对所述预设语音识别模型进行调整,得到调整后的预设语音识别模型。
进一步地,为了对调整后的预设语音识别模型进行测试,所述装置还包括测试单元37。
所述测试单元37,可以用于利用测试样本对所述调整后的预设语音识别模型进行测试,得到所述调整后的预设语音识别模型对应的测试结果。
所述第一确定单元33,还可以用于根据所述测试结果,确定所述调整后的预设语音识别模型对应的语音识别准确率。
所述调整单元36,还可以用于若所述语音识别准确率小于预设语音识别准确率,对所述整后的预设语音识别模型中的参数进行调整,直至所述调整后的预设语音识别模型对应的语音识别准确率达到预设语音识别准确率。
进一步地,为了对待识别用户进行语音识别,所述装置还包括:识别单元38。
所述获取单元31,还可以用于获取待识别用户的语音数据。
所述识别单元38,可以用于将所述待识别用户的语音数据输入至调整后的预设语音识别模型进行语音识别,确定所述待识别用户对应的语音识别结果。
需要说明的是,本申请实施例提供的一种语音数据生成装置所涉及各功能模块的其他相应描述,可以参考图1所示方法的对应描述,在此不再赘述。
基于上述如图1所示方法,相应的,本申请实施例还提供了一种计算机可读存储介质, 所述计算机可读存储介质可以是非易失性,也可以是易失性,其上存储有计算机程序,该程序被处理器执行时实现语音数据生成方法;其中,所述语音数据生成方法的步骤包括::获取目标用户的样本语音数据;对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的;基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
进一步地,所述根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,包括:
根据所述嵌入矩阵确定所述语音特征对应的查询向量、键向量和值向量;
将所述语音特征对应的查询向量和键向量相乘,得到所述语音特征对应的权重值;
根据所述语音特征对应的权重值和值向量,计算所述语音特征对应的注意力分值。
进一步地,所述对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征,包括:
对所述样本语音数据进行滤波处理,得到所述样本语音数据对应的语音能量;
对所述语音能量进行离散余弦化处理,得到所述样本语音数据对应的语音特征。
进一步地,在所述基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据之后,所述方法还包括:
将所述目标用户的样本语音数据和验证语音数据,以及预设样本库中其他用户的样本语音数据确定为第一训练样本;
利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型。
进一步地,在所述利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型之后,所述方法还包括:
将所述目标用户的样本语音数据和所述其他用户的样本语音数据确定为第二训练样本;
利用所述第二训练样本对所述预设语音识别模型进行调整,得到调整后的预设语音识别模型。
进一步地,所述方法还包括:
利用测试样本对所述调整后的预设语音识别模型进行测试,得到所述调整后的预设语音识别模型对应的测试结果;
根据所述测试结果,确定所述调整后的预设语音识别模型对应的语音识别准确率;
若所述语音识别准确率小于预设语音识别准确率,对所述整后的预设语音识别模型中的参数进行调整,直至所述调整后的预设语音识别模型对应的语音识别准确率达到预设语音识别准确率。
基于上述如图1所示方法和如图3所示装置的实施例,本申请实施例还提供了一种计算机设备的实体结构图,如图5所示,该计算机设备包括:处理器41、存储器42、及存储在存储器42上并可在处理器上运行的计算机程序,其中存储器42和处理器41均设置在总线43上所述处理器41执行所述程序时实现语音数据生成方法;其中,所述语音数据生成方法的步骤包括:获取目标用户的样本语音数据;对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的;基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
进一步地,所述根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,包括:
根据所述嵌入矩阵确定所述语音特征对应的查询向量、键向量和值向量;
将所述语音特征对应的查询向量和键向量相乘,得到所述语音特征对应的权重值;
根据所述语音特征对应的权重值和值向量,计算所述语音特征对应的注意力分值。
进一步地,所述对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征,包括:
对所述样本语音数据进行滤波处理,得到所述样本语音数据对应的语音能量;
对所述语音能量进行离散余弦化处理,得到所述样本语音数据对应的语音特征。
进一步地,在所述基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据之后,所述方法还包括:
将所述目标用户的样本语音数据和验证语音数据,以及预设样本库中其他用户的样本语音数据确定为第一训练样本;
利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型。
进一步地,在所述利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型之后,所述方法还包括:
将所述目标用户的样本语音数据和所述其他用户的样本语音数据确定为第二训练样本;
利用所述第二训练样本对所述预设语音识别模型进行调整,得到调整后的预设语音识别模型。
进一步地,所述方法还包括:
利用测试样本对所述调整后的预设语音识别模型进行测试,得到所述调整后的预设语音识别模型对应的测试结果;
根据所述测试结果,确定所述调整后的预设语音识别模型对应的语音识别准确率;
若所述语音识别准确率小于预设语音识别准确率,对所述整后的预设语音识别模型中的参数进行调整,直至所述调整后的预设语音识别模型对应的语音识别准确率达到预设语音识别准确率。
通过本申请的技术方案,本申请能够获取目标用户的样本语音数据;并对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;与此同时,根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的;并基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据,由此通过提取样本数据量匮乏的目标用户的语音特征,能够计算样本语音数据对应的注意力分值,并依据该注意力分值生成目标用户的验证语音数据,从而能够根据目标用户的少数样本语音数据,生成更多的语音数据,使不同用户的样本语音数据达到平衡,避免采用欠采用的方式丢失掉有价值的用户信息,同时依据该样本语音数据训练的预设语音识别模型的语音识别精度也得到了提高。
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。

Claims (20)

  1. 一种语音数据生成方法,其中,包括:
    获取目标用户的样本语音数据;
    对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;
    根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的;
    基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
  2. 根据权利要求1所述的方法,其中,所述根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,包括:
    根据所述嵌入矩阵确定所述语音特征对应的查询向量、键向量和值向量;
    将所述语音特征对应的查询向量和键向量相乘,得到所述语音特征对应的权重值;
    根据所述语音特征对应的权重值和值向量,计算所述语音特征对应的注意力分值。
  3. 根据权利要求1所述的方法,其中,所述对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征,包括:
    对所述样本语音数据进行滤波处理,得到所述样本语音数据对应的语音能量;
    对所述语音能量进行离散余弦化处理,得到所述样本语音数据对应的语音特征。
  4. 根据权利要求1所述的方法,其中,在所述基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据之后,所述方法还包括:
    将所述目标用户的样本语音数据和验证语音数据,以及预设样本库中其他用户的样本语音数据确定为第一训练样本;
    利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型。
  5. 根据权利要求4所述的方法,其中,在所述利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型之后,所述方法还包括:
    将所述目标用户的样本语音数据和所述其他用户的样本语音数据确定为第二训练样本;
    利用所述第二训练样本对所述预设语音识别模型进行调整,得到调整后的预设语音识别模型。
  6. 根据权利要求5所述的方法,其中,所述方法还包括:
    利用测试样本对所述调整后的预设语音识别模型进行测试,得到所述调整后的预设语音识别模型对应的测试结果;
    根据所述测试结果,确定所述调整后的预设语音识别模型对应的语音识别准确率;
    若所述语音识别准确率小于预设语音识别准确率,对所述整后的预设语音识别模型中的参数进行调整,直至所述调整后的预设语音识别模型对应的语音识别准确率达到预设语音识别准确率。
  7. 根据权利要求5所述的方法,其中,所述方法还包括:
    获取待识别用户的语音数据;
    将所述待识别用户的语音数据输入至调整后的预设语音识别模型进行语音识别,确定所述待识别用户对应的语音识别结果。
  8. 一种语音数据生成装置,其中,包括:
    获取单元,用于获取目标用户的样本语音数据;
    提取单元,用于对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;
    第一确定单元,用于根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵, 计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的;
    第二确定单元,用于基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
  9. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现语音数据生成方法;
    其中,所述语音数据生成方法的步骤包括:
    获取目标用户的样本语音数据;
    对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;
    根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的;
    基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
  10. 根据权利要求9所述的计算机可读存储介质,其中,所述根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,包括:
    根据所述嵌入矩阵确定所述语音特征对应的查询向量、键向量和值向量;
    将所述语音特征对应的查询向量和键向量相乘,得到所述语音特征对应的权重值;
    根据所述语音特征对应的权重值和值向量,计算所述语音特征对应的注意力分值。
  11. 根据权利要求9所述的计算机可读存储介质,其中,所述对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征,包括:
    对所述样本语音数据进行滤波处理,得到所述样本语音数据对应的语音能量;
    对所述语音能量进行离散余弦化处理,得到所述样本语音数据对应的语音特征。
  12. 根据权利要求9所述的计算机可读存储介质,其中,在所述基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据之后,所述方法还包括:
    将所述目标用户的样本语音数据和验证语音数据,以及预设样本库中其他用户的样本语音数据确定为第一训练样本;
    利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型。
  13. 根据权利要求12所述的计算机可读存储介质,其中,在所述利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型之后,所述方法还包括:
    将所述目标用户的样本语音数据和所述其他用户的样本语音数据确定为第二训练样本;
    利用所述第二训练样本对所述预设语音识别模型进行调整,得到调整后的预设语音识别模型。
  14. 根据权利要求13所述的计算机可读存储介质,其中,所述方法还包括:
    利用测试样本对所述调整后的预设语音识别模型进行测试,得到所述调整后的预设语音识别模型对应的测试结果;
    根据所述测试结果,确定所述调整后的预设语音识别模型对应的语音识别准确率;
    若所述语音识别准确率小于预设语音识别准确率,对所述整后的预设语音识别模型中的参数进行调整,直至所述调整后的预设语音识别模型对应的语音识别准确率达到预设语音识别准确率。
  15. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述计算机程序被处理器执行时实现语音数据生成方法;
    其中,所述语音数据生成方法的步骤包括:
    获取目标用户的样本语音数据;
    对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征;
    根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数 据对应的注意力分值,所述嵌入矩阵是通过对所述样本语音数据进行训练得到的;
    基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据。
  16. 根据权利要求15所述的计算机设备,其中,所述根据所述样本语音数据对应的语音特征和预先构建的嵌入矩阵,计算所述样本语音数据对应的注意力分值,包括:
    根据所述嵌入矩阵确定所述语音特征对应的查询向量、键向量和值向量;
    将所述语音特征对应的查询向量和键向量相乘,得到所述语音特征对应的权重值;
    根据所述语音特征对应的权重值和值向量,计算所述语音特征对应的注意力分值。
  17. 根据权利要求15所述的计算机设备,其中,所述对所述样本语音数据进行特征提取,得到所述样本语音数据对应的语音特征,包括:
    对所述样本语音数据进行滤波处理,得到所述样本语音数据对应的语音能量;
    对所述语音能量进行离散余弦化处理,得到所述样本语音数据对应的语音特征。
  18. 根据权利要求15所述的计算机设备,其中,在所述基于所述注意力分值,确定所述目标用户样本语音数据之外的验证语音数据之后,所述方法还包括:
    将所述目标用户的样本语音数据和验证语音数据,以及预设样本库中其他用户的样本语音数据确定为第一训练样本;
    利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型。
  19. 根据权利要求18所述的计算机设备,其中,在所述利用预设神经网络算法对所述第一训练样本进行训练,构建预设语音识别模型之后,所述方法还包括:
    将所述目标用户的样本语音数据和所述其他用户的样本语音数据确定为第二训练样本;
    利用所述第二训练样本对所述预设语音识别模型进行调整,得到调整后的预设语音识别模型。
  20. 根据权利要求19所述的计算机设备,其中,所述方法还包括:
    利用测试样本对所述调整后的预设语音识别模型进行测试,得到所述调整后的预设语音识别模型对应的测试结果;
    根据所述测试结果,确定所述调整后的预设语音识别模型对应的语音识别准确率;
    若所述语音识别准确率小于预设语音识别准确率,对所述整后的预设语音识别模型中的参数进行调整,直至所述调整后的预设语音识别模型对应的语音识别准确率达到预设语音识别准确率。
PCT/CN2020/136366 2020-10-26 2020-12-15 语音数据生成方法、装置、计算机设备及存储介质 WO2021189980A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011153538.8 2020-10-26
CN202011153538.8A CN112331182A (zh) 2020-10-26 2020-10-26 语音数据生成方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021189980A1 true WO2021189980A1 (zh) 2021-09-30

Family

ID=74311673

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/136366 WO2021189980A1 (zh) 2020-10-26 2020-12-15 语音数据生成方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112331182A (zh)
WO (1) WO2021189980A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109817246A (zh) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质
US20190189112A1 (en) * 2016-07-22 2019-06-20 Baidu Online Network Technology (Beijing) Co., Ltd. Voice recognition processing method, device and computer storage medium
CN110992938A (zh) * 2019-12-10 2020-04-10 同盾控股有限公司 语音数据处理方法、装置、电子设备及计算机可读介质
CN111429938A (zh) * 2020-03-06 2020-07-17 江苏大学 一种单通道语音分离方法、装置及电子设备
CN111798874A (zh) * 2020-06-24 2020-10-20 西北师范大学 一种语音情绪识别方法及系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3893763B2 (ja) * 1998-08-17 2007-03-14 富士ゼロックス株式会社 音声検出装置
CN111145718B (zh) * 2019-12-30 2022-06-07 中国科学院声学研究所 一种基于自注意力机制的中文普通话字音转换方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190189112A1 (en) * 2016-07-22 2019-06-20 Baidu Online Network Technology (Beijing) Co., Ltd. Voice recognition processing method, device and computer storage medium
CN109817246A (zh) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质
CN110992938A (zh) * 2019-12-10 2020-04-10 同盾控股有限公司 语音数据处理方法、装置、电子设备及计算机可读介质
CN111429938A (zh) * 2020-03-06 2020-07-17 江苏大学 一种单通道语音分离方法、装置及电子设备
CN111798874A (zh) * 2020-06-24 2020-10-20 西北师范大学 一种语音情绪识别方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HU TINGTING, FENG YAQIN, SHEN LINGJIE, WANG WEI: "The Salient Feature Selection by Attention Mechanism Based LSTM in Speech Emotion Recognition", SHENGXUE-JISHU : JIKAN = TECHNICAL ACOUSTICS, SHENG XUE JI SHU BIAN JI BU, CN, vol. 38, no. 4, 31 August 2019 (2019-08-31), CN, pages 414 - 421, XP055852794, ISSN: 1000-3630, DOI: 10.16300/j.cnki.1000-3630.2019.04.010 *

Also Published As

Publication number Publication date
CN112331182A (zh) 2021-02-05

Similar Documents

Publication Publication Date Title
TWI641965B (zh) 基於聲紋識別的身份驗證的方法及系統
CN107492382B (zh) 基于神经网络的声纹信息提取方法及装置
WO2017215558A1 (zh) 一种声纹识别方法和装置
WO2019119505A1 (zh) 人脸识别的方法和装置、计算机装置及存储介质
CN104732978B (zh) 基于联合深度学习的文本相关的说话人识别方法
EP3486903B1 (en) Identity vector generating method, computer apparatus and computer readable storage medium
CN105023573B (zh) 使用听觉注意力线索的语音音节/元音/音素边界检测
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
CN109801634B (zh) 一种声纹特征的融合方法及装置
WO2019019256A1 (zh) 电子装置、身份验证的方法、系统及计算机可读存储介质
CN105989849B (zh) 一种语音增强方法、语音识别方法、聚类方法及装置
WO2018223727A1 (zh) 识别声纹的方法、装置、设备及介质
US20230087657A1 (en) Assessing face image quality for application of facial recognition
CN112927707A (zh) 语音增强模型的训练方法和装置及语音增强方法和装置
WO2020073518A1 (zh) 声纹验证的方法、装置、计算机设备和存储介质
CN110120230B (zh) 一种声学事件检测方法及装置
Zhang et al. I-vector based physical task stress detection with different fusion strategies
WO2021189980A1 (zh) 语音数据生成方法、装置、计算机设备及存储介质
WO2021189979A1 (zh) 语音增强方法、装置、计算机设备及存储介质
CN110991228A (zh) 一种抗光照影响的改进pca人脸识别算法
WO2019218515A1 (zh) 服务器、基于声纹的身份验证方法及存储介质
WO2022205249A1 (zh) 音频特征补偿方法、音频识别方法及相关产品
CN112466311B (zh) 声纹识别方法、装置、存储介质及计算机设备
Zhipeng et al. Voiceprint recognition based on BP Neural Network and CNN
CN111310836A (zh) 一种基于声谱图的声纹识别集成模型的防御方法及防御装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20928007

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20928007

Country of ref document: EP

Kind code of ref document: A1