CN113593524A - Method and device for training acoustic model for accent recognition, and storage medium - Google Patents

Method and device for training acoustic model for accent recognition, and storage medium Download PDF

Info

Publication number
CN113593524A
CN113593524A CN202110104567.3A CN202110104567A CN113593524A CN 113593524 A CN113593524 A CN 113593524A CN 202110104567 A CN202110104567 A CN 202110104567A CN 113593524 A CN113593524 A CN 113593524A
Authority
CN
China
Prior art keywords
initial
accent
basic
features
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110104567.3A
Other languages
Chinese (zh)
Inventor
曹松军
马龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110104567.3A priority Critical patent/CN113593524A/en
Publication of CN113593524A publication Critical patent/CN113593524A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a method and a device for training an acoustic model for voice recognition, computer equipment and a storage medium. The method comprises the following steps: acquiring training data; extracting acoustic features corresponding to training voices; inputting the acoustic features and the accent region features into an initial accent recognition acoustic model, transforming the accent region features by the initial accent recognition acoustic model to obtain initial transformation features, extracting voice features of the acoustic features to obtain initial voice features, combining the initial transformation features and the initial voice features to obtain initial combination features, and performing voice phoneme recognition on the initial combination features to obtain initial voice phoneme information; and calculating loss information based on the initial voice phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and performing iteration in a circulating way until training is finished to obtain the target accent recognition acoustic model. By adopting the method, the accuracy of the accent recognition can be improved.

Description

Method and device for training acoustic model for accent recognition, and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for training an acoustic model for accent recognition, an accent recognition method and an apparatus, a computer device, and a storage medium.
Background
Along with the development of artificial intelligence technology, speech recognition technology has appeared, and speech recognition technology further can be split into four major parts:
1. front-end processing: the method comprises the technologies of noise reduction, sound source positioning, echo cancellation and the like of the voice signals.
2. Acoustic model: and modeling the mapping relation of the voice signals to the corresponding pronunciation units.
3. Language models and dictionaries: and modeling the mapping relation from the pronunciation unit to the Chinese character.
4. A decoder: and the whole search process from voice to words is carried out by combining the acoustic model, the language model and the dictionary.
In the conventional technology, an acoustic model is usually used to identify a speech signal to obtain a corresponding pronunciation unit. However, the accuracy of the speech recognition result can be ensured by recognizing the speech which does not carry the accent in the current acoustic model, but when the speech which carries the accent is recognized by the current acoustic model through the pronunciation unit, the accuracy is greatly reduced, so that the accuracy of the accent speech recognition result is greatly reduced.
Disclosure of Invention
In view of the above, it is desirable to provide an accent recognition acoustic model training method, an accent recognition method, an apparatus, a computer device, and a storage medium, which can improve the accuracy of pronunciation units obtained by accent recognition, thereby improving the accent speech recognition result.
A method of acoustic model training for speech recognition, the method comprising:
acquiring training data, wherein the training data comprises training voice, accent region characteristics corresponding to the training voice and phoneme labels;
extracting acoustic features corresponding to training voices;
inputting the acoustic features and the accent region features into an initial accent recognition acoustic model, transforming the accent region features by the initial accent recognition acoustic model to obtain initial transformation features, extracting voice features of the acoustic features to obtain initial voice features, combining the initial transformation features and the initial voice features to obtain initial combination features, and performing voice phoneme recognition on the initial combination features to obtain initial voice phoneme information;
and calculating loss information based on the initial voice phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to the step of inputting the acoustic features and the accent region features into the initial accent recognition acoustic model for iterative execution until training is completed to obtain the target accent recognition acoustic model.
In one embodiment, the inputting the acoustic feature into the initial feature extraction network to perform speech feature extraction to obtain the initial speech feature includes:
and transforming the acoustic features through spectrum enhancement to obtain acoustic enhancement features, and inputting the acoustic enhancement features into the initial feature extraction network to extract voice features to obtain the initial voice features.
In one embodiment, the initial phoneme recognition network comprises an initial speech phoneme feature extraction network and an initial accent phoneme recognition network;
inputting the starting speech feature into a starting phoneme recognition network for speech phoneme recognition to obtain starting speech phoneme information, including:
inputting the initial voice features into the initial voice phoneme feature extraction network for voice phoneme feature extraction to obtain initial voice phoneme features, and inputting the initial voice phoneme features into an initial accent phoneme recognition network for phoneme recognition to obtain initial voice phoneme information.
In one embodiment, the initial speech phoneme feature extraction network comprises at least one initial time-delay neural network and at least one initial gating cycle network, and the initial time-delay neural network and the initial gating cycle network are of an alternative network structure;
inputting the initial voice feature into the initial voice phoneme feature extraction network for voice phoneme feature extraction to obtain an initial voice phoneme feature, including:
inputting the initial voice feature into the initial time delay neural network for calculation to obtain an initial time delay feature, and inputting the initial time delay feature into the initial gating circulation network for calculation to obtain the initial voice phoneme feature.
In one embodiment, the building the basic accent recognition acoustic model based on the trained initial accent recognition acoustic model includes:
and taking the trained initial feature extraction network as the basic feature extraction network, taking the trained initial time delay neural network as the basic time delay neural network, taking the trained initial gated circulation network as the basic gated circulation network, taking the trained initial accent phoneme recognition network as the basic accent phoneme recognition network, and establishing a parameter initialized conversion network to obtain the basic accent recognition acoustic model.
In one embodiment, after the steps of calculating loss information based on the initial speech phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to input the acoustic features and the accent region features into the initial accent recognition acoustic model are iteratively performed until a target accent recognition acoustic model is obtained when training is completed, the method further includes:
acquiring target accent data corresponding to a target area, wherein the target accent data comprises target area voice and corresponding target area voice phoneme labels;
acquiring target area characteristics corresponding to the target area and extracting target area voice acoustic characteristics corresponding to the target area voice;
inputting the target region voice acoustic features and the target region features into a target accent recognition acoustic model, converting the target region features by the target accent recognition acoustic model to obtain target region conversion features, extracting voice features based on the target region acoustic features to obtain target region voice features, combining the target region conversion features and the target region voice features to obtain target region combination features, and performing voice phoneme recognition based on the target region combination features to obtain target region voice phoneme information;
calculating target region voice loss information based on the target region voice phoneme information and the corresponding target region voice phoneme label, updating a phoneme recognition network corresponding to the target region in the target accent recognition acoustic model based on the target region voice loss information, and returning to the step of inputting the target region voice acoustic feature and the target region feature into the target accent recognition acoustic model for iterative execution until the target training is completed to obtain an optimized accent recognition acoustic model.
An acoustic model training apparatus for mouth sound recognition, the apparatus comprising:
the data acquisition module is used for acquiring training data, and the training data comprises training voice, accent region characteristics corresponding to the training voice and phoneme labels;
the feature extraction module is used for extracting acoustic features corresponding to the training voice;
the model training module is used for inputting the acoustic features and the accent region features into an initial accent recognition acoustic model, the initial accent recognition acoustic model transforms the accent region features to obtain initial transformation features, performs voice feature extraction on the acoustic features to obtain initial voice features, combines the initial transformation features and the initial voice features to obtain initial combination features, and performs voice phoneme recognition on the initial combination features to obtain initial voice phoneme information;
and the loop iteration module is used for calculating loss information based on the initial voice phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to the step of inputting the acoustic characteristics and the accent region characteristics into the initial accent recognition acoustic model for iteration execution until training is finished to obtain the target accent recognition acoustic model.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring training data, wherein the training data comprises training voice, accent region characteristics corresponding to the training voice and phoneme labels;
extracting acoustic features corresponding to training voices;
inputting the acoustic features and the accent region features into an initial accent recognition acoustic model, transforming the accent region features by the initial accent recognition acoustic model to obtain initial transformation features, extracting voice features of the acoustic features to obtain initial voice features, combining the initial transformation features and the initial voice features to obtain initial combination features, and performing voice phoneme recognition on the initial combination features to obtain initial voice phoneme information;
and calculating loss information based on the initial voice phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to the step of inputting the acoustic features and the accent region features into the initial accent recognition acoustic model for iterative execution until training is completed to obtain the target accent recognition acoustic model.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring training data, wherein the training data comprises training voice, accent region characteristics corresponding to the training voice and phoneme labels;
extracting acoustic features corresponding to training voices;
inputting the acoustic features and the accent region features into an initial accent recognition acoustic model, transforming the accent region features by the initial accent recognition acoustic model to obtain initial transformation features, extracting voice features of the acoustic features to obtain initial voice features, combining the initial transformation features and the initial voice features to obtain initial combination features, and performing voice phoneme recognition on the initial combination features to obtain initial voice phoneme information;
and calculating loss information based on the initial voice phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to the step of inputting the acoustic features and the accent region features into the initial accent recognition acoustic model for iterative execution until training is completed to obtain the target accent recognition acoustic model.
According to the method, the device, the computer equipment and the storage medium for training the acoustic model for accent recognition, acoustic features corresponding to training voice are extracted, the acoustic features and accent region features are input into the acoustic model for initial accent recognition, the acoustic model for initial accent recognition transforms the accent region features to obtain initial transformation features, voice feature extraction is carried out on the acoustic features to obtain initial voice features, the initial transformation features and the initial voice features are combined to obtain initial combination features, and voice phoneme recognition is carried out on the initial combination features to obtain initial voice phoneme information; and calculating loss information based on the initial voice phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to the step of inputting the acoustic features and the accent region features into the initial accent recognition acoustic model for iterative execution until training is completed to obtain the target accent recognition acoustic model. The target accent recognition acoustic model is obtained by training through the combined action of the accent region characteristics and the acoustic characteristics, so that the target accent recognition acoustic model can learn richer information, the recognition accuracy is improved when the accent voice recognition is carried out on the target accent recognition acoustic model, and the accuracy of the accent voice recognition result is improved.
A method of accent recognition, the method comprising:
acquiring accent voice to be recognized and corresponding information of a region to be recognized;
extracting acoustic features to be recognized corresponding to accent voice to be recognized and acquiring the features of the area to be recognized corresponding to the area information to be recognized;
inputting the acoustic feature to be recognized and the feature of the area to be recognized into a target accent recognition acoustic model, converting the feature of the area to be recognized by the target accent recognition acoustic model to obtain a conversion feature to be recognized, extracting the voice feature of the acoustic feature to be recognized to obtain a voice feature to be recognized, combining the conversion feature to be recognized and the voice feature to be recognized to obtain a combined feature to be recognized, and performing voice phoneme recognition on the combined feature to be recognized to obtain voice phoneme information corresponding to the accent voice to be recognized;
and carrying out text recognition based on the accent phoneme information to obtain a target text corresponding to the accent voice to be recognized.
An oral recognition device, the device comprising:
the voice to be recognized acquisition module is used for acquiring accent voice to be recognized and corresponding information of an area to be recognized;
the to-be-recognized feature extraction module is used for extracting to-be-recognized acoustic features corresponding to the accent voice to be recognized and acquiring to-be-recognized region features corresponding to the to-be-recognized region information;
the model recognition module is used for inputting the acoustic features to be recognized and the regional features to be recognized into a target accent recognition acoustic model, the target accent recognition acoustic model transforms the regional features to be recognized to obtain transformed features to be recognized, the acoustic features to be recognized are subjected to voice feature extraction to obtain voice features to be recognized, the transformed features to be recognized and the voice features to be recognized are combined to obtain combined features to be recognized, and voice phoneme recognition is performed on the combined features to be recognized to obtain voice phoneme information corresponding to accent voice to be recognized;
and the text obtaining module is used for carrying out text recognition based on the accent phoneme information to obtain a target text corresponding to the accent voice to be recognized.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring accent voice to be recognized and corresponding information of a region to be recognized;
extracting acoustic features to be recognized corresponding to accent voice to be recognized and acquiring the features of the area to be recognized corresponding to the area information to be recognized;
inputting the acoustic feature to be recognized and the feature of the area to be recognized into a target accent recognition acoustic model, converting the feature of the area to be recognized by the target accent recognition acoustic model to obtain a conversion feature to be recognized, extracting the voice feature of the acoustic feature to be recognized to obtain a voice feature to be recognized, combining the conversion feature to be recognized and the voice feature to be recognized to obtain a combined feature to be recognized, and performing voice phoneme recognition on the combined feature to be recognized to obtain voice phoneme information corresponding to the accent voice to be recognized;
and carrying out text recognition based on the accent phoneme information to obtain a target text corresponding to the accent voice to be recognized.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring accent voice to be recognized and corresponding information of a region to be recognized;
extracting acoustic features to be recognized corresponding to accent voice to be recognized and acquiring the features of the area to be recognized corresponding to the area information to be recognized;
inputting the acoustic feature to be recognized and the feature of the area to be recognized into a target accent recognition acoustic model, converting the feature of the area to be recognized by the target accent recognition acoustic model to obtain a conversion feature to be recognized, extracting the voice feature of the acoustic feature to be recognized to obtain a voice feature to be recognized, combining the conversion feature to be recognized and the voice feature to be recognized to obtain a combined feature to be recognized, and performing voice phoneme recognition on the combined feature to be recognized to obtain voice phoneme information corresponding to the accent voice to be recognized;
and carrying out text recognition based on the accent phoneme information to obtain a target text corresponding to the accent voice to be recognized.
According to the accent recognition method, the accent recognition device, the computer equipment and the storage medium, the acoustic features to be recognized corresponding to the accent voice to be recognized are extracted, the regional features to be recognized corresponding to the regional information to be recognized are obtained, the acoustic features to be recognized and the regional features to be recognized are input into the target accent recognition acoustic model, the regional features to be recognized are transformed by the target accent recognition acoustic model to obtain the transformed features to be recognized, the acoustic features to be recognized are extracted to obtain the voice features to be recognized, the transformed features to be recognized and the voice features to be recognized are combined to obtain the combined features to be recognized, and the voice phoneme information corresponding to the accent voice to be recognized is obtained by performing voice phoneme recognition on the combined features to be recognized. And then carrying out text recognition based on the accent phoneme information to obtain a target text corresponding to the accent voice to be recognized. The target accent recognition acoustic model uses the characteristics of the region to be recognized and the acoustic characteristics to be recognized to recognize the phonetic phoneme information corresponding to the accent voice to be recognized, so that the phonetic phoneme information corresponding to the accent voice to be recognized obtained through training is more accurate, and the accuracy of the target text obtained through recognition is improved.
Drawings
FIG. 1 is a diagram illustrating an exemplary embodiment of a method for training an acoustic model for accent recognition;
FIG. 2 is a schematic flow chart illustrating a method for training an acoustic model for accent recognition according to an embodiment;
FIG. 3 is a flow diagram illustrating a process for obtaining initial phonetic phoneme information in one embodiment;
FIG. 4 is a flowchart illustrating the process of obtaining initial phonetic phoneme information in another embodiment;
FIG. 5 is a flow diagram illustrating the process of obtaining initial phonetic phone features in one embodiment;
FIG. 6 is a block diagram of an initial accent recognition acoustic model in an embodiment;
FIG. 7 is a schematic flow chart illustrating obtaining a target accent recognition acoustic model in one embodiment;
FIG. 8 is a schematic flow diagram illustrating the establishment of an initial accent recognition acoustic model, according to one embodiment;
FIG. 9 is a flow diagram that illustrates obtaining basic phonetic phoneme information, in one embodiment;
FIG. 10 is a schematic flow chart illustrating the process of obtaining an initial accent recognition acoustic model in one embodiment;
FIG. 11 is a flowchart illustrating the process of obtaining basic phonetic phone features in one embodiment;
FIG. 12 is a block diagram of an acoustic model for basic accent recognition in one embodiment;
FIG. 13 is a schematic flow chart illustrating the process of creating a basic accent recognition acoustic model in one embodiment;
FIG. 14 is a block diagram of an initial accent recognition acoustic model in one embodiment;
FIG. 15 is a schematic flow chart illustrating a method for obtaining an optimized accent recognition acoustic model in one embodiment;
FIG. 16 is a flowchart illustrating an accent recognition method according to one embodiment;
FIG. 17 is a schematic flow chart diagram illustrating a method for training an acoustic model for accent recognition in an exemplary embodiment;
FIG. 18 is a diagram illustrating the correspondence between regional features and regions in an exemplary embodiment;
FIG. 19 is a diagram illustrating an exemplary implementation of accent recognition;
FIG. 20 is a block diagram showing the structure of an acoustic model training apparatus for accent recognition according to an embodiment;
FIG. 21 is a block diagram showing the structure of an accent recognition apparatus according to an embodiment;
FIG. 22 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, which is the development direction of future human-computer interaction, wherein the voice becomes one of the best viewed human-computer interaction modes in the future
The scheme provided by the embodiment of the application relates to an artificial intelligence voice recognition technology, and is specifically explained by the following embodiment:
the accent recognition acoustic model training method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The method comprises the steps that a terminal 102 sends a model training instruction to a server, the server 104 receives the model training instruction sent by the terminal and obtains training data from a database, and the training data comprise training voice, accent region characteristics corresponding to the training voice and phoneme labels; the server 104 extracts acoustic features corresponding to the training speech; the server 104 inputs the acoustic features and the accent region features into an initial accent recognition acoustic model, the initial accent recognition acoustic model transforms the accent region features to obtain initial transformation features, performs voice feature extraction on the acoustic features to obtain initial voice features, combines the initial transformation features and the initial voice features to obtain initial combination features, and performs voice phoneme recognition on the initial combination features to obtain initial voice phoneme information; the server 104 calculates loss information based on the initial speech phoneme information and the corresponding phoneme label, updates the initial accent recognition acoustic model based on the loss information, and returns to the step of inputting the acoustic features and the accent region features into the initial accent recognition acoustic model for iterative execution until training is completed, so as to obtain the target accent recognition acoustic model. A reminder that the model training is complete may then be returned to the terminal 102. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a method for training an acoustic model for speech recognition is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
step 202, training data is obtained, wherein the training data comprises training voice, accent region characteristics corresponding to the training voice and phoneme labels.
Wherein the training data refers to data used in training an initial accent recognition acoustic model. Training speech refers to speech used in training an initial accent recognition acoustic model, which may include different accent speech. Each training voice has corresponding accent region features and phoneme labels. The accent region features refer to accent regions corresponding to training voices, and voices in different regions have different accents. Different areas correspond to different accent area characteristics, and the accent area characteristics can be preset or obtained by extracting characteristics according to the areas. The phoneme label refers to a label of a phoneme corresponding to the training speech. Phones (phones) are the smallest phonetic unit divided according to natural attributes of speech, and are analyzed according to pronunciation actions in syllables, and one action constitutes one phone.
Specifically, the server may directly obtain training data from the database, where the training data includes training voices, accent region features corresponding to the training voices, and phoneme labels. The server may also obtain training data from a service provider providing data services, or the server may collect training data from the internet. In one embodiment, the server obtains training voice and an accent region corresponding to the training voice, and then obtains accent region characteristics corresponding to the region to obtain training data.
And step 204, extracting acoustic features corresponding to the training voice.
The acoustic features refer to features representing acoustic characteristics of speech, and include at least one of a fundamental Frequency feature, a formant feature, an MFCC (Mel Frequency Cepstrum Coefficient) feature, a PNCC (modified MFCC) feature, and an i-vector (voiceprint feature) feature, and preferably, the PNCC feature and the i-vector feature can be extracted and obtained as acoustic features corresponding to training speech.
Specifically, the server may acquire a training speech corresponding to any accent region feature from the training data, and then may extract a corresponding acoustic feature from the training speech by using a signal processing technique, for example, a PNCC feature may be extracted by using a PNCC algorithm, and an i-vector feature may be extracted by using an i-vector feature. The fundamental frequency characteristics can be obtained by detecting the pitch period in the voice, and the formant characteristics are extracted by using a cepstrum method. MFCC features can be extracted using a MFCC extraction algorithm.
Step 206, inputting the acoustic features and the accent region features into an initial accent recognition acoustic model, transforming the accent region features by the initial accent recognition acoustic model to obtain initial transformation features, performing voice feature extraction on the acoustic features to obtain initial voice features, combining the initial transformation features and the initial voice features to obtain initial combination features, and performing voice phoneme recognition on the initial combination features to obtain initial voice phoneme information.
The initial accent recognition acoustic model refers to an untrained multitask accent recognition acoustic model, that is, parameters in the initial accent recognition acoustic model may be obtained by initialization, for example, all-zero initialization, random initialization, or the like. The multitask is an accent recognition task corresponding to a plurality of region features. The initial transformation characteristic is a characteristic obtained by performing initial transformation on the characteristic of the accent area, and the transformation may be linear transformation or nonlinear transformation. The initial voice feature is obtained by further performing voice feature extraction on the acoustic feature, so that the obtained initial voice feature has higher robustness. The initial merging feature is a feature obtained by merging the initial transformation feature and the initial voice feature. The initial speech phoneme information refers to initial speech phoneme information obtained by the recognition of an initial accent recognition acoustic model, and the speech phoneme information may be phonemes corresponding to training speech or probability distribution of states corresponding to the phonemes.
Specifically, the server establishes an initial accent recognition acoustic model in advance, then during training, inputs acoustic features and accent region features into the initial accent recognition acoustic model, the initial accent recognition acoustic model transforms the accent region features using initial transformation parameters to obtain initial transformation features, performs voice feature extraction on the acoustic features using initial voice feature extraction parameters to obtain initial voice features, and merges the initial transformation features and the initial voice features to obtain initial merged features, wherein merging the initial transformation features and the initial voice features may be performed by directly stitching the initial transformation features and the initial voice features to obtain initial merged features, or by performing vector operations on the initial transformation features and the initial voice features to obtain initial merged features, wherein the vector operations may include vector sum operations, and vector sum operations, Vector product operations, quantity product operations, and the like. And then carrying out voice phoneme recognition on the initial combination characteristics by using the voice phoneme recognition parameters corresponding to the region characteristics to obtain initial voice phoneme information.
Step 208, loss information is calculated based on the initial speech phoneme information and the corresponding phoneme label, and the initial accent recognition acoustic model is updated based on the loss information.
Wherein the loss information is used to represent an error between the initial speech phoneme information and the corresponding phoneme label.
Specifically, the server calculates an error between the initial speech phoneme information and the corresponding phoneme label by using a loss function to obtain loss information, and then reversely updates parameters in the initial accent recognition acoustic model based on a gradient descent algorithm by using the loss information.
And step 210, judging whether the training is finished, and returning to the step of inputting the acoustic characteristics and the accent region characteristics into the initial accent recognition acoustic model for iterative execution when the training is not finished. When training is complete, step 212 is performed.
And step 212, obtaining a target accent recognition acoustic model.
And judging whether the training is finished or not, namely judging whether the training reaches a training finishing condition or not, namely judging whether the model is converged or not, wherein the training finishing condition comprises at least one of the condition that the training iteration number reaches the maximum iteration number, the loss information obtained by the training is smaller than a preset loss threshold value and the model parameter obtained by the training does not change any more. The target accent recognition acoustic model is a trained accent recognition acoustic model and is used for recognizing accent voices in different areas.
Specifically, the server judges whether the training is finished or not, and when the training is not finished, the step of inputting the acoustic features and the accent region features into the initial accent recognition acoustic model is returned for iterative execution. And when the training is finished, obtaining a target accent recognition acoustic model.
In the method for training the acoustic model for accent recognition, acoustic features and accent region features are input into an initial accent recognition acoustic model by extracting the acoustic features corresponding to training voices, the initial accent recognition acoustic model transforms the accent region features to obtain initial transformation features, voice feature extraction is carried out on the acoustic features to obtain initial voice features, the initial transformation features and the initial voice features are combined to obtain initial combination features, and voice phoneme recognition is carried out on the initial combination features to obtain initial voice phoneme information; and calculating loss information based on the initial voice phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to the step of inputting the acoustic features and the accent region features into the initial accent recognition acoustic model for iterative execution until training is completed to obtain the target accent recognition acoustic model. The target accent recognition acoustic model is obtained by training through the combined action of the accent region characteristics and the acoustic characteristics, so that the target accent recognition acoustic model can learn richer information, the recognition accuracy is improved when the accent voice recognition is carried out on the target accent recognition acoustic model, and the accuracy of the accent voice recognition result is improved.
In one embodiment, the initial accent recognition acoustic model comprises: the method comprises the following steps of (1) an initial conversion network, an initial feature extraction network and an initial phoneme recognition network;
as shown in fig. 3, step 206, inputting the acoustic features and the features of the accent regions into the initial accent recognition acoustic model, includes:
step 302, inputting the characteristics of the accent area into an initial conversion network for conversion, so as to obtain initial conversion characteristics.
Wherein the initial conversion network is a network for performing feature conversion. The network parameters in the initial transition network may be obtained through initialization or through pre-training.
Specifically, the server inputs the characteristics of the accent area into an initial conversion network for conversion, so as to obtain initial conversion characteristics. In one embodiment, the initial conversion network may be an initial linear conversion network, which may perform linear conversion. Or an initial nonlinear conversion network, and nonlinear conversion can be carried out.
And step 304, inputting the acoustic features into an initial feature extraction network for voice feature extraction to obtain initial voice features.
Specifically, the initial feature extraction Network is a Network for performing speech feature extraction by the user, and may be, for example, a CNN (Convolutional Neural Networks) Network, an RNN (Recurrent Neural Networks) Network, or the like. The network parameters in the initial feature extraction network may be obtained through initialization or through pre-training. And the server inputs the acoustic features into an initial feature extraction network to extract the voice features, so as to obtain initial voice features. The initial feature extraction network can extract features which can represent voice more, so that robustness is improved.
And step 306, combining the initial conversion feature and the initial voice feature to obtain an initial combination feature.
Step 308, inputting the initial merging features into an initial phoneme recognition network for speech phoneme recognition, so as to obtain initial speech phoneme information.
The initial phoneme recognition network refers to a network for performing speech phoneme recognition, and the initial phoneme recognition network may be a multitasking network, and each region has a corresponding regional speech phoneme recognition network.
Specifically, the server calculates the sum of the vectors corresponding to the initial conversion features and the vectors corresponding to the initial speech features to obtain initial combination features, and then inputs the initial combination features into a phoneme recognition network corresponding to the region features in the initial phoneme recognition network for recognition to obtain output initial speech phoneme information.
In the above embodiment, the initial conversion features are obtained by using the initial conversion network, the initial feature extraction network is used to extract the initial speech features, then the initial conversion features and the initial speech features are combined, and the initial phoneme recognition network is used to perform the speech phoneme recognition on the initial combination features to obtain the initial speech phoneme information.
In one embodiment, step 304, inputting the acoustic features into an initial feature extraction network for speech feature extraction, to obtain initial speech features, includes:
and the acoustic features are deformed through frequency spectrum enhancement to obtain acoustic enhancement features, and the acoustic enhancement features are input into an initial feature extraction network to extract voice features to obtain initial voice features.
The spectrum enhancement is a masking operation in a time-frequency domain by a SpecAment (A Simple Data amplification Method for Automatic Speech Recognition ). The acoustic enhancement features refer to features obtained after spectral enhancement.
Specifically, the server deforms the acoustic features through spectrum enhancement to obtain acoustic enhancement features, wherein the acoustic enhancement features can be deformed through time deformation, spectrum masking and time masking operations to obtain the acoustic enhancement features, and the acoustic enhancement features are input into an initial feature extraction network to perform voice feature extraction to obtain initial voice features. By training the accent recognition acoustic model by simultaneously using the acoustic enhancement features, the robustness of the trained model can be improved.
In one embodiment, the initial phoneme recognition network comprises an initial speech phoneme feature extraction network, an initial target conversion network and an initial accent phoneme recognition network corresponding to at least two different accent region features;
as shown in fig. 4, step 308, inputting the initial merging features into the initial phoneme recognition network for speech phoneme recognition to obtain initial speech phoneme information, includes:
step 402, inputting the initial merging feature into an initial speech phoneme feature extraction network to perform speech phoneme feature extraction, so as to obtain an initial speech phoneme feature.
Wherein the initial speech phoneme feature extraction network is a network for extracting accent speech phoneme features. May be a deep neural network. The initial speech phoneme characteristics refer to characteristics of the extracted accent speech phonemes.
Specifically, the server inputs the initial merging features into an initial speech phoneme feature extraction network for calculation, and the output initial speech phoneme features are obtained.
Step 404, inputting the characteristics of the accent area into the initial target conversion network for conversion, so as to obtain the initial target conversion characteristics.
The initial target conversion network is a network for converting the characteristics of the accent area, and the network parameters of the trained target conversion network and the trained conversion network are different. The conversion may be linear or nonlinear. The initial target conversion feature refers to a feature obtained by conversion through an initial target conversion network.
Specifically, the server may input the accent region features into an initial target linear transformation network for linear transformation to obtain initial target linear transformation features, or input the accent region features into an initial target nonlinear transformation network for nonlinear transformation to obtain initial target nonlinear transformation features.
Step 405, merging the initial speech phoneme features and the initial target conversion features to obtain target merging features, inputting the target merging features into an initial accent phoneme recognition network corresponding to the accent region features for phoneme recognition to obtain initial speech phoneme information.
The target merging feature is a feature obtained by merging the initial speech phoneme feature and the feature output by the target conversion network.
Specifically, the initial phoneme recognition network includes an initial accent phoneme recognition network corresponding to at least two different accent region features. I.e. there are a plurality of initial accent phoneme recognition networks, different initial accent phoneme recognition networks are used for using the accent phonemes of the corresponding region. For example, the spoken English phoneme recognition networks may correspond to different country regions, for example, the spoken English phoneme recognition networks of the Chinese region, the United states region, the English region, the Japanese region, and the like. And when the regional characteristics corresponding to the training speech are Chinese regional characteristics, the server inputs the target merging characteristics into an initial Chinese English accent phoneme recognition network corresponding to the Chinese regional characteristics for phoneme recognition to obtain initial Chinese English accent phoneme information. For example, the chinese language has different dialects in different regions, and the dialect speech is input into the accent phoneme recognition network of the corresponding dialect region to perform phoneme recognition, so as to obtain initial speech phoneme information. And then updating the network parameters of the initial accent phoneme recognition network corresponding to the regional characteristics in the initial phoneme recognition network when the parameters are updated by using the loss information calculated by the initial voice phoneme information, wherein the network parameters of the initial accent phoneme recognition network corresponding to other regional characteristics are kept unchanged.
In the above embodiment, during training, the initial speech phoneme features and the initial target conversion features are combined to obtain target combined features, and the target combined features are input into the initial accent phoneme recognition network corresponding to the accent region features to perform phoneme recognition, so as to obtain initial speech phoneme information.
In one embodiment, the initial speech phoneme feature extraction network comprises at least one initial time-delay neural network, at least one initial gating cycle network and at least one initial intermediate conversion network, wherein the initial time-delay neural network and the initial gating cycle network are of an alternative network structure;
as shown in fig. 5, step 402, inputting the initial merging feature into an initial speech phoneme feature extraction network to perform speech phoneme feature extraction, so as to obtain an initial speech phoneme feature, includes:
and 502, inputting the initial merging characteristics into an initial delay neural network for calculation to obtain initial delay characteristics.
The time Delay Neural network is TDNN (time Delay Neural network) and is a convolution Neural network applied to the problem of voice recognition, a voice signal preprocessed by FFT is used as input, and an implicit layer of the time Delay Neural network is composed of 2 one-dimensional convolution kernels so as to extract translation invariant features on a frequency domain. The initial delay neural network refers to a delay neural network initialized by parameters in an initial accent recognition acoustic model. The initial latency characteristics refer to translation invariant characteristics in the frequency domain extracted using an initial latency neural network. The alternative network structure means that the initial time-delay neural network is connected with the next initial gated-cycle network, and the initial gated-cycle network is connected with the next initial time-delay neural network, namely the initial time-delay neural network and the initial gated-cycle network are alternately connected.
Specifically, the server inputs the initial merging characteristic into an initial delay neural network, extracts the translation invariant characteristic on the frequency domain, and obtains the initial delay characteristic. In an embodiment, a plurality of initial delay neural networks may be used to extract initial merging features, when the output of each initial delay neural network is to be input into the next initial delay neural network, the initial merging features need to be merged with the features obtained by conversion by the conversion network to obtain merged features, and then the merged features are input into the next initial delay neural network for calculation, so that the calculated initial delay features contain regional information, and the accuracy of subsequent phoneme recognition can be improved.
Step 504, inputting the characteristics of the accent area into an initial intermediate conversion network for conversion, so as to obtain initial intermediate conversion characteristics.
The intermediate conversion network is a conversion network between the initial delay neural network and the initial gating circulation network, and is used for converting the characteristics of the vocal tract, and can be linear conversion or nonlinear conversion. The initial intermediate conversion feature is a feature obtained by converting the accent region feature using an initial intermediate conversion network. That is, the initial delay characteristic needs to be combined with the initial intermediate conversion characteristic before being input into the initial gated-loop network for calculation.
And 506, combining the initial time delay characteristic and the initial intermediate conversion characteristic to obtain an initial intermediate combination characteristic, and inputting the initial intermediate combination characteristic into an initial gating circulating network for calculation to obtain an initial voice phoneme characteristic.
The initial intermediate merging feature refers to a feature obtained by merging the initial delay feature and the initial intermediate conversion feature. The Gated loop network refers to a GRU (Gated current Unit Gated loop Unit) that does not use the state of the Unit, but uses a hidden state to transmit information. It also has only two gates, a reset gate and an update gate, which can preserve information in long-term sequences and do not clear over time or remove because of irrelevant predictions, avoiding the gradient vanishing problem. Gated recurrent networks refer to a modified version of the standard recurrent neural network. The initial gating circulation network is used for extracting initial speech phoneme characteristics.
Specifically, the server merges the initial time delay feature and the initial intermediate conversion feature to obtain an initial intermediate merging feature, and inputs the initial intermediate merging feature into an initial gating cycle network for calculation to obtain an initial speech phoneme feature.
In the above embodiment, the initial delay feature is obtained by using a delay neural network, the initial intermediate conversion feature is obtained by using an initial intermediate conversion network, the initial delay feature and the initial intermediate conversion feature are combined to obtain an initial intermediate combination feature, the initial intermediate combination feature is input into an initial gated cyclic network for calculation to obtain an initial speech phoneme feature, that is, a speech time sequence signal is modeled by using a network formed by alternating delay neural networks and gated cyclic networks, so that the efficiency of model training can be improved, the combination feature is obtained by using a conversion network, and the combination feature is used for calculation, thereby improving the accuracy of a model obtained by training.
In a specific embodiment, as shown in fig. 6, an architecture diagram of an acoustic model for initial accent recognition is provided, specifically:
the method comprises the steps that a server obtains training data, the training data comprise training dialect voice, accent region characteristics and phoneme labels corresponding to the training dialect voice, acoustic characteristics corresponding to the training voice are extracted, the acoustic characteristics and the accent region characteristics are input into an initial accent recognition acoustic model, the acoustic characteristics are subjected to data enhancement through SpecAugement by the initial accent recognition acoustic model to obtain acoustic enhancement characteristics, the acoustic enhancement characteristics are input into a CNN1 network and a CNN2 network to perform voice characteristic extraction to obtain initial voice characteristics, and more abstract robust characteristics can be extracted through two layers of CNN1 and CNN 2. Then inputting the initial voice characteristics into a network part formed by interweaving a plurality of layers of TDNN (time delay neural network) and GRU (Gated Current Unit) which mainly functions to model a time sequence signal, wherein the initial voice characteristics are input into a Tdnn1 network for calculation to obtain output, performing linear transformation on the region characteristics (1-hot vector) through an Affine1 (Affine transformation) network to obtain initial transformation characteristics, combining the initial transformation characteristics with the output to obtain initial combination characteristics, inputting the initial combination characteristics into a Tdnn2 (-1, 0, 1) network for calculation to obtain output, performing linear transformation on the region characteristics through an Aff 2 network to obtain initial transformation characteristics, combining the initial transformation characteristics with the output to obtain initial combination characteristics, sequentially calculating, wherein, two layers of Tdnn networks and one layer of Opgru network are alternately calculated, and the output of each network is combined with the conversion characteristics obtained by the calculation of the corresponding Affine network and then used as the input of the next network. And finally, obtaining the output of an Opgru3 network, combining the output with the initial conversion feature obtained by calculating the regional feature through an Affinine 10 network to obtain an initial combination feature, inputting the initial combination feature into the Affinine network corresponding to the regional feature of the dialect in the dialect phoneme recognition layer for phoneme recognition to obtain initial accent phoneme information output by the output layer, then calculating loss information, updating the initial accent recognition acoustic model according to the loss information, and obtaining a target accent recognition acoustic model when training is finished.
In one embodiment, as shown in fig. 7, calculating loss information based on the initial speech phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to the step of inputting the acoustic features and accent region features into the initial accent recognition acoustic model for iterative execution until the training is completed, so as to obtain the target accent recognition acoustic model, including:
and step 702, calculating loss information by using the initial voice phoneme information and the corresponding phoneme label through a maximum mutual information optimization function, and reversely updating parameters of the initial accent recognition acoustic model based on the loss information to obtain an updated accent recognition acoustic model when the loss information does not meet the training completion condition.
And step 704, taking the updated acoustic model for accent recognition as an initial accent recognition acoustic model, and iteratively executing the step of inputting the acoustic characteristics and the accent region characteristics into the initial accent recognition acoustic model until the loss information meets the training completion conditions, and taking the initial accent recognition acoustic model meeting the training completion conditions as a target accent recognition acoustic model.
Specifically, the criteria for training are trained using LFMMI (maximum mutual information-free maximum mutual information) and using a gradient descent method. The method comprises the steps that a server calculates errors between initial voice phoneme information and corresponding phoneme labels by using a maximum mutual information optimization function to obtain loss information, judges whether the loss information reaches a preset loss threshold value, reversely updates parameters of an initial accent recognition acoustic model based on the loss information when the loss information does not reach the loss threshold value to obtain an updated accent recognition acoustic model, then uses the updated accent recognition acoustic model as the initial accent recognition acoustic model, returns to the step of inputting acoustic features and accent region features into the initial accent recognition acoustic model for iterative execution, and uses the initial accent recognition acoustic model meeting training completion conditions as a target accent recognition acoustic model when the loss information meets training completion conditions. In one embodiment, when the obtained training data is the training speech corresponding to the first accent region feature, the training speech corresponding to the first accent region feature is used for training to obtain an updated accent recognition acoustic model, the updated accent recognition acoustic model is a model in which parameters of a phoneme recognition network corresponding to the first accent region feature are updated, parameters of phoneme recognition networks corresponding to other accent region features are kept unchanged, then the updated accent recognition acoustic model is used as an initial accent recognition acoustic model, and the step of obtaining the training data, which may be the training speech corresponding to other accent region features, is performed until loss information meets a training completion condition, and the initial accent recognition acoustic model meeting the training completion condition is used as a target accent recognition acoustic model.
In the embodiment, the maximum mutual information optimization function is used for harmfulness in the model training process, so that the trained target accent recognition acoustic model can improve the accuracy of accent recognition.
In one embodiment, as shown in fig. 8, before step 202, i.e. before acquiring the training data, the method further includes:
step 802, basic training data is obtained, wherein the basic training data comprises basic training speech, basic accent region features corresponding to the basic training speech and basic phoneme labels.
The basic training data refers to data used in training of a basic accent recognition acoustic model, the basic accent recognition acoustic model refers to an untrained single-task accent recognition acoustic model, and the single task refers to an accent recognition task. The basic accent recognition acoustic model may be a pre-trained model of the initial accent recognition acoustic model, that is, parameters of the trained basic accent recognition acoustic model are used as initial parameters of the initial accent recognition acoustic model. The base training data may be the same as or different from the training data, or may be a part of the training data. For example, the initial training data may be 20000 hours of speech data, the training data may be 4000 hours of speech data with region information selected from 20000 hours of speech data, and the basic training data may also be 4000 hours of speech data with region information selected from 20000 hours of speech data. Basic training speech refers to speech in the basic training data. The basic accent region feature refers to a feature of a region corresponding to the basic training speech, that is, the basic training speech is an accent speech of the region. The basic phoneme label refers to a label of a phoneme corresponding to the basic training speech.
Specifically, the server may obtain the basic training data from the database, or may select the basic training data with the area information from the training data. The server may also collect the resulting basic training data from the internet.
And step 804, extracting basic acoustic features corresponding to the basic training voice.
Specifically, the basic acoustic features refer to acoustic features corresponding to the basic training speech. The server may extract corresponding acoustic features from the base training speech using signal processing techniques, for example, PNCC features may be extracted using a PNCC algorithm, and i-vector features may be extracted using i-vector features. The fundamental frequency characteristics can be obtained by detecting the pitch period in the voice, and the formant characteristics are extracted by using a cepstrum method. MFCC features can be extracted using a MFCC extraction algorithm.
Step 806, inputting the basic acoustic features and the basic accent region features into a basic accent recognition acoustic model, inputting the basic accent region features into a basic conversion network for conversion by the basic accent recognition acoustic model to obtain basic conversion features, inputting the basic acoustic features into a basic feature extraction network for voice feature extraction to obtain basic voice features, merging the basic conversion features and the basic voice features to obtain basic merging features, and inputting the basic merging features into a basic phoneme recognition network for voice phoneme recognition to obtain basic voice phoneme information.
The basic accent recognition acoustic model is used for pre-training as a pre-training model of the initial accent recognition acoustic model, and comprises a basic conversion network, a basic feature extraction network and a basic phoneme recognition network. The basic conversion network is used for converting the basic accent region characteristics, and the basic conversion network can be linear conversion or nonlinear conversion. The basic conversion feature refers to a conversion feature corresponding to the basic accent region feature. The underlying feature extraction network is used to extract more abstract and robust features. The basic speech features refer to features corresponding to basic training speech. The basic combining feature is a feature obtained by combining the basic conversion feature and the basic speech feature, and the basic phoneme recognition network is used for recognizing basic speech phoneme information corresponding to the basic training speech. The basic speech phoneme information refers to speech phoneme information corresponding to the basic training speech. The phoneme information of the basic speech may be a phoneme corresponding to the basic training speech, or may be a probability distribution of states corresponding to the phoneme
Specifically, the server inputs basic acoustic features and basic accent regional features into a basic accent recognition acoustic model, the basic accent recognition acoustic model receives the input basic acoustic features and basic accent regional features, the basic accent regional features are input into a basic conversion network for conversion to obtain basic conversion features, the basic acoustic features are input into a basic feature extraction network for voice feature extraction to obtain basic voice features, and the basic conversion features and the basic voice features are combined to obtain basic combined features, wherein the combination can be direct splicing or vector operation. And inputting the basic merging characteristics into a basic phoneme recognition network for speech phoneme recognition to obtain basic speech phoneme information. In an embodiment, the basic acoustic features may also be deformed through spectrum enhancement to obtain basic acoustic enhancement features, and the basic acoustic enhancement features are input into a basic feature extraction network to perform speech feature extraction to obtain basic speech features. Namely, masking operation is carried out in a time-frequency domain through a SpecAugmen technology, and basic acoustic enhancement features are obtained.
And 808, calculating basic loss information based on the basic speech phoneme information and the corresponding basic phoneme label, and updating the basic accent recognition acoustic model based on the basic loss information.
And 810, judging whether the basic training is finished or not, executing step 812 when the basic training is finished, and returning to step 806 to continue iterative execution when the basic training is not finished.
The basic loss information refers to loss information obtained during training of the basic accent recognition acoustic model.
Specifically, the server calculates an error between basic speech phoneme information and a corresponding basic phoneme label by using a preset loss function to obtain basic loss information, reversely updates parameters of a basic conversion network, a basic feature extraction network and a basic phoneme recognition network in the basic accent recognition acoustic model by using the basic loss information through a gradient descent algorithm to obtain an updated basic accent recognition acoustic model, and then judges whether basic training is completed, whether the training frequency reaches the maximum iteration frequency or not can be judged, whether the loss information reaches a preset loss threshold value or not can be judged, and whether the updated parameters of the basic accent recognition acoustic model change or not can be judged. And when the basic training is judged not to be completed, taking the updated basic accent recognition acoustic model as a basic accent recognition acoustic model, and returning to the step of inputting the basic acoustic characteristics and the basic accent region characteristics into the basic accent recognition acoustic model for iterative execution. In one embodiment, it is also possible to return to the step of obtaining the basic training data for iterative execution, that is, obtaining any one basic training speech from the basic training data for iterative execution. And when the basic training is judged to be finished, taking the updated basic accent recognition acoustic model as the trained basic accent recognition acoustic model.
And step 812, obtaining a trained basic accent recognition acoustic model, and establishing an initial accent recognition acoustic model based on the trained basic accent recognition acoustic model.
Specifically, the server establishes an initial accent recognition acoustic model by using the trained basic accent recognition acoustic model, and parameters of a target basic conversion network, a target basic feature extraction network and a target basic phoneme recognition network in the basic accent recognition acoustic model can be used as initial parameters in the initial accent recognition acoustic model to obtain the initial accent recognition acoustic model.
In the above embodiment, the basic accent recognition acoustic model obtained by training with the basic training data is used, the trained basic accent recognition acoustic model is used to establish the initial accent recognition acoustic model, and then the initial accent recognition acoustic model is trained to obtain the target accent recognition acoustic model.
In one embodiment, as shown in FIG. 9, the basic phoneme recognition network includes a basic speech phoneme feature extraction network, a basic target conversion network and a basic accent phoneme recognition network;
step 806, inputting the basic merging features into a basic phoneme recognition network for speech phoneme recognition, and obtaining basic speech phoneme information, including:
step 902, inputting the basic merging features into a basic speech phoneme feature extraction network for speech phoneme feature extraction, so as to obtain basic speech phoneme features.
The basic speech phoneme feature extraction network is a network for extracting basic training speech phoneme features, and may be a deep neural network, such as a TDNN network, an RNN network, or the like. The basic speech phoneme features refer to features of the extracted basic training speech phonemes.
Specifically, the server inputs the basic merging features into a basic phoneme recognition network for speech phoneme recognition, and basic speech phoneme information is obtained.
And 904, inputting the basic accent region characteristics into a basic target conversion network for conversion to obtain basic target conversion characteristics.
The basic target conversion network is a network for converting the basic accent regional characteristics, and the network parameters of the basic target conversion network and the basic conversion network are different, and linear conversion or nonlinear conversion can be performed. The basic target conversion feature refers to a feature obtained by conversion through a basic target conversion network.
Specifically, the server may input the basic accent region characteristics into a basic target conversion network for linear conversion to obtain basic target linear conversion characteristics, or may perform nonlinear conversion to obtain basic target nonlinear conversion characteristics. The linear Transformation may be Affine Transformation (Affine Transformation).
Step 906, merging the basic speech phoneme characteristics and the basic target conversion characteristics to obtain basic target merging characteristics, and inputting the basic target merging characteristics into a basic accent phoneme recognition network for phoneme recognition to obtain basic speech phoneme information.
The basic target merging feature is a feature obtained by merging the basic speech phoneme feature and the basic target conversion feature. The basic accent phoneme recognition network is used to recognize the corresponding accent phonemes of the basic training speech, for example, the basic accent phoneme recognition network may be a linear transformation network. The basic speech phoneme information refers to accent phoneme information corresponding to the basic training speech.
Specifically, the server calculates the vector sum of the basic speech phoneme characteristics and the basic target conversion characteristics to obtain basic target combination characteristics, and inputs the basic target combination characteristics into a basic accent phoneme recognition network for phoneme recognition to obtain basic speech phoneme information.
In the above embodiment, when performing the accent phoneme recognition, the feature obtained by combining the basic speech phoneme feature and the basic target conversion feature is used for the recognition, so that the accuracy of the basic speech phoneme information obtained by the recognition can be improved.
In one embodiment, as shown in fig. 10, step 812, building an initial accent recognition acoustic model based on the trained basic accent recognition acoustic model, includes:
and step 1002, taking the basic conversion network in the trained basic accent recognition acoustic model as an initial conversion network in the initial accent recognition acoustic model.
And 1004, taking the basic feature extraction network in the trained basic accent recognition acoustic model as an initial feature extraction network in the initial accent recognition acoustic model.
Step 1006, taking the basic speech phoneme feature extraction network in the trained basic accent recognition acoustic model as an initial speech phoneme feature extraction network in the initial accent recognition acoustic model.
And step 1008, taking the basic target conversion network in the trained basic accent recognition acoustic model as an initial target conversion network in the initial accent recognition acoustic model.
Specifically, the server takes a basic target conversion network, a basic feature extraction network, a basic speech phoneme feature extraction network and a basic target conversion network in the trained basic accent recognition acoustic model as an initial conversion network, an initial feature extraction network, an initial speech phoneme feature extraction network and an initial target conversion network in the initial accent recognition acoustic model, namely, a network parameter in the basic accent recognition acoustic model is shared to a network in the initial accent recognition acoustic model.
Step 1010, establishing an initial accent phoneme recognition network corresponding to at least two different accent region characteristics to obtain an initial accent recognition acoustic model.
The initial accent phoneme recognition network refers to an accent phoneme recognition network initialized by network parameters.
Specifically, the server establishes an initial accent phoneme recognition network corresponding to each accent region feature according to the number of the accent region features, and the initial accent phoneme recognition network corresponding to each accent region feature is used for recognizing accent phonemes of the speech corresponding to the accent region feature. And the server obtains an initial accent recognition acoustic model according to the basic target conversion network, the basic feature extraction network, the basic speech phoneme feature extraction network, the basic target conversion network and the initial accent phoneme recognition network corresponding to at least two different accent region features.
In the above embodiment, the training of the initial accent recognition acoustic model can be made to improve efficiency and accuracy by establishing the initial accent recognition acoustic model using the trained basic accent recognition acoustic model.
In one embodiment, the basic speech phoneme feature extraction network comprises at least one basic time-delay neural network, at least one basic gating cycle network and at least one basic intermediate conversion network, wherein the basic time-delay neural network and the basic gating cycle network are of an alternative network structure;
as shown in fig. 11, step 902, inputting the basic merged feature into the basic speech/phoneme feature extraction network for speech/phoneme feature extraction, so as to obtain a basic speech/phoneme feature, includes:
step 1102, inputting the basic merging characteristics into a basic delay neural network for calculation to obtain basic delay characteristics.
And 1104, inputting the basic accent region characteristics into a basic intermediate conversion network for conversion to obtain basic intermediate conversion characteristics.
The basic time delay neural network is a time delay neural network in a basic accent recognition acoustic model and is used for extracting translation invariant features on an input feature extraction frequency domain. The basic time delay features refer to features extracted by a basic time delay neural network. The basic intermediate conversion network is a conversion network between a basic time delay neural network and a basic gating circulation network, is used for converting the characteristics of a basic accent region, and can be nonlinear conversion or linear conversion.
Specifically, the server calculates the basic merging characteristic as an input into the basic delay neural network to obtain an output basic delay characteristic. And simultaneously inputting the basic accent region characteristics into a basic intermediate conversion network for conversion to obtain basic intermediate conversion characteristics. The basic intermediate conversion feature refers to a conversion feature obtained by the basic intermediate conversion network.
And step 1106, combining the basic time delay characteristic and the basic intermediate conversion characteristic to obtain a basic intermediate combination characteristic, and inputting the basic intermediate combination characteristic into a basic gating cycle network for calculation to obtain a basic speech phoneme characteristic.
The basic gating cycle network is a network for extracting basic speech phoneme features through basic intermediate merging features.
Specifically, when the server needs to input the basic delay feature into the basic gated cyclic network, the basic delay feature and the basic intermediate conversion feature need to be merged first to obtain a basic intermediate merged feature, and then the basic intermediate merged feature is input into the basic gated cyclic network to be calculated to obtain a basic speech phoneme feature.
In the two embodiments, the speech time sequence signal is modeled by the basic time delay neural network and the basic gating cycle network, so that the obtained basic speech phoneme features are extracted, and the accuracy of the obtained basic speech phoneme features can be improved.
In a specific embodiment, as shown in fig. 12, an architectural diagram of the basic accent recognition acoustic model is shown, and the architectural diagram of the basic accent recognition acoustic model has other architectural parts except for the speech phoneme recognition layer and the architectural diagram of the initial accent recognition acoustic model substantially consistent. Specifically, the method comprises the following steps:
the server inputs basic acoustic features and basic accent regional features into the basic accent recognition acoustic model, the basic accent recognition acoustic model performs data enhancement on the basic acoustic features and performs feature extraction through a CNN network to obtain basic voice features, and the basic voice features are input into a Tdnn1 network to obtain output features. And simultaneously inputting the basic vocal sound acoustic characteristics into the affine network for linear conversion to obtain conversion characteristics, and combining the basic voice characteristics with the output of the Tdnn1 network to obtain basic combination characteristics. Then inputting the basic combined feature into an alternative network of Tdnn and GRU to obtain the output basic phonetic phoneme feature, wherein, if the output of the L-th layer is H in the alternative network of Tdnn and GRULThe dimension is h x 1, the vector corresponding to the region feature is V, the dimension is V x 1, and when the parameter matrix of the affine network is WLAnd v × h, the input of the L +1 th layer (which may also be understood as the output of the L-th layer) is expressed by the following formula (1).
H’L=HL+WLV formula (1)
Wherein, H'LThe output of the L-th layer is simultaneously used as the input of the L + 1-th layer.
And then combining the output basic speech phoneme characteristics and transformation characteristics obtained by inputting the regional characteristics into an affinity 10 network, inputting the combined basic speech phoneme characteristics into a speech phoneme recognition layer for recognition to obtain output basic speech phoneme information, then calculating loss information, continuing to perform iteration execution when the loss information is not less than a preset threshold value until the loss information is less than the preset threshold value to obtain a trained basic accent recognition acoustic model, and establishing an initial accent recognition acoustic model according to the trained basic accent recognition acoustic model.
In one embodiment, as shown in fig. 13, before step 802, i.e. before acquiring the basic training data, the method further includes:
step 1302, obtain initial training data, where the initial training data includes an initial training speech and a corresponding initial phoneme label.
The initial training data is used for training an initial accent recognition acoustic model, the initial accent recognition acoustic model is a pre-training model of a basic accent recognition acoustic model, and the trained initial accent recognition acoustic model is used as a parameter in the basic accent recognition acoustic model. The network architecture of the initial accent recognition acoustic model is consistent with the network architecture of the underlying accent recognition acoustic model that does not include a linear transformation network. The trained initial accent recognition acoustic model is used for recognizing phonemes corresponding to the accent voice. The model parameters in the initial accent recognition acoustic model are initialized, and may be initialized randomly or at zero, and so on. The initial training speech is speech in the initial training data and may be different accent speech. The initial phoneme label refers to a label of a phoneme corresponding to the initial training speech.
Specifically, the server may obtain the initial training data directly from the database. The initial training data can be acquired from the internet, or can be acquired from a service party for improving data service.
Step 1304, extracting an initial acoustic feature corresponding to the initial training speech, inputting the initial acoustic feature into an initial accent recognition acoustic model initialized by parameters, inputting the initial acoustic feature into an initial feature extraction network by the initial accent recognition acoustic model for feature extraction to obtain an initial speech feature, and inputting the initial speech feature into an initial phoneme recognition network for speech phoneme recognition to obtain initial speech phoneme information.
The starting acoustic feature refers to an acoustic feature corresponding to the starting training speech, and the server may extract the corresponding acoustic feature from the starting training speech by using a signal processing technique, for example, a PNCC algorithm may be used to extract the PNCC feature, and an i-vector feature may be extracted by using the i-vector feature. The fundamental frequency characteristics can be obtained by detecting the pitch period in the voice, and the formant characteristics are extracted by using a cepstrum method. MFCC features may be extracted using MFCC extraction algorithms, and so on. The initial feature extraction network is a network for extracting the initial acoustic features, and more abstract and robust features can be extracted. The initial speech feature refers to a speech feature corresponding to the initial training speech. The initial speech phoneme information refers to speech phoneme information corresponding to the initial training speech.
Specifically, when training an initial accent recognition acoustic model, a server extracts initial acoustic features corresponding to initial training speech, inputs the initial acoustic features into the initial accent recognition acoustic model with initialized parameters, inputs the initial acoustic features into an initial feature extraction network for feature extraction by the initial accent recognition acoustic model to obtain initial speech features, and inputs the initial speech features into an initial phoneme recognition network for speech phoneme recognition to obtain initial speech phoneme information. In one embodiment, the initial acoustic features are masked in a time-frequency domain by a SpecAugmen technology before being input into an initial feature extraction network to obtain initial acoustic enhancement features, and then the initial acoustic enhancement features are input into the initial feature extraction network to perform feature extraction to obtain initial voice features.
Step 1306, calculating initial loss information based on the initial phonetic phoneme information and the corresponding initial phonetic phoneme label, and updating parameters in the initial accent recognition acoustic model based on the initial loss information.
The initial loss information refers to loss information obtained when an initial accent recognition acoustic model is trained.
Specifically, the server calculates an error between the initial speech phoneme information and the corresponding initial speech phoneme label by using a loss function to obtain loss information, and then updates parameters in the initial accent recognition acoustic model by using the initial loss information, namely, parameters of the initial feature extraction network and parameters in the initial phoneme recognition network.
Step 1308, determining whether the initial training is completed, executing step 1310 when the training is completed, and returning to step 1304 when the training is not completed.
1210, obtaining a trained initial accent recognition acoustic model, and establishing a basic accent recognition acoustic model based on the trained initial accent recognition acoustic model.
Wherein, the completion of the initial training means that the initial training reaches the initial training completion condition. The initial training completion condition may be that the training number reaches the maximum iteration number, or the initial loss information reaches a preset threshold, or the updated parameter is not changed any more, and the like.
Specifically, the server judges whether an initial training completion condition is reached, and when the initial training completion condition is not reached, the server returns to the step of inputting the initial acoustic features into the initial accent recognition acoustic model initialized by the parameters for iterative execution, and when the initial training is completed, the initial accent recognition acoustic model completed by the training is obtained. And when the training is finished, obtaining a trained initial accent recognition acoustic model, and establishing a basic accent recognition acoustic model by using the trained initial feature extraction network and the trained initial phoneme recognition network.
In the above embodiment, the base accent recognition acoustic model is established by the trained initial accent recognition acoustic model, so that the base accent recognition acoustic model can converge faster during training and the trained base accent recognition acoustic model has higher accuracy.
In one embodiment, the initial phoneme recognition network includes an initial speech phoneme feature extraction network and an initial accent phoneme recognition network;
step 1304, inputting the initial speech features into an initial phoneme recognition network for speech phoneme recognition to obtain initial speech phoneme information, including the steps of:
inputting the initial voice features into an initial voice phoneme feature extraction network for voice phoneme feature extraction to obtain initial voice phoneme features, and inputting the initial voice phoneme features into an initial accent phoneme recognition network for phoneme recognition to obtain initial voice phoneme information.
Specifically, the initial speech phoneme feature extraction network is used for performing phoneme feature extraction on the initial training speech. The initial speech phoneme feature refers to a phoneme feature corresponding to the initial training speech. The initial accent phoneme recognition network is used for recognizing phonemes corresponding to the initial training speech. The initial speech phoneme information refers to phoneme information corresponding to the initial training speech. The server inputs the initial voice features into an initial voice phoneme feature extraction network to extract voice phoneme features to obtain initial voice phoneme features, and then inputs the initial voice phoneme features into an initial accent phoneme recognition network to perform phoneme recognition to obtain initial voice phoneme information.
In one embodiment, the initial speech phoneme feature extraction network comprises at least one initial delay neural network and at least one initial gating cycle network, and the initial delay neural network and the initial gating cycle network are of an alternative network structure;
step 1304, inputting the initial speech feature into an initial speech phoneme feature extraction network to perform speech phoneme feature extraction, so as to obtain an initial speech phoneme feature, including the steps of:
and inputting the initial voice feature into an initial time delay neural network for calculation to obtain an initial time delay feature, and inputting the initial time delay feature into an initial gating cyclic network for calculation to obtain an initial voice phoneme feature.
The starting time delay neural network refers to a time delay neural network in a starting accent recognition acoustic model. The starting delay characteristic refers to the extracted delay characteristic. The initial gated-loop network refers to a gated-loop network in an initial accent recognition acoustic model. The initial speech phoneme feature refers to a phoneme feature corresponding to the extracted inspiring speech. And the server inputs the initial voice feature into the initial time delay neural network to extract the translation invariant feature on the frequency domain, so as to obtain the initial time delay feature. And then inputting the initial time delay characteristic into an initial gating circulation network for calculation to obtain an initial voice phoneme characteristic. In one embodiment, a plurality of s-gated circulation networks are connected and then connected, and the gated circulation networks are connected again and are connected alternately. Therefore, the extracted initial speech phoneme characteristics can be more accurate.
In one embodiment, the method for establishing a basic accent recognition acoustic model based on a trained initial accent recognition acoustic model comprises the following steps:
and taking the trained initial feature extraction network as a basic feature extraction network, taking the trained initial time delay neural network as a basic time delay neural network, taking the trained initial gated circulation network as a basic gated circulation network, taking the trained initial accent phoneme recognition network as a basic accent phoneme recognition network, and establishing a parameter initialized conversion network to obtain a basic accent recognition acoustic model.
Specifically, the server takes parameters in the trained initial accent recognition acoustic model as parameters in the basic accent recognition acoustic model, establishes a conversion network with initialized parameters, and the conversion network with initialized parameters is used for converting the regional characteristics to obtain conversion characteristics, and then combines the conversion characteristics with outputs in the basic time delay neural network, the basic gating cyclic network and the basic accent phoneme recognition network to obtain basic combination characteristics. And obtaining a basic accent recognition acoustic model according to the obtained basic feature extraction network, the basic time delay neural network, the basic gating circulation network, the basic accent phoneme recognition network and the parameter initialized conversion network. Namely, the base accent recognition acoustic model is established by using the trained initial accent recognition acoustic model, so that the efficiency and the accuracy of training the base accent recognition acoustic model can be improved.
In a specific embodiment, as shown in fig. 14, which is an architectural schematic diagram of an initial accent recognition acoustic model, when training the initial accent recognition acoustic model, initial training data is obtained, where the initial training data may be an accent voice with region information or an accent voice without region information. Specifically, the server inputs acoustic features corresponding to initial training speech into the initial accent recognition acoustic model, wherein the acoustic features can be 40-dimensional PNCC and 200-dimensional i-vector features, recognition is performed through the architecture diagram to obtain 8464 output factor states, loss information is calculated, initialized parameters in the initial accent recognition acoustic model are updated by using the loss information to obtain an updated accent recognition acoustic model, then the updated accent recognition acoustic model is used as the initial accent recognition acoustic model to continuously perform loop iteration, when training is known to be completed, the trained initial accent recognition acoustic model is obtained, and then the trained initial accent recognition acoustic model is used as a part of the basic accent recognition acoustic model to establish the basic accent recognition acoustic model.
In one embodiment, as shown in fig. 15, after the steps of calculating loss information based on the initial speech phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to input the acoustic features and accent region features into the initial accent recognition acoustic model are iteratively performed until the training is completed to obtain the target accent recognition acoustic model, the method further includes:
step 1502, obtaining target accent data corresponding to the target area, where the target accent data includes target area speech and corresponding target area speech phoneme labels.
The target region is a region in which the accent recognition effect needs to be optimized. The target accent data is accent voice data corresponding to the target region. The target area voice is an accent voice corresponding to the target area. The target region phonetic phoneme label refers to a label of the target region phonetic phoneme.
Specifically, when the accent recognition effect of any one region needs to be optimized, that is, the accent recognition accuracy needs to be improved, at this time, the server may obtain the target accent data corresponding to the target region from the database, or may acquire the target accent data corresponding to the target region from the internet.
Step 1504, obtaining the target area characteristics corresponding to the target area and extracting the target area voice acoustic characteristics corresponding to the target area voice.
Specifically, the target area feature refers to a feature corresponding to the target area, and the target area feature corresponding to the target area may be obtained according to a preset correspondence between the area and the area feature. The target area voice acoustic feature refers to an acoustic feature corresponding to the target area voice, that is, the server uses a signal processing technology to extract the target area voice acoustic feature corresponding to the target area voice.
Step 1506, inputting the target region voice acoustic features and the target region features into a target accent recognition acoustic model, transforming the target region features by the target accent recognition acoustic model to obtain target region transformation features, extracting the voice features based on the target region acoustic features to obtain target region voice features, merging the target region transformation features and the target region voice features to obtain target region merging features, and performing voice phoneme recognition based on the target region merging features to obtain target region voice phoneme information.
The target region transformation feature is a feature obtained by transforming the target region feature, and may be a linear feature or a nonlinear feature. The target area voice feature refers to a voice feature corresponding to the target area voice. The target area merging feature is a feature obtained by merging the target area conversion feature and the target area voice feature.
Specifically, the server optimizes the target accent recognition acoustic model by using the target accent data corresponding to the target area. Namely, the server inputs the target region voice acoustic characteristics and the target region characteristics into a target accent recognition model to obtain the output target region voice phoneme information,
step 1508, calculating target region speech loss information based on the target region speech phoneme information and the corresponding target region speech phoneme label, updating a phoneme recognition network corresponding to the target region in the target accent recognition acoustic model based on the target region speech loss information, and returning to the step of inputting the target region speech acoustic features and the target region features into the target accent recognition acoustic model for iterative execution until the target training is completed, so as to obtain an optimized accent recognition acoustic model.
Specifically, the phoneme recognition network corresponding to the target region refers to a network corresponding to the target region in the multitask phoneme recognition network. At this time, the server calculates an error between the target region speech phoneme information and the corresponding target region speech phoneme label by using a preset loss function, so as to obtain the target region speech loss information. And only updating the phoneme recognition network corresponding to the target region by using the voice loss information of the target region, and keeping other parameters in the target accent recognition acoustic model unchanged. And then returning to the step of inputting the target region voice acoustic features and the target region features into the target accent recognition acoustic model for iterative execution until the target training is finished to obtain the optimized accent recognition acoustic model.
In the above embodiment, the target accent recognition acoustic model is optimized by using the target accent data corresponding to the target region, so that the accuracy of accent recognition corresponding to the target region can be further improved on the basis of ensuring the accuracy of accent recognition of other regions.
In one embodiment, as shown in fig. 16, a method for identifying a mouth-sound is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
step 1602, obtain the accent voice to be recognized and the corresponding region information to be recognized.
The accent voice to be recognized refers to the accent voice to be recognized. The information of the area to be recognized refers to the area corresponding to the accent voice to be recognized. Different accent voices correspond to different region information.
Specifically, the server may obtain the accent speech to be recognized and the corresponding region information to be recognized from the database. The server can also acquire the accent voice to be recognized acquired by the microphone and then acquire the area information to be recognized corresponding to the accent voice to be recognized. The server can also acquire the accent voice to be recognized and the corresponding regional information to be recognized, which are uploaded by the terminal.
Step 1602, extracting the acoustic feature to be recognized corresponding to the accent voice to be recognized and obtaining the feature of the area to be recognized corresponding to the area information to be recognized.
The acoustic feature to be recognized refers to an acoustic feature corresponding to the accent voice to be recognized, and the region feature to be recognized refers to a feature of a region corresponding to the accent voice to be recognized.
Specifically, the server extracts the acoustic feature to be recognized corresponding to the accent voice to be recognized and then acquires the feature of the area to be recognized corresponding to the information of the area to be recognized. In one embodiment, the region text corresponding to the region information to be recognized may be vectorized to obtain the region feature to be recognized.
Step 1602, inputting the acoustic feature to be recognized and the feature of the region to be recognized into a target accent recognition acoustic model, transforming the feature of the region to be recognized by the target accent recognition acoustic model to obtain a transformed feature to be recognized, performing speech feature extraction on the acoustic feature to be recognized to obtain a speech feature to be recognized, combining the transformed feature to be recognized and the speech feature to be recognized to obtain a combined feature to be recognized, and performing speech phoneme recognition on the combined feature to be recognized to obtain speech phoneme information corresponding to the accent speech to be recognized.
The feature to be identified is obtained by converting the feature of the region to be identified. The speech feature to be recognized refers to a feature corresponding to the speech to be recognized. The merging feature to be recognized refers to a feature obtained by merging the transformation feature to be recognized and the voice feature to be recognized.
Specifically, the server inputs the acoustic feature to be recognized and the feature of the region to be recognized into a target accent recognition acoustic model, the target accent recognition acoustic model transforms the feature of the region to be recognized to obtain a transformed feature to be recognized, the acoustic feature to be recognized is subjected to voice feature extraction to obtain a voice feature to be recognized, the transformed feature to be recognized and the voice feature to be recognized are combined to obtain a combined feature to be recognized, voice phoneme recognition is carried out on the combined feature to be recognized to obtain voice phoneme information corresponding to the accent voice to be recognized and output by the target accent recognition acoustic model. In one embodiment, the target accent recognition acoustic model may be an accent recognition acoustic model trained in any of the embodiments of the accent recognition acoustic model training method described above. In a specific embodiment, the server may use the acoustic model for accent recognition as shown in fig. 6 to recognize the speech to be recognized, and obtain the phonetic phoneme information corresponding to the accent speech to be recognized.
And 1602, performing text recognition based on the accent phoneme information to obtain a target text corresponding to the accent voice to be recognized.
The target text refers to a text corresponding to the accent voice to be recognized.
Specifically, the server performs text recognition by using the accent phoneme information, that is, the server may perform text recognition on the accent phoneme information by using a dictionary and a speech model to obtain a target text corresponding to the accent speech to be recognized.
According to the accent recognition method, the acoustic features to be recognized corresponding to accent voice to be recognized are extracted, the regional features to be recognized corresponding to regional information to be recognized are obtained, the acoustic features to be recognized and the regional features to be recognized are input into a target accent recognition acoustic model, the regional features to be recognized are transformed by the target accent recognition acoustic model to obtain transformed features to be recognized, the acoustic features to be recognized are subjected to voice feature extraction to obtain voice features to be recognized, the transformed features to be recognized and the voice features to be recognized are combined to obtain combined features to be recognized, and voice phoneme recognition is performed on the combined features to be recognized to obtain voice phoneme information corresponding to accent voice to be recognized. And then carrying out text recognition based on the accent phoneme information to obtain a target text corresponding to the accent voice to be recognized. The target accent recognition acoustic model uses the characteristics of the region to be recognized and the acoustic characteristics to be recognized to recognize the phonetic phoneme information corresponding to the accent voice to be recognized, so that the phonetic phoneme information corresponding to the accent voice to be recognized obtained through training is more accurate, and the accuracy of the target text obtained through recognition is improved.
In a specific embodiment, as shown in fig. 17, a method for training a mouth-sound recognition acoustic model is provided, which specifically includes the following steps:
step 1702, obtain initial training data, where the initial training data includes an initial training speech and a corresponding initial phoneme label, and extract an initial acoustic feature corresponding to the initial training speech.
Step 1704, inputting the initial acoustic features into an initial accent recognition acoustic model initialized by parameters, inputting the initial acoustic features into an initial feature extraction network by the initial accent recognition acoustic model for feature extraction to obtain initial voice features, and inputting the initial voice features into an initial phoneme recognition network for voice phoneme recognition to obtain initial voice phoneme information.
Step 1706, calculating initial loss information based on the initial speech phoneme information and the corresponding initial speech phoneme label, updating parameters in the initial accent recognition acoustic model based on the initial loss information, and returning to the step of inputting the initial acoustic features into the initial accent recognition acoustic model initialized by the parameters for iterative execution until the initial training is completed, thereby obtaining a trained initial accent recognition acoustic model.
And 1708, establishing a basic accent recognition acoustic model based on the trained initial accent recognition acoustic model, and acquiring basic training data from the initial training data, wherein the basic training data comprises basic training speech, basic accent region features and basic phoneme labels corresponding to the basic training speech, and extracting basic acoustic features corresponding to the basic training speech.
Step 1710, inputting the basic acoustic features and the basic accent region features into a basic accent recognition acoustic model, inputting the basic accent region features into a basic conversion network for conversion by the basic accent recognition acoustic model to obtain basic conversion features, inputting the basic acoustic features into a basic feature extraction network for voice feature extraction to obtain basic voice features, merging the basic conversion features and the basic voice features to obtain basic merging features, and inputting the basic merging features into a basic phoneme recognition network for voice phoneme recognition to obtain basic voice phoneme information.
And step 1712, calculating basic loss information based on the basic speech phoneme information and the corresponding basic phoneme label, updating the basic accent recognition acoustic model based on the basic loss information, and returning to the step of inputting the basic acoustic features and the basic accent region features into the basic accent recognition acoustic model for iterative execution until basic training is finished to obtain the trained basic accent recognition acoustic model.
And step 1714, establishing an initial accent recognition acoustic model based on the trained basic accent recognition acoustic model. And taking the basic training data as training data, wherein the training data comprises training voice, accent region characteristics corresponding to the training voice and phoneme labels, and extracting acoustic characteristics corresponding to the training voice.
Step 1716, inputting the acoustic features and the accent region features into an initial accent recognition acoustic model, transforming the accent region features by the initial accent recognition acoustic model to obtain initial transformation features, performing voice feature extraction on the acoustic features to obtain initial voice features, combining the initial transformation features and the initial voice features to obtain initial combination features, and performing voice phoneme recognition on the initial combination features to obtain initial voice phoneme information.
Step 1718, calculating loss information based on the initial voice phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to the step of inputting the acoustic features and the accent region features into the initial accent recognition acoustic model for iterative execution until training is completed to obtain the target accent recognition acoustic model.
In a specific embodiment, the target accent recognition acoustic model is tested, specifically:
the method comprises the steps of using Chinese dialect data to conduct comparison test on each accent recognition acoustic model obtained through training in the application, dividing a Chinese area into ten areas, setting area characteristics corresponding to each area, and specifically, the division of each area characteristic and used training data and test data are shown in the following table 1.
TABLE 1 data set Table
Figure BDA0002916842770000341
Figure BDA0002916842770000351
Each region has a corresponding region feature, and a specific correspondence between the region feature and the region may be as shown in fig. 18. And then testing the obtained recognition result by using the test data, and then calculating the word error rate corresponding to the dialect voice. The results of comparing the text error rates are shown in table 2 below, and the smaller the text error rate is, the higher the accuracy of model identification is, and the better the identification effect is.
Table 2 test results word error rate comparison table
Region partitioning A0 A1 A2
1 4.93 4.73 4.63
2 6.41 5.96 5.60
3 5.23 4.93 4.80
4 4.64 3.96 3.89
5 4.98 4.91 4.86
6 6.07 6.05 5.60
7 6.21 6.03 5.71
8 6.61 6.69 6.56
9 4.01 3.79 3.73
10 5.75 5.73 5.87
All are 5.37 5.14 5.02
Wherein, a0 represents the trained initial accent recognition acoustic model, a1 represents the trained basic accent recognition acoustic model, and a2 represents the target accent recognition acoustic model. Obviously, the recognition effects of the A1 model and the A2 model are obviously improved compared with the recognition effect of the A0 model, namely the accuracy of the A2 model and the accuracy of the A1 model are obviously improved, and the word error rate obtained by recognition is obviously reduced.
And further, the A1 model is trained again by adding the data of the area one for 270 hours, and when the training is completed, a new model A1+ is obtained. The a2 model was also retrained using the data for region one for the additional 270 hours, and when training was completed, a new model, a2+, was obtained, and then the a1+ model and the a2+ model were tested using the test data, and the text error rates of the test results are shown in table 3 below. The smaller the text error rate is, the higher the model identification accuracy is, and the better the identification effect is.
TABLE 3 text error Rate comparison Table of test results
Region partitioning A1 A1+ A2 A2+
1 4.73 4.59 4.63 4.45
2 5.96 6.03 5.60 5.60
3 4.93 5.15 4.80 4.80
4 3.96 4.05 3.89 3.89
5 4.91 4.90 4.86 4.86
6 6.05 5.89 5.60 5.60
7 6.03 6.14 5.71 5.71
8 6.69 7.30 6.56 6.56
9 3.79 3.76 3.73 3.73
10 5.73 5.74 5.87 5.87
All are 5.14 5.20 5.02 4.99
In this case, it is obvious that the a2+ model improves the recognition accuracy of the dialect speech in the region 1 compared with the a2 model, and does not affect the recognition effect of the dialect speech in other regions. Namely, the A2 model can be more convenient and flexible to optimize aiming at the voice of the appointed accent region type, and meanwhile, the loss of accent recognition of other region types is not caused, and the stability and flexibility of recognition are improved.
The application also provides an application scene, and the accent recognition method is applied to the application scene. Specifically, the application of the accent recognition method in the application scenario is as follows:
fig. 19 is a schematic view of a specific application scenario of the accent recognition method. The target accent recognition acoustic model is obtained through pre-training, and the target accent recognition acoustic model is deployed to a cloud server. And the language model is trained in advance and deployed to the cloud server. Specifically, the method comprises the following steps:
the method comprises the steps of collecting voices to be recognized through a microphone array, processing the voices to be recognized through an acoustic front-end algorithm, for example, processing through noise suppression, reverberation printing, echo cancellation, sound source positioning and the like to obtain processed voices to be recognized, sending the processed voices to be recognized to a cloud server, recognizing through a cloud recognition algorithm, for example, recognizing through a target accent recognition acoustic model and a language model to obtain cloud recognition results corresponding to the voices to be recognized, performing offline recognition on the processed voices to be recognized through an offline recognition algorithm, finally fusing the cloud recognition results and the offline recognition results with offline and cloud semantic information through a fusion algorithm to obtain recognition results corresponding to the voices to be recognized, and displaying the recognition results.
It should be understood that although the various steps in the flow charts of fig. 2-17 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-17 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 20, there is provided an acoustic model training apparatus 2000 for mouth recognition, which may be a part of a computer device using software modules or hardware modules, or a combination of the two modules, and specifically includes: a data acquisition module 2002, a feature extraction module 2004, a model training module 2006, and a loop iteration module 2008, wherein:
a data obtaining module 2002, configured to obtain training data, where the training data includes training voices, accent region features corresponding to the training voices, and phoneme labels;
a feature extraction module 2004, configured to extract acoustic features corresponding to the training speech;
the model training module 2006 is configured to input the acoustic features and the accent region features into an initial accent recognition acoustic model, where the initial accent recognition acoustic model transforms the accent region features to obtain initial transformation features, performs speech feature extraction on the acoustic features to obtain initial speech features, merges the initial transformation features and the initial speech features to obtain initial merging features, and performs speech phoneme recognition on the initial merging features to obtain initial speech phoneme information;
and a loop iteration module 2008, configured to calculate loss information based on the initial speech phoneme information and the corresponding phoneme label, update the initial accent recognition acoustic model based on the loss information, and return to the step of inputting the acoustic features and the accent region features into the initial accent recognition acoustic model for iteration execution until training is completed, so as to obtain the target accent recognition acoustic model.
In one embodiment, the initial accent recognition acoustic model comprises: the method comprises the following steps of (1) an initial conversion network, an initial feature extraction network and an initial phoneme recognition network; model training module 2006, comprising:
the conversion unit is used for inputting the characteristics of the accent area into an initial conversion network for conversion to obtain initial conversion characteristics;
the voice feature extraction unit is used for inputting the acoustic features into an initial feature extraction network to extract the voice features so as to obtain initial voice features;
the initial merging unit is used for merging the initial conversion characteristic and the initial voice characteristic to obtain an initial merging characteristic;
and the recognition unit is used for inputting the initial merging characteristics into an initial phoneme recognition network for speech phoneme recognition to obtain initial speech phoneme information.
In an embodiment, the speech feature extraction unit is further configured to perform deformation on the acoustic features through spectrum enhancement to obtain acoustic enhancement features, and input the acoustic enhancement features into the initial feature extraction network to perform speech feature extraction to obtain initial speech features.
In one embodiment, the initial phoneme recognition network comprises an initial speech phoneme feature extraction network, an initial target conversion network and an initial accent phoneme recognition network corresponding to at least two different accent region features;
an identification unit comprising:
a phoneme feature extraction subunit, configured to input the initial merging feature into an initial speech phoneme feature extraction network to perform speech phoneme feature extraction, so as to obtain an initial speech phoneme feature;
the target conversion subunit is used for inputting the characteristics of the accent regions into an initial target conversion network for conversion to obtain initial target conversion characteristics;
and the phoneme recognition subunit is used for merging the initial voice phoneme characteristics and the initial target conversion characteristics to obtain target merging characteristics, and inputting the target merging characteristics into an initial accent phoneme recognition network corresponding to the accent region characteristics to perform phoneme recognition to obtain initial voice phoneme information.
In one embodiment, the initial speech phoneme feature extraction network comprises at least one initial time-delay neural network, at least one initial gating cycle network and at least one initial intermediate conversion network, wherein the initial time-delay neural network and the initial gating cycle network are of an alternative network structure;
the phoneme feature extraction subunit is also used for inputting the initial merging feature into an initial time delay neural network for calculation to obtain an initial time delay feature; inputting the characteristics of the accent area into an initial intermediate conversion network for conversion to obtain initial intermediate conversion characteristics; and combining the initial time delay characteristic and the initial intermediate conversion characteristic to obtain an initial intermediate combination characteristic, and inputting the initial intermediate combination characteristic into an initial gating circulation network for calculation to obtain an initial voice phoneme characteristic.
In one embodiment, the loop iteration module 2008 is further configured to calculate loss information using a maximum mutual information optimization function for the initial speech phoneme information and the corresponding phoneme label, and when the loss information does not meet the training completion condition, reversely update parameters of the initial accent recognition acoustic model based on the loss information to obtain an updated accent recognition acoustic model; and taking the updated acoustic model for the accent recognition as an initial accent recognition acoustic model, and iteratively executing the step of inputting the acoustic characteristics and the accent region characteristics into the initial accent recognition acoustic model until the loss information meets the training completion condition, and taking the initial accent recognition acoustic model meeting the training completion condition as a target accent recognition acoustic model.
In one embodiment, the accent recognition acoustic model training apparatus 2000 further comprises:
the basic training data acquisition module is used for acquiring basic training data, and the basic training data comprises basic training voices, basic accent region characteristics corresponding to the basic training voices and basic phoneme labels;
the basic feature extraction module is used for extracting basic acoustic features corresponding to basic training voices;
the basic model training module is used for inputting basic acoustic features and basic accent region features into a basic accent recognition acoustic model, inputting the basic accent region features into a basic conversion network for conversion by the basic accent recognition acoustic model to obtain basic conversion features, inputting the basic acoustic features into a basic feature extraction network for voice feature extraction to obtain basic voice features, merging the basic conversion features and the basic voice features to obtain basic merging features, and inputting the basic merging features into a basic phoneme recognition network for voice phoneme recognition to obtain basic voice phoneme information;
the basic circulation module is used for calculating basic loss information based on the basic speech phoneme information and the corresponding basic phoneme label, updating the basic accent recognition acoustic model based on the basic loss information, returning to the step of inputting the basic acoustic characteristics and the basic accent region characteristics into the basic accent recognition acoustic model for iterative execution, and obtaining a trained basic accent recognition acoustic model until basic training is completed;
and the initial model establishing module is used for establishing an initial accent recognition acoustic model based on the trained basic accent recognition acoustic model.
In one embodiment, the basic phoneme recognition network comprises a basic speech phoneme feature extraction network, a basic target conversion network and a basic accent phoneme recognition network; the basic model training module is also used for inputting the basic merging characteristics into a basic voice phoneme characteristic extraction network to extract the voice phoneme characteristics so as to obtain basic voice phoneme characteristics; inputting the basic accent region characteristics into a basic target conversion network for conversion to obtain basic target conversion characteristics; and merging the basic speech phoneme characteristics and the basic target conversion characteristics to obtain basic target merging characteristics, and inputting the basic target merging characteristics into a basic accent phoneme recognition network for phoneme recognition to obtain basic speech phoneme information.
In one embodiment, the initial model building module is further configured to use a basic conversion network in the trained basic accent recognition acoustic model as an initial conversion network in the initial accent recognition acoustic model; taking a basic feature extraction network in the trained basic accent recognition acoustic model as an initial feature extraction network in the initial accent recognition acoustic model; taking a basic speech phoneme feature extraction network in the trained basic accent recognition acoustic model as an initial speech phoneme feature extraction network in the initial accent recognition acoustic model; taking a basic target conversion network in the trained basic accent recognition acoustic model as an initial target conversion network in the initial accent recognition acoustic model; and establishing an initial accent phoneme recognition network corresponding to at least two different accent regional characteristics to obtain an initial accent recognition acoustic model.
In one embodiment, the basic speech phoneme feature extraction network comprises at least one basic time-delay neural network, at least one basic gating cycle network and at least one basic intermediate conversion network, wherein the basic time-delay neural network and the basic gating cycle network are of an alternative network structure; the basic model training module is also used for inputting the basic merging characteristics into a basic time delay neural network for calculation to obtain basic time delay characteristics; inputting the basic accent region characteristics into a basic intermediate conversion network for conversion to obtain basic intermediate conversion characteristics; and combining the basic time delay characteristic and the basic intermediate conversion characteristic to obtain a basic intermediate combination characteristic, and inputting the basic intermediate combination characteristic into a basic gating cycle network for calculation to obtain a basic voice phoneme characteristic.
In one embodiment, the accent recognition acoustic model training apparatus 2000 further comprises:
the initial training data acquisition module is used for acquiring initial training data, and the initial training data comprises initial training voice and corresponding initial phoneme labels;
the initial model training module is used for extracting initial acoustic features corresponding to initial training voice, inputting the initial acoustic features into an initial accent recognition acoustic model initialized by parameters, inputting the initial acoustic features into an initial feature extraction network by the initial accent recognition acoustic model for feature extraction to obtain initial voice features, and inputting the initial voice features into an initial phoneme recognition network for voice phoneme recognition to obtain initial voice phoneme information;
the initial model circulation module is used for calculating initial loss information based on the initial voice phoneme information and the corresponding initial voice phoneme label, updating parameters in the initial accent recognition acoustic model based on the initial loss information, and returning to the step of inputting the initial acoustic features into the initial accent recognition acoustic model initialized by the parameters for iterative execution until the initial training is finished to obtain the trained initial accent recognition acoustic model;
and the basic model establishing module is used for establishing a basic accent recognition acoustic model based on the trained initial accent recognition acoustic model.
In one embodiment, the initial phoneme recognition network includes an initial speech phoneme feature extraction network and an initial accent phoneme recognition network; the initial model training module is further used for inputting the initial voice features into an initial voice phoneme feature extraction network for voice phoneme feature extraction to obtain initial voice phoneme features, and inputting the initial voice phoneme features into an initial accent phoneme recognition network for phoneme recognition to obtain initial voice phoneme information.
In one embodiment, the initial speech phoneme feature extraction network comprises at least one initial delay neural network and at least one initial gating cycle network, and the initial delay neural network and the initial gating cycle network are of an alternative network structure; the initial model training module is further used for inputting the initial voice features into an initial time delay neural network for calculation to obtain initial time delay features, and inputting the initial time delay features into an initial gating circulation network for calculation to obtain initial voice phoneme features.
In an embodiment, the basic model establishing module is further configured to use the trained initial feature extraction network as a basic feature extraction network, use the trained initial delay neural network as a basic delay neural network, use the trained initial gating cycle network as a basic gating cycle network, use the trained initial accent phoneme recognition network as a basic accent phoneme recognition network, and establish a parameter-initialized transformation network to obtain the basic accent recognition acoustic model.
In one embodiment, the accent recognition acoustic model training apparatus 2000 further comprises: :
the target area optimization module is used for acquiring target accent data corresponding to a target area, and the target accent data comprises target area voice and corresponding target area voice phoneme labels; acquiring target area characteristics corresponding to a target area and extracting target area voice acoustic characteristics corresponding to target area voice; inputting the target region voice acoustic features and the target region features into a target accent recognition acoustic model, converting the target region features by the target accent recognition acoustic model to obtain target region conversion features, extracting the voice features based on the target region acoustic features to obtain target region voice features, combining the target region conversion features and the target region voice features to obtain target region combination features, and performing voice phoneme recognition based on the target region combination features to obtain target region voice phoneme information; and calculating target region voice loss information based on the target region voice phoneme information and the corresponding target region voice phoneme label, updating a phoneme recognition network corresponding to the target region in the target accent recognition acoustic model based on the target region voice loss information, and returning to the step of inputting the target region voice acoustic characteristics and the target region characteristics into the target accent recognition acoustic model for iterative execution until the target training is completed to obtain the optimized accent recognition acoustic model.
In one embodiment, as shown in fig. 21, there is provided an apparatus 2100 for mouth-sound recognition, which may be a part of a computer device using software modules or hardware modules, or a combination of the two, and specifically includes: a to-be-recognized speech acquisition module 2102, a to-be-recognized feature extraction module 2104, a model recognition module 2106, and a text derivation module 2108, wherein:
a to-be-recognized voice acquiring module 2102, configured to acquire an accent voice to be recognized and corresponding to-be-recognized area information;
the to-be-recognized feature extraction module 2104 is used for extracting to-be-recognized acoustic features corresponding to the accent voice to be recognized and acquiring to-be-recognized region features corresponding to the to-be-recognized region information;
the model recognition module 2106 is configured to input the acoustic feature to be recognized and the feature of the region to be recognized into the target accent recognition acoustic model, where the target accent recognition acoustic model transforms the feature of the region to be recognized to obtain a transformed feature to be recognized, performs speech feature extraction on the acoustic feature to be recognized to obtain a speech feature to be recognized, merges the transformed feature to be recognized and the speech feature to be recognized to obtain a merged feature to be recognized, and performs speech phoneme recognition on the merged feature to be recognized to obtain speech phoneme information corresponding to the accent speech to be recognized;
a text obtaining module 2108, configured to perform text recognition based on the accent phoneme information to obtain a target text corresponding to the accent voice to be recognized.
For specific limitations of the accent recognition acoustic model training apparatus and the accent recognition apparatus, reference may be made to the above limitations of the accent recognition acoustic model training method and the accent recognition method, which are not described herein again. The modules in the above-mentioned accent recognition acoustic model training device and accent recognition device may be wholly or partially implemented by software, hardware and their combination. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 22. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store various training data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an accent recognition acoustic model training method and an accent recognition method.
Those skilled in the art will appreciate that the architecture shown in fig. 22 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (15)

1. A method for training an acoustic model for speech recognition, the method comprising:
acquiring training data, wherein the training data comprises training voice, accent region characteristics corresponding to the training voice and phoneme labels;
extracting acoustic features corresponding to the training voice;
inputting the acoustic features and the accent region features into an initial accent recognition acoustic model, transforming the accent region features by the initial accent recognition acoustic model to obtain initial transformation features, performing voice feature extraction on the acoustic features to obtain initial voice features, merging the initial transformation features and the initial voice features to obtain initial merging features, and performing voice phoneme recognition on the initial merging features to obtain initial voice phoneme information;
and calculating loss information based on the initial voice phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to the step of inputting the acoustic features and the accent region features into the initial accent recognition acoustic model for iterative execution until training is completed to obtain a target accent recognition acoustic model.
2. The method of claim 1, wherein the initial accent recognition acoustic model comprises: the method comprises the following steps of (1) an initial conversion network, an initial feature extraction network and an initial phoneme recognition network;
inputting the acoustic features and the accent region features into an initial accent recognition acoustic model, comprising:
inputting the characteristics of the accent regions into the initial conversion network for conversion to obtain the initial conversion characteristics;
inputting the acoustic features into the initial feature extraction network for voice feature extraction to obtain the initial voice features;
merging the initial conversion feature and the initial voice feature to obtain an initial merging feature;
and inputting the initial merging characteristics into the initial phoneme recognition network for voice phoneme recognition to obtain the initial voice phoneme information.
3. The method of claim 2, wherein the initial phoneme recognition network comprises an initial speech phoneme feature extraction network, an initial target conversion network, and an initial accent phoneme recognition network corresponding to at least two different accent region features;
the inputting the initial merging feature into the initial phoneme recognition network for speech phoneme recognition to obtain the initial speech phoneme information includes:
inputting the initial merging features into the initial voice phoneme feature extraction network for voice phoneme feature extraction to obtain initial voice phoneme features;
inputting the accent region characteristics into the initial target conversion network for conversion to obtain the initial target conversion characteristics;
and merging the initial voice phoneme characteristics and the initial target conversion characteristics to obtain target merging characteristics, and inputting the target merging characteristics into an initial accent phoneme recognition network corresponding to the accent region characteristics to perform phoneme recognition to obtain initial voice phoneme information.
4. The method of claim 3, wherein the initial phonetic phoneme feature extraction network comprises at least one initial time-delay neural network, at least one initial gated round-robin network and at least one initial intermediate conversion network, and wherein the initial time-delay neural network and the initial gated round-robin network are of an alternating network structure;
the inputting the initial merging feature into the initial speech phoneme feature extraction network for speech phoneme feature extraction to obtain an initial speech phoneme feature includes:
inputting the initial merging characteristics into the initial time delay neural network for calculation to obtain initial time delay characteristics;
inputting the characteristics of the accent regions into the initial intermediate conversion network for conversion to obtain initial intermediate conversion characteristics;
and combining the initial time delay characteristic and the initial intermediate conversion characteristic to obtain an initial intermediate combination characteristic, and inputting the initial intermediate combination characteristic into the initial gating circulation network for calculation to obtain the initial voice phoneme characteristic.
5. The method according to any one of claims 1-4, wherein the steps of calculating loss information based on the initial speech phoneme information and corresponding phoneme labels, updating the initial accent recognition acoustic model based on the loss information, and returning to input the acoustic features and the accent region features into the initial accent recognition acoustic model are iteratively performed until training is completed to obtain a target accent recognition acoustic model, comprising:
calculating loss information by using a maximum mutual information optimization function for the initial voice phoneme information and the corresponding phoneme label, and reversely updating parameters of the initial accent recognition acoustic model based on the loss information to obtain an updated accent recognition acoustic model when the loss information does not meet training completion conditions;
and taking the updated acoustic model for accent recognition as an initial accent recognition acoustic model, and iteratively executing the step of inputting the acoustic characteristics and the accent region characteristics into the initial accent recognition acoustic model until loss information meets training completion conditions, and taking the initial accent recognition acoustic model meeting the training completion conditions as a target accent recognition acoustic model.
6. The method of claim 1, further comprising, prior to said obtaining training data:
acquiring basic training data, wherein the basic training data comprises basic training voice, basic accent region characteristics corresponding to the basic training voice and basic phoneme labels;
extracting basic acoustic features corresponding to the basic training speech;
inputting the basic acoustic features and the basic accent region features into a basic accent recognition acoustic model, inputting the basic accent region features into a basic conversion network for conversion by the basic accent recognition acoustic model to obtain basic conversion features, inputting the basic acoustic features into a basic feature extraction network for voice feature extraction to obtain basic voice features, merging the basic conversion features and the basic voice features to obtain basic merging features, and inputting the basic merging features into a basic phoneme recognition network for voice phoneme recognition to obtain basic voice phoneme information;
calculating basic loss information based on the basic speech phoneme information and the corresponding basic phoneme label, updating the basic accent recognition acoustic model based on the basic loss information, and returning to the step of inputting the basic acoustic features and the basic accent region features into the basic accent recognition acoustic model for iterative execution until basic training is completed to obtain a trained basic accent recognition acoustic model;
and establishing the initial accent recognition acoustic model based on the trained basic accent recognition acoustic model.
7. The method of claim 6 wherein said basic phoneme recognition network comprises a basic speech phoneme feature extraction network, a basic target conversion network and a basic accent phoneme recognition network;
the inputting the basic merging features into a basic phoneme recognition network for speech phoneme recognition to obtain basic speech phoneme information includes:
inputting the basic merging features into the basic speech phoneme feature extraction network for speech phoneme feature extraction to obtain basic speech phoneme features;
inputting the basic accent region characteristics into the basic target conversion network for conversion to obtain the basic target conversion characteristics;
and combining the basic speech phoneme characteristics and the basic target conversion characteristics to obtain basic target combination characteristics, and inputting the basic target combination characteristics into the basic accent phoneme recognition network for phoneme recognition to obtain the basic speech phoneme information.
8. The method of claim 6, wherein the building the initial accent recognition acoustic model based on the trained base accent recognition acoustic model comprises:
taking a basic conversion network in the trained basic accent recognition acoustic model as an initial conversion network in the initial accent recognition acoustic model;
taking a basic feature extraction network in the trained basic accent recognition acoustic model as an initial feature extraction network in the initial accent recognition acoustic model;
taking a basic speech phoneme feature extraction network in the trained basic accent recognition acoustic model as an initial speech phoneme feature extraction network in the initial accent recognition acoustic model;
taking a basic target conversion network in the trained basic accent recognition acoustic model as an initial target conversion network in the initial accent recognition acoustic model;
and establishing an initial accent phoneme recognition network corresponding to at least two different accent regional characteristics to obtain the initial accent recognition acoustic model.
9. The method of claim 7, wherein the basic speech phoneme feature extraction network comprises at least one basic time-delay neural network, at least one basic gating cyclic network and at least one basic intermediate conversion network, and wherein the basic time-delay neural network and the basic gating cyclic network are of an alternating network structure;
inputting the basic merging features into the basic speech phoneme feature extraction network for speech phoneme feature extraction to obtain basic speech phoneme features, wherein the basic speech phoneme features comprise:
inputting the basic merging characteristics into the basic time delay neural network for calculation to obtain basic time delay characteristics;
inputting the basic accent region characteristics into the basic intermediate conversion network for conversion to obtain basic intermediate conversion characteristics;
and combining the basic time delay characteristic and the basic intermediate conversion characteristic to obtain a basic intermediate combination characteristic, and inputting the basic intermediate combination characteristic into the basic gating cycle network for calculation to obtain the basic speech phoneme characteristic.
10. The method of claim 6, further comprising, prior to said obtaining base training data:
acquiring initial training data, wherein the initial training data comprises initial training voice and a corresponding initial phoneme label;
extracting initial acoustic features corresponding to the initial training voice, inputting the initial acoustic features into an initial accent recognition acoustic model initialized by the parameters, inputting the initial acoustic features into an initial feature extraction network by the initial accent recognition acoustic model for feature extraction to obtain initial voice features, and inputting the initial voice features into an initial phoneme recognition network for voice phoneme recognition to obtain initial voice phoneme information;
calculating initial loss information based on the initial voice phoneme information and the corresponding initial voice phoneme label, updating parameters in the initial accent recognition acoustic model based on the initial loss information, and returning to the step of inputting the initial acoustic features into the initial accent recognition acoustic model initialized by the parameters for iterative execution until the initial training is finished, so as to obtain a trained initial accent recognition acoustic model;
and establishing the basic accent recognition acoustic model based on the trained initial accent recognition acoustic model.
11. A method for identifying a spoken utterance, the method comprising:
acquiring accent voice to be recognized and corresponding information of a region to be recognized;
extracting acoustic features to be recognized corresponding to the accent voice to be recognized and acquiring the features of the area to be recognized corresponding to the area information to be recognized;
inputting the acoustic feature to be recognized and the regional feature to be recognized into a target accent recognition acoustic model, transforming the regional feature to be recognized by the target accent recognition acoustic model to obtain a transformed feature to be recognized, extracting the voice feature of the acoustic feature to be recognized to obtain a voice feature to be recognized, combining the transformed feature to be recognized and the voice feature to be recognized to obtain a combined feature to be recognized, and performing voice phoneme recognition on the combined feature to be recognized to obtain voice phoneme information corresponding to the accent voice to be recognized;
and performing text recognition based on the accent phoneme information to obtain a target text corresponding to the accent voice to be recognized.
12. An acoustic model training apparatus for speech recognition, the apparatus comprising:
the data acquisition module is used for acquiring training data, wherein the training data comprises training voice, accent region characteristics corresponding to the training voice and phoneme labels;
the feature extraction module is used for extracting acoustic features corresponding to the training voice;
the model training module is used for inputting the acoustic features and the accent region features into an initial accent recognition acoustic model, the initial accent recognition acoustic model transforms the accent region features to obtain initial transformation features, voice feature extraction is carried out on the acoustic features to obtain initial voice features, the initial transformation features and the initial voice features are combined to obtain initial combination features, and voice phoneme recognition is carried out on the initial combination features to obtain initial voice phoneme information;
and the loop iteration module is used for calculating loss information based on the initial voice phoneme information and the corresponding phoneme label, updating the initial accent recognition acoustic model based on the loss information, and returning to the step of inputting the acoustic features and the accent region features into the initial accent recognition acoustic model for iteration execution until training is finished to obtain the target accent recognition acoustic model.
13. An apparatus for recognizing a mouth sound, the apparatus comprising:
the voice to be recognized acquisition module is used for acquiring accent voice to be recognized and corresponding information of an area to be recognized;
the to-be-recognized feature extraction module is used for extracting to-be-recognized acoustic features corresponding to the to-be-recognized accent voice and acquiring to-be-recognized region features corresponding to the to-be-recognized region information;
the model recognition module is used for inputting the acoustic features to be recognized and the regional features to be recognized into a target accent recognition acoustic model, the target accent recognition acoustic model transforms the regional features to be recognized to obtain transformed features to be recognized, voice feature extraction is carried out on the acoustic features to be recognized to obtain voice features to be recognized, the transformed features to be recognized and the voice features to be recognized are combined to obtain combined features to be recognized, and voice phoneme recognition is carried out on the combined features to be recognized to obtain voice phoneme information corresponding to the accent voice to be recognized;
and the text obtaining module is used for carrying out text recognition based on the accent phoneme information to obtain a target text corresponding to the accent voice to be recognized.
14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.
15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.
CN202110104567.3A 2021-01-26 2021-01-26 Method and device for training acoustic model for accent recognition, and storage medium Pending CN113593524A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110104567.3A CN113593524A (en) 2021-01-26 2021-01-26 Method and device for training acoustic model for accent recognition, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110104567.3A CN113593524A (en) 2021-01-26 2021-01-26 Method and device for training acoustic model for accent recognition, and storage medium

Publications (1)

Publication Number Publication Date
CN113593524A true CN113593524A (en) 2021-11-02

Family

ID=78238126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110104567.3A Pending CN113593524A (en) 2021-01-26 2021-01-26 Method and device for training acoustic model for accent recognition, and storage medium

Country Status (1)

Country Link
CN (1) CN113593524A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201405253D0 (en) * 2014-03-24 2014-05-07 Toshiba Res Europ Ltd Speech synthesis
CN110930982A (en) * 2019-10-31 2020-03-27 国家计算机网络与信息安全管理中心 Multi-accent acoustic model and multi-accent voice recognition method
US20200327883A1 (en) * 2019-04-15 2020-10-15 Beijing Baidu Netcom Science And Techology Co., Ltd. Modeling method for speech recognition, apparatus and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201405253D0 (en) * 2014-03-24 2014-05-07 Toshiba Res Europ Ltd Speech synthesis
US20200327883A1 (en) * 2019-04-15 2020-10-15 Beijing Baidu Netcom Science And Techology Co., Ltd. Modeling method for speech recognition, apparatus and device
CN110930982A (en) * 2019-10-31 2020-03-27 国家计算机网络与信息安全管理中心 Multi-accent acoustic model and multi-accent voice recognition method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SONGJUN CAO 等: "Improving speech recognition accuracy of local poi using geographical models", 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 22 January 2021 (2021-01-22), pages 180 - 184 *
徐文颖: "基于神经网络的多口音普通话语音识别研究与实现", 中国优秀硕士学位论文全文数据库, 15 May 2019 (2019-05-15), pages 33 *

Similar Documents

Publication Publication Date Title
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
US10249294B2 (en) Speech recognition system and method
US10074363B2 (en) Method and apparatus for keyword speech recognition
Ghai et al. Literature review on automatic speech recognition
CN107093422B (en) Voice recognition method and voice recognition system
CN113707125B (en) Training method and device for multi-language speech synthesis model
US11450320B2 (en) Dialogue system, dialogue processing method and electronic apparatus
CN111081230A (en) Speech recognition method and apparatus
Vadwala et al. Survey paper on different speech recognition algorithm: challenges and techniques
US11961515B2 (en) Contrastive Siamese network for semi-supervised speech recognition
US20210056958A1 (en) System and method for tone recognition in spoken languages
US20220180864A1 (en) Dialogue system, dialogue processing method, translating apparatus, and method of translation
CN112017648A (en) Weighted finite state converter construction method, speech recognition method and device
Ranjan et al. Isolated word recognition using HMM for Maithili dialect
Vimala et al. Isolated speech recognition system for Tamil language using statistical pattern matching and machine learning techniques
CN116090474A (en) Dialogue emotion analysis method, dialogue emotion analysis device and computer-readable storage medium
CN112216270A (en) Method and system for recognizing speech phonemes, electronic equipment and storage medium
CN112951277B (en) Method and device for evaluating speech
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
CN113593524A (en) Method and device for training acoustic model for accent recognition, and storage medium
JP7170594B2 (en) A program, apparatus and method for constructing a learning model that integrates different media data generated chronologically for the same event
Ankit et al. Acoustic speech recognition for Marathi language using sphinx
Lin et al. Speech recognition for people with dysphasia using convolutional neural network
Gupta et al. Voice Identification in Python Using Hidden Markov Model
Bozorg et al. Autoregressive articulatory wavenet flow for speaker-independent acoustic-to-articulatory inversion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40055304

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination