CN118197292A - Dialect variant speech recognition model training method and system based on context information - Google Patents

Dialect variant speech recognition model training method and system based on context information Download PDF

Info

Publication number
CN118197292A
CN118197292A CN202410368744.2A CN202410368744A CN118197292A CN 118197292 A CN118197292 A CN 118197292A CN 202410368744 A CN202410368744 A CN 202410368744A CN 118197292 A CN118197292 A CN 118197292A
Authority
CN
China
Prior art keywords
dialect
variant
model
context information
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410368744.2A
Other languages
Chinese (zh)
Inventor
苏立伟
刘振华
谭火超
陈海燕
吴石松
梁飞令
毛莉萍
李兰芳
杨英勃
曾晓锋
简冬琳
冼文祥
石世玉
彭若馨
李静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Power Grid Co Ltd
Customer Service Center of Guangdong Power Grid Co Ltd
Original Assignee
Guangdong Power Grid Co Ltd
Customer Service Center of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Power Grid Co Ltd, Customer Service Center of Guangdong Power Grid Co Ltd filed Critical Guangdong Power Grid Co Ltd
Priority to CN202410368744.2A priority Critical patent/CN118197292A/en
Publication of CN118197292A publication Critical patent/CN118197292A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of voice recognition and discloses a method and a system for training a dialect variant voice recognition model based on context information, wherein the method comprises the steps of obtaining a plurality of dialect voice samples marked with labels, and splicing the labels according to time sequence to obtain the context information; training an acoustic model, a dialect prediction model and a lexical model through the context information and the dialect voice sample respectively; constructing a dialect variant and a variant constraint path corresponding to the variant, and adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a dialect variant prediction model; based on the attention mechanism, feature fusion is carried out on the trained acoustic model, the dialect variant prediction model and the trained lexical model, so that a dialect variant speech recognition model is obtained. The method and the system fully utilize the context information, assist the system recognition by generating possible options of the dialect variant, and improve the voice recognition accuracy of the dialect variant.

Description

Dialect variant speech recognition model training method and system based on context information
Technical Field
The invention relates to the technical field of voice recognition, in particular to a dialect variant voice recognition model training method and system based on context information.
Background
Currently, speech recognition technology is a natural language processing technology that has been widely used in recent years, allowing computers to recognize and understand spoken expressions of human language. However, there are some unique challenges for automatic speech recognition (Automatic Speech Recognition, ASR) systems for dialects, which differ significantly from mandarin, which is widely used in the public, in terms of pronunciation, vocabulary, and grammar. Accordingly, conventional ASR systems are often difficult to accurately identify a variety of dialects, and the prior art is left to improve.
Disclosure of Invention
The invention provides a dialect variant speech recognition model training method and system based on context information, which solve the problem that the traditional ASR system is difficult to accurately recognize various dialects.
To solve the above technical problem, a first aspect of the present invention provides a method for training a dialect variant speech recognition model based on context information, including:
Acquiring a plurality of dialect voice samples marked with labels, and splicing the labels according to time sequence to obtain context information; the labels comprise text content labels and word class labels corresponding to texts; the context information comprises text context information and word class context information;
Training an acoustic model, a dialect prediction model and a lexical model through the context information and the dialect voice sample respectively; the acoustic model is used for constructing a pronunciation association relation between dialects and mandarin; the dialect prediction model is used for predicting vocabulary in dialects; the lexical model is used for constructing vocabulary association relations between dialects and mandarin chinese;
Constructing a dialect variant and a variant constraint path corresponding to the variant, and adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a dialect variant prediction model;
Based on the attention mechanism, feature fusion is carried out on the trained acoustic model, the dialect variant prediction model and the trained lexical model, so that a dialect variant speech recognition model is obtained.
Further, the training the acoustic model, the dialect prediction model and the lexical model through the context information and the dialect voice sample respectively includes:
Performing supervised training on the acoustic model through the dialect voice sample and the text content label corresponding to the dialect voice sample to obtain a trained acoustic model;
Performing autoregressive prediction training on the dialect prediction model through the text context information to obtain a trained dialect prediction model;
And training the lexical model through the word class context information, the dialect voice sample and the word class labels corresponding to the word class context information, so as to obtain the trained lexical model.
Further, the training criteria for the supervised training is represented by:
Wherein L supervised is a loss function of the acoustic model; x is a dialect voice sample; y is a text content label; p (x) is the distribution function of the dialect voice sample; θ t is an acoustic model parameter in the t-th iterative training; the output probability of the text content label; a (-) is used to enhance the spectral data in the dialect speech samples.
Further, the training criteria of the dialect prediction model are represented by the following formula:
In the method, in the process of the invention, A loss function for the dialect prediction model; q (y) is a distribution function of text content tags; y i is the text content tag at the ith position of the dialect voice sample; y <i is text context information; phi t is a dialect prediction model parameter in the t-th iteration training; /(I)The predicted probability for the next vocabulary.
Further, the training criteria of the lexical model are represented by the following formula:
In the method, in the process of the invention, A loss function for the lexical model; q (z) is the distribution function of the word class labels; z i is the part-of-speech tag of the text at the ith position of the dialect voice sample; z <i is word class context information; phi i is a lexical model parameter during the t-th iteration training; /(I)The probability is predicted for the part of speech of the next word.
Further, the constructing a dialect variant and a variant constraint path corresponding to the dialect variant, and adding the variant constraint path to a decoding space of the trained dialect prediction model to obtain a dialect variant prediction model, including:
Establishing a synonymous substitution list of the dialect variant based on priori knowledge of the dialect variant, and constructing a variant constraint path according to the synonymous substitution list;
Adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a constraint dialect prediction model;
Sampling the constraint dialect prediction model, generating prediction results of a plurality of dialect variants and updating the text context information;
training the constraint dialect prediction model through the updated text context information to obtain the dialect variant prediction model.
Further, based on the attention mechanism, feature fusion is performed on the trained acoustic model, the dialect variant prediction model and the trained lexical model to obtain a dialect variant speech recognition model, which comprises the following steps:
Adding a cross attention mechanism module in a network layer of the trained acoustic model, sending the dialect variant prediction model and the output layer characteristics of the trained lexical model into the attention mechanism module of each layer, and regulating the output layer characteristics through a loss function of the acoustic model to obtain a fused dialect variant speech recognition model.
A second aspect of the present invention provides a dialect variant speech recognition model training system based on context information, comprising:
The data acquisition module is used for acquiring a plurality of dialect voice samples marked with labels, and splicing the labels according to time sequence to obtain context information; the labels comprise text content labels and word class labels corresponding to texts; the context information comprises text context information and word class context information;
The model training module is used for training an acoustic model, a dialect prediction model and a lexical model through the context information and the dialect voice sample respectively; the acoustic model is used for constructing a pronunciation association relation between dialects and mandarin; the dialect prediction model is used for predicting vocabulary in dialects; the lexical model is used for constructing vocabulary association relations between dialects and mandarin chinese;
The path constraint module is used for constructing a dialect variant and a variant constraint path corresponding to the dialect variant, and adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a dialect variant prediction model;
And the feature fusion module is used for carrying out feature fusion on the trained acoustic model, the dialect variant prediction model and the trained lexical model based on the attention mechanism to obtain a dialect variant speech recognition model.
A third aspect of the present invention provides an electronic device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the method of dialect variant speech recognition model training based on context information as described in any of the first aspects above when the computer program is executed.
A fourth aspect of the present invention provides a computer readable storage medium comprising a stored computer program, wherein the computer program when run controls a device in which the computer readable storage medium is located to perform the dialect variant speech recognition model training method based on context information as described in any of the first aspects above.
Compared with the prior art, the embodiment of the invention has the beneficial effects that:
The invention provides a dialect variant voice recognition model training method and system based on context information, wherein the method comprises the steps of obtaining a plurality of dialect voice samples marked with labels, and splicing the labels according to time sequence to obtain the context information; training an acoustic model, a dialect prediction model and a lexical model through the context information and the dialect voice sample respectively; constructing a dialect variant and a variant constraint path corresponding to the variant, and adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a dialect variant prediction model; based on the attention mechanism, feature fusion is carried out on the trained acoustic model, the dialect variant prediction model and the trained lexical model, so that a dialect variant speech recognition model is obtained. By utilizing the context information, the ASR system can better understand the voice change and the context information in the dialect variant, and improves the voice recognition accuracy of the dialect variant, so that a more reliable and efficient solution can be brought to the application of dialect voice recognition.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for training a dialect variant speech recognition model based on context information according to an embodiment of the present invention;
FIG. 2 is a flowchart of step S2 provided in an embodiment of the present invention;
FIG. 3 is a flowchart of step S3 provided in an embodiment of the present invention;
FIG. 4 is a block diagram of a feature fusion architecture provided in accordance with an embodiment of the present invention;
FIG. 5 is a device diagram of a dialect variant speech recognition model training system based on context information, according to an embodiment of the present invention;
fig. 6 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings and examples, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the step numbers used herein are for convenience of description only and are not limiting as to the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Cantonese (Cantonese) presents some unique challenges to automatic speech recognition ASR systems as a dialect with rich language variants. The voice variants of the cantonese mainly comprise standard cantonese, guangfu, chaoshan and the like, and the dialects have obvious differences in pronunciation, vocabulary and grammar. Thus, conventional ASR systems often have difficulty accurately identifying dialects, particularly various variants of cantonese. In order to improve the speech recognition accuracy of dialect variants, the following key problems need to be solved: 1. pronunciation difference: different dialect variants have significant differences in pronunciation, including changes in pitch, phonemes, and syllables, which increase the difficulty of recognizing the language; 2. vocabulary difference: different dialect variants use different words and phrases, even there are different meanings of the words; 3. conventional ASR systems are typically based on recognition of individual speech segments, lacking in the full utilization of context information. Therefore, the application starts from the pronunciation, vocabulary and context of dialect based on the difference, and trains the recognition capability of the model on different pronunciation words, voice change, grammar structure and other aspects through machine learning, thereby improving the accuracy of dialect variant voice recognition.
In an embodiment, as shown in fig. 1, a first aspect of the present invention provides a dialect variant speech recognition model training method based on context information, including:
S1, acquiring a plurality of dialect voice samples marked with labels, and splicing the labels in time sequence to obtain context information; the labels comprise text content labels and word class labels corresponding to texts; the context information comprises text context information and word class context information; the method comprises the steps of training a model through dialect voice samples with labels; the text content tag refers to the text content of the dialect, and the word class tag refers to the vocabulary in the text and the information such as the part of speech corresponding to the vocabulary. The method and the system utilize the labeling of the dialect sample and splice the context information formed by the labeling information in time sequence to train the dialect variant speech recognition model and improve the accuracy of the dialect variant speech recognition model.
S2, training an acoustic model, a dialect prediction model and a lexical model through the context information and the dialect voice sample respectively; the acoustic model is used for constructing a pronunciation association relation between dialects and mandarin; the dialect prediction model is used for predicting vocabulary in dialects; the lexical model is used for constructing vocabulary association relations between dialects and mandarin chinese;
in one embodiment, step S2, as shown in fig. 2, includes:
s21, performing supervised training on the acoustic model through the dialect voice sample and the text content label corresponding to the dialect voice sample to obtain a trained acoustic model;
wherein the training criteria for the supervised training is represented by:
Wherein L supervised is a loss function of the acoustic model; x is a dialect voice sample; y is a text content label; p (x) is the distribution function of the dialect voice sample; θ t is an acoustic model parameter in the t-th iterative training; the output probability of the text content label; a (-) is used to enhance the spectral data in the dialect speech samples.
Specifically, the application adopts a supervised learning mode for training the acoustic model, and the training process is not repeated here; wherein a (-) can also be considered as a method for enhancing the spectral data of the counterpart speech sound sample. According to the method, a large number of dialect voice samples marked with labels are used as input, an acoustic model is supervised and trained based on a gradient descent optimization mode, and the output probability of text content labels is obtained; in this training process, text content labels, i.e. words contained in speech, each word is a class, and classification is trained by the above-mentioned loss function. The application carries out supervised learning training on the acoustic model, and is more beneficial to constructing pronunciation association relation between the constructed dialect and mandarin.
S22, performing autoregressive prediction training on the dialect prediction model through the text context information to obtain a trained dialect prediction model;
Wherein the training criteria of the dialect prediction model are represented by the following formula:
In the method, in the process of the invention, A loss function for the dialect prediction model; q (y) is a distribution function of text content tags; y i is the text content tag at the ith position of the dialect voice sample; y <i is text context information; phi t is a dialect prediction model parameter in the t-th iteration training; /(I)The predicted probability for the next vocabulary.
Specifically, the application adopts an autoregressive prediction mode for training a dialect prediction model, and the training process is not described in detail herein. The text context information, the dialect voice sample and the label thereof are taken as input, the prediction probability of the next vocabulary is output, and the prediction probability only acts when the dialect prediction model is trained, so that the model is guided to learn part-of-speech knowledge.
S23, training the lexical model through the word class context information, the dialect voice sample and the word class labels corresponding to the word class context information, and obtaining a trained lexical model;
Wherein the training criteria of the lexical model are represented by the following formula:
In the method, in the process of the invention, A loss function for the lexical model; q (z) is the distribution function of the word class labels; z i is the part-of-speech tag of the text at the ith position of the dialect voice sample; z <i is word class context information; phi i is a lexical model parameter during the t-th iteration training; /(I)The probability is predicted for the part of speech of the next word.
Specifically, the present application is not limited to a specific method for training the lexical model. The word class context information, the dialect voice sample and the word class label corresponding to the word class context information are taken as input, and the part-of-speech prediction probability of the next word is output, so that the vocabulary association relation between the dialect and the mandarin can be effectively learned.
S3, constructing a dialect variant and a variant constraint path corresponding to the dialect variant, and adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a dialect variant prediction model;
In one embodiment, step S3, as shown in fig. 3, includes:
s31, establishing a synonymous substitution list of the dialect variant based on priori knowledge of the dialect variant, and constructing a variant constraint path according to the synonymous substitution list;
S32, adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a constraint dialect prediction model;
S33, sampling the constraint dialect prediction model, generating prediction results of a plurality of dialect variants and updating the text context information;
s34, training the constraint dialect prediction model through the updated text context information to obtain the dialect variant prediction model.
Specifically, the application establishes a synonymous substitution list of the dialect variant based on the dialect variant linguistic priori knowledge; constructing an additional variant constraint path through a synonymous substitution list of the dialect variants, adding the additional variant constraint path into a decoding space of the dialect prediction model to obtain a constraint dialect prediction model, providing more possibility for decoding the dialect prediction model and providing additional knowledge of the dialect variants for a final recognition model; then sampling the constraint dialect prediction model, updating text context information by the obtained multiple prediction results, and retraining the model by using the updated information; and carrying out random variant replacement on the text according to the dialect variant rule during training, and increasing the richness of the generated variants.
It should be noted that the modes of the acoustic model, the dialect prediction model, the lexical model and the like adopted in the text can be selected in a diversified manner (Transformer, LSTM and other structures are all available), and the description is omitted here.
S4, based on an attention mechanism, performing feature fusion on the trained acoustic model, the dialect variant prediction model and the trained lexical model to obtain a dialect variant speech recognition model; the feature fusion structure block diagram is shown in fig. 4, a cross attention mechanism module is added in a network layer of the trained acoustic model, the dialect variant prediction model and the output layer features of the trained lexical model are sent into the attention mechanism module of each layer, and the feature fusion structure block diagram is regulated through a loss function of the acoustic model, so that a fused dialect variant speech recognition model is obtained.
Specifically, the feature fusion method adopted by the application is a cross-modal cross-attention mechanism; the output layer features refer to the output probability of each model, such as the output probability of text content labels output by an acoustic model, the prediction probability of the next word output by a dialect prediction model, and the part-of-speech prediction probability of the next word output by a lexical model. The output layer features contain rich information related to the prediction results, the features are spliced into higher-dimension rear projection, and the prediction results can be comprehensively output after the information of a plurality of models is fused, so that the dialect variant speech recognition model is obtained.
The embodiment of the application designs a dialect variant voice recognition model training method based on context information based on the problem that various dialects are difficult to accurately recognize by a traditional ASR system, and the method realizes that a plurality of dialect voice samples marked with labels are obtained, and the labels are spliced according to time sequence to obtain the context information; training an acoustic model, a dialect prediction model and a lexical model through the context information and the dialect voice sample respectively; constructing a dialect variant and a variant constraint path corresponding to the variant, and adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a dialect variant prediction model; based on an attention mechanism, performing feature fusion on the trained acoustic model, the dialect variant prediction model and the trained lexical model to obtain a technical scheme of a dialect variant speech recognition model; and the context information is fully utilized, and the system recognition is assisted by generating possible options of the dialect variant, so that the voice recognition accuracy of the dialect variant is improved.
Although the steps in the flowcharts described above are shown in order as indicated by arrows, these steps are not necessarily executed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders.
In another embodiment, as shown in fig. 5, a second aspect of the present invention provides a dialect variant speech recognition model training system based on context information, comprising:
The data acquisition module 10 is used for acquiring a plurality of dialect voice samples marked with labels, and splicing the labels according to time sequence to obtain context information; the labels comprise text content labels and word class labels corresponding to texts; the context information comprises text context information and word class context information;
A model training module 20 for training an acoustic model, a dialect prediction model and a lexical model by the context information and the dialect speech samples, respectively; the acoustic model is used for constructing a pronunciation association relation between dialects and mandarin; the dialect prediction model is used for predicting vocabulary in dialects; the lexical model is used for constructing vocabulary association relations between dialects and mandarin chinese;
the path constraint module 30 is configured to construct a dialect variant and a variant constraint path corresponding to the dialect variant, and add the variant constraint path to a decoding space of the trained dialect prediction model to obtain a dialect variant prediction model;
The feature fusion module 40 is configured to perform feature fusion on the trained acoustic model, the dialect variant prediction model and the trained lexical model based on the attention mechanism, so as to obtain a dialect variant speech recognition model.
It should be noted that, each module in the dialect variant speech recognition model training system based on the context information may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules. For specific limitations regarding a context-based dialect variant speech recognition model training system, see the above description for limitations regarding a context-based dialect variant speech recognition model training method, both of which have the same functions and roles, and are not described in detail herein.
A third aspect of the present invention provides an electronic device comprising:
a processor, a memory, and a bus;
The bus is used for connecting the processor and the memory;
The memory is used for storing operation instructions;
the processor is configured to, by invoking the operation instruction, cause the processor to perform an operation corresponding to a dialect variant speech recognition model training method based on context information as illustrated in the first aspect of the present application.
In an alternative embodiment, an electronic device is provided, as shown in fig. 6, the electronic device 5000 shown in fig. 6 includes: a processor 5001 and a memory 5003. The processor 5001 is coupled to the memory 5003, e.g., via bus 5002. Optionally, the electronic device 5000 may also include a transceiver 5004. It should be noted that, in practical applications, the transceiver 5004 is not limited to one, and the structure of the electronic device 5000 is not limited to the embodiment of the present application.
The processor 5001 may be a CPU, general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 5001 may also be a combination of computing functions, e.g., including one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 5002 may include a path to transfer information between the aforementioned components. Bus 5002 may be a PCI bus or an EISA bus, among others. The bus 5002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus.
The memory 5003 may be, but is not limited to, ROM or other type of static storage device, RAM or other type of dynamic storage device, which can store static information and instructions, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disc, etc.), magnetic disk storage or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and capable of being accessed by a computer.
The memory 5003 is used for storing application program codes for implementing the inventive arrangements and is controlled to be executed by the processor 5001. The processor 5001 is operative to execute application code stored in the memory 5003 to implement what has been shown in any of the method embodiments described previously.
Among them, electronic devices include, but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method for training a dialect variant speech recognition model based on context information as described in the first aspect of the present application.
Yet another embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the corresponding ones of the foregoing method embodiments.
Furthermore, an embodiment of the present invention proposes a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the above-mentioned method.
In summary, the invention relates to the technical field of voice recognition, and discloses a dialect variant voice recognition model training method and system based on context information, wherein the method comprises the steps of obtaining a plurality of dialect voice samples marked with labels, and splicing the labels in time sequence to obtain the context information; training an acoustic model, a dialect prediction model and a lexical model through the context information and the dialect voice sample respectively; constructing a dialect variant and a variant constraint path corresponding to the variant, and adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a dialect variant prediction model; based on the attention mechanism, feature fusion is carried out on the trained acoustic model, the dialect variant prediction model and the trained lexical model, so that a dialect variant speech recognition model is obtained. The method and the system fully utilize the context information, assist the system recognition by generating possible options of the dialect variant, and improve the voice recognition accuracy of the dialect variant.
In this specification, each embodiment is described in a progressive manner, and all the embodiments are directly the same or similar parts referring to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments. It should be noted that, any combination of the technical features of the foregoing embodiments may be used, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples represent only a few preferred embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the application. It should be noted that modifications and substitutions can be made by those skilled in the art without departing from the technical principles of the present application, and such modifications and substitutions should also be considered to be within the scope of the present application. Therefore, the protection scope of the patent of the application is subject to the protection scope of the claims.

Claims (10)

1. A dialect variant speech recognition model training method based on context information, comprising:
Acquiring a plurality of dialect voice samples marked with labels, and splicing the labels according to time sequence to obtain context information; the labels comprise text content labels and word class labels corresponding to texts; the context information comprises text context information and word class context information;
Training an acoustic model, a dialect prediction model and a lexical model through the context information and the dialect voice sample respectively; the acoustic model is used for constructing a pronunciation association relation between dialects and mandarin; the dialect prediction model is used for predicting vocabulary in dialects; the lexical model is used for constructing vocabulary association relations between dialects and mandarin chinese;
Constructing a dialect variant and a variant constraint path corresponding to the variant, and adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a dialect variant prediction model;
Based on the attention mechanism, feature fusion is carried out on the trained acoustic model, the dialect variant prediction model and the trained lexical model, so that a dialect variant speech recognition model is obtained.
2. The method for training a dialect variant speech recognition model based on context information as recited in claim 1, wherein the training of the acoustic model, the dialect prediction model and the lexical model by the context information and the dialect speech samples, respectively, comprises:
Performing supervised training on the acoustic model through the dialect voice sample and the text content label corresponding to the dialect voice sample to obtain a trained acoustic model;
Performing autoregressive prediction training on the dialect prediction model through the text context information to obtain a trained dialect prediction model;
And training the lexical model through the word class context information, the dialect voice sample and the word class labels corresponding to the word class context information, so as to obtain the trained lexical model.
3. A method of training a dialect variant speech recognition model based on context information as recited in claim 2, wherein the training criteria for supervised training is represented by the following equation:
Wherein L supervised is a loss function of the acoustic model; x is a dialect voice sample; y is a text content label; p (x) is the distribution function of the dialect voice sample; θ t is an acoustic model parameter in the t-th iterative training; the output probability of the text content label; a (-) is used to enhance the spectral data in the dialect speech samples.
4. A method of training a dialect variant speech recognition model based on context information as recited in claim 3 in which the training criteria of the dialect predictive model is represented by the following formula:
In the method, in the process of the invention, A loss function for the dialect prediction model; q (y) is a distribution function of text content tags; y i is the text content tag at the ith position of the dialect voice sample; y < i is text context information; phi t is a dialect prediction model parameter in the t-th iteration training; /(I)The predicted probability for the next vocabulary.
5. The method of claim 4, wherein the training criteria of the lexical model is represented by the following formula:
In the method, in the process of the invention, A loss function for the lexical model; q (z) is the distribution function of the word class labels; z i is the part-of-speech tag of the text at the ith position of the dialect voice sample; z < i is word class context information; phi i is a lexical model parameter during the t-th iteration training; the probability is predicted for the part of speech of the next word.
6. The method for training a dialect variant speech recognition model based on context information as set forth in claim 2, wherein the constructing the dialect variant and the variant constraint path corresponding thereto, and adding the variant constraint path to a decoding space of the trained dialect prediction model, to obtain the dialect variant prediction model, comprises:
Establishing a synonymous substitution list of the dialect variant based on priori knowledge of the dialect variant, and constructing a variant constraint path according to the synonymous substitution list;
Adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a constraint dialect prediction model;
Sampling the constraint dialect prediction model, generating prediction results of a plurality of dialect variants and updating the text context information;
training the constraint dialect prediction model through the updated text context information to obtain the dialect variant prediction model.
7. A method for training a dialect variant speech recognition model based on context information as set forth in claim 3, wherein the feature fusion is performed on the trained acoustic model, the dialect variant prediction model and the trained lexical model based on the attention mechanism to obtain the dialect variant speech recognition model, and the method comprises:
Adding a cross attention mechanism module in a network layer of the trained acoustic model, sending the dialect variant prediction model and the output layer characteristics of the trained lexical model into the attention mechanism module of each layer, and regulating the output layer characteristics through a loss function of the acoustic model to obtain a fused dialect variant speech recognition model.
8. A dialect variant speech recognition model training system based on context information, comprising:
The data acquisition module is used for acquiring a plurality of dialect voice samples marked with labels, and splicing the labels according to time sequence to obtain context information; the labels comprise text content labels and word class labels corresponding to texts; the context information comprises text context information and word class context information;
The model training module is used for training an acoustic model, a dialect prediction model and a lexical model through the context information and the dialect voice sample respectively; the acoustic model is used for constructing a pronunciation association relation between dialects and mandarin; the dialect prediction model is used for predicting vocabulary in dialects; the lexical model is used for constructing vocabulary association relations between dialects and mandarin chinese;
The path constraint module is used for constructing a dialect variant and a variant constraint path corresponding to the dialect variant, and adding the variant constraint path into a decoding space of the trained dialect prediction model to obtain a dialect variant prediction model;
And the feature fusion module is used for carrying out feature fusion on the trained acoustic model, the dialect variant prediction model and the trained lexical model based on the attention mechanism to obtain a dialect variant speech recognition model.
9. An electronic device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the context information based dialect variant speech recognition model training method of any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program, when run, controls a device in which the computer readable storage medium is located to perform the dialect variant speech recognition model training method based on context information as claimed in any of claims 1 to 7.
CN202410368744.2A 2024-03-28 2024-03-28 Dialect variant speech recognition model training method and system based on context information Pending CN118197292A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410368744.2A CN118197292A (en) 2024-03-28 2024-03-28 Dialect variant speech recognition model training method and system based on context information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410368744.2A CN118197292A (en) 2024-03-28 2024-03-28 Dialect variant speech recognition model training method and system based on context information

Publications (1)

Publication Number Publication Date
CN118197292A true CN118197292A (en) 2024-06-14

Family

ID=91412070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410368744.2A Pending CN118197292A (en) 2024-03-28 2024-03-28 Dialect variant speech recognition model training method and system based on context information

Country Status (1)

Country Link
CN (1) CN118197292A (en)

Similar Documents

Publication Publication Date Title
CN111862977B (en) Voice conversation processing method and system
US11238845B2 (en) Multi-dialect and multilingual speech recognition
CN112002308B (en) Voice recognition method and device
CN107077841B (en) Superstructure recurrent neural network for text-to-speech
KR102375115B1 (en) Phoneme-Based Contextualization for Cross-Language Speech Recognition in End-to-End Models
US9501470B2 (en) System and method for enriching spoken language translation with dialog acts
CN111062217B (en) Language information processing method and device, storage medium and electronic equipment
CN112309366B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN109754809A (en) Audio recognition method, device, electronic equipment and storage medium
CN112037773B (en) N-optimal spoken language semantic recognition method and device and electronic equipment
Hori et al. Dialog state tracking with attention-based sequence-to-sequence learning
CN111414745A (en) Text punctuation determination method and device, storage medium and electronic equipment
CN112309367B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111508497B (en) Speech recognition method, device, electronic equipment and storage medium
CN114333838A (en) Method and system for correcting voice recognition text
CN113836945A (en) Intention recognition method and device, electronic equipment and storage medium
CN116580698A (en) Speech synthesis method, device, computer equipment and medium based on artificial intelligence
CN118197292A (en) Dialect variant speech recognition model training method and system based on context information
CN111489742B (en) Acoustic model training method, voice recognition device and electronic equipment
Carson-Berndsen Multilingual time maps: portable phonotactic models for speech technology
CN114398876B (en) Text error correction method and device based on finite state converter
CN116229994B (en) Construction method and device of label prediction model of Arabic language
US20220398474A1 (en) System and Method for Contextual Density Ratio-based Biasing of Sequence-to-Sequence Processing Systems
KR20240044019A (en) Method for extracting linguistic features based on multilingual model and computing apparatus using the same
Sun Unsupervised Feature Representation Learning using Sequence-to-sequence Autoencoder Architecture for Low-resource Language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination