CN112489634A - Language acoustic model training method and device, electronic equipment and computer medium - Google Patents

Language acoustic model training method and device, electronic equipment and computer medium Download PDF

Info

Publication number
CN112489634A
CN112489634A CN202011287317.XA CN202011287317A CN112489634A CN 112489634 A CN112489634 A CN 112489634A CN 202011287317 A CN202011287317 A CN 202011287317A CN 112489634 A CN112489634 A CN 112489634A
Authority
CN
China
Prior art keywords
target language
target
sentences
dialect
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011287317.XA
Other languages
Chinese (zh)
Inventor
颜京豪
黄申
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011287317.XA priority Critical patent/CN112489634A/en
Publication of CN112489634A publication Critical patent/CN112489634A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams

Abstract

The application provides a method and a device for training an acoustic model of a language, electronic equipment and a computer readable storage medium, and relates to the field of voice recognition. The method comprises the following steps: performing Latin on the text set of the target language to obtain a pronunciation dictionary of the target language; generating a target language corpus based on the pronunciation dictionary and the text set; training a voice recognition model for recognizing the target language according to the target language material library; and respectively training the voice recognition models again based on dialect linguistic data respectively corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing the various dialects. The method and the device do not need manual marking, so that a large amount of labor cost and time cost are saved; moreover, for dialects of different branches of the target language, dialect acoustic models corresponding to each dialect are obtained through preset multi-dialect corpus training, and the recognition rate of each dialect is improved.

Description

Language acoustic model training method and device, electronic equipment and computer medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for training an acoustic model of a language, an electronic device, and a computer-readable storage medium.
Background
Automatic Speech Recognition (ASR) is an active research topic in the field of artificial intelligence. The speech recognition purpose is to convert speech signals into corresponding text representations, the basic framework of which is shown in fig. 1. The voice signal firstly needs to be subjected to acoustic feature extraction, information is greatly compressed and converted into a form which can be better divided by a machine, and then the features are sent to a decoder to decode a recognition result. The decoder needs the combined action of the acoustic model, the language model and the pronunciation dictionary to score the features to obtain the final decoding sequence.
The training of the acoustic model is vital, and one key work is to select a proper corpus to train, the selected corpus covers the pronunciation phenomenon of the language as much as possible, data cannot be sparse, and a pronunciation dictionary stores the mapping relation from words to pronunciations, which is also a bridge connecting the acoustic model and the language model in the traditional modeling method. For languages with sufficient data resources, a large amount of training corpora and pronunciation dictionaries are usually obtained by adopting a manual labeling mode, or an end-to-end modeling method is adopted to directly carry out acoustic model modeling by using modeling units such as characters and the like, so that the aim of not needing pronunciation dictionaries is fulfilled.
Although the research work of the voice recognition technology at home and abroad is more, the related research work mainly focuses on general languages with abundant data resources such as Chinese and English, and the voice data volume slowly breaks through tens of thousands of hours or even hundreds of thousands of hours. However, the speech recognition of a target language (such as Tibetan) with few data resources is less researched, and meanwhile, due to the scarcity of data resources, a pronunciation dictionary is difficult to construct in the target language, so that the threshold of related research is high and the target language is concentrated in a single dialect.
In the existing target language speech recognition method, the construction of a pronunciation dictionary mostly adopts an artificial labeling form, and the best performance is difficult to achieve on a small data set by directly carrying out acoustic model modeling on syllables or words by adopting an end-to-end method, so the construction of the pronunciation dictionary is still important work so far. On the other hand, the corpus resources of the existing target language are scarce, so that the target language speech data are difficult to record in a large quantity, the corpus scale is small, meanwhile, the coverage of the pronunciation phenomenon is low, the balance is low, and the recognition rate of the acoustic model obtained by adopting corpus training is also low.
Disclosure of Invention
The application provides a method and a device for training an acoustic model of a language, electronic equipment and a computer-readable storage medium, which can solve the problems. The technical scheme is as follows:
in one aspect, a method for training an acoustic model of a language is provided, the method comprising:
performing Latin on a text set of a target language to obtain a pronunciation dictionary of the target language;
generating a target language corpus based on the pronunciation dictionary and the text set;
the target language corpus comprises a voice corpus corresponding to the text set;
training a voice recognition model for recognizing the target language according to the target language material library;
and respectively training the voice recognition models again based on dialect linguistic data respectively corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing the various dialects.
Preferably, the text set comprises at least two target language texts;
the latin processing of the preset text set of the target language to obtain the pronunciation dictionary of the target language includes:
segmenting the at least two target language texts based on the syllable symbol to obtain at least two syllables;
counting to obtain the occurrence frequency of the at least two syllables in the text set, and taking the syllables with the first preset number at the front occurrence frequency of the at least two syllables as target syllables;
and performing Latin on each target syllable to obtain a pronunciation sequence corresponding to each target syllable, and taking the set of each pronunciation sequence as a pronunciation dictionary of the target language.
Preferably, the generating a target language corpus based on the pronunciation dictionary and the text set comprises:
determining at least two target language sentences from the at least two target language texts; wherein any entry slogan text comprises at least one target language sentence;
determining at least two target language sentences based on the pronunciation dictionary and the at least two target language sentences;
and generating a target language corpus based on the at least two target language sentences.
Preferably, the determining at least two target language sentences from the at least two target language texts includes:
removing the duplication of the at least two target language texts to obtain at least two remaining first target language texts;
regularizing the at least two first target language texts to obtain at least two second target language texts after regularization;
performing sentence segmentation on the at least two second target language texts to obtain at least two target language sentences;
and determining the target language sentences of which the syllable number exceeds a first syllable number threshold value and does not exceed a second syllable number threshold value in the at least two target language sentences.
Preferably, the determining at least two target language sentences based on the pronunciation dictionary and the at least two target language sentences includes:
performing triphone conversion on the at least two target language sentences by using the pronunciation dictionary to obtain triphone sequences corresponding to the at least two target language sentences respectively;
calculating to obtain the information entropy of each triphone subsequence, and taking the target semantic sentence corresponding to the triphone subsequence with the maximum information entropy as a target language sentence;
and aiming at other target language sentences except the target language sentences in the at least two target language sentences, repeatedly executing the steps of respectively carrying out triphone conversion on the at least two target language sentences by adopting the pronunciation dictionary to obtain triphone sequences respectively corresponding to the at least two target language sentences, calculating to obtain the information entropy of each triphone sequence, and taking the target semantic sentences corresponding to the triphone sequences with the maximum information entropy as the target language sentences until the number of the target language sentences reaches a second preset number.
Preferably, the generating a target language corpus based on the at least two target language sentences comprises:
carrying out audio recording on the at least two target language sentences to obtain audio data corresponding to the at least two target language sentences respectively;
and storing the at least two target language sentences and the audio data corresponding to the at least two target language sentences to obtain a target language corpus.
Preferably, the training of the speech recognition model for recognizing the target language according to the target language corpus comprises:
extracting 40-dimensional Mel cepstrum coefficient features and 100-dimensional identity authentication vector features from each audio data in the target language corpus as acoustic features;
and training a preset Gaussian mixture model by adopting the acoustic features and each target language sentence in the target language corpus to obtain a speech recognition model of the target language.
Preferably, the training the speech recognition models again based on the dialect corpus corresponding to each dialect of the target language to obtain the dialect speech recognition models for recognizing each dialect includes:
and adopting dialect linguistic data respectively corresponding to various dialects of the target language to respectively perform transfer learning on the voice recognition model to obtain dialect acoustic models which respectively correspond to the dialect linguistic data and recognize the dialects.
Preferably, the method further comprises the following steps:
acquiring audio to be processed of the target language;
and performing voice recognition on the audio to be processed by adopting the at least one dialect voice recognition model to obtain a corresponding target language text.
In another aspect, an apparatus for training an acoustic model of a language is provided, the apparatus including:
the first processing module is used for carrying out Latin on a text set of a target language to obtain a pronunciation dictionary of the target language;
a second processing module for generating a target language corpus based on the pronunciation dictionary and the text set; the target language corpus comprises a voice corpus corresponding to the text set;
the third processing module is used for training a voice recognition model for recognizing the target language according to the target language material library;
and the fourth processing module is used for respectively retraining the speech recognition models based on dialect linguistic data respectively corresponding to various dialects of the target language to obtain dialect speech recognition models for recognizing the various dialects.
Preferably, the text set comprises at least two target language texts;
the first processing module comprises:
the segmentation submodule is used for segmenting the at least two target language texts based on the syllable symbol to obtain at least two syllables;
the statistic submodule is used for obtaining the occurrence frequency of the at least two syllables in the text set through statistics, and taking the syllables with the first preset number, which are at the front of the occurrence frequency, of the at least two syllables as target syllables;
and the conversion submodule is used for performing Latin on each target syllable to obtain a pronunciation sequence corresponding to each target syllable, and taking the set of each pronunciation sequence as a pronunciation dictionary of the target language.
Preferably, the second processing module includes:
the first determining submodule is used for determining at least two target language sentences from at least two target language texts; wherein any entry slogan text comprises at least one target language sentence;
a second determining submodule for determining at least two target language sentences based on the pronunciation dictionary and the at least two target language sentences;
and the generation submodule is used for generating a target language corpus based on the at least two target language sentences.
Preferably, the first determination submodule includes:
the first filtering unit is used for carrying out duplication removal on the at least two target language texts to obtain at least two remaining first target language texts;
the regularization unit is used for regularizing the at least two first target language texts to obtain at least two second target language texts after regularization;
the segmentation unit is used for carrying out sentence segmentation on the at least two second target language texts to obtain at least two target language sentences;
and the second filtering unit is used for determining the target language sentences of which the syllable number exceeds the first syllable number threshold value and does not exceed the second syllable number threshold value in the at least two target language sentences.
Preferably, the second determination submodule includes:
the conversion unit is used for respectively carrying out triphone conversion on the at least two target language sentences by adopting the pronunciation dictionary to obtain triphone sequences corresponding to the at least two target language sentences;
the calculation unit is used for calculating the information entropy of each triphone subsequence, and taking the target semantic sentence corresponding to the triphone subsequence with the largest information entropy as the target language sentence;
and repeatedly calling the conversion unit and the calculation unit aiming at other target language sentences except the target language sentences in the at least two target language sentences until the number of the target language sentences reaches a second preset number.
Preferably, the generating sub-module comprises:
the recording unit is used for recording the audio frequency of the at least two target language sentences to obtain audio data corresponding to the at least two target language sentences respectively;
and the storage unit is used for storing the at least two target language sentences and the audio data corresponding to the at least two target language sentences to obtain a target language corpus.
Preferably, the third processing module comprises:
the extraction submodule is used for extracting 40-dimensional Mel cepstrum coefficient characteristics and 100-dimensional identity authentication vector characteristics from each audio data in the target language corpus to serve as acoustic characteristics;
and the training submodule is used for training a preset Gaussian mixture model by adopting the acoustic features and each target language sentence in the target language corpus to obtain a voice recognition model of the target language.
Preferably, the fourth processing module is specifically configured to: and adopting dialect linguistic data respectively corresponding to various dialects of the target language to respectively perform transfer learning on the voice recognition model to obtain dialect acoustic models which respectively correspond to the dialect linguistic data and recognize the dialects.
Preferably, the method further comprises the following steps:
the acquisition module is used for acquiring the audio to be processed of the target language;
and the recognition module is used for performing voice recognition on the audio to be processed by adopting the at least one dialect voice recognition model to obtain a corresponding target language text.
In another aspect, an electronic device is provided, including:
a processor, a memory, and a bus;
the bus is used for connecting the processor and the memory;
the memory is used for storing operation instructions;
the processor is configured to call the operation instruction, and the executable instruction enables the processor to execute an operation corresponding to the acoustic model training method for the language shown in the first aspect of the present application.
In another aspect, a computer-readable storage medium is provided, which has a computer program stored thereon, and when the program is executed by a processor, the program implements the method for training an acoustic model of a language as shown in the first aspect of the present application.
The beneficial effect that technical scheme that this application provided brought is:
in the embodiment of the invention, the text set of the target language is subjected to Latin to obtain a pronunciation dictionary of the target language, and then a target language corpus is generated based on the pronunciation dictionary and the text set; the target language corpus comprises a voice corpus corresponding to the text set; and training a voice recognition model for recognizing the target language according to the target language corpus, and respectively training the voice recognition model again on the basis of dialect corpora respectively corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing various dialects. Therefore, the pronunciation dictionary of the target language can be constructed through latin, manual marking is not needed, and a large amount of labor cost and time cost are saved; moreover, for dialects of different branches of the target language, dialect acoustic models corresponding to each dialect are obtained through preset multi-dialect corpus training, and the recognition rate of each dialect is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a diagram of a prior art speech recognition framework;
FIG. 2 is a schematic flow chart illustrating a method for training an acoustic model of a language according to an embodiment of the present application;
FIG. 3 is a block diagram of the Tibetan language word of the present application;
FIG. 4 is a diagram illustrating the effect of syllabification according to the present application;
FIG. 5 is a diagram illustrating transfer learning of an acoustic model of the present application;
FIG. 6 is a flowchart illustrating a method for processing languages based on various dialect speech recognition models according to another embodiment of the present application;
FIG. 7 is a schematic structural diagram of an acoustic model training apparatus for a language according to another embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device for training an acoustic model of a language according to another embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms referred to in this application will first be introduced and explained:
artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (TTS) as well as voiceprint recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
The application provides a method, an apparatus, an electronic device and a computer readable storage medium for training and processing an acoustic model of a language, which aim to solve the above technical problems in the prior art.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
In one embodiment, a method for training an acoustic model of a language is provided, as shown in fig. 2, the method comprising:
step S201, performing Latin on a text set of a target language to obtain a pronunciation dictionary of the target language;
in practical applications, world-wide languages refer to languages that serve as a session medium in international communications, and do not refer to languages that are unified throughout the world. In embodiments of the present invention, the target language may be any other language than the world wide language. Furthermore, since some universal languages in the world include local languages (hereinafter, abbreviated as "dialects") or different branch languages belonging to the same language family, and the universal applicability of these dialects or branch languages is also small, the target language in the embodiment of the present invention may also include these dialects and branch languages, for example, cantonese in chinese and other dialects belonging to chinese language, and Tibetan language belonging to branch languages of the tibetan language family (chinese also belongs to branch languages of the tibetan language family).
Further, latin (or romanization) is a term of linguistics, which refers to the process of converting a pinyin-text system that is not in the form of latin letters (or roman letters) to a latin text system.
Wherein the text set may include at least two target language texts. After the text set is obtained, latin conversion can be performed on each target language text of the text set, so as to obtain a pronunciation dictionary of the target language.
Step S202, generating a target language corpus based on the pronunciation dictionary and the text set; the target language corpus comprises voice corpora corresponding to the text set;
after the pronunciation dictionary of the target language is constructed, a target language corpus can be generated based on the pronunciation dictionary and the acquired text set of the target language. The target language corpus comprises at least one target language corpus, and one target language corpus comprises a target language text and pronunciations corresponding to the entry slogan text.
Step S203, training a voice recognition model for recognizing the target language according to the target language corpus;
and step S204, respectively retraining the voice recognition models again based on dialect linguistic data respectively corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing various dialects.
After the target language corpus is constructed, the target language corpus and at least one preset dialect corpus can be adopted to train to obtain dialect acoustic models corresponding to each dialect corpus. Wherein any dialect is a branch language of the target language.
In the embodiment of the invention, a text set of a target language is subjected to Latin to obtain a pronunciation dictionary of the target language, and then a target language corpus is generated based on the pronunciation dictionary and the text set; the target language corpus comprises voice corpora corresponding to the text set; and training a voice recognition model for recognizing the target language according to the target language corpus, and respectively training the voice recognition model again on the basis of dialect corpora corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing various dialects. Therefore, the pronunciation dictionary of the target language can be constructed through latin, manual marking is not needed, and a large amount of labor cost and time cost are saved; moreover, for dialects of different branches of the target language, dialect acoustic models corresponding to each dialect are obtained through preset multi-dialect corpus training, and the recognition rate of each dialect is improved.
In another embodiment, the method for training an acoustic model of a language as shown in FIG. 2 is described in detail.
Step S201, performing Latin on a text set of a target language to obtain a pronunciation dictionary of the target language;
in practical applications, world-wide languages refer to languages that serve as a session medium in international communications, and do not refer to languages that are unified throughout the world. In the world of the present, a completely universal language does not exist, only a language with a certain universal degree exists, and in contrast, English plays a role of a conversation medium to a large extent, but also a considerable international interaction does not use English as the conversation medium, and the universality of English does not occupy an absolute position; the languages with larger universality comprise French, Chinese, Spanish, Russian and Arabic, the number of Chinese users is the most, but the main application range is China, Singapore, Malaysia and Welan, and the world universality is small.
In embodiments of the present invention, the target language may be any other language than the world wide language. Furthermore, since some universal languages in the world include local languages (hereinafter, abbreviated as "dialects") or different branch languages belonging to the same language family, and the universal applicability of these dialects or branch languages is also small, the target language in the embodiment of the present invention may also include these dialects and branch languages, for example, cantonese in chinese and other dialects belonging to chinese language, and Tibetan language belonging to branch languages of the tibetan language family (chinese also belongs to branch languages of the tibetan language family).
For convenience of description, the embodiments of the present invention are described in detail with the Tibetan language as the target language.
Furthermore, latin (or romanization) is a term of linguistics, which refers to a process of converting a pinyin word system that is not in the form of latin letters (or roman letters) into a latin word system, i.e., non-latin characters in the converted system are faithfully transliterated into latin characters in the conversion system (including diacritics and single-phone double characters of the characters) in pairs according to rules and transcription tables of the transcription system.
The Tibetan is a pinyin character and consists of 30 consonant letters and 4 vowel letters. Fig. 3 shows a syllable of the Tibetan language, which is a basic ideographic unit of the Tibetan language. A Tibetan syllable is expanded around a base character and respectively consists of a front additional character, a rear additional character and a rear additional character which are positioned in the front-rear direction of the base character, and an upper additional character, an upper vowel, a lower additional character and a lower vowel which are positioned in the upper-lower direction of the base character, wherein the additional characters are consonant letters. The syllables are separated by "syllables" and words are composed of single or multiple syllables, i.e. a word may comprise at least one syllable.
If a pronunciation dictionary at a word level is to be constructed, Tibetan segmentation is needed, but under the condition of low Tibetan resources, segmentation and labeling after collecting a large amount of Tibetan are difficult. According to the method, one word in the Tibetan is composed of at least one syllable, and meanwhile, clear syllables are arranged among the syllables, so that the collected Tibetan can be subjected to syllable segmentation based on the syllables, the scale of the common Tibetan syllables can be controlled, and the step of dividing the word in the Tibetan is omitted in the syllable segmentation.
For example, for ease of understanding, the example of Chinese characters is illustrated. Aiming at the Chinese character text ' today's sunny weather ', the method assumes that the Chinese character text is subjected to word segmentation to obtain three words of ' today's ', ' weather ' and ' sunny ' and performs word segmentation to obtain five words of ' today's ', ' sky ' and ' qi ', ' sunny ' and ' long ', and is obvious, the word segmentation process needs a large amount of calculation, and the word segmentation process omits the large amount of calculation process.
Corresponding to the Tibetan, one Chinese character is equivalent to one syllable, and one word is equivalent to one word, so the step of dividing the Tibetan into words can be omitted by carrying out syllable division on the Tibetan.
Wherein the text set may include at least two target language texts, such as a large amount of Tibetan texts. The text set may be obtained by collecting Tibetan news, forum articles, and the like on the network, may also be obtained by inputting Tibetan text by a user, may also be obtained by other ways, and may be set according to actual conditions in practical applications, which is not limited in the embodiments of the present invention.
After the text set is obtained, latin conversion can be performed on each target language text of the text set, so as to obtain a pronunciation dictionary of the target language. For example, latin is performed on each piece of Tibetan text in the Tibetan text set based on syllables, so as to obtain a pronunciation dictionary of the Tibetan.
In a preferred embodiment of the present invention, the latin implementation of the preset text set of the target language to obtain the pronunciation dictionary of the target language includes:
segmenting at least two target language texts based on the syllable symbol to obtain at least two syllables;
counting to obtain the occurrence frequency of at least two syllables in a text set, and taking the syllables with the first preset number at the front occurrence frequency of the at least two syllables as target syllables;
and performing Latin on each target syllable to obtain a pronunciation sequence corresponding to each target syllable, and taking the set of each pronunciation sequence as a pronunciation dictionary of the target language.
Specifically, for each target language text in the text set, segmentation may be performed based on the syllable characters to obtain all syllables corresponding to the text set, and then the occurrence frequency of each syllable in the text set is counted, and a first preset number of syllables with the former occurrence frequency are used as the target syllables.
For example, the text set includes 3 ten thousand Tibetan texts, 8000 syllables are obtained by dividing three ten thousand Tibetan texts based on the syllables, the occurrence frequencies of the 8000 syllables in the 3 ten thousand Tibetan texts are counted, and 6000 syllables with the former occurrence frequencies are taken as target syllables.
And after the target syllables are determined, performing latin treatment on each target syllable to obtain a pronunciation sequence corresponding to each target syllable, and taking the set of each pronunciation sequence as a pronunciation dictionary of the target language. Although Tibetan is a pinyin character, the character system and the pronunciation system of the character description of the Tibetan are not consistent, so that the Weililac Latin scheme accepted by the industry is designed to accurately transcribe Tibetan characters, and a pronunciation dictionary is difficult to directly construct in spite of modern pronunciation of Tibetan words. Therefore, the embodiment of the invention adopts a new THL (Tibeta and Himalayan Library, Tibet and Himalayan Library) Tibetan language latin transformation scheme, the THL Tibetan language latin transformation scheme further improves the pronunciation on the basis of the Weiilantinization scheme, and the text after latin transformation is closer to the pronunciation of the modern Tibetan language through a certain special rule. In the embodiment of the invention, a THL Tibetan Latin scheme is used for generating pronunciation sequences for collected target syllables such as 6000 syllables respectively, so that a Tibetan pronunciation dictionary with the size of 6000 is constructed. Wherein, each Tibetan syllable in the pronunciation dictionary has a corresponding phonon.
For example, for a certain Tibetan text set, 3 target syllables are obtained based on syllable segmentation and occurrence frequency statistics, and then the 3 target syllables are latin by adopting the THL Tibetan latin making scheme to obtain a pronunciation sequence of each target syllable, as shown in fig. 4, so that a Tibetan pronunciation dictionary is constructed.
Step S202, generating a target language corpus based on the pronunciation dictionary and the text set; the target language corpus comprises voice corpora corresponding to the text set;
after the pronunciation dictionary of the target language is constructed, a target language corpus can be generated based on the pronunciation dictionary and the acquired text set of the target language. The target language corpus comprises at least one target language corpus, and one target language corpus comprises one target language text and pronunciations corresponding to the entry slogan text. For ease of understanding, the illustration is again made using the example of Chinese. For example, a piece of chinese corpus includes the pronunciations of the chinese text "hello" and "hello".
In a preferred embodiment of the present invention, the generating a target language corpus based on a pronunciation dictionary and a corpus of texts includes:
determining at least two target language sentences from the at least two target language texts; wherein any entry slogan text comprises at least one target language sentence;
determining at least two target language sentences based on the pronunciation dictionary and the at least two target language sentences;
a target language corpus is generated based on the at least two target language sentences.
When the target language corpus is generated, at least two target language sentences can be determined from each entry slogan text in the text set, wherein any entry slogan text comprises at least one target language sentence. For example, the Tibetan text set includes 3 ten thousand pieces of Tibetan text, and the 3 ten thousand pieces of Tibetan text include 4 ten thousand pieces of Tibetan sentences in total.
And then determining at least two target language sentences from the target sentences based on the pronunciation dictionary, and generating a target language corpus based on the at least two target language sentences. For example, 2 ten thousand Tibetan sentences are determined from 4 ten thousand Tibetan sentences based on the pronunciation dictionary, and then the Tibetan language corpus is generated based on the 2 ten thousand Tibetan sentences.
In a preferred embodiment of the present invention, determining at least two target language sentences from the at least two target language texts comprises:
removing the duplication of the at least two target language texts to obtain at least two remaining first target language texts;
regularizing the at least two first target language texts to obtain at least two second target language texts after regularization;
performing sentence segmentation on the at least two second target language texts to obtain at least two target language sentences;
and determining the target language sentences of which the syllable number exceeds a first syllable number threshold value and does not exceed a second syllable number threshold value in at least two target language sentences.
Specifically, since the source of the obtained target language text is different, repeated target language texts may exist, so that each entry slogan text in the text set may be deduplicated first, and one repeated target language text is reserved, so as to obtain at least two remaining first target language texts. And then regularizing the at least two first target language texts to obtain at least two second target language texts after regularization.
And then, sentence segmentation is carried out on the at least two target language texts to obtain at least two target language sentences. In the embodiment of the invention, the sentence segmentation of the Tibetan text can be performed on the basis of single pendants and double pendants, wherein the single pendants and the double pendants are equivalent to punctuation marks in Chinese characters.
And then determining the number of syllables contained in each target language sentence, and reserving the target language sentences of which the number of syllables exceeds a first syllable number threshold and does not exceed a second syllable number threshold as final target language sentences, namely deleting the target language sentences of which the number of syllables does not exceed the first syllable number threshold or exceeds the second syllable number threshold. In the embodiment of the present invention, since one word of the Tibetan language includes at least one syllable, one Tibetan language sentence may include a plurality of syllables.
For example, the Tibetan sentence a includes 3 words each including 3 syllables, the Tibetan sentence B includes 4 words each including 2 syllables, the Tibetan sentence C includes 5 words each including 5 syllables; then, the Tibetan sentence a includes 9 syllables, the Tibetan sentence B includes 8 syllables, and the Tibetan sentence a includes 25 syllables; wherein the first threshold number of syllables is 4 and the second threshold number of syllables is 20. Since the number of syllables in Tibetan sentence C exceeded 20, it was deleted and Tibetan sentences A and B were retained.
Therefore, the selected target language sentence is ensured to be moderate in length and not to be too long or too short. Of course, in practical applications, the first threshold value and the second threshold value may be set according to practical requirements, and the embodiment of the present invention is not limited thereto.
In a preferred embodiment of the present invention, determining at least two target language sentences based on the pronunciation dictionary and the at least two target language sentences comprises:
performing triphone conversion on at least two target language sentences by adopting a pronunciation dictionary to obtain triphone sequences corresponding to the at least two target language sentences respectively;
calculating to obtain the information entropy of each triphone subsequence, and taking the target semantic sentence corresponding to the triphone subsequence with the maximum information entropy as a target language sentence;
and aiming at other target language sentences except the target language sentences in the at least two target language sentences, repeatedly executing the steps of respectively carrying out triphone conversion on the at least two target language sentences by adopting a pronunciation dictionary to obtain triphone sequences respectively corresponding to the at least two target language sentences, calculating the information entropy of each triphone sequence, and taking the target semantic sentences corresponding to the triphone sequences with the largest information entropy as the target language sentences until the number of the target language sentences reaches a second preset number.
Specifically, after the final target language sentences are determined, the constructed pronunciation dictionary is adopted to perform three-tone sub-conversion on the final entry slogan language sentences respectively, and three-tone sub-sequences corresponding to the entry slogan language sentences are obtained. The number of the triphone subsequences corresponding to each target language sentence can be one or more.
Where "phoneme" is a segment of a language which is extracted from the continuum of the language by various methods and is as small as possible, usually the phoneme can be regarded as the minimum unit of continuous speech, for example, as in the pronunciation sequence in fig. 4, "k", "a", and "ng" are phonemes.
In the process of continuous voice production, the movement of the pronunciation organ and the pronunciation part causes a series of influences such as cooperative pronunciation, namely, a certain phoneme in the continuous voice is influenced by the first M phonemes and the last N phonemes, so that the pronunciation phenomena of the single phoneme and the phoneme in the continuous voice are obviously different. Researchers in the speech recognition field of the application find that the recognition system with three phones as the minimum unit has a good performance.
For ease of understanding, the example of Chinese Pinyin is used for illustration. For example, when the phonon "a" in the chinese pinyin is between "h" and "o", the pronunciation phenomenon is "hao", and when "a" is between "d" and "o", the pronunciation phenomenon is "dao".
Further, when performing the three-tone conversion, the conversion may be performed in the order of the tones. For the convenience of understanding, the example of the Chinese Pinyin is also given. For example, the uttering sequence of the "present application" is "benshenqing", and when triphone conversion is performed, a plurality of triphone sequences such as "ben", "ens", "nsh", "she", "hen" … "ing" can be obtained.
That is, assuming that there are 50 phones in total, there are 125000 triphone sequences that are theoretically 3 times as many as 50, but actually there is a great difference in the occurrence probability of these triphones in practical applications, and some triphones are not even present. Therefore, the three phonemes are distributed unevenly in the corpus, and the recognition rate of the acoustic model trained by using the corpus is low. The present application then proposes to guarantee an equal distribution of the three phones based on the information entropy.
The information entropy is used for describing the uncertainty of the information, and the measurement is carried out according to the occurrence probability of the information entropy, the occurrence frequency of events with small probability is less, the uncertainty is large, the information amount is large, and vice versa. The information entropy represents the measurement of uncertainty of a random variable, the random variable is set as X, the generation probability of each variable X is p (X), and then the information entropy is defined as:
Figure RE-GDA0002884448850000171
wherein, X is a target language sentence, and X is each triphone subsequence in the target language sentence. For example, in the above example, X is "benshenqing", and X is a plurality of triphone subsequences such as "ben", "ens", "nsh", "she", "hen" … "ing".
After at least one triphone subsequence corresponding to each target language sentence is obtained, counting the occurrence frequency of each triphone subsequence in a text set aiming at each target language sentence, calculating the information entropy of each triphone subsequence based on each occurrence frequency, and then taking the target semantic sentence corresponding to the triphone subsequence with the largest information entropy as the target language sentence.
And then removing the target language sentence from the final target language sentence to obtain other target language sentences except the entry slogan language sentence, repeatedly executing the steps of respectively carrying out triphone conversion on at least two target language sentences by adopting a pronunciation dictionary to obtain triphone subsequences corresponding to at least two target language sentences aiming at other target language sentences, calculating to obtain the information entropy of each triphone subsequence, and taking the target semantic sentence corresponding to the triphone subsequence with the largest information entropy as the target language sentence until the number of the target language sentences reaches a second preset number.
For example, assume that the second predetermined number is 3 ten thousand. After 6 ten thousand final Tibetan sentences are obtained, three-tone sub-conversion is carried out on each Tibetan sentence, the information entropy is calculated, the Tibetan sentence corresponding to the three-tone subsequence with the maximum information entropy is obtained as a first target Tibetan sentence, then the first target Tibetan sentence is removed from the 6 ten thousand Tibetan sentences to obtain 59999 Tibetan sentences, and the steps are repeatedly executed, so that a second target Tibetan sentence, a third target Tibetan sentence and the like are determined until 3 ten thousand target Tibetan sentences are determined.
In a preferred embodiment of the present invention, generating a target language corpus based on at least two target language sentences comprises:
carrying out audio recording on at least two target language sentences to obtain audio data corresponding to the at least two target language sentences respectively;
and storing at least two target language sentences and the audio data corresponding to the target language sentences to obtain a target language corpus.
Specifically, after the slogan speech sentences of each entry are obtained, audio recording can be performed on each target language sentence, and audio data corresponding to each target language sentence is obtained. The audio recording may be performed manually, may also be performed in a sound source library, may also be performed in other manners, and may be set according to actual requirements in practical applications, which is not limited in this embodiment of the present invention.
And after the recording is finished, taking each target language sentence and the corresponding audio data as a corpus to obtain a plurality of corpora, and storing the plurality of corpora to obtain a target language corpus.
For example, for the obtained Tibetan text set, determining a plurality of Tibetan sentences from a plurality of Tibetan texts, then respectively converting a plurality of target Tibetan sentences into a plurality of three-tone subsequences, then calculating the information entropy of each three-tone subsequence, then determining the Tibetan sentences corresponding to the three-tone subsequences with the largest information entropy as the target Tibetan sentences, removing the target Tibetan sentences from the plurality of Tibetan sentences to obtain the remaining plurality of Tibetan sentences, repeatedly executing the steps of respectively converting the plurality of target Tibetan sentences into the plurality of three-tone subsequences, then calculating the information entropy of each three-tone subsequence, then determining the Tibetan sentences corresponding to the three-tone subsequences with the largest information entropy as the target Tibetan sentences until a preset number of target Tibetan sentences are obtained, then carrying out audio recording on each target Tibetan sentence to obtain the audio data of each target Tibetan sentence, and storing each target Tibetan sentence and the corresponding audio data as a corpus to obtain a Tibetan language corpus.
Step S203, training a voice recognition model for recognizing the target language according to the target language corpus;
after the target language corpus is constructed, a voice recognition model for recognizing the target language can be trained by adopting the target language corpus.
In a preferred embodiment of the present invention, a speech recognition model for recognizing a target language according to a target language corpus training comprises:
extracting 40-dimensional Mel cepstrum coefficient features and 100-dimensional identity authentication vector features from each audio data in a target language corpus as acoustic features;
and training a preset Gaussian mixture model by adopting the acoustic features and each target language sentence in the target language corpus to obtain a speech recognition model of the target language.
Specifically, 40-dimensional Mel-Frequency Cepstral cepstrum Coefficients (MFCCs for short) features and 100-dimensional identity authentication Vector (i-Vector for short) features may be extracted from each audio data in a target language corpus, and then the acoustic features and each target language sentence in the target language corpus are adopted to train a preset Gaussian mixture Model (GMM for short), and the acoustic features are aligned through the Gaussian mixture Model, and further trained to obtain a deep neural network base acoustic Model of the target language, and the deep neural network base acoustic Model is used as a speech recognition Model of the target language.
And step S204, respectively retraining the voice recognition models again based on dialect linguistic data respectively corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing various dialects.
After the speech recognition model is obtained by training, the dialect speech recognition model corresponding to each dialect corpus can be obtained by training at least one dialect corpus. Wherein any dialect is a branch language of the target language.
For example, taking Tibetan as an example, Tibetan mainly includes three branch languages of tibetan dialect, kanba dialect and anduo dialect. Because the pronunciation difference of the three dialects of the Tibetan language is large, most users in different dialect areas only pay attention to the recognition accuracy of the dialects used by the users, and therefore, the corresponding dialect acoustic models need to be trained respectively aiming at the three main dialects.
In a preferred embodiment of the present invention, the method for obtaining a dialect speech recognition model for recognizing various dialects by re-training the speech recognition model based on dialect corpora corresponding to various dialects of the target language includes:
and adopting dialect linguistic data respectively corresponding to various dialects of the target language to respectively perform transfer learning on the voice recognition model to obtain dialect acoustic models which respectively correspond to the dialect linguistic data and recognize the dialects.
Specifically, migration learning is carried out on the speech recognition model by adopting any dialect corpus, the learning rate of the speech recognition model is reduced to a preset proportion, the dialect speech recognition model corresponding to the dialect is obtained, and the like, so that the speech recognition model corresponding to each dialect is obtained.
In the transfer learning, the output layer of the speech recognition model of the target language may be replaced by an output layer of the dialect, that is, the hidden layer weight matrix of the shared deep neural network, as shown in fig. 5.
For example, taking a Tibetan language as an example, firstly training a deep neural network by using a Tibetan language database to obtain a Tibetan language speech recognition model, and then training the Tibetan language speech recognition model by using Tibetan dialect linguistic data to obtain a Tibetan dialect speech recognition model; training the Tibetan language voice recognition model by adopting Kangba dialect linguistic data to obtain a Kangba dialect voice recognition model; and training the Tibetan language voice recognition model by adopting the Anwu dialect corpus so as to obtain the Anwu dialect voice recognition model.
Of course, in practical application, the extracted features, the dimensions of the features, the types of the acoustic models, and the training method of the acoustic models may be set according to actual needs, which is not limited in the embodiment of the present invention.
In the embodiment of the invention, a text set of a target language is subjected to Latin to obtain a pronunciation dictionary of the target language, and then a target language corpus is generated based on the pronunciation dictionary and the text set; the target language corpus comprises voice corpora corresponding to the text set; and training a voice recognition model for recognizing the target language according to the target language corpus, and respectively training the voice recognition model again on the basis of dialect corpora corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing various dialects. Therefore, the pronunciation dictionary of the target language can be constructed through latin, manual marking is not needed, and a large amount of labor cost and time cost are saved; moreover, for dialects of different branches of the target language, dialect voice recognition models corresponding to each dialect are obtained through preset multi-dialect corpus training, and the recognition rate of each dialect is improved.
Furthermore, when a target language corpus is constructed, balanced coverage of pronunciation phenomena is guaranteed under the condition that the data volume of a target language text is small by means of converting the triphone subsequence and calculating the triphone subsequence information entropy, especially the quantity of the infrequent pronunciation phenomena is increased, model accuracy of a speech recognition model obtained by training the corpus is greatly improved, and recognition rate is improved.
In another embodiment, a language processing method based on various dialect speech recognition models is provided, as shown in fig. 6, the method comprising:
step S601, acquiring audio to be processed of a target language;
specifically, the audio to be processed in the target language may be obtained through an application client, an applet, or other type of program, and the audio to be processed may be obtained by converting sound collected by an audio processing device. The application program client, the applet and other types of programs can be installed in the terminal, and the terminal can have the following characteristics:
(1) on a hardware architecture, a device has a central processing unit, a memory, an input unit and an output unit, that is, the device is often a microcomputer device having a communication function. In addition, various input modes such as a keyboard, a mouse, a touch screen, a microphone, a camera and the like can be provided, and input can be adjusted as required. Meanwhile, the equipment often has a plurality of output modes, such as a telephone receiver, a display screen and the like, and can be adjusted according to needs;
(2) on a software system, the device must have an operating system, such as Windows Mobile, Symbian, Palm, Android, iOS, and the like. Meanwhile, the operating systems are more and more open, and personalized application programs developed based on the open operating system platforms are infinite, such as a communication book, a schedule, a notebook, a calculator, various games and the like, so that the requirements of personalized users are met to a great extent;
(3) in terms of communication capacity, the device has flexible access mode and high-bandwidth communication performance, and can automatically adjust the selected communication mode according to the selected service and the environment, thereby being convenient for users to use. The device can support GSM (Global System for Mobile Communication), WCDMA (Wideband Code Division Multiple Access), CDMA2000(Code Division Multiple Access), TDSCDMA (Time Division-Synchronous Code Division Multiple Access), Wi-Fi (Wireless-Fidelity), WiMAX (world Interoperability for Microwave Access) and the like, thereby being suitable for various types of networks, and not only supporting voice services, but also supporting various Wireless data services;
(4) in the aspect of function use, the equipment focuses more on humanization, individuation and multi-functionalization. With the development of computer technology, devices enter a human-centered mode from a device-centered mode, and the embedded computing, control technology, artificial intelligence technology, biometric authentication technology and the like are integrated, so that the human-oriented purpose is fully embodied. Due to the development of software technology, the equipment can be adjusted and set according to individual requirements, and is more personalized. Meanwhile, the device integrates a plurality of software and hardware, and the function is more and more powerful.
Step S602, performing voice recognition on the audio to be processed by adopting at least one dialect voice recognition model to obtain a corresponding target language text; at least one dialect speech recognition model is obtained by training through the acoustic model training method of the language in the steps S201 to S203.
Specifically, after the program acquires the audio to be processed, the program may perform speech recognition on the audio to be processed by using at least one preset dialect speech recognition model to obtain a target language text corresponding to the audio to be processed. For example, after acquiring a piece of Tibetan language audio, the program performs speech recognition on the piece of Tibetan language respectively by using a speech recognition model of tibetan dialect, a speech recognition model of kanba dialect and a speech recognition model of Anduo dialect, so as to obtain a corresponding piece of Tibetan text.
At least one dialect speech recognition model is obtained through the training of steps S101 to S103, which is not described herein again.
In the embodiment of the invention, when the audio to be processed of the target language is acquired, voice recognition is carried out on the audio to be processed by adopting at least one preset dialect voice recognition model, so as to obtain the corresponding target language text. Therefore, for the audio to be processed of the target language, the dialect speech recognition model corresponding to each dialect is obtained through the preset multi-dialect corpus training, and the recognition rate of the target language is improved, especially the recognition rate of each dialect.
Fig. 7 is a schematic structural diagram of an acoustic model training apparatus for a language according to another embodiment of the present application, and as shown in fig. 7, the apparatus of this embodiment may include:
the first processing module 701 is configured to perform latin conversion on a text set of a target language to obtain a pronunciation dictionary of the target language;
a second processing module 702, configured to generate a target language corpus based on the pronunciation dictionary and the text set;
the third processing module 703 is configured to train a speech recognition model for recognizing the target language according to the target language corpus;
the fourth processing module 704 is configured to train the speech recognition models again respectively based on dialect corpora corresponding to each dialect of the target language, so as to obtain dialect speech recognition models for recognizing each dialect.
In a preferred embodiment of the present invention, the text set comprises at least two target language texts;
the first processing module includes:
the segmentation submodule is used for segmenting at least two target language texts based on the syllable symbol to obtain at least two syllables;
the statistic submodule is used for counting to obtain the occurrence frequency of at least two syllables corresponding to each other in the text set, and taking the syllables with the first preset number, which are at the front of the occurrence frequency, of the at least two syllables as target syllables;
and the conversion submodule is used for performing Latin on each target syllable to obtain a pronunciation sequence corresponding to each target syllable, and taking the set of each pronunciation sequence as a pronunciation dictionary of the target language.
In a preferred embodiment of the present invention, the second processing module includes:
the first determining submodule is used for determining at least two target language sentences from at least two target language texts; wherein any entry slogan text comprises at least one target language sentence;
a second determining submodule for determining at least two target language sentences based on the pronunciation dictionary and the at least two target language sentences;
and the generation submodule is used for generating a target language corpus based on the at least two target language sentences.
In a preferred embodiment of the present invention, the first determination submodule includes:
the first filtering unit is used for carrying out duplication removal on the at least two target language texts to obtain at least two remaining first target language texts;
the regularization unit is used for regularizing the at least two first target language texts to obtain at least two second target language texts after regularization;
the segmentation unit is used for carrying out sentence segmentation on the at least two second target language texts to obtain at least two target language sentences;
and the second filtering unit is used for determining the target language sentences of which the syllable number exceeds the first syllable number threshold value and does not exceed the second syllable number threshold value in at least two target language sentences.
In a preferred embodiment of the present invention, the second determination submodule includes:
the conversion unit is used for respectively carrying out triphone conversion on the at least two target language sentences by adopting the pronunciation dictionary to obtain triphone sequences corresponding to the at least two target language sentences;
the calculation unit is used for calculating the information entropy of each triphone subsequence, and taking the target semantic sentence corresponding to the triphone subsequence with the largest information entropy as the target language sentence;
and repeatedly calling the conversion unit and the calculation unit aiming at other target language sentences except the target language sentences in the at least two target language sentences until the number of the target language sentences reaches a second preset number.
In a preferred embodiment of the present invention, the generating sub-module includes:
the recording unit is used for recording the audio frequency of the at least two target language sentences to obtain the audio data corresponding to the at least two target language sentences;
and the storage unit is used for storing at least two target language sentences and the audio data corresponding to the target language sentences to obtain a target language corpus.
In a preferred embodiment of the present invention, the third processing module includes:
the extraction submodule is used for extracting 40-dimensional Mel cepstrum coefficient characteristics and 100-dimensional identity authentication vector characteristics from each audio data in the target language material library as acoustic characteristics;
and the training submodule is used for training a preset Gaussian mixture model by adopting the acoustic features and each target language sentence in the target language corpus to obtain a voice recognition model of the target language.
Preferably, the fourth processing module is specifically configured to: and adopting dialect linguistic data respectively corresponding to various dialects of the target language to respectively perform transfer learning on the voice recognition model to obtain dialect acoustic models which respectively correspond to the dialect linguistic data and recognize the dialects.
Preferably, the method further comprises the following steps:
the acquisition module is used for acquiring the audio to be processed of the target language;
and the recognition module is used for performing voice recognition on the audio to be processed by adopting at least one dialect voice recognition model to obtain a corresponding target language text.
The acoustic model training device for languages of this embodiment can execute the acoustic model training method for languages shown in the first embodiment of this application, and the implementation principles thereof are similar, and are not described here again.
In the embodiment of the invention, a text set of a target language is subjected to Latin to obtain a pronunciation dictionary of the target language, and then a target language corpus is generated based on the pronunciation dictionary and the text set; the target language corpus comprises voice corpora corresponding to the text set; and training a voice recognition model for recognizing the target language according to the target language corpus, and respectively training the voice recognition model again on the basis of dialect corpora corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing various dialects. Therefore, the pronunciation dictionary of the target language can be constructed through latin, manual marking is not needed, and a large amount of labor cost and time cost are saved; moreover, for dialects of different branches of the target language, dialect acoustic models corresponding to each dialect are obtained through preset multi-dialect corpus training, and the recognition rate of each dialect is improved.
Furthermore, when a target language corpus is constructed, balanced coverage of pronunciation phenomena is guaranteed under the condition that the data volume of a target language text is small by means of converting the triphone subsequence and calculating the triphone subsequence information entropy, especially the quantity of the infrequent pronunciation phenomena is increased, model accuracy of the acoustic model obtained by training the corpus is greatly improved, and recognition rate is improved.
In another embodiment of the present application, there is provided an electronic device including: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: performing Latin on a text set of a target language to obtain a pronunciation dictionary of the target language, and then generating a target language corpus based on the pronunciation dictionary and the text set; the target language corpus comprises voice corpora corresponding to the text set; and training a voice recognition model for recognizing the target language according to the target language corpus, and respectively training the voice recognition model again on the basis of dialect corpora corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing various dialects. Therefore, the pronunciation dictionary of the target language can be constructed through latin, manual marking is not needed, and a large amount of labor cost and time cost are saved; moreover, for dialects of different branches of the target language, dialect acoustic models corresponding to each dialect are obtained through preset multi-dialect corpus training, and the recognition rate of each dialect is improved.
In an alternative embodiment, an electronic device is provided, as shown in FIG. 8, the electronic device 8000 shown in FIG. 8 including: a processor 8001 and memory 8003. Processor 8001 is coupled to memory 8003, such as via bus 8002. Optionally, the electronic device 8000 may also include a transceiver 8004. In addition, the transceiver 8004 is not limited to one in practical applications, and the structure of the electronic device 8000 does not limit the embodiment of the present application.
Processor 8001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. Processor 8001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, DSP and microprocessor combinations, and so forth.
Bus 8002 may include a path to transfer information between the aforementioned components. The bus 8002 may be a PCI bus or an EISA bus, etc. The bus 8002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.
Memory 8003 may be, but is not limited to, ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 8003 is used for storing application program codes for executing the scheme of the present application, and the execution is controlled by the processor 8001. Processor 8001 is configured to execute application program code stored in memory 8003 to implement what is shown in any of the foregoing method embodiments.
Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like.
Yet another embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which, when run on a computer, enables the computer to perform the corresponding content in the aforementioned method embodiments. Compared with the prior art, the method comprises the steps of performing Latin on a text set of a target language to obtain a pronunciation dictionary of the target language, and then generating a target language corpus based on the pronunciation dictionary and the text set; the target language corpus comprises voice corpora corresponding to the text set; and training a voice recognition model for recognizing the target language according to the target language corpus, and respectively training the voice recognition model again on the basis of dialect corpora corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing various dialects. Therefore, the pronunciation dictionary of the target language can be constructed through latin, manual marking is not needed, and a large amount of labor cost and time cost are saved; moreover, for dialects of different branches of the target language, dialect acoustic models corresponding to each dialect are obtained through preset multi-dialect corpus training, and the recognition rate of each dialect is improved.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device realizes the following when executed:
performing Latin on the text set of the target language to obtain a pronunciation dictionary of the target language;
generating a target language corpus based on the pronunciation dictionary and the text set; the target language corpus comprises a voice corpus corresponding to the text set;
training a voice recognition model for recognizing the target language according to the target language material library;
and respectively training the voice recognition models again based on dialect linguistic data respectively corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing the various dialects.

Claims (12)

1. A method for training an acoustic model of a language, comprising:
performing Latin on a text set of a target language to obtain a pronunciation dictionary of the target language;
generating a target language corpus based on the pronunciation dictionary and the text set; the target language corpus comprises a voice corpus corresponding to the text set;
training a voice recognition model for recognizing the target language according to the target language material library;
and respectively training the voice recognition models again based on dialect linguistic data respectively corresponding to various dialects of the target language to obtain dialect voice recognition models for recognizing the various dialects.
2. The method of acoustic model training of a language of claim 1, wherein the set of text comprises at least two target language texts;
the latin treatment of the text set of the target language to obtain the pronunciation dictionary of the target language includes:
segmenting the at least two target language texts based on the syllable symbol to obtain at least two syllables;
counting to obtain the occurrence frequency of the at least two syllables in the text set, and taking the syllables with the first preset number at the front occurrence frequency of the at least two syllables as target syllables;
and performing Latin on each target syllable to obtain a pronunciation sequence corresponding to each target syllable, and taking the set of each pronunciation sequence as a pronunciation dictionary of the target language.
3. The method for training an acoustic model of a language according to claim 1 or 2, wherein the generating a target language corpus based on the pronunciation dictionary and the text set comprises:
determining at least two target language sentences from the at least two target language texts; wherein any entry slogan text comprises at least one target language sentence;
determining at least two target language sentences based on the pronunciation dictionary and the at least two target language sentences;
and generating a target language corpus based on the at least two target language sentences.
4. A method for training an acoustic model of a language according to claim 3, wherein said determining at least two target language sentences from the at least two target language texts comprises:
removing the duplication of the at least two target language texts to obtain at least two remaining first target language texts;
regularizing the at least two first target language texts to obtain at least two second target language texts after regularization;
performing sentence segmentation on the at least two second target language texts to obtain at least two target language sentences;
and determining the target language sentences of which the syllable number exceeds a first syllable number threshold value and does not exceed a second syllable number threshold value in the at least two target language sentences.
5. The method for training an acoustic model of a language according to claim 3, wherein said determining at least two target language sentences based on the pronunciation dictionary and the at least two target language sentences comprises:
performing triphone conversion on the at least two target language sentences by using the pronunciation dictionary to obtain triphone sequences corresponding to the at least two target language sentences respectively;
calculating to obtain the information entropy of each triphone subsequence, and taking the target semantic sentence corresponding to the triphone subsequence with the maximum information entropy as a target language sentence;
and aiming at other target language sentences except the target language sentences in the at least two target language sentences, repeatedly executing the steps of respectively carrying out triphone conversion on the at least two target language sentences by adopting the pronunciation dictionary to obtain triphone sequences respectively corresponding to the at least two target language sentences, calculating to obtain the information entropy of each triphone sequence, and taking the target semantic sentences corresponding to the triphone sequences with the maximum information entropy as the target language sentences until the number of the target language sentences reaches a second preset number.
6. The method of acoustic model training of a language of claim 3, wherein the generating a target language corpus based on the at least two target language sentences comprises:
carrying out audio recording on the at least two target language sentences to obtain audio data corresponding to the at least two target language sentences respectively;
and storing the at least two target language sentences and the audio data corresponding to the at least two target language sentences to obtain a target language corpus.
7. The method for training the acoustic model of the language according to claim 1, wherein the training the speech recognition model for recognizing the target language according to the target speech corpus comprises:
extracting 40-dimensional Mel cepstrum coefficient features and 100-dimensional identity authentication vector features from each audio data in the target language corpus as acoustic features;
and training a preset Gaussian mixture model by adopting the acoustic features and each target language sentence in the target language corpus to obtain a speech recognition model of the target language.
8. The method according to claim 1, wherein the training the speech recognition model again based on dialect corpora corresponding to the dialects of the target language to obtain the dialect speech recognition model for recognizing the dialects comprises:
and adopting dialect linguistic data respectively corresponding to various dialects of the target language to respectively perform transfer learning on the voice recognition model to obtain dialect acoustic models which respectively correspond to the dialect linguistic data and recognize the dialects.
9. The method for training an acoustic model of a language according to claim 1, further comprising:
acquiring audio to be processed of the target language;
and performing voice recognition on the audio to be processed by adopting the at least one dialect acoustic model to obtain a corresponding target language text.
10. An apparatus for training an acoustic model of a language, comprising:
the first processing module is used for carrying out Latin on a text set of a target language to obtain a pronunciation dictionary of the target language;
a second processing module for generating a target language corpus based on the pronunciation dictionary and the text set; the target language corpus comprises a voice corpus corresponding to the text set;
the third processing module is used for training a voice recognition model for recognizing the target language according to the target language material library;
and the fourth processing module is used for respectively retraining the speech recognition models based on dialect linguistic data respectively corresponding to various dialects of the target language to obtain dialect speech recognition models for recognizing the various dialects.
11. An electronic device, comprising:
a processor, a memory, and a bus;
the bus is used for connecting the processor and the memory;
the memory is used for storing operation instructions;
the processor is configured to execute the method for training an acoustic model of a language according to any one of claims 1 to 9 by calling the operation instruction.
12. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the method of acoustic model training of a language according to any of claims 1 to 9.
CN202011287317.XA 2020-11-17 2020-11-17 Language acoustic model training method and device, electronic equipment and computer medium Pending CN112489634A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011287317.XA CN112489634A (en) 2020-11-17 2020-11-17 Language acoustic model training method and device, electronic equipment and computer medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011287317.XA CN112489634A (en) 2020-11-17 2020-11-17 Language acoustic model training method and device, electronic equipment and computer medium

Publications (1)

Publication Number Publication Date
CN112489634A true CN112489634A (en) 2021-03-12

Family

ID=74931001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011287317.XA Pending CN112489634A (en) 2020-11-17 2020-11-17 Language acoustic model training method and device, electronic equipment and computer medium

Country Status (1)

Country Link
CN (1) CN112489634A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112885351A (en) * 2021-04-30 2021-06-01 浙江非线数联科技股份有限公司 Dialect voice recognition method and device based on transfer learning
CN113345431A (en) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 Cross-language voice conversion method, device, equipment and medium
CN113380225A (en) * 2021-06-18 2021-09-10 广州虎牙科技有限公司 Language model training method, speech recognition method and related device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101958118A (en) * 2003-03-31 2011-01-26 索尼电子有限公司 Implement the system and method for speech recognition dictionary effectively
CN103578464A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
US20150287405A1 (en) * 2012-07-18 2015-10-08 International Business Machines Corporation Dialect-specific acoustic language modeling and speech recognition
US20170025117A1 (en) * 2015-07-23 2017-01-26 Samsung Electronics Co., Ltd. Speech recognition apparatus and method
CN108877782A (en) * 2018-07-04 2018-11-23 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN109949796A (en) * 2019-02-28 2019-06-28 天津大学 A kind of end-to-end framework Lhasa dialect phonetic recognition methods based on Tibetan language component
CN110675855A (en) * 2019-10-09 2020-01-10 出门问问信息科技有限公司 Voice recognition method, electronic equipment and computer readable storage medium
CN111369974A (en) * 2020-03-11 2020-07-03 北京声智科技有限公司 Dialect pronunciation labeling method, language identification method and related device
CN111489735A (en) * 2020-04-22 2020-08-04 北京声智科技有限公司 Speech recognition model training method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101958118A (en) * 2003-03-31 2011-01-26 索尼电子有限公司 Implement the system and method for speech recognition dictionary effectively
US20150287405A1 (en) * 2012-07-18 2015-10-08 International Business Machines Corporation Dialect-specific acoustic language modeling and speech recognition
CN103578464A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
US20170025117A1 (en) * 2015-07-23 2017-01-26 Samsung Electronics Co., Ltd. Speech recognition apparatus and method
CN108877782A (en) * 2018-07-04 2018-11-23 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN109949796A (en) * 2019-02-28 2019-06-28 天津大学 A kind of end-to-end framework Lhasa dialect phonetic recognition methods based on Tibetan language component
CN110675855A (en) * 2019-10-09 2020-01-10 出门问问信息科技有限公司 Voice recognition method, electronic equipment and computer readable storage medium
CN111369974A (en) * 2020-03-11 2020-07-03 北京声智科技有限公司 Dialect pronunciation labeling method, language identification method and related device
CN111489735A (en) * 2020-04-22 2020-08-04 北京声智科技有限公司 Speech recognition model training method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孙婧雯: "基于深度学习的藏语安多方言语音识别的研究", 《万方数据库》 *
金慧敏等: "藏语方言计算机辅助系统的研究", 《科技信息》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112885351A (en) * 2021-04-30 2021-06-01 浙江非线数联科技股份有限公司 Dialect voice recognition method and device based on transfer learning
CN112885351B (en) * 2021-04-30 2021-07-23 浙江非线数联科技股份有限公司 Dialect voice recognition method and device based on transfer learning
CN113345431A (en) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 Cross-language voice conversion method, device, equipment and medium
CN113380225A (en) * 2021-06-18 2021-09-10 广州虎牙科技有限公司 Language model training method, speech recognition method and related device

Similar Documents

Publication Publication Date Title
CN105957518B (en) A kind of method of Mongol large vocabulary continuous speech recognition
CN109065032B (en) External corpus speech recognition method based on deep convolutional neural network
CN111292720A (en) Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN112489634A (en) Language acoustic model training method and device, electronic equipment and computer medium
Adel et al. Features for factored language models for code-Switching speech.
CN112397056B (en) Voice evaluation method and computer storage medium
JP2014232268A (en) System, method and program for improving reading accuracy in speech recognition
CN111833845A (en) Multi-language speech recognition model training method, device, equipment and storage medium
CN110808032A (en) Voice recognition method and device, computer equipment and storage medium
CN110853629A (en) Speech recognition digital method based on deep learning
Nasr et al. End-to-end speech recognition for arabic dialects
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
Rasipuram et al. Grapheme and multilingual posterior features for under-resourced speech recognition: a study on scottish gaelic
CN107251137A (en) Improve method, device and the computer readable recording medium storing program for performing of the set of at least one semantic primitive using voice
Mittal et al. Speaker-independent automatic speech recognition system for mobile phone applications in Punjabi
Reddy et al. Indian sign language generation from live audio or text for tamil
CN115019787A (en) Interactive homophonic and heteronym word disambiguation method, system, electronic equipment and storage medium
Sitaram et al. Universal grapheme-based speech synthesis
Coto‐Solano Computational sociophonetics using automatic speech recognition
Amoolya et al. Automatic speech recognition for tulu language using gmm-hmm and dnn-hmm techniques
Ghorpade et al. ITTS model: speech generation for image captioning using feature extraction for end-to-end synthesis
Nga et al. A Survey of Vietnamese Automatic Speech Recognition
Tun et al. A speech recognition system for Myanmar digits
CN111090720A (en) Hot word adding method and device
Youa et al. Research on dialect speech recognition based on DenseNet-CTC

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40043383

Country of ref document: HK

AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20230516