CN115512689A

CN115512689A - Multi-language phoneme recognition method based on phoneme pair iterative fusion

Info

Publication number: CN115512689A
Application number: CN202211106527.3A
Authority: CN
Inventors: 龙华; 苏树盟; 邵玉斌; 杜庆治; 黄张衡; 段云
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-09-12
Filing date: 2022-09-12
Publication date: 2022-12-23

Abstract

The invention relates to a multi-language phoneme recognition method based on phoneme pair iterative fusion, belonging to the technical field of audio signal processing. Multilingual phoneme resources with different resource degrees are obtained and effectively utilized through multilingual international phonetic symbol conversion, after the data sets which are uniformly mapped to the directly obtained languages are subjected to non-mapping IPA phoneme cluster expansion, the reconstructed new data sets replace the traditional single universal data sets to be used for fusing the phoneme sets. And constructing a phoneme high-order linear prediction peak frequency band phonetic feature constrained by human body pronunciation resonance by taking the human body vocal tract pronunciation as constraint, and combining the phonetic feature with the acoustic feature MFCC to fuse the phonetic feature into a novel phoneme distinguishing feature. Iterative reduction algorithms by minimum co-occurrence phone pairs or phone feature cosine similarity pairs. And finally, constructing a multi-language phoneme label based on the fused complete phoneme set, extracting novel phoneme distinguishing characteristics of the label language training set and the test set, and realizing phoneme recognition of automatic alignment of the variable-length speech.

Description

Multi-language phoneme recognition method based on phoneme pair iterative fusion

Technical Field

The invention relates to a multi-language phoneme recognition method based on phoneme pair iterative fusion, belonging to the technical field of audio signal processing.

Background

Deep learning neural network techniques are widely used by their own superior performance compared to traditional continuous time recognition models (e.g., hidden markov models), where the use of connected time classification CTCs eliminates the need for forced alignment of speech sequences with speech tags, yielding better results in phoneme recognition. The phoneme recognition is to construct automatic recognition between a phoneme label and a phoneme feature vector, and adopts phonetic features as input features of the phoneme recognition, so that a better recognition result is not achieved all the time, and a novel distinguishing feature is needed to improve the phoneme classification and distinguishing recognition degree. Unlike speech recognition, phoneme recognition does not focus on the semantics of speech; the method is different from speaker recognition, the correlation between phoneme recognition and human pronunciation habit tone quality and tone color is small, the accuracy and expansibility of a phoneme recognition system play an important role in the development of automatic speech recognition, language recognition and other systems, while a plain recognition model faces a series of problems in practical application, particularly the resource problem of phoneme recognition model training and learning, the expansibility problem of phoneme recognition model cross-language application and the phoneme recognition accuracy problem of the phoneme recognition model. The phonemes are clustered, and the fused multi-language phoneme recognition model is constructed, so that phoneme resources can be effectively utilized, model expansibility is enhanced, and low-resource language phoneme recognition performance is improved. Different from the number of characters or words of languages, the total number of phonemes contained in a single language is quite limited and considerable, and is influenced by the limitation of human vocal tract and different pronunciation habits of different languages, the intersection degree of phoneme sets of different languages is larger, and the special phoneme set is smaller, as shown in that the phoneme sets of different languages of the same-family standard language are smaller, and the phoneme sets of non-standard languages (dialects) of different languages are larger, so that the method is an establishable characteristic of a multi-language phoneme set combined multi-language complete recognition model, and provides natural favorable conditions for multi-language flexible multi-level fusion. Combining the alignment-free characteristic of the CTC network, the phoneme recognition model of the variable-length speech is generated with more excellent expansibility and accuracy.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a multi-language phoneme recognition method based on phoneme pair iterative fusion, so as to solve the above problems.

The invention constructs an extensible phoneme recognizer which can be trained by less training resources and has the function of accuracy guarantee, and mainly solves the technical problem of providing a method for acquiring and mapping a multilingual phoneme set which can be used for fusion, flexibly and inerrably fusing a plurality of language phoneme sets in a multi-level mode, constructing a language phoneme label under the complete fusion phoneme set, constructing an accurate fusion characteristic for distinguishing phonemes, inputting the phoneme label and the phoneme distinguishing characteristic to be connected with a time classification CTC network, and constructing the accurate and effective multilingual phoneme recognizer. When acquiring a phoneme set, different language phoneme relativity is effectively utilized to intervene phoneme mapping conversion, different language phoneme sets are mapped to a certain data set on the aspect of phoneme linguistics, and then the mapping set is expanded to form a reconstructed data set to replace a traditional single universal data set so as to enhance the adaptability of the data set to different languages. The unified mapping of the phone set is independent of the phone set fusion and is a pre-phone set fusion pre-processing where phones acquire knowledge of linguistic phonetics on a direct basis. During fusion, on the premise of guaranteeing phoneme recognition accuracy, performance loss of a model caused by competition generated when a fusion set is too large in scale and a plurality of highly similar phoneme classes are recognized by phonemes is avoided by reducing the scale of a phoneme set, when the phoneme set is reduced, the language phoneme arrangement statistical application with the optimal overall phoneme recognition accuracy is realized by considering a small number of phoneme errors, the phoneme set is reduced by adopting a small probability phoneme to iterative language phoneme set reduction method, the phoneme set is characterized by aiming at language phoneme lattice grapheme and grapheme encoding language, the phoneme similarity is represented based on the cosine distance between phonemes with novel phoneme distinguishing characteristics, and the phoneme with the maximum similarity is combined and reduced; combining the statistical difference of the phoneme set fusion model, avoiding the problem of contradiction of phoneme phonetic knowledge attributes caused by direct phonetic knowledge fusion of a polyphone set after the phoneme set is mapped, determining the iteration times of phoneme set reduction by adopting a mode of maximum interactive information quantity, and performing multi-stage fusion on the phoneme set. After the integration, CTC supervised learning of the phoneme labels and the phoneme characteristics under the complete set is constructed based on the integrated complete phoneme set so as to obtain the multi-language phoneme recognizer with higher recognition accuracy.

The technical scheme of the invention is as follows: a multi-language phoneme recognition method based on phoneme pair iterative fusion specifically comprises the following steps:

step1: the method comprises the steps of obtaining a plurality of allophone corpora with different resource degrees, wherein the construction of a phoneme recognizer is limited by the resource of the phoneme recognition model corpora, the richness of the labeled corpora used for model supervision and learning represents the resource degree of the corpora, obtaining the corpora with higher resource degrees in a direct obtaining mode to be used as a main language phoneme set for training a first language, obtaining the corpora indirectly in a non-IPA phoneme dictionary coding mode to be used as an extended language phoneme set for training a second language or even more languages, and obtaining the corpora indirectly in a grapheme-IPA phoneme dictionary coding mode to be used as an extended language phoneme set for training a third language or even more languages.

The method comprises the steps of adopting a combination of three phoneme corpus acquisition methods to acquire different types of corpus resources, and adopting a direct acquisition mode to acquire a language speech set and a phoneme label as the language phoneme corpus for the phoneme corpus of a language with high precision and rich content. For a language phoneme corpus which has sufficient precision and rich content and language has language-specific phoneme standard representation, a dictionary code for constructing the phoneme representation of the language-specific phoneme representation standard to international phonetic alphabet IPA representation is proposed, and the language phonetic set and IPA phoneme labels are obtained as the language phoneme corpus by adopting the proposed phoneme-phoneme dictionary coding mode. For languages without better phoneme data substitution and with definite mapping between language speech phonemes and graphemes, an IPA phoneme transcription dictionary representing the language graphemes is proposed to be constructed, and a language phoneme set and IPA phoneme labels are obtained as the language phoneme corpus by adopting the proposed grapheme-phoneme dictionary coding mode. It is particularly noted that in the proposed multilingual phoneme corpus joint acquisition method, when the acquisition belongs to an isolated word language (most of the chinese-Tibetan language family languages), a single isolated word phoneme needs to be disassembled into a plurality of continuous part-of-speech IPA phonemes, and then a language phoneme set needs to be reconstructed.

Step2: uniformly mapping the obtained corpus phoneme sets of different languages: based on the phoneme set obtained in Step1, the linguistic resource phoneme tags obtained in a non-IPA phoneme dictionary coding mode and the linguistic phoneme tags obtained in a grapheme-IPA phoneme dictionary coding mode are mapped to the main phoneme set phoneme tags of the first language in a unified way by linguistic knowledge to represent.

In Step2, the unified mapping module of the phoneme sets of the linguistic data of different languages is separated and independent: the method comprises the steps of constructing an independent phoneme set mapping module based on different pronunciation modes of phonemes including tone, shiver, slippery sound, nasal sound and the like of the phonemes, enabling the phoneme set mapping to be independent of acquisition of the phoneme set, separating the mapping of the phoneme set from the outside of a fusion module of the phoneme set, modularizing the mapping of the phoneme set, and accessing a unified phoneme mapping module into a phoneme set acquisition module in parallel to adapt to a plurality of different single-language phoneme sets acquired in different modes.

Unified mapping scheme of phoneme set of linguistic data of different languages: selecting a main language phoneme set obtained in a direct obtaining mode as a basic phoneme set in the scheme mapping, selecting an extension language phoneme set obtained in an indirect obtaining mode as a superposed multistage fusion phoneme set in the scheme mapping, mapping an IPA format phoneme complete symbol set of an extension language obtained in a phoneme-phoneme coding mode to the main language phoneme complete symbol set obtained in the direct obtaining mode, and mapping the IPA format phoneme complete symbol set of the extension language obtained in a grapheme-phoneme coding mode to the main language phoneme complete symbol set obtained in the direct obtaining mode.

The specific implementation method for unified mapping of the phoneme sets of the linguistic data of different languages comprises the following steps: the number of phoneme symbols in the language phoneme set is quite limited and observable, according to the unity principle of human pronunciation organs and based on the phoneme linguistic knowledge, the language IPA phonemes obtained in an indirect mode are mapped to the phoneme set with highly similar languages obtained directly, according to the reciprocity principle of pronunciation organs of speakers and considering the language regionality, the phonemes which can not be mapped and obtained in an indirect obtaining mode are endowed with language labels to form a non-mapped IPA phoneme cluster, and the mapping relation ARPAbe expressed as: a': 'ah', 'aa': 'ah', 'ei': 'eh', 'ai': 'ay', 'an': 'ah en', 'en': 'ax en', 'ang': 'ah nx', 'ao': 'aw', 'eng': 'ax nx', 'b': 'b', 'c': 'C _1', 'er': 'axr', 'ch': 'ch','d': 'd', 'f': 'f', 'e': 'ax', 'ee': 'ax', 'g': 'g', 'h': 'hh', 'i': 'iy', 'ia': 'C _2', 'ian': 'iy ah en', 'iang': 'iy ah nx', 'iao': 'iy aw', 'ie': 'y eh', 'ii': 'iy', 'in': 'iy nx', 'ing': 'iy nx', 'iong': 'iy oy nx', 'iu': 'uw', 'ix': 'C _3', 'iy': 'C _4', 'iz': 'C _5', 'j': 'jh', 'k': 'k', 'l': 'l','m': 'm', 'n': the ratio of 'n', 'o': 'oy', 'ong': 'oy nx', 'oo': 'oy', 'ou': 'ow', 'p': 'p', 'q': 'ch', 'r': 'r','s': 's', 'sh': 'sh','t': 't', 'u': 'uh', 'ua': 'uhah', 'uai': 'wh aa ih', 'uan': 'uhah en', 'uang': 'uhah nx', 'ueng': 'uhax nx', 'ui': 'wh eh ih', and 'un': 'wh eh en', 'uo': 'C _6', 'uu': 'uh', 'v': 'v', 'van': 'v ah en','ve': 'v ax', 'vn': 'v en', 'vv': 'v', 'x': 'C _7', 'z': 'C _8', 'zh': 'zh' is used. The English Spanish linguistics knowledge mapping relation ARPAbe is provided as follows: a molar ratio of 'J': 'iy', 'L': 'jh', 'RR': 'S _2', 'S': 'S', 'S-1': 'S _1', 'S-2': 'S _3', 'T': 'th', 'T/': 'ch', 'X': 'hh', 'a': 'aa', 'b': 'v','d': 'en jh', 'e': 'eh', 'f': 'f', 'g': 'g', 'gs': 's', 'h': 'h #', 'i': 'ih', 'j': 'ih', 'k': 'k', 'l': 'l','m': 'm', 'n': 'en', 'o': 'ow', 'p': 'p', 'q': 'k', 'r': 'S _4','t': 't', 'u': 'uw', 'v': 'b', 'w': ' w ', wherein ' S-2' is a phoneme corresponding to the special grapheme ' u ', and ' S-1' is a phoneme corresponding to the special grapheme ' ″, spanish language has a multi-pronunciation phenomenon, so that a more complete phoneme pronunciation multi-layer mapping mode can be constructed according to the phenomenon, and the completeness of the model phoneme dictionary is improved.

Step3: the novel characteristic of distinguishing phonemes of the construction is constructed by taking the vocal tract of the human body as constraint, and specifically comprises the following steps: according to a speech sound production system mechanism, based on a multilingual phoneme set which is obtained by Step2 and mapped by a first language IPA phoneme label, all phonemes are refined into unvoiced phonemes, voiced-nasal phonemes and non-voiced-nasal voiced phonemes, and a new type of characteristic with phoneme distinguishing characteristics is constructed according to the speech sound production characteristics of different speech categories.

Step4: based on the multilingual phoneme set obtained by Step2, the IPA phoneme symbol sets mapped in the languages are reduced, and the sizes of the total phoneme symbol sets of the mapped main language, the second language and the third language are respectively reduced.

Step5: based on the reduced multi-language phoneme set obtained in Step4, a main phoneme set directly obtained is used as an initial set, first iteration primary fusion is carried out by combining with a second language phoneme set indirectly obtained, a new set formed after fusion is subjected to second iteration secondary fusion with a third language phoneme set indirectly obtained, and by analogy, more multi-language phoneme set fusion of more languages is obtained.

Multi-stage error-free fusion of multiple language phone sets: firstly, the phoneme set is fused without errors to avoid word confusion among the phoneme sets, and the reduction principle is as follows: the phonemes in the reduced single language are not combined, and the phonemes of the unmapped phoneme cluster with different language labels are not fused. And secondly, flexibly expanding the multi-stage fusion of the number of the fusion languages, taking the directly obtained reduced phoneme set as an initial set, combining the indirectly obtained first reduced phoneme set to perform first iteration one-stage fusion, and performing second iteration two-stage fusion on a new set formed by completing the fusion and the remaining second reduced phoneme set. And performing third iteration three-stage fusion on the new set subjected to the second-stage fusion and the remaining third phoneme set, wherein the number of the fusion phoneme sets is the number of iteration times and the number of fusion stages plus 1.

Step6: and constructing a phoneme recognition network by adopting a connecting time classification network CTC, and realizing the phoneme sequence recognition of variable-length multi-language speech automatic alignment.

The Step3 is specifically as follows: constructing distinguishing characteristic quantity for representing single phoneme, dividing the phoneme into three categories of unvoiced sound, voiced nasal sound and voiced non-nasal sound, combining acoustic characteristic MFCC and phoneme high-order linear prediction peak value frequency band characteristic constrained by human body pronunciation resonance to construct novel phoneme distinguishing characteristic, T represents the number of voice frames, and distinguishing characteristic of the static characteristic of the phoneme of the nth frame

Wherein n =1,2, \ 8230, T, F _n,0 Representing the fundamental frequency, F, of the nth frame _n,1 ,F _n,2 ,F _n,3 Representing the human vocal constraint formant, W, of the nth frame _n,1 ,W _n,2 ,W _n,3 Represents the bandwidth corresponding to the human body vocal constraint formant of the nth frame,

and when the speech is a nasal sound or an unvoiced sound, the 13-dimensional MFCC adopts the zero-pole frequency and the frequency of the first three formants of the voiced sound of the non-nasal sound after the bandwidth compensation, and the static characteristic distinguishing features of the phonemes of the speech section

Speech segment phoneme dynamics distinguishing features

Speech segment phoneme differentiation features

The Step4 specifically comprises the following steps: aiming at the application target of the constructed phoneme recognition model, the method is applied to a language phoneme alignment system which allows few phoneme errors to realize the optimal overall global phoneme recognition accuracy, sacrifices the contribution of few phoneme speech recognition degrees in a multilingual phoneme symbol set, defines a similarity, and solves the probability of two phonemes in the phoneme set, when the probability of two phonemes in the same language phoneme driving set is small and similar at the same time, when the two phonemes are in the same language phoneme driving set, the two phonemes have small probability and are similar at the same timeWhen the phonemes are simultaneously present in the phoneme driving sets of different languages, the probabilities of the two phonemes are combined, and one phoneme with a smaller probability is reduced. Aiming at language phoneme grapheme encoding, the phoneme language description precision is sacrificed, and the phoneme distinguishing characteristic F of the phoneme o in the language i, i =0,1,2 is solved ^(i,o) Based on a novel phoneme discriminative feature F ^(i,o) The cosine distance between the phonemes represents the phoneme similarity, the phoneme with the maximum similarity is combined and reduced, and the two phonemes are combined and represented by the phoneme symbol with higher occurrence probability in the training set during reduction.

The Step5 is specifically as follows:

in step S5, a fusion mode of the multiple language phoneme sets: the phoneme set multi-stage fusion mode is based on constructing a multi-complete phoneme set multi-stage fusion algorithm with maximum mutual information quantity, a language phoneme set and a language phoneme symbol set forming the language phoneme set are firstly converted into random variables, when the second stage fusion is carried out, an interactive information quantity expression under different iteration reduction iteration times of the two language phoneme sets fused at the stage is obtained, the phoneme reduction iteration times when the expression is maximum are used for carrying out multi-stage fusion of the multi-phoneme set, and the specific steps are as follows:

step5.1: respectively solving random variables X formed by the phoneme set after iterative reduction _i I =1,2 and X ₀ Amount of mutual information

Step5.2: calculating the iteration times E when the mutual information quantity of the two-language phoneme symbol set is maximum _i 。

Step5.3: solving the minimum E based on the maximum mutual information quantity in all L languages _i I =2, \8230, and L value, which is the optimal iterative reduction number for all phoneme sets.

Step5.4: and combining the iteratively reduced phoneme symbol sets under the optimal iterative reduction times to determine a fused multi-language complete phoneme symbol set under a fused i-language phoneme symbol set and a fused random variable.

The invention has the beneficial effects that: the invention adopts the combination of three phoneme corpus acquisition methods to jointly acquire different types of corpus resources, acquires the independence of low-resource corpus reduction heteronymous phoneme resources in an indirect mode, and makes up the shortage of different language phoneme resources acquired in the existing acquisition mode; the mapping of the phoneme set is modularized, so that the phoneme unified mapping is effectively suitable for a plurality of different single-language phoneme sets obtained in different modes, and the phoneme set unified mapping of the phoneme set mapping independent module eliminates the loss of phoneme interaction degree among the phoneme sets caused by different pronunciation modes among phonemes of the same-language phoneme set corresponding to the module; the proposed mapping scheme fully utilizes the precision characteristics of the main language phoneme set voice and the phoneme label obtained by a direct obtaining mode, and harmonizes the fitting degree of phoneme label phoneme supervision learning of the language phoneme set obtained by an indirect mode; dividing the phonemes into three categories of unvoiced sounds, voiced sounds of nasal sounds and voiced sounds of non-nasal sounds to construct distinguishing characteristic quantities representing single phonemes, combining acoustic characteristics MFCC and phoneme high-order linear prediction peak frequency band characteristics constrained by human pronunciation resonance to construct novel phoneme distinguishing characteristics, inheriting the acoustic characteristics of a speech section level and distinguishing the acoustic characteristics of the speech, and enabling the phonemes to be more distinguished in classification; the overall phoneme recognition multilevel fusion increases the resource supervised learning adaptability of different language corpus, and constructs a single multilingual phoneme recognizer, thereby effectively replacing the traditional complicated multilingual application phoneme recognition with a plurality of single language phoneme recognizers in parallel, introducing a novel phoneme distinguishing characteristic and greatly improving the recognition accuracy.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a flow chart of the phonemic phonetic features of the present invention;

FIG. 3 is a flow chart of the multilingual phone set fusion token of the present invention;

FIG. 4 is a diagram of a multilingual phoneme recognition supervised learning network of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

Example 1: as shown in fig. 1, a multi-lingual phoneme recognition method based on phoneme pair iterative fusion includes the following specific steps:

step1: acquiring a multi-language phoneme set;

the TIMIT English phoneme corpus with higher resource degree contains audio 16kHz sampling rate voice sequence sampling level artificial segmentation labels ARPAbet format phoneme labels, the voice phoneme labels with higher precision are directly obtained to be used as English phoneme sets, and the corresponding phoneme symbol sets are C ₀ (ii) a The Thchs30 corpus with higher resource degree comprises phoneme labels marked by Chinese initial and final formats of the speech segment level, a complete dictionary from the initial and final to the international phonetic symbol IPA is constructed, the phoneme IPA labels at the speech segment level are obtained through dictionary coding from the initial and final to the international phonetic symbol as a Chinese phoneme set, and the corresponding phoneme symbol set is C ₁ (ii) a The TSC corpus with low resource degree comprises speech grapheme labels at speech segment level, the corresponding Spanish language has self-defined phoneme mapping relation, IPA phoneme labels at Spanish speech segment level are obtained by utilizing the mapping relation from the language grapheme to the phoneme to serve as a Spanish phoneme set, and the corresponding phoneme symbol set is C ₂ 。

Step2: linguistic knowledge mapping of a linguistic phoneme set

The english language phoneme set TIMIT and the spanish language phoneme set TSC are language phoneme sets of the continuous morpheme identification neighborhood, and the chinese language phoneme set Thchs30 belongs to an isolated morpheme identification neighborhood language phoneme set, so that it is necessary to disassemble phonemes in part of Thchs30 into a plurality of continuous word neighborhood phonemes.

Transcribing a language phoneme symbol set consisting of an isolated word initial, final and phoneme set into an international phonetic symbol format, then mapping the international phonetic symbol format to a TIMIT phoneme set, expanding the international phonetic symbol set through special pronunciation of Chinese language, and mapping the mapping Thchs30 language phoneme symbol set C' ₁ Contains a mappable phoneme cluster C' _1,IPA And un-mappable un-mapped tagged (un-mapped IPA phoneme format with language tag "C")

The ARPAbe representation mapping relation that indirectly acquired Thchs30 initial and final consonants are mapped to directly acquired TIMIT through international phonetic transcription and linguistic knowledge is as follows: a': 'ah', 'aa': 'ah', 'ei': 'eh', 'ai': 'ay', 'an': 'ah en', 'en': 'ax en', 'ang': 'ah nx', 'ao': 'aw', 'eng': 'ax nx', 'b': 'b', 'c': 'C _1', 'er': 'axr', 'ch': 'ch','d': 'd', 'f': 'f', 'e': 'ax', 'ee': 'ax', 'g': 'g', 'h': 'hh', 'i': 'iy', 'ia': 'C _2', 'ian': 'iy ah en', 'iang': 'iy ah nx', 'iao': 'iy aw', 'ie': 'y eh', 'ii': 'iy', 'in': 'iy nx', 'ing':

'iy nx'，'iong'：'iy oy nx'，'iu'：'uw'，'ix'：'C_3'，'iy'：'C_4'，'iz'：'C_5'，'j'：'jh'，'k'：'k'，'l'：'l'，'m'：'m'，'n'：'n'，'o'：'oy'，'ong'：'oy nx'，'oo'：'oy'，'ou'：'ow'，'p'：'p'，'q'：'ch'，'r'：'r'，'s'：'s'，'sh'：'sh'，'t'：'t'，'u'：'uh'，'ua'：'uh ah'，'uai'：'wh aa ih'，'uan'：'uh ah en'，'uang'：'uh ah nx'，'ueng'：'uh ax nx'，'ui'：'wh eh ih'，'un'：'wh eh en'，'uo'：'C_6'，'uu'：'uh'，'v'：'v'，'van'：'v ah en'，'ve'：'v ax'，'vn'：'v en'，'vv'：'v'，'x'：'C_7'，'z'：'C_8'，'zh'：'zh'。

similarly, according to linguistic knowledge, the obtained TSC phoneme symbol set transcribed by the grapheme-phoneme is mapped to the TIMIT phoneme set, is subjected to Spanish special pronunciation expansion, and is mapped to the TSC phoneme symbol set C' ₂ Contains a mappable phoneme cluster C' _2,IPA And un-mappable un-mapped tagged (un-mapped IPA phoneme format with language tag "S")

The mapping relation of mapping the indirectly acquired TSC international phonetic symbols to the ARPAbe of the directly acquired TIMIT through linguistic knowledge is as follows: a ratio of 'J': 'iy', 'L': 'jh', 'RR': 'S _2', 'S': 'S', 'S-1': 'S _1', 'S-2': 'S _3', 'T': 'th', 'T/': 'ch', 'X': 'hh', 'a': 'aa', 'b': 'v','d': 'en jh', 'e': 'eh', 'f': 'f', 'g': 'g', 'gs': 's', 'h': 'h #', 'i': 'ih', 'j': 'ih', 'k': 'k', 'l': 'l','m': 'm', 'n': 'en', 'o': 'ow', 'p': 'p', 'q': 'k', 'r': 'S _4','t': 't', 'u': 'uw', 'v': 'b', 'w': ' w ', wherein ' S-2' is a phoneme corresponding to the special grapheme ' u ', and ' S-1' is a phoneme corresponding to the special grapheme ' ″, spanish language has a multi-pronunciation phenomenon, so that a more complete phoneme pronunciation multi-layer mapping mode can be constructed according to the phenomenon, and the completeness of the model phoneme dictionary is improved.

Step3: a phoneme distinguishing characteristic structure;

constructing the phoneme discriminative feature comprises constructing a phoneme phonetic feature and splicing the phoneme acoustic feature.

Step3.1: and (3) constructing phonetic features:

according to a speech system mechanism, based on peak frequency band characteristics of human body speech constraint, a method of combining a speech short-time spectrum and a vocal tract modeling is adopted to efficiently and accurately determine fundamental tone frequency of voiced phonemes of a speech, zero-pole frequency of unvoiced and voiced phonemes of a nasal voiced phoneme, first three formants of non-nasal voiced phoneme, first-order difference of the first three formants and the like to represent different phonemes. When extracting parameters, a high-order optimal LPC peak estimation method based on human body pronunciation digital resonance constraint is adopted to construct a phonetic feature construction mode as shown in fig. 2.

Step3.1: fusing phoneme acoustic features and phoneme phonetic features

And combining the acoustic feature MFCC and the phoneme high-order linear prediction peak frequency band feature constrained by human body pronunciation resonance to construct a novel phoneme distinguishing feature.

T represents the number of speech frames, the static characteristic distinguishing feature of the phoneme of the nth frame

the 13-dimensional MFCC representing the nth frame is replaced by the zero-pole frequency and the bandwidth compensation thereof when the speech is nasal or unvoicedThe first three formant frequencies and bandwidths of non-nasal voiced sounds. Speech segment phoneme static characteristic distinguishing feature

Speech segment phoneme dynamic characteristic distinguishing feature

Speech segment phoneme differentiation features

Step4: multilingual complete phone set

Step4.1: primordial set reduction

Reducing the phone set to reduce the size of the phone set for the mappable IPA phone cluster, before reducing the language phone symbol set C _i I =0,1,2, and a phoneme symbol set C 'obtained by mapping' _i And wherein IPA cluster C 'can be mapped' _i,IPA I =0,1,2 constituting a random variable X _i,IPA I =1,2, and the probability space is as follows:

p(c′ _i,IPA (m _i,IPA ) ) represents a random variable X _i,IPA Middle element symbol c' _i,IPA (m _i,IPA ) Probability of occurrence, m _i,IPA Denotes a random variable X _i,IPA C 'in the symbol set with i =0,1,2' _i,IPA I =0,1,2 number of phonemes.

Adopting a data-driven mode to carry out phoneme iterative reduction, and carrying out e times of iterative reduction on a random variable X _i,IPA And e phonemes of the symbol set to which i =0,1,2 belongs are reduced, and after the phoneme set is reduced, the random variable and the probability of the probability space expression are updated.

Obtaining the phoneme distinguishing characteristic F of the phoneme o in the language i, i =0,1,2 ^(i,o) 。

At this time, if the low error rate of phoneme recognition is the most required from the phoneme depiction angle, the cosine distance of different phonemes is calculated by adopting the cosine similar phoneme pair iteration, when the cosine distance of the characteristics of the two phonemes is the minimum, the similarity representing the two phonemes is higher, the probabilities of the two phonemes are combined, and one phoneme with smaller probability is reduced.

At this time, if the global recognition accuracy is sought to be the maximum in the application of sequence feature statistics for phoneme recognition, iteration is performed by using a small probability phoneme pair, a similarity is defined, the probability of two phonemes in the driving data set is obtained, when the probability of two phonemes simultaneously appearing in the same language phoneme driving set is small and similar, and when the probability of two phonemes simultaneously appearing in different language phoneme driving sets is small, the probabilities of the two phonemes are combined, and one phoneme with a small probability is reduced.

The above iterative process is repeated e times, and the reduced random variable probability space is updated each time.

Step4.2: phoneme set fusion

Standing at the angle of different learners, the phoneme-to-phrase-to-sentence-to-paragraph correlation is increased, the attention of a speaker to the phoneme is weakened, and the universal migration characteristic is generated among the phonemes in different languages, so that the speaker can speak the native phone by foreigners in daily life; and comparing and analyzing the difference types among the languages, wherein the larger the difference is, the higher the migration degree of the first language is, and the smaller the mutual information quantity of the voice phoneme set is. The sound channel model parameters of the speech sound production principle can effectively classify the phonemes, and different language phoneme sets are responses of similar sound channel model parameters and have higher coincidence degree and compatibility. The text fuses Chinese, english, spanish phoneme sets. Vocabulary confusion is avoided, multi-stage error-free fusion of the phoneme set based on the maximum interactive information amount criterion is realized, and a model is fused, as shown in fig. 3.

When merging is carried out on the basis of the updated phoneme set obtained by the phoneme set reduction, the phonemes in the single-language phoneme set are not merged, and the phonemes of the non-IPA phoneme cluster with different language tags are not mutually merged.

The fusion steps are as follows:

step4.2.1: respectively solving random variables X formed by the phoneme set after iterative reduction _i I =1,2 and X ₀ In a mobile communication systemInformation quantity

Step4.2.2: calculating the iteration times E when the mutual information quantity of the two-language phoneme symbol set is maximum _i ；

Step4.2.3: solving the minimum E based on the maximum mutual information quantity in all L languages _i I =2, \ 8230, L value, which is taken as the optimal iterative reduction number for all phone sets.

Step4.2.4: and merging the iteratively reduced phoneme symbol sets under the optimal iterative reduction times to determine a fused multilingual complete phoneme symbol set under the fusion random variables of the fused i-language phoneme symbol set.

Step5: constructing CTC supervised learning:

step5.1: constructing a feature label of the CTC supervised learning phoneme;

and (3) coding all the obtained TIMIT, TSC and Thchs30 original linguistic data based on the fused multilingual complete phoneme symbol set symbols, and constructing language training labels for training all languages uniformly under the fused multilingual complete phoneme symbol set.

The voice training label is a label sequence of a language segment level, and the phoneme distinguishing characteristic F of the voice segment corresponding to the voice label sequence is further solved.

Step5.2: building a CTC supervised learning network

And the connection time classification CTC defines the maximum word missing rate supervised learning by means of dynamic programming, realizes the alignment between speech frame level speech expression and phonemes, and maximizes the posterior probability of a speech output phoneme sequence relative to an input label, which is obtained by the greedy algorithm.

The constructed connection time classification network is shown in fig. 4.

The supervised learning of the CTC network firstly obtains the phoneme distinguishing characteristics of the speech segment level corresponding to the speech time sequence, and inputs the characteristics into the CTC training network.

The phoneme distinctive feature F is a feature vector with time step information, and the RNN constitutes a long-short term memory time network which calculates the phoneme distinctive feature F _n Characterised by the contextPhoneme label posterior probability vector h _n ，h _n Comprising F _n All posterior probability observations about the tag.

When training the model, the loss function passes through h _n Maximizing the probability of correctly outputting phonemes and updating to obtain an optimal coefficient matrix a _n,t Representing the phoneme posterior probability vector h at time t _n The phoneme classification score coefficient (connected with the context) is used for predicting the optimal solution according to the coefficient matrix.

Phoneme label observation vector h _n Obtaining the most probable label sequence of F after the phoneme label greedy decoding is g _n When aligning the phoneme sequence of the speech segment, aligning the phoneme vector g corresponding to the speech feature _n And their contexts (\8230g) _n-1 ,g _n-2 ,g _n ,g _n+1 ,g _n+2 8230)' Allocation to State S _n The space character distinguishes the pause blank label in the speech and the continuous repeated phoneme in the speech, combines the current state and the last state S _n-1 Corresponding phoneme symbol c _n-1 Judging the current state S _n Corresponding phoneme symbol c _n 。

The CTC learns and acquires the possible optimal distribution of all phoneme sequences output from the voice sections to obtain the accurate mapping of the voice sections and the phoneme sequences, and outputs the phoneme sequences corresponding to the whole voice sections to construct a phoneme recognition model capable of recognizing multiple languages by inputting the distinguishing characteristics of phonemes of the frame level of the model.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A multi-language phoneme recognition method based on phoneme pair iterative fusion is characterized in that:

step1: acquiring a plurality of allophasic phoneme corpora with different resource degrees, acquiring the corpus with higher resource degree in a direct acquisition mode as a main language phoneme set for training a first language, acquiring the corpus indirectly in a non-IPA phoneme dictionary coding mode as an extended language phoneme set for training a second or even more languages, and acquiring the corpus indirectly in a grapheme-IPA phoneme dictionary coding mode as an extended language phoneme set for training a third or even more languages;

step2: based on the phoneme set obtained in Step1, uniformly mapping the linguistic resource phoneme tags obtained in a non-IPA phoneme dictionary coding mode and the linguistic resource phoneme tags obtained in a grapheme-IPA phoneme dictionary coding mode to a main phoneme set phoneme tag of a first language by linguistic knowledge for representing;

step3: the method takes the vocal tract of a human body as the characteristic of a constraint structure for novel distinguishing phonemes, and specifically comprises the following steps: according to a speech sound production system mechanism, based on a multi-language phoneme set which is obtained by Step2 and mapped by a first language IPA phoneme label, refining all phonemes into unvoiced phonemes, voiced phonemes and non-voiced phonemes according to the speech sound production characteristics of different speech categories, and constructing a new type of characteristics with phoneme distinguishing characteristics;

step4: based on the multilingual phoneme set obtained by Step2, reducing the IPA phoneme symbol set mapped in the language, and respectively reducing the scales of the total phoneme symbol sets of the mapped main language, the second language and the third language phoneme set;

step5: based on the reduced multi-language phoneme set obtained in Step4, taking a directly obtained main phoneme set as an initial set, combining with an indirectly obtained second language phoneme set to perform first iteration primary fusion, performing second iteration secondary fusion on a new set formed after fusion and an indirectly obtained third language phoneme set, and so on to obtain more multi-language phoneme set fusion of more languages;

2. The method for multi-lingual phoneme recognition based on phoneme pair iterative fusion of claim 1, wherein Step3 is to: construction of distinctiveness characterizing individual phonemesThe characteristic quantity is that the phoneme is divided into three categories of unvoiced sound, voiced nasal sound and voiced non-nasal sound, a novel phoneme distinguishing characteristic is constructed by combining acoustic characteristics MFCC and phoneme high-order linear prediction peak frequency band characteristics constrained by human pronunciation resonance, T represents the number of voice frames, and the distinguishing characteristic of the static characteristics of the phoneme of the nth frame

representing 13D MFCC of the nth frame, when the speech is nasal sound or unvoiced sound, adopting zero-pole frequency and its bandwidth compensation to replace the first three formant frequencies and bandwidths of non-nasal voiced sound, and distinguishing features of phoneme static characteristics of speech segment

Speech segment phoneme dynamic characteristic distinguishing feature

Speech segment phoneme differentiation features

3. The method for multi-lingual phoneme recognition based on phoneme pair iterative fusion as claimed in claim 1, wherein Step4 is specifically: defining a similarity, solving the probability of two phonemes in the phoneme set, and when the two phonemes are lower in probability and similar in probability at the same time in the same language phoneme driving set, combining the probabilities of the two phonemes when the two phonemes are lower in probability at the same time in different language phoneme driving sets, and reducing one phoneme with lower probability; lattice word for language phonemeElement coding, sacrificing phoneme language description precision, and solving the phoneme distinguishing characteristic F of the phoneme o in the language i, i =0,1,2 ^(i,o) Based on the novel phoneme discriminative features F ^(i,o) The cosine distance between the phonemes represents the phoneme similarity, the phoneme with the maximum similarity is combined and reduced, and the two phonemes are combined and represented by the phoneme symbol with higher occurrence probability in the training set during reduction.

4. The method for multi-lingual phoneme recognition based on phoneme pair iterative fusion as claimed in claim 1, wherein Step5 is specifically: firstly, a language phoneme set and a language phoneme symbol set forming the language phoneme set are converted into random variables, when the hierarchical fusion is carried out, an interactive information quantity expression under different iteration reduction iteration times of the two language phoneme sets fused at the hierarchical fusion is obtained, and the phoneme reduction iteration times when the expression is maximum are used for carrying out the multilevel fusion of a multi-phoneme set, wherein the specific steps are as follows:

Step5.2: obtaining the iteration times E when the mutual information quantity of the two language phoneme symbol sets is maximum _i ；

Step5.3: solving the minimum E based on the maximum mutual information quantity in all L languages _i I =2, \ 8230, L value, which is taken as the optimal iterative reduction number for all phone sets;