WO2020211350A1 - 语音语料训练方法、装置、计算机设备和存储介质 - Google Patents

语音语料训练方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2020211350A1
WO2020211350A1 PCT/CN2019/117718 CN2019117718W WO2020211350A1 WO 2020211350 A1 WO2020211350 A1 WO 2020211350A1 CN 2019117718 W CN2019117718 W CN 2019117718W WO 2020211350 A1 WO2020211350 A1 WO 2020211350A1
Authority
WO
WIPO (PCT)
Prior art keywords
pronunciation
word
threshold
speech corpus
speech
Prior art date
Application number
PCT/CN2019/117718
Other languages
English (en)
French (fr)
Inventor
杨承勇
肖玉宾
敬大彦
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020211350A1 publication Critical patent/WO2020211350A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • This application relates to the field of computer technology, in particular to speech corpus training methods, devices, computer equipment and storage media.
  • the acoustic model is one of the most important parts of the speech recognition system. Through the acoustic model, speech can be converted into text.
  • speech corpora can be collected on a large scale for training acoustic models. In this process, the frequency of words in the speech corpus was not counted. The inventor found that under normal circumstances, the higher the frequency of words, the higher the accuracy of the conversion between speech and text based on the trained acoustic model. As such, the conversion accuracy of the existing implementation methods is usually low.
  • a speech corpus training method including: determining at least one pre-collected general word, and determining at least one pre-collected pronunciation region; according to a preset threshold determination method, determining at least one first A threshold, and each of the first thresholds corresponds to a universal word and a pronunciation region, wherein the threshold is determined according to the pronunciation of a universal word in a pronunciation region and the universal word To determine the first threshold corresponding to the universal word and the pronunciation region; determine the second threshold corresponding to each universal word according to the predetermined frequency of use of the universal word; determine A pre-set speech corpus including at least one speech corpus, wherein any of the speech corpus corresponds to the pronunciation region, and the pronunciation of any of the speech corpus is the pronunciation of the corresponding pronunciation region; One of the first thresholds is respectively used as the current first threshold, and is executed: For the
  • the speech corpus is added to the speech corpus; execution is completed for each of the second thresholds At the time, training an acoustic model of the at least one general word according to the speech corpus.
  • a speech corpus training device including:
  • the first determining unit is used to determine at least one pre-collected general word and at least one pre-collected pronunciation region;
  • the second determining unit is configured to determine at least one first threshold according to a preset threshold determination method, and each of the first thresholds corresponds to a common word and a pronunciation area, wherein the threshold is determined
  • the method is to determine the first threshold corresponding to the common word and the pronunciation area according to the closeness of the pronunciation of a common word in a pronunciation area to the standard pronunciation of the common word in Mandarin; according to the predetermined common word Word usage frequency, determining the second threshold corresponding to each of the general words;
  • the third determining unit is used to determine a pre-set speech corpus that includes at least one speech corpus, wherein any of the speech corpus corresponds to the pronunciation region, and the pronunciation of any of the speech corpus is all The pronunciation corresponding to the pronunciation area;
  • the processing unit is configured to use each of the first thresholds as the current first threshold, and execute: for the first universal word and the first pronunciation region corresponding to the current first threshold, the first universal word When the number of occurrences of the pronunciation in all the first speech corpus is less than the current first threshold, the speech corpus is supplemented with the speech corpus, wherein the first speech corpus is the corresponding speech corpus in the speech corpus.
  • each of the second thresholds is used as the current second threshold, and the execution: for the second corresponding to the current second threshold Common words, when the number of occurrences of the pronunciation of the second common words in all the speech corpora of the speech corpus is less than the current second threshold, supplement the speech corpus to the speech corpus;
  • the training unit is configured to train an acoustic model of the at least one general word according to the speech corpus when the execution of each of the second thresholds is completed.
  • a computer device including a memory and a processor, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the processor executes The steps of any of the above-mentioned speech corpus training methods.
  • non-volatile readable storage medium storing computer readable instructions, which when executed by one or more processors, cause one or more processors to execute The steps of any of the above-mentioned speech corpus training methods.
  • This application provides speech corpus training methods, devices, computer equipment and storage media. Determine a number of common words and a number of pronunciation regions; determine a number of first thresholds, the general words and/or pronunciation regions corresponding to different first thresholds are different, determine the second threshold corresponding to each general word; determine the speech corpus, each of which The speech corpus corresponds to a pronunciation area; the speech corpus is added to the speech corpus as needed, so that: for all the speech corpora corresponding to a pronunciation area in the speech corpus, the pronunciation of a common word in the speech corpus is not less than the common word
  • the first threshold corresponding to the word and the pronunciation area, and, for all the speech corpora in the speech corpus, the number of occurrences of the pronunciation of a common word in it is not less than the second threshold corresponding to the common word; training acoustics according to the speech corpus model. In this way, the accuracy of the conversion between speech and text can be improved.
  • Fig. 1 is a flowchart of a speech corpus training method provided in an embodiment
  • Figure 2 is a flowchart of a voice corpus training method provided in another embodiment
  • Fig. 3 is a schematic diagram of a speech corpus training device provided in an embodiment.
  • Step 101 Determine at least one pre-collected general word, and determine at least one pre-collected pronunciation region.
  • Step 102 Determine at least one first threshold according to a preset threshold determination method, and each of the first thresholds corresponds to a common word and a pronunciation region, wherein the threshold determination method is according to The proximity of the pronunciation of a common word in a pronunciation area to the standard pronunciation of the common word in Mandarin is used to determine the first threshold corresponding to the common word and the pronunciation area.
  • Step 103 Determine a second threshold corresponding to each universal word according to the predetermined frequency of use of the universal word.
  • Step 104 Determine a pre-set speech corpus that includes at least one speech corpus, wherein any of the speech corpus corresponds to the pronunciation region, and the pronunciation of any of the speech corpus is in the corresponding pronunciation region pronunciation.
  • Step 105 Use each of the first thresholds as the current first threshold, and execute: for the first universal word and the first pronunciation region corresponding to the current first threshold, the pronunciation of the first universal word When the number of occurrences in all the first speech corpus is less than the current first threshold, the speech corpus is supplemented with the speech corpus, wherein the first speech corpus is a corresponding first speech corpus in the speech corpus. Phonetic corpus of pronunciation region.
  • Step 106 When the execution of each of the first thresholds is completed, use each of the second thresholds as the current second thresholds, and execute: for the second general words corresponding to the current second thresholds, When the number of appearances of the pronunciation of the second universal word in all the speech corpora of the speech corpus is less than the current second threshold, the speech corpus is supplemented with the speech corpus.
  • Step 107 When the execution of each of the second thresholds is completed, train an acoustic model of the at least one general word according to the speech corpus.
  • the embodiment of the application provides a speech corpus training method.
  • the method includes: determining a number of general words and a number of pronunciation regions; determining a number of first thresholds, the general words and/or pronunciation regions corresponding to different first thresholds are different, and determining the general The second threshold corresponding to the word; determine the speech corpus, in which each speech corpus corresponds to a pronunciation area; supplement the speech corpus to the speech corpus as needed, so that: for all the speech corpora corresponding to a pronunciation area in the speech corpus, The number of occurrences of the pronunciation of a universal word in it is not less than the first threshold corresponding to the universal word and the pronunciation region, and for all the phonetic corpora in the speech corpus, the number of occurrences of the pronunciation of a universal word in it is not less than The second threshold corresponding to the general word; the acoustic model is trained according to the speech corpus. In this way, the accuracy of the conversion between speech and text can be improved. In order to ensure the accuracy of the
  • the speech corpus used to train the acoustic model should include some Sichuan accents Phonetic corpus, and these phonetic corpora should have the pronunciation of words such as " ⁇ ", " ⁇ ", and “Chinese”. Therefore, we must first collect common words such as " ⁇ ", " ⁇ ", and “Chinese”, and determine the pronunciation region such as "Sichuan”.
  • Example 1 Assume that there are three pre-collected at least one common word, namely me, yes, and Chinese; at least one pre-collected pronunciation area has two, namely Beijing and Sichuan.
  • the difference in pronunciation of different words in different regions can be large or small.
  • the first threshold corresponding to the universal word and the pronunciation area can be determined according to how close the pronunciation of a universal word in a pronunciation region is to the standard pronunciation of the universal word in Mandarin.
  • the first threshold corresponding to the word “I” and the pronunciation region of “Sichuan” is usually greater than the first threshold corresponding to the word “Chinese” and the pronunciation region of "Sichuan”. That is to say, the speech corpus should include more of the phonetic pronunciation of "I” in Sichuan accent, and relatively less phonetic of "Chinese” in Sichuan accent.
  • the first threshold corresponding to the word “ ⁇ ” and the pronunciation region of "Sichuan” is usually greater than the first threshold corresponding to the word " ⁇ ” and the pronunciation region of "Beijing". That is to say, the speech corpus should include more sounds that say “I” in Sichuan accent, and relatively few sounds that say “I” in Beijing accent.
  • first thresholds can be determined, namely: the first threshold Q1 corresponding to "I” and “Sichuan”, the first threshold Q2 corresponding to “Yes” and “Sichuan”, The first threshold Q3 corresponding to “Chinese” and “Sichuan”, the first threshold Q4 corresponding to "I” and “Beijing”, the first threshold Q5 corresponding to "Yes” and “Beijing”, the first threshold corresponding to "China”
  • the first threshold of "people” and "Beijing” is Q6.
  • the frequency of use of different words is different.
  • the second threshold corresponding to each general word can be determined according to the predetermined frequency of use of the general word. For example, the probability of using the word "I” when speaking is usually greater than the frequency of using the word “Chinese”. Thus, the second threshold corresponding to the word “I” is usually greater than the word “Chinese”.
  • the second threshold corresponding to a word That is, the phonetic corpus should include more phonetic sounds that say the word "I”, and relatively few phonetic words that say the word "Chinese”. So, based on the above For example 1, three second thresholds can be determined, namely: the second threshold P1 corresponding to "I", the second threshold P2 corresponding to "Yes", and the second threshold P3 corresponding to "Chinese”.
  • step 104 in order to train the acoustic model, it is necessary to have a speech corpus that meets the above-mentioned first and second thresholds.
  • the speech corpus includes several speech corpora.
  • the voice corpus here can be recording fragments of daily conversations, recording fragments of reading specific articles, etc.
  • the pronunciation of the same voice corpus is consistent, so it can be considered that each voice corpus corresponds to a pronunciation region, and the pronunciation of each voice corpus is the pronunciation of the corresponding pronunciation region.
  • the existing speech corpus usually does not fully meet these restrictions.
  • the speech corpus should meet these constraints.
  • This supplementary operation can usually be divided into two major steps. The first step is to supplement each first threshold, and after the first step is completed, the second step is to supplement each second threshold.
  • the first step supplement on-demand for each first threshold.
  • the above Q1 to Q6 can be used as the current first threshold for analysis.
  • the first speech corpus is the Sichuan regional pronunciation in the speech corpus. Speech corpus.
  • the number of occurrences of the pronunciation of " ⁇ " in these Sichuan regional pronunciation corpus can be judged. If the number of times is less than Q1, it needs to be supplemented, otherwise there is no need to supplement it. Assume that there are 4 speech corpora in the speech corpus at this time:
  • Speech corpus 1 "I love my motherland” pronounced in Sichuan accent
  • Speech corpus 2 “I love my motherland” pronounced in Beijing accent
  • Speech corpus 3 “I love my home” pronounced in Sichuan accent
  • voice Corpus 4 “I love my home” pronounced in Beijing accent.
  • the aforementioned speech corpus 1 and speech corpus 3 are all the first speech corpora at this time.
  • the number of occurrences of the pronunciation of "I" in all the first speech corpus is 4 times.
  • Q2 ⁇ Q6 are analyzed in turn, and the speech corpus is supplemented as needed, so that the supplemented speech corpus can meet each first threshold.
  • the second step is executed, which is to supplement each second threshold as needed.
  • step 106 In the second step, P1 to P3 are analyzed sequentially. Taking the analysis of P1 as an example, since P1 corresponds to "I", for all the speech corpora in the speech corpus, the number of times the pronunciation of "I" appears in these speech corpora can be judged. If the number of times is less than P1, it needs to be supplemented, otherwise there is no need to supplement it.
  • the speech corpus at this time has the following 6 speech corpora: the aforementioned speech corpus 1 to 4, and the following speech corpus 5 and 6.
  • Speech corpus 5 "Sit down, please” pronounced in Sichuan accent
  • Speech corpus 6 "Please drink tea” pronounced in Sichuan accent.
  • the number of occurrences of the pronunciation of "I" is 8.
  • P2 and P3 are analyzed in turn, and the speech corpus is supplemented as needed, so that the supplemented speech corpus can meet each second threshold.
  • step 107 when the execution is completed for each first threshold and each second threshold, it can be considered that the number and categories of speech corpora included in the latest speech corpus are sufficiently rich to ensure the gap between speech and text Conversion accuracy. In this way, an acoustic model for preset common words can be trained based on the latest speech corpus.
  • the above-mentioned first threshold and second threshold can be set, and the general corresponding to the different first thresholds The words and/or pronunciation are geographically different.
  • the standard The value can be an empirical value, usually the maximum number of speech corpora to be supplemented.
  • the closer to the standard Mandarin pronunciation, the smaller the weight, the smaller the corresponding supplement; the closer to the Mandarin standard pronunciation, the greater the weight, and the larger the corresponding supplement For example, when the Sichuan accent is used to say "I", there is a big difference from the standard pronunciation in Mandarin, so the weight corresponding to " ⁇ " and "Sichuan” can be 0.9. If the first standard value is 10000, the above Q1 is equal to 9000. For example, there is a small difference between the standard pronunciation of "Chinese” in Sichuan accent and Mandarin, so the weight corresponding to "Chinese” and “Sichuan” can be 0.3. Since the first standard value is 10000, the above Q3 is equal to 3000.
  • each first threshold can be set according to the difference in accent recognition and accent diversification in different pronunciation regions, so as to supplement various speech corpora as needed, so as to avoid useless or inefficient speech corpus. Supplements increase data processing pressure.
  • the above-mentioned second threshold may be set in consideration of the difference in frequency of use of different general words.
  • the determining the second threshold corresponding to each of the universal words according to the predetermined frequency of use of universal words includes: setting a second standard value; determining a preset text set, The text collection includes each of the universal words; counting the number of occurrences of each of the universal words in the text collection; and calculating the second threshold corresponding to each of the universal words according to formula 2.
  • the formula two includes: Wherein, y j is the second threshold value corresponding to the j-th general word in the at least one general word, X 2 is the second standard value, m is the number of the at least one general word, n j Is the number of occurrences of the j-th general word in the text collection.
  • the text in the text collection may be an article, a piece of news text report, or a piece of text after voice recognition.
  • the number of occurrences of the word "I" is 200 times
  • the number of occurrences of the word "Chinese” is 5 times.
  • the second standard value is 50000
  • the above P1 is equal to 1000
  • the above P3 is equal to 25. It can be seen that in the embodiment of the present application, the specific values of each second threshold can be set according to the frequency of use of different words to supplement various speech corpora as needed, so as to avoid adding data due to useless or inefficient speech corpus supplementation Deal with stress.
  • the at least one general word includes: part or all of the general words in the general dictionary, and/or part or all of the general words in the general dictionary.
  • the practicability of the collected general words can be guaranteed, and then the practicability of the trained acoustic model can be guaranteed.
  • each of the universal words and each of the speech corpus relates to a preset technical field, so that the acoustic model is an acoustic model for the preset technical field.
  • this specific field can be the medical field, the game competition field, etc.
  • general words can be collected in a targeted manner for a specific field, so that a targeted acoustic model can be trained.
  • the conversion accuracy between speech and text in a specific domain is better based on the acoustic model for the specific domain.
  • the first threshold is judged, that is, for all the speech corpora corresponding to the first pronunciation region in the speech corpus, whether the pronunciation of the first universal word appears in the speech corpus is less than the number corresponding to the first pronunciation.
  • a common word and the first threshold of the first pronunciation area If the judgment result is yes, the speech corpus needs to be supplemented with speech corpus.
  • the opposite is explained: when using the existing speech corpus to train the acoustic model, based on the trained acoustic model, if the conversion between speech and text involves the first universal word with the first pronunciation area, the corresponding conversion accuracy is usually Higher. In this way, there is no need to add speech corpus to the speech corpus.
  • this supplementary operation based on the acoustic model trained based on the supplemented speech corpus, accurate conversion between speech and text can be achieved for the same common words with different pronunciation regions.
  • the supplemented speech corpus is the speech corpus that contains the first general word and corresponds to the first pronunciation region. That is, currently only supplemented, including the phonetic corpus of the first universal word with the pronunciation of the first pronunciation region, but not supplemented, including the phonetic corpus of other universal words with the pronunciation of other pronunciation regions. Since the supplementary speech corpus is a targeted supplement based on the supplementary content, it can be made based on the supplemented speech corpus, when the first threshold is judged again, the judgment result can not only be no, but also minimize it as much as possible The amount of calculation for subsequent operations.
  • the number of supplements in addition to the above-mentioned targeted supplement based on supplementary content, for the number of supplements, it can be ensured that: based on the supplemented speech corpus, the first threshold value can be performed again In the judgment, the supplementary quantity should be as small as possible under the premise that the judgment result is no. In this way, the amount of calculation for other subsequent judgment operations can be minimized as much as possible. That is, the number of supplements is the minimum number to ensure that the following conditions are met. The condition is: for all the speech corpora corresponding to the first pronunciation region in the speech corpus, the number of occurrences of the first universal word in it is not less than that corresponding to the first universal The first threshold of the word and first pronunciation area.
  • the second threshold is then judged, that is, for all the speech corpora in the speech corpus, whether the number of occurrences of the second universal word in the speech corpus is less than the second threshold corresponding to the second universal word , If the judgment result is yes, the speech corpus needs to be added to the speech corpus.
  • the speech corpus needs to be added to the speech corpus.
  • the number ratio of various types of speech corpora corresponding to different pronunciation regions in the supplemented speech corpus is within a preset number ratio range.
  • the accent of Sichuan dialect is heavier than the accent of Northeast dialect, and the number of speech corpora corresponding to Sichuan is preferably greater than the number of speech corpora corresponding to Northeast. In this way, the acoustic model trained based on the speech corpus can have a better conversion effect in terms of pronunciation area.
  • the number of supplements in addition to the above-mentioned targeted supplement based on supplementary content, for the number of supplements, it can be ensured that: based on the supplemented speech corpus, the second threshold value can be performed again In the judgment, the supplementary quantity should be as small as possible under the premise that the judgment result is no. In this way, the amount of calculation for other subsequent judgment operations can be minimized as much as possible. That is, the number of supplements is the minimum number that guarantees that the following conditions are met, and the condition is: for all the speech corpora in the speech corpus, the number of occurrences of the second universal word therein is not less than the second threshold corresponding to the second universal word.
  • the training of the acoustic model of the at least one general word according to the speech corpus includes: determining an initial acoustic model; obtaining at least two sub-speech corpora, the speech corpus including any one Any speech corpus in the sub-speech corpus; for each of the sub-speech corpora, execute: optimize the initial acoustic model based on the current sub-speech corpus to obtain an optimized acoustic model; fuse all the optimized acoustic models obtained to Obtain a target acoustic model that meets a preset convergence condition; determine that the target acoustic model is an acoustic model of the at least one general word.
  • an embodiment of the present application provides another voice corpus training method, which may include the following steps:
  • Step 201 Collect at least one general word and determine at least one pronunciation region.
  • the at least one general word includes: part or all of the general words in the general dictionary, and/or part or all of the general words in the general dictionary.
  • Step 202 Set the first standard value and the second standard value.
  • Step 203 Determine at least one weight, where each weight corresponds to a general word and a pronunciation region, and different weights correspond to different general words and/or pronunciation regions.
  • the value range of the weight is (0,1).
  • the pronunciation of the target general word in the target pronunciation region is closer to the standard pronunciation of the target general word in Mandarin , The smaller the value of the target weight.
  • Step 204 Calculate the first threshold corresponding to each weight.
  • each first threshold corresponds to a general word and a pronunciation region, and different first thresholds correspond to different general words and/or pronunciation regions.
  • each first threshold can be calculated according to Formula 1.
  • Step 205 Determine a text set, which includes every common word.
  • Step 206 Count the number of occurrences of each general word in the text collection.
  • Step 207 Calculate the second threshold corresponding to each general word.
  • each second threshold can be calculated according to Formula 2.
  • Step 208 Determine a speech corpus including at least one speech corpus, and each speech corpus corresponds to a pronunciation region.
  • Step 209 Use each first threshold as the current first threshold, and execute: For the first universal word and the first pronunciation region corresponding to the current first threshold, the pronunciation of the first universal word is in all the first speech corpus When the number of occurrences in is less than the current first threshold, the speech corpus is added to the speech corpus, where the first speech corpus is the speech corpus corresponding to the first pronunciation region in the speech corpus.
  • Step 210 When the execution for each first threshold is completed, use each second threshold as the current second threshold, and execute: for the second general word corresponding to the current second threshold, the pronunciation of the second general word When the number of appearances in all the speech corpora of the speech corpus is less than the current second threshold, the speech corpus is supplemented with the speech corpus.
  • Step 211 When the execution is completed for each general word, the initial acoustic model is determined, and at least two sub-speech corpora are obtained, and the speech corpus includes any voice corpus in any sub-speech corpus. The total number of speech corpora in any two sub-speech corpora is equal, and the total number is within the preset numerical range.
  • Step 212 Execute for each sub-speech corpus: optimize the initial acoustic model based on the current sub-speech corpus to obtain an optimized acoustic model.
  • Step 213 Fusion of all the obtained optimized acoustic models to obtain a target acoustic model that meets the preset convergence condition.
  • Step 214 Determine that the target acoustic model is an acoustic model of at least one general word.
  • a voice corpus training device which may include: a first determining unit 301, configured to determine at least one pre-collected general word and at least one pre-collected pronunciation region
  • the second determining unit 302 is configured to determine at least one first threshold according to a preset threshold determination method, and each of the first thresholds corresponds to a common word and a pronunciation region, wherein the The threshold determination method is to determine the first threshold corresponding to the universal word and the pronunciation area according to how close the pronunciation of a universal word in a pronunciation region is to the standard pronunciation of the universal word in Mandarin;
  • the frequency of use of common words is used to determine the second threshold corresponding to each of the common words;
  • the third determining unit 303 is used to determine a preset speech corpus that includes at least one speech corpus, wherein any one of the The speech corpus corresponds to the pronunciation area, and the pronunciation of any of the speech corpus
  • the processing unit 304 is configured to use each of the first thresholds as the current first threshold, and execute: for the first universal word and the first pronunciation region corresponding to the current first threshold, the first universal character When the number of occurrences of the pronunciation of the word in all the first speech corpus is less than the current first threshold, the speech corpus is supplemented with the speech corpus, wherein the first speech corpus is the corresponding information in the speech corpus.
  • the second determining unit 302 is configured to set a first standard value; determine at least one weight, wherein each weight corresponds to a common word and a pronunciation region The value range of the weight is (0, 1). For the target weight corresponding to the target general word and the target pronunciation region, the pronunciation of the target general word in the target pronunciation region is closer to the target general The standard pronunciation of words in Mandarin, the smaller the value of the target weight; the first threshold corresponding to each weight is calculated according to the above formula 1.
  • the second determining unit 302 is configured to Set a second standard value; determine a preset text collection, the text collection includes each of the universal words; count the number of occurrences of each of the universal words in the text collection; according to the above formula 2. Calculate the second threshold corresponding to each of the general words.
  • the training unit 305 is used to determine the initial acoustic model; to obtain at least two sub-speech corpora, the speech corpus includes any 1.
  • Any speech corpus in the sub-speech corpus for each sub-speech corpus, execute: optimize the initial acoustic model based on the current sub-speech corpus to obtain an optimized acoustic model; merge all the optimized acoustic models obtained , To obtain a target acoustic model that meets a preset convergence condition; determine that the target acoustic model is an acoustic model of the at least one general word.
  • An embodiment of the present application also provides a computer device, including a memory and a processor, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the processor executes The steps of any of the above-mentioned speech corpus training methods.
  • An embodiment of the present application also provides a non-volatile readable storage medium storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute The steps of any of the above-mentioned speech corpus training methods.
  • the computer program can be stored in a computer readable storage medium. When executed, it may include the processes of the above-mentioned method embodiments.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
  • the technical features of the above-mentioned embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above-mentioned embodiments are not described.

Abstract

本申请提供了语音语料训练方法、装置、计算机设备和存储介质。确定若干通用字词及若干发音地域;确定若干第一阈值,不同第一阈值对应的通用字词和/或发音地域不同,确定各通用字词对应的第二阈值;确定语音语料库,其中的各语音语料均对应有一发音地域;按需向语音语料库中补充语音语料,以使:对于语音语料库中对应于一发音地域的全部语音语料,一通用字词的发音在其中的出现次数不小于该通用字词和该发音地域对应的第一阈值,以及,对于语音语料库中全部语音语料,一通用字词的发音在其中的出现次数不小于该通用字词对应的第二阈值;根据语音语料库训练声学模型。如此,可提高语音与文本间的转化准确度。

Description

语音语料训练方法、装置、计算机设备和存储介质
本申请要求与2019年4月19日提交中国专利局、申请号为201910320221X、申请名称为“语音语料训练方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及语音语料训练方法、装置、计算机设备和存储介质。
背景技术
声学模型是语音识别系统中最为重要的部分之一,通过声学模型,可以将语音转化为文本。目前,可以大规模收集语音语料以用于训练声学模型。这一过程中,并没有对语音语料中字词出现的频率进行统计。发明人发现通常情况下,字词出现的频率越高,基于训练出的声学模型,语音与文本间的转化准确度越高。如此,现有实现方式的转化准确度通常较低。
发明内容
基于此,有必要针对转化准确度通常较低的问题,提供一种语音语料训练方法、装置、计算机设备和存储介质。依据本申请一个方面,提供一种语音语料训练方法,包括:确定预先收集好的至少一个通用字词,以及确定预先收集好的至少一个发音地域;根据预设的阈值确定方式,确定至少一个第一阈值,每一个所述第一阈值均对应有一所述通用字词和一所述发音地域,其中,所述阈值确定方式为,根据一通用字词在一发音地域的发音与该通用字词的普通话标准发音的接近程度,来确定对应于该通用字词和该发音地域的第一阈值;根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值;确定预先设置好的、包括有至少一个语音语料的语音语料库,其中,任一所述语音语料均对应有一所述发音地域,任一所述语音语料的发音均为所对应发音地域的发音;将每一个所述第一阈值分别作为当前第一阈值,并执行:对于所述当前第一阈值对应的第一通用字词和第一发音地域,所述第一通用字词的发音在全部第一语音语料中的出现次数小于所述当前第一阈值时,向所述语音语料库中补充语音语料,其中,所述第一语音语料为所述语音语料库中的对应有所述第一发音地域的语音语料;针对每一个所述第一阈值均执行完成时,将每一个所述第二阈值分别作为当前第二阈值,并执行:对于所述当前第二阈值对应的第二通用字词,所述第二通用字词的发音在所述语音语料库的全部语音语料中的出现次数,小于所述当前第二阈值时,向所述语音语料库中补充语音语料;针对每一个所述第二阈值均执行完成时,根据所述语音语料库,训练所述至少一个通用字词的声学模型。
依据本申请另一方面,提供一种语音语料训练装置,包括:
第一确定单元,用于确定预先收集好的至少一个通用字词,以及确定预先收集好的至少一个发音地域;
第二确定单元,用于根据预设的阈值确定方式,确定至少一个第一阈值,每一个所述第一阈值均对应有一所述通用字词和一所述发音地域,其中,所述阈值确定方式为,根据一通用字词在一发音地域的发音与该通用字词的普通话标准发音的接近程度,来确定对应于该通用字词和该发音地域的第一阈值;根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值;
第三确定单元,用于确定预先设置好的、包括有至少一个语音语料的语音语料库,其中,任一所述语音语料均对应有一所述发音地域,任一所述语音语料的发音均为所对应发音地域的发音;
处理单元,用于将每一个所述第一阈值分别作为当前第一阈值,并执行:对于所述当前第一阈值对应的第一通用字词和第一发音地域,所述第一通用字词的发音在全部第一语音语料中的出现次数小于所述当前第一阈值时,向所述语音语料库中补充语音语料,其中,所述第一语音语料为所述语音语料库中的对应有所述第一发音地域的语音语料;针对每一个所述第一阈值均执行完成时,将每一个所述第二阈值分别作为当前第二阈值,并执行:对于所述当前第二阈值对应的第二通用字词,所述第二通用字词的发音在所述语音语料库的全部语音语料中的出现次数,小于所述当前第二阈值时,向所述语音语料库中补充语音语料;
训练单元,用于针对每一个所述第二阈值均执行完成时,根据所述语音语料库,训练所述至少一个通用字词的声学模型。
依据本申请又一方面,提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行上述任一所述语音语料训练方法的步骤。
依据本申请再一方面,提供一种存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任一所述语音语料训练方法的步骤。
本申请提供了语音语料训练方法、装置、计算机设备和存储介质。确定若干通用字词及若干发音地域;确定若干第一阈值,不同第一阈值对应的通用字词和/或发音地域不同,确定各通用字词对应的第二阈值;确定语音语料库,其中的各语音语料均对应有一发音地域;按需向语音语料库中补充语音语料,以使:对于语音语料库中对应于一发音地域的全部语音语料,一通用字词的发音在其中的出现次数不小于该通用字词和该发音地域对应的第一阈值,以及,对于语音语料库中全部语音语料,一通用字词的发音在其中的出现次数不小于该通用字词对应的第二阈值;根据语音语料库训练声学模型。如此,可提高语音与文本间的转化准确度。
附图说明
图1为一个实施例中提供的语音语料训练方法的流程图;
图2为另一个实施例中提供的语音语料训练方法的流程图;
图3为一个实施例中提供的语音语料训练装置的示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。可以理解,本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种元件,但这些元件不受这些术语限制。这些术语仅用于将第一个元件与另一个元件区分。请参考图1,本申请实施例提供了一种语音语料训练方法,可以包括如下步骤:
步骤101:确定预先收集好的至少一个通用字词,以及确定预先收集好的至少一个发音地域。
步骤102:根据预设的阈值确定方式,确定至少一个第一阈值,每一个所述第一阈值均对应有一所述通用字词和一所述发音地域,其中,所述阈值确定方式为,根据一通用字词在一发音地域的发音与该通用字词的普通话标准发音的接近程度,来确定对应于该通用字词和该发音地域的第一阈值。
步骤103:根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值。
步骤104:确定预先设置好的、包括有至少一个语音语料的语音语料库,其中,任一所述语音语料均对应有一所述发音地域,任一所述语音语料的发音均为所对应发音地域的发音。
步骤105:将每一个所述第一阈值分别作为当前第一阈值,并执行:对于所述当前第一阈值对应的第一通用字词和第一发音地域,所述第一通用字词的发音在全部第一语音语料中的出现次数小于所述当前第一阈值时,向所述语音语料库中补充语音语料,其中,所述第一语音语料为所述语音语料库中的对应有所述第一发音地域的语音语料。
步骤106:针对每一个所述第一阈值均执行完成时,将每一个所述第二阈值分别作为当前第二阈值,并执行:对于所述当前第二阈值对应的第二通用字词,所述第二通用字词的发音在所述语音语料库的全部语音语料中的出现次数,小于所述当前第二阈值时,向所述语音语料库中补充语音语料。
步骤107:针对每一个所述第二阈值均执行完成时,根据所述语音语料库,训练所述至少一个通用字词的声学模型。
本申请实施例提供了语音语料训练方法,该方法包括:确定若干通用字词及若干发音地域;确定若干第一阈值,不同第一阈值对应的通用字词和/或发音地域不同,确定各通用字词对应的第二阈值;确定语音语料库,其中的各语音语料均对应有一发音地域;按需向语音语料库中补充语音语料,以使:对于语音语料库中对应于一发音地域的全部语音语料,一通用字词的发音在其中的出现次数不小于该通用字词和该发音地域对应的第一阈值,以及,对于语音语料库中全部语音语料,一通用字词的发音在其中的出现次数不小于该通用字词对应的第二阈值;根据语音语料库训练声学模型。如此,可提高语音与文本间的转化准确度。为保证语音与文本间的转化准确度,语音语料库中所包括语音语料的数量和类别应足够丰富。
对应于上述步骤101:比如,基于训练好的声学模型,为了能够将“我是中国人”这一四川语音转换为相应文本,用于训练声学模型的语音语料库中,就应包括一些四川口音的语音语料,且这些语音语料中应具有“我”、“是”、“中国人”这样的字词发音。所以,首先要收集比如“我”、“是”、“中国人”这样的通用字词,以及要确定比如“四川”这样的发音地域。举例1:假设预先收集好的至少一个通用字词有3个,分别为我、是、中国人;预先收集好的至少一个发音地域有2个,分别为北京和四川。
对应于上述步骤102:详细地,与普通话标准发音相比,不同地域针对不同字词的发音差异程度可大可小。如此,可根据一通用字词在一发音地域的发音与该通用字词的普通话标准发音的接近程度,来确定对应于该通用字词和该发音地域的第一阈值。一方面,以“我”和“中国人”这两个字词为例,用四川口音说“我”可以有较多种说法,而用四川口音说“中国人”可以有较少种说法,如此,对应于“我”这一字词和“四川”这一发音地域的第一阈值,通常大于对应于“中国人”这一字词和“四川”这一发音地域的第一阈值。即语音语料库中,应多包括用四川口音说“我”的语音,而相对少包括用四川口音说“中国人”的语音。
另一方面,以“四川”和“北京”这两个发音地域为例,用四川口音说“我”可以有较多种说法,而用北京口音说“我”通常有较少种说法,如此,对应于“我”这一字词和“四川”这一发音地域的第一阈值,通常大于对应于“我”这一字词和“北京”这一发音地域的第一阈值。即语音语料库中,应多包括用四川口音说“我”的语音,而相对少包括用北京口音说“我”的语音。如此,基于上述举例1,可以确定出6个第一阈值,即分别为:对应于“我”和“四川”的第一阈值Q1、对应于“是”和“四川”的第一阈值Q2、对应于“中国人”和“四川”的第一阈值Q3、对应于“我”和“北京”的第一阈值Q4、对应于“是”和“北京”的第一阈值Q5、对应于“中国人”和“北京”的第一阈值Q6。
对应于上述步骤103:详细地,不同字词的使用频率不同。如此,可根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值。比如,说话时用到“我”这一字词的概率通常大于用到“中国人”这一字词的频率,如此,我”这一字词对应的第二阈值通常大于“中国人”这一字词对应的第二阈值。即语音语料库中,应多包括说到“我”这一字词的语音,而相对少包括说到“中国人”这一字词的语音。如此,基于上述举例1,可以确定出3个第二阈值,即分别为:对应于“我”的第二阈值P1、对应于“是”的第二阈值P2、对应于“中国人”的第二阈值P3。
对应于上述步骤104:为训练声学模型,需要具备一符合上述各个第一阈值和第二阈值的语音语料库。通常情况下,需要预先设置一语音语料库,该语音语料库中包括有若干语音语料。这里的语音语料可以为日常对话的录音片段、阅读特定文章的录音片段等。如此,通常情况下,同一语音语料的发音一致,故可以认为每一个语音语料均对应有一发音地域,各语音语料的发音均为所对应发音地域的发音。基于上述各个第一阈值和第二阈值的限定,目前已有的语音语料库通常是不完全符合这些限定的,如此,需要基于这些限定做相应语音语料的补充,以丰富语音语料库,当然,补充后的语音语料库应满足这些限定。对于这一补充操作,通常可以分为两大步,第一步先针对各个第一阈值进行补充,第一步执行完成后,第二步再针对各个第二阈值进行补充。
对应于上述步骤105:第一步,针对各个第一阈值进行按需补充。第一步中,基于上述举例1,可依次将上述Q1~Q6做为当前第一阈值进行分析。以分析Q1为例,由于Q1对应于“我”和“四川”,如此,可找出语音语料库中的全部第一语音语料,此时的第一语音语料即为语音语料库中的四川地域发音的语音语料。然后,可判断“我”的发音在这些四川地域发音的语音语料中的出现次数。如果该次数小于Q1则需补充,否则无需补充。假设,此时的语音语料库中共有下述4条语音语料:
语音语料1:以四川口音发音的“我爱我的祖国”;语音语料2:以北京口音发音的“我爱我的祖国”;语音语料3:以四川口音发音的“我爱我家”;语音语料4:以北京口音发音的“我爱我家”。
如此,可知语音语料库中共有两个四川地域发音的语音语料,即上述语音语料1和语音语料3即为此时的全部第一语音语料。其中,“我”的发音在全部第一语音语料中的出现次数即为4次。基于同样的实现原理,再依次分析Q2~Q6,并按需进行语音语料补充,以使补充后的语音语料库可以满足各个第一阈值。第一步完成后,执行第二步,即针对各个第二阈值进行按需补充。
对应于上述步骤106:第二步中,依次分析P1~P3。以分析P1为例,由于P1对应于“我”,如此,对于语音语料库中的全部语音语料,可判断“我”的发音在这些语音语料中的出现次数。如果该次数小于P1则需补充,否则无需补充。比如,此时的语音语料库中共有下述6条语音语料:上述语音语料1~语音语料4,以及下述语音语料5和语音语料6。
语音语料5:以四川口音发音的“您请坐”;语音语料6:以四川口音发音的“您请喝茶”。
如此,对于语音语料库中的全部语音语料,即上述语音语料1~6,“我”的发音在其中的出现次数为8。基于同样的实现原理,再依次分析P2、P3,并按需进行语音语料补充,以使补充后的语音语料库可以满足各个第二阈值。
对应于上述步骤107:在针对各个第一阈值和各个第二阈值均完成执行时,即可认为最新语音语料库中所包括语音语料的数量和类别是足够丰富的,是能够保证语音与文本间的转化准确度的。如此,即可根据最新的语音语料库,训练针对预设通用字词的声学模型。本申请实施例中,考虑到不同发音地域口音辨识度和口音多样化的不同,以及不同通用字词使用频率的不同,可以设置上述第一阈值和第二阈值,且不同第一阈值对应的通用字词和/或发音地域不同。
在本申请一个实施例中,所述根据预设的阈值确定方式,确定至少一个第一阈值,包括:设置第一标准值;确定至少一个权重,其中,每一个所述权重均对应有一所述通用字词和一所述发音地域,所述权重的取值范围为(0,1],对于对应有目标通用字词和目标发音地域的目标权重,所述目标通用字词在所述目标发音地域的发音越接近所述目标通用字词的普通话标准发音,所述目标权重的值越小;根据公式一计算每一个所述权重对应的第一阈值;所述公式一包括:Y i=k i×X 1;其中,Y i为所述至少一个权重中第i个权重对应的第一阈值,k i为所述第i个权重,X 1为所述第一标准值。详细地,标准值可以为经验值,通常可以为待补充语音语料数量的最大值。越接近普通话标准发音,权重越小,相应补充量越小;越不接近普通话标准发音,权重越大,相应补充量越大。比如,用四川口音说“我”时与普通话标准发音相差较大,故对应于“我”和“四川”的权重可以为0.9,若第一标准值为10000,则上述Q1等于9000。再比如,用四川口音说“中国人”时与普通话标准发音相差较小,故对应于“中国人”和“四川”的权重可以为0.3,由于第一标准值为10000,则上述Q3等于3000。
可见,本申请实施例中,可以以不同发音地域口音辨识度和口音多样化的不同,来设置各个第一阈值的具体数值,以按需补充各类语音语料,以免因无用或低效语音语料的补充而增加数据处理压力。本申请实施例中,考虑到不同通用字词使用频率的不同,可以设置上述第二阈值。
在本申请一个实施例中,所述根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值,包括:设置第二标准值;确定预先设置好的文本集合,所述文本集合中包括有每一个所述通用字词;统计每一个所述通用字词在所述文本集合中的出现次数;根据公式二,计算每一个所述通用字词对应的第二阈值;所述公式二包括:
Figure PCTCN2019117718-appb-000001
其中,y j为所述至少一个通用字词中第j个通用字词对应的第二阈值,X 2为所述第二标准值,m为所述至少一个通用字词的个数,n j为所述第j个通用字词在所述文本集合中的出现次数。
详细地,文本集合中的文本可以为一篇文章,一段新闻文字报道,也可以为语音识别后的一段文字等。假设文本集合中共有10000个字词,而“我”这一字词的出现次数为200次,“中国人”这一字词的出现次数为5次,如此,若第二标准值为50000,上述P1等于1000,上述P3等于25。可见,本申请实施例中,可以以不同字词使用频率的不同,来设置各个第二阈值的具体数值,以按需补充各类语音语料,以免因无用或低效语音语料的补充而增加数据处理压力。
在本申请一个实施例中,所述至少一个通用字词包括:通用字典中的部分或全部通用字,和/或,通用词典中的部分或全部通用词。详细地,基于通用字典和通用词典以收集通用字词,可以保证所收集通用字词的实用性,进而保证所训练出的声学模型的实用性。
在本申请一个实施例中,每一个所述通用字词和每一个所述语音语料均涉及预设技术领域,以使所述声学模型为针对所述预设技术领域的声学模型。举例来说,这一特定领域可以为医药领域、游戏竞技领域等。本申请实施例中,可以针对特定领域,有针对性的收集通用字词,从而可以训练出有针对性的声学模型。与适用于普通领域或称大众领域的声学模型相比,进行特定领域的语音与文本间转化时,基于针对该特定领域的声学模型所得的转化准确度更优。在上述步骤105中,首先针对各个第一阈值进行判断,即判断对于语音语料库中对应有第一发音地域的全部语音语料,第一通用字词的发音在其中的出现次数,是否小于对应于第一通用字词和第一发音地域的第一阈值,若判断结果为是,则需向语音语料库中补充语音语料。反之则说明:利用现有语音语库训练声学模型时,基于训练出的声学模型,若语音与文本间的转化涉及到具有第一发音地域发音的第一通用字词,相应的转化准确度通常较高。如此,无需向语音语料库中补充语音语料。通过这一补充操作,基于根据补充后语音语料库所训练出的声学模型,对于具有不同发音地 域发音的同一通用字词,均可实现语音与文本间的准确转化。
详细地,对于补充内容:通常情况下,补充语音语料时,所补充的语音语料均为,包含第一通用字词且对应有第一发音地域的语音语料。即当前仅补充,包括具有第一发音地域发音的第一通用字词的语音语料,而不补充,包括具有其他发音地域发音的其他通用字词的语音语料。由于补充的语音语料,是基于补充内容的有针对性的补充,故可以使基于补充后的语音语料库,再次进行该第一阈值的判断时,判断结果不仅可以为否,而且可以尽可能最小化后续操作的计算量。
详细地,对于补充数量:本申请实施例中,除了上述基于补充内容的有针对性的补充,对于补充数量来说,在能够保证:基于补充后的语音语料库,使得再次进行该第一阈值的判断时,判断结果为否的这一前提下,补充数量上应尽可能小。如此,可以尽可能最小化后续其他判断操作的计算量。即补充数量为保证下述条件成立的最小数量,该条件为:对于语音语料库中对应有第一发音地域的全部语音语料,第一通用字词在其中的出现次数,不小于对应于第一通用字词和第一发音地域的第一阈值。
在上述步骤106中,之后针对各个第二阈值进行判断,即判断对于语音语料库中的全部语音语料,第二通用字词在其中的出现次数,是否小于第二通用字词对应的第二阈值时,若判断结果为是,则需向语音语料库中补充语音语料。反之则说明:利用现有语音语库训练声学模型时,基于训练出的声学模型,若语音与文本间的转化涉及到第二通用字词,相应的转化准确度通常较高。如此,无需向语音语料库中补充语音语料。通过这一补充操作,基于根据补充后语音语料库所训练出的声学模型,对于不同通用字词,均可实现语音与文本间的准确转化。
详细地,对于补充内容:在本申请一个实施例中,所补充的语音语料中,对应不同发音地域的各类语音语料的个数比值,在预设的个数比值范围内。比如,四川话的口音较东北话的口音更重,则所补充的对应于四川的语音语料的个数,优选大于所补充的对应于东北的语音语料的个数。如此,基于语音语料库所训练出的声学模型,可以在发音地域方面的转化效果更佳。
详细地,对于补充数量:本申请实施例中,除了上述基于补充内容的有针对性的补充,对于补充数量来说,在能够保证:基于补充后的语音语料库,使得再次进行该第二阈值的判断时,判断结果为否的这一前提下,补充数量上应尽可能小。如此,可以尽可能最小化后续其他判断操作的计算量。即补充数量为保证下述条件成立的最小数量,该条件为:对于语音语料库中的全部语音语料,第二通用字词在其中的出现次数不小于第二通用字词对应的第二阈值。
在本申请一个实施例中,所述根据所述语音语料库,训练所述至少一个通用字词的声学模型,包括:确定初始声学模型;获得至少两个子语音语料库,所述语音语料库包括任一所述子语音语料库中的任一语音语料;针对每一个所述子语音语料库均执行:基于当前子语音语料库来优化所述初始声学模型,以得到优化声学模型;融合得到的所有优化声学模型,以得到符合预设收敛条件的目标声学模型;确定所述目标声学模型为所述至少一个通用字词的声学模型。
请参考图2,本申请实施例提供了另一种语音语料训练方法,可以包括如下步骤:
步骤201:收集至少一个通用字词,确定至少一个发音地域。详细地,该至少一个通用字词包括:通用字典中的部分或全部通用字,和/或,通用词典中的部分或全部通用词。
步骤202:设置第一标准值和第二标准值。
步骤203:确定至少一个权重,其中,每一个权重均对应有一通用字词和一发音地域,不同权重对应的通用字词和/或发音地域不同。详细地,权重的取值范围为(0,1],对于对应有目标通用字词和目标发音地域的目标权重,目标通用字词在目标发音地域的发音越接近目标通用字词的普通话标准发音,目标权重的值越小。
步骤204:计算每一个权重对应的第一阈值。详细地,每一个第一阈值均对应有一通用字词和一发音地域,不同第一阈值对应的通用字词和/或发音地域不同。详细地,可以根据公式一计算各个第一阈值。
步骤205:确定文本集合,文本集合中包括有每一个通用字词。
步骤206:统计每一个通用字词在文本集合中的出现次数。
步骤207:计算每一个通用字词对应的第二阈值。详细地,可以根据公式二计算各个第二阈值。
步骤208:确定包括有至少一个语音语料的语音语料库,每一个语音语料均对应有一发音地域。
步骤209:将每一个第一阈值分别作为当前第一阈值,并执行:对于当前第一阈值对应的第一通用字词和第一发音地域,第一通用字词的发音在全部第一语音语料中的出现次数小于当前第一阈值时,向语音 语料库中补充语音语料,其中,第一语音语料为语音语料库中的对应有第一发音地域的语音语料。
步骤210:针对每一个第一阈值均执行完成时,将每一个第二阈值分别作为当前第二阈值,并执行:对于当前第二阈值对应的第二通用字词,第二通用字词的发音在语音语料库的全部语音语料中的出现次数,小于当前第二阈值时,向语音语料库中补充语音语料。
步骤211:针对每一个通用字词均完成执行时,确定初始声学模型,并获得至少两个子语音语料库,语音语料库包括任一子语音语料库中的任一语音语料。任意两个子语音语料库中语音语料的总个数相等,且该总个数在预设数值范围内。
步骤212:针对每一个子语音语料库均执行:基于当前子语音语料库来优化初始声学模型,以得到优化声学模型。
步骤213:融合得到的所有优化声学模型,以得到符合预设收敛条件的目标声学模型。
步骤214:确定目标声学模型为至少一个通用字词的声学模型。请参考图3,本申请实施例提供了一种语音语料训练装置,可以包括:第一确定单元301,用于确定预先收集好的至少一个通用字词,以及确定预先收集好的至少一个发音地域;第二确定单元302,用于根据预设的阈值确定方式,确定至少一个第一阈值,每一个所述第一阈值均对应有一所述通用字词和一所述发音地域,其中,所述阈值确定方式为,根据一通用字词在一发音地域的发音与该通用字词的普通话标准发音的接近程度,来确定对应于该通用字词和该发音地域的第一阈值;根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值;第三确定单元303,用于确定预先设置好的、包括有至少一个语音语料的语音语料库,其中,任一所述语音语料均对应有一所述发音地域,任一所述语音语料的发音均为所对应发音地域的发音;
处理单元304,用于将每一个所述第一阈值分别作为当前第一阈值,并执行:对于所述当前第一阈值对应的第一通用字词和第一发音地域,所述第一通用字词的发音在全部第一语音语料中的出现次数小于所述当前第一阈值时,向所述语音语料库中补充语音语料,其中,所述第一语音语料为所述语音语料库中的对应有所述第一发音地域的语音语料;针对每一个所述第一阈值均执行完成时,将每一个所述第二阈值分别作为当前第二阈值,并执行:对于所述当前第二阈值对应的第二通用字词,所述第二通用字词的发音在所述语音语料库的全部语音语料中的出现次数,小于所述当前第二阈值时,向所述语音语料库中补充语音语料;训练单元305,用于针对每一个所述第二阈值均执行完成时,根据所述语音语料库,训练所述至少一个通用字词的声学模型。
在本申请一个实施例中,所述第二确定单元302,用于设置第一标准值;确定至少一个权重,其中,每一个所述权重均对应有一所述通用字词和一所述发音地域,所述权重的取值范围为(0,1],对于对应有目标通用字词和目标发音地域的目标权重,所述目标通用字词在所述目标发音地域的发音越接近所述目标通用字词的普通话标准发音,所述目标权重的值越小;根据上述公式一计算每一个所述权重对应的第一阈值。在本申请一个实施例中,所述第二确定单元302,用于设置第二标准值;确定预先设置好的文本集合,所述文本集合中包括有每一个所述通用字词;统计每一个所述通用字词在所述文本集合中的出现次数;根据上述公式二,计算每一个所述通用字词对应的第二阈值。在本申请一个实施例中,所述训练单元305,用于确定初始声学模型;获得至少两个子语音语料库,所述语音语料库包括任一所述子语音语料库中的任一语音语料;针对每一个所述子语音语料库均执行:基于当前子语音语料库来优化所述初始声学模型,以得到优化声学模型;融合得到的所有优化声学模型,以得到符合预设收敛条件的目标声学模型;确定所述目标声学模型为所述至少一个通用字词的声学模型。
上述装置内的各单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,具体内容可参见本申请方法实施例中的叙述,此处不再赘述。
本申请一个实施例还提供了一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行上述任一所述语音语料训练方法的步骤。本申请一个实施例还提供了一种存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任一所述语音语料训练方法的步骤。
综上所述,基于本申请实施例提供的语音语料训练方法、装置、计算机设备、存储介质,可以实现事前模型效果的判定,以避免反复训练模型,且具有针对短语、常用字有较好的识别效果,针对特定应用场景可以快速迁移学习,可以便捷的评估模型对于方言的适应程度等有益效果。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指 令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (16)

  1. 一种语音语料训练方法,包括:
    确定预先收集好的至少一个通用字词,以及确定预先收集好的至少一个发音地域;
    根据预设的阈值确定方式,确定至少一个第一阈值,每一个所述第一阈值均对应有一所述通用字词和一所述发音地域,其中,所述阈值确定方式为,根据一通用字词在一发音地域的发音与该通用字词的普通话标准发音的接近程度,来确定对应于该通用字词和该发音地域的第一阈值;
    根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值;
    确定预先设置好的、包括有至少一个语音语料的语音语料库,其中,任一所述语音语料均对应有一所述发音地域,任一所述语音语料的发音均为所对应发音地域的发音;
    将每一个所述第一阈值分别作为当前第一阈值,并执行:对于所述当前第一阈值对应的第一通用字词和第一发音地域,所述第一通用字词的发音在全部第一语音语料中的出现次数小于所述当前第一阈值时,向所述语音语料库中补充语音语料,其中,所述第一语音语料为所述语音语料库中的对应有所述第一发音地域的语音语料;
    针对每一个所述第一阈值均执行完成时,将每一个所述第二阈值分别作为当前第二阈值,并执行:对于所述当前第二阈值对应的第二通用字词,所述第二通用字词的发音在所述语音语料库的全部语音语料中的出现次数,小于所述当前第二阈值时,向所述语音语料库中补充语音语料;
    针对每一个所述第二阈值均执行完成时,根据所述语音语料库,训练所述至少一个通用字词的声学模型。
  2. 如权利要求1所述的语音语料训练方法,所述根据预设的阈值确定方式,确定至少一个第一阈值,包括:设置第一标准值;确定至少一个权重,其中,每一个所述权重均对应有一所述通用字词和一所述发音地域,所述权重的取值范围为(0,1],对于对应有目标通用字词和目标发音地域的目标权重,所述目标通用字词在所述目标发音地域的发音越接近所述目标通用字词的普通话标准发音,所述目标权重的值越小;根据公式一计算每一个所述权重对应的第一阈值;所述公式一包括:Y i=k i×X 1;其中,Y i为所述至少一个权重中第i个权重对应的第一阈值,k i为所述第i个权重,X 1为所述第一标准值。
  3. 如权利要求1所述的语音语料训练方法,所述根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值,包括:设置第二标准值;确定预先设置好的文本集合,所述文本集合中包括有每一个所述通用字词;统计每一个所述通用字词在所述文本集合中的出现次数;根据公式二,计算每一个所述通用字词对应的第二阈值;所述公式二包括:
    Figure PCTCN2019117718-appb-100001
    其中,y j为所述至少一个通用字词中第j个通用字词对应的第二阈值,X 2为所述第二标准值,m为所述至少一个通用字词的个数,n j为所述第j个通用字词在所述文本集合中的出现次数。
  4. 如权利要求1所述的语音语料训练方法,所述根据所述语音语料库,训练所述至少一个通用字词的声学模型,包括:确定初始声学模型;获得至少两个子语音语料库,所述语音语料库包括任一所述子语音语料库中的任一语音语料;针对每一个所述子语音语料库均执行:基于当前子语音语料库来优化所述初始声学模型,以得到优化声学模型;融合得到的所有优化声学模型,以得到符合预设收敛条件的目标声学模型;确定所述目标声学模型为所述至少一个通用字词的声学模型。
  5. 一种语音语料训练装置,包括:
    第一确定单元,用于确定预先收集好的至少一个通用字词,以及确定预先收集好的至少一个发音地域;
    第二确定单元,用于根据预设的阈值确定方式,确定至少一个第一阈值,每一个所述第一阈值均对应有一所述通用字词和一所述发音地域,其中,所述阈值确定方式为,根据一通用字词在一发音地域的发音与该通用字词的普通话标准发音的接近程度,来确定对应于该通用字词和该发音地域的第一阈值;根据预 先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值;
    第三确定单元,用于确定预先设置好的、包括有至少一个语音语料的语音语料库,其中,任一所述语音语料均对应有一所述发音地域,任一所述语音语料的发音均为所对应发音地域的发音;
    处理单元,用于将每一个所述第一阈值分别作为当前第一阈值,并执行:对于所述当前第一阈值对应的第一通用字词和第一发音地域,所述第一通用字词的发音在全部第一语音语料中的出现次数小于所述当前第一阈值时,向所述语音语料库中补充语音语料,其中,所述第一语音语料为所述语音语料库中的对应有所述第一发音地域的语音语料;针对每一个所述第一阈值均执行完成时,将每一个所述第二阈值分别作为当前第二阈值,并执行:对于所述当前第二阈值对应的第二通用字词,所述第二通用字词的发音在所述语音语料库的全部语音语料中的出现次数,小于所述当前第二阈值时,向所述语音语料库中补充语音语料;
    训练单元,用于针对每一个所述第二阈值均执行完成时,根据所述语音语料库,训练所述至少一个通用字词的声学模型。
  6. 如权利要求5所述的语音语料训练装置,所述第二确定单元,用于设置第一标准值;确定至少一个权重,其中,每一个所述权重均对应有一所述通用字词和一所述发音地域,所述权重的取值范围为(0,1],对于对应有目标通用字词和目标发音地域的目标权重,所述目标通用字词在所述目标发音地域的发音越接近所述目标通用字词的普通话标准发音,所述目标权重的值越小;根据公式一计算每一个所述权重对应的第一阈值;所述公式一包括:Y i=k i×X 1;其中,Y i为所述至少一个权重中第i个权重对应的第一阈值,k i为所述第i个权重,X 1为所述第一标准值。
  7. 如权利要求5所述的语音语料训练装置,所述第二确定单元,用于设置第二标准值;确定预先设置好的文本集合,所述文本集合中包括有每一个所述通用字词;统计每一个所述通用字词在所述文本集合中的出现次数;根据公式二,计算每一个所述通用字词对应的第二阈值;所述公式二包括:
    Figure PCTCN2019117718-appb-100002
    其中,y j为所述至少一个通用字词中第j个通用字词对应的第二阈值,X 2为所述第二标准值,m为所述至少一个通用字词的个数,n j为所述第j个通用字词在所述文本集合中的出现次数。
  8. 如权利要求5所述的语音语料训练装置,
    所述训练单元,用于确定初始声学模型;获得至少两个子语音语料库,所述语音语料库包括任一所述子语音语料库中的任一语音语料;针对每一个所述子语音语料库均执行:基于当前子语音语料库来优化所述初始声学模型,以得到优化声学模型;融合得到的所有优化声学模型,以得到符合预设收敛条件的目标声学模型;确定所述目标声学模型为所述至少一个通用字词的声学模型。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行语音语料训练方法,包括:确定预先收集好的至少一个通用字词,以及确定预先收集好的至少一个发音地域;根据预设的阈值确定方式,确定至少一个第一阈值,每一个所述第一阈值均对应有一所述通用字词和一所述发音地域,其中,所述阈值确定方式为,根据一通用字词在一发音地域的发音与该通用字词的普通话标准发音的接近程度,来确定对应于该通用字词和该发音地域的第一阈值;根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值;确定预先设置好的、包括有至少一个语音语料的语音语料库,其中,任一所述语音语料均对应有一所述发音地域,任一所述语音语料的发音均为所对应发音地域的发音;将每一个所述第一阈值分别作为当前第一阈值,并执行:对于所述当前第一阈值对应的第一通用字词和第一发音地域,所述第一通用字词的发音在全部第一语音语料中的出现次数小于所述当前第一阈值时,向所述语音语料库中补充语音语料,其中,所述第一语音语料为所述语音语料库中的对应有所述第一发音地域的语音语料;针对每一个所述第一阈值均执行完成时,将每一个所述第二阈值分别作为当前第二阈值,并执行:对于所述当前第二阈值对应的第二通用字词,所述第二通用字词的发音在所述语音语料库的全部语音语料中的出现次数,小于所述当前第二阈值时, 向所述语音语料库中补充语音语料;针对每一个所述第二阈值均执行完成时,根据所述语音语料库,训练所述至少一个通用字词的声学模型。
  10. 如权利要求9所述的计算机设备,所述计算机可读指令被所述处理器执行时,使得所述处理器执行所述根据预设的阈值确定方式,确定至少一个第一阈值,包括:设置第一标准值;确定至少一个权重,其中,每一个所述权重均对应有一所述通用字词和一所述发音地域,所述权重的取值范围为(0,1],对于对应有目标通用字词和目标发音地域的目标权重,所述目标通用字词在所述目标发音地域的发音越接近所述目标通用字词的普通话标准发音,所述目标权重的值越小;根据公式一计算每一个所述权重对应的第一阈值;所述公式一包括:Y i=k i×X 1;其中,Y i为所述至少一个权重中第i个权重对应的第一阈值,k i为所述第i个权重,X 1为所述第一标准值。
  11. 如权利要求9所述的计算机设备,所述计算机可读指令被所述处理器执行时,使得所述处理器执行所述根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值,包括:设置第二标准值;确定预先设置好的文本集合,所述文本集合中包括有每一个所述通用字词;统计每一个所述通用字词在所述文本集合中的出现次数;根据公式二,计算每一个所述通用字词对应的第二阈值;所述公式二包括:
    Figure PCTCN2019117718-appb-100003
    其中,y j为所述至少一个通用字词中第j个通用字词对应的第二阈值,X 2为所述第二标准值,m为所述至少一个通用字词的个数,n j为所述第j个通用字词在所述文本集合中的出现次数。
  12. 如权利要求9所述的计算机设备,所述计算机可读指令被所述处理器执行时,使得所述处理器执行所述根据所述语音语料库,训练所述至少一个通用字词的声学模型,包括:确定初始声学模型;获得至少两个子语音语料库,所述语音语料库包括任一所述子语音语料库中的任一语音语料;针对每一个所述子语音语料库均执行:基于当前子语音语料库来优化所述初始声学模型,以得到优化声学模型;融合得到的所有优化声学模型,以得到符合预设收敛条件的目标声学模型;确定所述目标声学模型为所述至少一个通用字词的声学模型。
  13. 一种存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行语音语料训练方法,包括:确定预先收集好的至少一个通用字词,以及确定预先收集好的至少一个发音地域;根据预设的阈值确定方式,确定至少一个第一阈值,每一个所述第一阈值均对应有一所述通用字词和一所述发音地域,其中,所述阈值确定方式为,根据一通用字词在一发音地域的发音与该通用字词的普通话标准发音的接近程度,来确定对应于该通用字词和该发音地域的第一阈值;根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值;确定预先设置好的、包括有至少一个语音语料的语音语料库,其中,任一所述语音语料均对应有一所述发音地域,任一所述语音语料的发音均为所对应发音地域的发音;将每一个所述第一阈值分别作为当前第一阈值,并执行:对于所述当前第一阈值对应的第一通用字词和第一发音地域,所述第一通用字词的发音在全部第一语音语料中的出现次数小于所述当前第一阈值时,向所述语音语料库中补充语音语料,其中,所述第一语音语料为所述语音语料库中的对应有所述第一发音地域的语音语料;针对每一个所述第一阈值均执行完成时,将每一个所述第二阈值分别作为当前第二阈值,并执行:对于所述当前第二阈值对应的第二通用字词,所述第二通用字词的发音在所述语音语料库的全部语音语料中的出现次数,小于所述当前第二阈值时,向所述语音语料库中补充语音语料;针对每一个所述第二阈值均执行完成时,根据所述语音语料库,训练所述至少一个通用字词的声学模型。
  14. 如权利要求13所述的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行所述根据预设的阈值确定方式,确定至少一个第一阈值,包括:设置第一标准值;确定至少一个权重,其中,每一个所述权重均对应有一所述通用字词和一所述发音地域,所述权重的取值范围为(0,1],对于对应有目标通用字词和目标发音地域的目标权重,所述目标通用字词在所述目标发音地域的发音越接近所述目标通用字词的普通话标准发音,所述目标权重的值越小;根据公式一计算每一个所述权重对应的第一阈值;所述公式一包括:Y i=k i×X 1;其中,Y i为所述至少一个权重中第i个权重对应的第 一阈值,k i为所述第i个权重,X 1为所述第一标准值。
  15. 如权利要求13所述的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行所述根据预先确定的通用字词使用频率,确定每一个所述通用字词对应的第二阈值,包括:设置第二标准值;确定预先设置好的文本集合,所述文本集合中包括有每一个所述通用字词;统计每一个所述通用字词在所述文本集合中的出现次数;根据公式二,计算每一个所述通用字词对应的第二阈值;所述公式二包括:
    Figure PCTCN2019117718-appb-100004
    其中,y j为所述至少一个通用字词中第j个通用字词对应的第二阈值,X 2为所述第二标准值,m为所述至少一个通用字词的个数,n j为所述第j个通用字词在所述文本集合中的出现次数。
  16. 如权利要求13所述的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行所述根据所述语音语料库,训练所述至少一个通用字词的声学模型,包括:确定初始声学模型;获得至少两个子语音语料库,所述语音语料库包括任一所述子语音语料库中的任一语音语料;针对每一个所述子语音语料库均执行:基于当前子语音语料库来优化所述初始声学模型,以得到优化声学模型;融合得到的所有优化声学模型,以得到符合预设收敛条件的目标声学模型;确定所述目标声学模型为所述至少一个通用字词的声学模型。
PCT/CN2019/117718 2019-04-19 2019-11-12 语音语料训练方法、装置、计算机设备和存储介质 WO2020211350A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910320221.XA CN110223674B (zh) 2019-04-19 2019-04-19 语音语料训练方法、装置、计算机设备和存储介质
CN201910320221.X 2019-04-19

Publications (1)

Publication Number Publication Date
WO2020211350A1 true WO2020211350A1 (zh) 2020-10-22

Family

ID=67819892

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117718 WO2020211350A1 (zh) 2019-04-19 2019-11-12 语音语料训练方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN110223674B (zh)
WO (1) WO2020211350A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223674B (zh) * 2019-04-19 2023-05-26 平安科技(深圳)有限公司 语音语料训练方法、装置、计算机设备和存储介质
CN111209363B (zh) * 2019-12-25 2024-02-09 华为技术有限公司 语料数据处理方法、装置、服务器和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593518A (zh) * 2008-05-28 2009-12-02 中国科学院自动化研究所 实际场景语料和有限状态网络语料的平衡方法
CN105760361A (zh) * 2016-01-26 2016-07-13 北京云知声信息技术有限公司 一种语言模型建立方法及装置
US20160336006A1 (en) * 2015-05-13 2016-11-17 Microsoft Technology Licensing, Llc Discriminative data selection for language modeling
CN106251859A (zh) * 2016-07-22 2016-12-21 百度在线网络技术(北京)有限公司 语音识别处理方法和装置
WO2018153213A1 (zh) * 2017-02-24 2018-08-30 芋头科技(杭州)有限公司 一种多语言混合语音识别方法
CN110223674A (zh) * 2019-04-19 2019-09-10 平安科技(深圳)有限公司 语音语料训练方法、装置、计算机设备和存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204184B (zh) * 2017-05-10 2018-08-03 平安科技(深圳)有限公司 语音识别方法及系统
CN107705787A (zh) * 2017-09-25 2018-02-16 北京捷通华声科技股份有限公司 一种语音识别方法及装置
CN109213996A (zh) * 2018-08-08 2019-01-15 厦门快商通信息技术有限公司 一种语料库的训练方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593518A (zh) * 2008-05-28 2009-12-02 中国科学院自动化研究所 实际场景语料和有限状态网络语料的平衡方法
US20160336006A1 (en) * 2015-05-13 2016-11-17 Microsoft Technology Licensing, Llc Discriminative data selection for language modeling
CN105760361A (zh) * 2016-01-26 2016-07-13 北京云知声信息技术有限公司 一种语言模型建立方法及装置
CN106251859A (zh) * 2016-07-22 2016-12-21 百度在线网络技术(北京)有限公司 语音识别处理方法和装置
WO2018153213A1 (zh) * 2017-02-24 2018-08-30 芋头科技(杭州)有限公司 一种多语言混合语音识别方法
CN110223674A (zh) * 2019-04-19 2019-09-10 平安科技(深圳)有限公司 语音语料训练方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN110223674A (zh) 2019-09-10
CN110223674B (zh) 2023-05-26

Similar Documents

Publication Publication Date Title
WO2021104099A1 (zh) 一种基于情景感知的多模态抑郁症检测方法和系统
US9672817B2 (en) Method and apparatus for optimizing a speech recognition result
US9711139B2 (en) Method for building language model, speech recognition method and electronic apparatus
WO2017133165A1 (zh) 一种满意度自动测评的方法、装置、设备和计算机存储介质
US8972260B2 (en) Speech recognition using multiple language models
US9613621B2 (en) Speech recognition method and electronic apparatus
WO2018153213A1 (zh) 一种多语言混合语音识别方法
TWI536364B (zh) 自動語音識別方法和系統
US9190054B1 (en) Natural language refinement of voice and text entry
US11380300B2 (en) Automatically generating speech markup language tags for text
US20150112674A1 (en) Method for building acoustic model, speech recognition method and electronic apparatus
WO2017084334A1 (zh) 一种语种识别方法、装置、设备及计算机存储介质
CN110517664A (zh) 多方言识别方法、装置、设备及可读存储介质
CN103488623A (zh) 多种语言文本数据分类处理方法
JP5932869B2 (ja) N−gram言語モデルの教師無し学習方法、学習装置、および学習プログラム
WO2018192186A1 (zh) 语音识别方法及装置
Mustafa et al. Exploring the influence of general and specific factors on the recognition accuracy of an ASR system for dysarthric speaker
Glasser Automatic speech recognition services: Deaf and hard-of-hearing usability
WO2021063101A1 (zh) 基于人工智能的语音断点检测方法、装置和设备
WO2020211350A1 (zh) 语音语料训练方法、装置、计算机设备和存储介质
CN103336803B (zh) 一种嵌名春联的计算机生成方法
WO2015043071A1 (zh) 一种译文检查方法及其系统
US10867525B1 (en) Systems and methods for generating recitation items
TWI659411B (zh) 一種多語言混合語音識別方法
CN108009157B (zh) 一种语句归类方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19925469

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19925469

Country of ref document: EP

Kind code of ref document: A1