WO2023087767A1 - 适用于语音识别模型的训练数据生成方法及设备 - Google Patents

适用于语音识别模型的训练数据生成方法及设备 Download PDF

Info

Publication number
WO2023087767A1
WO2023087767A1 PCT/CN2022/107228 CN2022107228W WO2023087767A1 WO 2023087767 A1 WO2023087767 A1 WO 2023087767A1 CN 2022107228 W CN2022107228 W CN 2022107228W WO 2023087767 A1 WO2023087767 A1 WO 2023087767A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
speech
text
original
evaluation result
Prior art date
Application number
PCT/CN2022/107228
Other languages
English (en)
French (fr)
Inventor
蒋成林
Original Assignee
北京优幕科技有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京优幕科技有限责任公司 filed Critical 北京优幕科技有限责任公司
Publication of WO2023087767A1 publication Critical patent/WO2023087767A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the invention relates to the field of speech analysis and synthesis, in particular to a method and device for generating training data suitable for a speech recognition model.
  • the general speech recognition effect of some speech recognition manufacturers has reached a relatively excellent level, and the word error rate (WER) has been less than 3%, which has surpassed the level of manual transcription and reached the commercial level.
  • WER word error rate
  • many enterprises do not want to directly call the interfaces of speech recognition service providers, but prefer to have their own speech recognition models. Possible considerations include data security, cost, and unsatisfactory effects in actual business, etc.
  • the conventional method is to collect speech samples and label the data, and then train the model. This processing method is very cost-effective.
  • the present application provides a method for generating training data suitable for speech recognition models, including:
  • obtaining the target text data according to the plurality of text data includes:
  • the graph data of the text and the voice data are evaluated by the speech evaluation model, and the target text is obtained according to the path corresponding to the optimal evaluation result in the graph data.
  • the parallel part is screened according to a preset vocabulary to exclude words not relevant to the application scenario.
  • the present invention provides another method for generating training data suitable for speech recognition models, including:
  • obtaining the first target text data according to the original text data specifically includes:
  • the graph data of the original text and the original voice data are evaluated by the speech evaluation model, and the first target text is obtained according to the path corresponding to the optimal evaluation result in the graph data of the original text.
  • obtaining the second target text data according to the plurality of transformed text data specifically includes:
  • the speech evaluation model evaluates the transformed text graph data and the transformed speech data, and obtains the second target text according to the path with the best evaluation result in the transformed text graph data.
  • a plurality of transformed speech data which respectively correspond to the second target text data and the second evaluation result, and in the step of comparing the first evaluation result and the second evaluation result , selecting the best one of the plurality of second evaluation results to compare with the first evaluation result.
  • converting the frequency of the original voice specifically includes increasing and/or reducing the fundamental frequency of the sound signal.
  • the transformed speech data is lowered by at least one semitone compared to the original speech data.
  • the present invention provides a training data generation device suitable for speech recognition models, including: at least one processor; and a memory connected to the at least one processor; wherein, the memory stores information that can be used by the An instruction executed by a processor, the instruction is executed by the at least one processor, so that the at least one processor executes the above training data generation method.
  • the speech recognition model provided by other service providers is used to recognize the unmarked speech to obtain the preliminary recognized text, and then the speech evaluation model is used to evaluate the text and speech to obtain the pronunciation and The matching degree of the text, and finally judge the sample quality according to the evaluation results, and use the speech samples with better quality and their text labels as the training data, thereby realizing the automatic labeling of the speech, obtaining high-quality training data, and improving the efficiency of model training. And better training effect can be obtained.
  • the voice data of the conversion frequency is obtained by processing the voice data, and then the original voice and the converted voice are respectively recognized by using the voice recognition model provided by other service providers to obtain the corresponding recognition text ; further use the speech evaluation model to evaluate the text and speech, and obtain the matching degree between the original pronunciation and the corresponding text and the matching degree between the transformed speech and the corresponding text; finally compare the two, if the evaluation result of the transformed speech is better, it means The speech with transformed ratings matches the corresponding text better, so the text corresponding to the original speech and transformed speech can be used as training data to realize automatic labeling of speech, obtain high-quality training data, improve the efficiency of model training, and obtain Better training effect.
  • Fig. 1 is a schematic diagram of a method for generating training data in an embodiment of the present invention
  • Fig. 2 is a schematic diagram of the graph data of the text in the embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a data flow of a method for generating training data in an embodiment of the present invention.
  • the first application scenario of the present invention is to obtain speech samples of a specific field as training data.
  • the specific field mentioned in this application refers to a field with a high degree of expertise such as chemical engineering and medicine, and the general field refers to the general daily life field.
  • the speech content related to a specific domain is quite different from the content of general speech data. It is difficult for a speech recognition model trained with general speech data to be suitable for speech recognition tasks in a specific field, and there are relatively few speech samples in a specific field.
  • this embodiment provides a method for generating training data suitable for a speech recognition model.
  • the method can be executed by electronic devices such as computers and servers.
  • the method includes the following operations:
  • the voice data in this embodiment may be any domain-specific voice or voice in a general domain, such as a recording of a sentence or a paragraph, and its corresponding text is unknown.
  • This program does not limit the language, and it is feasible to use any language such as Chinese and English.
  • the speech recognition model refers to a neural network model that has been trained and based on deep learning algorithms, such as models provided by other service providers. This model has certain text recognition capabilities, but it may not be suitable for specific fields.
  • the first speech recognition model 11 and the second speech recognition model 12 perform speech-to-text recognition on the speech data audio respectively, and the recognition result output by the first speech recognition model 11 is transcription-1, the recognition result output by the second speech recognition model 12 is transcription-2.
  • this type of model can pre-acquire several vocabulary provided by the user, and these vocabulary will be referred to when recognizing speech data, so that the recognition result is more accurate. For example, assuming that the actual text contains the word "amino acid”, a typical model may output the wrong result of "according to calculation", which sounds the same, but if the model pre-fetched the word, it would output the correct result recognition result.
  • S3A Obtain target text data according to multiple text data. Since the performance of each speech recognition model, the training algorithm used, the training samples and other factors may be different, their recognition results may be inconsistent in many places, but they may also be completely consistent. Therefore, various situations may occur in this step. For example, if multiple recognition results (text data) are completely consistent, then the recognition result is the target text data.
  • an optional processing method is to establish a vocabulary table in advance according to the application scenario, which stores all the vocabulary that may appear in the application scenario.
  • a vocabulary table in advance according to the application scenario, which stores all the vocabulary that may appear in the application scenario.
  • the target text data finally obtained through searching is “business in this country is developing very well”.
  • Another optional processing method is to introduce a speech evaluation model to find the text that best matches the speech data among multiple recognition results.
  • text graph data (Graph) is obtained according to multiple text data, wherein the same part of each text data is kept, and different parts are arranged in a parallel relationship.
  • the generated graph can be expressed as "the [ ⁇
  • the image data is used as the identification object of the speech evaluation model.
  • What this embodiment uses is a model that can analyze the matching degree between the pronunciation of the speech and the text.
  • the output of this model is a score of a percentile system. The higher the score, the better the pronunciation. The more it matches the text, or it can be understood as the better the pronunciation quality.
  • the voice evaluation model of this embodiment searches the above-mentioned graph data, and the model will match all the paths (paths) in the graph data (Graph) with the voice data respectively, that is, evaluate the matching degree of the voice data for each path respectively or quality, and thus determine the optimal path as the target text.
  • the above two methods can be used in combination.
  • the first problem is that too many paths of graph data lead to an increase in the amount of calculation of the model. It will reduce the search efficiency; the second problem is that the speech evaluation model may not have the ability of semantic analysis, so for words with exactly the same pronunciation or words with only differences in pronunciation, the evaluation results of the paths where they are located may be exactly the same .
  • the first method is used to preliminarily screen the branch nodes in the graph data (Graph), and get rid of words that are not relevant to the application scenario, which can reduce the branch nodes and then reduce the number of paths (path), avoiding The above problem occurs.
  • step S4A Obtain an evaluation result of the speech evaluation model on the target text data and speech data.
  • the voice evaluation model 13 evaluates the target text data transcription and audio to obtain the evaluation result score, if the voice evaluation model is used to search for the optimal path in step S3A, then in this step the voice evaluation model can use the optimal path
  • the text of the excellent path is used as a reference to calculate scores or other forms of assessment results.
  • the speech evaluation model is not used to obtain the target text in step S3A, but the target text is generated by other methods mentioned above, then the speech evaluation model is used for the first time in this step for evaluation.
  • the score threshold can be preset. If the evaluation result of the target text data and speech data is higher than the threshold, it means that the matching degree between the speech and the target text is high enough, and this set of data can be used for training.
  • the speech recognition model is a high-quality sample, otherwise it means that the matching degree is poor and it is not suitable for training the model.
  • the training data generation method provided by the embodiment of the present invention uses the speech recognition model provided by other service providers to recognize the unmarked speech to obtain the preliminary recognized text, and then uses the speech evaluation model to evaluate the text and speech to obtain the pronunciation and text Finally, the sample quality is judged according to the evaluation results, and the high-quality speech samples and their text labels are used as training data, thereby realizing automatic labeling of speech, obtaining high-quality training data, improving the efficiency of model training, and being able to Get a better training effect.
  • the second application scenario is to obtain the speech of a specific group of people as training data.
  • the business scenario related to young children since the vocal organs of children are quite different from those of adults, the voices of children and adults show great differences.
  • the reality is that children’s speech samples are less than those of ordinary adults. If you directly use adult speech to train the speech recognition model and use it to recognize children’s speech after training, the recognition results will be very poor. Children’s speech recognition has always been is a difficult problem in speech recognition.
  • the usual method at this stage is data enhancement, such as the overall upward shift of the frequency spectrum in adult voice by means of digital signals, making the voice sound sharper, similar to the voice of children.
  • This method can bring certain benefits in the speech recognition training tool, but the efficiency is very low, and the final effect is not ideal.
  • the fundamental reason is that the method itself is not targeted and the obtained data is not real enough.
  • another embodiment of the present application provides a method for generating training data suitable for a speech recognition model.
  • the method can be executed by electronic devices such as computers and servers.
  • the method includes the following operations:
  • the original voice is the voice of a child.
  • the specific conversion method includes but not limited to changing the frequency of the sound signal.
  • the method adopted in this embodiment is to reduce the fundamental frequency of the sound signal, so that the sound of the converted voice data is lowered by at least one semitone compared with the original voice data.
  • the original data is a male voice with a very low voice or a female voice with a very sharp voice
  • the frequency may be processed or transformed in a corresponding manner here.
  • the original voice data is recorded as audio-0.
  • the original voice data can be transformed multiple times, and one semitone is lowered in turn to obtain more transformed voices.
  • the converted voice data lowered by one semitone is recorded as audio-1
  • the converted voice data audio-2 lowered by two semitones
  • the converted voice data audio-n lowered by n semitones.
  • a plurality of speech recognition models respectively recognize the original speech data, and output a plurality of original text data.
  • this step refers to identifying audio-0 to obtain a plurality of original text data.
  • step S4B Obtain first target text data according to a plurality of original text data, and obtain a first evaluation result of the first target text data and original speech data by the speech evaluation model.
  • the first target text data is marked as transcription-0 here.
  • This embodiment preferably adopts the third implementation mode provided in the above embodiments, that is, the speech evaluation model searches for transcription0 according to audio-0 and the graph data of the original text data.
  • the speech evaluation model evaluates audio-0 and transcription-0 to obtain a first evaluation result score-0.
  • a plurality of speech recognition models respectively recognize the transformed speech data, and output a plurality of transformed text data. Also refer to S2A in the aforementioned embodiments.
  • a plurality of speech recognition models recognize audio-1, obtain a plurality of corresponding transformed text data; Multiple speech recognition models are to audio- 2.
  • Perform recognition to obtain a plurality of corresponding converted text data ...multiple speech recognition models recognize the audio-n, and obtain a plurality of corresponding converted text data.
  • S6B Obtain second target text data according to the plurality of transformed text data, and obtain a second evaluation result of the speech evaluation model on the second target text data and transformed speech data. Also refer to S3A in the above embodiment.
  • Target text data transcription-2 the second target text data transcription-n corresponding to audio-n.
  • the voice evaluation model evaluates audio-1 and transcription-1 to obtain the second evaluation result score-1, and the voice evaluation model evaluates audio-2 and transcription-2 to obtain the second evaluation result score-2...
  • Voice evaluation The model evaluates audio-n and transcription-n to obtain a second evaluation result score-n.
  • steps S3B, S4B and steps S5B, S6B may be executed in parallel, and this solution does not limit the execution order of the steps.
  • step S7B comparing the first evaluation result with the second evaluation result, if the second evaluation result is better than the first evaluation result, combining the original speech data and the second target text as training data. If a plurality of converted speech data is obtained in step S2B, then in this step, first determine the optimal one among the plurality of second evaluation results, such as the score-0...score-n in the above example with the highest score If it is score-x, compare score-x with score-0 in this step.
  • score-x is better than score-0, combine the original voice audio-0 and the target text transcription-x corresponding to score-x is training data, which means that the matching degree between the original speech and the target text is high enough, and this set of data can be used to train the speech recognition model, which is a high-quality sample; otherwise, it means that the matching degree is poor, which is not suitable for training the model.
  • the comparison operation in step S7B is configured to: if the second evaluation result is better than the first evaluation result, further determine whether the advantage of the second evaluation result over the first evaluation result meets expectations, When the advantage is large enough (as expected), the original speech data and the second target text are combined as training data. Take the evaluation results in the form of scores above as an example. For example, if score-x is greater than score-0, it is necessary to further judge whether the excess value is greater than the preset threshold. If it is greater than the preset threshold, audio-0 and transcription-x will be combined as training data. Similar judgments can also be made for other forms of evaluation results, such as classification results. The training data obtained in this way is more targeted, and the effect of training the model is more obvious.
  • the training data generation method processes the voice data to obtain the voice data of converted frequency, and then uses the voice recognition model provided by other service providers to respectively recognize the original voice and the converted voice to obtain the corresponding recognition text; further Use the voice evaluation model to evaluate the text and voice, and get the matching degree between the original pronunciation and the text and the matching degree between the transformed voice and the text; finally compare the two, if the evaluation result of the transformed voice is better, it means that the original voice and the transformed voice are better.
  • the text corresponding to the voice is more matched, so the text corresponding to the original voice and the transformed voice can be used as training data, thereby realizing automatic labeling of the voice, obtaining high-quality training data, improving the efficiency of model training, and being able to obtain better training effect.
  • the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

一种适用于语音识别模型的训练数据生成方法及设备,适用于语音识别模型的训练数据生成方法包括:获取语音数据;由多个语音识别模型分别对语音数据进行识别,输出多个文本数据;根据多个文本数据得到目标文本数据;获取语音测评模型(13)对目标文本数据和语音数据的测评结果;对测评结果进行判断,如果测评结果符合预期,则将目标文本和语音数据组合为训练数据。

Description

适用于语音识别模型的训练数据生成方法及设备 技术领域
本发明涉及语音分析与合成领域,具体涉及一种适用于语音识别模型的训练数据生成方法及设备。
背景技术
语音识别技术从原先的GMM-HMM到基于HMM拓扑结构+神经网络的Hybrid的建模,到目前基于transformer/conformer的CTC/RNNT/LAS端到端的建模方式,建模能力不断增强,但随之而来的是对于训练数据量的需求也在指数增长。端到端对数据的需求已经从原先的data sparse变成了date hungry。目前语音识别厂商标注的数据量多数已经达到10万小时数量级。这里的数据是指<audio,transcription>即语音和文本标签,所以数据标注的人力财力成本都非常高。
目前一些语音识别的厂商的通用语音识别效果已经达到较为优秀的水准,词错率(WER)已经小于3%,已经超越人工转写的水平,达到商用水平。但是因为种种原因,很多企业不希望直接调用语音识别服务商的接口,更希望有自己的语音识别模型,可能的考虑包括数据安全、成本、在实际业务中的效果不理想等等。在这种情况下,企业搭建一个可商用的语音识别系统,按照常规方式是采集语音样本并进行数据标注,然后训练模型,这种处理方式的性价比非常低。
发明内容
有鉴于此,本申请提供一种适用于语音识别模型的训练数据生成方法,包括:
获取语音数据;
由多个语音识别模型分别对所述语音数据进行识别,输出多个文本数据;
根据所述多个文本数据得到目标文本数据;
获取语音测评模型对所述目标文本数据和所述语音数据的测评结果;
对所述测评结果进行判断,如果所述测评结果符合预期,则将所述目标文本和所述语音数据组合为训练数据。
可选地,根据所述多个文本数据得到目标文本数据包括:
根据所述多个文本数据得到文本的图数据,其中各个所述文本数据中相同部分被保持,不同的部分被配置为并联关系;
由语音测评模型对所述文本的图数据和所述语音数据进行测评,根据所述图数据中对应最优测评结果的路径得到目标文本。
可选地,在根据所述多个文本数据得到文本的图数据时,还包括:
判断并联部分的发音是否相同;
如果发音相同,则根据预设词汇表对所述并联部分进行筛选,以排除与应用场景不相关的词汇。
本发明提供另一种适用于语音识别模型的训练数据生成方法,包括:
获取原始语音数据;
对所述原始语音的频率进行变换得到至少一个变换语音数据;
由多个语音识别模型分别对所述原始语音数据进行识别,输出多个原始文本数据;
根据所述多个原始文本数据得到第一目标文本数据,并获取语音测评模型对所述第一目标文本数据和所述原始语音数据的第一测评结果;
由多个所述语音识别模型分别对所述变换语音数据进行识别,输出多个变换文本数据;
根据所述多个变换文本数据得到第二目标文本数据,并获取语音测评模型对所述第二目标文本数据和所述变换语音数据的第二测评结果;
比对所述第一测评结果与所述第二测评结果,如果所述第二测评结果优于所述第一测评结果,则将所述原始语音数据和所述第二目标文本组合为训练数据。
可选地,根据所述原始文本数据得到第一目标文本数据具体包括:
根据所述多个原始文本数据得到原始文本的图数据,其中各个所述原始文本数据中相同部分被保持,不同的部分被配置为并联关系;
由语音测评模型对所述原始文本的图数据和所述原始语音数据进行测评,根据所述原始文本的图数据中对应最优测评结果的路径得到第一目标文本。
可选地,根据所述多个变换文本数据得到第二目标文本数据具体包括:
根据所述多个变换文本数据得到变换文本的图数据,其中各个所述变换文本数据中相同部分被保持,不同的部分被配置为并联关系;
由语音测评模型对所述变换文本的图数据和所述变换语音数据进行测评,根据所述变换文本的图数据中测评结果最优的路径得到第二目标文本。
可选地,所述变换语音数据有多个,并分别对应有所述第二目标文本数据和所述第二测评结果,在比对所述第一测评结果与所述第二测评结果的步骤中,选取多个所述第二测评结果中最优的一个与所述第一测评结果进行比对。
可选地,对所述原始语音的频率进行变换具体包括提高和/或降低声音信号的基频。
可选地,所述变换语音数据相比于所述原始语音数据至少被降低一个半音。
相应地,本发明提供一种适用于语音识别模型的训练数据生成设备,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器执行上述训练数据生成方法。
根据本发明提供的训练数据生成方法及设备,利用其它服务商提供的语音识别模型对未标注的语音进行识别,得到初步的识别文本,然后利用语音测评模型对文本及语音进行评测,得到发音与文本的匹配度,最后根据测评结果判断样本质量,将质量较好的语音样本及其文本标签作为训练数据,由此实现对语音的自动标注,获得高质量的训练数据,提高了模型训练效率,并且能够获得较好的训练效果。
根据本发明提供的训练数据生成方法及设备,通过对语音数据进行处理得到变换频率的语音数据,然后利用其它服务商提供的语音识别模型对原始语音和变换语音分别进行识别,得到相应的识别文本;进一步利用语音测评模型对文本及语音进行评测,得到原始发音与相应文本的匹配度及变换语音与相应文本的匹配度;最后对二者进行比较,如果变换语音的测评结果更好,则表示变换评率的语音与相应文本更加匹配,由此可以将原始语音与变换语音对应的文本作为训练数据,实现对语音的自动标注,获得高质量的训练数据,提高了模型训练效率,并且能够获得较好的训练效果。
附图说明
为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例中一种训练数据生成方法的示意图;
图2为本发明实施例中的文本的图数据的示意图;
图3为本发明实施例中的一种训练数据生成方法的数据流示意图。
具体实施方式
下面将结合附图对本发明的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
此外,下面所描述的本发明不同实施方式中所涉及的技术特征只要彼此之间未构成冲突就可以相互结合。
本发明的第一种应用场景是需要获得特定领域的语音样本作为训练数据,本申请所述特定领域是指化工、医学等专业程度较高的领域,通用领域是指一般日常生活领域。特定领域相关的语音内容和通用语音数据的内容相差较大,使用通用语音数据训练出的语音识别模型很难适用于特定领域的语音识别任务,而特定领域的语音样本相对较少。
针对这种应用场景,为了获得可用的训练数据,本实施例提供一种适用于语音识别模型的训练数据生成方法,该方法可以由计算机、服务器等电子设备执行,该方法包括如下操作:
S1A,获取语音数据。本实施例的语音数据可以是任何特定领域语音或者通用领域的语音,比如是一句话或一段话的录音,其相应的文本是未知的。本方案不对语种进行限定,使用中文、英文等任何语种都是可行的。
S2A,由多个语音识别模型分别对语音数据进行识别,输出多个文本数据。语音识别模型是指已经被训练过的、基于深度学习算法的神经网络模型,比如是其它服务商提供的模型等等,该模型具备一定的文本识别能力,只是可能不适用于特定的领域。
如图1所示,以两个语音识别模型为例,第一语音识别模型11和第二语音识别模型12分别对语音数据audio进行语音转文本识别,第一语音识别模型11输出的识别结果是transcription-1、第二语音识别模型12输出的识别结果是transcription-2。
另外,在此步骤中优选使用具有热词识别功能的语音识别模型。具体地,此类模型可以预先获取用户提供的若干词汇,在识别语音数据时这些词汇将被参考,使得识别结果更加准确。举例来说,假设实际的文本中包括“氨基酸”这一词汇,一般的模型可能会输出“按计算”这一错误结果,二者发音相同,而如果模型预先获取该词汇,则会输出正确的识别结果。
S3A,根据多个文本数据得到目标文本数据。由于各个语音识别模型的性能、所采用的训练算法、训练样本等多种因素可能不同,因此它们的识别结果可能会有多处不一致,但也有可能是完全一致的。所以在此步骤中会出现多种情况,比如对于多个识别结果(文本数据)完全一致的情况,那么该识别结果就是目标文本数据。
对于多个识别结果存在不一致的情况,有多种可选地实施方式来得到目标文本数据。假设第一语音识别模型11输出的识别结果是“这个郭嘉的商务发展得很好”、第二语音识别模型12输出的识别结果是“这个国家的上路发展得很好”,可以看出其中相应的词汇“郭嘉”和“国家”是不一致的、“商务”和“上路”是不一致的。
面对此情况,可选的一种处理方式是预先根据应用场景建立词汇表,其中存储所有可能出现在本应用场景中的词汇,当多个模型的输出结果不一致时,可以针对这些不一致的词在该词汇表中进行搜索,保留词汇表中有的词或者出现频率较高的词。
比如针对上述举例,预先建立的词汇表中存储有“国家”和“商务”这两个词,则通过搜索最终得到的目标文本数据是“这个国家的商务发展得很好”。
另一种可选的处理方式是,引入一语音评测模型在多个识别结果中寻找与语音数据最匹配的文本。具体地,根据多个文本数据得到文本的图数据(Graph),其中各个文本数据中相同部分被保持,不同的部分被配置为并联关系。针对上述两个识别结果进行举例,所生产的图可以表示为“这个[郭嘉|国家]的[上路|商务]发展地很好”。如图2所示,其中相同的部分“这个”、“的”、“发展”等作为单一节点,“[郭嘉|国家]”表示这两个词汇为并联关系,它们在图中为并列的节点或称为分支节点,因此如图2所示的图数据有4条路径(path), 每一条路径都是由词汇序列组成的。
将图数据作为语音测评模型的识别对象,本实施例所使用的是一种能够分析语音的发音情况与文本匹配度的模型,该模型输出的是一个百分制的分数,分数越高则表示发音情况与文本越匹配,或者可以理解为发音质量越好。语音测评模型的实现方式有多种,比如可以参考中国专利文件CN110797049A公开的语音评测方法及相关装置,需要说明的是本方案并不限于使用该测评方案,也可以使用其它测评模型输出其它形式的测评结果,只要能够对图数据(Graph)和相应的语音数据进行测评的算法都是可行的,关于测评结果的形式比如可以是分类结果,将发音与路径的匹配情况分为“优秀”、“一般”、“较差”等等能够用于比较的都是可行的。
本实施例的语音测评模型对上述图数据进行搜索,该模型将对图数据(Graph)中的所有路径(path)分别与语音数据进行匹配,也即分别测评该语音数据对于各个路径的匹配度或质量,由此确定其中的最优路径,作为目标文本。
另外需要说明的是,上述两种方式可以相结合来使用。比如当使用更多的语音识别模型得到更多的识别结果时,如果直接采用上述第三种方式来搜索最优路径,第一个问题是图数据的路径过多导致模型的计算量增大,会降低搜索效率;第二个问题是语音测评模型可能并不具备语义分析的能力,因此对于发音完全相同的词、仅发音音调有区别的词,其所在的路径的测评结果可能是完全相同的。在优选的实施例中先采用第一种方式对图数据(Graph)中的分支节点进行初步筛选,排除掉与应用场景不相关的词汇,可以减少分支节点进而减少路径(path)的数量,避免出现上述问题。
S4A,获取语音测评模型对目标文本数据和语音数据的测评结果。返回图1所示,语音测评模型13对目标文本数据transcription及audio进行测评得到测评结果score,如果在步骤S3A中使用了语音测评模型搜索最优路径,则在此步骤中语音测评模型可以将最优路径的文本作为参考,计算分数或是其他形式的测评结果。当然,如果在步骤S3A中未使用语音测评模型来获得目标文本,而是采用上述其它方式生成了目标文本,则在此步骤中首次使用语音测评模型进行测评。
S5A,对测评结果进行判断,如果测评结果符合预期,则将目标文本和语音 数据组合为训练数据。以分数为例,比如可以预设分数阈值,如果对目标文本数据和语音数据的测评结果高于该阈值,则表示该语音与该目标文本的匹配度足够高,这一组数据可以用于训练语音识别模型,是一个高质量的样本,反之则表示匹配度较差,不适于训练模型。
本发明实施例提供的训练数据生成方法利用其它服务商提供的语音识别模型对未标注的语音进行识别,得到初步的识别文本,然后利用语音测评模型对文本及语音进行评测,得到发音与文本的匹配度,最后根据测评结果判断样本质量,将质量较好的语音样本及其文本标签作为训练数据,由此实现对语音的自动标注,获得高质量的训练数据,提高了模型训练效率,并且能够获得较好的训练效果。
第二种应用场景是需要获得特定人群的语音作为训练数据。以低龄儿童相关的业务场景举例,由于儿童发音器官和成人相差较大,儿童的声音和成人的声音表现出很大的差异。现实情况是儿童语音样本相比于普通成人的语音样本更少,如果直接用成人语音训练语音识别模型,训练后用来识别儿童语音,所得到的识别结果的效果会很差,儿童语音识别一直是语音识别中的难题。
现阶段通常的做法是数据增强,比如通过数字信号的手段实现成人语音中频谱的整体上移,使得声音听起来更加尖锐,与儿童的语音方式相似。这种方式能在语音识别训练工具可以带来一定的收益,但是效率非常低,最终效果也不够理想,根本原因还在于做法本身没有针对性,得到的数据也不够真实。
针对这种应用场景,本申请的另一实施例提供一种适用于语音识别模型的训练数据生成方法,该方法可以由计算机、服务器等电子设备执行,该方法包括如下操作:
S1B,获取原始语音数据,作为举例原始语音是儿童的语音。
S2B,对原始语音的频率进行变换得到至少一个变换语音数据。具体的变换方式包括但不限于改变声音信号的频率。对于儿童的语音,本实施例采用的方式是降低声音信号的基频,使得变换语音数据相比于原始语音数据的声音降低了至少一个半音。
对于其它类似的应用场景,比如原始数据是嗓音非常低的男声或者嗓音非常尖锐的女声,在此可以采用相应的方式对频率进行处理或变换。
结合图3所示,将原始语音数据记为audio-0,为了获得更好的训练数据,在此步骤中可以对原始语音数据进行多次变换处理,依次降低一个半音,得到更多的变换语音数据。比如降低一个半音的变换语音数据,记为audio-1、降低两个半音的变换语音数据audio-2……降低n个半音的变换语音数据audio-n。
通过实际实施例验证发现,在此步骤中得到依次降低3~5个半音的变化语音数据即可得到较好的识别结果,并且具有较高的计算效率。
S3B,由多个语音识别模型分别对原始语音数据进行识别,输出多个原始文本数据。具体参考上述实施例中的S2A,此步骤是指对audio-0进行识别,得到多个原始文本数据。
S4B,根据多个原始文本数据得到第一目标文本数据,并获取语音测评模型对第一目标文本数据和原始语音数据的第一测评结果。具体参考上述实施例中的步骤S3A,为了便于描述,此处将第一目标文本数据记为transcription-0。本实施例优选采用上述实施例中所提供的第三种实施方式,即由语音测评模型根据audio-0和原始文本数据的图数据来搜索transcription0。
进一步地,语音测评模型对audio-0和transcription-0进行测评得到第一测评结果score-0。
S5B,由多个语音识别模型分别对变换语音数据进行识别,输出多个变换文本数据。同样参考上述述实施例中的S2A。当有多个变换语音数据时,如上述audio-1……audio-n,则多个语音识别模型对audio-1进行识别,得到对应的多个变换文本数据;多个语音识别模型对audio-2进行识别,得到对应的多个变换文本数据……多个语音识别模型对audio-n进行识别,得到对应的多个变换文本数据。
S6B,根据多个变换文本数据得到第二目标文本数据,并获取语音测评模型对第二目标文本数据和变换语音数据的第二测评结果。同样参考上述实施例中的S3A。
需要说明的是,当有多个变换语音数据及其对应的多个变换文本数据时,在此将得到与audio-1对应的第二目标文本数据transcription-1、与audio-2对应的第二目标文本数据transcription-2……与audio-n对应的第二目标文本数据transcription-n。
进一步地,语音测评模型对audio-1和transcription-1进行测评得到第二测评结果score-1、语音测评模型对audio-2和transcription-2进行测评得到第二测评结果score-2……语音测评模型对audio-n和transcription-n进行测评得到第二测评结果score-n。
另外,上述步骤S3B、S4B与步骤S5B、S6B可以并行执行,本方案对各步骤的执行顺序不进行限定。
S7B,比对第一测评结果与第二测评结果,如果第二测评结果优于第一测评结果,则将原始语音数据和第二目标文本组合为训练数据。如果步骤S2B中获得了多个变换语音数据,则在此步骤中,首先确定多个第二测评结果中的最优的一个,比如上述举例中的score-0……score-n中分数最高的是score-x,则此步骤中将score-x与score-0进行比对,如果score-x优于score-0,则将原始语音audio-0和score-x对应的目标文本transcription-x组合为训练数据,表示该原始语音与该目标文本的匹配度足够高,这一组数据可以用于训练语音识别模型,是一个高质量的样本,反之则表示匹配度较差,不适于训练模型。
在优选的实施例中,步骤S7B中的比对操作被配置为:在第二测评结果优于第一测评结果的情况下,进一步判断第二测评结果比第一测评结果的优势是否达到预期,当优势足够大(达到预期)时,才将原始语音数据和第二目标文本组合为训练数据。以上述分数形式的测评结果为例,比如score-x大于score-0,还需进一步判断超出值是否大于预设阈值,如果大于预设阈值才将audio-0和transcription-x组合为训练数据。对于其它形式的测评结果,比如分类结果等,也可以进行类似判断。如此得到训练数据更有针对性,训练模型时的效果更加明显。
本发明实施例提供的训练数据生成方法通过对语音数据进行处理得到变换频率的语音数据,然后利用其它服务商提供的语音识别模型对原始语音和变换语音分别进行识别,得到相应的识别文本;进一步利用语音测评模型对文本及语音进行评测,得到原始发音与文本的匹配度及变换语音与文本的匹配度;最后对二者进行比较,如果变换语音的测评结果更好,则表示原始语音与变换语音对应的文本更加匹配,由此可以将原始语音与变换语音对应的文本作为训练数据,由此实现对语音的自动标注,获得高质量的训练数据,提高了模型训练效率,并且能 够获得较好的训练效果。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,上述实施例仅仅是为清楚地说明所作的举例,而并非对实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引伸出的显而易见的变化或变动仍处于本发明创造的保护范围之中。

Claims (10)

  1. 一种适用于语音识别模型的训练数据生成方法,其特征在于,包括:
    获取语音数据;
    由多个语音识别模型分别对所述语音数据进行识别,输出多个文本数据;
    根据所述多个文本数据得到目标文本数据;
    获取语音测评模型对所述目标文本数据和所述语音数据的测评结果;
    对所述测评结果进行判断,如果所述测评结果符合预期,则将所述目标文本和所述语音数据组合为训练数据。
  2. 根据权利要求1所述的方法,其特征在于,根据所述多个文本数据得到目标文本数据包括:
    根据所述多个文本数据得到文本的图数据,其中各个所述文本数据中相同部分被保持,不同的部分被配置为并联关系;
    由语音测评模型对所述文本的图数据和所述语音数据进行测评,根据所述图数据中对应最优测评结果的路径得到目标文本。
  3. 根据权利要求2所述的方法,其特征在于,在根据所述多个文本数据得到文本的图数据时,还包括:
    判断并联部分的发音是否相同;
    如果发音相同,则根据预设词汇表对所述并联部分进行筛选,以排除与应用场景不相关的词汇。
  4. 一种适用于语音识别模型的训练数据生成方法,其特征在于,包括:
    获取原始语音数据;
    对所述原始语音的频率进行变换得到至少一个变换语音数据;
    由多个语音识别模型分别对所述原始语音数据进行识别,输出多个原始文本数据;
    根据所述多个原始文本数据得到第一目标文本数据,并获取语音测评模型对所述第一目标文本数据和所述原始语音数据的第一测评结果;
    由多个所述语音识别模型分别对所述变换语音数据进行识别,输出多个变换文本数据;
    根据所述多个变换文本数据得到第二目标文本数据,并获取语音测评模型对 所述第二目标文本数据和所述变换语音数据的第二测评结果;
    比对所述第一测评结果与所述第二测评结果,如果所述第二测评结果优于所述第一测评结果,则将所述原始语音数据和所述第二目标文本组合为训练数据。
  5. 根据权利要求4所述的方法,其特征在于,根据所述原始文本数据得到第一目标文本数据具体包括:
    根据所述多个原始文本数据得到原始文本的图数据,其中各个所述原始文本数据中相同部分被保持,不同的部分被配置为并联关系;
    由语音测评模型对所述原始文本的图数据和所述原始语音数据进行测评,根据所述原始文本的图数据中对应最优测评结果的路径得到第一目标文本。
  6. 根据权利要求4所述的方法,其特征在于,根据所述多个变换文本数据得到第二目标文本数据具体包括:
    根据所述多个变换文本数据得到变换文本的图数据,其中各个所述变换文本数据中相同部分被保持,不同的部分被配置为并联关系;
    由语音测评模型对所述变换文本的图数据和所述变换语音数据进行测评,根据所述变换文本的图数据中测评结果最优的路径得到第二目标文本。
  7. 根据权利要求4-6中任一项所述的方法,其特征在于,所述变换语音数据有多个,并分别对应有所述第二目标文本数据和所述第二测评结果,在比对所述第一测评结果与所述第二测评结果的步骤中,选取多个所述第二测评结果中最优的一个与所述第一测评结果进行比对。
  8. 根据权利要求4-6中任一项所述的方法,其特征在于,对所述原始语音的频率进行变换具体包括提高和/或降低声音信号的基频。
  9. 根据权利要求8所述的方法,其特征在于,所述变换语音数据相比于所述原始语音数据至少被降低一个半音。
  10. 一种适用于语音识别模型的训练数据生成设备,其特征在于,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器执行如权利要求1-9中任意一项所述的训练数据生成方法。
PCT/CN2022/107228 2021-11-18 2022-07-22 适用于语音识别模型的训练数据生成方法及设备 WO2023087767A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111368649.5A CN113793593B (zh) 2021-11-18 2021-11-18 适用于语音识别模型的训练数据生成方法及设备
CN202111368649.5 2021-11-18

Publications (1)

Publication Number Publication Date
WO2023087767A1 true WO2023087767A1 (zh) 2023-05-25

Family

ID=78955398

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107228 WO2023087767A1 (zh) 2021-11-18 2022-07-22 适用于语音识别模型的训练数据生成方法及设备

Country Status (2)

Country Link
CN (1) CN113793593B (zh)
WO (1) WO2023087767A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117174084A (zh) * 2023-11-02 2023-12-05 摩尔线程智能科技(北京)有限责任公司 一种训练数据构建方法及装置、电子设备和存储介质
CN117174084B (zh) * 2023-11-02 2024-05-31 摩尔线程智能科技(北京)有限责任公司 一种训练数据构建方法及装置、电子设备和存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793593B (zh) * 2021-11-18 2022-03-18 北京优幕科技有限责任公司 适用于语音识别模型的训练数据生成方法及设备
CN114898733A (zh) * 2022-05-06 2022-08-12 深圳妙月科技有限公司 Ai语音数据的分析处理方法及系统
CN115798519B (zh) * 2023-02-10 2023-05-05 山东山大鸥玛软件股份有限公司 一种英语多题型口语发音评估方法及系统

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180366124A1 (en) * 2017-06-19 2018-12-20 Intel Corporation Context-aware enrollment for text independent speaker recognition
CN110797049A (zh) * 2019-10-17 2020-02-14 科大讯飞股份有限公司 一种语音评测方法及相关装置
CN110808034A (zh) * 2019-10-31 2020-02-18 北京大米科技有限公司 语音转换方法、装置、存储介质及电子设备
CN111048070A (zh) * 2019-12-24 2020-04-21 苏州思必驰信息科技有限公司 语音数据的筛选方法、装置、电子设备及存储介质
CN111402895A (zh) * 2020-06-08 2020-07-10 腾讯科技(深圳)有限公司 语音处理、语音评测方法、装置、计算机设备和存储介质
CN111883110A (zh) * 2020-07-30 2020-11-03 上海携旅信息技术有限公司 语音识别的声学模型训练方法、系统、设备及介质
CN113205814A (zh) * 2021-04-28 2021-08-03 平安科技(深圳)有限公司 语音数据标注方法、装置、电子设备及存储介质
CN113393841A (zh) * 2020-10-16 2021-09-14 腾讯科技(深圳)有限公司 语音识别模型的训练方法、装置、设备及存储介质
CN113535939A (zh) * 2020-04-17 2021-10-22 阿里巴巴集团控股有限公司 文本处理方法和装置、电子设备以及计算机可读存储介质
CN113793593A (zh) * 2021-11-18 2021-12-14 北京优幕科技有限责任公司 适用于语音识别模型的训练数据生成方法及设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105185372B (zh) * 2015-10-20 2017-03-22 百度在线网络技术(北京)有限公司 个性化多声学模型的训练方法、语音合成方法及装置
CN108766437B (zh) * 2018-05-31 2020-06-23 平安科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质
CN111179916B (zh) * 2019-12-31 2023-10-13 广州市百果园信息技术有限公司 重打分模型训练方法、语音识别方法及相关装置
CN111339765B (zh) * 2020-02-18 2023-11-03 腾讯科技(深圳)有限公司 文本质量评估方法、文本推荐方法及装置、介质及设备
US11735169B2 (en) * 2020-03-20 2023-08-22 International Business Machines Corporation Speech recognition and training for data inputs
CN112397056B (zh) * 2021-01-20 2021-04-09 北京世纪好未来教育科技有限公司 语音评测方法及计算机存储介质

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180366124A1 (en) * 2017-06-19 2018-12-20 Intel Corporation Context-aware enrollment for text independent speaker recognition
CN110797049A (zh) * 2019-10-17 2020-02-14 科大讯飞股份有限公司 一种语音评测方法及相关装置
CN110808034A (zh) * 2019-10-31 2020-02-18 北京大米科技有限公司 语音转换方法、装置、存储介质及电子设备
CN111048070A (zh) * 2019-12-24 2020-04-21 苏州思必驰信息科技有限公司 语音数据的筛选方法、装置、电子设备及存储介质
CN113535939A (zh) * 2020-04-17 2021-10-22 阿里巴巴集团控股有限公司 文本处理方法和装置、电子设备以及计算机可读存储介质
CN111402895A (zh) * 2020-06-08 2020-07-10 腾讯科技(深圳)有限公司 语音处理、语音评测方法、装置、计算机设备和存储介质
CN111883110A (zh) * 2020-07-30 2020-11-03 上海携旅信息技术有限公司 语音识别的声学模型训练方法、系统、设备及介质
CN113393841A (zh) * 2020-10-16 2021-09-14 腾讯科技(深圳)有限公司 语音识别模型的训练方法、装置、设备及存储介质
CN113205814A (zh) * 2021-04-28 2021-08-03 平安科技(深圳)有限公司 语音数据标注方法、装置、电子设备及存储介质
CN113793593A (zh) * 2021-11-18 2021-12-14 北京优幕科技有限责任公司 适用于语音识别模型的训练数据生成方法及设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117174084A (zh) * 2023-11-02 2023-12-05 摩尔线程智能科技(北京)有限责任公司 一种训练数据构建方法及装置、电子设备和存储介质
CN117174084B (zh) * 2023-11-02 2024-05-31 摩尔线程智能科技(北京)有限责任公司 一种训练数据构建方法及装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN113793593B (zh) 2022-03-18
CN113793593A (zh) 2021-12-14

Similar Documents

Publication Publication Date Title
WO2023087767A1 (zh) 适用于语音识别模型的训练数据生成方法及设备
WO2021174757A1 (zh) 语音情绪识别方法、装置、电子设备及计算机可读存储介质
CN108766414B (zh) 用于语音翻译的方法、装置、设备和计算机可读存储介质
WO2020224119A1 (zh) 用于语音识别的音频语料筛选方法、装置及计算机设备
CN106782603B (zh) 智能语音评测方法及系统
US9613638B2 (en) Computer-implemented systems and methods for determining an intelligibility score for speech
WO2020036178A1 (ja) 音声変換学習装置、音声変換装置、方法、及びプログラム
US8880399B2 (en) Utterance verification and pronunciation scoring by lattice transduction
US20140039896A1 (en) Methods and System for Grammar Fitness Evaluation as Speech Recognition Error Predictor
JP7266683B2 (ja) 音声対話に基づく情報検証方法、装置、デバイス、コンピュータ記憶媒体、およびコンピュータプログラム
CN101551947A (zh) 辅助口语语言学习的计算机系统
JP6440967B2 (ja) 文末記号推定装置、この方法及びプログラム
JP2013206253A (ja) 機械翻訳装置、方法、およびプログラム
US11151996B2 (en) Vocal recognition using generally available speech-to-text systems and user-defined vocal training
Kopparapu Non-linguistic analysis of call center conversations
JP2015201215A (ja) 機械翻訳装置、方法、およびプログラム
Liu et al. Neural acoustic-phonetic approach for speaker verification with phonetic attention mask
Baljekar Speech synthesis from found data
Ling An acoustic model for English speech recognition based on deep learning
CN117349427A (zh) 一种面向舆情事件应对的人工智能多模态内容生成系统
CN113035236B (zh) 语音合成数据的质检方法以及装置
Lavechin et al. Modeling early phonetic acquisition from child-centered audio data
Govender et al. Objective measures to improve the selection of training speakers in HMM-based child speech synthesis
Vidal et al. Mispronunciation detection using self-supervised speech representations
Ngoc et al. Adapt-Tts: High-Quality Zero-Shot Multi-Speaker Text-to-Speech Adaptive-Based for Vietnamese

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22894310

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023578196

Country of ref document: JP

Kind code of ref document: A