CN111768763A

CN111768763A - Acoustic model training method and device, electronic equipment and storage medium

Info

Publication number: CN111768763A
Application number: CN202010537885.4A
Authority: CN
Inventors: 丁科; 黄辰; 万广鲁
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2020-10-13

Abstract

The embodiment of the application provides an acoustic model training method, an acoustic model training device, electronic equipment and a storage medium, wherein the method comprises the following steps: labeling sample texts in training samples of the acoustic model to form labeled texts corresponding to the sample texts; and training the acoustic model by utilizing alignment information of the training sample obtained by forcibly aligning the labeled text corresponding to the sample text with the sample audio in the training sample. The problem that the acoustic model is difficult to train accurately due to the fact that silence segments at the beginning and the end of the sample audio are not aligned with the silence unit and are mistakenly aligned with the pronunciation unit is avoided, and the training accuracy is improved. The method and the device ensure that the silence segments at the beginning and the end of the sample audio participate in the training of the acoustic model, avoid the problem of small number of the silence segments used for training, ensure that a certain number of the silence segments participate in the training of the acoustic model, and improve the training accuracy.

Description

Acoustic model training method and device, electronic equipment and storage medium

Technical Field

The application relates to the field of voice recognition, in particular to an acoustic model training method and device, electronic equipment and a storage medium.

Background

Before training the acoustic model, the annotation text corresponding to the sample text needs to be forcibly aligned with the sample audio, so as to align the pronunciation unit with the corresponding speech segment in the sample audio and align the reserved silence unit with the corresponding silence segment in the sample audio.

At present, the commonly adopted mode is as follows: and adding the selectable mute units before the first character, between every two adjacent characters and after the last character of the sample text to obtain the labeled text corresponding to the sample text. Then, a forced alignment Algorithm, such as a Viterbi Algorithm (Viterbi Algorithm), is used to perform forced alignment on the labeled text corresponding to the sample text and the sample audio to obtain alignment information, and the acoustic model is trained by using the alignment information. And for each optional mute unit in the annotation text corresponding to the sample text, determining whether to reserve the optional mute unit by a forced alignment algorithm, and when the optional mute unit is determined to be reserved by the forced alignment algorithm, using a mute section aligned with the optional mute unit in the sample audio for training an acoustic model.

However, the forced alignment algorithm itself has alignment errors, and in the case that it is determined by the forced alignment algorithm that an optional mute cell is not retained before the first word of the labeled text corresponding to the sample text and/or an end mute cell is not retained after the last word of the sample text, the silence segment at the beginning of the sample audio that should be aligned with the mute cell is not aligned with the mute cell, and the silence segment at the end of the sample audio that should be aligned with the mute cell is not aligned with the mute cell and only is aligned with the pronunciation cell of the last word.

On one hand, if the optional mute unit before the first word of the labeled text corresponding to the sample text and/or the optional mute unit after the last word of the labeled text corresponding to the sample text are determined by the forced alignment algorithm, the mute frame in the mute section at the beginning of the sample audio and/or the mute frame in the mute section at the end of the sample audio are/is mistakenly used as the voice frame in the voice section aligned with the pronunciation unit to participate in the training of the acoustic model, thereby causing difficulty in accurately training the acoustic model.

On the other hand, the training of the acoustic model is iteratively trained using a large number of training samples. It may happen that a large number of silence segments at the beginning of the sample audio in the training samples and/or silence segments at the end of the sample audio are erroneously aligned with corresponding pronunciation units, resulting in that a large number of silence segments that should participate in the training of the acoustic model are not used as silence segments to participate in the training of the acoustic model, and a silence absorption problem occurs, and the number of silence segments used for training the acoustic model is small, which makes it difficult to accurately train the acoustic model.

Disclosure of Invention

In order to overcome the problems in the prior art, the application provides an acoustic model training method, an acoustic model training device, electronic equipment and a storage medium.

According to a first aspect of embodiments of the present application, there is provided an acoustic model training method, including:

obtaining a plurality of training samples of an acoustic model, the training samples comprising: a sample audio and a sample text corresponding to the sample audio;

for each training sample in the plurality of training samples, labeling a sample text in the training sample to form a labeled text; wherein the labeling of the sample text in the training sample comprises: adding a beginning mute unit before the first character in the sample text; adding a tail mute unit after the last word in the sample text; adding an optional mute unit between adjacent words in the sample text; forcibly aligning the sample audio and the marked text to obtain the alignment information of the training sample;

and training the acoustic model by using the alignment information of each training sample.

According to a second aspect of embodiments of the present application, there is provided an acoustic model training apparatus, including:

an acquisition unit configured to acquire a plurality of training samples of an acoustic model, the training samples including: a sample audio and a sample text corresponding to the sample audio;

the alignment unit is configured to label the sample texts in the training samples to form labeled texts for each training sample in the plurality of training samples; wherein the labeling of the sample text in the training sample comprises: adding a beginning mute unit before the first character in the sample text; adding a tail mute unit after the last word in the sample text; adding an optional mute unit between adjacent words in the sample text; forcibly aligning the sample audio and the marked text to obtain the alignment information of the training sample;

a training unit configured to train the acoustic model using the alignment information of each training sample.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

since the labeled text corresponding to the sample text includes the beginning mute unit and the ending mute unit, the labeled text corresponding to the sample text can indicate that the beginning mute unit and the ending mute unit are both mute units that must be reserved, and neither the beginning mute unit nor the ending mute unit is an optional mute unit, therefore, when the marked text corresponding to the sample text and the sample audio corresponding to the sample text are aligned forcibly, it can be ensured that the beginning silence segment of the sample audio corresponding to the sample text must be aligned with the beginning silence unit, the end silence segment of the sample audio corresponding to the sample text must be aligned with the end silence unit, and that the beginning silence segment of the sample audio aligned with the beginning silence unit participates in the training of the acoustic model together with the beginning silence unit, and the end silence segment of the sample audio aligned with the end silence unit participates in the training of the acoustic model together with the end silence unit.

On one hand, the problem that the acoustic model is difficult to accurately train due to the fact that the mute section at the beginning of the sample audio and/or the mute section at the end of the sample audio are not aligned with the mute unit and the mute frame in the mute section at the beginning of the sample audio and/or the mute frame in the mute section at the end of the sample audio are mistakenly used as the voice frame in the voice section aligned with the pronunciation unit to participate in the training of the acoustic model is solved, and the accuracy of the training of the acoustic model is improved.

On the other hand, the mute section at the beginning of the sample audio and the beginning mute unit participate in the training of the acoustic model together, and the mute section at the end of the sample audio and the end mute unit participate in the training of the acoustic model together, so that the problem that the number of the mute sections for training the acoustic model is small is solved, a certain number of the mute sections participate in the training of the acoustic model, and the accuracy of the training of the acoustic model is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart of one of the acoustic model training methods provided by embodiments of the present application;

FIG. 2 is a diagram illustrating an effect of forced alignment of sample text and sample audio in the prior art;

FIG. 3 is a diagram illustrating an effect of forced alignment of sample text and sample audio in the present application;

fig. 4 shows a schematic structural diagram of an acoustic model training apparatus provided in an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 is a flowchart of an acoustic model training method provided in an embodiment of the present application. The method comprises the following steps:

step 101, obtaining a plurality of training samples of an acoustic model.

In the present application, the training samples of the acoustic model include: sample audio, sample text. The sample text in each training sample is different. The sample audio in each training sample is different.

For each training sample, the sample text in the training sample may be referred to as the sample text corresponding to the sample video in the training sample. Accordingly, for each training sample, the sample audio in the training sample may be referred to as the sample audio corresponding to the sample text in the training sample.

In the present application, for each of a plurality of training samples of an acoustic model, the sample audio in the training sample is the audio of a person participating in generating the training sample to read the sample text in the training sample.

For example, for a sample text "mei qun kou bi ke", a person who participates in generating the training sample reads the sample text "mei qun ke", audio of the person who participates in generating the training sample reads the sample text "mei qun ke" is collected during the period of the person who participates in generating the training sample reads the sample text "mei qun ke", the collected audio of the person who participates in generating the training sample reads the sample text "mei qun ke" is used as a sample audio corresponding to the sample text "mei qun ke", and the sample audio corresponding to the sample text "mei qun ke" and the sample text "mei qun ke" constitutes a training sample.

And 102, labeling the sample text in each training sample to form a labeled text, and forcibly aligning the sample audio and the labeled text to obtain the alignment information of the training samples.

In the application, for each training sample in a plurality of training samples, the sample text in the training sample is labeled to form a labeled text corresponding to the sample text. Correspondingly, for each training sample in the plurality of training samples, the label text corresponding to the sample text in the training sample includes: the mute control method comprises the following steps of each character in the sample text, a beginning mute unit, an ending mute unit and at least one selectable mute unit, wherein the beginning mute unit is positioned before the first character in the labeled text corresponding to the sample text, the ending mute unit is positioned after the last character in the labeled text corresponding to the sample text, and the selectable mute unit is positioned between two adjacent characters in the labeled text corresponding to the sample text.

In this application, for each sample text, the order of the words in the label text corresponding to the sample text is consistent with the order of the words in the sample text.

The mute units are aligned with the silence segments in the sample audio, which are the audio segments of unvoiced or unvoiced speech in the sample audio.

For each sample text, when the sample text is labeled to form a labeled text corresponding to the sample text, a beginning mute unit may be added before the first word of the sample text, an ending mute unit may be added after the last word of the sample text, and an optional mute unit may be added between two adjacent words in the sample text.

For the optional mute unit, when the forced alignment is performed, whether the optional mute unit is reserved is determined by a preset forced alignment algorithm, such as a viterbi algorithm.

For each selectable mute unit, when the preset forced alignment algorithm is used to determine that the selectable mute unit is reserved, the mute section aligned with the selectable mute unit participates in the training of the acoustic model.

For the beginning mute unit and the ending mute unit, the beginning mute unit and the ending mute unit must be reserved, and in the case that it is determined that the beginning mute unit and/or the ending mute unit are not reserved after the preliminary alignment by the preset forced alignment algorithm, it may be determined that the preliminary alignment by the preset forced alignment algorithm has an error, and further, a mute segment aligned with the beginning mute unit and/or a mute segment aligned with the ending mute unit in the sample audio is determined.

In the application, for each training sample, the labeled text corresponding to the sample text in the training sample and the sample audio in the training sample may be forcibly aligned to obtain the alignment information of the training sample.

In this application, because the label text corresponding to the sample text includes the beginning mute unit and the ending mute unit, the label text corresponding to the sample text may indicate that the beginning mute unit and the ending mute unit are both the mute units that must be reserved, and neither the beginning mute unit nor the ending mute unit is a selectable mute unit.

Therefore, when the label text corresponding to the sample text and the sample audio corresponding to the sample text are aligned forcibly, it is ensured that the beginning silence segment of the sample audio corresponding to the sample text must be aligned with the beginning silence element in the label text corresponding to the sample text, the end silence segment of the sample audio corresponding to the sample text must be aligned with the end silence element in the label text corresponding to the sample text, and it is ensured that the beginning silence segment of the sample audio aligned with the beginning silence element participates in the training of the acoustic model together with the beginning silence element, and the end silence segment of the sample audio aligned with the end silence element participates in the training of the acoustic model together with the end silence element.

The labeling text corresponding to the sample text in the training sample comprises a beginning mute unit and an ending mute unit, and the labeling accuracy of the labeling text corresponding to the sample text is high. Because, for normally recorded audio, there will be silence segments of a certain duration at both the beginning of the audio and the end of the audio.

In this application, for each training sample, when the label text corresponding to the sample text in the training sample and the sample audio corresponding to the sample text in the training sample are forcibly aligned to obtain the alignment information of the training sample, a preset forced alignment algorithm, such as a viterbi algorithm, may be first used to initially align the label text corresponding to the sample text in the training sample and the sample audio corresponding to the sample text in the training sample to obtain the alignment reference information of the training sample. Then, based on the alignment reference information of the training sample, the alignment information of the training sample is obtained.

For a training sample, after preliminarily aligning the labeled text corresponding to the sample text in the training sample with the sample audio corresponding to the sample text by a preset forced alignment algorithm, determining the reference position of the speech segment aligned with the pronunciation unit of the character in the labeled text corresponding to the sample text in the sample audio corresponding to the sample text.

The alignment reference information of the training sample obtained by the preset forced alignment algorithm includes: reference position information of a speech segment aligned with a pronunciation unit of a word in a label text corresponding to a sample text in the training sample.

For each pronunciation unit of the characters in the annotation text corresponding to the sample text in the training sample, the reference position information of the speech segment aligned with the pronunciation unit includes: the reference position of the first frame of the speech segment aligned with the pronunciation unit in the sample audio corresponding to the sample text, and the reference position of the last frame of the speech segment aligned with the pronunciation unit in the sample audio corresponding to the sample text.

The reference position of the first frame of the speech segment aligned with the pronunciation unit in the sample audio corresponding to the sample text indicates that the first frame of the speech segment aligned with the pronunciation unit is the second frame of the sample audio corresponding to the sample text. The reference position of the last frame of the speech segment of the word pronunciation unit in the sample audio corresponding to the sample text indicates that the last frame of the speech segment aligned with the pronunciation unit is the number of frames in the sample audio corresponding to the sample text.

For each pronunciation unit, the reference position of the first frame of the speech segment aligned with the pronunciation unit in the sample audio corresponding to the sample text can be directly used as the final position of the first frame of the speech segment aligned with the pronunciation unit in the sample audio corresponding to the sample text, and the reference position of the last frame of the speech segment aligned with the pronunciation unit in the sample audio corresponding to the sample text can be directly used as the final position of the last frame of the speech segment aligned with the pronunciation unit in the sample audio corresponding to the sample text.

The alignment information of the training samples includes: and the first frame and the last frame of the pronunciation unit of the character in the label text corresponding to the sample text in each training sample are at the final positions in the sample audio corresponding to the sample text.

For a training sample, after preliminarily aligning a label text corresponding to a sample text in the training sample with a sample audio corresponding to the sample text by a preset forced alignment algorithm, if it is determined that a beginning silent unit before a first word in the label text corresponding to the sample text in the training sample is reserved, the alignment reference information of the training sample obtained by the preset forced alignment algorithm includes: the reference position of the first frame of the mute segment aligned with the beginning mute unit in the sample audio corresponding to the sample text, and the reference position of the last frame of the mute segment aligned with the beginning mute unit in the sample audio corresponding to the sample text.

The reference position of the first frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text indicates that the first frame of the silence segment aligned with the beginning silence unit is the first frame of the sample audio corresponding to the sample text.

The reference position of the last frame of the mute segment aligned with the beginning mute unit in the sample audio corresponding to the sample text indicates that the last frame of the mute segment aligned with the beginning mute unit is the number of frames in the sample audio corresponding to the sample text.

The reference position of the first frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text can be directly used as the final position of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text, and the reference position of the last frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text can be directly used as the final position of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text.

The alignment information of the training samples includes: the final position of the first frame of the mute segment aligned with the beginning mute unit in the sample audio corresponding to the sample text, and the final position of the last frame of the mute segment aligned with the beginning mute unit in the sample audio corresponding to the sample text.

For a training sample, after a labeled text corresponding to a sample text in the training sample is preliminarily aligned with a sample audio corresponding to the sample text by a preset forced alignment algorithm, if it is determined by the preset forced alignment algorithm that an initial mute unit before a first word in the labeled text corresponding to the sample text in the training sample is not reserved, a final position of a last frame of a mute segment aligned with the initial mute unit in the labeled text corresponding to the sample text in the sample audio corresponding to the sample text can be determined. Since the first frame of the mute segment aligned with the beginning mute unit is the first frame in the sample audio corresponding to the sample text, the final position of the first frame of the mute segment aligned with the beginning mute unit in the sample audio corresponding to the sample text can be directly obtained. The final position of the first frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text indicates that the first frame of the silence segment aligned with the beginning silence unit is the first frame in the sample audio corresponding to the sample text.

When determining the final position of the last frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text, the position of the speech frame closest to the first frame of the sample audio corresponding to the sample text may be determined according to the acoustic features with discrimination between the silence frame and the speech frame, and the frame located before the speech frame is the last frame of the silence segment aligned with the beginning silence unit, thereby determining the final position of the last frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text.

For a training sample, after preliminarily aligning a label text corresponding to a sample text in the training sample with a sample audio corresponding to the sample text by a preset forced alignment algorithm, if a termination mute unit after a last word in the label text corresponding to the sample text in the training sample is determined to be reserved, the alignment reference information of the training sample obtained by the preset forced alignment algorithm includes: the reference position of the first frame of the mute segment aligned with the end mute unit in the sample audio corresponding to the sample text, and the reference position of the last frame of the mute segment aligned with the end mute unit in the sample audio corresponding to the sample text.

The reference position of the first frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text indicates that the first frame of the silence segment aligned with the end silence unit is the second frame of the sample audio corresponding to the sample text.

The reference position of the last frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text indicates that the last frame of the silence segment aligned with the end silence unit is the last frame in the sample audio corresponding to the sample text.

The reference position of the first frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text can be directly used as the final position of the first frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text, and the reference position of the last frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text can be directly used as the final position of the last frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text.

The alignment information of the training samples includes: the final position of the first frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text, and the final position of the last frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text.

For a training sample, after a label text corresponding to a sample text in the training sample is preliminarily aligned with a sample audio corresponding to the sample text by a preset forced alignment algorithm, if a tail mute unit after a last word in the label text corresponding to the sample text in the training sample is determined not to be retained by the preset forced alignment algorithm, a final position of a first frame of a mute segment aligned with the tail mute unit in the label text corresponding to the sample text in the sample audio corresponding to the sample text can be determined. Since the last frame of the mute segment aligned with the end mute unit is the last frame in the sample audio corresponding to the sample text, the final position of the last frame of the mute segment aligned with the end mute unit in the sample audio corresponding to the sample text can be directly obtained. The final position of the last frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text indicates that the last frame of the silence segment aligned with the end silence unit is the last frame in the sample audio corresponding to the sample text.

When determining the final position of the first frame of the silence segment aligned with the ending silence unit in the sample audio corresponding to the sample text, the position of the speech frame closest to the last frame of the sample audio corresponding to the sample text may be determined according to the acoustic features with discrimination between the silence frame and the speech frame, and the frame located after the speech frame is the first frame of the silence segment aligned with the ending silence unit, thereby determining the final position of the first frame of the silence segment aligned with the ending silence unit in the sample audio corresponding to the sample text.

For a training sample, the silence segments of the sample audio in the training sample except the silence segment aligned with the beginning silence element and the silence segment aligned with the end silence element can be referred to as other silence segments.

For each other silent segment in the sample audio, the other silent segment is aligned with one of the annotation texts corresponding to the sample text in the training sample determined to be the reserved selectable silent cell.

For a training sample, after preliminarily aligning the label text corresponding to the sample text in the training sample with the sample audio corresponding to the sample text by a preset forced alignment algorithm, each reserved optional mute unit can be determined.

The alignment reference information of the training sample obtained by the preset forced alignment algorithm includes: the first frame and the last frame of each other silence segment are at reference positions in the sample audio corresponding to the sample text.

For each other silent section, the reference position of the first frame of the other silent section in the sample audio corresponding to the sample text can be directly used as the final position of the first frame of the other silent section in the sample audio, and the reference position of the last frame of the other silent section in the sample audio corresponding to the sample text can be directly used as the final position of the last frame of the other silent section in the sample audio corresponding to the sample text.

The alignment information of the training samples includes: the first frame and the last frame of each other silence segment are at the final positions in the sample audio corresponding to the sample text.

The differences between the present application and the prior art are described below with reference to the drawings:

referring to fig. 2, a diagram illustrating an effect of forced alignment of sample text and sample audio in the prior art is shown.

A training sample comprising: sample audio of a sample text 'mei yuan jue', and a sample text 'mei yuan jue'. The sample text "mei zong criticism" includes: the characters of "beautiful", "ball", "point", "comment" and the like.

The arcs shown in fig. 2 indicate that there may be mute units, and in the prior art, each mute unit in the label text corresponding to the sample text is an optional mute unit.

In fig. 2, optional mute unit 201, optional mute unit 202, optional mute unit 203, optional mute unit 204, optional mute unit 205 are shown.

In the prior art, a labeling text corresponding to a sample text "mei qun ju" includes: an optional mute unit 201, the word "beauty", an optional mute unit 202, the word "blob", an optional mute unit 203, the word "point", an optional mute unit 204, the word "comment", and an optional mute unit 205.

In fig. 2, only speech segments aligned with the pronunciation units of the characters in the sample audio of the sample text "mei-juan", i.e., speech segments pointed by arrows starting from the characters, are exemplarily shown. In aligning, for each word, the pronunciation unit of the word is aligned with the corresponding speech segment. Specifically, it is determined whether the pronunciation unit of the acoustic model is a phoneme or a syllable. When the pronunciation unit is a syllable, the pronunciation unit of the character is the character itself, and when the pronunciation unit is a phoneme, the number of the pronunciation unit of each character is multiple.

In the prior art, for each optional mute unit, a forced alignment algorithm is used to determine whether to reserve the optional mute unit.

A forced alignment algorithm is used to determine the optional mute unit 203, optional mute unit 204, the text "comment", optional mute unit 205 to keep.

A forced alignment algorithm is used to determine that optional mute units 201 are not to be kept and optional mute units 202 are not to be kept.

And the audio segment at the beginning of the sample audio corresponding to the sample text 'mei-dian' is a mute segment. Due to the fact that the optional mute unit 201 is determined not to be reserved by adopting the forced alignment algorithm, the audio segment at the beginning of the sample audio corresponding to the sample text 'mei-zong comment' cannot be aligned with the optional mute unit 201 which should be aligned, but is aligned with the pronunciation unit belonging to the first word 'mei'.

In the prior art, the initial audio segment of the sample audio corresponding to the sample text "mei-ju" is aligned with the pronunciation unit belonging to the first word "mei", so that when the acoustic model is trained, the silent frame in the initial audio segment, i.e. the initial silent segment, of the sample audio corresponding to the sample text "mei-ju" is mistakenly taken as a speech frame in the speech segment aligned with the pronunciation unit of the first word "mei" to participate in the training of the acoustic model, thereby causing difficulty in accurately training the acoustic model.

The training of the acoustic model is iteratively trained using a large number of training samples. The above situation may occur in a large number of training samples, resulting in a large number of silent sections that should participate in the training of the acoustic model not being involved in the training of the acoustic model as silent sections, resulting in a problem of silence absorption, resulting in a small number of silent sections for training the acoustic model, and making it difficult to sufficiently train the acoustic model.

Please refer to fig. 3, which shows a schematic diagram of an effect of forced alignment of sample text and sample audio in the present application.

In fig. 3, sample audio of a sample text "mei bo kou comment", and a sample text "mei bo kou comment" are shown.

When the sample text "mei qun ju" is labeled, the beginning mute unit 301 is added before the first word "mei" in the sample text "mei qun ju", and the ending mute unit 305 is added before the last word "qun" in the sample text "mei qun ju".

In the present application, the labeling text corresponding to the sample text "mei zong jut" includes: a leading mute section 301, a character "beauty", an optional mute section 302, a character "blob", an optional mute section 303, a character "dot", an optional mute section 304, a character "comment", and a trailing mute section 305.

The arcs shown in fig. 3 indicate that there may be mute units, the beginning mute unit 301 and the ending mute unit 305 are mute units that must be reserved, and compared to the prior art, the arc associated with the beginning mute unit 301 is forcibly removed and the arc associated with the ending mute unit 305 is forcibly removed.

In fig. 3, only speech segments aligned with the pronunciation units of the characters in the sample audio of the sample text "mei-juan", i.e., speech segments pointed by arrows starting from the characters, are exemplarily shown. In aligning, for each word, the pronunciation unit of the word is aligned with the corresponding speech segment. Specifically, it is determined whether the pronunciation unit of the acoustic model is a phoneme or a syllable. When the pronunciation unit is a syllable, the pronunciation unit of the character is the character itself, and when the pronunciation unit is a phoneme, the number of the pronunciation unit of each character is multiple.

In the present application, since the labeled text corresponding to the sample text "mei bo comment" includes the beginning mute unit 301 and the ending mute unit 305, and the labeled text of the sample text "mei bo comment" can indicate that the beginning mute unit and the ending mute unit are all mute units that must be reserved, when the labeled text corresponding to the sample text "mei bo comment" and the sample audio corresponding to the "mei bo comment" are forcibly aligned, it can be ensured that the mute segment at the beginning of the sample audio corresponding to the sample text "mei bo comment" must be aligned with the beginning mute unit 301 in the labeled text corresponding to the sample text "mei bo comment", the mute segment at the end of the sample audio corresponding to the sample text "mei bo comment" must be aligned with the ending mute unit 305 in the labeled text corresponding to the sample text "mei bo comment", and that the mute segment at the beginning of the sample audio aligned with the beginning mute unit and the beginning mute unit participate in the acoustic model together with the beginning mute unit The end silence segments of the sample audio aligned with the end silence elements participate in the training of the acoustic model along with the end silence elements.

In some embodiments, for each training sample, a final position of a silence segment aligned with a beginning silence element in a sample audio corresponding to the sample text may be determined based on a reference position of the silence segment aligned with the beginning silence element in an annotation text corresponding to the sample text in the training sample in the sample audio corresponding to the sample text; and determining the final position of the mute segment aligned with the end mute unit in the sample audio corresponding to the sample text based on the reference position of the mute segment aligned with the end mute unit in the label text corresponding to the sample text in the training sample in the sample audio corresponding to the sample text.

For a training sample, after a sample text in the training sample and a sample audio corresponding to the sample text are preliminarily aligned through a preset forced alignment algorithm, if a beginning silent unit before a first character in a label text corresponding to the sample text in the training sample is reserved through the preset forced alignment algorithm, the alignment reference information of the training sample obtained through the preset forced alignment algorithm includes: the reference position of the first frame of the mute segment aligned with the beginning mute unit in the sample audio corresponding to the sample text, and the reference position of the last frame of the mute segment aligned with the beginning mute unit in the sample audio corresponding to the sample text.

The first frame of the silence segment aligned with the beginning silence unit is the first frame of the sample audio corresponding to the sample text, and the reference position of the first frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text can be directly used as the final position of the first frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text.

For the last frame of the silence segment aligned with the beginning silence unit, multiple frames near the reference position of the last frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text may be selected, and it is respectively determined whether each of the selected multiple frames is a silence frame, thereby determining the final position of the last frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text.

After determining the final position of the first frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text and the final position of the last frame of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text, the final position of the silence segment aligned with the beginning silence unit in the sample audio corresponding to the sample text may be determined.

For a training sample, after a sample text in the training sample and a sample audio corresponding to the sample text are preliminarily aligned by a preset forced alignment algorithm, if a tail mute unit after a last word in a label text corresponding to the sample text in the training sample is determined to be reserved, the alignment reference information of the training sample obtained by the preset forced alignment algorithm includes: the reference position of the first frame of the mute segment aligned with the end mute unit in the sample audio corresponding to the sample text, and the reference position of the last frame of the mute segment aligned with the end mute unit in the sample audio corresponding to the sample text.

The last frame of the mute section aligned with the end mute unit is the last frame of the sample audio corresponding to the sample text, and the reference position of the last frame of the mute section aligned with the end mute unit in the sample audio corresponding to the sample text can be directly used as the final position of the last frame of the mute section aligned with the end mute unit in the sample audio corresponding to the sample text.

For the last frame of the silence segment aligned with the end silence unit, multiple frames near the reference position of the last frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text may be selected, and whether each frame of the selected multiple frames is a silence frame or not may be respectively determined, thereby determining the final position of the last frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text.

After determining the final position of the first frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text and the final position of the last frame of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text, the final position of the silence segment aligned with the end silence unit in the sample audio corresponding to the sample text may be determined.

In some embodiments, for each training sample, a final position of the first frame of the other silent section in the sample audio corresponding to the sample text may be determined based on a reference position of the first frame of the other silent section in the sample audio corresponding to the sample text for each other silent section in the sample audio corresponding to the sample text in the training sample; and determining the final positions of the last frames of the other silent sections in the sample audio corresponding to the sample text based on the reference positions of the last frames of the other silent sections in the sample audio corresponding to the sample text.

For a training sample, the silence segments in the sample audio corresponding to the sample text in the training sample except the silence segment aligned with the beginning silence element and the silence segment aligned with the end silence element can be referred to as other silence segments.

For each other silent segment in the sample audio corresponding to the sample text, the other silent segment is aligned with the optional silent cell determined to be reserved in one of the annotation texts corresponding to the sample text.

The alignment reference information of the training sample obtained by the preset forced alignment algorithm includes: each other silence segment in the sample audio corresponding to the sample text is at a reference position in the sample audio corresponding to the sample text.

For each other silent section in the sample audio corresponding to the sample text, multiple frames of the first frame of the other silent section near the reference position in the sample audio corresponding to the sample text may be selected, and whether each frame in the selected multiple frames is a silent frame is respectively determined, so as to determine the final position of the first frame of the other silent section in the sample audio corresponding to the sample text.

For each other silent section in the sample audio corresponding to the sample text, multiple frames of the last frame of the other silent section near the reference position in the sample audio corresponding to the sample text may be selected, and whether each frame in the selected multiple frames is a silent frame is respectively determined, so as to determine the final position of the last frame of the other silent section in the sample audio corresponding to the sample text.

For each other silence segment in the sample audio corresponding to the sample text, after determining a final position of a first frame of the other silence segment in the sample audio and a final position of a last frame of the other silence segment in the sample audio corresponding to the sample text, a final position of the other silence segment in the sample audio corresponding to the sample text may be determined, and at the same time, determining a final position of a first frame or a last frame of a speech segment adjacent to the other silence segment and aligned with a pronunciation unit in the sample audio corresponding to the sample text, and determining a final position of a speech segment adjacent to the other silence segment and aligned with a pronunciation unit in the sample audio corresponding to the sample text.

And 103, training the acoustic model by using the alignment information of each training sample.

In this application, after obtaining the alignment information of each training sample, the acoustic model may be trained by using the alignment information of each training sample.

In this application, since the label text corresponding to the sample text in the training sample includes the beginning silent unit and the ending silent unit, the label text corresponding to the sample text in the training sample can indicate that the beginning silent unit and the ending silent unit are both the silent units that must be reserved, and neither the beginning silent unit nor the ending silent unit is a selectable silent unit.

Therefore, when the label text corresponding to the sample text and the sample audio corresponding to the sample text are forcibly aligned, it can be ensured that the beginning silence segment of the sample audio corresponding to the sample text must be aligned with the beginning silence unit, the end silence segment of the sample audio corresponding to the sample text must be aligned with the end silence unit, the beginning silence segment of the sample audio aligned with the beginning silence unit participates in the training of the acoustic model together with the beginning silence unit, and the end silence segment of the sample audio aligned with the end silence unit participates in the training of the acoustic model together with the end silence unit.

For each training sample, the alignment information of the training sample comprises: the final positions of the first frame and the last frame of the mute section aligned with the beginning mute unit in the annotation text corresponding to the sample text in the training sample in the sample audio corresponding to the sample text, and the final positions of the first frame and the last frame of the mute section aligned with the ending mute unit in the annotation text corresponding to the sample text in the training sample in the sample audio corresponding to the sample text

For example, one training sample includes: and the sample text is the sample audio corresponding to the American group comment and the sample text is the American group comment.

The alignment information of the training samples includes: the first frame and the last frame of the audio segment aligned with the beginning mute unit before the first word "beauty" in the annotation text corresponding to the sample text "beauty group comment" are at the final positions in the sample audio corresponding to the sample text "beauty group comment", and the first frame and the last frame of the audio segment aligned with the end mute unit after the last word "comment" in the annotation text corresponding to the sample text "beauty group comment" are at the final positions in the sample audio corresponding to the sample text "beauty group comment".

The alignment information of the training samples further includes: the first frame and the last frame of the audio segment aligned with the pronunciation unit of the word "beauty" in the annotation text corresponding to the sample text "beauty group comment" are at the final positions in the sample audio corresponding to the sample text "beauty group comment", the first frame and the last frame of the audio segment aligned with the pronunciation unit of the word "beauty group" in the annotation text corresponding to the sample text "beauty group comment" are at the final positions in the sample audio corresponding to the sample text "beauty group comment", the first frame and the last frame of the audio segment aligned with the pronunciation unit of the character "point" in the annotation text corresponding to the sample text "mei ju" are at the final positions in the sample audio corresponding to the sample text "mei ju", and the first frame and the last frame of the audio segment of the pronunciation unit of the character "comment" in the annotation text corresponding to the sample text "mei ju" are at the final positions in the sample audio corresponding to the sample text "mei ju".

It is assumed that after the preliminary forced alignment is performed by the preset forced alignment algorithm, it is determined that an optional mute cell between the text "american" and the text "cluster" is not reserved, it is determined that an optional mute cell between the text "cluster" and the text "dot" is reserved, and an optional mute cell between the text "dot" and the text "comment" is reserved.

The alignment information of the training samples further includes: the final positions of the first frame and the last frame of the silence segment aligned with the optional silence unit between the words "bouquet" and "point" in the sample audio corresponding to the sample text "bouquet comment", and the final positions of the first frame and the last frame of the silence segment aligned with the optional silence unit between the words "point" and "comment" in the sample audio corresponding to the sample text "bouquet comment".

In the present application, for each training sample, according to the alignment information of the training sample, a final position of a beginning silence segment of a sample audio in the training sample aligned with a beginning silence unit in the sample audio in the training sample, a final position of a ending silence segment of the sample audio in the training sample aligned with an ending silence unit in the sample audio in the training sample, a final position of a speech segment of the sample audio in the training sample aligned with a pronunciation unit in the sample audio in the training sample in the sample audio in the training sample, a final position of a silence segment of the sample audio in the training sample aligned with a reserved silence unit in the training sample in the sample audio in the training sample can be determined, so that an acoustic model can be trained, which ends with a sequence consisting of a pronunciation unit and all silence units of characters in a sample text in the training sample when the acoustic model is trained, the other end is a sequence of acoustic feature vectors consisting of the acoustic feature vectors of each frame of sample audio in the training sample.

In some embodiments, when the acoustic model is trained using the alignment information of each training sample, the acoustic model may be trained using a Cross Entropy (Cross Entropy) loss function or a Connection Timing Classification (CTC) criterion using the alignment information of each training sample.

In some embodiments, the pronunciation unit of the acoustic model is one of: phoneme, syllable, single word.

In the present application, the modeling unit of the acoustic model may be a phoneme. The phoneme can be initial consonant, vowel, English phoneme, etc. When the alignment information of each training sample is utilized to train the acoustic model, firstly, for each character in the labeled text corresponding to the sample text in the training sample, determining the phoneme of the character, generating a phoneme sequence comprising the phoneme of each character, and simultaneously, generating an acoustic feature vector of each frame in the sample audio in the training sample to obtain an acoustic feature vector sequence. Then, based on the alignment information of the training samples, the acoustic model is trained by using the acoustic feature vector sequence and the phoneme sequence.

In the present application, the modeling unit of the acoustic model may be a syllable. When the alignment information of each training sample is used for training the acoustic model, firstly, for each character in the labeled text corresponding to the sample text in the training sample, determining the syllable of the character, generating a syllable sequence comprising the syllable of each character, and simultaneously, generating the acoustic feature vector of each frame in the sample audio in the training sample to obtain the acoustic feature vector sequence. Then, based on the alignment information of the training samples, the acoustic model is trained by using the acoustic feature vector sequence and the syllable sequence.

Please refer to fig. 4, which shows a schematic structural diagram of an acoustic model training apparatus according to an embodiment of the present application. The specific implementation of the operation that each unit in the acoustic model training apparatus provided in the embodiment of the present application is configured to complete may refer to the specific implementation of the corresponding operation described in the method embodiment.

As shown in fig. 4, the acoustic model training apparatus includes: an acquisition unit 401, an alignment unit 402, and a training unit 403.

The obtaining unit 401 is configured to obtain a plurality of training samples of the acoustic model, the training samples including: a sample audio and a sample text corresponding to the sample audio;

the aligning unit 402 is configured to label, for each of the plurality of training samples, sample texts in the training sample to form a labeled text; wherein the labeling of the sample text in the training sample comprises: adding a beginning mute unit before the first character in the sample text; adding a tail mute unit after the last word in the sample text; adding an optional mute unit between adjacent words in the sample text; forcibly aligning the sample audio frequency in the training sample with the labeling text to obtain the alignment information of the training sample;

the training unit 403 is configured to train the acoustic model using the alignment information of each training sample.

In some embodiments, the acoustic model training apparatus further comprises:

a first position determination unit configured to determine a final position of a mute segment aligned with an opening mute unit in the sample audio based on a reference position of the mute segment aligned with the opening mute unit in the sample audio; and determining the final position of the mute segment aligned with the end mute unit in the sample audio based on the reference position of the mute segment aligned with the end mute unit in the sample audio, wherein the reference position of the mute segment aligned with the beginning mute unit in the sample audio and the reference position of the mute segment aligned with the end mute unit in the sample audio are determined by preliminarily aligning the annotation text and the sample audio by utilizing a preset forced alignment algorithm.

In some embodiments, the acoustic model training apparatus further comprises:

a second position determination unit configured to determine, for each other silent segment in the sample audio, a final position of the other silent segment in the sample audio based on a reference position of the other silent segment in the sample audio, wherein the other silent segment is aligned with the optional silent unit determined to be reserved, and the reference position of the other silent segment in the sample audio is determined by preliminarily aligning the annotation text and the sample audio by using a preset forced alignment algorithm.

In some embodiments, the pronunciation unit of the acoustic model is one of: phonemes, syllables.

In some embodiments, training the acoustic model with the alignment information for each training sample comprises:

and training the acoustic model by using the alignment information of each training sample and adopting a connection time sequence classification criterion.

The present application further provides an electronic device that may be configured with one or more processors; a memory for storing one or more programs, the one or more programs may include instructions for performing the operations described in the above embodiments. The one or more programs, when executed by the one or more processors, cause the one or more processors to perform the instructions of the operations described in the above embodiments.

The present application also provides a computer readable medium, which may be included in an electronic device; or the device can be independently arranged and not assembled into the electronic equipment. The computer-readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to perform the operations described in the embodiments above.

It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a message execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a message execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable messages for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer messages.

The above description is only a preferred embodiment of the present request and is illustrative of the principles of the technology employed. It will be understood by those skilled in the art that the scope of the invention herein referred to is not limited to the technical embodiments with the specific combination of the above technical features, but also encompasses other technical embodiments with any combination of the above technical features or their equivalents without departing from the inventive concept. For example, technical embodiments formed by mutually replacing the above-mentioned features with (but not limited to) technical features having similar functions disclosed in the present application.

Claims

1. A method of acoustic model training, the method comprising:

for each training sample in the plurality of training samples, labeling a sample text in the training sample to form a labeled text corresponding to the sample text; wherein the labeling of the sample text in the training sample comprises: adding a beginning mute unit before the first character in the sample text; adding a tail mute unit after the last word in the sample text; adding an optional mute unit between adjacent words in the sample text; forcibly aligning the sample audio frequency in the training sample with the labeling text to obtain the alignment information of the training sample;

2. The method of claim 1, further comprising:

determining a final position of a silence segment aligned with the beginning silence unit in the sample audio based on a reference position of the silence segment aligned with the beginning silence unit in the sample audio;

and determining the final position of the mute segment aligned with the tail mute unit in the sample audio based on the reference position of the mute segment aligned with the tail mute unit in the sample audio, wherein the reference position of the mute segment aligned with the head mute unit in the sample audio and the reference position of the mute segment aligned with the tail mute unit in the sample audio are determined by preliminarily aligning the annotation text and the sample audio by utilizing a preset forced alignment algorithm.

3. The method of claim 2, further comprising:

for each other mute segment in the sample audio, determining a final position of the other mute segment in the sample audio based on a reference position of the other mute segment in the sample audio, wherein the other mute segment is aligned with the optional mute unit determined to be reserved, and the reference position of the other mute segment in the sample audio is determined by preliminarily aligning the annotation text and the sample audio by using a preset forced alignment algorithm.

4. The method according to one of claims 1 to 3, wherein the pronunciation unit of the acoustic model is one of: phonemes, syllables.

5. The method of claim 4, wherein training the acoustic model using the alignment information for each training sample comprises:

and training the acoustic model by using the alignment information of each training sample and adopting a cross entropy loss function or a connection time sequence classification criterion.

6. An acoustic model training apparatus, characterized in that the apparatus comprises:

an aligning unit, configured to label, for each training sample in the plurality of training samples, a sample text in the training sample to obtain a labeled text corresponding to the sample text, where the labeling of the sample text in the training sample includes: adding a beginning mute unit before the first character in the sample text; adding a tail mute unit after the last word in the sample text; adding an optional mute unit between adjacent words in the sample text; forcibly aligning the sample audio frequency in the training sample with the labeling text to obtain the alignment information of the training sample;

7. The apparatus of claim 6, further comprising:

a first position determination unit configured to determine a final position of a mute segment aligned with the beginning mute unit in the sample audio based on a reference position of the mute segment aligned with the beginning mute unit in the sample audio; and determining the final position of the mute segment aligned with the tail mute unit in the sample audio based on the reference position of the mute segment aligned with the tail mute unit in the sample audio, wherein the reference position of the mute segment aligned with the head mute unit in the sample audio and the reference position of the mute segment aligned with the tail mute unit in the sample audio are determined by preliminarily aligning the annotation text and the sample audio by utilizing a preset forced alignment algorithm.

8. The apparatus of claim 7, further comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 5.

10. A storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1 to 5.