CN114267376B

CN114267376B - Phoneme detection method and device, training method and device, equipment and medium

Info

Publication number: CN114267376B
Application number: CN202111404820.3A
Authority: CN
Inventors: 杨少雄
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-10-18
Anticipated expiration: 2041-11-24
Also published as: CN114267376A

Abstract

The disclosure provides a phoneme detection method and device, a training method and device, equipment and a medium, and relates to the field of artificial intelligence, in particular to the technical fields of deep learning, speech synthesis, computer vision, virtual/augmented reality and natural language processing. The scheme is as follows: acquiring a category coding sequence of phonemes contained in a target text segment corresponding to an indication audio frequency band; performing phoneme detection on an audio frequency spectrogram corresponding to the audio frequency band based on the category coding sequence by adopting a phoneme detection model so as to determine at least one phoneme prediction box from the audio frequency spectrogram, and determining a plurality of candidate phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box from the phonemes indicated by the category coding sequence; and determining a target phoneme to which the at least one phoneme prediction in-frame spectrum fragment belongs from a plurality of candidate phonemes corresponding to the at least one phoneme prediction in-frame spectrum fragment. Therefore, the model carries out phoneme detection based on the prior information of the phonemes contained in the text, and the accuracy of phoneme detection can be improved.

Description

Phoneme detection method and device, training method and device, equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to the field of deep learning, speech synthesis, computer vision and virtual/augmented reality, and natural language processing, and more particularly to a phoneme detection method and apparatus, a training method and apparatus, a device, and a medium.

Background

At present, with the continuous progress of computer animation technology, audio-driven virtual human facial expression animation is developed and can be applied to different fields. Wherein, the human face expression (including mouth shape) animation of the virtual image corresponding to each phoneme object in the audio stream can be generated by inputting audio, thereby completing the three-dimensional (3D) virtual image audio driving. Therefore, it is important how to identify each phoneme object from the audio stream.

Disclosure of Invention

The disclosure provides a phoneme detection method and device, a training method and device, equipment and a medium.

According to an aspect of the present disclosure, there is provided a phoneme detection method, including:

acquiring an audio frequency spectrogram corresponding to at least one audio frequency band;

acquiring a category coding sequence of a target text segment corresponding to the audio segment, wherein the category coding sequence is used for indicating phonemes contained in the target text segment;

performing phoneme detection on the audio frequency spectrogram based on the category coding sequence by adopting a phoneme detection model so as to determine at least one phoneme prediction box from the audio frequency spectrogram, and determining a plurality of candidate phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box from the phonemes indicated by the category coding sequence;

and determining a target phoneme to which the at least one phoneme prediction box internal spectrum fragment belongs from a plurality of candidate phonemes corresponding to the at least one phoneme prediction box internal spectrum fragment.

According to another aspect of the present disclosure, there is provided a method for training a phoneme detection model, including:

acquiring an audio frequency spectrogram corresponding to a sample audio;

acquiring a category coding sequence corresponding to the sample audio, wherein the category coding sequence is used for indicating phonemes contained in a text corresponding to the sample audio;

performing phoneme detection on the audio frequency spectrogram based on the category coding sequence by adopting a phoneme detection model so as to determine the position of at least one phoneme prediction box and a target phoneme to which a spectrum fragment in the at least one phoneme prediction box belongs from the audio frequency spectrogram;

and training the phoneme detection model according to a first difference between the position of the at least one phoneme prediction box and the position of the at least one phoneme detection box labeled on the sample audio, and/or according to a second difference between a target phoneme to which a spectrum fragment in the at least one phoneme prediction box belongs and a labeled phoneme in the at least one phoneme detection box labeled on the sample audio.

According to still another aspect of the present disclosure, there is provided a phoneme detecting apparatus including:

the first acquisition module is used for acquiring an audio frequency spectrogram corresponding to at least one audio frequency band;

the second obtaining module is configured to obtain a category coding sequence of a target text segment corresponding to the audio segment, where the category coding sequence is used to indicate phonemes included in the target text segment;

a detection module, configured to perform phoneme detection on the audio spectrogram based on the category coding sequence by using a phoneme detection model, so as to determine at least one phoneme prediction box from the audio spectrogram, and determine a plurality of candidate phonemes corresponding to a spectrum segment in the at least one phoneme prediction box from phonemes indicated by the category coding sequence;

and the determining module is used for determining a target phoneme to which the spectrum fragment in the at least one phoneme prediction frame belongs from a plurality of candidate phonemes corresponding to the spectrum fragment in the at least one phoneme prediction frame.

According to still another aspect of the present disclosure, there is provided a training apparatus for a phoneme detection model, including:

the first acquisition module is used for acquiring an audio frequency spectrogram corresponding to the sample audio;

the second obtaining module is configured to obtain a category coding sequence corresponding to the sample audio, where the category coding sequence is used to indicate phonemes included in a text corresponding to the sample audio;

a detection module, configured to perform phoneme detection on the audio spectrogram based on the category coding sequence by using a phoneme detection model, so as to determine, from the audio spectrogram, a position of at least one phoneme prediction box and a target phoneme to which a spectrum fragment in the at least one phoneme prediction box belongs;

and the training module is used for training the phoneme detection model according to a first difference between the position of the at least one phoneme prediction box and the position of the at least one phoneme detection box labeled on the sample audio and/or according to a second difference between a target phoneme to which the spectrum fragment in the at least one phoneme prediction box belongs and a labeled phoneme in the at least one phoneme detection box labeled on the sample audio.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method for phoneme detection as set forth in one aspect of the disclosure above or a method for training as set forth in another aspect of the disclosure above.

According to still another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium of computer instructions for causing a computer to perform the phoneme detection method set forth in the above-described aspect of the present disclosure or to perform the training method set forth in the above-described aspect of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the phoneme detection method set forth in the above-mentioned aspect of the present disclosure, or implements the training method set forth in the above-mentioned aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of a phoneme detection method according to a first embodiment of the disclosure;

fig. 2 is a schematic flowchart of a phoneme detection method according to a second embodiment of the disclosure;

fig. 3 is a schematic flowchart of a phoneme detection method according to a third embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a phoneme detection method according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a phoneme detection principle in an embodiment of the present disclosure;

fig. 6 is a schematic flowchart of a phoneme detection method according to a fifth embodiment of the present disclosure;

fig. 7 is a flowchart illustrating a training method of a phoneme detection model according to a sixth embodiment of the present disclosure;

fig. 8 is a flowchart illustrating a training method of a phoneme detection model according to a seventh embodiment of the disclosure;

fig. 9 is a schematic structural diagram of a phoneme detection apparatus according to an eighth embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of a training apparatus for a phoneme detection model according to a ninth embodiment of the present disclosure;

FIG. 11 shows a schematic block diagram of an example electronic device that may be used to implement any embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, an input Text may be obtained, speech synthesis is performed on the input Text by using a TTS (Text to Speech) technology to obtain an audio, a spectrogram corresponding to the audio is directly input to a phoneme detection model, and a phoneme object detection is performed on the audio spectrogram by the phoneme detection model to obtain each phoneme object in the audio spectrogram.

Specifically, the phoneme detection model is input as: the output of the phoneme detection model is as follows: the position of the detection box, the category (409 phonemes [ understood as pinyin ]) and the confidence corresponding to the category of each phoneme object in the audio frequency spectrogram.

However, since the similarity of different phoneme objects is very high (such as "en", "eng"), the above method of directly detecting a phoneme object on an audio spectrogram by using a model often causes a situation of misrecognition of the phoneme object.

In view of the above problems, the present disclosure provides a phoneme detection method and apparatus, a training method and apparatus, a device, and a medium.

A phoneme detection method and apparatus, a training method and apparatus, a device, and a medium according to embodiments of the present disclosure are described below with reference to the drawings.

Fig. 1 is a flowchart illustrating a phoneme detection method according to a first embodiment of the disclosure.

The embodiment of the present disclosure is exemplified by the fact that the phoneme detection method is configured in a phoneme detection apparatus, and the phoneme detection apparatus can be applied to any electronic device, so that the electronic device can perform a phoneme detection function.

The electronic device may be any device with computing capability, for example, a personal computer, a mobile terminal, a server, and the like, and the mobile terminal may be a hardware device with various operating systems, touch screens, and/or display screens, such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device, and the like.

As shown in fig. 1, the phoneme detection method may include the steps of:

step 101, obtaining an audio frequency spectrogram corresponding to at least one audio frequency band.

In the embodiment of the present disclosure, at least one audio clip may be obtained, and the audio clip is subjected to spectrum feature extraction to obtain an audio spectrogram.

In the embodiment of the present disclosure, the obtaining manner of the audio clip is not limited, for example, the audio clip may be collected online, for example, online through a web crawler technology, or offline, or synthesized manually, and the like, which is not limited in the embodiment of the present disclosure.

Step 102, a category coding sequence of the target text segment corresponding to the audio segment is obtained, wherein the category coding sequence is used for indicating phonemes contained in the target text segment.

In the embodiment of the present disclosure, the category encoding sequence is generated according to phonemes of characters included in a target text segment corresponding to the audio segment, and is used for indicating the phonemes of the characters included in the target text segment.

For example, when the target text segment corresponding to the audio segment is "nice scenery", the category coding sequence is used to indicate five phonemes "haomei de jing se".

And 103, performing phoneme detection on the audio frequency spectrogram based on the category coding sequence by adopting a phoneme detection model so as to determine at least one phoneme prediction box from the audio frequency spectrogram, and determining a plurality of candidate phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box from the phonemes indicated by the category coding sequence.

In the embodiment of the present disclosure, the audio spectrogram may be input into a phoneme detection model, the phoneme detection model performs phoneme detection on the audio spectrogram based on the category coding sequence to determine at least one phoneme prediction box from the audio spectrogram, and determines a plurality of candidate phonemes corresponding to a spectrum segment in the at least one phoneme prediction box from phonemes indicated by the category coding sequence.

For example, in a case that an audio segment contains one character, the phoneme detection model performs phoneme detection on an audio spectrogram corresponding to the audio segment based on the category coding sequence, so as to obtain a phoneme detection box and a plurality of candidate phonemes corresponding to the spectral segment in the phoneme detection box. When the audio segment contains a plurality of characters, the phoneme detection model performs phoneme detection on the audio spectrogram corresponding to the audio segment based on the category coding sequence, so as to obtain a plurality of phoneme detection boxes and a plurality of candidate phonemes corresponding to the spectral segment in each phoneme detection box.

That is to say, in the present disclosure, the phoneme detection model performs phoneme detection on the audio spectrogram based on the prior information of phonemes included in the target text segment corresponding to the audio segment, and since the phoneme detection model performs phoneme prediction only on phonemes included in the target text segment, a situation that the model is misrecognized due to similar phonemes can be avoided, so as to improve accuracy and reliability of a phoneme detection result.

And 104, determining a target phoneme to which the at least one spectrum fragment in the phoneme prediction frame belongs from a plurality of candidate phonemes corresponding to the at least one spectrum fragment in the phoneme prediction frame.

In the embodiment of the present disclosure, a target phoneme to which at least one phoneme prediction intra-frame spectrum fragment belongs may be determined from a plurality of candidate phonemes corresponding to the at least one phoneme prediction intra-frame spectrum fragment. For example, the target phoneme to which the spectrum fragment in each phoneme prediction box belongs may be determined according to the confidences of a plurality of candidate phonemes corresponding to the spectrum fragment in each phoneme prediction box, where the confidence of the target phoneme is greater than the confidences of other candidate phonemes.

As an example, for each phoneme prediction box, a softmax layer in a phoneme detection model may be used to perform confidence prediction on a plurality of candidate phonemes corresponding to a spectrum segment in the phoneme prediction box, so as to obtain a confidence corresponding to each candidate phoneme, so that the candidate phoneme with the highest confidence may be used as the target phoneme to which the spectrum segment in the phoneme prediction box belongs.

The phoneme detection method of the embodiment of the disclosure obtains a category coding sequence of a target text segment corresponding to at least one audio segment, wherein the category coding sequence is used for indicating phonemes contained in the target text segment, and performs phoneme detection on an audio spectrogram corresponding to an audio frequency band based on the category coding sequence by using a phoneme detection model to determine at least one phoneme prediction box from the audio spectrogram, and determine a plurality of candidate phonemes corresponding to a spectrum segment in the at least one phoneme prediction box from the phonemes indicated by the category coding sequence, and then determine a target phoneme to which the spectrum segment in the at least one phoneme prediction box belongs from the plurality of candidate phonemes corresponding to the spectrum segment in the at least one phoneme prediction box. Therefore, in the model prediction stage, the phoneme detection model performs phoneme detection on the audio frequency spectrogram based on the prior information of the phonemes contained in the target text segment corresponding to the audio segment, and because the phoneme detection model performs phoneme prediction only in the phonemes contained in the target text segment, the situation of model misrecognition caused by similar phonemes can be avoided, so that the accuracy and the reliability of a phoneme detection result are improved.

It should be noted that, in the technical solution of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information (such as audio clips, target text clips and the like) of the user are all performed under the premise of obtaining the consent of the user, and all comply with the regulations of the relevant laws and regulations, and do not violate the customs.

In order to clearly illustrate how the phoneme detection model is used to perform phoneme detection on the audio spectrogram in the above embodiments of the present disclosure, the present disclosure further provides a phoneme detection method.

Fig. 2 is a flowchart illustrating a phoneme detection method according to a second embodiment of the disclosure.

As shown in fig. 2, the phoneme detection method may include the steps of:

step 201, an audio frequency spectrogram corresponding to at least one audio frequency band is obtained.

Step 202, a category coding sequence of a target text segment corresponding to the audio segment is obtained, wherein the category coding sequence is used for indicating phonemes contained in the target text segment.

The execution process of steps 201 to 202 may refer to the execution process of any embodiment of the present disclosure, and is not described herein again.

Step 203, performing phoneme detection on the audio frequency spectrogram by using a phoneme detection model to obtain a position of the at least one phoneme prediction box and a plurality of predicted phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box.

In the embodiment of the present disclosure, a phoneme detection model may be used to perform phoneme detection on an audio spectrogram, so as to obtain a position of at least one phoneme prediction box and a plurality of predicted phonemes corresponding to a spectrum segment in the at least one phoneme prediction box. For example, the number of predicted phonemes may be 409.

For example, in a case that an audio segment contains a character, the phoneme detection model performs phoneme detection on an audio spectrogram corresponding to the audio segment, so as to obtain a position of a phoneme detection box and a plurality of predicted phonemes corresponding to the audio segment in the phoneme detection box. And under the condition that the audio segment contains a plurality of characters, the phoneme detection model carries out phoneme detection on the audio spectrogram corresponding to the audio segment, so that the positions of a plurality of phoneme detection boxes and a plurality of predicted phonemes corresponding to the spectrum segment in each phoneme detection box can be obtained.

As a possible implementation manner, a prediction branch in the phoneme detection model may be adopted, and meanwhile, a regression prediction of phonemes is performed on the audio frequency spectrogram to obtain a position of at least one phoneme prediction box, and a category prediction of phonemes is performed on the audio frequency spectrogram to obtain a plurality of predicted phonemes (or categories to which a plurality of predicted phonemes belong) corresponding to a spectrum segment in the at least one phoneme prediction box.

As another possible implementation manner, the classification and regression may be further decoupled, so that the model focuses on the expression of the feature capability of the classification and regression, that is, the feature expression capability of the model is enhanced, so as to improve the phoneme detection effect. Wherein the first predicted branch is different from the second predicted branch.

Step 204, according to the phonemes indicated by the category coding sequence, screening a plurality of predicted phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box so as to reserve candidate phonemes matched with the phonemes indicated by the category coding sequence.

In the embodiment of the present disclosure, a plurality of predicted phonemes corresponding to the spectral slices in the at least one phoneme prediction box may be filtered according to the phonemes indicated by the category coding sequence, so as to retain candidate phonemes matching with the phonemes indicated by the category coding sequence.

For example, assuming that the category coding sequence is used to indicate five phonemes of "haomei de _ jing se", for 409 predicted phonemes corresponding to the spectral slice in each phoneme prediction box, only 5 phonemes of the 409 predicted phonemes (i.e., haomei de _ jing se) may be reserved as candidate phonemes.

Step 205, determining a target phoneme to which the at least one phoneme prediction intra-frame spectrum fragment belongs from a plurality of candidate phonemes corresponding to the at least one phoneme prediction intra-frame spectrum fragment.

The execution process of step 205 may refer to the execution process of any embodiment of the present disclosure, and is not described herein again.

The phoneme detection method of the embodiment of the disclosure carries out phoneme detection on an audio frequency spectrogram by adopting a phoneme detection model to obtain the position of at least one phoneme prediction box and a plurality of predicted phonemes corresponding to a spectrum fragment in the at least one phoneme prediction box; and screening a plurality of predicted phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box according to the phonemes indicated by the category coding sequence so as to reserve candidate phonemes matched with the phonemes indicated by the category coding sequence. Therefore, only the candidate phonemes matched with the phonemes indicated by the category coding sequence are reserved for each phoneme prediction box, so that phoneme prediction is performed only in the reserved candidate phonemes, the condition that similar phonemes cause model misrecognition can be avoided, and the accuracy and the reliability of the phoneme detection result are improved.

In order to clearly illustrate how to obtain the audio frequency spectrogram corresponding to the audio segment in the above embodiments of the present disclosure, the present disclosure further provides a phoneme detection method.

Fig. 3 is a flowchart illustrating a phoneme detection method according to a third embodiment of the disclosure.

As shown in fig. 3, the phoneme detection method may include the steps of:

step 301, acquiring an input text, and performing speech synthesis on the input text to obtain an audio stream.

In the embodiment of the present disclosure, the obtaining manner of the input text is not limited, for example, the input text may be text information input by a user, or may also be text information acquired online, for example, the input text may be acquired online by using a web crawler technology, and the like, which is not limited by the present disclosure.

In the embodiment of the present disclosure, the input text may be subjected to speech synthesis based on a speech synthesis technology, so as to obtain an audio stream.

Step 302, segmenting the audio stream according to a set time interval to obtain at least one audio segment.

In the embodiment of the present disclosure, the set time interval is a preset time interval, for example, the set time interval may be 1 second.

It can be understood that the phoneme detection is directly performed on the longer audio stream, which has higher complexity, and therefore, in the present disclosure, in order to reduce the complexity of the phoneme detection on the audio stream, the audio stream may be segmented to obtain a plurality of audio segments. That is, the audio stream may be segmented according to a set time interval to obtain at least one audio segment. For example, assuming that the audio stream is 20 seconds of speech, the audio stream may be sliced into 201 second audio segments.

Step 303, performing spectrum feature extraction on the audio segment to obtain an audio spectrogram.

In the embodiment of the present disclosure, after the audio stream is segmented to obtain each audio segment, spectral feature extraction may be performed on each audio segment to obtain an audio spectrogram corresponding to each audio segment.

Step 304, a category coding sequence of the target text segment corresponding to the audio segment is obtained, wherein the category coding sequence is used for indicating phonemes contained in the target text segment.

Step 305, performing phoneme detection on the audio spectrogram based on the category coding sequence by using a phoneme detection model to determine at least one phoneme prediction box from the audio spectrogram, and determining a plurality of candidate phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box from the phonemes indicated by the category coding sequence.

Step 306, determining a target phoneme to which the at least one spectrum fragment in the phoneme prediction box belongs from a plurality of candidate phonemes corresponding to the at least one spectrum fragment in the phoneme prediction box.

The execution process of steps 304 to 306 may refer to the execution process of any embodiment of the present disclosure, and is not described herein again.

The phoneme detection method of the embodiment of the disclosure obtains an audio stream by obtaining an input text and performing speech synthesis on the input text; segmenting the audio stream according to a set time interval to obtain at least one audio segment; and extracting the spectral characteristics of the audio clip to obtain an audio frequency spectrogram. Therefore, the audio stream is segmented to obtain an audio segment with the voice duration less than or equal to the set time interval, and the spectral feature extraction is performed on the audio segment, so that the extracted audio spectrogram can meet the input requirement of a phoneme detection model, the effective detection of phonemes is ensured, and the complexity of phoneme detection can be reduced.

In order to clearly illustrate how to obtain the category coding sequence of the target text segment corresponding to the audio segment in any of the above embodiments of the present disclosure, the present disclosure further provides a phoneme detection method.

Fig. 4 is a flowchart illustrating a phoneme detection method according to a fourth embodiment of the disclosure.

As shown in fig. 4, the phoneme detecting method may include the steps of:

step 401, acquiring an input text, and performing speech synthesis on the input text to obtain an audio stream.

Step 402, segmenting the audio stream according to a set time interval to obtain at least one audio segment.

And step 403, extracting the spectral features of the audio segments to obtain an audio spectrogram.

The execution process of steps 401 to 403 may refer to the execution process of the above embodiment, which is not described herein again.

And step 404, determining the interception length according to the number of the audio segments and the number of characters contained in the input text.

In the embodiment of the present disclosure, the truncation length may be determined according to the number of audio segments and the number of characters included in the input text.

As an example, the number of characters included in the input text may be divided by the number of audio segments, and the truncation length may be determined according to the obtained ratio. For example, if the input text contains 100 characters and the number of audio pieces is 20, the truncation length may be 100/20=5 characters.

As another example, considering that the speech rates of different users may be different and the speech rate of the same user at different time may also be different, for example, the user may sometimes say 7 words in 1 second and sometimes may sometimes say 3 words in 1 second, so to improve the accuracy of the phoneme detection result and avoid missing detection of the phoneme, the truncation length may be greater than or equal to the above ratio, for example, the truncation length may be 2 times the ratio, and still in the above example, the truncation length may be 10 characters.

Step 405, intercepting the input text into at least one text segment according to the interception length; the number of the text segments is the same as that of the audio segments.

In the embodiment of the present disclosure, the input text may be cut into at least one text segment according to the cutting length, where the number of the text segments is the same as the number of the audio segments.

As an example, when the truncation length is a ratio of the number of characters included in the input text divided by the number of audio segments, assuming that the audio stream is 20 seconds, the input text includes 100 characters, and 5 phonemes need to be predicted per second, a first text segment may be the 1 st to 5 th characters in the input text, a second text segment may be the 6 th to 10 th characters in the input text, a third text segment may be the 11 th to 15 th characters in the input text, 8230, and a twentieth text segment may be the 96 th to 100 th characters in the input text.

As another example, still assuming that the audio stream is 20 seconds, the input text contains 100 characters, 5 phonemes need to be predicted per second, and the truncation length may be greater than or equal to 5, such as when the truncation length is 10 characters, the first text segment may be the 1 st to 10 th characters in the input text, the second text segment may be the 6 th to 15 th characters in the input text, the third text segment may be the 11 th to 20 th characters in the input text, the fourth text segment may be the 16 th to 25 th characters in the input text, \8230, and the twentieth text segment may be the 96 th to 100 th characters in the input text. Alternatively, the first segment of text may be the 1 st to 10 th characters, the second segment of text may be the 3 rd to 12 th characters, the third segment of text may be the 11 th to 20 th characters, and the fourth segment of text may be the 13 th to 22 th characters, \ 8230;, and the twentieth segment of text may be the 93 th to 100 th characters.

And 406, determining a target text segment matched with the position from the at least one text segment according to the position of the audio segment in the audio stream.

In the disclosed embodiment, a target text segment matching the position can be determined from at least one text segment according to the position of the audio segment in the audio stream. Assuming that the set time interval is 1 second, assuming that the audio segment is in the audio stream for 1-2 seconds, the target text segment may be the second text segment in the input text.

Step 407, generating a category coding sequence according to the phoneme of each character in the target text segment.

In the embodiment of the present disclosure, the category encoding sequence may be generated according to the phoneme of each character in the target text segment. And the category coding sequence is used for indicating the phoneme of each character in the target text segment.

Step 408, performing phoneme detection on the audio spectrogram based on the category coding sequence by using a phoneme detection model to determine at least one phoneme prediction box from the audio spectrogram, and determining a plurality of candidate phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box from the phonemes indicated by the category coding sequence.

Step 409, determining a target phoneme to which the spectrum fragment in the at least one phoneme prediction frame belongs from a plurality of candidate phonemes corresponding to the spectrum fragment in the at least one phoneme prediction frame.

The execution process of steps 408 to 409 may refer to the execution process of any embodiment of the present disclosure, and is not described herein again.

As an example, taking the example of setting the time interval to be 1 second, the audio spectrogram may be a 1 × 160 × 101 image, and the 1 second audio generally has 2-5 phonemes (considering that there are phonemes of the same category, there are at most 2-5 phoneme categories in the 1 second audio).

The phoneme detection principle can be shown in fig. 5, where 1 in the category coding sequence indicates that the target text segment corresponding to the audio segment has the category phoneme, and 0 indicates that the target text segment corresponding to the audio segment does not have the category phoneme, and assuming that the target text segment is "lomy dad", the 4 kinds of phonemes in the category coding sequence, except "aiwo de ba", are coded as 1, and the rest are 0.

A first branch (i.e. a position branch) in the phoneme detection model, configured to output a position of a phoneme prediction box in an audio spectrogram corresponding to the audio segment; the second branch (i.e., the category branch or the classification branch) is configured to output 409 phone categories corresponding to the spectrum fragments in each phone prediction box, and may perform a dot product operation (multiplication of corresponding elements) on the output of the second branch (i.e., 409 phone categories corresponding to each phone prediction box) and a category coding sequence, perform confidence prediction on the phone categories with nonzero values after the dot product operation through a softmax layer in the phone detection model, and use the phone category with the maximum confidence as a category to which the spectrum fragments in the final phone prediction box belong.

Therefore, in the model training stage, the class coding sequence is added for training, so that the phoneme detection model can pay more attention to the phonemes indicated by the class coding sequence, the attention degree to the phonemes not indicated by the class coding sequence is reduced, the model classification and identification difficulty is reduced, and the phoneme identification precision is improved.

The phoneme detection method of the embodiment of the disclosure determines the interception length according to the number of the audio segments and the number of characters contained in the input text; intercepting the input text into at least one text fragment according to the interception length; the number of the text segments is the same as that of the audio segments; according to the position of the audio segment in the audio stream, determining a target text segment matched with the position from at least one text segment; and generating a category coding sequence according to the phoneme of each character in the target text fragment. Therefore, the target text segment corresponding to the audio segment can be accurately positioned from the input text, so that the category coding sequence effectively indicates each phoneme in the audio segment, the phoneme detection model can predict the phonemes only in the phonemes indicated by the category coding sequence in the prediction stage, the condition that similar phonemes cause model misrecognition can be avoided, and the accuracy and the reliability of the phoneme detection result can be improved.

In the foregoing embodiments corresponding to the phoneme detection method, the present disclosure further provides a method for generating an animation video based on the identified target phoneme.

Fig. 6 is a flowchart illustrating a phoneme detection method according to a fifth embodiment of the disclosure.

As shown in fig. 6, on the basis of any of the above embodiments, when the number of the audio segments is multiple, the phoneme detection method may further include the following steps:

step 601, generating a phoneme information sequence according to the position of at least one phoneme prediction box in the plurality of audio segments and the target phoneme to which the spectrum segment in the at least one phoneme prediction box belongs, wherein phoneme information in the phoneme information sequence comprises: each target phoneme and a corresponding pronunciation time period.

In the embodiment of the present disclosure, the position of each target phoneme in the phoneme information sequence and the corresponding pronunciation time period may be determined according to the position of each target phoneme in the corresponding audio segment and the position of the corresponding audio segment in the audio stream. The pronunciation time period may include a pronunciation start time and a pronunciation end time.

For example, assume that the audio stream is 2 seconds, the audio spectrogram corresponding to the first audio segment (0-1 seconds) has 3 phoneme prediction boxes, which are box 1, box 2, and box 3, respectively, the target phoneme to which the spectrum segment in box 1 belongs is phoneme 1, the target phoneme to which the spectrum segment in box 2 belongs is phoneme 2, and the target phoneme to which the spectrum segment in box 3 belongs is phoneme 3, and assume that the position of box 1 < the position of box 2 < the position of box 3. Furthermore, the audio spectrogram corresponding to the second audio segment (1-2 seconds) also has 3 phoneme prediction boxes, which are respectively box 4, box 5 and box 6, the target phoneme to which the spectrum segment in box 4 belongs is phoneme 4, the target phoneme to which the spectrum segment in box 5 belongs is phoneme 5, and the target phoneme to which the spectrum segment in box 6 belongs is phoneme 6, assuming that the position of box 4 < the position of box 5 < the position of box 6. The target phonemes arranged in order in the phoneme information sequence are phoneme 1, phoneme 2, phoneme 3, phoneme 4, phoneme 5 and phoneme 6, respectively.

Assuming that the position of box 1 is at 0.2 ms in the first audio segment, the pronunciation period of phoneme 1 may be: "0.1 ms to 0.3 ms", and the position of box 2 is at the 300 th ms in the first audio piece, the pronunciation time period of phoneme 2 may be: "299.9 ms to 300.1 ms", the position of the box 3 is 600 ms in the first audio segment, and the pronunciation time period of the phoneme 3 can be: "599.9 milliseconds to 600.1 milliseconds". And, assuming that the position of the box 4 is at 0.2 ms in the second audio segment, the pronunciation time period of the phoneme 4 may be: "1000.1 ms to 1000.3 ms", the position of the box 5 is at the 300 th ms in the second audio segment, and the pronunciation time period of the phoneme 5 can be: "1299.9 ms to 1300.1 ms", the position of the box 6 is at 600 ms in the third audio piece, the pronunciation time period of the phoneme 6 may be: "1599.9 ms to 1600.1 ms".

That is to say, in order to improve the accuracy of the phoneme information sequence generation result, for each audio segment, a phoneme information subsequence may be generated according to the position of at least one phoneme prediction box in the audio segment and the target phoneme to which the spectrum segment in at least one phoneme prediction box belongs, so that the phoneme information subsequences may be merged according to the position of each audio segment in the audio stream to obtain a phoneme information sequence.

For example, the pronunciation time period in each phoneme information subsequence may be adjusted according to the position of each audio segment in the audio stream, so as to obtain an adjusted phoneme information subsequence; and merging the plurality of adjusted phoneme information subsequences to obtain a phoneme information sequence. That is to say, in order to improve the accuracy of the phoneme information sequence generation result, the pronunciation time periods in the multiple phoneme information subsequences may be adjusted to the time period information in the audio stream according to the time period information of the multiple audio segments in the audio stream, and the adjusted phoneme information subsequences may be subjected to splicing processing to obtain the phoneme information sequence.

Step 602, obtaining a syllable sequence, wherein the syllable sequence corresponds to the same text as the audio stream.

In the embodiment of the present disclosure, the text corresponding to the audio stream is an input text, and the syllables corresponding to each character in the input text can be obtained, and the syllables corresponding to each character are spliced to obtain the syllable sequence corresponding to the input text. Wherein, the syllable corresponding to the character can be the pinyin of the character.

Step 603, determining a pronunciation time period corresponding to the syllable in the syllable sequence according to the syllable sequence, each target phoneme in the phoneme information sequence and the corresponding pronunciation time period.

In the embodiment of the present disclosure, the syllables in the syllable sequence have a corresponding relationship with the target phoneme in the phoneme information sequence, for example, the syllable "wo" in the syllable sequence has a corresponding relationship with the target phoneme "wo" in the phoneme information sequence, and therefore, for each syllable in the syllable sequence, the pronunciation time period corresponding to the syllable may be determined according to the pronunciation time period of the target phoneme corresponding to the syllable. The step of determining the pronunciation time period may be performed for each syllable in the syllable sequence, so as to obtain a pronunciation time period corresponding to each syllable in the syllable sequence.

In a possible implementation manner of the embodiment of the present disclosure, in order to improve the accuracy of the result of determining the pronunciation time period corresponding to a syllable, the corresponding relationship between a syllable and phoneme information in the sequence of the phoneme information may be determined according to the sequence of syllables and the corresponding relationship between the syllable and the phoneme, for example, a target phoneme in the phoneme information in the sequence of the phoneme information corresponds to a syllable in the sequence of the syllable, and the pronunciation time period of the syllable in the sequence of the syllable corresponds to the pronunciation time period of the target phoneme in the phoneme information corresponding to the syllable. Furthermore, the pronunciation time zone corresponding to the syllable may be determined based on the pronunciation time zone of the target phoneme in the phoneme information corresponding to the syllable.

Step 604, generating an animation video corresponding to the audio stream according to the pronunciation time period corresponding to the syllable in the syllable sequence and the animation frame sequence corresponding to the syllable.

In the embodiment of the present disclosure, since the pronunciation time period in the syllable sequence is determined according to the pronunciation time period corresponding to the target phoneme in the phoneme information sequence, the duration of the pronunciation time period corresponding to the syllable can be determined according to the pronunciation time period corresponding to the syllable in the syllable sequence, and the animation frame sequence corresponding to the syllable can be processed according to the duration to generate the animation video corresponding to the audio stream.

In a possible implementation manner of the embodiment of the present disclosure, the animation frame sequence corresponding to the syllable may be interpolated according to the duration of the pronunciation time period corresponding to the syllable, so as to obtain the processed animation frame sequence having the duration. For example, for a syllable in a syllable sequence, the animation dictionary may be queried to obtain an animation frame sequence corresponding to the syllable, and the animation frame sequence corresponding to the syllable is interpolated (e.g., compressed) according to the duration of the pronunciation time period corresponding to the syllable, so as to obtain an animation frame sequence corresponding to the duration.

It should be noted that, the step of the interpolation processing described above may be performed for each syllable or a partial syllable in the syllable sequence, and the disclosure does not limit this. Taking each syllable as an example, the step of interpolation processing may be performed separately for each syllable in the syllable sequence, so as to obtain a processed animation sequence corresponding to each syllable in the syllable sequence. Therefore, the animation video can be generated according to the processed animation frame sequence corresponding to each syllable in the syllable sequence.

The phoneme detection method of the embodiment of the disclosure generates a phoneme information sequence according to the position of at least one phoneme prediction box in a plurality of audio fragments and a target phoneme to which a spectrum fragment in the at least one phoneme prediction box belongs; obtaining a syllable sequence, wherein the syllable sequence corresponds to the same text as the audio stream; determining pronunciation time periods corresponding to the syllables in the syllable sequence according to the syllable sequence, each target phoneme in the phoneme information sequence and the corresponding pronunciation time period; and generating an animation video corresponding to the audio stream according to the pronunciation time period corresponding to the syllable in the syllable sequence and the animation frame sequence corresponding to the syllable. Therefore, the animation video can be automatically generated according to the input text, and the actual application requirements can be met. Moreover, the animation video and the audio stream have strong consistency, the problem of inter-frame jitter does not exist, and the reality and the generalization capability of the animation video are further improved.

The present disclosure also provides a training method for a phoneme detection model, in which the above embodiments correspond to the application method of the phoneme detection model (i.e., the phoneme detection method).

Fig. 7 is a flowchart illustrating a training method of a phoneme detection model according to a sixth embodiment of the present disclosure.

As shown in fig. 7, the training method of the phoneme detection model may include the following steps:

step 701, obtaining an audio frequency spectrogram corresponding to a sample audio.

In the embodiment of the present disclosure, a sample audio may be obtained, and a spectral feature of the sample audio is extracted to obtain an audio frequency spectrogram.

The method for obtaining the sample audio is not limited, and for example, the sample audio may be obtained from an existing training set, or may be input by manual speech, or may be obtained by performing speech synthesis according to a text input by a user, and the like.

Step 702, a category coding sequence corresponding to the sample audio is obtained, where the category coding sequence is used to indicate phonemes included in a text corresponding to the sample audio.

In the embodiment of the present disclosure, a text corresponding to a sample audio may be obtained, and a category coding sequence is determined according to phonemes of characters included in the text. For example, the sample audio may be an audio with a speech duration of 1 second, a text corresponding to the sample audio may be identified, and the category encoding sequence may be determined according to phonemes of characters included in the text.

For example, when the text corresponding to the sample audio is "nice scenery", the category coding sequence is used to indicate five phonemes "haomei de jing se".

Step 703, performing phoneme detection on the audio frequency spectrogram based on the category coding sequence by using a phoneme detection model, so as to determine a position of at least one phoneme prediction box and a target phoneme to which a spectrum fragment in the at least one phoneme prediction box belongs from the audio frequency spectrogram.

In the embodiment of the disclosure, the audio spectrogram may be input into a phoneme detection model, the phoneme detection model performs phoneme detection on the audio spectrogram based on the category coding sequence to determine a position of at least one phoneme detection box from the audio spectrogram, and the target phoneme to which the spectrum segment in the at least one phoneme detection box belongs is determined from phonemes indicated by the category coding sequence.

Step 704, training a phoneme detection model according to a first difference between a position of the at least one phoneme prediction box and a position of the at least one phoneme detection box labeled on the sample audio, and/or according to a second difference between a target phoneme to which the spectrum fragment in the at least one phoneme prediction box belongs and a labeled phoneme in the at least one phoneme detection box labeled on the sample audio.

In one possible implementation of the embodiment of the present disclosure, the phoneme detection model may be trained according to a first difference between a position of the at least one phoneme prediction box and a position of the at least one phoneme detection box labeled on the sample audio.

For example, a position loss function may be generated according to a first difference between a position of at least one phoneme prediction box and a position of at least one phoneme detection box labeled on the sample audio, and a phoneme detection model may be trained according to the position loss function, so as to minimize a value of the position loss function.

As an example, the position loss function may be a MSE (Mean Square Error) loss function, i.e. an average of the squares of the differences between predicted and true values, e.g. the position loss function MSE may be determined according to equation (1):

where n may be the number of the phoneme prediction blocks or the phoneme detection blocks, f (x) may include the position (i.e., the center position abscissa) of the phoneme prediction block and/or the width of the phoneme prediction block, and y may include the position (i.e., the center position abscissa) of the phoneme detection block and/or the width of the phoneme detection block, for example, f (x) _i ) May indicate the location of the ith phoneme prediction box, y _i The position of the ith phoneme detection box may be indicated.

In another possible implementation manner of the embodiment of the present disclosure, the phoneme detection model may be trained according to a second difference between the target phoneme to which the spectrum segment in the at least one phoneme prediction box belongs and the labeled phoneme in the at least one phoneme detection box labeled on the sample audio.

For example, a classification loss function may be generated according to a second difference between a target phoneme to which a spectrum fragment in the at least one phoneme prediction box belongs and a labeled phoneme in the at least one phoneme detection box labeled on the sample audio, and the phoneme detection model may be trained according to the classification loss function, so as to minimize a value of the classification loss function.

As an example, the classification Loss function may be a CEL (Cross Entropy Loss) Loss function, and the phoneme detection model may predict a plurality of (for example, 409) predicted phonemes corresponding to the spectrum fragment in each phoneme prediction box when performing phoneme detection, and determine a target phoneme to which the spectrum fragment in the phoneme prediction box belongs from the plurality of predicted phonemes. A first classification loss function may thus be generated based on the confidence levels of the plurality of (e.g., 409) predicted phonemes corresponding to the spectral slice within the at least one phoneme prediction box predicted by the phoneme detection model and the confidence level of the at least one annotated phoneme. For example, the loss function for each annotated phoneme can be determined according to the following equation (2):

wherein class refers to the labeled phoneme (or the category to which the labeled phoneme belongs), x refers to the target phoneme (or the category to which the target phoneme belongs), x [ class ] refers to the confidence degree corresponding to the labeled phoneme in 409 predicted phonemes predicted by the phoneme detection model, x [ j ] refers to the confidence degree corresponding to 409 predicted phonemes (including the labeled phoneme) predicted by the phoneme detection model, and the value of j is 409 in total, and is respectively the predicted phonemes. That is, the CEL loss function may be calculated based on the confidence levels of the 408 inaccurate phones from the 409 predicted phones and the confidence levels of the accurate phones corresponding to the class 1 labeled phones.

When the number of labeled phonemes is one, loss in equation (2) may be used as a classification loss function, and when the number of labeled phonemes is plural, the loss functions of the labeled phonemes obtained from equation (2) may be weighted and summed to obtain a classification loss function.

In yet another possible implementation manner of the embodiment of the present disclosure, the phoneme detection model may be trained according to the first difference and the second difference at the same time.

For example, a position loss function may be generated according to the first difference, a classification loss function may be generated according to the second difference, the position loss function and the classification loss function are weighted and summed to obtain a target loss function, and the phoneme detection model is trained according to the target loss function to minimize a value of the target loss function.

The method for training the phoneme detection model according to the embodiment of the present disclosure includes obtaining a category coding sequence corresponding to a sample audio, where the category coding sequence is used to indicate phonemes included in a text corresponding to the sample audio, and performing phoneme detection on an audio spectrogram corresponding to the sample audio based on the category coding sequence by using a phoneme detection model to determine a position of at least one phoneme prediction box and a target phoneme to which a spectrum fragment in the phoneme prediction box belongs from the audio spectrogram, and then training the phoneme detection model according to a first difference between the position of the at least one phoneme prediction box and a position of at least one phoneme detection box labeled on the sample audio, and/or according to a second difference between the target phoneme to which the spectrum fragment in the phoneme prediction box belongs and a labeled phoneme in the at least one phoneme detection box labeled on the sample audio. Therefore, in the model training stage, the class coding sequence is added for training, so that the phoneme detection model can pay more attention to the phonemes indicated by the class coding sequence, the attention degree to the phonemes not indicated by the class coding sequence is reduced, the phoneme detection model can only carry out phoneme prediction in the phonemes indicated by the class coding sequence, the condition that similar phonemes cause model misrecognition can be avoided, and the accuracy and the reliability of the phoneme detection result are improved.

In order to clearly illustrate how the phoneme detection is performed on the audio frequency spectrogram in the embodiment of the present disclosure, the present disclosure further provides a training method of a phoneme detection model.

Fig. 8 is a flowchart illustrating a training method of a phoneme detection model according to a seventh embodiment of the present disclosure.

As shown in fig. 8, the training method of the phoneme detection model may include the following steps:

step 801, an audio frequency spectrogram corresponding to a sample audio is obtained.

Step 802, a category coding sequence corresponding to the sample audio is obtained, where the category coding sequence is used to indicate phonemes included in a text corresponding to the sample audio.

The execution process of steps 801 to 802 may refer to the execution process of any embodiment of the present disclosure, and is not described herein again.

Step 803, performing phoneme detection on the audio spectrogram based on the category coding sequence by using the first prediction layer in the phoneme detection model to determine a position of at least one phoneme prediction box from the audio spectrogram, and determining a plurality of candidate phonemes corresponding to a spectrum segment in the at least one phoneme prediction box from phonemes indicated by the category coding sequence.

In the embodiment of the present disclosure, an audio spectrogram may be input into a first prediction layer in a phoneme detection model, the audio spectrogram is subjected to phoneme detection by the first prediction layer to determine a position of at least one phoneme prediction box from the audio spectrogram, and a plurality of candidate phonemes corresponding to a spectrum segment in the at least one phoneme prediction box are determined from phonemes indicated by a category coding sequence.

In a possible implementation manner of the embodiment of the present disclosure, in order to improve accuracy of a determination result of a plurality of candidate phonemes corresponding to a spectrum segment in each phoneme prediction box, a first prediction layer may be used to perform phoneme detection on an audio spectrogram, so as to obtain a position of at least one phoneme prediction box and a plurality of (for example, 409) predicted phonemes corresponding to a spectrum segment in the at least one phoneme prediction box.

As a possible implementation manner, a prediction branch in the first prediction layer may be adopted, and meanwhile, a regression prediction of phonemes is performed on the audio spectrogram to obtain a position of at least one phoneme prediction box, and a category prediction of phonemes is performed on the audio spectrogram to obtain a plurality of predicted phonemes (or categories to which a plurality of predicted phonemes belong) corresponding to a spectrum segment in the at least one phoneme prediction box.

As another possible implementation manner, the classification and regression may be further decoupled, so that the model focuses on the expression of the feature capability of the classification and regression, that is, the feature expression capability of the model is enhanced, in the present disclosure, a first prediction branch in a first prediction layer may be further adopted to perform regression prediction on a phoneme on the audio spectrogram to obtain a position of at least one phoneme prediction box, and a second prediction branch in the first prediction layer may be adopted to perform category prediction on the phoneme on the audio spectrogram to obtain a plurality of predicted phonemes (or categories to which the plurality of predicted phonemes belong) corresponding to a spectrum segment in the at least one phoneme prediction box. Wherein the first predicted branch is different from the second predicted branch.

Therefore, a plurality of predicted phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box can be screened according to the phonemes indicated by the category coding sequence so as to reserve candidate phonemes matched with the phonemes indicated by the category coding sequence.

Step 804, performing confidence prediction on a plurality of candidate phonemes corresponding to the spectrum segment in the at least one phoneme prediction frame by using a second prediction layer in the phoneme detection model to obtain confidence degrees of the plurality of candidate phonemes corresponding to the at least one phoneme prediction frame.

In this embodiment of the present disclosure, for each phoneme prediction box, a second prediction layer (for example, softmax layer) in the phoneme detection model may be used to perform confidence prediction on a plurality of candidate phonemes corresponding to a spectrum segment in the phoneme prediction box, so as to obtain a confidence corresponding to each candidate phoneme.

Step 805, according to the confidence degrees of a plurality of candidate phonemes corresponding to at least one phoneme prediction box, screening out a target phoneme to which a spectrum fragment in at least one phoneme prediction box belongs from each candidate phoneme.

In the embodiment of the present disclosure, for each phoneme prediction box, a target phoneme may be screened from each candidate phoneme according to the confidence degree corresponding to each corresponding candidate phoneme, for example, the confidence degree of the target phoneme may be greater than the confidence degrees of other candidate phonemes, so that the target phoneme may be used as a phoneme to which a spectrum segment in the phoneme prediction box belongs.

Step 806, training a phoneme detection model according to a first difference between a position of the at least one phoneme prediction box and a position of the at least one phoneme detection box labeled on the sample audio, and/or according to a second difference between a target phoneme to which the spectrum fragment in the at least one phoneme prediction box belongs and a labeled phoneme in the at least one phoneme detection box labeled on the sample audio.

The execution process of step 806 may refer to the execution process of the above embodiments, which is not described herein.

According to the training method of the phoneme detection model, the phoneme detection model carries out phoneme detection on the audio frequency spectrogram based on the prior information of phonemes contained in the text corresponding to the sample audio, and since the phoneme detection model only carries out phoneme prediction in the phonemes contained in the text, the condition of model misrecognition caused by similar phonemes can be avoided, so that the accuracy and reliability of a phoneme detection result are improved. And according to the confidence corresponding to each candidate phoneme, the target phoneme to which the spectrum fragment in the phoneme prediction frame belongs is screened from each candidate phoneme, so that the accuracy and the reasonability of the target phoneme determination result can be improved.

Corresponding to the phoneme detection method provided in the embodiments of fig. 1 to 6, the present disclosure also provides a phoneme detection apparatus, and since the phoneme detection apparatus provided in the embodiments of the present disclosure corresponds to the phoneme detection method provided in the embodiments of fig. 1 to 6, the embodiments of the phoneme detection method are also applicable to the phoneme detection apparatus provided in the embodiments of the present disclosure, and will not be described in detail in the embodiments of the present disclosure.

Fig. 9 is a schematic structural diagram of a phoneme detection apparatus according to an eighth embodiment of the present disclosure.

As shown in fig. 9, the phoneme detecting apparatus 900 may include: a first acquisition module 910, a second acquisition module 920, a detection module 930, and a determination module 940.

The first obtaining module 910 is configured to obtain an audio frequency spectrogram corresponding to at least one audio frequency band.

The second obtaining module 920 is configured to obtain a category coding sequence of the target text segment corresponding to the audio segment, where the category coding sequence is used to indicate phonemes included in the target text segment.

A detecting module 930, configured to perform phoneme detection on the audio spectrogram based on the category coding sequence by using a phoneme detection model, so as to determine at least one phoneme prediction box from the audio spectrogram, and determine a plurality of candidate phonemes corresponding to the spectrum segment in the at least one phoneme prediction box from the phonemes indicated by the category coding sequence.

The determining module 940 is configured to determine a target phoneme to which the at least one phoneme prediction intra-frame spectrum segment belongs from a plurality of candidate phonemes corresponding to the at least one phoneme prediction intra-frame spectrum segment.

In a possible implementation manner of the embodiment of the present disclosure, the detecting module 930 is specifically configured to: performing phoneme detection on the audio frequency spectrogram by adopting a phoneme detection model to obtain the position of at least one phoneme prediction box and a plurality of predicted phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box; and screening a plurality of predicted phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box according to the phonemes indicated by the category coding sequence so as to reserve candidate phonemes matched with the phonemes indicated by the category coding sequence.

In a possible implementation manner of the embodiment of the present disclosure, the first obtaining module 910 is specifically configured to: acquiring an input text, and performing voice synthesis on the input text to obtain an audio stream; segmenting the audio stream according to a set time interval to obtain at least one audio segment; and extracting the spectral characteristics of the audio clip to obtain an audio frequency spectrogram.

In a possible implementation manner of the embodiment of the present disclosure, the second obtaining module 920 is specifically configured to: determining the intercepting length according to the number of the audio segments and the number of characters contained in the input text; intercepting the input text into at least one text fragment according to the interception length; the number of the text segments is the same as that of the audio segments; according to the position of the audio segment in the audio stream, determining a target text segment matched with the position from at least one text segment; and generating a category coding sequence according to the phoneme of each character in the target text segment.

In a possible implementation manner of the embodiment of the present disclosure, the number of the audio segments is multiple, and the phoneme detecting apparatus 900 may further include:

a generating module, configured to generate a phoneme information sequence according to a position of at least one phoneme prediction box in the multiple audio segments and a target phoneme to which the spectrum segment in the at least one phoneme prediction box belongs, where phoneme information in the phoneme information sequence includes: each target phoneme and a corresponding pronunciation time segment.

And the third acquisition module is used for acquiring a syllable sequence, wherein the syllable sequence corresponds to the same text as the audio stream.

The determining module 940 is further configured to determine a pronunciation time period corresponding to a syllable in the syllable sequence according to the syllable sequence, each target phoneme in the phoneme information sequence, and the corresponding pronunciation time period.

And the generating module is also used for generating the animation video corresponding to the audio stream according to the pronunciation time period corresponding to the syllable in the syllable sequence and the animation frame sequence corresponding to the syllable.

In a possible implementation manner of the embodiment of the present disclosure, the generating module is specifically configured to: aiming at each audio fragment, generating a phoneme information subsequence according to the position of at least one phoneme prediction box in the audio fragment and a target phoneme to which a spectrum fragment in the at least one phoneme prediction box belongs; and combining the phoneme information subsequences according to the positions of the audio fragments in the audio stream to obtain a phoneme information sequence.

The phoneme detection device of the embodiment of the disclosure obtains a category coding sequence of a target text segment corresponding to at least one audio segment, wherein the category coding sequence is used for indicating phonemes contained in the target text segment, and performs phoneme detection on an audio spectrogram corresponding to an audio frequency band based on the category coding sequence by using a phoneme detection model to determine at least one phoneme prediction box from the audio spectrogram, and determine a plurality of candidate phonemes corresponding to a spectrum segment in the at least one phoneme prediction box from the phonemes indicated by the category coding sequence, and then determine a target phoneme to which the spectrum segment in the at least one phoneme prediction box belongs from the plurality of candidate phonemes corresponding to the spectrum segment in the at least one phoneme prediction box. Therefore, in the model prediction stage, the phoneme detection model carries out phoneme detection on the audio frequency spectrogram based on the prior information of the phonemes contained in the target text segment corresponding to the audio segment, and because the phoneme detection model carries out phoneme prediction only in the phonemes contained in the target text segment, the condition of model misrecognition caused by similar phonemes can be avoided, and thus the accuracy and reliability of the phoneme detection result are improved.

Since the training device of the phoneme detection model provided in the embodiment of the present disclosure corresponds to the training method of the phoneme detection model provided in the embodiment of the present disclosure in the embodiment of fig. 7 to 8, the embodiment of the training method of the phoneme detection model is also applicable to the training device of the phoneme detection model provided in the embodiment of the present disclosure, and will not be described in detail in the embodiment of the present disclosure.

Fig. 10 is a schematic structural diagram of a training apparatus for a phoneme detection model according to a ninth embodiment of the present disclosure.

As shown in fig. 10, the training apparatus 1000 for a phoneme detection model may include: a first acquisition module 1010, a second acquisition module 1020, a detection module 1030, and a training module 1040.

The first obtaining module 1010 is configured to obtain an audio frequency spectrogram corresponding to a sample audio.

The second obtaining module 1020 is configured to obtain a category coding sequence corresponding to the sample audio, where the category coding sequence is used to indicate phonemes included in a text corresponding to the sample audio.

A detecting module 1030, configured to perform phoneme detection on the audio spectrogram based on the category coding sequence by using a phoneme detection model, so as to determine, from the audio spectrogram, a position of the at least one phoneme prediction box and a target phoneme to which the spectrum fragment in the at least one phoneme prediction box belongs.

The training module 1040 is configured to train the phoneme detection model according to a first difference between a position of the at least one phoneme prediction box and a position of the at least one phoneme detection box labeled on the sample audio, and/or according to a second difference between a target phoneme to which the spectrum fragment in the at least one phoneme prediction box belongs and a labeled phoneme in the at least one phoneme detection box labeled on the sample audio.

In a possible implementation manner of the embodiment of the present disclosure, the detecting module 1030 is specifically configured to: performing phoneme detection on the audio frequency spectrogram based on the category coding sequence by adopting a first prediction layer in a phoneme detection model so as to determine the position of at least one phoneme prediction box in the audio frequency spectrogram and determine a plurality of candidate phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box from the phonemes indicated by the category coding sequence; performing confidence prediction on a plurality of candidate phonemes corresponding to the spectrum fragment in at least one phoneme prediction frame by adopting a second prediction layer in the phoneme detection model to obtain the confidence of the plurality of candidate phonemes corresponding to the at least one phoneme prediction frame; and screening a target phoneme to which the spectrum fragment in the at least one phoneme prediction box belongs from each candidate phoneme according to the confidence degrees of the plurality of candidate phonemes corresponding to the at least one phoneme prediction box.

In a possible implementation manner of the embodiment of the present disclosure, the detecting module 1030 is specifically configured to: performing phoneme detection on the audio frequency spectrogram by adopting a first prediction layer to obtain the position of at least one phoneme prediction box and a plurality of predicted phonemes corresponding to the frequency fragment in the at least one phoneme prediction box; and screening a plurality of predicted phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box according to the phonemes indicated by the category coding sequence so as to reserve candidate phonemes matched with the phonemes indicated by the category coding sequence.

The training device of the phoneme detection model according to the embodiment of the present disclosure obtains a category coding sequence corresponding to a sample audio, where the category coding sequence is used to indicate phonemes included in a text corresponding to the sample audio, and performs phoneme detection on an audio spectrogram corresponding to the sample audio based on the category coding sequence by using a phoneme detection model to determine a position of at least one phoneme prediction box and a target phoneme to which a spectrum fragment in the phoneme prediction box belongs from the audio spectrogram, and then trains the phoneme detection model according to a first difference between the position of the at least one phoneme prediction box and a position of at least one phoneme detection box labeled on the sample audio, and/or according to a second difference between the target phoneme to which the spectrum fragment in the phoneme prediction box belongs and a labeled phoneme in the at least one phoneme detection box labeled on the sample audio. Therefore, in the model training stage, the class coding sequence is added for training, so that the phoneme detection model can pay more attention to the phonemes indicated by the class coding sequence, the attention degree to the phonemes not indicated by the class coding sequence is reduced, the phoneme detection model can only carry out phoneme prediction in the phonemes indicated by the class coding sequence, the condition that similar phonemes cause model misrecognition can be avoided, and the accuracy and the reliability of the phoneme detection result are improved.

To implement the above embodiments, the present disclosure also provides an electronic device, which may include at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a phoneme detection method or a training method set forth in any of the above embodiments of the present disclosure.

To achieve the above embodiments, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute a phoneme detection method or a training method proposed by any of the above embodiments of the present disclosure.

In order to implement the above embodiments, the present disclosure also provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the phoneme detection method or the training method proposed by any of the above embodiments of the present disclosure.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 11 shows a schematic block diagram of an example electronic device that may be used to implement any of the embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 11, the device 1100 includes a computing unit 1101, which can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 1102 or a computer program loaded from a storage unit 1108 into a RAM (Random Access Memory) 1103. In the RAM 1103, various programs and data necessary for the operation of the device 1100 may also be stored. The calculation unit 1101, the ROM 1102, and the RAM 1103 are connected to each other by a bus 1104. An I/O (Input/Output) interface 1105 is also connected to the bus 1104.

A number of components in device 1100 connect to I/O interface 1105, including: an input unit 1106 such as a keyboard, mouse, or the like; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108 such as a magnetic disk, optical disk, or the like; and a communication unit 1109 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 1101 can be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing Unit 1101 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 1101 performs the respective methods and processes described above, such as the phoneme detection method or the training method described above. For example, in some embodiments, the phoneme detection method or training method described above may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When the computer program is loaded into RAM 1103 and executed by the computing unit 1101, one or more steps of the phoneme detection method or training method described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform the above-described phoneme detection method or training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, system On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (erasable Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in a conventional physical host and a VPS (Virtual Private Server). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that artificial intelligence is a subject for studying a computer to simulate some human thinking process and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and has both hardware-level and software-level technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge map technology and the like.

According to the technical scheme of the embodiment of the disclosure, a category coding sequence of a target text segment corresponding to at least one audio segment is obtained, wherein the category coding sequence is used for indicating phonemes contained in the target text segment, a phoneme detection model is adopted, phoneme detection is carried out on an audio spectrogram corresponding to the audio frequency segment based on the category coding sequence, so as to determine at least one phoneme prediction box from the audio spectrogram, a plurality of candidate phonemes corresponding to the spectrum segment in the at least one phoneme prediction box are determined from the phonemes indicated by the category coding sequence, and then a target phoneme to which the spectrum segment in the at least one phoneme prediction box belongs is determined from a plurality of candidate phonemes corresponding to the spectrum segment in the at least one phoneme prediction box. Therefore, in the model prediction stage, the phoneme detection model performs phoneme detection on the audio frequency spectrogram based on the prior information of the phonemes contained in the target text segment corresponding to the audio segment, and because the phoneme detection model performs phoneme prediction only in the phonemes contained in the target text segment, the situation of model misrecognition caused by similar phonemes can be avoided, so that the accuracy and the reliability of a phoneme detection result are improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of phoneme detection, the method comprising:

acquiring a category coding sequence of a target text segment corresponding to an audio segment, wherein the category coding sequence is used for indicating phonemes contained in the target text segment;

performing phoneme detection on the audio frequency spectrogram based on the category coding sequence by adopting a phoneme detection model so as to determine at least one phoneme prediction box from the audio frequency spectrogram, and determining a plurality of candidate phonemes corresponding to a spectrum fragment in the at least one phoneme prediction box from the phonemes indicated by the category coding sequence;

2. The method of claim 1, wherein the performing phoneme detection on the audio spectrogram based on the category encoding sequence by using a phoneme detection model to determine at least one phoneme prediction box from the audio spectrogram, and determining a plurality of candidate phonemes corresponding to a spectrum fragment in the at least one phoneme prediction box from phonemes indicated by the category encoding sequence comprises:

performing phoneme detection on the audio frequency spectrogram by using the phoneme detection model to obtain the position of at least one phoneme prediction box and a plurality of predicted phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box;

and screening a plurality of predicted phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box according to the phonemes indicated by the category coding sequence so as to reserve candidate phonemes matched with the phonemes indicated by the category coding sequence.

3. The method of claim 1, wherein the obtaining an audio spectrogram corresponding to at least one audio frequency band comprises:

acquiring an input text, and performing voice synthesis on the input text to obtain an audio stream;

segmenting the audio stream according to a set time interval to obtain at least one audio segment;

and extracting the spectral characteristics of the audio frequency segments to obtain the audio frequency spectrogram.

4. The method according to claim 3, wherein the obtaining of the category encoding sequence of the target text segment corresponding to the audio segment includes:

determining the intercepting length according to the number of the audio clips and the number of characters contained in the input text;

intercepting the input text into at least one text fragment according to the interception length; the number of the text segments is the same as that of the audio segments;

according to the position of the audio segment in the audio stream, determining a target text segment matched with the position from at least one text segment;

and generating a category coding sequence according to the phoneme of each character in the target text fragment.

5. The method of claim 3 or 4, wherein the audio segment is plural, the method further comprising:

generating a phoneme information sequence according to the position of at least one phoneme prediction box in the plurality of audio fragments and a target phoneme to which the spectrum fragment in the at least one phoneme prediction box belongs, wherein phoneme information in the phoneme information sequence comprises: each target phoneme and a corresponding pronunciation time period;

obtaining a syllable sequence, wherein the syllable sequence corresponds to the same text as the audio stream;

determining pronunciation time periods corresponding to syllables in the syllable sequence according to the syllable sequence, each target phoneme in the phoneme information sequence and the corresponding pronunciation time period;

and generating an animation video corresponding to the audio stream according to the pronunciation time period corresponding to the syllable in the syllable sequence and the animation frame sequence corresponding to the syllable.

6. The method of claim 5, wherein the generating a phoneme information sequence according to the position of at least one phoneme prediction box in the plurality of audio segments and the target phoneme to which the spectral fragment belongs in the at least one phoneme prediction box comprises:

generating a phoneme information subsequence according to the position of at least one phoneme prediction box in each audio fragment and a target phoneme to which a spectrum fragment in the at least one phoneme prediction box belongs;

and combining the phoneme information subsequences according to the positions of the audio fragments in the audio stream to obtain the phoneme information sequence.

7. A method of training a phoneme detection model, the method comprising:

acquiring an audio frequency spectrogram corresponding to a sample audio;

8. The method of claim 7, wherein the performing phoneme detection on the audio spectrogram based on the category-coded sequence by using a phoneme detection model to determine a position of at least one phoneme prediction box and a target phoneme to which a spectrum fragment in the at least one phoneme prediction box belongs from the audio spectrogram comprises:

performing phoneme detection on the audio frequency spectrogram based on the category coding sequence by adopting a first prediction layer in the phoneme detection model so as to determine the position of at least one phoneme prediction box in the audio frequency spectrogram and determine a plurality of candidate phonemes corresponding to the spectrum fragments in the at least one phoneme prediction box from the phonemes indicated by the category coding sequence;

performing confidence prediction on a plurality of candidate phonemes corresponding to the spectrum fragment in the at least one phoneme prediction frame by using a second prediction layer in the phoneme detection model to obtain confidence degrees of the plurality of candidate phonemes corresponding to the at least one phoneme prediction frame;

and screening out a target phoneme to which a spectrum fragment in the at least one phoneme prediction box belongs from each candidate phoneme according to the confidence degrees of the candidate phonemes corresponding to the at least one phoneme prediction box.

9. The method of claim 8, wherein the performing phoneme detection on the audio spectrogram based on the category encoding sequence by using the first prediction layer in the phoneme detection model to determine a position of at least one phoneme prediction box from the audio spectrogram, and determining a plurality of candidate phonemes corresponding to a spectral slice in the at least one phoneme prediction box from the phonemes indicated by the category encoding sequence comprises:

performing phoneme detection on the audio frequency spectrogram by using the first prediction layer to obtain the position of the at least one phoneme prediction box and a plurality of predicted phonemes corresponding to the spectrum fragment in the at least one phoneme prediction box;

10. A phoneme detection apparatus, the apparatus comprising:

the second obtaining module is used for obtaining a category coding sequence of a target text segment corresponding to the audio segment, wherein the category coding sequence is used for indicating phonemes contained in the target text segment;

11. The apparatus according to claim 10, wherein the detection module is specifically configured to:

12. The apparatus according to claim 10, wherein the first obtaining module is specifically configured to:

13. The apparatus according to claim 12, wherein the second obtaining module is specifically configured to:

and generating a category coding sequence according to the phoneme of each character in the target text segment.

14. The apparatus according to claim 12 or 13, wherein the audio piece is plural, the apparatus further comprising:

a generating module, configured to generate a phoneme information sequence according to a position of at least one phoneme prediction box in the plurality of audio segments and a target phoneme to which a spectrum segment in the at least one phoneme prediction box belongs, where phoneme information in the phoneme information sequence includes: each target phoneme and a corresponding pronunciation time period;

a third obtaining module, configured to obtain a syllable sequence, where the syllable sequence corresponds to the same text as the audio stream;

the determining module is further configured to determine a pronunciation time period corresponding to a syllable in the syllable sequence according to the syllable sequence, each target phoneme in the phoneme information sequence, and the corresponding pronunciation time period;

the generating module is further configured to generate an animation video corresponding to the audio stream according to the pronunciation time period corresponding to the syllable in the syllable sequence and the animation frame sequence corresponding to the syllable.

15. The apparatus according to claim 14, wherein the generating module is specifically configured to:

16. An apparatus for training a phoneme detection model, the apparatus comprising:

a detection module, configured to perform phoneme detection on the audio spectrogram based on the category coding sequence by using a phoneme detection model, so as to determine, from the audio spectrogram, a position of at least one phoneme prediction box and a target phoneme to which a spectrum segment in the at least one phoneme prediction box belongs;

17. The apparatus of claim 16, wherein the detection module is specifically configured to:

performing confidence prediction on the candidate phonemes corresponding to the spectrum fragment in the at least one phoneme prediction frame by using a second prediction layer in the phoneme detection model to obtain confidence degrees of the candidate phonemes corresponding to the at least one phoneme prediction frame;

18. The apparatus according to claim 17, wherein the detection module is specifically configured to:

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the phoneme detection method of any one of claims 1 to 6 or the training method of any one of claims 7 to 9.

20. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the phoneme detection method of any one of claims 1 to 6 or the training method of any one of claims 7 to 9.