CN113011127A

CN113011127A - Text phonetic notation method and device, storage medium and electronic equipment

Info

Publication number: CN113011127A
Application number: CN202110172476.3A
Authority: CN
Inventors: 金强; 朱一闻; 曹偲; 刘华平
Original assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Current assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-06-22

Abstract

The embodiment of the invention provides a text phonetic notation method and device, a storage medium and electronic equipment. The text phonetic notation method comprises the following steps: extracting characters to be annotated from the text file, and determining candidate sounds corresponding to the characters to be annotated; wherein, the candidate sound is a character sequence composed of one or more phonetic characters; if the number of the candidate sounds is larger than 1, extracting a voice audio segment corresponding to the character to be noted from the voice audio; wherein, the voice audio is matched with the text file; based on the voice audio clip, calculating the matching probability of the character sequence corresponding to each candidate voice and the character to be annotated; and determining the candidate sound with the maximum matching probability as the phonetic notation sound of the character to be phonetic annotated. The technical scheme of the embodiment of the invention can improve the accuracy of the phonetic notation of the polyphone.

Description

Text phonetic notation method and device, storage medium and electronic equipment

Technical Field

Embodiments of the present invention relate to the field of information processing, and more particularly, to a text phonetic notation method, a text phonetic notation apparatus, a storage medium, and an electronic device.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

With the increasing popularity of foreign language songs, a large number of users are emerging who want to sing foreign language songs. In order to facilitate the user to learn singing, Roman sounds can be marked on the lyrics in the foreign language songs, and the pronunciation of the lyrics is similar to Chinese pinyin; the user can easily spell out the lyric pronunciation of the foreign language song according to the marked Roman sound, thereby achieving the purpose of learning to sing.

When the existing roman phonetic notation method is used for transliterating polyphone characters, a pre-training language model is generally used, and probability calculation is carried out on the polyphone characters by utilizing context information, so that the roman pronunciation corresponding to the polyphone characters is determined.

Disclosure of Invention

However, in the prior art, for some very short lyrics with only one or two words, the context information is often lacked, which results in poor accuracy of the result derived by using the language model.

Therefore, a new method for annotating text is highly needed to improve the accuracy of the polyphonic accents.

In this context, embodiments of the present invention are intended to provide a text ZhuYin method, a text ZhuYin apparatus, a storage medium, and an electronic device.

In a first aspect of embodiments of the present invention, there is provided a text ZhuYin method, including: extracting characters to be annotated from a text file, and determining candidate sounds corresponding to the characters to be annotated; wherein the candidate tone is a character sequence consisting of one or more ZhuYin characters; if the number of the candidate voices is larger than 1, extracting voice audio segments corresponding to the characters to be noted from voice audio; wherein the human voice audio matches the text file; calculating the matching probability of the character sequence corresponding to each candidate voice and the character to be annotated on the basis of the voice audio clip; and determining the candidate sound with the maximum matching probability as the phonetic notation sound of the character to be phonetic annotated.

In some embodiments of the present invention, calculating, based on the human voice audio clip, a matching probability between the character sequence corresponding to each candidate voice and the text to be annotated includes: calculating the sequencing probability of all phonetic notation characters corresponding to the human voice audio clip; and calculating the matching probability of the character sequence corresponding to each candidate tone according to the sorting probability.

In some embodiments of the present invention, calculating the match probability for the character sequence corresponding to each of the candidate tones according to the ranking probabilities comprises: determining all arrangement modes of all the ZhuYin characters corresponding to the candidate tones on the time sequence; obtaining the probability corresponding to each phonetic symbol in the arrangement mode from the sequencing probability; and determining the matching probability of the character sequence corresponding to the candidate tone according to the probability corresponding to each ZhuYin character.

In some embodiments of the invention, the ZhuYin characters include pronunciation characters and blank characters; determining all arrangement modes of all the ZhuYin characters corresponding to the candidate tones on the time sequence comprises the following steps: and determining all arrangement modes of the blank characters and the pronunciation characters corresponding to the candidate voices on the time sequence.

In some embodiments of the present invention, determining the matching probability of the character sequence corresponding to the candidate tone according to the probability corresponding to each of the ZhuYin characters comprises: multiplying the probabilities corresponding to the phonetic characters to obtain the path probabilities corresponding to the arrangement modes; and adding the path probabilities corresponding to all the arrangement modes to obtain the matching probability of the character sequence corresponding to the candidate tone.

In some embodiments of the present invention, calculating the ranking probability of all ZhuYin characters corresponding to the human voice audio clip includes: dividing the human voice audio clip into a plurality of frames of sub-audio, and extracting the acoustic characteristics of each frame of sub-audio; inputting the acoustic features of the sub-audios of the plurality of frames into an acoustic model frame by frame to obtain the sequencing probability of all phonetic notation characters corresponding to the human voice audio clip.

In some embodiments of the invention, dividing the segment of human voice audio into a plurality of sub-audios comprises: and dividing the voice audio clip into a plurality of frames of the sub audio according to a preset window length and a preset step length.

In some embodiments of the invention, the step size is less than or equal to the window length.

In some embodiments of the invention, the acoustic features include mel-frequency cepstral coefficients, fundamental frequency features and formant features.

In some embodiments of the present invention, extracting a voice audio segment corresponding to the text to be annotated from the voice audio includes: and separating the voice audio frequency segment corresponding to the characters to be noted from the voice audio frequency according to the time position of the characters to be noted in the voice audio frequency.

In some embodiments of the invention, further comprising: and if the number of the candidate tones is equal to 1, determining the candidate tones as the phonetic notation tones of the character to be phonetic annotated.

In some embodiments of the invention, the method further comprises: and marking the character to be marked with the phonetic notation by using the phonetic notation.

In some embodiments of the present invention, determining the candidate sound corresponding to the text to be annotated includes: determining candidate voices corresponding to the characters to be annotated from a transliteration dictionary; and the language corresponding to the transliteration dictionary is the same as the language corresponding to the character to be annotated.

In some embodiments of the present invention, the candidate sounds are roman sounds and the transliteration dictionary includes each letter of the language and the roman sound to which each letter corresponds.

In some embodiments of the present invention, before extracting a voice audio segment corresponding to the text to be annotated from the voice audio, the method further includes: separating human audio from audio files.

In some embodiments of the invention, separating the human voice audio from the audio file comprises: acquiring a spectrogram of the audio file, and determining a phase diagram and a human voice spectrogram of the audio file according to the spectrogram; and taking the human voice spectrogram and the phase diagram as a human voice frequency domain signal, and carrying out inverse Fourier transform on the frequency domain signal to obtain the human voice frequency of a time domain.

In some embodiments of the invention, determining the phase map of the audio file from the spectrogram comprises: and obtaining the phase map by taking the phase of the spectrogram.

In some embodiments of the present invention, determining the human voice spectrogram of the audio file from the spectrogram comprises: obtaining a spectrogram of the audio file by taking a model of the spectrogram, and obtaining a human voice mask according to the spectrogram; and performing dot multiplication on the voice mask and the voice spectrogram to obtain a voice spectrogram.

In some embodiments of the invention, obtaining the human voice mask from the spectrogram comprises: and inputting the spectrogram into a convolutional neural network to obtain the human voice mask.

In some embodiments of the present invention, obtaining the spectrogram of the audio file comprises: and performing discrete Fourier transform on the audio file to obtain the spectrogram.

In a second aspect of embodiments of the present invention, there is provided a text phonetic notation device comprising: the candidate sound determining module is used for extracting characters to be annotated from the text file and determining candidate sounds corresponding to the characters to be annotated; wherein the candidate tone is a ZhuYin character sequence composed of one or more ZhuYin characters; the audio extraction module is used for extracting a voice audio clip corresponding to the character to be noted from the voice audio if the number of the candidate voices is greater than 1; wherein the human voice audio matches the text file; the probability calculation module is used for calculating the matching probability of the character sequence corresponding to each candidate voice and the character to be annotated on the basis of the voice audio clip; and the phonetic notation determining module is used for determining the candidate sound with the maximum matching probability as the phonetic notation of the character to be phonetic annotated.

In a third aspect of embodiments of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements a text ZhuYin method according to any one of the above-described embodiments.

In a fourth aspect of embodiments of the present invention, there is provided an electronic apparatus comprising: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to perform the text ZhuYin method of any of the above embodiments via execution of the executable instructions.

According to the text phonetic notation method, the text phonetic notation device, the storage medium and the electronic equipment, on one hand, the text to be phonetic-annotated is annotated by combining the text file and the audio file, so that the conversion success rate of text annotation is improved. On the other hand, the acoustic features in the audio are used for determining the matching probability of the character sequence corresponding to the candidate sound, and the candidate sound with the maximum matching probability is determined as the phonetic notation sound, so that the accuracy of the phonetic notation of the polyphonic characters is improved. On the other hand, the text phonetic notation method is not only suitable for phonetic notation of conventional audio texts, but also suitable for extremely short lyric texts lacking context information in lyric texts, or phonetic notation of lyric texts with misuse of word classes adopted for achieving artistic effects, so that the application range of the text phonetic notation method is expanded, and the user experience is improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically illustrates a flow diagram of a text ZhuYin method according to an exemplary embodiment of the present invention;

FIG. 2 schematically illustrates a multi-frame sub-audio division according to an exemplary embodiment of the present invention;

fig. 3 schematically shows a diagram of acoustic features obtained from an audio signal corresponding to a human voice audio piece according to the invention;

FIG. 4 schematically illustrates a ZhuYin character probability matrix corresponding to a human voice audio clip according to the present invention;

FIG. 5 schematically illustrates an operational flow diagram of a text ZhuYin method according to an exemplary embodiment of the present invention;

FIG. 6 schematically illustrates a block diagram of a text ZhuYin apparatus according to an exemplary embodiment of the present invention;

fig. 7 schematically shows a block diagram of an electronic device according to an exemplary embodiment of the present invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to the embodiment of the invention, a text phonetic notation method and a text phonetic notation device are provided.

In this document, any number of elements in the drawings is by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Summary of The Invention

The inventor finds that in the prior art, no scheme for better marking the Roman sound of the polyphones in the foreign songs exists, so that the Roman sound marking accuracy of the polyphones is poor.

Based on the above, the basic idea of the invention is: for polyphone characters in the text file, acoustic feature extraction is carried out on the audio file corresponding to the text file, the matching probability of the character sequence corresponding to each candidate sound is obtained through the acoustic feature, and the phonetic notation of the characters to be phonetic annotated is determined according to the matching probability.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Exemplary method

A text ZhuYin method according to an exemplary embodiment of the present invention is described below with reference to fig. 1.

Fig. 1 schematically shows a flow chart of a text ZhuYin method according to an exemplary embodiment of the present invention. Referring to fig. 1, a text ZhuYin method according to an exemplary embodiment of the present invention may include the steps of:

s12, extracting characters to be annotated from the text file, and determining candidate sounds corresponding to the characters to be annotated; wherein the candidate sound is a character sequence consisting of one or more ZhuYin characters.

The text file may be a lyric file of a song, and the song may be a japanese song or a korean song, and the exemplary embodiment of the present invention aims to annotate the lyrics of the song, for example, roman note, so that the user can spell and sing the lyrics according to the annotation.

In the process of determining candidate sounds corresponding to the characters to be annotated, the characters can be read from the text file word by word in sequence, and each time one character is read, the character is taken as the character to be annotated; then, candidate sounds corresponding to the word to be annotated can be determined from the transliteration dictionary.

The language corresponding to the transliteration dictionary is the same as the language corresponding to the character to be annotated, and the transliteration dictionary at least comprises each character belonging to the language and the annotation sound corresponding to each character, such as Roman sound.

In addition, roman sounds are a representation method for expressing pronunciation of japanese or korean using latin alphabets, and international phonetic symbols used by a person who is not in japanese or korean as a parent language when learning japanese or korean can help the person to better grasp pronunciation when learning japanese or korean and learn more easily, and have similar functions to pinyin of chinese.

Taking the transliteration dictionary as a japanese roman dictionary as an example, as shown in table 1, the japanese roman dictionary contains all japanese characters and all possible roman syllables of each japanese character, which are candidate syllables of the japanese character.

TABLE 1

As can be seen from table 1, the candidate sounds corresponding to the japanese "な" word are the character sequence na composed of the ZhuYin characters n and a, and there are 3 candidate sounds corresponding to the japanese "public" word.

It should be noted that the text phonetic notation method according to the exemplary embodiment of the present invention is not only applicable to marking roman sounds, but also applicable to other transliteration marks, such as chinese pinyin, etc.

S14, if the number of the candidate voices is larger than 1, extracting voice audio segments corresponding to the characters to be noted from the voice audio; wherein the human voice audio matches the text file.

In the process of preliminarily determining the candidate sound corresponding to the character to be annotated through the transliteration dictionary, if the number of the candidate sounds of the character to be annotated is equal to 1, directly determining the candidate sound as the annotation sound of the character to be annotated, and annotating the character to be annotated by using the annotation sound, thereby completing the annotating process of the character to be annotated.

However, if the number of candidate voices of the text to be noted is greater than 1, the voice audio needs to be separated from the audio file, and the voice audio needs to be matched with the text file, so as to extract a voice audio segment corresponding to the text to be noted, that is, a voice audio segment where the text to be noted is located from the voice audio. The audio file may be an audio file of a song, or may be an audio file of another audio file with a human voice, which is not limited in this exemplary embodiment.

In practical applications, the audio file generally includes vocal audio and accompaniment audio, and the exemplary embodiment of the present invention needs to separate the vocal audio and the accompaniment audio to obtain the vocal audio. There are various ways to separate human audio from audio files, for example, the separation can be performed by using the characteristics of stereo music and the principle that the left and right channels cancel each other. Alternatively, any scheme that can implement the separation of human voice and audio by using the frequency domain characteristics of the audio file for separation, etc. falls within the scope of the exemplary embodiments of the present invention.

In an exemplary embodiment of the present invention, a method of performing separation using frequency domain features of an audio file is described as an example: firstly, performing discrete Fourier transform on an audio file to obtain a spectrogram, and determining a phase diagram and a human voice spectrogram of the audio file according to the spectrogram; the phase diagram is obtained by taking a phase from a spectrogram, and the specific obtaining process of the human voice spectrogram is as follows: and performing modulo extraction on the spectrogram to obtain a spectrogram of the audio file, inputting the spectrogram into a convolutional neural network to obtain a human voice mask, and performing dot multiplication on the human voice mask and the spectrogram to obtain the human voice spectrogram. And secondly, taking the obtained human voice spectrogram and the obtained phase diagram as a human voice frequency domain signal, and carrying out inverse Fourier transform on the frequency domain signal to obtain the human voice audio frequency of the required time domain.

In an exemplary embodiment of the present invention, after separating the human voice audio from the audio file, a human voice audio segment corresponding to the text to be annotated needs to be extracted from the human voice audio.

Specifically, the extracting of the voice audio segment from the voice audio may include: and separating the voice audio frequency segment corresponding to the character to be annotated from the voice audio frequency according to the time position of the character to be annotated in the voice audio frequency.

In practical applications, the text file of a song usually carries lyric time information in order to facilitate the display of the lyrics on the player following the playing of the audio. Common lyric types are word-by-word lyrics and line-by-line lyrics, wherein the text form of the word-by-word lyrics is as follows:

[ st, d ] these (st1, d1) are (st2, d2) to (st3, d3) cases (st4, d4) (st5, d5) children (st6, d6)

[ st, d ] Only (st1, d1) for (st2, d2) reference (st3, d3) (st4, d4)

Wherein st represents the start time, d represents the duration, the square brackets represent the start time of the whole sentence, and the round brackets represent the start time of a single character, as can be seen from the text form of the word-by-word lyrics shown in the above example, the word-by-word lyrics can easily obtain the start time of each text character, i.e. the time position of the character to be annotated can be easily obtained, and then the accurate voice audio segment can be obtained according to the time position.

While the textual form of the line-by-line lyrics is as follows:

xxx is an example

Xxx is for reference only

As can be seen from the above examples, such lyrics usually have only the starting time of each lyric, and it is necessary to convert the line-by-line lyrics into word-by-word lyrics in order to obtain the time position of the word to be annotated.

In practical applications, there are various methods for converting the lyrics from line to word, for example, a forced alignment technique in a speech recognition technique is used to perform an alignment process on separated voice audio and text characters to obtain time information corresponding to each text character. The exemplary embodiments of the present invention are not particularly limited with respect to a specific method of converting the line-by-line lyrics into the word-by-word lyrics.

And S16, calculating the matching probability of the character sequence corresponding to each candidate voice and the character to be noted based on the voice audio clip.

For a word to be annotated, whether roman or pinyin is annotated, the candidate sound corresponding to the word to be annotated is usually a character sequence consisting of one or more annotated characters.

In an exemplary embodiment of the present invention, it is necessary to first calculate the ranking probability of all phonetic characters corresponding to the above-mentioned human voice audio segment, where all phonetic characters refer to all pronunciation characters and blank characters in the phonetic notation, for example, pronunciation characters refer to all latin letters in roman sounds or 26 letters in pinyin. And then, according to the sequencing probability, calculating the matching probability of the character sequence corresponding to each candidate sound and the character to be annotated.

Specifically, the process of calculating the ranking probability of all phonetic notation characters corresponding to the human voice audio clip includes: the voice audio segment is divided into multiple sub-audios, and specifically, in the dividing process, as shown in fig. 2, the voice audio segment may be divided into multiple sub-audios according to a preset window length L and a preset step length D, for example, T sub-audios such as an i-th audio frame, an i + 1-th audio frame, and the like. And the preset step length D is less than or equal to the preset window length L.

After obtaining the multiple frames of sub-audios, extracting the acoustic features of each frame of sub-audio, wherein the acoustic features refer to physical quantities representing the acoustic characteristics of the voice and are also a general term for acoustic representation of the elements of the voice. The method comprises an energy concentration area for representing tone, a Mel frequency cepstrum coefficient, a fundamental frequency characteristic, a formant characteristic, duration for representing a voice rhythm characteristic, a fundamental frequency, an average voice power and the like. The present exemplary embodiment is not particularly limited with respect to specific acoustic characteristics.

Referring to fig. 3, an acoustic feature map 301 of dimension N × T obtained from an audio signal corresponding to a human voice audio segment is shown, where N represents N acoustic features, and the number of N may be determined according to actual situations, for example, N is 40, which is not limited in this exemplary embodiment. In addition, a small box in fig. 3 represents one frame of sub-audio 302, for a total of T frames of sub-audio 302.

After obtaining the acoustic features of each frame of sub-audio, the acoustic features of the sub-audio of multiple frames need to be input into the acoustic model frame by frame, so as to obtain the sequencing probability of all phonetic notation characters corresponding to the human voice audio clip. Referring to fig. 4, a M x T-dimensional ZhuYin character probability matrix 401 corresponding to the human voice audio clip is shown. Where M refers to the number of all phonetic characters, for example, M is 27, including all phonetic characters and blank characters. The phonetic symbol probability matrix 401 includes the matching probabilities of all phonetic symbols corresponding to each frame of the sub-audio 302, that is, the sequencing probabilities of all phonetic symbols corresponding to the human audio clip are obtained. In fig. 4, the blank represents a blank character, and a, i, u, e, o … represents a pronunciation character.

After the ranking probabilities of all phonetic notation characters corresponding to the human voice audio clip are obtained, the matching probability of the character sequence corresponding to each candidate voice and the character to be phonetic notation can be calculated according to the ranking probabilities. Suppose that the character to be annotated has { E }₁，E₂，...，E_MM candidate tones, calculating the matching probability P (E) of the character sequence corresponding to each candidate tone₁)，P(E₂)，...，P(E_M)。

Specifically, calculating the matching probability between the character sequence corresponding to each candidate sound and the text to be annotated may include: firstly, all arrangement modes of all phonetic notation characters corresponding to each candidate sound on a time sequence are determined, including all arrangement modes of blank characters and pronunciation characters corresponding to the candidate sounds on the time sequence. Hypothesis candidate tone E₁The corresponding pronunciation character has C₁、C₂And the time sequence of the human voice audio frequency fragment is three time frames t₁、t₂、t₃Then the candidate tone E₁All arrangement modes corresponding to all phonetic characters on the time sequence comprise: c₁C₂、C₁C₁C₂、C₁C₂C₂、C₁_C₂、C₁C₂Five permutations.

Obtaining the probability corresponding to each phonetic symbol in all the arrangement modes from the sequencing probability; for example, C can be obtained from the ranking probability₁、C₂And a blank character in time frame t₁、t₂、t₃The probability of (3), comprising: p_t1(C₁)，P_t1(C₂)，P_t1()，P_t2(C₁)，P_t2(C₂)，P_t2()，P_t3(C₁)，P_t3(C₂)，P_t3()。

And determining the matching probability of the character sequence corresponding to the candidate sound according to the probability corresponding to each phonetic symbol. Specifically, the probabilities corresponding to the phonetic characters may be multiplied to obtain the path probability corresponding to the arrangement mode. Or candidate tone E₁For example, the path probabilities of the corresponding five permutation modes are calculated as follows:

P(_C₁C₂)＝P_t1()*P_t2(C₁)*P_t3(C₂)；

P(C₁C₁C₂)＝P_t1(C₁)*P_t2(C₁)*P_t3(C₂)；

P(C₁C₂C₂)＝P_t1(C₁)*P_t2(C₂)*P_t3(C₂)；

P(C₁_C₂)＝P_t1(C₁)*P_t2(_)*P_t3(C₂)*；

P(C₁C₂_)＝P_t1(C₁)*P_t2(C₂)*P_t3(_)。

and adding the path probabilities corresponding to all the arrangement modes to obtain the matching probability of the character sequence corresponding to the candidate tone. I.e. candidate tone E₁The corresponding match probability is:

P(E₁)＝P(_C₁C₂)+P(C₁C₁C₂)+P(C₁C₂C₂)+P(C₁_C₂)+P(C₁C₂_)。

in the same manner, the candidate tone E can be found₂，...，E_MThe corresponding matching probability: p (E)₂)，...，P(E_M)。

And S18, determining the candidate sound with the maximum matching probability as the phonetic notation sound of the character to be phonetic annotated.

From the above calculated matching probability P (E)₁)，P(E₂)，...，P(E_M) The maximum value is selected, and the candidate sound corresponding to the maximum value is determined as the phonetic notation sound of the character to be phonetic annotated. And after the phonetic notation is determined, the phonetic notation characters are labeled by using the phonetic notation.

The following describes an operation flow of the text ZhuYin method according to an exemplary embodiment of the present invention with reference to FIG. 5:

in step S501, a word to be annotated is extracted from a text file; in step S502, candidate voices corresponding to the to-be-phonetic characters are determined; in step S503, entering a judgment condition, namely judging the number of candidate tones; if the number of the candidate sounds is 1, executing step S504, determining that the candidate sounds are the phonetic notation sounds of the character to be phonetic annotated, and annotating the character to be phonetic notation with the phonetic notation sounds; if the number of the candidate tones is greater than 1, executing step S505 to separate the human voice audio; in step S506, a voice audio clip corresponding to the text to be annotated is extracted from the voice audio; in step S507, dividing the human voice audio clip into multiple frames of sub-audio; in step S508, extracting acoustic features of each frame of sub-audio; in step S509, the acoustic features of the multiple frames of sub-audio are input into the acoustic model frame by frame, and the ranking probabilities of all phonetic notation characters corresponding to the human voice audio clip are obtained; in step S510, after obtaining the ranking probabilities of all ZhuYin characters corresponding to the human voice audio clip, all arrangement modes of all the ZhuYin characters corresponding to each candidate voice on the time sequence may be determined; in step S511, the probabilities corresponding to the phonetic characters in all the arrangement modes are obtained from the ranking probabilities; in step S512, determining a matching probability of the character sequence corresponding to each candidate tone according to the probability corresponding to each ZhuYin character; in step S513, the candidate sound with the highest matching probability is determined as the phonetic notation sound of the character to be phonetic annotated; in step S514, the character to be annotated is annotated with the annotation sound.

The following takes specific words in the lyric text as an example to illustrate the specific operation process of the text phonetic notation method provided by the exemplary embodiment of the present invention:

taking lyrics 'jun とそうか' in a Japanese song 'PLANET' as an example, the first character to be annotated extracted from the text file is 'jun', the corresponding candidate voices obtained after looking up a transliteration dictionary have two 'kunn' and 'kimi', M is equal to 2, and further judgment is needed. And separating the song audio files in PLANET to obtain the voice audio. And obtaining the time of the section of the song corresponding to the monarch according to the word-by-word lyric information of PLANET from 00:15:88 to 00:16:76, and intercepting the corresponding voice audio section. Based on the voice audio segment, a corresponding phonetic notation character probability matrix is obtained through calculation, and the matching probability of the character sequence corresponding to the candidate voice "kunn" is obtained through calculation and is P (kunn) ═ 0.01, the matching probability of the character sequence corresponding to the candidate voice "kimi" is P (kimi) ═ 0.80, and P (kimi) is larger, so that the phonetic notation marked by "king" is "kimi", and the "king" is marked by using "kimi".

And continuously extracting a second character to be annotated as 'と', and obtaining corresponding candidate sounds only with 'to' after checking the transliteration dictionary, wherein M is equal to 1, so that the annotation sound directly recorded with 'と' is 'to'.

And continuously extracting a third character to be annotated as '', and obtaining corresponding candidate sounds only 'hanashi' after checking the transliteration dictionary, wherein M is equal to 1, so that the annotation sound of '' is directly recorded as 'hanashi'.

And continuously extracting a fourth character to be annotated as ' そ ', searching the transliteration dictionary to obtain corresponding candidate sounds only having ' so ', wherein M is equal to 1, so that the annotation sound of ' そ ' is ' so directly recorded.

And continuously extracting the fifth character to be annotated as 'う', obtaining corresponding candidate sound only as 'u' after checking the transliteration dictionary, wherein M is equal to 1, so that the annotation sound directly recorded as 'う' is 'u'.

And continuously extracting the sixth character to be annotated as 'か', and obtaining corresponding candidate sounds only with 'ka' after checking the transliteration dictionary, wherein M is equal to 1, so that the annotation sound directly recorded as 'か' is 'ka'.

Therefore, the Japanese Roman lyrics corresponding to the final lyrics "Jun とそうか" are "kimi-to-hanashi-so-u-ka".

The technical scheme of the embodiment of the invention is based on the text file and the audio file, and particularly for the phonetic notation of polyphone characters in the characters to be annotated, the human voice audio is separated from the audio file, and the human voice audio segment corresponding to the characters to be annotated is divided into multi-frame sub-audio; acquiring the sequencing probability of all phonetic notation characters corresponding to the human voice audio clip according to the acoustic characteristics of each frame of sub audio; determining the matching probability of the character sequence corresponding to each candidate tone according to the sequencing probability; and determining the candidate sound with the maximum matching probability as the phonetic notation sound of the character to be phonetic annotated. On one hand, the text file and the audio file are combined to label the words to be labeled, so that the conversion success rate of text labeling is improved. On the other hand, acoustic features in the audio are used for determining the matching probability of the character sequence corresponding to the candidate sound, and the candidate sound with the highest matching probability is determined as the phonetic notation sound, so that the accuracy of candidate sound identification is increased. On the other hand, the text phonetic notation method is not only suitable for phonetic notation of conventional audio texts, but also suitable for extremely short lyric texts lacking context information in lyric texts, or phonetic notation of lyric texts with misuse of word classes adopted for achieving artistic effects, so that the application range of the text phonetic notation method is expanded, and the user experience is improved.

Exemplary devices

Having described the text ZhuYin method according to an exemplary embodiment of the present invention, a text ZhuYin apparatus according to an exemplary embodiment of the present invention will be described with reference to FIG. 6.

Referring to fig. 6, the text phonetic notation device 6 according to an exemplary embodiment of the present invention may include a candidate tone determination module 61, an audio extraction module 63, a probability calculation module 65, and a notation tone determination module 67.

Specifically, the candidate sound determining module 61 may be configured to extract characters to be annotated from the text file, and determine candidate sounds corresponding to the characters to be annotated; wherein, the candidate sound is a phonetic character sequence composed of one or more phonetic characters; the audio extraction module 63 may be configured to extract a voice audio segment corresponding to the text to be annotated from the voice audio if the number of the candidate voices is greater than 1; wherein, the voice audio is matched with the text file; the probability calculation module 65 may be configured to calculate, based on the human voice audio segment, a matching probability between the character sequence corresponding to each candidate voice and the text to be annotated; the phonetic notation determining module 67 may be configured to determine the candidate sound with the highest matching probability as the phonetic notation of the text to be phonetic annotated.

In some embodiments of the present invention, the probability calculation module 65 may be configured to divide the human voice audio clip into multiple frames of sub-audios, and extract the acoustic features of each frame of sub-audio; inputting the acoustic characteristics of the multi-frame sub-audio into an acoustic model frame by frame to obtain the sequencing probability of all phonetic notation characters corresponding to the human voice audio clip; determining all arrangement modes of all phonetic notation characters corresponding to the candidate tones on the time sequence; acquiring the probability corresponding to each phonetic symbol in the arrangement mode from the sequencing probability; and determining the matching probability of the character sequence corresponding to the candidate tone according to the probability corresponding to each ZhuYin character.

Since each functional module of the program operation performance analysis apparatus according to the embodiment of the present invention is the same as that in the embodiment of the present invention, it is not described herein again.

Exemplary device

Having described the text phonetic notation method and the text phonetic notation device of the exemplary embodiments of the present invention, the electronic device of the exemplary embodiments of the present invention will be described next. The electronic equipment of the exemplary embodiment of the invention comprises one of the text phonetic notation devices.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible embodiments, an electronic device according to the invention may comprise at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps in the text ZhuYin method according to various exemplary embodiments of the present invention described in the above-mentioned "methods" section of this specification. For example, the processing unit may perform steps 12 to 18 as described in fig. 1.

An electronic device 700 according to this embodiment of the invention is described below with reference to fig. 7. The electronic device 700 shown in fig. 7 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 7, electronic device 700 is embodied in the form of a general purpose computing device. The components of the electronic device 700 may include, but are not limited to: the at least one processing unit 710, the at least one memory unit 720, a bus 730 connecting different system components (including the memory unit 720 and the processing unit 710), and a display unit 740.

Wherein the storage unit stores program code that is executable by the processing unit 710 such that the processing unit 710 performs the steps according to various exemplary embodiments of the present invention as described in the above section "exemplary method" of the present specification. For example, the processing unit 710 may execute step S12 as shown in fig. 1: extracting characters to be annotated from the text file, and determining candidate sounds corresponding to the characters to be annotated; wherein, the candidate sound is a character sequence composed of one or more phonetic characters; step S14: if the number of the candidate sounds is larger than 1, extracting a voice audio segment corresponding to the character to be noted from the voice audio; wherein, the voice audio is matched with the text file; step S16: based on the voice audio clip, calculating the matching probability of the character sequence corresponding to each candidate voice and the character to be annotated; step S18: and determining the candidate sound with the maximum matching probability as the phonetic notation sound of the character to be phonetic annotated.

The storage unit 720 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)7201 and/or a cache memory unit 7202, and may further include a read only memory unit (ROM) 7203.

The storage unit 720 may also include a program/utility 7204 having a set (at least one) of program modules 7205, such program modules 7205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 730 may be any representation of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 700 may also communicate with one or more external devices 770 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 700, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 700 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 750. Also, the electronic device 700 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 760. As shown, the network adapter 760 communicates with the other modules of the electronic device 700 via the bus 730. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 700, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Exemplary program product

In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps in the text ZhuYin method according to various exemplary embodiments of the present invention described in the above-mentioned "method" section of this specification, when the program product is run on the terminal device, for example, the terminal device may perform steps 12 to 18 as described in FIG. 1.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical disk, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In addition, as technology advances, readable storage media should also be interpreted accordingly.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).

It should be noted that although several modules or sub-modules of the text phonetic device are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method for annotating text, comprising:

extracting characters to be annotated from a text file, and determining candidate sounds corresponding to the characters to be annotated; wherein the candidate tone is a character sequence consisting of one or more ZhuYin characters;

if the number of the candidate voices is larger than 1, extracting voice audio segments corresponding to the characters to be noted from voice audio; wherein the human voice audio matches the text file;

calculating the matching probability of the character sequence corresponding to each candidate voice and the character to be annotated on the basis of the voice audio clip;

and determining the candidate sound with the maximum matching probability as the phonetic notation sound of the character to be phonetic annotated.

2. The method of claim 1, wherein calculating the matching probability between the character sequence corresponding to each candidate voice and the word to be annotated based on the human voice audio segment comprises:

calculating the sequencing probability of all phonetic notation characters corresponding to the human voice audio clip;

and calculating the matching probability of the character sequence corresponding to each candidate tone according to the sorting probability.

3. The method of claim 2, wherein calculating the match probability for the character sequence corresponding to each of the candidate sounds according to the ranking probabilities comprises:

determining all arrangement modes of all the ZhuYin characters corresponding to the candidate tones on the time sequence;

obtaining the probability corresponding to each phonetic symbol in the arrangement mode from the sequencing probability;

and determining the matching probability of the character sequence corresponding to the candidate tone according to the probability corresponding to each ZhuYin character.

4. The textual phonetic method of claim 3, wherein the phonetic characters include pronunciation characters and blank characters;

determining all arrangement modes of all the ZhuYin characters corresponding to the candidate tones on the time sequence comprises the following steps: and determining all arrangement modes of the blank characters and the pronunciation characters corresponding to the candidate voices on the time sequence.

5. The method of claim 3, wherein determining the matching probability of the character sequence corresponding to the candidate phonetic according to the probability corresponding to each of the phonetic characters comprises:

multiplying the probabilities corresponding to the phonetic characters to obtain the path probabilities corresponding to the arrangement modes;

and adding the path probabilities corresponding to all the arrangement modes to obtain the matching probability of the character sequence corresponding to the candidate tone.

6. The method of claim 2, wherein calculating the probability of the ranking of all ZhuYin characters corresponding to the human voice audio clip comprises:

dividing the human voice audio clip into a plurality of frames of sub-audio, and extracting the acoustic characteristics of each frame of sub-audio;

inputting the acoustic features of the sub-audios of the plurality of frames into an acoustic model frame by frame to obtain the sequencing probability of all phonetic notation characters corresponding to the human voice audio clip.

7. The method of text ZhuYin of any one of claims 1-6, further comprising:

and marking the character to be marked with the phonetic notation by using the phonetic notation.

8. A text phonetic notation device comprising:

the candidate sound determining module is used for extracting characters to be annotated from the text file and determining candidate sounds corresponding to the characters to be annotated; wherein the candidate tone is a ZhuYin character sequence composed of one or more ZhuYin characters;

the audio extraction module is used for extracting a voice audio clip corresponding to the character to be noted from the voice audio if the number of the candidate voices is greater than 1; wherein the human voice audio matches the text file;

the probability calculation module is used for calculating the matching probability of the character sequence corresponding to each candidate voice and the character to be annotated on the basis of the voice audio clip;

and the phonetic notation determining module is used for determining the candidate sound with the maximum matching probability as the phonetic notation of the character to be phonetic annotated.

9. A storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the text ZhuYin method of any of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the text ZhuYin method of any of claims 1-7 via execution of the executable instructions.