CN109326281A

CN109326281A - Prosodic labeling method, apparatus and equipment

Info

Publication number: CN109326281A
Application number: CN201810988973.9A
Authority: CN
Inventors: 孟君; 曹琼; 廖晓玲; 郝玉峰
Original assignee: Beijing Haitian Rui Sheng Polytron Technologies Inc
Current assignee: Beijing Haitian Rui Sheng Polytron Technologies Inc
Priority date: 2018-08-28
Filing date: 2018-08-28
Publication date: 2019-02-12
Anticipated expiration: 2038-08-28
Also published as: CN109326281B

Abstract

The present invention provides a kind of prosodic labeling method, apparatus and equipment.Wherein, prosodic labeling method includes: the voice data for obtaining text to be marked；According to voice data, determine that the prosodic information in voice data, prosodic information are used to indicate the pause duration in voice data；According to the prosodic information in voice data, prosodic sign mark is carried out to text to be marked.Prosodic labeling method provided by the invention improves the efficiency and accuracy of prosodic labeling.

Description

Prosodic labeling method, apparatus and equipment

Technical field

The present invention relates to prosodic labeling technical field more particularly to a kind of prosodic labeling method, apparatus and equipment.

Background technique

The rhythm, also known as super-segmental feature, the rhythm and pace of moving things or musical note generally include rhythm, emphasize, intonation etc..Prosodic information It is that people express thoughts a kind of necessary means of emotion.Identical text can be given expression to completely not using the different tone and rhythm The same meaning.Therefore, prosodic information plays the role of highly important in speech synthesis system.

Currently, the prosodic labeling in speech synthesis system is generally by the way of predicting the rhythm based on text information.In For text mark, prosody prediction is carried out based on text information, is determined generally according to information such as initial consonant, simple or compound vowel of a Chinese syllable, word, phrase, paragraphs Prosody prediction result.The mark personnel of profession complete prosodic labeling according to prosody prediction result.

But language expression is with rich.Prosodic labeling is carried out by artificial mode only according to text information, it is right Needed in text marked halt or need obvious mute part cannot correctly predicted prosodic information.Mark personnel need There are many place to be changed.Cause efficiency and the accuracy of prosodic labeling lower.

Summary of the invention

The present invention provides a kind of prosodic labeling method, apparatus and equipment, improves the efficiency and accuracy of prosodic labeling.

In a first aspect, the present invention provides a kind of prosodic labeling method, comprising:

Obtain the voice data of text to be marked；

According to the voice data, determine that the prosodic information in the voice data, the prosodic information are used to indicate institute State the pause duration in voice data；

According to the prosodic information in the voice data, prosodic sign mark is carried out to the text to be marked.

Optionally, in a kind of possible embodiment, further includes:

Obtain the prosodic information in the text data of the text to be marked；

Optionally, in a kind of possible embodiment, the prosodic information according in the voice data, to described Text to be marked carries out prosodic sign mark, comprising:

According to the prosodic information in the prosodic information and the text data in the voice data, to the text to be marked This progress prosodic sign mark.

Optionally, in a kind of possible embodiment, the prosodic information according in the voice data and described Prosodic information in text data carries out prosodic sign mark to the text to be marked, comprising:

According to the prosodic information in the voice data, prosodic sign mark is carried out to the text to be marked；

According to the prosodic information in the text data, the prosodic sign marked in the text to be marked is carried out more Newly.

Optionally, in a kind of possible embodiment, the prosodic information according in the text data, to described The prosodic sign marked in text to be marked is updated, comprising:

If the prosodic information in the text data indicates that at least one rhythm marked in the text to be marked accords with Number position without marking prosodic sign, then delete at least one the described prosodic sign marked.

Optionally, described according to the voice data in a kind of possible embodiment, it determines in the voice data Prosodic information, comprising:

According to the voice data, mute section of at least one of described voice data is obtained；

For each mute section, according to this mute section, mute section of corresponding prosodic information of this in the voice data is determined.

Optionally, described according to the voice data in a kind of possible embodiment, it obtains in the voice data At least one mute section, comprising:

Phoneme segmentation is carried out to the text data of the text to be marked, obtains voice annotation sequence；

According to the voice annotation sequence, the voice data and predetermined acoustic model, the voice data is carried out Phoneme segmentation obtains described at least one mute section in the voice data；Wherein, the predetermined acoustic model is for indicating The corresponding phonetic feature of different phonemes.

Second aspect, the present invention provide a kind of prosodic labeling device, comprising:

First obtains module, for obtaining the voice data of text to be marked；

Prosodic information determining module, for determining the prosodic information in the voice data, institute according to the voice data State the pause duration that prosodic information is used to indicate in the voice data；

Labeling module, for carrying out rhythm symbol to the text to be marked according to the prosodic information in the voice data Number mark.

It optionally, further include the second acquisition module in a kind of possible embodiment；

Described second obtains module, the prosodic information in text data for obtaining the text to be marked；

The labeling module is specifically used for:

Optionally, in a kind of possible embodiment, the labeling module is specifically used for:

The third aspect, the present invention provide a kind of prosodic labeling equipment, which includes processor and memory. Memory is for storing instruction.Processor is for executing the instruction stored in memory, so that prosodic labeling equipment executes this hair The prosodic labeling method that bright first aspect any embodiment provides.

Fourth aspect, the present invention provide a kind of storage medium, comprising: readable storage medium storing program for executing and computer program, the meter The prosodic labeling method that calculation machine program provides for realizing first aspect present invention any embodiment.

The present invention provides a kind of prosodic labeling method, apparatus and equipment, treats mark according to the voice data of text to be marked The mark of explanatory notes this progress prosodic sign, it is contemplated that language expression it is rich, especially consider marked halt in voice or Person is mute section obvious, improves the efficiency and accuracy of prosodic labeling, reduces the cost of prosodic labeling.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 is the flow chart of prosodic labeling method provided in an embodiment of the present invention；

Fig. 2 is the structural schematic diagram of prosodic labeling device provided in an embodiment of the present invention；

Fig. 3 is the structural schematic diagram of prosodic labeling equipment provided in an embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Fig. 1 is the flow chart of prosodic labeling method provided in an embodiment of the present invention.Prosodic labeling side provided in this embodiment Method, executing subject can be prosodic labeling device, or be prosodic labeling equipment.As shown in Figure 1, the rhythm provided in this embodiment Mask method may include:

S101, the voice data for obtaining text to be marked.

S102, according to voice data, determine the prosodic information in voice data.

Wherein, prosodic information is used to indicate the pause duration in voice data.

S103, according to the prosodic information in voice data, prosodic sign mark is carried out to text to be marked.

Specifically, in the present embodiment, the text for carrying out prosodic labeling is needed to be properly termed as text to be marked.Text to be marked This voice data is the voice data generated after declaimer reads aloud text to be marked.The present embodiment for declaimer not It limits.The prosodic information in voice data can be determined according to the voice data of text to be marked.Wherein, prosodic information is used for Indicate the pause duration in voice data.In turn, rhythm can be carried out to text to be marked according to the pause duration in voice data Restrain the mark of symbol.

Prosodic labeling method provided in this embodiment carries out rhythm to text to be marked according to the voice data of text to be marked Restrain symbol mark, it is contemplated that language expression it is rich.Read aloud to text to be marked the voice of generation based on declaimer Data have fully considered the marked halt or mute section obvious in voice.Pass through artificial side compared to based on text to be marked Formula carries out prosodic labeling, improves the accuracy of prosodic labeling.Due to reducing the place for needing to change, the rhythm is improved The efficiency of mark reduces prosodic labeling cost.

It should be noted that the present embodiment for prosodic sign implementation without limitation, be configured as needed. Wherein, the corresponding pause duration range of different prosodic signs can be preset.The present embodiment is for pause duration range Specific value is without limitation.

For example, prosodic sign may include #1, #2, #3 and #4.At this point, the pause duration in voice data can have 4 kinds.

It is illustrated below by example.

Table 1 shows the corresponding relationship between prosodic sign, the meaning and pause duration range that prosodic sign indicates.Its In, #1 and #2 scene one is corresponding to pause due to subtle in sense of hearing, it is subjective, therefore, in the present embodiment may be used Not define pause duration range.It is of course also possible to define pause duration range.The present embodiment does not limit this.Wherein, t3 <t4≤t5<t6.The present embodiment for t3~t6 specific value without limitation.For example, t4=t5=90ms.It is assumed that be marked One example of text is xxxxxxx, xxxxxxxx.Text to be marked can be xxxx# after carrying out prosodic sign mark 2xxx#3, xxx#2xxxxx#4.

Table 1

Optionally, prosodic labeling method provided in this embodiment can also include:

Obtain the prosodic information in the text data of text to be marked.

S103 carries out prosodic sign mark to text to be marked, may include: according to the prosodic information in voice data

According to the prosodic information in the prosodic information and text data in voice data, rhythm symbol is carried out to text to be marked Number mark.

Specifically, the prosodic information in the text data of text to be marked, is used to indicate the text data of text to be marked In pause duration.It should be noted that prosodic information in text data of the present embodiment for obtaining text to be marked Implementation without limitation, can be using the existing method for carrying out prosody prediction based on text information.

Rhythm symbol is carried out to text to be marked according to the prosodic information in the prosodic information and text data in voice data Number mark, comprehensively considered text prosody prediction result and phonetic-rhythm analysis as a result, further improving the effect of prosodic labeling Rate and accuracy.

Optionally, according to the prosodic information in the prosodic information and text data in voice data, to text to be marked into Row prosodic sign marks, and may include:

According to the prosodic information in voice data, prosodic sign mark is carried out to text to be marked.

According to the prosodic information in text data, the prosodic sign marked in text to be marked is updated.

By carrying out prosodic sign mark to text to be marked based on the prosodic information in voice data, according to text Prosodic information in data updates the mark of prosodic sign, and it is pre- to consider the text rhythm on the basis of phonetic-rhythm analyzes result It surveys as a result, further improving the efficiency and accuracy of prosodic labeling.

Optionally, according to the prosodic information in text data, the prosodic sign marked in text to be marked is updated, May include:

If the prosodic information in text data indicates the position of at least one prosodic sign marked in text to be marked Without marking prosodic sign, then at least one prosodic sign marked is deleted.

Specifically, the prosodic information in text data is pre- according to the determining text rhythm of the text data of text to be marked Survey result.Prosodic information in text data usually reflects the pause duration that can grammatically pause, and also including cannot The position of pause.In some scenes, the prosodic information in text data indicates at least one marked in text to be marked The position of prosodic sign is without marking prosodic sign.For example, not having pause usually in the centre of function word, function word be can wrap Include phrase, Chinese idiom, common saying etc..It has been marked in text to be marked at this point it is possible to be deleted according to the prosodic information in text data At least one prosodic sign further improves the accuracy of prosodic labeling.

Optionally, S102 determines the prosodic information in voice data, may include: according to voice data

According to voice data, mute section of at least one of voice data is obtained.

For each mute section, according to this mute section, mute section of corresponding prosodic information of this in voice data is determined.

Specifically, obtaining mute section of at least one of voice data according to voice data.Described mute section when it is a length of Pause duration in voice data.

Optionally, for each mute section, according to this mute section, mute section of this in voice data corresponding rhythm letter is determined It ceases, may include:

According to the mute section of initial time and end time in voice data, mute section of duration is obtained.

It is illustrated below by example.

It is assumed that one mute section of initial time is 00:22:07:300, end time 00:22:07:360.Mute section When a length of 60ms.Referring to table 1.It is assumed that t3=30ms, t4=90ms.It is possible to according to mute section of the duration wait mark It is #2 that prosodic sign is marked in explanatory notes sheet.

Optionally, according to voice data, mute section of at least one of voice data is obtained, may include:

Phoneme segmentation is carried out to the text data of text to be marked, obtains voice annotation sequence.

According to voice annotation sequence, voice data and predetermined acoustic model, phoneme segmentation is carried out to voice data, is obtained Mute section of at least one of voice data.Wherein, predetermined acoustic model is for indicating the corresponding phonetic feature of different phonemes.

Specifically, phoneme is the least speech unit come out from the angular divisions of sound quality.To the textual data of text to be marked According to carrying out phoneme segmentation, a series of corresponding segment of the and phoneme that text data can be divided into timing adjacent.The segment can With referred to as voice annotation sequence.Predetermined acoustic model illustrates the corresponding phonetic feature of different phonemes.According to voice annotation sequence, Voice data and predetermined acoustic model can carry out phoneme segmentation to voice data, obtain at least one of voice data Mute section.

It should be noted that the present embodiment for phoneme segmentation method without limitation, existing phoneme segmentation can be used Method.For example, being based on the automatic speech segmentation algorithm of Markov model (Hidden Markov Model, HMM).In the calculation It can be given annotated sequence based on the language model of HMM in method, using Viterbi algorithm by voice signal and phonetics It marks unit (phoneme) corresponding HMM sequence and forces alignment.

It should be noted that the present embodiment for predetermined acoustic model type and acquisition modes without limitation.For example, can To be based on Open-Source Tools Kaldi, the voice data and corresponding text training predetermined acoustic model of the rhythm to be predicted are used.Example again Such as, predetermined acoustic model can be obtained based on deep neural network (Deep Neural Networks, DNN) algorithm.Optionally, When amount of voice data is smaller, predetermined acoustic model can be GMM-HMM acoustic model.When amount of voice data is larger, preset Acoustic model can be DNN-HMM model.

Optionally, phoneme segmentation is carried out to the text data of text to be marked, obtains voice annotation sequence, may include:

Phoneme segmentation carried out to the text data of text to be marked, and adjacent two words in text to be marked interleave Enter the symbol that pauses, obtains voice annotation sequence.

It is illustrated below by example.

It is assumed that phoneme includes initial consonant and simple or compound vowel of a Chinese syllable.Text to be marked is that " hello, dear motherland.".The text of text to be marked Notebook data is " ni hao, qin ai de zu guo ".So, voice annotation sequence can be " n i sp h ao sp q in sp ai sp d e sp z u sp g uo".Wherein, sp indicates the symbol that pauses.

The present embodiment provides a kind of prosodic labeling methods, comprising: the voice data for obtaining text to be marked, according to voice number According to the prosodic information determined in voice data, prosodic sign mark is carried out to text to be marked according to the prosodic information in voice data Note.Prosodic labeling method provided in this embodiment carries out rhythm symbol to text to be marked according to the voice data of text to be marked Number mark, improve the efficiency and accuracy of prosodic labeling.

Fig. 2 is the structural schematic diagram of prosodic labeling device provided in an embodiment of the present invention.Rhythm mark provided in this embodiment Dispensing device, for executing the prosodic labeling method of embodiment illustrated in fig. 1 offer.As shown in Fig. 2, rhythm mark provided in this embodiment Dispensing device may include:

First obtains module 11, for obtaining the voice data of text to be marked.

Prosodic information determining module 12, for determining the prosodic information in voice data, prosodic information according to voice data The pause duration being used to indicate in voice data.

Labeling module 13, for carrying out prosodic sign mark to text to be marked according to the prosodic information in voice data.

It optionally, further include the second acquisition module 14.

Second obtains module 14, the prosodic information in text data for obtaining text to be marked.

Labeling module 13 is specifically used for:

Optionally, labeling module 13 is specifically used for:

Optionally, prosodic information determining module 12 is specifically used for:

Prosodic labeling device provided in this embodiment, it is former for executing the prosodic labeling method of embodiment illustrated in fig. 1 offer Reason is similar with technical effect, and details are not described herein again.

Fig. 3 is the structural schematic diagram of prosodic labeling equipment provided in an embodiment of the present invention.Rhythm mark provided in this embodiment Equipment is infused, for executing the prosodic labeling method of embodiment illustrated in fig. 1 offer.

As shown in figure 3, prosodic labeling equipment may include processor 21 and memory 22.The memory 22 is for storing Instruction, the processor 21 is for executing the instruction stored in the memory 22, so that the prosodic labeling equipment executes Fig. 1 The prosodic labeling method that illustrated embodiment provides, specific implementation is similar with technical effect, and which is not described herein again.

The embodiment of the present invention also provides a kind of storage medium, and instruction is stored in the storage medium, when it is in computer When upper operation, so that computer executes the prosodic labeling method of embodiment as shown in Figure 1 above.

The embodiment of the present invention also provides a kind of program product, and described program product includes computer program, the computer Program is stored in a storage medium, at least one processor can read the computer program from the storage medium, described At least one processor can realize the prosodic labeling method of above-mentioned embodiment illustrated in fig. 1 when executing the computer program.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer readable storage medium.The journey When being executed, execution includes the steps that above-mentioned each method embodiment to sequence；And storage medium above-mentioned include: read-only memory (English: Read-Only Memory, referred to as: ROM), random access memory (English: Random Access Memory, referred to as: RAM), the various media that can store program code such as magnetic or disk.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims

1. a kind of prosodic labeling method characterized by comprising

Obtain the voice data of text to be marked；

According to the voice data, determine that the prosodic information in the voice data, the prosodic information are used to indicate institute's predicate Pause duration in sound data；

2. the method according to claim 1, wherein further include:

Obtain the prosodic information in the text data of the text to be marked；

The prosodic information according in the voice data carries out prosodic sign mark to the text to be marked, comprising:

According to the prosodic information in the prosodic information and the text data in the voice data, to the text to be marked into Row prosodic sign mark.

3. according to the method described in claim 2, it is characterized in that, the prosodic information and institute according in the voice data The prosodic information in text data is stated, prosodic sign mark is carried out to the text to be marked, comprising:

According to the prosodic information in the text data, the prosodic sign marked in the text to be marked is updated.

4. according to the method described in claim 3, it is characterized in that, the prosodic information according in the text data, right The prosodic sign marked in the text to be marked is updated, comprising:

If the prosodic information in the text data indicates at least one prosodic sign marked in the text to be marked At least one the described prosodic sign marked is then deleted without marking prosodic sign in position.

5. method according to claim 1-4, which is characterized in that it is described according to the voice data, determine institute State the prosodic information in voice data, comprising:

6. according to the method described in claim 5, obtaining the voice number it is characterized in that, described according to the voice data According at least one of mute section, comprising:

According to the voice annotation sequence, the voice data and predetermined acoustic model, phoneme is carried out to the voice data Segmentation, obtains described at least one mute section in the voice data；Wherein, the predetermined acoustic model is for indicating different The corresponding phonetic feature of phoneme.

7. a kind of prosodic labeling device characterized by comprising

First obtains module, for obtaining the voice data of text to be marked；

Prosodic information determining module, for determining the prosodic information in the voice data, the rhythm according to the voice data Rule information is used to indicate the pause duration in the voice data；

Labeling module, for carrying out prosodic sign mark to the text to be marked according to the prosodic information in the voice data Note.

8. device according to claim 7, which is characterized in that further include the second acquisition module；

The labeling module is specifically used for:

9. device according to claim 8, which is characterized in that the labeling module is specifically used for:

10. a kind of prosodic labeling equipment characterized by comprising memory and processor；

The memory, for storing program instruction；

The processor, for calling the described program stored in the memory instruction to realize as appointed in claim 1-6 Prosodic labeling method described in one.