CN109087627A - Method and apparatus for generating information - Google Patents
Method and apparatus for generating information Download PDFInfo
- Publication number
- CN109087627A CN109087627A CN201811202290.2A CN201811202290A CN109087627A CN 109087627 A CN109087627 A CN 109087627A CN 201811202290 A CN201811202290 A CN 201811202290A CN 109087627 A CN109087627 A CN 109087627A
- Authority
- CN
- China
- Prior art keywords
- voice
- syllable
- fundamental frequency
- stress
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 239000000284 extract Substances 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims description 26
- 238000000605 extraction Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 9
- 230000015572 biosynthetic process Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 230000006854 communication Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000003786 synthesis reaction Methods 0.000 description 5
- 230000015654 memory Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 230000005611 electricity Effects 0.000 description 2
- 230000005291 magnetic effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 210000003127 knee Anatomy 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the present application discloses the method and apparatus for generating information.One specific embodiment of this method includes: the syllable for voice in preset voice set, extracts the corresponding fundamental frequency of the syllable, obtains the fundamental frequency sequence for the syllable, obtains the fundamental frequency characteristic information for the syllable according to above-mentioned fundamental frequency sequence;Statistically in predicate sound set the syllable of voice fundamental frequency characteristic information, obtain statistical result;According to above-mentioned statistical result, the stress information of the syllable of voice in above-mentioned voice set is generated.The embodiment realizes automatically generating for the stress information of the syllable of voice in voice set.
Description
Technical field
The invention relates to field of computer technology, and in particular to the method and apparatus for generating information.
Background technique
With the development of speech synthesis technique, synthesize no matter voice has much for intelligibility or naturalness
Progress.However, current synthesis voice is still more dull barren.Different from synthesizing voice, have when the people of different regions speaks
Therefore different pronunciation habits, the difference that stress (accent) can describe between different pronunciation habits add when synthesizing voice
The voice with " human interest " can preferably be synthesized by entering stress information.At this stage, speech synthesis system generally includes a variety of moulds
Type, for example, text model, prediction model, acoustic model etc..And want the voice of anamorphic zone stress, then it needs using band weight
Model in the training data training language synthesis system of phonetic symbol note.The weight in training data is marked using the mode manually marked
Sound does not require nothing more than the pronunciation habit of the very familiar speaker of mark personnel, it is also necessary to consume biggish manpower and financial resources.
Summary of the invention
The embodiment of the present application proposes the method and apparatus for generating information.
In a first aspect, the embodiment of the present application provides a kind of method for generating information, this method comprises: for preparatory
The syllable of voice in the voice set of setting, extracts the corresponding fundamental frequency of the syllable, obtains the fundamental frequency sequence for the syllable, according to
Above-mentioned fundamental frequency sequence obtains the fundamental frequency characteristic information for the syllable;Statistically the fundamental frequency of the syllable of voice is special in predicate sound set
Reference breath, obtains statistical result;According to above-mentioned statistical result, the stress information of the syllable of voice in above-mentioned voice set is generated.
In some embodiments, the above method further include: using the stress information generated to voice in above-mentioned voice set
Syllable carry out stress label;Based on the voice set training stress prediction models after stress label, wherein above-mentioned stress predicted
Model is used to predict the stress information of the syllable of the corresponding voice of text.
In some embodiments, the above-mentioned voice set training stress prediction models based on after stress label, comprising: obtain
The corresponding text feature information of voice in voice set after stress label;By the text of voice in the voice set after stress label
Eigen information is as input, using the stress label result of the corresponding voice of text feature information of input as desired output,
Training obtains stress prediction models.
In some embodiments, the fundamental frequency characteristic information of syllable includes the fundamental frequency amplitude of syllable;And it is above-mentioned according to above-mentioned
Statistical result generates the stress information of the syllable of the voice in above-mentioned voice set, comprising: according to above-mentioned statistical result, determines
The fundamental frequency amplitude thresholds of the syllable of voice in above-mentioned voice set;Above-mentioned voice set is generated according to determining fundamental frequency amplitude thresholds
The stress information of the syllable of middle voice.
In some embodiments, the fundamental frequency characteristic information of syllable includes the fundamental frequency duration of syllable;And it is above-mentioned according to above-mentioned
Statistical result generates the stress information of the syllable of the voice in above-mentioned voice set, comprising: according to above-mentioned statistical result, determines
The fundamental frequency duration threshold value of the syllable of voice in above-mentioned voice set;Above-mentioned voice set is generated according to determining fundamental frequency duration threshold value
The stress information of the syllable of middle voice.
Second aspect, the embodiment of the present application provide it is a kind of for generating the device of information, above-mentioned apparatus include: extract it is single
Member is configured to the syllable for voice in preset voice set, extracts the corresponding fundamental frequency of the syllable, obtains being directed to and be somebody's turn to do
The fundamental frequency sequence of syllable obtains the fundamental frequency characteristic information for the syllable according to above-mentioned fundamental frequency sequence;Statistic unit is configured to
Statistically in predicate sound set the syllable of voice fundamental frequency characteristic information, obtain statistical result;Generation unit is configured to basis
Above-mentioned statistical result generates the stress information of the syllable of voice in above-mentioned voice set.
In some embodiments, above-mentioned apparatus further include: mark unit is configured to the stress information using generation to upper
The syllable of voice carries out stress label in predicate sound set;Training unit, the voice set after being configured to based on stress label
Training stress prediction models, wherein above-mentioned stress prediction models are used to predict the stress information of the syllable of the corresponding voice of text.
In some embodiments, above-mentioned training unit is further configured to: in the voice set after obtaining stress label
The corresponding text feature information of voice;It, will using the text feature information of voice in the voice set after stress label as input
The stress label result of the corresponding voice of text feature information of input obtains stress prediction models as desired output, training.
In some embodiments, the fundamental frequency characteristic information of syllable includes the fundamental frequency amplitude of syllable;And above-mentioned generation unit
It is further configured to: according to above-mentioned statistical result, determining the fundamental frequency amplitude thresholds of the syllable of voice in above-mentioned voice set;Root
The stress information of the syllable of voice in above-mentioned voice set is generated according to determining fundamental frequency amplitude thresholds.
In some embodiments, the fundamental frequency characteristic information of syllable includes the fundamental frequency duration of syllable;And above-mentioned generation unit
It is further configured to: according to above-mentioned statistical result, determining the fundamental frequency duration threshold value of the syllable of voice in above-mentioned voice set;Root
The stress information of the syllable of voice in above-mentioned voice set is generated according to determining fundamental frequency duration threshold value.
The third aspect, the embodiment of the present application provide a kind of equipment, which includes: one or more processors;Storage
Device is stored thereon with one or more programs, when said one or multiple programs are executed by said one or multiple processors
When, so that said one or multiple processors realize the method as described in implementation any in first aspect.
Fourth aspect, the embodiment of the present application provide a kind of computer-readable medium, are stored thereon with computer program,
In, the method as described in implementation any in first aspect is realized when which is executed by processor.
Method and apparatus provided by the embodiments of the present application for generating information, for the syllable of voice in language set,
The corresponding fundamental frequency of the syllable is extracted, obtains the fundamental frequency sequence for the syllable, and obtain for the syllable according to fundamental frequency sequence
Fundamental frequency characteristic information then counts the fundamental frequency characteristic information of the syllable of voice in voice set, statistical result is obtained, finally, root
Result generates the stress information of the syllable of voice in voice set according to statistics, to automatically generate the sound of voice in voice set
The stress information of section improves information formation efficiency, avoids influence of the human factor to result relative to manually generated, drops
Low cost.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is that one embodiment of the application can be applied to exemplary system architecture figure therein;
Fig. 2 is the flow chart according to one embodiment of the method for generating information of the application;
Fig. 3 is the distribution schematic diagram according to the exemplary fundamental frequency amplitude of the application;
Fig. 4 is the schematic diagram according to an application scenarios of the method for generating information of the application;
Fig. 5 is the flow chart according to another embodiment of the method for generating information of the application;
Fig. 6 is the structural schematic diagram according to one embodiment of the device for generating information of the application;
Fig. 7 is adapted for the structural schematic diagram for the computer system for realizing the equipment of the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to
Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can the method for generating information using the embodiment of the present application or the device for generating information
Exemplary system architecture 100.
As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105.
Network 104 between terminal device 101,102,103 and server 105 to provide the medium of communication link.Network 104 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 101,102,103 and be interacted by network 104 with server 105, to receive or send out
Send message etc..Various telecommunication customer end applications can be installed, such as speech synthesis class is answered on terminal device 101,102,103
With, shopping class application, searching class application, instant messaging tools, mailbox client, social platform software etc..
Terminal device 101,102,103 can be hardware, be also possible to software.When terminal device 101,102,103 is hard
When part, it can be the various electronic equipments that can be handled voice, including but not limited to smart phone, tablet computer, electricity
(Moving Picture Experts Group Audio Layer III, dynamic image are special for philosophical works reader, MP3 player
Family's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image
Expert's compression standard audio level 4) player, pocket computer on knee and desktop computer etc..When terminal device 101,
102,103 when being software, may be mounted in above-mentioned cited electronic equipment.Multiple softwares or software mould may be implemented into it
Block (such as providing Distributed Services), also may be implemented into single software or software module.It is not specifically limited herein.
Server 105 can be to provide the server of various services, such as to showing on terminal device 101,102,103
Information provides the background server supported.Background server can carry out the data such as the voice set received analyzing etc.
Reason, and processing result (such as stress information) is fed back into terminal device.
It should be noted that server can be hardware, it is also possible to software.When server is hardware, may be implemented
At the distributed server cluster that multiple servers form, individual server also may be implemented into.It, can when server is software
To be implemented as multiple softwares or software module (such as providing Distributed Services), single software or software also may be implemented into
Module.It is not specifically limited herein.
It should be noted that the method provided by the embodiment of the present application for generating information can pass through terminal device
101, it 102,103 executes, can also be executed by server 105.Correspondingly, it can be set for generating the device of information in end
In end equipment 101,102,103, also it can be set in server 105.The application does not limit this.
It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need
It wants, can have any number of terminal device, network and server.
With continued reference to Fig. 2, the process of one embodiment of the method for generating information according to the application is shown
200.The method for being used to generate information, comprising the following steps:
Step 201, for the syllable of voice in preset voice set, the corresponding fundamental frequency of the syllable is extracted, is obtained
For the fundamental frequency sequence of the syllable, the fundamental frequency characteristic information for the syllable is obtained according to fundamental frequency sequence.
In the present embodiment, for generate the method for information executing subject (such as terminal device shown in FIG. 1 101,
102,103 or server 105) it is local can be previously stored with voice set, above-mentioned voice set can be storage voice
Warehouse.As an example, the voice in above-mentioned voice set can be the sound of an individual recording, it is also possible to what a kind of people recorded
Sound, for example, areal, with same pronunciation habit a kind of people record sound.Voice in above-mentioned voice set
It can be the voice of various languages, for example, English, Chinese etc..
For each syllable of each voice in above-mentioned voice set, above-mentioned executing subject can extract the sound first
Corresponding fundamental frequency is saved, the fundamental frequency sequence for the syllable is obtained.Here, syllable is most natural structural unit in voice.Fundamental frequency
For the frequency of fundamental tone.General sound be all by sounding body issue a series of frequencies, amplitude it is different vibration it is compound and
At.The vibration for having a frequency minimum in these vibrations, the sound issued by it is exactly fundamental tone (fundamental tone),
Remaining is overtone.As an example, the voice can be carried out phonetic segmentation by above-mentioned executing subject, it is cut into multiple syllables.For more
Each of a syllable syllable, above-mentioned executing subject can extract the corresponding fundamental frequency of the syllable, to obtain for the syllable
Fundamental frequency sequence.For example, above-mentioned executing subject can be spaced the fundamental frequency that the primary syllable is extracted in setting duration (for example, 5 milliseconds),
To obtain the fundamental frequency sequence of the syllable.Later, above-mentioned executing subject can obtain needle according to the fundamental frequency sequence for the syllable
To the fundamental frequency characteristic information of the syllable.As an example, fundamental frequency characteristic information can include but is not limited to fundamental frequency maximum value, fundamental frequency most
Small value, fundamental frequency amplitude, fundamental frequency duration etc., wherein fundamental frequency amplitude can refer to the difference of fundamental frequency maximum value and fundamental frequency minimum value, base
Frequency duration can refer to duration shared by non-zero fundamental frequency in fundamental frequency sequence.It should be noted that phonetic segmentation and fundamental frequency extraction are
The well-known technique studied and applied extensively at present, details are not described herein again.
Step 202, the fundamental frequency characteristic information for counting the syllable of voice in voice set, obtains statistical result.
In the present embodiment, above-mentioned executing subject can statistically whole sound of (or part) voices in predicate sound set
The fundamental frequency characteristic information of section, obtains statistical result.By taking fundamental frequency characteristic information includes fundamental frequency amplitude as an example, above-mentioned executing subject can be with
The syllable in above-mentioned voice set is ranked up by the descending sequence of fundamental frequency amplitude, and counts sound in each fundamental frequency amplitude
The quantity of section, available fundamental frequency amplitude distribution figure as shown in Figure 3, the abscissa of the distribution map indicate to carry out fundamental frequency amplitude
Value after logarithm process, ordinate indicate the quantity of syllable.As an example, above-mentioned executing subject can also statistically predicate sound collection
The information such as mean value, variance, quartile of fundamental frequency characteristic information of the syllable of at least one voice, obtain statistical result in conjunction.
Step 203, according to statistical result, the stress information of the syllable of voice in voice set is generated.
In the present embodiment, above-mentioned executing subject can generate in above-mentioned voice set according to the statistical result of step 202
The stress information of the syllable of voice.Herein, the stress information of syllable may include "Yes" and "No" two types, and "Yes" indicates
The syllable needs weight ([zh ò ng]) to read, and "No" indicates that the syllable does not need weight ([zh ò ng]) and reads.As an example, stress information can
"Yes" is indicated to be indicated with 1 and 0,1, and 0 indicates "No".
In some optional implementations of the present embodiment, the fundamental frequency characteristic information of syllable may include the fundamental frequency of syllable
Amplitude.And above-mentioned steps 203 specific as follows can carry out:
Firstly, above-mentioned executing subject can determine the base of the syllable of voice in above-mentioned voice set according to above-mentioned statistical result
Frequency amplitude thresholds.As an example, above-mentioned executing subject can will be in above-mentioned voice set by the descending sequence of fundamental frequency amplitude
The syllable of voice is ranked up, and will come preceding 20% multiple syllables as target syllable, determines base in multiple target syllables
The smallest target syllable of frequency amplitude, using the corresponding fundamental frequency amplitude of determining target syllable as fundamental frequency amplitude thresholds.
Later, above-mentioned executing subject can generate the sound of voice in above-mentioned voice set according to determining fundamental frequency amplitude thresholds
The stress information of section.As an example, above-mentioned executing subject can be by each syllable of each voice in above-mentioned voice set
Fundamental frequency amplitude is compared with above-mentioned fundamental frequency amplitude thresholds, if the fundamental frequency amplitude of the syllable is greater than above-mentioned fundamental frequency amplitude thresholds,
Then generate stress information "Yes";If the fundamental frequency amplitude of the syllable is less than above-mentioned fundamental frequency amplitude thresholds, stress information is generated
"No".In practice, the fundamental frequency amplitude of syllable is generally higher than the fundamental frequency amplitude of non-syllable.Therefore, pass through statistical result
Determine fundamental frequency amplitude thresholds, and generating stress information based on fundamental frequency amplitude thresholds can make the stress information generated more accurate.
In some optional implementations of the present embodiment, the fundamental frequency characteristic information of syllable may include the fundamental frequency of syllable
Duration.And above-mentioned steps 203 specific as follows can carry out:
Firstly, above-mentioned executing subject can determine the base of the syllable of voice in above-mentioned voice set according to above-mentioned statistical result
Frequency duration threshold value.As an example, above-mentioned executing subject can will be in above-mentioned voice set by the descending sequence of fundamental frequency duration
The syllable of voice is ranked up, and will come preceding 20% multiple syllables as target syllable, determines base in multiple target syllables
The smallest target syllable of frequency duration, using the corresponding fundamental frequency duration of determining target syllable as fundamental frequency duration threshold value.
Later, above-mentioned executing subject can generate the sound of voice in above-mentioned voice set according to determining fundamental frequency duration threshold value
The stress information of section.As an example, above-mentioned executing subject can be by each syllable of each voice in above-mentioned voice set
Fundamental frequency duration is compared with above-mentioned fundamental frequency duration threshold value, if the fundamental frequency duration of the syllable is greater than above-mentioned fundamental frequency duration threshold value,
Then generate stress information "Yes";If the fundamental frequency duration of the syllable is less than above-mentioned fundamental frequency duration threshold value, stress information is generated
"No".In practice, therefore the fundamental frequency duration that the fundamental frequency duration of syllable is generally higher than non-syllable passes through statistical result
Determine fundamental frequency duration threshold value, and generating stress information based on fundamental frequency duration threshold value can make the stress information generated more accurate.
With continued reference to the signal that Fig. 4, Fig. 4 are according to the application scenarios of the method for generating information of the present embodiment
Figure.In the application scenarios of Fig. 4, multiple people from areal, with same pronunciation habit record a plurality of voice in advance,
Form voice set A.For each syllable of each voice in voice set A, terminal device 401 can extract this first
The corresponding fundamental frequency of syllable obtains the fundamental frequency sequence for the syllable, and obtains the fundamental frequency spy for the syllable according to fundamental frequency sequence
Reference breath.Later, terminal device 401 can count the fundamental frequency feature letter of each syllable of each voice in voice set A
Breath, obtains statistical result.Finally, terminal device 401 according to statistical result, generates each of each voice in voice set A
The stress information of a syllable.
The method provided by the above embodiment of the application realize the stress information of the syllable of voice in voice set from
It is dynamic to generate, relative to manually generated, information formation efficiency is improved, influence of the human factor to result is avoided, reduces into
This.
With further reference to Fig. 5, it illustrates the processes 500 of another embodiment of the method for generating information.The use
In the process 500 for the method for generating information, comprising the following steps:
Step 501, for the syllable of voice in preset voice set, the corresponding fundamental frequency of the syllable is extracted, is obtained
For the fundamental frequency sequence of the syllable, the fundamental frequency characteristic information for the syllable is obtained according to fundamental frequency sequence.
In the present embodiment, step 501 is similar with step 201 principle in embodiment illustrated in fig. 2, and details are not described herein again.
Step 502, the fundamental frequency characteristic information for counting the syllable of voice in voice set, obtains statistical result.
In the present embodiment, step 502 is similar with step 202 principle in embodiment illustrated in fig. 2, and details are not described herein again.
Step 503, according to statistical result, the stress information of the syllable of voice in voice set is generated.
In the present embodiment, step 503 is similar with step 203 principle in embodiment illustrated in fig. 2, and details are not described herein again.
Step 504, stress label is carried out using syllable of the stress information of generation to voice in voice set.
In the present embodiment, step 503 stress information generated can be used to above-mentioned voice collection in above-mentioned executing subject
The carry out stress label of the syllable of voice in conjunction.As an example, above-mentioned executing subject can be by the language in above-mentioned voice set
Sound, syllable annotation that stress information is "Yes" be " 1 ", by the voice in above-mentioned voice set, stress information be "No"
Syllable annotation is " 0 ".
Step 505, based on the voice set training stress prediction models after stress label.
In the present embodiment, above-mentioned executing subject can be based on the training stress predicted mould of the voice set after stress label
Type.Herein, above-mentioned stress prediction models can be used for predicting the stress information of the syllable of the corresponding voice of text.As showing
Example, above-mentioned stress prediction models can be machine learning model can include but is not limited to DNN (Deep Neural Network,
Deep-neural-network), SVM (Support Vector Machine, support vector machines), LSTM (Long Short-Term
Memory, shot and long term memory network), CRF (calculate by conditional random field algorithm, condition random field
Method), attention (attention mechanism), wavenet etc..Language in some usage scenarios, after being also based on stress label
Sound set trains acoustic model, and above-mentioned acoustic model can be used for characterizing text information (for example, text size, word quantity, sound
Joint number amount, syllable position etc.) with the corresponding relationship of parameters,acoustic.As an example, above-mentioned acoustic model may include but unlimited
In HMM (Hidden Markov Model, hidden Markov model), DNN, LSTM, attention, wavenet etc.
In some optional implementations of the present embodiment, above-mentioned steps 505 specific as follows can be carried out:
Firstly, the corresponding text feature letter of voice in voice set after the above-mentioned available stress label of executing subject
Breath.As an example, each speech recognition in the voice set after stress label can be text by above-mentioned executing subject, and
The corresponding text feature information of this bar text is obtained using existing various modes.Herein, text feature information may include
But be not limited to term vector, part of speech, capital and small letter feature, syllable number etc..
Later, above-mentioned executing subject can be using the text feature information of voice in the voice set after stress label as defeated
Enter, using the stress label result of the corresponding voice of text feature information of input as desired output, training obtains stress predicted
Model.As an example, the output of stress prediction models can be compared with desired output, in training process if the two
Between error amount be less than preset threshold value, then it represents that training complete, deconditioning;If error amount between the two is not
Less than preset threshold value, then can using back-propagation algorithm (Back Propgation Algorithm, BP algorithm) and
Gradient descent method (such as stochastic gradient descent algorithm) is adjusted the parameter of stress prediction models, and using aforesaid way after
It is continuous that parameter stress prediction models adjusted are trained.
Compared with the corresponding embodiment of Fig. 2, the process 500 of the method for generating information in the present embodiment is highlighted pair
The syllable of voice carries out stress label in voice set, and based on the voice set training stress prediction models after stress label
Step.The formation efficiency of stress information can be improved in the scheme of the present embodiment description as a result, and then shortens stress prediction models
Generate the period.
With further reference to Fig. 6, as the realization to method shown in above-mentioned each figure, this application provides one kind for generating letter
One embodiment of the device of breath, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, which can specifically answer
For in various electronic equipments.
As shown in fig. 6, the present embodiment includes: extraction unit 601, statistic unit 602 for generating the device 600 of information
With generation unit 603.Wherein, extraction unit 601 is configured to the syllable for voice in preset voice set, extracts
The corresponding fundamental frequency of the syllable obtains the fundamental frequency sequence for the syllable, obtains the base for the syllable according to above-mentioned fundamental frequency sequence
Frequency characteristic information;Statistic unit 602 is configured to the fundamental frequency characteristic information of the syllable of voice in statistically predicate sound set, obtains
Statistical result;Generation unit 603 is configured to generate the weight of the syllable of voice in above-mentioned voice set according to above-mentioned statistical result
Message breath.
In the present embodiment, for generating the extraction unit 601, statistic unit 602 and generation unit of the device 600 of information
603 specific processing and its brought technical effect can be respectively with reference to step 201, step 202 and steps in Fig. 2 corresponding embodiment
Rapid 203 related description, details are not described herein.
In some optional implementations of the present embodiment, above-mentioned apparatus 600 further include: mark unit (does not show in figure
Out), it is configured to carry out stress label to the syllable of voice in above-mentioned voice set using the stress information of generation;Training unit
(not shown), the voice set training stress prediction models after being configured to based on stress label, wherein above-mentioned stress is pre-
Survey the stress information that model is used to predict the syllable of the corresponding voice of text.
In some optional implementations of the present embodiment, above-mentioned training unit is further configured to: obtaining stress
The corresponding text feature information of voice in voice set after mark;The text of voice in voice set after stress label is special
Reference breath is as input, using the stress label result of the corresponding voice of text feature information of input as desired output, training
Obtain stress prediction models.
In some optional implementations of the present embodiment, the fundamental frequency characteristic information of syllable includes the fundamental frequency width of syllable
Value;And above-mentioned generation unit 603 is further configured to: according to above-mentioned statistical result, determining voice in above-mentioned voice set
Syllable fundamental frequency amplitude thresholds;The stress of the syllable of voice in above-mentioned voice set is generated according to determining fundamental frequency amplitude thresholds
Information.
In some optional implementations of the present embodiment, when the fundamental frequency characteristic information of syllable includes the fundamental frequency of syllable
It is long;And above-mentioned generation unit 603 is further configured to: according to above-mentioned statistical result, determining voice in above-mentioned voice set
Syllable fundamental frequency duration threshold value;The stress of the syllable of voice in above-mentioned voice set is generated according to determining fundamental frequency duration threshold value
Information.
Below with reference to Fig. 7, it illustrates the knots of the computer system 700 for the equipment for being suitable for being used to realize the embodiment of the present application
Structure schematic diagram.Equipment shown in Fig. 7 is only an example, should not function to the embodiment of the present application and use scope bring and appoint
What is limited.
As shown in fig. 7, computer system 700 includes central processing unit (CPU) 701, it can be read-only according to being stored in
Program in memory (ROM) 702 or be loaded into the program in random access storage device (RAM) 703 from storage section 708 and
Execute various movements appropriate and processing.In RAM 703, also it is stored with system 700 and operates required various programs and data.
CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input/output (I/O) interface 705 is also connected to always
Line 704.
I/O interface 705 is connected to lower component: the importation 706 including keyboard, mouse etc.;It is penetrated including such as cathode
The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 708 including hard disk etc.;
And the communications portion 709 of the network interface card including LAN card, modem etc..Communications portion 709 via such as because
The network of spy's net executes communication process.Driver 710 is also connected to I/O interface 705 as needed.Detachable media 711, such as
Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 710, in order to read from thereon
Computer program be mounted into storage section 708 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description
Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium
On computer program, which includes the program code for method shown in execution flow chart.In such reality
It applies in example, which can be downloaded and installed from network by communications portion 709, and/or from detachable media
711 are mounted.When the computer program is executed by central processing unit (CPU) 701, limited in execution the present processes
Above-mentioned function.
It should be noted that computer-readable medium described herein can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not
Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter
The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires
Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In this application, computer readable storage medium can be it is any include or storage journey
The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this
In application, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for
By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium
Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned
Any appropriate combination.
The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof
Machine program code, described program design language include object oriented program language-such as Java, Smalltalk, C+
+, it further include conventional procedural programming language-such as " C " language or similar programming language.Program code can
Fully to execute, partly execute on the user computer on the user computer, be executed as an independent software package,
Part executes on the remote computer or executes on a remote computer or server completely on the user computer for part.
In situations involving remote computers, remote computer can pass through the network of any kind --- including local area network (LAN)
Or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize Internet service
Provider is connected by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use
The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box
The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually
It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse
Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding
The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction
Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard
The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet
Include extraction unit, statistic unit and generation unit.Wherein, the title of these units is not constituted under certain conditions to the unit
The restriction of itself, for example, extraction unit is also described as " for the syllable of voice in preset voice set, mentioning
The corresponding fundamental frequency of the syllable is taken, the fundamental frequency sequence for the syllable is obtained, is obtained according to above-mentioned fundamental frequency sequence for the syllable
The unit of fundamental frequency characteristic information ".
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be
Included in device described in above-described embodiment;It is also possible to individualism, and without in the supplying device.Above-mentioned calculating
Machine readable medium carries one or more program, when said one or multiple programs are executed by the device, so that should
Device: for the syllable of voice in preset voice set, extracting the corresponding fundamental frequency of the syllable, obtains for the syllable
Fundamental frequency sequence obtains the fundamental frequency characteristic information for the syllable according to above-mentioned fundamental frequency sequence;Statistically voice in predicate sound set
Syllable fundamental frequency characteristic information, obtain statistical result;According to above-mentioned statistical result, the sound of voice in above-mentioned voice set is generated
The stress information of section.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art
Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic
Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature
Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein
Can technical characteristic replaced mutually and the technical solution that is formed.
Claims (12)
1. a kind of method for generating information, comprising:
For the syllable of voice in preset voice set, the corresponding fundamental frequency of the syllable is extracted, is obtained for the syllable
Fundamental frequency sequence obtains the fundamental frequency characteristic information for the syllable according to the fundamental frequency sequence;
The fundamental frequency characteristic information for counting the syllable of voice in the voice set, obtains statistical result;
According to the statistical result, the stress information of the syllable of voice in the voice set is generated.
2. according to the method described in claim 1, wherein, the method also includes:
Stress label is carried out to the syllable of voice in the voice set using the stress information of generation;
Based on the voice set training stress prediction models after stress label, wherein the stress prediction models are for predicting text
The stress information of the syllable of this corresponding voice.
3. according to the method described in claim 2, wherein, the voice set based on after stress label trains stress predicted mould
Type, comprising:
The corresponding text feature information of voice in voice set after obtaining stress label;
Using the text feature information of voice in the voice set after stress label as input, by the text feature information pair of input
The stress label result for the voice answered obtains stress prediction models as desired output, training.
4. according to the method described in claim 1, wherein, the fundamental frequency characteristic information of syllable includes the fundamental frequency amplitude of syllable;And
It is described according to the statistical result, generate the stress information of the syllable of the voice in the voice set, comprising:
According to the statistical result, the fundamental frequency amplitude thresholds of the syllable of voice in the voice set are determined;
The stress information of the syllable of voice in the voice set is generated according to determining fundamental frequency amplitude thresholds.
5. according to the method described in claim 1, wherein, the fundamental frequency characteristic information of syllable includes the fundamental frequency duration of syllable;And
It is described according to the statistical result, generate the stress information of the syllable of the voice in the voice set, comprising:
According to the statistical result, the fundamental frequency duration threshold value of the syllable of voice in the voice set is determined;
The stress information of the syllable of voice in the voice set is generated according to determining fundamental frequency duration threshold value.
6. a kind of for generating the device of information, comprising:
Extraction unit is configured to the syllable for voice in preset voice set, extracts the corresponding fundamental frequency of the syllable,
The fundamental frequency sequence for the syllable is obtained, the fundamental frequency characteristic information for the syllable is obtained according to the fundamental frequency sequence;
Statistic unit is configured to count the fundamental frequency characteristic information of the syllable of voice in the voice set, obtains statistical result;
Generation unit is configured to generate the stress information of the syllable of voice in the voice set according to the statistical result.
7. device according to claim 6, wherein described device further include:
Unit is marked, is configured to carry out stress mark to the syllable of voice in the voice set using the stress information of generation
Note;
Training unit, the voice set training stress prediction models after being configured to based on stress label, wherein the stress is pre-
Survey the stress information that model is used to predict the syllable of the corresponding voice of text.
8. device according to claim 7, wherein the training unit is further configured to:
The corresponding text feature information of voice in voice set after obtaining stress label;
Using the text feature information of voice in the voice set after stress label as input, by the text feature information pair of input
The stress label result for the voice answered obtains stress prediction models as desired output, training.
9. device according to claim 6, wherein the fundamental frequency characteristic information of syllable includes the fundamental frequency amplitude of syllable;And
The generation unit is further configured to:
According to the statistical result, the fundamental frequency amplitude thresholds of the syllable of voice in the voice set are determined;
The stress information of the syllable of voice in the voice set is generated according to determining fundamental frequency amplitude thresholds.
10. device according to claim 6, wherein the fundamental frequency characteristic information of syllable includes the fundamental frequency duration of syllable;And
The generation unit is further configured to:
According to the statistical result, the fundamental frequency duration threshold value of the syllable of voice in the voice set is determined;
The stress information of the syllable of voice in the voice set is generated according to determining fundamental frequency duration threshold value.
11. a kind of equipment, comprising:
One or more processors;
Storage device is stored thereon with one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
Now such as method as claimed in any one of claims 1 to 5.
12. a kind of computer-readable medium, is stored thereon with computer program, wherein real when described program is executed by processor
Now such as method as claimed in any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811202290.2A CN109087627A (en) | 2018-10-16 | 2018-10-16 | Method and apparatus for generating information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811202290.2A CN109087627A (en) | 2018-10-16 | 2018-10-16 | Method and apparatus for generating information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109087627A true CN109087627A (en) | 2018-12-25 |
Family
ID=64843483
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811202290.2A Pending CN109087627A (en) | 2018-10-16 | 2018-10-16 | Method and apparatus for generating information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109087627A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112309367A (en) * | 2020-11-03 | 2021-02-02 | 北京有竹居网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN113205827A (en) * | 2021-05-05 | 2021-08-03 | 张茜 | High-precision extraction method and device for baby voice fundamental frequency and computer equipment |
CN113516963A (en) * | 2020-04-09 | 2021-10-19 | 菜鸟智能物流控股有限公司 | Audio data generation method and device, server and intelligent loudspeaker box |
WO2022095754A1 (en) * | 2020-11-03 | 2022-05-12 | 北京有竹居网络技术有限公司 | Speech synthesis method and apparatus, storage medium, and electronic device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6366883B1 (en) * | 1996-05-15 | 2002-04-02 | Atr Interpreting Telecommunications | Concatenation of speech segments by use of a speech synthesizer |
CN101000765A (en) * | 2007-01-09 | 2007-07-18 | 黑龙江大学 | Speech synthetic method based on rhythm character |
CN101751919A (en) * | 2008-12-03 | 2010-06-23 | 中国科学院自动化研究所 | Spoken Chinese stress automatic detection method |
CN102254554A (en) * | 2011-07-18 | 2011-11-23 | 中国科学院自动化研究所 | Method for carrying out hierarchical modeling and predicating on mandarin accent |
CN102436807A (en) * | 2011-09-14 | 2012-05-02 | 苏州思必驰信息科技有限公司 | Method and system for automatically generating voice with stressed syllables |
CN102496363A (en) * | 2011-11-11 | 2012-06-13 | 北京宇音天下科技有限公司 | Correction method for Chinese speech synthesis tone |
CN105895075A (en) * | 2015-01-26 | 2016-08-24 | 科大讯飞股份有限公司 | Method and system for improving synthetic voice rhythm naturalness |
-
2018
- 2018-10-16 CN CN201811202290.2A patent/CN109087627A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6366883B1 (en) * | 1996-05-15 | 2002-04-02 | Atr Interpreting Telecommunications | Concatenation of speech segments by use of a speech synthesizer |
CN101000765A (en) * | 2007-01-09 | 2007-07-18 | 黑龙江大学 | Speech synthetic method based on rhythm character |
CN101751919A (en) * | 2008-12-03 | 2010-06-23 | 中国科学院自动化研究所 | Spoken Chinese stress automatic detection method |
CN102254554A (en) * | 2011-07-18 | 2011-11-23 | 中国科学院自动化研究所 | Method for carrying out hierarchical modeling and predicating on mandarin accent |
CN102436807A (en) * | 2011-09-14 | 2012-05-02 | 苏州思必驰信息科技有限公司 | Method and system for automatically generating voice with stressed syllables |
CN102496363A (en) * | 2011-11-11 | 2012-06-13 | 北京宇音天下科技有限公司 | Correction method for Chinese speech synthesis tone |
CN105895075A (en) * | 2015-01-26 | 2016-08-24 | 科大讯飞股份有限公司 | Method and system for improving synthetic voice rhythm naturalness |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113516963A (en) * | 2020-04-09 | 2021-10-19 | 菜鸟智能物流控股有限公司 | Audio data generation method and device, server and intelligent loudspeaker box |
CN113516963B (en) * | 2020-04-09 | 2023-11-10 | 菜鸟智能物流控股有限公司 | Audio data generation method and device, server and intelligent sound box |
CN112309367A (en) * | 2020-11-03 | 2021-02-02 | 北京有竹居网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
WO2022095754A1 (en) * | 2020-11-03 | 2022-05-12 | 北京有竹居网络技术有限公司 | Speech synthesis method and apparatus, storage medium, and electronic device |
CN113205827A (en) * | 2021-05-05 | 2021-08-03 | 张茜 | High-precision extraction method and device for baby voice fundamental frequency and computer equipment |
CN113205827B (en) * | 2021-05-05 | 2022-02-15 | 张茜 | High-precision extraction method and device for baby voice fundamental frequency and computer equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109036384B (en) | Audio recognition method and device | |
US10553201B2 (en) | Method and apparatus for speech synthesis | |
CN108630190B (en) | Method and apparatus for generating speech synthesis model | |
CN108182936B (en) | Voice signal generation method and device | |
US10902841B2 (en) | Personalized custom synthetic speech | |
CN107481717B (en) | Acoustic model training method and system | |
CN107481720B (en) | Explicit voiceprint recognition method and device | |
CN108428446A (en) | Audio recognition method and device | |
CN109087627A (en) | Method and apparatus for generating information | |
CN107707745A (en) | Method and apparatus for extracting information | |
CN109545192A (en) | Method and apparatus for generating model | |
CN108877782A (en) | Audio recognition method and device | |
CN109190124B (en) | Method and apparatus for participle | |
CN108121800A (en) | Information generating method and device based on artificial intelligence | |
CN108933730A (en) | Information-pushing method and device | |
US11741941B2 (en) | Configurable neural speech synthesis | |
CN108986805A (en) | Method and apparatus for sending information | |
CN109545193A (en) | Method and apparatus for generating model | |
CN107526809A (en) | Method and apparatus based on artificial intelligence push music | |
CN109346109A (en) | Fundamental frequency extracting method and device | |
CN107680584A (en) | Method and apparatus for cutting audio | |
CN109299477A (en) | Method and apparatus for generating text header | |
CN108897853A (en) | The method and apparatus for generating pushed information | |
CN110930975B (en) | Method and device for outputting information | |
CN109979439A (en) | Audio recognition method, device, medium and electronic equipment based on block chain |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181225 |