CN109087627A - Method and apparatus for generating information - Google Patents

Method and apparatus for generating information Download PDF

Info

Publication number
CN109087627A
CN109087627A CN201811202290.2A CN201811202290A CN109087627A CN 109087627 A CN109087627 A CN 109087627A CN 201811202290 A CN201811202290 A CN 201811202290A CN 109087627 A CN109087627 A CN 109087627A
Authority
CN
China
Prior art keywords
voice
syllable
fundamental frequency
stress
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811202290.2A
Other languages
Chinese (zh)
Inventor
周志平
盖于涛
陈昌滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201811202290.2A priority Critical patent/CN109087627A/en
Publication of CN109087627A publication Critical patent/CN109087627A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the present application discloses the method and apparatus for generating information.One specific embodiment of this method includes: the syllable for voice in preset voice set, extracts the corresponding fundamental frequency of the syllable, obtains the fundamental frequency sequence for the syllable, obtains the fundamental frequency characteristic information for the syllable according to above-mentioned fundamental frequency sequence;Statistically in predicate sound set the syllable of voice fundamental frequency characteristic information, obtain statistical result;According to above-mentioned statistical result, the stress information of the syllable of voice in above-mentioned voice set is generated.The embodiment realizes automatically generating for the stress information of the syllable of voice in voice set.

Description

Method and apparatus for generating information
Technical field
The invention relates to field of computer technology, and in particular to the method and apparatus for generating information.
Background technique
With the development of speech synthesis technique, synthesize no matter voice has much for intelligibility or naturalness Progress.However, current synthesis voice is still more dull barren.Different from synthesizing voice, have when the people of different regions speaks Therefore different pronunciation habits, the difference that stress (accent) can describe between different pronunciation habits add when synthesizing voice The voice with " human interest " can preferably be synthesized by entering stress information.At this stage, speech synthesis system generally includes a variety of moulds Type, for example, text model, prediction model, acoustic model etc..And want the voice of anamorphic zone stress, then it needs using band weight Model in the training data training language synthesis system of phonetic symbol note.The weight in training data is marked using the mode manually marked Sound does not require nothing more than the pronunciation habit of the very familiar speaker of mark personnel, it is also necessary to consume biggish manpower and financial resources.
Summary of the invention
The embodiment of the present application proposes the method and apparatus for generating information.
In a first aspect, the embodiment of the present application provides a kind of method for generating information, this method comprises: for preparatory The syllable of voice in the voice set of setting, extracts the corresponding fundamental frequency of the syllable, obtains the fundamental frequency sequence for the syllable, according to Above-mentioned fundamental frequency sequence obtains the fundamental frequency characteristic information for the syllable;Statistically the fundamental frequency of the syllable of voice is special in predicate sound set Reference breath, obtains statistical result;According to above-mentioned statistical result, the stress information of the syllable of voice in above-mentioned voice set is generated.
In some embodiments, the above method further include: using the stress information generated to voice in above-mentioned voice set Syllable carry out stress label;Based on the voice set training stress prediction models after stress label, wherein above-mentioned stress predicted Model is used to predict the stress information of the syllable of the corresponding voice of text.
In some embodiments, the above-mentioned voice set training stress prediction models based on after stress label, comprising: obtain The corresponding text feature information of voice in voice set after stress label;By the text of voice in the voice set after stress label Eigen information is as input, using the stress label result of the corresponding voice of text feature information of input as desired output, Training obtains stress prediction models.
In some embodiments, the fundamental frequency characteristic information of syllable includes the fundamental frequency amplitude of syllable;And it is above-mentioned according to above-mentioned Statistical result generates the stress information of the syllable of the voice in above-mentioned voice set, comprising: according to above-mentioned statistical result, determines The fundamental frequency amplitude thresholds of the syllable of voice in above-mentioned voice set;Above-mentioned voice set is generated according to determining fundamental frequency amplitude thresholds The stress information of the syllable of middle voice.
In some embodiments, the fundamental frequency characteristic information of syllable includes the fundamental frequency duration of syllable;And it is above-mentioned according to above-mentioned Statistical result generates the stress information of the syllable of the voice in above-mentioned voice set, comprising: according to above-mentioned statistical result, determines The fundamental frequency duration threshold value of the syllable of voice in above-mentioned voice set;Above-mentioned voice set is generated according to determining fundamental frequency duration threshold value The stress information of the syllable of middle voice.
Second aspect, the embodiment of the present application provide it is a kind of for generating the device of information, above-mentioned apparatus include: extract it is single Member is configured to the syllable for voice in preset voice set, extracts the corresponding fundamental frequency of the syllable, obtains being directed to and be somebody's turn to do The fundamental frequency sequence of syllable obtains the fundamental frequency characteristic information for the syllable according to above-mentioned fundamental frequency sequence;Statistic unit is configured to Statistically in predicate sound set the syllable of voice fundamental frequency characteristic information, obtain statistical result;Generation unit is configured to basis Above-mentioned statistical result generates the stress information of the syllable of voice in above-mentioned voice set.
In some embodiments, above-mentioned apparatus further include: mark unit is configured to the stress information using generation to upper The syllable of voice carries out stress label in predicate sound set;Training unit, the voice set after being configured to based on stress label Training stress prediction models, wherein above-mentioned stress prediction models are used to predict the stress information of the syllable of the corresponding voice of text.
In some embodiments, above-mentioned training unit is further configured to: in the voice set after obtaining stress label The corresponding text feature information of voice;It, will using the text feature information of voice in the voice set after stress label as input The stress label result of the corresponding voice of text feature information of input obtains stress prediction models as desired output, training.
In some embodiments, the fundamental frequency characteristic information of syllable includes the fundamental frequency amplitude of syllable;And above-mentioned generation unit It is further configured to: according to above-mentioned statistical result, determining the fundamental frequency amplitude thresholds of the syllable of voice in above-mentioned voice set;Root The stress information of the syllable of voice in above-mentioned voice set is generated according to determining fundamental frequency amplitude thresholds.
In some embodiments, the fundamental frequency characteristic information of syllable includes the fundamental frequency duration of syllable;And above-mentioned generation unit It is further configured to: according to above-mentioned statistical result, determining the fundamental frequency duration threshold value of the syllable of voice in above-mentioned voice set;Root The stress information of the syllable of voice in above-mentioned voice set is generated according to determining fundamental frequency duration threshold value.
The third aspect, the embodiment of the present application provide a kind of equipment, which includes: one or more processors;Storage Device is stored thereon with one or more programs, when said one or multiple programs are executed by said one or multiple processors When, so that said one or multiple processors realize the method as described in implementation any in first aspect.
Fourth aspect, the embodiment of the present application provide a kind of computer-readable medium, are stored thereon with computer program, In, the method as described in implementation any in first aspect is realized when which is executed by processor.
Method and apparatus provided by the embodiments of the present application for generating information, for the syllable of voice in language set, The corresponding fundamental frequency of the syllable is extracted, obtains the fundamental frequency sequence for the syllable, and obtain for the syllable according to fundamental frequency sequence Fundamental frequency characteristic information then counts the fundamental frequency characteristic information of the syllable of voice in voice set, statistical result is obtained, finally, root Result generates the stress information of the syllable of voice in voice set according to statistics, to automatically generate the sound of voice in voice set The stress information of section improves information formation efficiency, avoids influence of the human factor to result relative to manually generated, drops Low cost.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is that one embodiment of the application can be applied to exemplary system architecture figure therein;
Fig. 2 is the flow chart according to one embodiment of the method for generating information of the application;
Fig. 3 is the distribution schematic diagram according to the exemplary fundamental frequency amplitude of the application;
Fig. 4 is the schematic diagram according to an application scenarios of the method for generating information of the application;
Fig. 5 is the flow chart according to another embodiment of the method for generating information of the application;
Fig. 6 is the structural schematic diagram according to one embodiment of the device for generating information of the application;
Fig. 7 is adapted for the structural schematic diagram for the computer system for realizing the equipment of the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can the method for generating information using the embodiment of the present application or the device for generating information Exemplary system architecture 100.
As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105. Network 104 between terminal device 101,102,103 and server 105 to provide the medium of communication link.Network 104 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 101,102,103 and be interacted by network 104 with server 105, to receive or send out Send message etc..Various telecommunication customer end applications can be installed, such as speech synthesis class is answered on terminal device 101,102,103 With, shopping class application, searching class application, instant messaging tools, mailbox client, social platform software etc..
Terminal device 101,102,103 can be hardware, be also possible to software.When terminal device 101,102,103 is hard When part, it can be the various electronic equipments that can be handled voice, including but not limited to smart phone, tablet computer, electricity (Moving Picture Experts Group Audio Layer III, dynamic image are special for philosophical works reader, MP3 player Family's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image Expert's compression standard audio level 4) player, pocket computer on knee and desktop computer etc..When terminal device 101, 102,103 when being software, may be mounted in above-mentioned cited electronic equipment.Multiple softwares or software mould may be implemented into it Block (such as providing Distributed Services), also may be implemented into single software or software module.It is not specifically limited herein.
Server 105 can be to provide the server of various services, such as to showing on terminal device 101,102,103 Information provides the background server supported.Background server can carry out the data such as the voice set received analyzing etc. Reason, and processing result (such as stress information) is fed back into terminal device.
It should be noted that server can be hardware, it is also possible to software.When server is hardware, may be implemented At the distributed server cluster that multiple servers form, individual server also may be implemented into.It, can when server is software To be implemented as multiple softwares or software module (such as providing Distributed Services), single software or software also may be implemented into Module.It is not specifically limited herein.
It should be noted that the method provided by the embodiment of the present application for generating information can pass through terminal device 101, it 102,103 executes, can also be executed by server 105.Correspondingly, it can be set for generating the device of information in end In end equipment 101,102,103, also it can be set in server 105.The application does not limit this.
It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.
With continued reference to Fig. 2, the process of one embodiment of the method for generating information according to the application is shown 200.The method for being used to generate information, comprising the following steps:
Step 201, for the syllable of voice in preset voice set, the corresponding fundamental frequency of the syllable is extracted, is obtained For the fundamental frequency sequence of the syllable, the fundamental frequency characteristic information for the syllable is obtained according to fundamental frequency sequence.
In the present embodiment, for generate the method for information executing subject (such as terminal device shown in FIG. 1 101, 102,103 or server 105) it is local can be previously stored with voice set, above-mentioned voice set can be storage voice Warehouse.As an example, the voice in above-mentioned voice set can be the sound of an individual recording, it is also possible to what a kind of people recorded Sound, for example, areal, with same pronunciation habit a kind of people record sound.Voice in above-mentioned voice set It can be the voice of various languages, for example, English, Chinese etc..
For each syllable of each voice in above-mentioned voice set, above-mentioned executing subject can extract the sound first Corresponding fundamental frequency is saved, the fundamental frequency sequence for the syllable is obtained.Here, syllable is most natural structural unit in voice.Fundamental frequency For the frequency of fundamental tone.General sound be all by sounding body issue a series of frequencies, amplitude it is different vibration it is compound and At.The vibration for having a frequency minimum in these vibrations, the sound issued by it is exactly fundamental tone (fundamental tone), Remaining is overtone.As an example, the voice can be carried out phonetic segmentation by above-mentioned executing subject, it is cut into multiple syllables.For more Each of a syllable syllable, above-mentioned executing subject can extract the corresponding fundamental frequency of the syllable, to obtain for the syllable Fundamental frequency sequence.For example, above-mentioned executing subject can be spaced the fundamental frequency that the primary syllable is extracted in setting duration (for example, 5 milliseconds), To obtain the fundamental frequency sequence of the syllable.Later, above-mentioned executing subject can obtain needle according to the fundamental frequency sequence for the syllable To the fundamental frequency characteristic information of the syllable.As an example, fundamental frequency characteristic information can include but is not limited to fundamental frequency maximum value, fundamental frequency most Small value, fundamental frequency amplitude, fundamental frequency duration etc., wherein fundamental frequency amplitude can refer to the difference of fundamental frequency maximum value and fundamental frequency minimum value, base Frequency duration can refer to duration shared by non-zero fundamental frequency in fundamental frequency sequence.It should be noted that phonetic segmentation and fundamental frequency extraction are The well-known technique studied and applied extensively at present, details are not described herein again.
Step 202, the fundamental frequency characteristic information for counting the syllable of voice in voice set, obtains statistical result.
In the present embodiment, above-mentioned executing subject can statistically whole sound of (or part) voices in predicate sound set The fundamental frequency characteristic information of section, obtains statistical result.By taking fundamental frequency characteristic information includes fundamental frequency amplitude as an example, above-mentioned executing subject can be with The syllable in above-mentioned voice set is ranked up by the descending sequence of fundamental frequency amplitude, and counts sound in each fundamental frequency amplitude The quantity of section, available fundamental frequency amplitude distribution figure as shown in Figure 3, the abscissa of the distribution map indicate to carry out fundamental frequency amplitude Value after logarithm process, ordinate indicate the quantity of syllable.As an example, above-mentioned executing subject can also statistically predicate sound collection The information such as mean value, variance, quartile of fundamental frequency characteristic information of the syllable of at least one voice, obtain statistical result in conjunction.
Step 203, according to statistical result, the stress information of the syllable of voice in voice set is generated.
In the present embodiment, above-mentioned executing subject can generate in above-mentioned voice set according to the statistical result of step 202 The stress information of the syllable of voice.Herein, the stress information of syllable may include "Yes" and "No" two types, and "Yes" indicates The syllable needs weight ([zh ò ng]) to read, and "No" indicates that the syllable does not need weight ([zh ò ng]) and reads.As an example, stress information can "Yes" is indicated to be indicated with 1 and 0,1, and 0 indicates "No".
In some optional implementations of the present embodiment, the fundamental frequency characteristic information of syllable may include the fundamental frequency of syllable Amplitude.And above-mentioned steps 203 specific as follows can carry out:
Firstly, above-mentioned executing subject can determine the base of the syllable of voice in above-mentioned voice set according to above-mentioned statistical result Frequency amplitude thresholds.As an example, above-mentioned executing subject can will be in above-mentioned voice set by the descending sequence of fundamental frequency amplitude The syllable of voice is ranked up, and will come preceding 20% multiple syllables as target syllable, determines base in multiple target syllables The smallest target syllable of frequency amplitude, using the corresponding fundamental frequency amplitude of determining target syllable as fundamental frequency amplitude thresholds.
Later, above-mentioned executing subject can generate the sound of voice in above-mentioned voice set according to determining fundamental frequency amplitude thresholds The stress information of section.As an example, above-mentioned executing subject can be by each syllable of each voice in above-mentioned voice set Fundamental frequency amplitude is compared with above-mentioned fundamental frequency amplitude thresholds, if the fundamental frequency amplitude of the syllable is greater than above-mentioned fundamental frequency amplitude thresholds, Then generate stress information "Yes";If the fundamental frequency amplitude of the syllable is less than above-mentioned fundamental frequency amplitude thresholds, stress information is generated "No".In practice, the fundamental frequency amplitude of syllable is generally higher than the fundamental frequency amplitude of non-syllable.Therefore, pass through statistical result Determine fundamental frequency amplitude thresholds, and generating stress information based on fundamental frequency amplitude thresholds can make the stress information generated more accurate.
In some optional implementations of the present embodiment, the fundamental frequency characteristic information of syllable may include the fundamental frequency of syllable Duration.And above-mentioned steps 203 specific as follows can carry out:
Firstly, above-mentioned executing subject can determine the base of the syllable of voice in above-mentioned voice set according to above-mentioned statistical result Frequency duration threshold value.As an example, above-mentioned executing subject can will be in above-mentioned voice set by the descending sequence of fundamental frequency duration The syllable of voice is ranked up, and will come preceding 20% multiple syllables as target syllable, determines base in multiple target syllables The smallest target syllable of frequency duration, using the corresponding fundamental frequency duration of determining target syllable as fundamental frequency duration threshold value.
Later, above-mentioned executing subject can generate the sound of voice in above-mentioned voice set according to determining fundamental frequency duration threshold value The stress information of section.As an example, above-mentioned executing subject can be by each syllable of each voice in above-mentioned voice set Fundamental frequency duration is compared with above-mentioned fundamental frequency duration threshold value, if the fundamental frequency duration of the syllable is greater than above-mentioned fundamental frequency duration threshold value, Then generate stress information "Yes";If the fundamental frequency duration of the syllable is less than above-mentioned fundamental frequency duration threshold value, stress information is generated "No".In practice, therefore the fundamental frequency duration that the fundamental frequency duration of syllable is generally higher than non-syllable passes through statistical result Determine fundamental frequency duration threshold value, and generating stress information based on fundamental frequency duration threshold value can make the stress information generated more accurate.
With continued reference to the signal that Fig. 4, Fig. 4 are according to the application scenarios of the method for generating information of the present embodiment Figure.In the application scenarios of Fig. 4, multiple people from areal, with same pronunciation habit record a plurality of voice in advance, Form voice set A.For each syllable of each voice in voice set A, terminal device 401 can extract this first The corresponding fundamental frequency of syllable obtains the fundamental frequency sequence for the syllable, and obtains the fundamental frequency spy for the syllable according to fundamental frequency sequence Reference breath.Later, terminal device 401 can count the fundamental frequency feature letter of each syllable of each voice in voice set A Breath, obtains statistical result.Finally, terminal device 401 according to statistical result, generates each of each voice in voice set A The stress information of a syllable.
The method provided by the above embodiment of the application realize the stress information of the syllable of voice in voice set from It is dynamic to generate, relative to manually generated, information formation efficiency is improved, influence of the human factor to result is avoided, reduces into This.
With further reference to Fig. 5, it illustrates the processes 500 of another embodiment of the method for generating information.The use In the process 500 for the method for generating information, comprising the following steps:
Step 501, for the syllable of voice in preset voice set, the corresponding fundamental frequency of the syllable is extracted, is obtained For the fundamental frequency sequence of the syllable, the fundamental frequency characteristic information for the syllable is obtained according to fundamental frequency sequence.
In the present embodiment, step 501 is similar with step 201 principle in embodiment illustrated in fig. 2, and details are not described herein again.
Step 502, the fundamental frequency characteristic information for counting the syllable of voice in voice set, obtains statistical result.
In the present embodiment, step 502 is similar with step 202 principle in embodiment illustrated in fig. 2, and details are not described herein again.
Step 503, according to statistical result, the stress information of the syllable of voice in voice set is generated.
In the present embodiment, step 503 is similar with step 203 principle in embodiment illustrated in fig. 2, and details are not described herein again.
Step 504, stress label is carried out using syllable of the stress information of generation to voice in voice set.
In the present embodiment, step 503 stress information generated can be used to above-mentioned voice collection in above-mentioned executing subject The carry out stress label of the syllable of voice in conjunction.As an example, above-mentioned executing subject can be by the language in above-mentioned voice set Sound, syllable annotation that stress information is "Yes" be " 1 ", by the voice in above-mentioned voice set, stress information be "No" Syllable annotation is " 0 ".
Step 505, based on the voice set training stress prediction models after stress label.
In the present embodiment, above-mentioned executing subject can be based on the training stress predicted mould of the voice set after stress label Type.Herein, above-mentioned stress prediction models can be used for predicting the stress information of the syllable of the corresponding voice of text.As showing Example, above-mentioned stress prediction models can be machine learning model can include but is not limited to DNN (Deep Neural Network, Deep-neural-network), SVM (Support Vector Machine, support vector machines), LSTM (Long Short-Term Memory, shot and long term memory network), CRF (calculate by conditional random field algorithm, condition random field Method), attention (attention mechanism), wavenet etc..Language in some usage scenarios, after being also based on stress label Sound set trains acoustic model, and above-mentioned acoustic model can be used for characterizing text information (for example, text size, word quantity, sound Joint number amount, syllable position etc.) with the corresponding relationship of parameters,acoustic.As an example, above-mentioned acoustic model may include but unlimited In HMM (Hidden Markov Model, hidden Markov model), DNN, LSTM, attention, wavenet etc.
In some optional implementations of the present embodiment, above-mentioned steps 505 specific as follows can be carried out:
Firstly, the corresponding text feature letter of voice in voice set after the above-mentioned available stress label of executing subject Breath.As an example, each speech recognition in the voice set after stress label can be text by above-mentioned executing subject, and The corresponding text feature information of this bar text is obtained using existing various modes.Herein, text feature information may include But be not limited to term vector, part of speech, capital and small letter feature, syllable number etc..
Later, above-mentioned executing subject can be using the text feature information of voice in the voice set after stress label as defeated Enter, using the stress label result of the corresponding voice of text feature information of input as desired output, training obtains stress predicted Model.As an example, the output of stress prediction models can be compared with desired output, in training process if the two Between error amount be less than preset threshold value, then it represents that training complete, deconditioning;If error amount between the two is not Less than preset threshold value, then can using back-propagation algorithm (Back Propgation Algorithm, BP algorithm) and Gradient descent method (such as stochastic gradient descent algorithm) is adjusted the parameter of stress prediction models, and using aforesaid way after It is continuous that parameter stress prediction models adjusted are trained.
Compared with the corresponding embodiment of Fig. 2, the process 500 of the method for generating information in the present embodiment is highlighted pair The syllable of voice carries out stress label in voice set, and based on the voice set training stress prediction models after stress label Step.The formation efficiency of stress information can be improved in the scheme of the present embodiment description as a result, and then shortens stress prediction models Generate the period.
With further reference to Fig. 6, as the realization to method shown in above-mentioned each figure, this application provides one kind for generating letter One embodiment of the device of breath, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, which can specifically answer For in various electronic equipments.
As shown in fig. 6, the present embodiment includes: extraction unit 601, statistic unit 602 for generating the device 600 of information With generation unit 603.Wherein, extraction unit 601 is configured to the syllable for voice in preset voice set, extracts The corresponding fundamental frequency of the syllable obtains the fundamental frequency sequence for the syllable, obtains the base for the syllable according to above-mentioned fundamental frequency sequence Frequency characteristic information;Statistic unit 602 is configured to the fundamental frequency characteristic information of the syllable of voice in statistically predicate sound set, obtains Statistical result;Generation unit 603 is configured to generate the weight of the syllable of voice in above-mentioned voice set according to above-mentioned statistical result Message breath.
In the present embodiment, for generating the extraction unit 601, statistic unit 602 and generation unit of the device 600 of information 603 specific processing and its brought technical effect can be respectively with reference to step 201, step 202 and steps in Fig. 2 corresponding embodiment Rapid 203 related description, details are not described herein.
In some optional implementations of the present embodiment, above-mentioned apparatus 600 further include: mark unit (does not show in figure Out), it is configured to carry out stress label to the syllable of voice in above-mentioned voice set using the stress information of generation;Training unit (not shown), the voice set training stress prediction models after being configured to based on stress label, wherein above-mentioned stress is pre- Survey the stress information that model is used to predict the syllable of the corresponding voice of text.
In some optional implementations of the present embodiment, above-mentioned training unit is further configured to: obtaining stress The corresponding text feature information of voice in voice set after mark;The text of voice in voice set after stress label is special Reference breath is as input, using the stress label result of the corresponding voice of text feature information of input as desired output, training Obtain stress prediction models.
In some optional implementations of the present embodiment, the fundamental frequency characteristic information of syllable includes the fundamental frequency width of syllable Value;And above-mentioned generation unit 603 is further configured to: according to above-mentioned statistical result, determining voice in above-mentioned voice set Syllable fundamental frequency amplitude thresholds;The stress of the syllable of voice in above-mentioned voice set is generated according to determining fundamental frequency amplitude thresholds Information.
In some optional implementations of the present embodiment, when the fundamental frequency characteristic information of syllable includes the fundamental frequency of syllable It is long;And above-mentioned generation unit 603 is further configured to: according to above-mentioned statistical result, determining voice in above-mentioned voice set Syllable fundamental frequency duration threshold value;The stress of the syllable of voice in above-mentioned voice set is generated according to determining fundamental frequency duration threshold value Information.
Below with reference to Fig. 7, it illustrates the knots of the computer system 700 for the equipment for being suitable for being used to realize the embodiment of the present application Structure schematic diagram.Equipment shown in Fig. 7 is only an example, should not function to the embodiment of the present application and use scope bring and appoint What is limited.
As shown in fig. 7, computer system 700 includes central processing unit (CPU) 701, it can be read-only according to being stored in Program in memory (ROM) 702 or be loaded into the program in random access storage device (RAM) 703 from storage section 708 and Execute various movements appropriate and processing.In RAM 703, also it is stored with system 700 and operates required various programs and data. CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input/output (I/O) interface 705 is also connected to always Line 704.
I/O interface 705 is connected to lower component: the importation 706 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 708 including hard disk etc.; And the communications portion 709 of the network interface card including LAN card, modem etc..Communications portion 709 via such as because The network of spy's net executes communication process.Driver 710 is also connected to I/O interface 705 as needed.Detachable media 711, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 710, in order to read from thereon Computer program be mounted into storage section 708 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communications portion 709, and/or from detachable media 711 are mounted.When the computer program is executed by central processing unit (CPU) 701, limited in execution the present processes Above-mentioned function.
It should be noted that computer-readable medium described herein can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In this application, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In application, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.
The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof Machine program code, described program design language include object oriented program language-such as Java, Smalltalk, C+ +, it further include conventional procedural programming language-such as " C " language or similar programming language.Program code can Fully to execute, partly execute on the user computer on the user computer, be executed as an independent software package, Part executes on the remote computer or executes on a remote computer or server completely on the user computer for part. In situations involving remote computers, remote computer can pass through the network of any kind --- including local area network (LAN) Or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize Internet service Provider is connected by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet Include extraction unit, statistic unit and generation unit.Wherein, the title of these units is not constituted under certain conditions to the unit The restriction of itself, for example, extraction unit is also described as " for the syllable of voice in preset voice set, mentioning The corresponding fundamental frequency of the syllable is taken, the fundamental frequency sequence for the syllable is obtained, is obtained according to above-mentioned fundamental frequency sequence for the syllable The unit of fundamental frequency characteristic information ".
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be Included in device described in above-described embodiment;It is also possible to individualism, and without in the supplying device.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the device, so that should Device: for the syllable of voice in preset voice set, extracting the corresponding fundamental frequency of the syllable, obtains for the syllable Fundamental frequency sequence obtains the fundamental frequency characteristic information for the syllable according to above-mentioned fundamental frequency sequence;Statistically voice in predicate sound set Syllable fundamental frequency characteristic information, obtain statistical result;According to above-mentioned statistical result, the sound of voice in above-mentioned voice set is generated The stress information of section.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims (12)

1. a kind of method for generating information, comprising:
For the syllable of voice in preset voice set, the corresponding fundamental frequency of the syllable is extracted, is obtained for the syllable Fundamental frequency sequence obtains the fundamental frequency characteristic information for the syllable according to the fundamental frequency sequence;
The fundamental frequency characteristic information for counting the syllable of voice in the voice set, obtains statistical result;
According to the statistical result, the stress information of the syllable of voice in the voice set is generated.
2. according to the method described in claim 1, wherein, the method also includes:
Stress label is carried out to the syllable of voice in the voice set using the stress information of generation;
Based on the voice set training stress prediction models after stress label, wherein the stress prediction models are for predicting text The stress information of the syllable of this corresponding voice.
3. according to the method described in claim 2, wherein, the voice set based on after stress label trains stress predicted mould Type, comprising:
The corresponding text feature information of voice in voice set after obtaining stress label;
Using the text feature information of voice in the voice set after stress label as input, by the text feature information pair of input The stress label result for the voice answered obtains stress prediction models as desired output, training.
4. according to the method described in claim 1, wherein, the fundamental frequency characteristic information of syllable includes the fundamental frequency amplitude of syllable;And
It is described according to the statistical result, generate the stress information of the syllable of the voice in the voice set, comprising:
According to the statistical result, the fundamental frequency amplitude thresholds of the syllable of voice in the voice set are determined;
The stress information of the syllable of voice in the voice set is generated according to determining fundamental frequency amplitude thresholds.
5. according to the method described in claim 1, wherein, the fundamental frequency characteristic information of syllable includes the fundamental frequency duration of syllable;And
It is described according to the statistical result, generate the stress information of the syllable of the voice in the voice set, comprising:
According to the statistical result, the fundamental frequency duration threshold value of the syllable of voice in the voice set is determined;
The stress information of the syllable of voice in the voice set is generated according to determining fundamental frequency duration threshold value.
6. a kind of for generating the device of information, comprising:
Extraction unit is configured to the syllable for voice in preset voice set, extracts the corresponding fundamental frequency of the syllable, The fundamental frequency sequence for the syllable is obtained, the fundamental frequency characteristic information for the syllable is obtained according to the fundamental frequency sequence;
Statistic unit is configured to count the fundamental frequency characteristic information of the syllable of voice in the voice set, obtains statistical result;
Generation unit is configured to generate the stress information of the syllable of voice in the voice set according to the statistical result.
7. device according to claim 6, wherein described device further include:
Unit is marked, is configured to carry out stress mark to the syllable of voice in the voice set using the stress information of generation Note;
Training unit, the voice set training stress prediction models after being configured to based on stress label, wherein the stress is pre- Survey the stress information that model is used to predict the syllable of the corresponding voice of text.
8. device according to claim 7, wherein the training unit is further configured to:
The corresponding text feature information of voice in voice set after obtaining stress label;
Using the text feature information of voice in the voice set after stress label as input, by the text feature information pair of input The stress label result for the voice answered obtains stress prediction models as desired output, training.
9. device according to claim 6, wherein the fundamental frequency characteristic information of syllable includes the fundamental frequency amplitude of syllable;And
The generation unit is further configured to:
According to the statistical result, the fundamental frequency amplitude thresholds of the syllable of voice in the voice set are determined;
The stress information of the syllable of voice in the voice set is generated according to determining fundamental frequency amplitude thresholds.
10. device according to claim 6, wherein the fundamental frequency characteristic information of syllable includes the fundamental frequency duration of syllable;And
The generation unit is further configured to:
According to the statistical result, the fundamental frequency duration threshold value of the syllable of voice in the voice set is determined;
The stress information of the syllable of voice in the voice set is generated according to determining fundamental frequency duration threshold value.
11. a kind of equipment, comprising:
One or more processors;
Storage device is stored thereon with one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as method as claimed in any one of claims 1 to 5.
12. a kind of computer-readable medium, is stored thereon with computer program, wherein real when described program is executed by processor Now such as method as claimed in any one of claims 1 to 5.
CN201811202290.2A 2018-10-16 2018-10-16 Method and apparatus for generating information Pending CN109087627A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811202290.2A CN109087627A (en) 2018-10-16 2018-10-16 Method and apparatus for generating information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811202290.2A CN109087627A (en) 2018-10-16 2018-10-16 Method and apparatus for generating information

Publications (1)

Publication Number Publication Date
CN109087627A true CN109087627A (en) 2018-12-25

Family

ID=64843483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811202290.2A Pending CN109087627A (en) 2018-10-16 2018-10-16 Method and apparatus for generating information

Country Status (1)

Country Link
CN (1) CN109087627A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112309367A (en) * 2020-11-03 2021-02-02 北京有竹居网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN113205827A (en) * 2021-05-05 2021-08-03 张茜 High-precision extraction method and device for baby voice fundamental frequency and computer equipment
CN113516963A (en) * 2020-04-09 2021-10-19 菜鸟智能物流控股有限公司 Audio data generation method and device, server and intelligent loudspeaker box
WO2022095754A1 (en) * 2020-11-03 2022-05-12 北京有竹居网络技术有限公司 Speech synthesis method and apparatus, storage medium, and electronic device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6366883B1 (en) * 1996-05-15 2002-04-02 Atr Interpreting Telecommunications Concatenation of speech segments by use of a speech synthesizer
CN101000765A (en) * 2007-01-09 2007-07-18 黑龙江大学 Speech synthetic method based on rhythm character
CN101751919A (en) * 2008-12-03 2010-06-23 中国科学院自动化研究所 Spoken Chinese stress automatic detection method
CN102254554A (en) * 2011-07-18 2011-11-23 中国科学院自动化研究所 Method for carrying out hierarchical modeling and predicating on mandarin accent
CN102436807A (en) * 2011-09-14 2012-05-02 苏州思必驰信息科技有限公司 Method and system for automatically generating voice with stressed syllables
CN102496363A (en) * 2011-11-11 2012-06-13 北京宇音天下科技有限公司 Correction method for Chinese speech synthesis tone
CN105895075A (en) * 2015-01-26 2016-08-24 科大讯飞股份有限公司 Method and system for improving synthetic voice rhythm naturalness

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6366883B1 (en) * 1996-05-15 2002-04-02 Atr Interpreting Telecommunications Concatenation of speech segments by use of a speech synthesizer
CN101000765A (en) * 2007-01-09 2007-07-18 黑龙江大学 Speech synthetic method based on rhythm character
CN101751919A (en) * 2008-12-03 2010-06-23 中国科学院自动化研究所 Spoken Chinese stress automatic detection method
CN102254554A (en) * 2011-07-18 2011-11-23 中国科学院自动化研究所 Method for carrying out hierarchical modeling and predicating on mandarin accent
CN102436807A (en) * 2011-09-14 2012-05-02 苏州思必驰信息科技有限公司 Method and system for automatically generating voice with stressed syllables
CN102496363A (en) * 2011-11-11 2012-06-13 北京宇音天下科技有限公司 Correction method for Chinese speech synthesis tone
CN105895075A (en) * 2015-01-26 2016-08-24 科大讯飞股份有限公司 Method and system for improving synthetic voice rhythm naturalness

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516963A (en) * 2020-04-09 2021-10-19 菜鸟智能物流控股有限公司 Audio data generation method and device, server and intelligent loudspeaker box
CN113516963B (en) * 2020-04-09 2023-11-10 菜鸟智能物流控股有限公司 Audio data generation method and device, server and intelligent sound box
CN112309367A (en) * 2020-11-03 2021-02-02 北京有竹居网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
WO2022095754A1 (en) * 2020-11-03 2022-05-12 北京有竹居网络技术有限公司 Speech synthesis method and apparatus, storage medium, and electronic device
CN113205827A (en) * 2021-05-05 2021-08-03 张茜 High-precision extraction method and device for baby voice fundamental frequency and computer equipment
CN113205827B (en) * 2021-05-05 2022-02-15 张茜 High-precision extraction method and device for baby voice fundamental frequency and computer equipment

Similar Documents

Publication Publication Date Title
CN109036384B (en) Audio recognition method and device
US10553201B2 (en) Method and apparatus for speech synthesis
CN108630190B (en) Method and apparatus for generating speech synthesis model
CN108182936B (en) Voice signal generation method and device
US10902841B2 (en) Personalized custom synthetic speech
CN107481717B (en) Acoustic model training method and system
CN107481720B (en) Explicit voiceprint recognition method and device
CN108428446A (en) Audio recognition method and device
CN109087627A (en) Method and apparatus for generating information
CN107707745A (en) Method and apparatus for extracting information
CN109545192A (en) Method and apparatus for generating model
CN108877782A (en) Audio recognition method and device
CN109190124B (en) Method and apparatus for participle
CN108121800A (en) Information generating method and device based on artificial intelligence
CN108933730A (en) Information-pushing method and device
US11741941B2 (en) Configurable neural speech synthesis
CN108986805A (en) Method and apparatus for sending information
CN109545193A (en) Method and apparatus for generating model
CN107526809A (en) Method and apparatus based on artificial intelligence push music
CN109346109A (en) Fundamental frequency extracting method and device
CN107680584A (en) Method and apparatus for cutting audio
CN109299477A (en) Method and apparatus for generating text header
CN108897853A (en) The method and apparatus for generating pushed information
CN110930975B (en) Method and device for outputting information
CN109979439A (en) Audio recognition method, device, medium and electronic equipment based on block chain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181225