CN109616094A - Phoneme synthesizing method, device, system and storage medium - Google Patents
Phoneme synthesizing method, device, system and storage medium Download PDFInfo
- Publication number
- CN109616094A CN109616094A CN201811648146.1A CN201811648146A CN109616094A CN 109616094 A CN109616094 A CN 109616094A CN 201811648146 A CN201811648146 A CN 201811648146A CN 109616094 A CN109616094 A CN 109616094A
- Authority
- CN
- China
- Prior art keywords
- speaker
- candidate
- information
- scene
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000003860 storage Methods 0.000 title claims abstract description 26
- 230000002194 synthesizing effect Effects 0.000 title claims abstract description 19
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 43
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 43
- 230000015654 memory Effects 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 17
- 238000013507 mapping Methods 0.000 claims description 12
- 238000012790 confirmation Methods 0.000 claims description 10
- 230000002123 temporal effect Effects 0.000 claims description 8
- 238000009877 rendering Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 14
- 238000013461 design Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 240000005561 Musa balbisiana Species 0.000 description 1
- 235000018290 Musa x paradisiaca Nutrition 0.000 description 1
- 101150041570 TOP1 gene Proteins 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000010977 jade Substances 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
- JLYXXMFPNIAWKQ-UHFFFAOYSA-N γ Benzene hexachloride Chemical compound ClC1C(Cl)C(Cl)C(Cl)C(Cl)C1Cl JLYXXMFPNIAWKQ-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of phoneme synthesizing method, device, system and storage medium, this method, comprising: determine current scene information;Obtain all candidate speakers being consistent with current scene information;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate's pronunciation list, target speaker is determined;According to the sound of target speaker, text information is converted into target voice.To realize according to the text and scene properties received, the speaker being consistent with scene is automatically selected, allows synthesis voice according to different scenes, convert most suitable speaker, so that the voice finally synthesized is truer, speech synthesis effect is improved, user experience is good.
Description
Technical field
The present invention relates to voice processing technology fields more particularly to a kind of phoneme synthesizing method, device, system and storage to be situated between
Matter.
Background technique
Speech synthesis (Text to Speech) is one of the important technology in artificial intelligent voice field and application direction, is
It is the process of voice by the text conversion that user or product input, in such a way that machine imitates the mankind " speaking ", output is anthropomorphic
Sound, be mainly used in the scenes such as sound reading, human-computer dialogue, intelligent sound box, intelligent customer service, be people and machine carry out from
So interactive one of major way.
Currently, existing speech synthesis is the process that user (or product) input text carries out text-to-speech, input text
This, is synthesized by speaker selected in advance, and wherein speaker tone color style is that speaker uniquely selectes reference mode.
And on realizing, with the expansion of sound scene, under different scenes, different speaker expression effects are different.Such as the field before sleeping
Scape may be more suitable for warm sound of releiving;Working public transport subway scene, may be more suitable for brisk bright sound.
But existing speech synthesis technique can not adapt to the variation of scene, influence the final presentation effect of speech synthesis,
User experience is bad.
Summary of the invention
The present invention provides a kind of phoneme synthesizing method, device, system and storage medium, may be implemented according to the text received
Sheet and scene properties, automatically select the speaker being consistent with scene, allow synthesis voice according to different scenes, transformation
Most suitable speaker, improves speech synthesis effect, and user experience is good.
In a first aspect, the embodiment of the present invention provides a kind of phoneme synthesizing method, comprising:
Determine current scene information;
Obtain all candidate speakers being consistent with the current scene information;
According to default rule, the candidate speaker is ranked up, obtains candidate pronunciation list;
According to the candidate pronunciation list, target speaker is determined;
According to the sound of the target speaker, text information is converted into target voice.
In a kind of possible design, the determining current scene information, comprising:
Scene information is obtained from the text information received, and the scene information that will acquire is as current scene
Information;Or
According to presupposed information, current scene information is determined;The presupposed information include: current location information, temporal information,
Weather information, network information etc., can choose one of presupposed information or appoint and a variety of determine current scene information;
Wherein, the scene information includes: to sleep preceding scene, night-time scene, lunch break scene, read scene, subway scene, public affairs
Hand over scene, airport scene.
In a kind of possible design, all candidate speakers being consistent with the current scene information are obtained, comprising:
From the database of mapping relations for being previously stored with speaker speech packet and speaker and scene information, obtain
Take all candidate speakers for meeting the current scene information.
In a kind of possible design, further includes:
Update the mapping relations of the speaker speech packet and speaker and scene information in the database.
In a kind of possible design, according to default rule, the candidate speaker is ranked up, candidate hair is obtained
Sound list, comprising:
Obtain all candidate speakers scene properties weighted value corresponding with the current scene information;Wherein, the field
Scape Attribute Weight weight values, for characterizing the matching degree of speaker and scene;
According to the scene properties weighted value, the candidate speaker is ranked up, obtains candidate pronunciation list.
In a kind of possible design, according to the candidate pronunciation list, target speaker is determined, comprising:
Show the candidate speaker of ranking top N, N is the natural number greater than 0;
If the quantity of candidate speaker is 1, using the candidate speaker as target speaker;
If the quantity of candidate speaker is greater than 1, according to the confirmation message that user inputs, from the candidate pronunciation list
Middle determination one candidate speaker is as target speaker;If within a preset period of time, not receiving the confirmation letter of user's input
Breath, then will come the 1st candidate speaker as target speaker.
In a kind of possible design, according to the sound of the target speaker, text information is converted into target voice,
Include:
By the sound rendering initial speech of text information target speaker;
Receive the adjusting information to initial speech, the target voice after being adjusted;Wherein, the adjusting information is for adjusting
The audio attribute of the whole initial speech, the audio attribute include: volume, tone, word speed and background sound;
Export the target voice.
Second aspect, the embodiment of the present invention provide a kind of speech synthetic device, comprising:
Module is obtained, for obtaining all candidate speakers being consistent with the current scene information;
Sorting module obtains candidate speaker column for being ranked up to the candidate speaker according to default rule
Table;
Second determining module, for determining target speaker according to the candidate pronunciation list;
Text information is converted to target voice for the sound according to the target speaker by synthesis module.
In a kind of possible design, first determining module is specifically used for:
Scene information is obtained from the text information received, and the scene information that will acquire is as current scene
Information;Or
According to presupposed information, current scene information is determined;The presupposed information include: current location information, temporal information,
Weather information, network information etc., can choose one of presupposed information or appoint and a variety of determine current scene information;
Wherein, the scene information includes: to sleep preceding scene, night-time scene, lunch break scene, read scene, subway scene, public affairs
Hand over scene, airport scene.
In a kind of possible design, the acquisition module is specifically used for:
From the database of mapping relations for being previously stored with speaker speech packet and speaker and scene information, obtain
Take all candidate speakers for meeting the current scene information.
In a kind of possible design, further includes:
Update module, for updating reflecting for speaker speech packet in the database and speaker and scene information
Penetrate relationship.
In a kind of possible design, the sorting module is specifically used for:
Obtain all candidate speakers scene properties weighted value corresponding with the current scene information;Wherein, the field
Scape Attribute Weight weight values, for characterizing the matching degree of speaker and scene;
According to the scene properties weighted value, the candidate speaker is ranked up, obtains candidate pronunciation list.
In a kind of possible design, second determining module is specifically used for:
Show the candidate speaker of ranking top N, N is the natural number greater than 0;
If the quantity of candidate speaker is 1, using the candidate speaker as target speaker;
If the quantity of candidate speaker is greater than 1, according to the confirmation message that user inputs, from the candidate pronunciation list
Middle determination one candidate speaker is as target speaker;If within a preset period of time, not receiving the confirmation letter of user's input
Breath, then will come the 1st candidate speaker as target speaker.
In a kind of possible design, the synthesis module is specifically used for:
By the sound rendering initial speech of text information target speaker;
Receive the adjusting information to initial speech, the target voice after being adjusted;Wherein, the adjusting information is for adjusting
The audio attribute of the whole initial speech, the audio attribute include: volume, tone, word speed and background sound;
Export the target voice.
The third aspect, the embodiment of the present invention provide a kind of speech synthesis system, comprising: memory and processor, memory
In be stored with the executable instruction of the processor;Wherein, the processor is configured to next via the executable instruction is executed
Execute phoneme synthesizing method described in any one of first aspect.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, are stored thereon with computer program,
Phoneme synthesizing method described in any one of first aspect is realized when the program is executed by processor.
5th aspect, the embodiment of the present invention provide a kind of program product, and described program product includes: computer program, institute
It states computer program to be stored in readable storage medium storing program for executing, at least one processor of server can be from the readable storage medium storing program for executing
The computer program is read, at least one described processor executes the computer program and server is made to execute first aspect
In any phoneme synthesizing method.
The present invention provides a kind of phoneme synthesizing method, device, system and storage medium, passes through and determines current scene information;
Obtain all candidate speakers being consistent with the current scene information;According to default rule, to the candidate speaker
It is ranked up, obtains candidate pronunciation list;According to the candidate pronunciation list, target speaker is determined;According to the mesh
The sound for marking speaker, is converted to target voice for text information.To realize according to the text and scene properties received, certainly
The speaker that dynamic selection is consistent with scene allows synthesis voice according to different scenes, converts most suitable speaker,
So that the voice finally synthesized is truer, speech synthesis effect is improved, user experience is good.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair
Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with
It obtains other drawings based on these drawings.
Fig. 1 is the schematic illustration of an application scenarios of the invention;
Fig. 2 is the flow chart for the phoneme synthesizing method that the embodiment of the present invention one provides;
Fig. 3 is the flow chart of phoneme synthesizing method provided by Embodiment 2 of the present invention;
Fig. 4 is the structural schematic diagram for the speech synthetic device that the embodiment of the present invention three provides;
Fig. 5 is the structural schematic diagram for the speech synthetic device that the embodiment of the present invention four provides;
Fig. 6 is the structural schematic diagram for the speech synthesis system that the embodiment of the present invention five provides.
Through the above attached drawings, it has been shown that the specific embodiment of the disclosure will be hereinafter described in more detail.These attached drawings
It is not intended to limit the scope of this disclosure concept by any means with verbal description, but is by referring to specific embodiments
Those skilled in the art illustrate the concept of the disclosure.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
Description and claims of this specification and term " first ", " second ", " third ", " in above-mentioned attached drawing
The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage
The data that solution uses in this way are interchangeable under appropriate circumstances, so that the embodiment of the present invention described herein for example can be to remove
Sequence other than those of illustrating or describe herein is implemented.In addition, term " includes " and " having " and theirs is any
Deformation, it is intended that cover it is non-exclusive include, for example, containing the process, method of a series of steps or units, system, production
Product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for this
A little process, methods, the other step or units of product or equipment inherently.
Technical solution of the present invention is described in detail with specifically embodiment below.These specific implementations below
Example can be combined with each other, and the same or similar concept or process may be repeated no more in some embodiments.
Existing speech synthesis is the process that user (or product) input text carries out text-to-speech, inputs text, warp
It is synthesized after speaker selected in advance, wherein speaker tone color style is that speaker uniquely selectes reference mode.Currently,
The speech synthesis solution provided on the market is all based on text and speaker tone color style, does not distinguish what synthesis used
Scene.Under different scenes, same speaker synthetic effect does not have difference, and the synthetic effect of scene is performed poor.This hair
It is bright that scene information is integrated into speech synthesis technique, according to current scene information, recommended candidate speaker, so that specific close
At voice it is truer, promoted user experience.
Fig. 1 is the schematic illustration of an application scenarios of the invention, as shown in Figure 1, the present invention is according to the text envelope received
Breath and scene properties 11, speech synthetic device 12 identifies and determines current scene information, then recommends to close according to scene information
Suitable speaker finally inputs the corresponding initial speech of text according to the Timbre Synthesis of speaker, and can be to initial speech
Audio attribute is adjusted, and exports target voice 13.When determining scene information, the text information of input can be carried out semantic
Identification, extracts scene information.
Such as: the text of input is " catching up with subway, the people on today subway is so many, and working may will be late ",
Can identify current scene be on the way to office, the vehicles taken be that subway, possible background sound can be more miscellaneous, really
Determining scene information is that subway scene is proper.When determining scene, it can also determine that current scene is believed according to presupposed information
Breath;Presupposed information includes: current location information, temporal information, Weather information, the network information etc., can choose presupposed information
One of or appoint and a variety of determine current scene information.Such as: the text of input is " to have looked for a circle, not finding you will borrow
Book ", be so-and-so college library, university library, academic library according to current location information, then background sound should be a quiet atmosphere, determine
Scene information is that reading scene is proper.
In a particular application, speech synthesis can be carried out as follows: determines current scene information;It obtains and works as front court
All candidate speakers that scape information is consistent;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation
List;According to candidate's pronunciation list, target speaker is determined;According to the sound of target speaker, text information is converted
For target voice.
It may be implemented according to the text that receives and scene properties using the above method, automatically select and be consistent with scene
Speaker allows synthesis voice according to different scenes, most suitable speaker is converted, so that the voice finally synthesized is more
Add really, improves speech synthesis effect, user experience is good.
How to be solved with technical solution of the specifically embodiment to technical solution of the present invention and the application below above-mentioned
Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept
Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, the embodiment of the present invention is described.
Fig. 2 is the flow chart for the phoneme synthesizing method that the embodiment of the present invention one provides, as shown in Fig. 2, in the present embodiment
Method may include:
S101, current scene information is determined.
The scene information conduct that in the present embodiment, scene information is obtained from the text information received, and will acquire
Current scene information;Or according to presupposed information, determine current scene information;Presupposed information includes: current location information, time
Information, Weather information, network information etc..Can choose one of presupposed information or appoint it is a variety of come determine current scene believe
Breath;Wherein, scene information includes: to sleep preceding scene, night-time scene, lunch break scene, read scene, subway scene, public transport scene, machine
Field scene etc..
Optionally, when determining scene information, semantics recognition can be carried out to the text information of input, extracts scene letter
Breath.
Specifically, such as: the text of input is " to catch up with subway, the people on today subway is so many, and working may will be slow
Arrive ", can identify current scene be on the way to office, the vehicles taken be that subway, possible background sound can compare
It is more miscellaneous, determine that scene information is that subway scene is proper.
Optionally, when determining scene, current scene can also be determined according to current location information and/or temporal information
Information.Such as: the text of input is " having looked for a circle, do not find your book to be borrowed ", big for so-and-so according to current location information
Library is learned, then background sound should be a quiet atmosphere, and it is proper to read scene to determine scene information.In determination
When scene, the voice messaging and current location information and/or temporal information of input text can be combined with, determine that current scene is believed
Breath.Such as: the text of input is " I tells a story to treasure daughter ", and the current time is 9 points at night, leans on text envelope merely
Breath or temporal information, are difficult to determine suitable scene information, and the two is combined, and can determine current scene information
It is more appropriate to sleep preceding scene.
It should be noted that the present embodiment does not limit the type of scene information, those skilled in the art can be according to reality
Border situation increases or reduces the type of scene information.
All candidate speakers that S102, acquisition are consistent with current scene information.
In the present embodiment, from the mapping relations for being previously stored with speaker speech packet and speaker and scene information
In database, all candidate speakers for meeting current scene information are obtained.
Optionally, speaker speech packet is made of the speaker of multiple and different tone color styles, and each speaker includes basis
Attribute and the big main attribute of scene properties two.Wherein, primary attribute: including information such as tone color, style, gender, ages.
Specifically, such as: pronounce artificial Guo Degang, and it is the simple and honest sound of mature male that primary attribute is corresponding.Speaker
For Lin Zhiling, it is the sound of flirtatious women that primary attribute is corresponding.Scene properties include that current speaker is appropriate for synthesizing
Usage scenario and corresponding scene properties weight, be 0-100, weight indicates recommendation journey of the speaker under the scene
Degree, value is bigger, more recommends.It is main include before sleeping, at night, afternoon, reading, subway, public transport, aircraft, high-speed rail, lunch break etc. it is main
Speech synthesis usage scenario.
The present embodiment, from the number for the mapping relations for being previously stored with speaker speech packet and speaker and scene information
According in library, acquisition meets all candidate speakers of current scene information.For example, the text of input is that " elder brother accompanies me to see at night
It goes window-shopping!", it can determine that the flirtatious sound of the corresponding speaker selection woods will tinkling of pieces of jade is proper, if the sound of selection Guo Degang
Sound will destroy context, not enough really.
S103, according to default rule, candidate speaker is ranked up, candidate pronunciation list is obtained.
In the present embodiment, all candidate speakers scene properties weighted value corresponding with current scene information is obtained;Wherein,
Scene properties weighted value, for characterizing the matching degree of speaker and scene;According to scene properties weighted value, to candidate speaker
It is ranked up, obtains candidate pronunciation list.
Specifically, the user's usage scenario obtained based on S101, the scene being suitble to speaker is matched, in S102
The speaker for being appropriate for synthesis is matched, and is arranged according to the scene properties weight descending of speaker.
S104, pronounce list according to candidate, determines target speaker.
In the present embodiment, the candidate speaker of ranking top N is shown, N is the natural number greater than 0;
If the quantity of candidate speaker is 1, using candidate speaker as target speaker;If the quantity of candidate speaker
Greater than 1, then the confirmation message inputted according to user determines a candidate speaker as target from candidate's pronunciation list and sends out
Sound people;If within a preset period of time, not receiving the confirmation message of user's input, then the candidate speaker for coming the 1st is made
For target speaker.
Specifically, after according to the arrangement of weight descending, recommendation speaker output of the Top1 as default.Meanwhile it also propping up
It holds user and refers to that synthesis speaker exports surely in the pronunciation list of input scene.
S105, the sound according to target speaker, are converted to target voice for text information.
In the present embodiment, by the sound rendering initial speech of text information target speaker;It receives to initial speech
Adjust information, the target voice after being adjusted;Wherein, the audio attribute that information is used to adjust initial speech, audio category are adjusted
Property includes: volume, tone, word speed and background sound;Export target voice.
Specifically, according to target speaker tone color feature, text information is synthesized into initial speech.It is then possible to combination field
Scape information automatically adjust to audio attribute, manually adjust audio attribute according to the input of user.For example,
Working public transport subway scene, may be more suitable for brisk bright sound, background colour should be more noisy.And the scene before sleeping,
It may be more suitable for warm sound of releiving, background sound should be quieter.Specific back can also be added according to scene information
Jing Yin can be again in background sound plus the sound that Raindrops drummed rhythmically against the banana leaves such as rainy day scene, or the sound that It's raining in torrents.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information
People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list,
Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception
The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields
Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience
It is good.
Fig. 3 is the flow chart of phoneme synthesizing method provided by Embodiment 2 of the present invention, as shown in figure 3, in the present embodiment
Method may include:
The mapping relations of S201, the speaker speech packet more in new database and speaker and scene information.
, can be again in speaker speech packet in the present embodiment, dynamic increases or reduces the quantity of speaker.It can also root
According to the selection of user, the corresponding attribute of speaker, weight are adjusted.For example, according to nearest hot spot, in pronunciation packet
Increase the voice for the artificial Wang Yuan that pronounces.It can also select to be accustomed to according to user, increase the power of the voice packet for the artificial Lin Zhiling that pronounces
Weight values carry out preferential recommendation.
S202, current scene information is determined.
All candidate speakers that S203, acquisition are consistent with current scene information.
S204, according to default rule, candidate speaker is ranked up, candidate pronunciation list is obtained.
S205, pronounce list according to candidate, determines target speaker.
S206, the sound according to target speaker, are converted to target voice for text information.
In the present embodiment, step S202~step S206 specific implementation process and technical principle are shown in Figure 2
Associated description in method in step S101~step S105, details are not described herein again.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information
People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list,
Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception
The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields
Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience
It is good.
In addition, this implementation can the speaker speech packet in more new database and speaker and scene information mapping
Relationship promotes user experience.Such as user can regularly update the speaker speech packet in database, or record own voice
Voice packet.
Fig. 4 is the structural schematic diagram for the speech synthetic device that the embodiment of the present invention three provides, as shown in figure 4, the present embodiment
Speech synthetic device may include:
First determining module 31, for determining current scene information;
Module 32 is obtained, for obtaining all candidate speakers being consistent with current scene information;
Sorting module 33 obtains candidate speaker column for being ranked up to candidate speaker according to default rule
Table;
Second determining module 34, for determining target speaker according to candidate's pronunciation list;
Text information is converted to target voice for the sound according to target speaker by synthesis module 35.
In a kind of possible design, the first determining module 31 is specifically used for:
The scene information that obtains scene information from the text information received, and will acquire is believed as current scene
Breath;Or
According to presupposed information, current scene information is determined;Presupposed information includes: current location information, temporal information, weather
Information, network information etc., can choose one of presupposed information or appoint and a variety of determine current scene information;
Wherein, scene information includes: to sleep preceding scene, night-time scene, lunch break scene, read scene, subway scene, public transport field
Scape, airport scene.
In a kind of possible design, module 32 is obtained, is specifically used for:
From the database of mapping relations for being previously stored with speaker speech packet and speaker and scene information, obtain
Take all candidate speakers for meeting current scene information.
In a kind of possible design, sorting module 33 is specifically used for:
Obtain all candidate speakers scene properties weighted value corresponding with current scene information;Wherein, scene properties are weighed
Weight values, for characterizing the matching degree of speaker and scene;
According to scene properties weighted value, candidate speaker is ranked up, obtains candidate pronunciation list.
In a kind of possible design, the second determining module 34 is specifically used for:
Show the candidate speaker of ranking top N, N is the natural number greater than 0;
If the quantity of candidate speaker is 1, using candidate speaker as target speaker;
If the quantity of candidate speaker is greater than 1, according to the confirmation message that user inputs, from candidate's pronunciation list really
A fixed candidate speaker is as target speaker;If within a preset period of time, not receiving the confirmation message of user's input, then
The 1st candidate speaker will be come as target speaker.
In a kind of possible design, synthesis module 35 is specifically used for:
By the sound rendering initial speech of text information target speaker;
Receive the adjusting information to initial speech, the target voice after being adjusted;Wherein, information is adjusted for adjusting just
The audio attribute of beginning voice, audio attribute include: volume, tone, word speed and background sound;
Export target voice.
The speech synthetic device of the present embodiment can execute the technical solution in method shown in Fig. 2, implement process
With the associated description in technical principle method shown in Figure 2, details are not described herein again.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information
People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list,
Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception
The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields
Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience
It is good.
Fig. 5 is the structural schematic diagram for the speech synthetic device that the embodiment of the present invention four provides, as shown in figure 5, the present embodiment
Speech synthetic device device shown in Fig. 4 on the basis of, can also include:
Update module 36, the mapping for speaker speech packet and speaker and scene information in more new database
Relationship.
The speech synthetic device of the present embodiment can execute the technical solution in method shown in Fig. 2, Fig. 3, specific implementation
The associated description of process and technical principle referring to fig. 2, in method shown in Fig. 3, details are not described herein again.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information
People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list,
Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception
The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields
Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience
It is good.
In addition, this implementation can the speaker speech packet in more new database and speaker and scene information mapping
Relationship promotes user experience.
Fig. 6 is the structural schematic diagram for the speech synthesis system that the embodiment of the present invention five provides, as shown in fig. 6, the present embodiment
Speech synthesis system 40 may include: processor 41 and memory 42.
Memory 42, for storing program;Memory 42 may include volatile memory (English: volatile
Memory), for example, random access memory (English: random-access memory, abbreviation: RAM), such as static random-access
Memory (English: static random-access memory, abbreviation: SRAM), double data rate synchronous dynamic random-access
Memory (English: Double Data Rate Synchronous Dynamic Random Access Memory, abbreviation:
DDR SDRAM) etc.;Memory also may include nonvolatile memory (English: non-volatile memory), such as fastly
Flash memory (English: flash memory).Memory 42 is used to store computer program (the application journey as realized the above method
Sequence, functional module etc.), computer instruction etc., above-mentioned computer program, computer instruction etc. can with partitioned storage at one or
In multiple memories 42.And above-mentioned computer program, computer instruction, data etc. can be called with device 41 processed.
Above-mentioned computer program, computer instruction etc. can be with partitioned storages in one or more memories 42.And
Above-mentioned computer program, computer instruction, data etc. can be called with device 41 processed.
Processor 41, for executing the computer program of the storage of memory 42, to realize method that above-described embodiment is related to
In each step.
It specifically may refer to the associated description in previous methods embodiment.
Processor 41 and memory 42 can be absolute construction, be also possible to the integrated morphology integrated.Work as processing
When device 41 and memory 42 are absolute construction, memory 42, processor 41 can be of coupled connections by bus 43.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information
People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list,
Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception
The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields
Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience
It is good.
The server of the present embodiment can execute the technical solution in method shown in Fig. 2, Fig. 3, specific implementation process and
Associated description of the technical principle referring to fig. 2, in method shown in Fig. 3, details are not described herein again.
In addition, the embodiment of the present application also provides a kind of computer readable storage medium, deposited in computer readable storage medium
Computer executed instructions are contained, when at least one processor of user equipment executes the computer executed instructions, user equipment
Execute above-mentioned various possible methods.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information
People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list,
Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception
The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields
Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience
It is good.
Wherein, computer-readable medium includes computer storage media and communication media, and wherein communication media includes being convenient for
From a place to any medium of another place transmission computer program.Storage medium can be general or specialized computer
Any usable medium that can be accessed.A kind of illustrative storage medium is coupled to processor, to enable a processor to from this
Read information, and information can be written to the storage medium.Certainly, storage medium is also possible to the composition portion of processor
Point.Pocessor and storage media can be located in ASIC.In addition, the ASIC can be located in user equipment.Certainly, processor and
Storage medium can also be used as discrete assembly and be present in communication equipment.
The application also provides a kind of program product, and program product includes computer program, and computer program is stored in readable
In storage medium, at least one processor of server can read computer program from readable storage medium storing program for executing, at least one
Reason device executes the phoneme synthesizing method that computer program makes the server implementation embodiments of the present invention any.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to
The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer readable storage medium.The journey
When being executed, execution includes the steps that above-mentioned each method embodiment to sequence;And storage medium above-mentioned include: ROM, RAM, magnetic disk or
The various media that can store program code such as person's CD.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to
So be possible to modify the technical solutions described in the foregoing embodiments, or part of or all technical features are carried out etc.
With replacement;And these modifications or substitutions, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution
Range.
Claims (10)
1. a kind of phoneme synthesizing method characterized by comprising
Determine current scene information;
Obtain all candidate speakers being consistent with the current scene information;
According to default rule, the candidate speaker is ranked up, obtains candidate pronunciation list;
According to the candidate pronunciation list, target speaker is determined;
According to the sound of the target speaker, text information is converted into target voice.
2. the method according to claim 1, wherein the determining current scene information, comprising:
The scene information that obtains scene information from the text information received, and will acquire is believed as current scene
Breath;Or
According to presupposed information, current scene information is determined;The presupposed information includes: current location information, temporal information, weather
Information, the network information;
Wherein, the scene information includes: to sleep preceding scene, night-time scene, lunch break scene, read scene, subway scene, public transport field
Scape, airport scene.
3. the method according to claim 1, wherein obtaining all times being consistent with the current scene information
Select speaker, comprising:
From the database of mapping relations for being previously stored with speaker speech packet and speaker and scene information, symbol is obtained
Close all candidate speakers of the current scene information.
4. according to the method described in claim 3, it is characterized by further comprising:
Update the mapping relations of the speaker speech packet and speaker and scene information in the database.
5. the method according to claim 1, wherein being carried out according to default rule to the candidate speaker
Sequence obtains candidate pronunciation list, comprising:
Obtain all candidate speakers scene properties weighted value corresponding with the current scene information;Wherein, the scene category
Property weighted value, for characterizing the matching degree of speaker and scene;
According to the scene properties weighted value, the candidate speaker is ranked up, obtains candidate pronunciation list.
6. the method according to claim 1, wherein determining target speaker according to the candidate pronunciation list
People, comprising:
Show the candidate speaker of ranking top N, N is the natural number greater than 0;
If the quantity of candidate speaker is 1, using the candidate speaker as target speaker;
If the quantity of candidate speaker is greater than 1, according to the confirmation message that user inputs, from the candidate pronunciation list really
A fixed candidate speaker is as target speaker;If within a preset period of time, not receiving the confirmation message of user's input, then
The 1st candidate speaker will be come as target speaker.
7. method according to claim 1 to 6, which is characterized in that according to the sound of the target speaker,
Text information is converted into target voice, comprising:
By the sound rendering initial speech of text information target speaker;
Receive the adjusting information to initial speech, the target voice after being adjusted;Wherein, the adjusting information is for adjusting institute
The audio attribute of initial speech is stated, the audio attribute includes: volume, tone, word speed and background sound;
Export the target voice.
8. a kind of speech synthetic device characterized by comprising
First determining module, for determining current scene information;
Module is obtained, for obtaining all candidate speakers being consistent with the current scene information;
Sorting module obtains candidate pronunciation list for being ranked up to the candidate speaker according to default rule;
Second determining module, for determining target speaker according to the candidate pronunciation list;
Text information is converted to target voice for the sound according to the target speaker by synthesis module.
9. a kind of speech synthesis system characterized by comprising memory and processor are stored with the processing in memory
The executable instruction of device;Wherein, the processor is configured to come perform claim requirement 1-7 institute via the execution executable instruction
The phoneme synthesizing method stated.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
Claim 1-7 described in any item phoneme synthesizing methods are realized when execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811648146.1A CN109616094A (en) | 2018-12-29 | 2018-12-29 | Phoneme synthesizing method, device, system and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811648146.1A CN109616094A (en) | 2018-12-29 | 2018-12-29 | Phoneme synthesizing method, device, system and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109616094A true CN109616094A (en) | 2019-04-12 |
Family
ID=66017285
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811648146.1A Pending CN109616094A (en) | 2018-12-29 | 2018-12-29 | Phoneme synthesizing method, device, system and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109616094A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110264992A (en) * | 2019-06-11 | 2019-09-20 | 百度在线网络技术(北京)有限公司 | Speech synthesis processing method, device, equipment and storage medium |
CN110534131A (en) * | 2019-08-30 | 2019-12-03 | 广州华多网络科技有限公司 | A kind of audio frequency playing method and system |
CN111161737A (en) * | 2019-12-23 | 2020-05-15 | 北京欧珀通信有限公司 | Data processing method and device, electronic equipment and storage medium |
CN111415650A (en) * | 2020-03-25 | 2020-07-14 | 广州酷狗计算机科技有限公司 | Text-to-speech method, device, equipment and storage medium |
CN111833857A (en) * | 2019-04-16 | 2020-10-27 | 阿里巴巴集团控股有限公司 | Voice processing method and device and distributed system |
CN112185338A (en) * | 2020-09-30 | 2021-01-05 | 北京大米科技有限公司 | Audio processing method and device, readable storage medium and electronic equipment |
CN112181348A (en) * | 2020-08-28 | 2021-01-05 | 星络智能科技有限公司 | Sound style switching method, system, computer equipment and readable storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104750839A (en) * | 2015-04-03 | 2015-07-01 | 魅族科技(中国)有限公司 | Data recommendation method, terminal and server |
CN105047193A (en) * | 2015-08-27 | 2015-11-11 | 百度在线网络技术(北京)有限公司 | Voice broadcasting method and apparatus |
CN105302908A (en) * | 2015-11-02 | 2016-02-03 | 北京奇虎科技有限公司 | E-book related audio resource recommendation method and apparatus |
CN105550316A (en) * | 2015-12-14 | 2016-05-04 | 广州酷狗计算机科技有限公司 | Pushing method and device of audio list |
CN105609096A (en) * | 2015-12-30 | 2016-05-25 | 小米科技有限责任公司 | Text data output method and device |
CN106610968A (en) * | 2015-10-21 | 2017-05-03 | 广州酷狗计算机科技有限公司 | Song menu list determination method and apparatus, and electronic device |
CN106649644A (en) * | 2016-12-08 | 2017-05-10 | 腾讯音乐娱乐(深圳)有限公司 | Lyric file generation method and device |
CN107731219A (en) * | 2017-09-06 | 2018-02-23 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis processing method, device and equipment |
US9972301B2 (en) * | 2016-10-18 | 2018-05-15 | Mastercard International Incorporated | Systems and methods for correcting text-to-speech pronunciation |
CN108536655A (en) * | 2017-12-21 | 2018-09-14 | 广州市讯飞樽鸿信息技术有限公司 | Audio production method and system are read aloud in a kind of displaying based on hand-held intelligent terminal |
-
2018
- 2018-12-29 CN CN201811648146.1A patent/CN109616094A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104750839A (en) * | 2015-04-03 | 2015-07-01 | 魅族科技(中国)有限公司 | Data recommendation method, terminal and server |
CN105047193A (en) * | 2015-08-27 | 2015-11-11 | 百度在线网络技术(北京)有限公司 | Voice broadcasting method and apparatus |
CN106610968A (en) * | 2015-10-21 | 2017-05-03 | 广州酷狗计算机科技有限公司 | Song menu list determination method and apparatus, and electronic device |
CN105302908A (en) * | 2015-11-02 | 2016-02-03 | 北京奇虎科技有限公司 | E-book related audio resource recommendation method and apparatus |
CN105550316A (en) * | 2015-12-14 | 2016-05-04 | 广州酷狗计算机科技有限公司 | Pushing method and device of audio list |
CN105609096A (en) * | 2015-12-30 | 2016-05-25 | 小米科技有限责任公司 | Text data output method and device |
US9972301B2 (en) * | 2016-10-18 | 2018-05-15 | Mastercard International Incorporated | Systems and methods for correcting text-to-speech pronunciation |
CN106649644A (en) * | 2016-12-08 | 2017-05-10 | 腾讯音乐娱乐(深圳)有限公司 | Lyric file generation method and device |
CN107731219A (en) * | 2017-09-06 | 2018-02-23 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis processing method, device and equipment |
CN108536655A (en) * | 2017-12-21 | 2018-09-14 | 广州市讯飞樽鸿信息技术有限公司 | Audio production method and system are read aloud in a kind of displaying based on hand-held intelligent terminal |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111833857A (en) * | 2019-04-16 | 2020-10-27 | 阿里巴巴集团控股有限公司 | Voice processing method and device and distributed system |
CN111833857B (en) * | 2019-04-16 | 2024-05-24 | 斑马智行网络(香港)有限公司 | Voice processing method, device and distributed system |
CN110264992A (en) * | 2019-06-11 | 2019-09-20 | 百度在线网络技术(北京)有限公司 | Speech synthesis processing method, device, equipment and storage medium |
CN110534131A (en) * | 2019-08-30 | 2019-12-03 | 广州华多网络科技有限公司 | A kind of audio frequency playing method and system |
CN111161737A (en) * | 2019-12-23 | 2020-05-15 | 北京欧珀通信有限公司 | Data processing method and device, electronic equipment and storage medium |
CN111415650A (en) * | 2020-03-25 | 2020-07-14 | 广州酷狗计算机科技有限公司 | Text-to-speech method, device, equipment and storage medium |
CN112181348A (en) * | 2020-08-28 | 2021-01-05 | 星络智能科技有限公司 | Sound style switching method, system, computer equipment and readable storage medium |
CN112185338A (en) * | 2020-09-30 | 2021-01-05 | 北京大米科技有限公司 | Audio processing method and device, readable storage medium and electronic equipment |
CN112185338B (en) * | 2020-09-30 | 2024-01-23 | 北京大米科技有限公司 | Audio processing method, device, readable storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109616094A (en) | Phoneme synthesizing method, device, system and storage medium | |
US11922924B2 (en) | Multilingual neural text-to-speech synthesis | |
JP5768093B2 (en) | Speech processing system | |
CN105845125B (en) | Phoneme synthesizing method and speech synthetic device | |
US10140972B2 (en) | Text to speech processing system and method, and an acoustic model training system and method | |
JP6246777B2 (en) | Speech synthesis method, apparatus and program | |
US9472190B2 (en) | Method and system for automatic speech recognition | |
US9269347B2 (en) | Text to speech system | |
US10720157B1 (en) | Voice to voice natural language understanding processing | |
WO2018049979A1 (en) | Animation synthesis method and device | |
CN113892135A (en) | Multi-lingual speech synthesis and cross-lingual voice cloning | |
CN104021784B (en) | Phoneme synthesizing method and device based on Big-corpus | |
US20140210830A1 (en) | Computer generated head | |
CN109523986A (en) | Phoneme synthesizing method, device, equipment and storage medium | |
CN105609097A (en) | Speech synthesis apparatus and control method thereof | |
CN107239547B (en) | Voice error correction method, terminal and storage medium for ordering song by voice | |
US11289082B1 (en) | Speech processing output personalization | |
CN108831437A (en) | A kind of song generation method, device, terminal and storage medium | |
CN101901598A (en) | Humming synthesis method and system | |
CN111223474A (en) | Voice cloning method and system based on multi-neural network | |
GB2510201A (en) | Animating a computer generated head based on information to be output by the head | |
Wilson | Conflicting language ideologies in choral singing in Trinidad | |
Liu et al. | Non-parallel voice conversion with autoregressive conversion model and duration adjustment | |
Luong et al. | Laughnet: synthesizing laughter utterances from waveform silhouettes and a single laughter example | |
Secujski et al. | Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |