CN109616094A - Phoneme synthesizing method, device, system and storage medium - Google Patents

Phoneme synthesizing method, device, system and storage medium Download PDF

Info

Publication number
CN109616094A
CN109616094A CN201811648146.1A CN201811648146A CN109616094A CN 109616094 A CN109616094 A CN 109616094A CN 201811648146 A CN201811648146 A CN 201811648146A CN 109616094 A CN109616094 A CN 109616094A
Authority
CN
China
Prior art keywords
speaker
candidate
information
scene
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811648146.1A
Other languages
Chinese (zh)
Inventor
杨杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201811648146.1A priority Critical patent/CN109616094A/en
Publication of CN109616094A publication Critical patent/CN109616094A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of phoneme synthesizing method, device, system and storage medium, this method, comprising: determine current scene information;Obtain all candidate speakers being consistent with current scene information;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate's pronunciation list, target speaker is determined;According to the sound of target speaker, text information is converted into target voice.To realize according to the text and scene properties received, the speaker being consistent with scene is automatically selected, allows synthesis voice according to different scenes, convert most suitable speaker, so that the voice finally synthesized is truer, speech synthesis effect is improved, user experience is good.

Description

Phoneme synthesizing method, device, system and storage medium
Technical field
The present invention relates to voice processing technology fields more particularly to a kind of phoneme synthesizing method, device, system and storage to be situated between Matter.
Background technique
Speech synthesis (Text to Speech) is one of the important technology in artificial intelligent voice field and application direction, is It is the process of voice by the text conversion that user or product input, in such a way that machine imitates the mankind " speaking ", output is anthropomorphic Sound, be mainly used in the scenes such as sound reading, human-computer dialogue, intelligent sound box, intelligent customer service, be people and machine carry out from So interactive one of major way.
Currently, existing speech synthesis is the process that user (or product) input text carries out text-to-speech, input text This, is synthesized by speaker selected in advance, and wherein speaker tone color style is that speaker uniquely selectes reference mode. And on realizing, with the expansion of sound scene, under different scenes, different speaker expression effects are different.Such as the field before sleeping Scape may be more suitable for warm sound of releiving;Working public transport subway scene, may be more suitable for brisk bright sound.
But existing speech synthesis technique can not adapt to the variation of scene, influence the final presentation effect of speech synthesis, User experience is bad.
Summary of the invention
The present invention provides a kind of phoneme synthesizing method, device, system and storage medium, may be implemented according to the text received Sheet and scene properties, automatically select the speaker being consistent with scene, allow synthesis voice according to different scenes, transformation Most suitable speaker, improves speech synthesis effect, and user experience is good.
In a first aspect, the embodiment of the present invention provides a kind of phoneme synthesizing method, comprising:
Determine current scene information;
Obtain all candidate speakers being consistent with the current scene information;
According to default rule, the candidate speaker is ranked up, obtains candidate pronunciation list;
According to the candidate pronunciation list, target speaker is determined;
According to the sound of the target speaker, text information is converted into target voice.
In a kind of possible design, the determining current scene information, comprising:
Scene information is obtained from the text information received, and the scene information that will acquire is as current scene Information;Or
According to presupposed information, current scene information is determined;The presupposed information include: current location information, temporal information, Weather information, network information etc., can choose one of presupposed information or appoint and a variety of determine current scene information;
Wherein, the scene information includes: to sleep preceding scene, night-time scene, lunch break scene, read scene, subway scene, public affairs Hand over scene, airport scene.
In a kind of possible design, all candidate speakers being consistent with the current scene information are obtained, comprising:
From the database of mapping relations for being previously stored with speaker speech packet and speaker and scene information, obtain Take all candidate speakers for meeting the current scene information.
In a kind of possible design, further includes:
Update the mapping relations of the speaker speech packet and speaker and scene information in the database.
In a kind of possible design, according to default rule, the candidate speaker is ranked up, candidate hair is obtained Sound list, comprising:
Obtain all candidate speakers scene properties weighted value corresponding with the current scene information;Wherein, the field Scape Attribute Weight weight values, for characterizing the matching degree of speaker and scene;
According to the scene properties weighted value, the candidate speaker is ranked up, obtains candidate pronunciation list.
In a kind of possible design, according to the candidate pronunciation list, target speaker is determined, comprising:
Show the candidate speaker of ranking top N, N is the natural number greater than 0;
If the quantity of candidate speaker is 1, using the candidate speaker as target speaker;
If the quantity of candidate speaker is greater than 1, according to the confirmation message that user inputs, from the candidate pronunciation list Middle determination one candidate speaker is as target speaker;If within a preset period of time, not receiving the confirmation letter of user's input Breath, then will come the 1st candidate speaker as target speaker.
In a kind of possible design, according to the sound of the target speaker, text information is converted into target voice, Include:
By the sound rendering initial speech of text information target speaker;
Receive the adjusting information to initial speech, the target voice after being adjusted;Wherein, the adjusting information is for adjusting The audio attribute of the whole initial speech, the audio attribute include: volume, tone, word speed and background sound;
Export the target voice.
Second aspect, the embodiment of the present invention provide a kind of speech synthetic device, comprising:
Module is obtained, for obtaining all candidate speakers being consistent with the current scene information;
Sorting module obtains candidate speaker column for being ranked up to the candidate speaker according to default rule Table;
Second determining module, for determining target speaker according to the candidate pronunciation list;
Text information is converted to target voice for the sound according to the target speaker by synthesis module.
In a kind of possible design, first determining module is specifically used for:
Scene information is obtained from the text information received, and the scene information that will acquire is as current scene Information;Or
According to presupposed information, current scene information is determined;The presupposed information include: current location information, temporal information, Weather information, network information etc., can choose one of presupposed information or appoint and a variety of determine current scene information;
Wherein, the scene information includes: to sleep preceding scene, night-time scene, lunch break scene, read scene, subway scene, public affairs Hand over scene, airport scene.
In a kind of possible design, the acquisition module is specifically used for:
From the database of mapping relations for being previously stored with speaker speech packet and speaker and scene information, obtain Take all candidate speakers for meeting the current scene information.
In a kind of possible design, further includes:
Update module, for updating reflecting for speaker speech packet in the database and speaker and scene information Penetrate relationship.
In a kind of possible design, the sorting module is specifically used for:
Obtain all candidate speakers scene properties weighted value corresponding with the current scene information;Wherein, the field Scape Attribute Weight weight values, for characterizing the matching degree of speaker and scene;
According to the scene properties weighted value, the candidate speaker is ranked up, obtains candidate pronunciation list.
In a kind of possible design, second determining module is specifically used for:
Show the candidate speaker of ranking top N, N is the natural number greater than 0;
If the quantity of candidate speaker is 1, using the candidate speaker as target speaker;
If the quantity of candidate speaker is greater than 1, according to the confirmation message that user inputs, from the candidate pronunciation list Middle determination one candidate speaker is as target speaker;If within a preset period of time, not receiving the confirmation letter of user's input Breath, then will come the 1st candidate speaker as target speaker.
In a kind of possible design, the synthesis module is specifically used for:
By the sound rendering initial speech of text information target speaker;
Receive the adjusting information to initial speech, the target voice after being adjusted;Wherein, the adjusting information is for adjusting The audio attribute of the whole initial speech, the audio attribute include: volume, tone, word speed and background sound;
Export the target voice.
The third aspect, the embodiment of the present invention provide a kind of speech synthesis system, comprising: memory and processor, memory In be stored with the executable instruction of the processor;Wherein, the processor is configured to next via the executable instruction is executed Execute phoneme synthesizing method described in any one of first aspect.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, are stored thereon with computer program, Phoneme synthesizing method described in any one of first aspect is realized when the program is executed by processor.
5th aspect, the embodiment of the present invention provide a kind of program product, and described program product includes: computer program, institute It states computer program to be stored in readable storage medium storing program for executing, at least one processor of server can be from the readable storage medium storing program for executing The computer program is read, at least one described processor executes the computer program and server is made to execute first aspect In any phoneme synthesizing method.
The present invention provides a kind of phoneme synthesizing method, device, system and storage medium, passes through and determines current scene information; Obtain all candidate speakers being consistent with the current scene information;According to default rule, to the candidate speaker It is ranked up, obtains candidate pronunciation list;According to the candidate pronunciation list, target speaker is determined;According to the mesh The sound for marking speaker, is converted to target voice for text information.To realize according to the text and scene properties received, certainly The speaker that dynamic selection is consistent with scene allows synthesis voice according to different scenes, converts most suitable speaker, So that the voice finally synthesized is truer, speech synthesis effect is improved, user experience is good.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with It obtains other drawings based on these drawings.
Fig. 1 is the schematic illustration of an application scenarios of the invention;
Fig. 2 is the flow chart for the phoneme synthesizing method that the embodiment of the present invention one provides;
Fig. 3 is the flow chart of phoneme synthesizing method provided by Embodiment 2 of the present invention;
Fig. 4 is the structural schematic diagram for the speech synthetic device that the embodiment of the present invention three provides;
Fig. 5 is the structural schematic diagram for the speech synthetic device that the embodiment of the present invention four provides;
Fig. 6 is the structural schematic diagram for the speech synthesis system that the embodiment of the present invention five provides.
Through the above attached drawings, it has been shown that the specific embodiment of the disclosure will be hereinafter described in more detail.These attached drawings It is not intended to limit the scope of this disclosure concept by any means with verbal description, but is by referring to specific embodiments Those skilled in the art illustrate the concept of the disclosure.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
Description and claims of this specification and term " first ", " second ", " third ", " in above-mentioned attached drawing The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage The data that solution uses in this way are interchangeable under appropriate circumstances, so that the embodiment of the present invention described herein for example can be to remove Sequence other than those of illustrating or describe herein is implemented.In addition, term " includes " and " having " and theirs is any Deformation, it is intended that cover it is non-exclusive include, for example, containing the process, method of a series of steps or units, system, production Product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for this A little process, methods, the other step or units of product or equipment inherently.
Technical solution of the present invention is described in detail with specifically embodiment below.These specific implementations below Example can be combined with each other, and the same or similar concept or process may be repeated no more in some embodiments.
Existing speech synthesis is the process that user (or product) input text carries out text-to-speech, inputs text, warp It is synthesized after speaker selected in advance, wherein speaker tone color style is that speaker uniquely selectes reference mode.Currently, The speech synthesis solution provided on the market is all based on text and speaker tone color style, does not distinguish what synthesis used Scene.Under different scenes, same speaker synthetic effect does not have difference, and the synthetic effect of scene is performed poor.This hair It is bright that scene information is integrated into speech synthesis technique, according to current scene information, recommended candidate speaker, so that specific close At voice it is truer, promoted user experience.
Fig. 1 is the schematic illustration of an application scenarios of the invention, as shown in Figure 1, the present invention is according to the text envelope received Breath and scene properties 11, speech synthetic device 12 identifies and determines current scene information, then recommends to close according to scene information Suitable speaker finally inputs the corresponding initial speech of text according to the Timbre Synthesis of speaker, and can be to initial speech Audio attribute is adjusted, and exports target voice 13.When determining scene information, the text information of input can be carried out semantic Identification, extracts scene information.
Such as: the text of input is " catching up with subway, the people on today subway is so many, and working may will be late ", Can identify current scene be on the way to office, the vehicles taken be that subway, possible background sound can be more miscellaneous, really Determining scene information is that subway scene is proper.When determining scene, it can also determine that current scene is believed according to presupposed information Breath;Presupposed information includes: current location information, temporal information, Weather information, the network information etc., can choose presupposed information One of or appoint and a variety of determine current scene information.Such as: the text of input is " to have looked for a circle, not finding you will borrow Book ", be so-and-so college library, university library, academic library according to current location information, then background sound should be a quiet atmosphere, determine Scene information is that reading scene is proper.
In a particular application, speech synthesis can be carried out as follows: determines current scene information;It obtains and works as front court All candidate speakers that scape information is consistent;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation List;According to candidate's pronunciation list, target speaker is determined;According to the sound of target speaker, text information is converted For target voice.
It may be implemented according to the text that receives and scene properties using the above method, automatically select and be consistent with scene Speaker allows synthesis voice according to different scenes, most suitable speaker is converted, so that the voice finally synthesized is more Add really, improves speech synthesis effect, user experience is good.
How to be solved with technical solution of the specifically embodiment to technical solution of the present invention and the application below above-mentioned Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, the embodiment of the present invention is described.
Fig. 2 is the flow chart for the phoneme synthesizing method that the embodiment of the present invention one provides, as shown in Fig. 2, in the present embodiment Method may include:
S101, current scene information is determined.
The scene information conduct that in the present embodiment, scene information is obtained from the text information received, and will acquire Current scene information;Or according to presupposed information, determine current scene information;Presupposed information includes: current location information, time Information, Weather information, network information etc..Can choose one of presupposed information or appoint it is a variety of come determine current scene believe Breath;Wherein, scene information includes: to sleep preceding scene, night-time scene, lunch break scene, read scene, subway scene, public transport scene, machine Field scene etc..
Optionally, when determining scene information, semantics recognition can be carried out to the text information of input, extracts scene letter Breath.
Specifically, such as: the text of input is " to catch up with subway, the people on today subway is so many, and working may will be slow Arrive ", can identify current scene be on the way to office, the vehicles taken be that subway, possible background sound can compare It is more miscellaneous, determine that scene information is that subway scene is proper.
Optionally, when determining scene, current scene can also be determined according to current location information and/or temporal information Information.Such as: the text of input is " having looked for a circle, do not find your book to be borrowed ", big for so-and-so according to current location information Library is learned, then background sound should be a quiet atmosphere, and it is proper to read scene to determine scene information.In determination When scene, the voice messaging and current location information and/or temporal information of input text can be combined with, determine that current scene is believed Breath.Such as: the text of input is " I tells a story to treasure daughter ", and the current time is 9 points at night, leans on text envelope merely Breath or temporal information, are difficult to determine suitable scene information, and the two is combined, and can determine current scene information It is more appropriate to sleep preceding scene.
It should be noted that the present embodiment does not limit the type of scene information, those skilled in the art can be according to reality Border situation increases or reduces the type of scene information.
All candidate speakers that S102, acquisition are consistent with current scene information.
In the present embodiment, from the mapping relations for being previously stored with speaker speech packet and speaker and scene information In database, all candidate speakers for meeting current scene information are obtained.
Optionally, speaker speech packet is made of the speaker of multiple and different tone color styles, and each speaker includes basis Attribute and the big main attribute of scene properties two.Wherein, primary attribute: including information such as tone color, style, gender, ages.
Specifically, such as: pronounce artificial Guo Degang, and it is the simple and honest sound of mature male that primary attribute is corresponding.Speaker For Lin Zhiling, it is the sound of flirtatious women that primary attribute is corresponding.Scene properties include that current speaker is appropriate for synthesizing Usage scenario and corresponding scene properties weight, be 0-100, weight indicates recommendation journey of the speaker under the scene Degree, value is bigger, more recommends.It is main include before sleeping, at night, afternoon, reading, subway, public transport, aircraft, high-speed rail, lunch break etc. it is main Speech synthesis usage scenario.
The present embodiment, from the number for the mapping relations for being previously stored with speaker speech packet and speaker and scene information According in library, acquisition meets all candidate speakers of current scene information.For example, the text of input is that " elder brother accompanies me to see at night It goes window-shopping!", it can determine that the flirtatious sound of the corresponding speaker selection woods will tinkling of pieces of jade is proper, if the sound of selection Guo Degang Sound will destroy context, not enough really.
S103, according to default rule, candidate speaker is ranked up, candidate pronunciation list is obtained.
In the present embodiment, all candidate speakers scene properties weighted value corresponding with current scene information is obtained;Wherein, Scene properties weighted value, for characterizing the matching degree of speaker and scene;According to scene properties weighted value, to candidate speaker It is ranked up, obtains candidate pronunciation list.
Specifically, the user's usage scenario obtained based on S101, the scene being suitble to speaker is matched, in S102 The speaker for being appropriate for synthesis is matched, and is arranged according to the scene properties weight descending of speaker.
S104, pronounce list according to candidate, determines target speaker.
In the present embodiment, the candidate speaker of ranking top N is shown, N is the natural number greater than 0;
If the quantity of candidate speaker is 1, using candidate speaker as target speaker;If the quantity of candidate speaker Greater than 1, then the confirmation message inputted according to user determines a candidate speaker as target from candidate's pronunciation list and sends out Sound people;If within a preset period of time, not receiving the confirmation message of user's input, then the candidate speaker for coming the 1st is made For target speaker.
Specifically, after according to the arrangement of weight descending, recommendation speaker output of the Top1 as default.Meanwhile it also propping up It holds user and refers to that synthesis speaker exports surely in the pronunciation list of input scene.
S105, the sound according to target speaker, are converted to target voice for text information.
In the present embodiment, by the sound rendering initial speech of text information target speaker;It receives to initial speech Adjust information, the target voice after being adjusted;Wherein, the audio attribute that information is used to adjust initial speech, audio category are adjusted Property includes: volume, tone, word speed and background sound;Export target voice.
Specifically, according to target speaker tone color feature, text information is synthesized into initial speech.It is then possible to combination field Scape information automatically adjust to audio attribute, manually adjust audio attribute according to the input of user.For example, Working public transport subway scene, may be more suitable for brisk bright sound, background colour should be more noisy.And the scene before sleeping, It may be more suitable for warm sound of releiving, background sound should be quieter.Specific back can also be added according to scene information Jing Yin can be again in background sound plus the sound that Raindrops drummed rhythmically against the banana leaves such as rainy day scene, or the sound that It's raining in torrents.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list, Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience It is good.
Fig. 3 is the flow chart of phoneme synthesizing method provided by Embodiment 2 of the present invention, as shown in figure 3, in the present embodiment Method may include:
The mapping relations of S201, the speaker speech packet more in new database and speaker and scene information.
, can be again in speaker speech packet in the present embodiment, dynamic increases or reduces the quantity of speaker.It can also root According to the selection of user, the corresponding attribute of speaker, weight are adjusted.For example, according to nearest hot spot, in pronunciation packet Increase the voice for the artificial Wang Yuan that pronounces.It can also select to be accustomed to according to user, increase the power of the voice packet for the artificial Lin Zhiling that pronounces Weight values carry out preferential recommendation.
S202, current scene information is determined.
All candidate speakers that S203, acquisition are consistent with current scene information.
S204, according to default rule, candidate speaker is ranked up, candidate pronunciation list is obtained.
S205, pronounce list according to candidate, determines target speaker.
S206, the sound according to target speaker, are converted to target voice for text information.
In the present embodiment, step S202~step S206 specific implementation process and technical principle are shown in Figure 2 Associated description in method in step S101~step S105, details are not described herein again.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list, Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience It is good.
In addition, this implementation can the speaker speech packet in more new database and speaker and scene information mapping Relationship promotes user experience.Such as user can regularly update the speaker speech packet in database, or record own voice Voice packet.
Fig. 4 is the structural schematic diagram for the speech synthetic device that the embodiment of the present invention three provides, as shown in figure 4, the present embodiment Speech synthetic device may include:
First determining module 31, for determining current scene information;
Module 32 is obtained, for obtaining all candidate speakers being consistent with current scene information;
Sorting module 33 obtains candidate speaker column for being ranked up to candidate speaker according to default rule Table;
Second determining module 34, for determining target speaker according to candidate's pronunciation list;
Text information is converted to target voice for the sound according to target speaker by synthesis module 35.
In a kind of possible design, the first determining module 31 is specifically used for:
The scene information that obtains scene information from the text information received, and will acquire is believed as current scene Breath;Or
According to presupposed information, current scene information is determined;Presupposed information includes: current location information, temporal information, weather Information, network information etc., can choose one of presupposed information or appoint and a variety of determine current scene information;
Wherein, scene information includes: to sleep preceding scene, night-time scene, lunch break scene, read scene, subway scene, public transport field Scape, airport scene.
In a kind of possible design, module 32 is obtained, is specifically used for:
From the database of mapping relations for being previously stored with speaker speech packet and speaker and scene information, obtain Take all candidate speakers for meeting current scene information.
In a kind of possible design, sorting module 33 is specifically used for:
Obtain all candidate speakers scene properties weighted value corresponding with current scene information;Wherein, scene properties are weighed Weight values, for characterizing the matching degree of speaker and scene;
According to scene properties weighted value, candidate speaker is ranked up, obtains candidate pronunciation list.
In a kind of possible design, the second determining module 34 is specifically used for:
Show the candidate speaker of ranking top N, N is the natural number greater than 0;
If the quantity of candidate speaker is 1, using candidate speaker as target speaker;
If the quantity of candidate speaker is greater than 1, according to the confirmation message that user inputs, from candidate's pronunciation list really A fixed candidate speaker is as target speaker;If within a preset period of time, not receiving the confirmation message of user's input, then The 1st candidate speaker will be come as target speaker.
In a kind of possible design, synthesis module 35 is specifically used for:
By the sound rendering initial speech of text information target speaker;
Receive the adjusting information to initial speech, the target voice after being adjusted;Wherein, information is adjusted for adjusting just The audio attribute of beginning voice, audio attribute include: volume, tone, word speed and background sound;
Export target voice.
The speech synthetic device of the present embodiment can execute the technical solution in method shown in Fig. 2, implement process With the associated description in technical principle method shown in Figure 2, details are not described herein again.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list, Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience It is good.
Fig. 5 is the structural schematic diagram for the speech synthetic device that the embodiment of the present invention four provides, as shown in figure 5, the present embodiment Speech synthetic device device shown in Fig. 4 on the basis of, can also include:
Update module 36, the mapping for speaker speech packet and speaker and scene information in more new database Relationship.
The speech synthetic device of the present embodiment can execute the technical solution in method shown in Fig. 2, Fig. 3, specific implementation The associated description of process and technical principle referring to fig. 2, in method shown in Fig. 3, details are not described herein again.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list, Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience It is good.
In addition, this implementation can the speaker speech packet in more new database and speaker and scene information mapping Relationship promotes user experience.
Fig. 6 is the structural schematic diagram for the speech synthesis system that the embodiment of the present invention five provides, as shown in fig. 6, the present embodiment Speech synthesis system 40 may include: processor 41 and memory 42.
Memory 42, for storing program;Memory 42 may include volatile memory (English: volatile Memory), for example, random access memory (English: random-access memory, abbreviation: RAM), such as static random-access Memory (English: static random-access memory, abbreviation: SRAM), double data rate synchronous dynamic random-access Memory (English: Double Data Rate Synchronous Dynamic Random Access Memory, abbreviation: DDR SDRAM) etc.;Memory also may include nonvolatile memory (English: non-volatile memory), such as fastly Flash memory (English: flash memory).Memory 42 is used to store computer program (the application journey as realized the above method Sequence, functional module etc.), computer instruction etc., above-mentioned computer program, computer instruction etc. can with partitioned storage at one or In multiple memories 42.And above-mentioned computer program, computer instruction, data etc. can be called with device 41 processed.
Above-mentioned computer program, computer instruction etc. can be with partitioned storages in one or more memories 42.And Above-mentioned computer program, computer instruction, data etc. can be called with device 41 processed.
Processor 41, for executing the computer program of the storage of memory 42, to realize method that above-described embodiment is related to In each step.
It specifically may refer to the associated description in previous methods embodiment.
Processor 41 and memory 42 can be absolute construction, be also possible to the integrated morphology integrated.Work as processing When device 41 and memory 42 are absolute construction, memory 42, processor 41 can be of coupled connections by bus 43.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list, Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience It is good.
The server of the present embodiment can execute the technical solution in method shown in Fig. 2, Fig. 3, specific implementation process and Associated description of the technical principle referring to fig. 2, in method shown in Fig. 3, details are not described herein again.
In addition, the embodiment of the present application also provides a kind of computer readable storage medium, deposited in computer readable storage medium Computer executed instructions are contained, when at least one processor of user equipment executes the computer executed instructions, user equipment Execute above-mentioned various possible methods.
The present embodiment, by determining current scene information;Obtain all candidate pronunciations being consistent with current scene information People;According to default rule, candidate speaker is ranked up, obtains candidate pronunciation list;According to candidate pronounce list, Determine target speaker;According to the sound of target speaker, text information is converted into target voice.To realize according to reception The text and scene properties arrived, automatically selects the speaker being consistent with scene, allows synthesis voice according to different fields Scape converts most suitable speaker, so that the voice finally synthesized is truer, improves speech synthesis effect, user experience It is good.
Wherein, computer-readable medium includes computer storage media and communication media, and wherein communication media includes being convenient for From a place to any medium of another place transmission computer program.Storage medium can be general or specialized computer Any usable medium that can be accessed.A kind of illustrative storage medium is coupled to processor, to enable a processor to from this Read information, and information can be written to the storage medium.Certainly, storage medium is also possible to the composition portion of processor Point.Pocessor and storage media can be located in ASIC.In addition, the ASIC can be located in user equipment.Certainly, processor and Storage medium can also be used as discrete assembly and be present in communication equipment.
The application also provides a kind of program product, and program product includes computer program, and computer program is stored in readable In storage medium, at least one processor of server can read computer program from readable storage medium storing program for executing, at least one Reason device executes the phoneme synthesizing method that computer program makes the server implementation embodiments of the present invention any.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer readable storage medium.The journey When being executed, execution includes the steps that above-mentioned each method embodiment to sequence;And storage medium above-mentioned include: ROM, RAM, magnetic disk or The various media that can store program code such as person's CD.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or part of or all technical features are carried out etc. With replacement;And these modifications or substitutions, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution Range.

Claims (10)

1. a kind of phoneme synthesizing method characterized by comprising
Determine current scene information;
Obtain all candidate speakers being consistent with the current scene information;
According to default rule, the candidate speaker is ranked up, obtains candidate pronunciation list;
According to the candidate pronunciation list, target speaker is determined;
According to the sound of the target speaker, text information is converted into target voice.
2. the method according to claim 1, wherein the determining current scene information, comprising:
The scene information that obtains scene information from the text information received, and will acquire is believed as current scene Breath;Or
According to presupposed information, current scene information is determined;The presupposed information includes: current location information, temporal information, weather Information, the network information;
Wherein, the scene information includes: to sleep preceding scene, night-time scene, lunch break scene, read scene, subway scene, public transport field Scape, airport scene.
3. the method according to claim 1, wherein obtaining all times being consistent with the current scene information Select speaker, comprising:
From the database of mapping relations for being previously stored with speaker speech packet and speaker and scene information, symbol is obtained Close all candidate speakers of the current scene information.
4. according to the method described in claim 3, it is characterized by further comprising:
Update the mapping relations of the speaker speech packet and speaker and scene information in the database.
5. the method according to claim 1, wherein being carried out according to default rule to the candidate speaker Sequence obtains candidate pronunciation list, comprising:
Obtain all candidate speakers scene properties weighted value corresponding with the current scene information;Wherein, the scene category Property weighted value, for characterizing the matching degree of speaker and scene;
According to the scene properties weighted value, the candidate speaker is ranked up, obtains candidate pronunciation list.
6. the method according to claim 1, wherein determining target speaker according to the candidate pronunciation list People, comprising:
Show the candidate speaker of ranking top N, N is the natural number greater than 0;
If the quantity of candidate speaker is 1, using the candidate speaker as target speaker;
If the quantity of candidate speaker is greater than 1, according to the confirmation message that user inputs, from the candidate pronunciation list really A fixed candidate speaker is as target speaker;If within a preset period of time, not receiving the confirmation message of user's input, then The 1st candidate speaker will be come as target speaker.
7. method according to claim 1 to 6, which is characterized in that according to the sound of the target speaker, Text information is converted into target voice, comprising:
By the sound rendering initial speech of text information target speaker;
Receive the adjusting information to initial speech, the target voice after being adjusted;Wherein, the adjusting information is for adjusting institute The audio attribute of initial speech is stated, the audio attribute includes: volume, tone, word speed and background sound;
Export the target voice.
8. a kind of speech synthetic device characterized by comprising
First determining module, for determining current scene information;
Module is obtained, for obtaining all candidate speakers being consistent with the current scene information;
Sorting module obtains candidate pronunciation list for being ranked up to the candidate speaker according to default rule;
Second determining module, for determining target speaker according to the candidate pronunciation list;
Text information is converted to target voice for the sound according to the target speaker by synthesis module.
9. a kind of speech synthesis system characterized by comprising memory and processor are stored with the processing in memory The executable instruction of device;Wherein, the processor is configured to come perform claim requirement 1-7 institute via the execution executable instruction The phoneme synthesizing method stated.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor Claim 1-7 described in any item phoneme synthesizing methods are realized when execution.
CN201811648146.1A 2018-12-29 2018-12-29 Phoneme synthesizing method, device, system and storage medium Pending CN109616094A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811648146.1A CN109616094A (en) 2018-12-29 2018-12-29 Phoneme synthesizing method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811648146.1A CN109616094A (en) 2018-12-29 2018-12-29 Phoneme synthesizing method, device, system and storage medium

Publications (1)

Publication Number Publication Date
CN109616094A true CN109616094A (en) 2019-04-12

Family

ID=66017285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811648146.1A Pending CN109616094A (en) 2018-12-29 2018-12-29 Phoneme synthesizing method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN109616094A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110264992A (en) * 2019-06-11 2019-09-20 百度在线网络技术(北京)有限公司 Speech synthesis processing method, device, equipment and storage medium
CN110534131A (en) * 2019-08-30 2019-12-03 广州华多网络科技有限公司 A kind of audio frequency playing method and system
CN111161737A (en) * 2019-12-23 2020-05-15 北京欧珀通信有限公司 Data processing method and device, electronic equipment and storage medium
CN111415650A (en) * 2020-03-25 2020-07-14 广州酷狗计算机科技有限公司 Text-to-speech method, device, equipment and storage medium
CN111833857A (en) * 2019-04-16 2020-10-27 阿里巴巴集团控股有限公司 Voice processing method and device and distributed system
CN112185338A (en) * 2020-09-30 2021-01-05 北京大米科技有限公司 Audio processing method and device, readable storage medium and electronic equipment
CN112181348A (en) * 2020-08-28 2021-01-05 星络智能科技有限公司 Sound style switching method, system, computer equipment and readable storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750839A (en) * 2015-04-03 2015-07-01 魅族科技(中国)有限公司 Data recommendation method, terminal and server
CN105047193A (en) * 2015-08-27 2015-11-11 百度在线网络技术(北京)有限公司 Voice broadcasting method and apparatus
CN105302908A (en) * 2015-11-02 2016-02-03 北京奇虎科技有限公司 E-book related audio resource recommendation method and apparatus
CN105550316A (en) * 2015-12-14 2016-05-04 广州酷狗计算机科技有限公司 Pushing method and device of audio list
CN105609096A (en) * 2015-12-30 2016-05-25 小米科技有限责任公司 Text data output method and device
CN106610968A (en) * 2015-10-21 2017-05-03 广州酷狗计算机科技有限公司 Song menu list determination method and apparatus, and electronic device
CN106649644A (en) * 2016-12-08 2017-05-10 腾讯音乐娱乐(深圳)有限公司 Lyric file generation method and device
CN107731219A (en) * 2017-09-06 2018-02-23 百度在线网络技术(北京)有限公司 Phonetic synthesis processing method, device and equipment
US9972301B2 (en) * 2016-10-18 2018-05-15 Mastercard International Incorporated Systems and methods for correcting text-to-speech pronunciation
CN108536655A (en) * 2017-12-21 2018-09-14 广州市讯飞樽鸿信息技术有限公司 Audio production method and system are read aloud in a kind of displaying based on hand-held intelligent terminal

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750839A (en) * 2015-04-03 2015-07-01 魅族科技(中国)有限公司 Data recommendation method, terminal and server
CN105047193A (en) * 2015-08-27 2015-11-11 百度在线网络技术(北京)有限公司 Voice broadcasting method and apparatus
CN106610968A (en) * 2015-10-21 2017-05-03 广州酷狗计算机科技有限公司 Song menu list determination method and apparatus, and electronic device
CN105302908A (en) * 2015-11-02 2016-02-03 北京奇虎科技有限公司 E-book related audio resource recommendation method and apparatus
CN105550316A (en) * 2015-12-14 2016-05-04 广州酷狗计算机科技有限公司 Pushing method and device of audio list
CN105609096A (en) * 2015-12-30 2016-05-25 小米科技有限责任公司 Text data output method and device
US9972301B2 (en) * 2016-10-18 2018-05-15 Mastercard International Incorporated Systems and methods for correcting text-to-speech pronunciation
CN106649644A (en) * 2016-12-08 2017-05-10 腾讯音乐娱乐(深圳)有限公司 Lyric file generation method and device
CN107731219A (en) * 2017-09-06 2018-02-23 百度在线网络技术(北京)有限公司 Phonetic synthesis processing method, device and equipment
CN108536655A (en) * 2017-12-21 2018-09-14 广州市讯飞樽鸿信息技术有限公司 Audio production method and system are read aloud in a kind of displaying based on hand-held intelligent terminal

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833857A (en) * 2019-04-16 2020-10-27 阿里巴巴集团控股有限公司 Voice processing method and device and distributed system
CN111833857B (en) * 2019-04-16 2024-05-24 斑马智行网络(香港)有限公司 Voice processing method, device and distributed system
CN110264992A (en) * 2019-06-11 2019-09-20 百度在线网络技术(北京)有限公司 Speech synthesis processing method, device, equipment and storage medium
CN110534131A (en) * 2019-08-30 2019-12-03 广州华多网络科技有限公司 A kind of audio frequency playing method and system
CN111161737A (en) * 2019-12-23 2020-05-15 北京欧珀通信有限公司 Data processing method and device, electronic equipment and storage medium
CN111415650A (en) * 2020-03-25 2020-07-14 广州酷狗计算机科技有限公司 Text-to-speech method, device, equipment and storage medium
CN112181348A (en) * 2020-08-28 2021-01-05 星络智能科技有限公司 Sound style switching method, system, computer equipment and readable storage medium
CN112185338A (en) * 2020-09-30 2021-01-05 北京大米科技有限公司 Audio processing method and device, readable storage medium and electronic equipment
CN112185338B (en) * 2020-09-30 2024-01-23 北京大米科技有限公司 Audio processing method, device, readable storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN109616094A (en) Phoneme synthesizing method, device, system and storage medium
US11922924B2 (en) Multilingual neural text-to-speech synthesis
JP5768093B2 (en) Speech processing system
CN105845125B (en) Phoneme synthesizing method and speech synthetic device
US10140972B2 (en) Text to speech processing system and method, and an acoustic model training system and method
JP6246777B2 (en) Speech synthesis method, apparatus and program
US9472190B2 (en) Method and system for automatic speech recognition
US9269347B2 (en) Text to speech system
US10720157B1 (en) Voice to voice natural language understanding processing
WO2018049979A1 (en) Animation synthesis method and device
CN113892135A (en) Multi-lingual speech synthesis and cross-lingual voice cloning
CN104021784B (en) Phoneme synthesizing method and device based on Big-corpus
US20140210830A1 (en) Computer generated head
CN109523986A (en) Phoneme synthesizing method, device, equipment and storage medium
CN105609097A (en) Speech synthesis apparatus and control method thereof
CN107239547B (en) Voice error correction method, terminal and storage medium for ordering song by voice
US11289082B1 (en) Speech processing output personalization
CN108831437A (en) A kind of song generation method, device, terminal and storage medium
CN101901598A (en) Humming synthesis method and system
CN111223474A (en) Voice cloning method and system based on multi-neural network
GB2510201A (en) Animating a computer generated head based on information to be output by the head
Wilson Conflicting language ideologies in choral singing in Trinidad
Liu et al. Non-parallel voice conversion with autoregressive conversion model and duration adjustment
Luong et al. Laughnet: synthesizing laughter utterances from waveform silhouettes and a single laughter example
Secujski et al. Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination