CN109036373A - A kind of method of speech processing and electronic equipment - Google Patents

A kind of method of speech processing and electronic equipment Download PDF

Info

Publication number
CN109036373A
CN109036373A CN201810857848.4A CN201810857848A CN109036373A CN 109036373 A CN109036373 A CN 109036373A CN 201810857848 A CN201810857848 A CN 201810857848A CN 109036373 A CN109036373 A CN 109036373A
Authority
CN
China
Prior art keywords
voice
casting
audio
speech samples
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810857848.4A
Other languages
Chinese (zh)
Inventor
王丽娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Microlive Vision Technology Co Ltd
Original Assignee
Beijing Microlive Vision Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Microlive Vision Technology Co Ltd filed Critical Beijing Microlive Vision Technology Co Ltd
Priority to CN201810857848.4A priority Critical patent/CN109036373A/en
Publication of CN109036373A publication Critical patent/CN109036373A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

This application discloses a kind of method of speech processing and electronic equipments, this method comprises: carrying out semantics recognition to the content of the pre- casting information of acquisition;Background audio corresponding with the pre- casting information and voice messaging are generated according to recognition result;Synthetic operation is carried out to the background audio and the voice messaging, generates casting audio corresponding with the pre- casting information.The method of speech processing can carry out synthetic operation to background audio and voice messaging, so that can also play background audio while broadcasting to voice messaging, enhance user experience.

Description

A kind of method of speech processing and electronic equipment
Technical field
This application involves data processing field, in particular to a kind of method of speech processing and electronic equipment.
Background technique
With information-based continuous development, more and more users are in the reading etc. for carrying out information using electronic equipment Movement, carries out voice broadcast using electronic equipment sometimes when being read, to meet the related needs of user.But mesh Preceding when carrying out voice broadcast, the voice of the mechanization of sending seems very stiff, without the voice such as the corresponding tone, rhythm spy Sign, and the content that content dullness is not bound with casting when casting issues corresponding background sound, and user experience is not lively.
Summary of the invention
The embodiment of the present application be designed to provide a kind of method of speech processing and electronic equipment, this method can be to backgrounds Audio and voice messaging carry out synthetic operation, so that can also play background audio while broadcasting to voice messaging.
In order to solve the above-mentioned technical problem, embodiments herein adopts the technical scheme that a kind of speech processes side Method, comprising:
Semantics recognition is carried out to the content of the pre- casting information of acquisition;
Background audio corresponding with the pre- casting information and voice messaging are generated according to recognition result;
Synthetic operation is carried out to the background audio and the voice messaging, is generated corresponding with the pre- casting information Broadcast audio.
Preferably, after generation casting audio corresponding with the pre- casting information, further includes:
The casting audio is played based on selected voice bank, wherein the selected voice bank is preset sound data A voice bank in library.
Preferably, the method also includes:
Speech samples are received, and the speech samples are pre-processed to establish customized voice bank, and by the sound Sound library is added in preset sound database.
Preferably, described pre-process the speech samples to establish customized voice bank and include:
Individual pronunciation difference and/or the ambient noise in the speech samples are removed, is removed superfluous in the speech samples Remaining information is to retain key message;
Determine the voice unit in the speech samples;
Institute's speech units are identified according to the key message corresponding grammer, to obtain the speech samples Language mode;
Customized voice bank is constructed based on the language mode.
Preferably, the content progress semantics recognition of the pre- casting information to acquisition includes:
Identify the corresponding context relation data of content of the pre- casting information, and according to the context relation data Determine the context of the pre- casting information.
Preferably, carrying out synthetic operation to the background audio and the voice messaging, generates and believe with the pre- casting The corresponding casting audio of manner of breathing includes:
The voice messaging and the background audio are divided according to the context of the pre- casting information;
By the voice messaging and background audio progress synthetic operation with identical context.
The embodiment of the present application also provides a kind of electronic equipment, including identification module, processing module and synthesis module,
The identification module is configured to carry out semantics recognition to the content of the pre- casting information of acquisition;
The processing module be configured to be generated according to recognition result background audio corresponding with the pre- casting information and Voice messaging;
The synthesis module is configured to carry out synthetic operation to the background audio and the voice messaging, generate with it is described It is pre- to broadcast the corresponding casting audio of information.
Preferably, the electronic equipment further includes broadcasting module, the broadcasting module is configured to selected sound Library plays the casting audio, wherein the selected voice bank is a voice bank in preset sound database.
Preferably, the electronic equipment further includes building module, the building module is configured that
Speech samples are received, and the speech samples are pre-processed to establish customized voice bank, and by the sound Sound library is added in preset sound database.
Preferably, the building module is further configured to:
Individual pronunciation difference and/or the ambient noise in the speech samples are removed, is removed superfluous in the speech samples Remaining information is to retain key message;
Determine the voice unit in the speech samples;
Institute's speech units are identified according to the key message corresponding grammer, to obtain the speech samples Language mode;
Customized voice bank is constructed based on the language mode.
The beneficial effect of the embodiment of the present application is: this method can carry out synthesis behaviour to background audio and voice messaging Make, so that background audio can also be played while broadcasting to voice messaging, enhances user experience.
Detailed description of the invention
Fig. 1 is the flow chart of the method for speech processing of the embodiment of the present application;
Fig. 2 is the flow chart of a specific embodiment of the method for speech processing of the embodiment of the present application;
Fig. 3 is the structural schematic diagram of the electronic equipment of the embodiment of the present application.
Specific embodiment
The various schemes and feature of the application are described herein with reference to attached drawing.
It should be understood that various modifications can be made to the embodiment applied herein.Therefore, description above should not regard To limit, and only as the example of embodiment.Those skilled in the art will expect in the scope and spirit of the present application Other modifications.
The attached drawing being included in the description and forms part of the description shows embodiments herein, and with it is upper What face provided is used to explain the application together to substantially description and the detailed description given below to embodiment of the application Principle.
By the description of the preferred form with reference to the accompanying drawings to the embodiment for being given as non-limiting example, the application's These and other characteristic will become apparent.
It is also understood that although the application is described referring to some specific examples, those skilled in the art Member realizes many other equivalents of the application in which can determine, they have feature as claimed in claim and therefore all In the protection scope defined by whereby.
When read in conjunction with the accompanying drawings, in view of following detailed description, above and other aspect, the feature and advantage of the application will become It is more readily apparent.
The specific embodiment of the application is described hereinafter with reference to attached drawing;It will be appreciated, however, that applied embodiment is only Various ways implementation can be used in the example of the application.Known and/or duplicate function and structure and be not described in detail to avoid Unnecessary or extra details makes the application smudgy.Therefore, applied specific structural and functionality is thin herein Section is not intended to restrictions, but as just the basis of claim and representative basis be used to instructing those skilled in the art with Substantially any appropriate detailed construction diversely uses the application.
This specification can be used phrase " in one embodiment ", " in another embodiment ", " in another embodiment In " or " in other embodiments ", it can be referred to one or more of the identical or different embodiment according to the application.
A kind of method of speech processing of the embodiment of the present application can be identified and be handled to text or audio, and be carried out Corresponding voice broadcast, Fig. 1 be the embodiment of the present application method of speech processing flow chart, as shown in Figure 1, this method include with Lower step:
S1 carries out semantics recognition to the content of the pre- casting information of acquisition.Pre- casting information includes text information, audio letter A plurality of types of information such as breath or pre-stored information, the pre- content for broadcasting information have corresponding semanteme, and semanteme is data Meaning, i.e. pass between the meaning and these meanings of concept representated by things in the real world corresponding to data System, is explanation and logical expressions of the data on some field.In one embodiment, can make when being identified to semanteme With pre-stored knowledge base, corresponding FIELD Data can be obtained in knowledge base according to the information of identification, in conjunction with the field Data further carry out semantics recognition to the content of pre- casting information.So that recognition result more closing to reality situation.
S2 generates background audio corresponding with pre- casting information and voice messaging according to recognition result.To pre- casting The content of information can know after semantics recognition the corresponding usage scenario of the pre- casting information, for example, the pre- casting information It is information relevant to historical events or literary works relevant information (such as contemporary lyric prose, horror story), or Person is the relevant information (such as cocktail party, movement meeting) of a living scene in reality, can also be one kind of multiple scenes to mixing Information of conjunction etc..Background audio corresponding with the pre- casting information can be generated according to recognition result in the present embodiment, The background audio can be adapted in entire content with pre- casting information, can also be with the local key content of pre- casting information It is adapted.For example, when casting information in advance and when a lyric prose, the to releive corresponding with the lyric prose can be generated One background sound;When the pre- casting information is a horror story, the second background corresponding with the horror story can be generated Sound, and thrill of second background sound in the horror story is put to have and be dubbed emphatically, and then increases and broadcast the terror Terrified effect when story.In addition, in the present embodiment, it is also necessary to be generated according to recognition result corresponding with pre- casting information Voice messaging.Due to had learned that it is pre- casting information content semanteme, this enable generate voice messaging more adjunction The content to be expressed of script in nearly pre- casting information, such as when broadcasting information and a lyric prose in advance, can generate The requirement such as the tone, intonation, the rhythm, stress of the lyric prose can be more in line with when corresponding voice messaging, with more adjunction The voice of nearly true man.
S3 carries out synthetic operation to background audio and voice messaging, generates casting audio corresponding with pre- casting information. In one embodiment, when carrying out synthetic operation, according to the entire content phase of the entire content of background audio and voice messaging It is corresponding, the casting audio of generation enable can also be heard while user's uppick voice broadcast in the voice broadcast Hold corresponding background audio, to enrich the content of casting audio, meet user demand and enrich audio content, uses It experiences at family.In another embodiment, when carrying out synthetic operation, by background audio at least one key sound frequency point with The corresponding Key Words point of articulation contrasts in voice messaging, that is to say, that in the casting audio of generation, when the casting key voice The key sound frequency point is played while point, so as to the content for expressing the Key Words point of articulation more emphatically, enriches expression Effect.
In one embodiment of the application, the described generation step for broadcasting audio corresponding with the pre- casting information After rapid the following steps are included:
S4 plays casting audio based on selected voice bank, wherein selected voice bank is in preset sound database One voice bank.It include at least one voice bank in preset sound database.Different voice banks has different sound characteristics, If preset sound database may include: boy student library, it can using standard male voice (such as the standard for meeting preset rules Male voice) carry out the expression of sound;Female voice library, it can (such as meet the mark of preset rules using the female voice of standard Quasi- female voice) carry out the expression of sound;Male and female students mixing library, it can according to the particular content of casting audio or requirement Simultaneously using male voice and female voice, or alternate mode carries out the expression of sound;Dialect library is carried out using preset dialect The expression of sound;Can additionally there are the multiple types voice banks such as fast word speed voice bank, slow word speed voice bank, child's voice library.It can be with It is needed and is established according to user, and stored, with being capable of quick calling in the one or more of voice banks of use.
When playing out to casting audio, language can be used to play voice messaging, specially based on selected sound Sound library can be used male voice and come in casting audio to carry out the broadcasting of voice messaging such as the male voice library that selected user requires The content of voice messaging carries out the casting (such as reading aloud a lyric prose) of language, and also to play the casting sound at the same time Background audio in frequency.Certainly corresponding voice bank can also be selected according to the voice messaging specifically played, such as first laugh at Words can choose dialect library to carry out the broadcasting of casting audio, improve acoustic expression effect.
In one embodiment of the application, this method is further comprising the steps of: S5, receives speech samples, and to voice Sample is pre-processed to establish customized voice bank, and voice bank is added in preset sound database.Each user's tool There is the preference of oneself, such as someone likes the sweet female voice of sound to broadcast, someone likes the strong male voice of sound quality sense and carries out Casting, someone then like will casting audio with child's voice come broadcast in addition someone like the sound of specific people broadcast (as use The relatives and friends of oneself).And the foundation of customized voice bank becomes realization purpose basis, received speech samples are that user inclines To the speech samples used, that is to say, that user is desirable for voice identical with the speech samples to broadcast to casting audio It puts, customized voice bank can be established based on the speech samples.
Preferably, Fig. 2 is the flow chart of a specific embodiment of the method for speech processing of the embodiment of the present application, such as scheme Shown in 2, described to establish at least one voice bank further comprising the steps of:
S51 removes individual pronunciation difference and/or ambient noise in the speech samples, removes superfluous in speech samples Remaining information is to retain key message.The pretreatment for removing the individual pronunciation difference and/or ambient noise in speech samples, so that language Sound sample, which is more clear, to be defined, and semantic meaning representation is more accurate, then to lay the foundation to being further processed for speech samples.In language When speech expression, it is possible that some duplicate information perhaps useless information such as pet phrase, interjection or repeat Multiple word needs to be removed these redundancies in the present embodiment, retains key message, so as to more Subsequent recognition accuracy is improved using key message.
S52 determines the voice unit in speech samples.Different spoken and written languages have different voice units, such as English Can be using each word as a voice unit in text, it can be using each independent word or independent word as one in Chinese A voice unit.It can be determined when determining the voice unit in speech samples according to the actual conditions of speech samples come specific, Such as according to the text of speech samples (being Chinese or English), according to the language communicative habits etc. in speech samples.
S53 identifies voice unit according to the corresponding grammer of key message, to obtain the language mould of speech samples Formula.Different speech samples have corresponded to different grammers, and such as Chinese has Chinese grammer, and English has English grammar, due to closing Key information is after treatment, to eliminate the remaining more regular and ideal information of expression after being not easy identified information, because This, which carries out identification to voice unit according to the corresponding grammer of key message, can be realized to accurately identifying voice unit, thus The language mode of speech samples is obtained, it includes this in the speech pattern that different speech samples, which have respective language mode, The various features of speech samples, and then can be with identical language expression-form come to other voice messagings based on the various features It is expressed.
S54 constructs customized voice bank based on language mode.The language mode has corresponded to the voice sample of user oneself offer This, therefore the customized voice bank based on language mode building can embody feature identical with the speech samples, such as have Identical sound characteristic, language communicative habits etc..Customized voice bank is enabled to be more in line with the needs of user.
In one embodiment of the application, the content of the pre- casting information to acquisition carries out the step of semantics recognition It suddenly include: the corresponding context relation data of content of the pre- casting information of identification, and broadcast in advance according to the determination of context relation data It notifies the context of breath.Specifically, the pre- casting information can be more accurately known from the entire content of pre- casting information Context, in order to avoid the result and actual pre- casting information that only obtain from local content have deviation.In the present embodiment, it is based on Entire content considers, the context of pre- casting information, example can be determined from the corresponding context relationship of content of pre- casting information Identify the relationship between relationship or previous paragraphs and rear paragraph between preceding sentence and rear sentence such as more accurately to determine pre- casting letter The context of breath.
It is described that synthetic operation is carried out to background audio and voice messaging in one embodiment of the application, generate with The step of pre- casting information corresponding casting audio the following steps are included: according to the context of pre- casting information by voice messaging and Background audio is divided;By voice messaging and background audio progress synthetic operation with identical context.Specifically, voice Information can be divided into multiple portions, such as content low tide part, content climax parts, content turning point according to the difference of context Point etc.;Same background audio can also be divided into multiple portions, such as sound low tide part, sound climax parts according to context, Sound return portion etc., preferably, the voice messaging of identical context and background audio can be carried out synthetic operation, so that interior Hold and sound is adapted, it is such as that content low tide part and sound low tide is partially synthetic, by content climax parts and sound climax portion Division is at so that the casting audio more effect generated is more lively.In addition, in the present embodiment can also be according to voice messaging In at least one key point content carry out synthetic operation, such as synthesis after when broadcasting a horror story, in thrilling point The background audio that place plays is stress or supper bass, to make user have sense on the spot in person to the effect that adds to one's terrors at thrilling point Feel.Synthetic operation can certainly be carried out according to the entire content of voice messaging, background audio and voice messaging are existed It is adapted in entire content, can such as play a head within the period entirely broadcasted when broadcasting a lyric prose and releive sound It is happy.
The embodiment of the present application also provides a kind of electronic equipment, Fig. 3 is that the structure of the electronic equipment of the embodiment of the present application is shown It is intended to, as shown in figure 3, including identification module, processing module and synthesis module.
Identification module is configured to carry out semantics recognition to the content of the pre- casting information of acquisition.Pre- casting information includes text A plurality of types of information such as information, audio-frequency information or pre-stored information, the pre- content for broadcasting information have corresponding semanteme, Semanteme is the meaning of data, i.e., the meaning of concept representated by things in the real world corresponding to data and these contain Relationship between justice is explanation and logical expressions of the data on some field.In one embodiment, in identification module to language Pre-stored knowledge base can be used when justice is identified, corresponding neck can be obtained in knowledge base according to the information of identification Numeric field data further carries out semantics recognition to the content of pre- casting information in conjunction with the FIELD Data.So that recognition result is more sticked on Nearly actual conditions.
Processing module is configured to generate background audio corresponding with pre- casting information and voice messaging according to recognition result. The corresponding usage scenario of the pre- casting information can be known after the content to pre- casting information carries out semantics recognition, for example, this Pre- casting information is information relevant to historical events or literary works relevant information (lyric prose, terror such as the present age Story etc.) or a living scene in reality relevant information (such as cocktail party, movement meeting), can also be that one kind is more Scene is planted to mixed information etc..Processing module can generate and the pre- casting information according to recognition result in the present embodiment Corresponding background audio, the background audio can be adapted in entire content with pre- casting information, can also be with pre- casting The local key content of information is adapted.For example, when casting information in advance and when a lyric prose, processing module can be generated with Corresponding the first background sound releived of the lyric prose;When the pre- casting information is a horror story, processing module can To generate the second background sound corresponding with the horror story, and thrilling point of second background sound in the horror story can With emphatically dub, and then increase broadcast the horror story when terrified effect.In addition, in the present embodiment, processing module Also need to generate voice messaging corresponding with pre- casting information according to recognition result.Due to having had learned that pre- casting information The semanteme of content, this enables the voice messaging generated to be more nearly the content to be expressed of script in pre- casting information, example Such as when casting information is with a lyric prose in advance, it can be more in line with this when generating corresponding voice messaging and express one's emotion The requirement such as the tone, intonation, the rhythm, stress of prose, to be more nearly the voice of true man.
Synthesis module is configured to carry out synthetic operation to background audio and voice messaging, generates corresponding with pre- casting information Casting audio.In one embodiment, synthesis module is when carrying out synthetic operation, according to the entire content and language of background audio The entire content of message breath is corresponding, and the casting audio of generation enables can also while user's uppick voice broadcast It hears background audio corresponding with the voice broadcast content, to enrich the content of casting audio, meets user demand And audio content is enriched, user experience is good.In another embodiment, synthesis module is when carrying out synthetic operation, by background At least one key sound frequency point in audio is contrasted with the Key Words point of articulation corresponding in voice messaging, that is to say, that in generation It broadcasts in audio, plays the key sound frequency point while broadcasting the Key Words point of articulation, expressed so as to what is more focused on The content of the Key Words point of articulation, enriches expression effect.
In one embodiment of the application, electronic equipment further includes broadcasting module, and broadcasting module is configured to select Voice bank play the casting audio, wherein the selected voice bank is a voice bank in preset sound database. It include at least one voice bank in preset sound database.Different voice banks has different sound characteristics, such as constructs module The voice bank of foundation may include: boy student library, it can (such as meet the male of the standard of preset rules using standard male voice Sound) carry out the expression of sound;Female voice library, it can using the female voice of standard (such as the standard for meeting preset rules Female voice) carry out the expression of sound;Male and female students mixing library, it can according to the particular content of casting audio or require to use Simultaneously, or alternate mode carries out the expression of sound for male voice and female voice;Dialect library carries out sound using preset dialect Expression;Can additionally there are the multiple types voice banks such as fast word speed voice bank, slow word speed voice bank, child's voice library.It can basis User needs and establishes, and is stored, with being capable of quick calling in the one or more of voice banks of use.
When playing out to casting audio, language can be used to play voice messaging, specially based on selected sound Sound library can be used male voice and come in casting audio to carry out the broadcasting of voice messaging such as the male voice library that selected user requires The content of voice messaging carries out the casting (such as reading aloud a lyric prose) of language, and also to play the casting sound at the same time Background audio in frequency.Certainly corresponding voice bank can also be selected according to the voice messaging specifically played, such as first laugh at Words can choose dialect library to carry out the broadcasting of casting audio, improve acoustic expression effect.
In one embodiment of the application, electronic equipment further includes building module, and building module is configured that reception voice Sample, and speech samples are pre-processed to establish customized voice bank, and voice bank is added to preset sound database In.Each user has the preference of oneself, such as someone likes the sweet female voice of sound to broadcast, and someone likes sound quality sense Strong male voice is broadcasted, and someone is then liked broadcasting audio and broadcasted or even someone likes the sound of specific people with child's voice To broadcast (such as the relatives and friends using oneself).And the foundation of customized voice bank becomes realization purpose basis, received language Sound sample is the speech samples that user is inclined to use, that is to say, that user is desirable for voice identical with the speech samples and comes pair Casting audio plays out, and building module can establish customized voice bank based on the speech samples.
Preferably, building module is further configured to: individual pronunciation difference and/or environment in removal speech samples are made an uproar Sound removes the redundancy in speech samples to retain key message;Determine the voice unit in speech samples;Believed according to key It ceases corresponding grammer to identify voice unit, to obtain the language mode of speech samples;It is made by oneself based on language mode building Adopted voice bank.The pretreatment for removing the individual pronunciation difference and/or ambient noise in speech samples, so that speech samples are more clear Clear clear, semantic meaning representation is more accurate, then to lay the foundation to being further processed for speech samples.It, can in language expression Some duplicate information can be will appear and perhaps useless information such as pet phrase, interjection or repeat multiple word, Building module needs to be removed these redundancies in the present embodiment, retains key message, so as to more benefits Subsequent recognition accuracy is improved with key message.Different spoken and written languages have can in different voice units, such as English Using by each word as a voice unit, can be using each independent word or independent word as a voice in Chinese Unit.Constructing module can be according to the actual conditions of speech samples come specific true when determining the voice unit in speech samples It is fixed, such as according to the text of speech samples (being Chinese or English), according to the language communicative habits etc. in speech samples.Different languages Sound sample has corresponded to different grammers, and such as Chinese has Chinese grammer, and English has English grammar, due to key message be by Treated, eliminates the remaining more regular and ideal information of expression after being not easy identified information, therefore construct module root Carrying out identification to voice unit according to the corresponding grammer of key message can be realized to accurately voice unit is identified, to obtain The language mode of speech samples, different speech samples have respective language mode, include the voice in the speech pattern The various features of sample, and then can be with identical language expression-form come to the progress of other voice messagings based on the various features Expression.The language mode has corresponded to the speech samples of user oneself offer, thus construct that module construct based on language mode oneself Feature identical with the speech samples can be embodied by defining voice bank, such as sound characteristic having the same, language communicative habits Deng.Customized voice bank is enabled to be more in line with the needs of user.
In one embodiment of the application, identification module is further configured to: the content of the pre- casting information of identification is corresponding Context relation data, and determine according to context relation data the context of pre- casting information.Specifically, from pre- casting letter The entire content of breath can more accurately know the context of the pre- casting information, in order to avoid the knot only obtained from local content Fruit and actual pre- casting information have deviation.In the present embodiment, considered based on entire content, it can be out of pre- casting information Hold corresponding context relationship to determine that the context of pre- casting information, such as identification module identify the pass between preceding sentence and rear sentence System or the relationship between previous paragraphs and rear paragraph come more accurately determine in advance broadcast information context.
In one embodiment of the application, synthesis module is further configured to: according to the context of pre- casting information by language Message breath and background audio are divided;By voice messaging and background audio progress synthetic operation with identical context.Specifically For, voice messaging can be divided into multiple portions, such as content low tide part, content climax parts according to the difference of context, Content return portion etc.;Same background audio can also be divided into multiple portions, such as sound low tide part, sound according to context Climax parts, sound return portion etc., preferably, synthesis module can by the voice messaging of identical context and background audio into Row synthetic operation, so that content and sound are adapted, it is such as that content low tide part and sound low tide is partially synthetic, by content climax Part is synthesized with sound climax parts, so that the casting audio more effect generated is more lively.In addition, closing in the present embodiment It is being broadcasted after can also carrying out synthetic operation, such as synthesis according to the content of the key point of at least one in voice messaging at module When one horror story, the background audio played at thrilling point is stress or supper bass, to the effect that adds to one's terrors at thrilling point Fruit makes user have feeling on the spot in person.Certain synthesis module can also carry out synthesis behaviour according to the entire content of voice messaging Make, background audio is adapted in entire content with voice messaging, it such as can be when broadcasting a lyric prose A head is played in period for entirely broadcasting to releive music.
Above embodiments are only the exemplary embodiment of the application, are not used in limitation the application, the protection scope of the application It is defined by the claims.Those skilled in the art can make respectively the application in the essence and protection scope of the application Kind modification or equivalent replacement, this modification or equivalent replacement also should be regarded as falling within the scope of protection of this application.

Claims (10)

1. a kind of method of speech processing, comprising:
Semantics recognition is carried out to the content of the pre- casting information of acquisition;
Background audio corresponding with the pre- casting information and voice messaging are generated according to recognition result;
Synthetic operation is carried out to the background audio and the voice messaging, generates casting corresponding with the pre- casting information Audio.
2. according to the method described in claim 1, after generation casting audio corresponding with the pre- casting information, Further include:
The casting audio is played based on selected voice bank, wherein the selected voice bank is in preset sound database A voice bank.
3. according to the method described in claim 2, the method also includes:
Speech samples are received, and the speech samples are pre-processed to establish customized voice bank, and will be described customized Voice bank is added in preset sound database.
4. according to the method described in claim 3, described pre-process the speech samples to establish customized sound Library includes:
Individual pronunciation difference and/or the ambient noise in the speech samples are removed, the redundancy letter in the speech samples is removed Breath is to retain key message;
Determine the voice unit in the speech samples;
Institute's speech units are identified according to the key message corresponding grammer, to obtain the language of the speech samples Mode;
The customized voice bank is constructed based on the language mode.
5. according to the method described in claim 1, the content progress semantics recognition of the pre- casting information to acquisition includes:
It identifies the corresponding context relation data of content of the pre- casting information, and is determined according to the context relation data The context of the pre- casting information.
6. according to the method described in claim 5, carry out synthetic operation to the background audio and the voice messaging, generate with The corresponding casting audio of the pre- casting information includes:
The voice messaging and the background audio are divided according to the context of the pre- casting information;
By the voice messaging and background audio progress synthetic operation with identical context.
7. a kind of electronic equipment, including identification module, processing module and synthesis module,
The identification module is configured to carry out semantics recognition to the content of the pre- casting information of acquisition;
The processing module is configured to generate background audio corresponding with the pre- casting information and voice according to recognition result Information;
The synthesis module is configured to carry out synthetic operation to the background audio and the voice messaging, generates and pre- broadcasts with described The corresponding casting audio of manner of breathing of notifying.
8. electronic equipment according to claim 7, the electronic equipment further includes broadcasting module, the broadcasting module configuration To play the casting audio based on selected voice bank, wherein the selected voice bank is in preset sound database One voice bank.
9. electronic equipment according to claim 8, the electronic equipment further includes building module, the building module configuration Are as follows:
Speech samples are received, and the speech samples are pre-processed to establish customized voice bank, and by the voice bank It is added in preset sound database.
10. electronic equipment according to claim 9, the building module is further configured to:
Individual pronunciation difference and/or the ambient noise in the speech samples are removed, the redundancy letter in the speech samples is removed Breath is to retain key message;
Determine the voice unit in the speech samples;
Institute's speech units are identified according to the key message corresponding grammer, to obtain the language of the speech samples Mode;
Customized voice bank is constructed based on the language mode.
CN201810857848.4A 2018-07-31 2018-07-31 A kind of method of speech processing and electronic equipment Pending CN109036373A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810857848.4A CN109036373A (en) 2018-07-31 2018-07-31 A kind of method of speech processing and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810857848.4A CN109036373A (en) 2018-07-31 2018-07-31 A kind of method of speech processing and electronic equipment

Publications (1)

Publication Number Publication Date
CN109036373A true CN109036373A (en) 2018-12-18

Family

ID=64647187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810857848.4A Pending CN109036373A (en) 2018-07-31 2018-07-31 A kind of method of speech processing and electronic equipment

Country Status (1)

Country Link
CN (1) CN109036373A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109981448A (en) * 2019-03-28 2019-07-05 联想(北京)有限公司 Information processing method and electronic equipment
CN112107773A (en) * 2020-09-08 2020-12-22 杭州趣安科技有限公司 Audio processing method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000767A (en) * 2006-01-09 2007-07-18 杭州世导科技有限公司 Speech recognition equipment and method
CN101567186A (en) * 2008-04-23 2009-10-28 索尼爱立信移动通信日本株式会社 Speech synthesis apparatus, method, program, system, and portable information terminal
CN103680492A (en) * 2012-09-24 2014-03-26 Lg电子株式会社 Mobile terminal and controlling method thereof
US20150066518A1 (en) * 2013-09-05 2015-03-05 Electronics And Telecommunications Research Institute Audio encoding apparatus and method, audio decoding apparatus and method, and audio reproducing apparatus
CN105096932A (en) * 2015-07-14 2015-11-25 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus of talking book
CN107464555A (en) * 2016-06-03 2017-12-12 索尼移动通讯有限公司 Background sound is added to the voice data comprising voice
CN108010512A (en) * 2017-12-05 2018-05-08 广东小天才科技有限公司 The acquisition methods and recording terminal of a kind of audio

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000767A (en) * 2006-01-09 2007-07-18 杭州世导科技有限公司 Speech recognition equipment and method
CN101567186A (en) * 2008-04-23 2009-10-28 索尼爱立信移动通信日本株式会社 Speech synthesis apparatus, method, program, system, and portable information terminal
CN103680492A (en) * 2012-09-24 2014-03-26 Lg电子株式会社 Mobile terminal and controlling method thereof
US20150066518A1 (en) * 2013-09-05 2015-03-05 Electronics And Telecommunications Research Institute Audio encoding apparatus and method, audio decoding apparatus and method, and audio reproducing apparatus
CN105096932A (en) * 2015-07-14 2015-11-25 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus of talking book
CN107464555A (en) * 2016-06-03 2017-12-12 索尼移动通讯有限公司 Background sound is added to the voice data comprising voice
CN108010512A (en) * 2017-12-05 2018-05-08 广东小天才科技有限公司 The acquisition methods and recording terminal of a kind of audio

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109981448A (en) * 2019-03-28 2019-07-05 联想(北京)有限公司 Information processing method and electronic equipment
CN109981448B (en) * 2019-03-28 2022-03-25 联想(北京)有限公司 Information processing method and electronic device
CN112107773A (en) * 2020-09-08 2020-12-22 杭州趣安科技有限公司 Audio processing method and system

Similar Documents

Publication Publication Date Title
Lindsey English after RP: Standard British pronunciation today
Elbow Voice in writing again: Embracing contraries
Peterson The art of language invention: From horse-lords to Dark Elves to sand worms, the words behind world-building
Foley The singer of tales in performance
US6847931B2 (en) Expressive parsing in computerized conversion of text to speech
Mithen The singing Neanderthals: The origins of music, language, mind, and body
Bird How I stopped dreading and learned to love transcription
Pieraccini The voice in the machine: building computers that understand speech
US6865533B2 (en) Text to speech
Jeffries Discovering language: The structure of modern English
Yasar Electrified Voices: How the Telephone, Phonograph, and Radio Shaped Modern Japan, 1868–1945
Setter Your voice speaks volumes: it's not what you say, but how you say it
Oestreich Performance Criticism of the Pauline Letters
Snaith Sound and literature
Clark Staging language: Place and identity in the enactment, performance and representation of regional dialects
CN108986785B (en) Text recomposition method and device
CN109036373A (en) A kind of method of speech processing and electronic equipment
Werner et al. Optionality and variability of speech pauses in read speech across languages and rates
Nilsenová et al. Prosodic adaptation in language learning
Aaron et al. Conversational computers
Poss Hmong music and language cognition: An interdisciplinary investigation
Shankar Speaking on the record: A theory of composition
Catlin Puzzling the text: Thought-songs, secret languages, and archaic tones in Hmong music
Sandoval et al. Thinking like a linguist: an introduction to the science of language
Osuolale-Ajayi Discourse and Humour Strategies in Two-Person Stand-up Art in Nigeria

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination