CN110164414A

CN110164414A - Method of speech processing, device and smart machine

Info

Publication number: CN110164414A
Application number: CN201811452305.0A
Authority: CN
Inventors: 潘伟洲
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2019-08-23
Anticipated expiration: 2038-11-30
Also published as: CN110164414B

Abstract

The embodiment of the invention discloses a kind of method of speech processing, device and smart machines, wherein method includes: after getting user speech information, obtain the tamber parameter and voice content included by it, and the first verification voice to match with voice content included by the voice messaging is searched, obtain the tamber parameter of the first verification voice.Further, the the second verification voice for being determined based on the tamber parameter in the tamber parameter and the first verification voice in the user speech information and referring to tone color frequency, and matched according to the voice content with reference to included by the generation of tone color frequency and user speech information.Using the embodiment of the present invention, verification voice can be generated, according to the tamber parameter of user in order to which user more intuitively carries out spoken language exercise.

Description

Method of speech processing, device and smart machine

Technical field

The present invention relates to voice processing technology field more particularly to a kind of method of speech processing, device and smart machine.

Background technique

In this day and age, with developing for economic globalization and implementing for our the external open policys, international exchange Increasingly increase, the enthusiasm for having driven user to learn foreign languages in this way.In order to skillfully be talked with the foreigner, it is necessary to improve Foreigh-language oral-speech is horizontal.

Carrying out the common method of spoken language exercise now is to record mouth for one section of foreign language content by professional such as teacher Language reads aloud model, and then user carries out this section of foreign language content to read aloud practice, and user is read aloud practice compared with reading aloud model, Visual correlation curve is generated for the difference of the two so that user finds out difference from curve and practices.Practice discovery The help of the foreign language learning of this mode Foreign Language Learning user is not high.Therefore, how for user more intuitive language is provided Sound model has become a hot topic of research problem.

Summary of the invention

The embodiment of the present invention provides a kind of method of speech processing, device and smart machine, can be according to the tamber parameter of user Verification voice is generated, in order to which user more intuitively carries out spoken language exercise.

On the one hand, the embodiment of the invention provides a kind of method of speech processing, comprising:

User speech information is obtained, and obtains the tamber parameter in the user speech information；

Search with the matched first verification voice of voice content included by the user speech information, and obtain described the The tamber parameter of one verification voice；

Reference is determined based on the tamber parameter of tamber parameter and the first verification voice in the user speech information Tone color frequency；

Based on described matched with the voice content included by the user speech information with reference to the generation of tone color frequency Second verification voice.

On the other hand, the embodiment of the invention also provides a kind of voice processing apparatus, comprising:

Acquiring unit for obtaining user speech information, and obtains the tamber parameter in the user speech information；

Processing unit, for searching and the matched first verification language of voice content included by the user speech information Sound；

The acquiring unit is also used to obtain the tamber parameter of the first verification voice；

The processing unit is also used to based on the tamber parameter and the first verification voice in the user speech information Tamber parameter determine refer to tone color frequency；

The processing unit is also used to based on described with reference to included by the generation of tone color frequency and the user speech information The matched second verification voice of voice content.

Another aspect, the embodiment of the invention provides a kind of smart machines, comprising: processor and memory, the storage Device is for storing computer program, and the computer program includes program instruction, and the processor is configured for described in calling Program instruction executes above-mentioned method of speech processing.

Correspondingly, it the embodiment of the invention also provides a kind of computer storage medium, is deposited in the computer storage medium Computer program instructions are contained, when the computer program instructions are executed by processor, for executing above-mentioned method of speech processing.

In the embodiment of the present invention get user speech information and its it is corresponding first verification voice after, can according to The tamber parameter in tamber parameter and the first verification voice in the voice messaging of family, which determines, refers to tone color frequency, further, base The the second verification voice to match with user speech information is generated with reference to tone color frequency in this, is realized and is generated approximated user tone color Verification voice improve the efficiency of user's spoken language exercise in order to which user accurately corrects incorrect pronunciations.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of architecture diagram of speech processing system provided in an embodiment of the present invention；

Fig. 2 is a kind of flow diagram of method of speech processing provided in an embodiment of the present invention；

Fig. 3 is a kind of schematic diagram for correcting prompt information provided in an embodiment of the present invention；

Fig. 4 is a kind of structural schematic diagram of voice processing apparatus provided in an embodiment of the present invention；

Fig. 5 is a kind of structural schematic diagram of smart machine provided in an embodiment of the present invention.

Specific embodiment

It is found in the research of Foreign Language spoken language exercise, with the development of internet, spoken language exercise is from the face under line Opposite teaching pattern is changed on line spoken exercise mode.It is specified that face-to-face teaching under line refers to that user at the appointed time reaches Place follows professional to carry out spoken language exercise, and such exercise mode learning time is by spoken professional according to itself religion Between class hour and the time of most of spoken language exercise person formulates, and user is unable to the unrestricted choice study spoken time.Line is suitable for reading Under language exercise mode, user can log on to spoken language exercise website or downloading spoken language exercise by smart machine such as mobile phone The modes such as video carry out spoken language exercise at any time, and under this kind of exercise mode, user can practice according to the arrangement of time spoken language of oneself Practise the time.

In one embodiment, the mode of spoken exercise mode can be smart machine on a user interface to user on line The voice content for showing one section of Foreigh-language oral-speech, when collecting voice messaging of the user for this section of voice content, smart machine The verification voice (alternatively referred to as received pronunciation) to match with this section of voice content is searched, the voice for verifying voice and user is believed Breath compares, and output comparing result (for example comparing result can be in hello letter e pronunciation mistake, orthoepy is) In order to which user corrects one's pronunciation according to comparing result.In other embodiments, smart machine is being found and voice content phase After the verification voice matched, the verification voice can also be constantly played, in order to which user corrects one's pronunciation according to verification voice. In two above-mentioned embodiments, if the tone color of verification voice more conducively user close with user's tone color intuitively learns and right Pronounce than standard pronunciation.

It is the second verification voice for how generating approximated user tone color that the embodiment of the present invention, which is described in detail below, in order to User practices spoken according to standard pronunciation identical with user's tone color.

It is a kind of speech processing system provided in an embodiment of the present invention with reference to Fig. 1, in speech processing system shown in FIG. 1 It may include voice acquisition module 101, verification speech polling module 102, similarity score module 103 and tone color adjustment module 104。

In one embodiment, for some spoken language exercise task, the voice acquisition module 101 is for acquiring user's language Message breath, the voice acquisition module 101 can acquire user speech information by sound transducer such as microphone.The use Family voice messaging includes voice content, after detecting that voice acquisition module 101 collects user speech information, the verification The matched first verification voice of voice content of the inquiry of speech polling module 102 and the user speech information.

Optionally, the first verification voice set, the first verification voice collection can be previously stored in speech processing system It include at least one first verification voice in conjunction, including language in each first verification voice in the first verification voice set Sound content, matched first verification of voice content of verification speech polling module 102 inquiry and the user speech information The mode of voice, which may is that, obtains voice content included in user speech information；According to institute in the user speech information Including voice content search target first from the first verification voice set and verify voice, the target first verifies in voice Included voice content matches with the voice content included in the user speech information；By the target first Voice is verified as with voice content included by user speech information matched first and verifies voice.

In one embodiment, after getting user speech information and the first verification voice, the similarity score mould Block 103 carries out similarity score to user speech information according to the first verification voice, to determine user pronunciation and the first verification language Difference between sound pronunciation.If similarity score is lower than similarity threshold, show that user pronunciation accuracy is lower, tone color adjustment Module 104 needs to carry out tone color adjustment to the first verification voice, so that user is according to verification voice identical with oneself tone color Practice is spoken；If similarity score is higher than similarity threshold, show that user pronunciation accuracy is higher, tone color adjusts module 104 Tone color adjustment can not be carried out to the first verification voice.

In one embodiment, the mode that tone color adjustment module 104 carries out tone color adjustment to the first verification voice can be The tone color of first verification voice is adjusted to the tone color of user, first after having adjusted tone color examines voice as the second verification language Sound.In one embodiment, speech processing system shown in FIG. 1 can also include voice playing module 105, and the voice plays Module 105 is for playing the first verification voice or the second verification voice so that user is according to the first verification voice or the second school Test Practice on Phonetics pronunciation.

In conclusion, for some spoken language exercise task, passing through voice first in speech processing system shown in Fig. 1 Acquisition module 101 acquires user speech information, and verification speech polling module 102 searches the first school corresponding with user speech information Voice is tested, further, similarity score module 103 carries out similarity to user speech information according to the first verification voice Scoring, obtains similarity score result.When the similarity score result is lower than similarity threshold, tone color adjusts module 104 The tone color of adjustment the first verification voice, obtains the second verification voice, and last voice playing module 105 can play the second verification voice So that user pronounces according to verification Practice on Phonetics identical with oneself tone color.

Referring to FIG. 2, be a kind of flow diagram of method of speech processing provided in an embodiment of the present invention, it is shown in Fig. 2 Method of speech processing can be using in speech processing system shown in Fig. 1, and the method for speech processing can be held by smart machine Row, can specifically be executed by the processor in smart machine.In one embodiment, the smart machine may include mobile phone, plate One of equipment such as computer, notebook are a variety of.

In one embodiment, method of speech processing shown in Fig. 2 mainly adjusts the first school according to user speech information The second verification voice of speech production approximated user tone color is tested, in order to which user more intuitively learns and contrast standard pronunciation, head First smart machine obtains user speech information in step s 201, and obtains the tamber parameter in the user speech information.? It include voice content in the user speech information in one embodiment, the user speech information refers to user for a certain The audio-frequency information that section voice content is read aloud, the institute in the user speech information in the case where user is to each word pronunciation correct situation Including voice content be above-mentioned voice content, it is assumed that voice content be " welcome to QQ ", then user speech information is Refer to the audio-frequency information comprising the voice content, voice content included by user speech information is welcome to QQ.

In one embodiment, the tamber parameter is the parameter for referring to influence tone color, and tone color is mainly by fundamental tone, general What sound and running parameter determined, for most objects in vibration, vibration is not only whole object, its various pieces It vibrates at the same time respectively, complex vibration is in this vibration.Wherein, sound caused by body vibration may be considered fundamental tone, each Sound caused by partial vibration may be considered overtone.For people, in sounding there is also the vibration of multiple portions, also deposit In point of fundamental tone and overtone, overtone and fundamental tone, which are carried out synthesis based on running parameter, can determine the tone color of sounding body.

In the embodiment of the present invention in the second verification voice process according to the first verification speech production approximated user tone color, Running parameter can keep identical as the first verification voice, and tamber parameter described in the embodiment of the present invention is mainly according to phase Fundamental frequency and the overtone frequency of voice is answered to determine, tamber parameter can only include the pitch parameter of corresponding voice, or Tamber parameter can include the pitch parameter and overtone parameter of corresponding voice simultaneously.

In one embodiment, the user speech information that smart machine is got can be the voice letter that user inputs in real time Breath, smart machine can obtain in real time the user speech information of user's input by sound transducer such as microphone, such as intelligently Equipment shows spoken activity window to user on a user interface, shows have multistage to can be used for practicing in the spoken language exercise window Spoken voice content；When detecting selection operation of the user to a certain section of voice content, smart machine exports prompt information To prompt user to read aloud this section of voice content；When detecting that user starts to read aloud voice content, smart machine passes through microphone It obtains user and reads aloud video, as user speech information.

In other embodiments, the user speech information that smart machine is got is also possible to the history voice of user's input Information, such as user recorded the audio of reading aloud of oneself, and by each audio storage of reading aloud in intelligence when practice is spoken before In equipment, when user wants to correct one's pronunciation problem by smart machine, it can read aloud in audio one section of selection from each and read aloud Audio is as user speech information input into smart machine.

In one embodiment, the tamber parameter of the user speech information includes mean value and/or variance, user's language The tamber parameter of message breath can be the mean value and variance of fundamental tone in user speech information, or, the sound of the user speech information Color parameter can be the mean value and variance of the overtone of the mean value and variance and user speech information of the fundamental tone of user speech information. Tamber parameter in the user speech information can be according to the tone color frequency determination within a certain period of time of user speech information 's.General sound is combined by the different vibration of a series of frequencies of sounding body sending, amplitude.These vibrations The vibration for having a frequency minimum in dynamic, the sound issued by it is exactly fundamental tone, and fundamental frequency refers to that air-flow passes through when sending out voiced sound Vocal cords generate the vibration of relaxation vibrating type when glottis, generate quasi-periodic driving pulse string.It is the length of fundamental frequency and vocal cords, thin Thickness, toughness, stiffness and the factors such as habit of pronouncing are related, therefore everyone fundamental frequency is different.Overtone frequency, which refers to, is pronouncing When, the vibration of the organs such as mouth, throat, lung generates quasi-periodic driving pulse string, and the organs such as everyone mouth, throat are different, Cause everyone overtone frequency also not identical.Difference based on everyone tone color frequency in the embodiment of the present invention, Ye Jiji The tone color frequency of first verification voice can be adjusted to related with the tone color frequency of user by the difference of voice frequency and overtone frequency Reference tone color frequency, thus realize generate approximated user tone color second verification voice.

In one embodiment, the mode that the embodiment of the present invention generates the second verification voice of approximated user tone color can be Smart machine carries out tone color adjustment for each word in voice content included by the first verification voice one by one.For example, it is assumed that Voice content is " where are you from? " included by first verification voice, smart machine can for where, are, Tetra- words of you and from carry out tone color adjustment respectively.The smart machine obtains the tone color in the user speech information The mode of parameter can be with are as follows: after smart machine gets user speech information, determine the duration of the user speech information, with And (duration of each word can be understood as each list for the duration of each word in voice content included by the voice messaging Word corresponds to a period, includes multiple periods in user speech information)；It is described for the target word in each word Smart machine obtains the target time section and tone color frequency of the target word, according to the target time section of the target word and Tone color frequency calculates the tamber parameter in the target time section.And so on, user speech information can be calculated each Tamber parameter in a period.Specifically, being the target according to the target word when tone color frequency includes fundamental frequency Period and fundamental frequency calculate the pitch parameter in the target time section；When tone color frequency includes overtone frequency, it is The overtone parameter in the target time section is calculated according to the target time section of the target word and overtone frequency.

In one embodiment, after smart machine gets user speech information, in step S202 search with it is described The matched first verification voice of voice content included by user speech information, and obtain the tone color ginseng of the first verification voice Number.The tamber parameter of the first verification voice includes mean value and/or variance, and the tamber parameter of the first verification voice can be with The mean value and variance for verifying fundamental tone in voice for first, or, the tamber parameter of the first verification voice can be the first verification The mean value and variance of the overtone of the mean value and variance of the fundamental tone of voice and the first verification voice.The first verification voice refers to Received pronunciation corresponding with user speech information, the received pronunciation also referred to as read aloud model, the received pronunciation can be by Professional people's such as professional teachers are recorded, and are pronounced in the received pronunciation completely correct.In other words, for one section for spoken language The voice content of practice, user read aloud the audio that this section of voice content obtains and are called user speech information, and professional reads aloud this The audio that section voice content obtains is the first verification voice.

In one embodiment, lookup described in step S202 is matched with voice content included by the user speech information The embodiment of the first verification voice may is that according to voice content included in user speech information from the first verification language The first verification voice is determined in sound set.Specifically, the lookup with included by the user speech information in voice Hold matched first verification voice, comprising: obtain the voice content included in the user speech information；According to described The included voice content searches target first from the first verification voice set and verifies voice, institute in user speech information The voice content phase for including in voice content and the user speech information included by stating in the verification voice of target first Matching；Using the target first verification voice as with the matched institute of the voice content included by the user speech information State the first verification voice.

In smart machine can associated storage have multistage for spoken language exercise voice content and the first corresponding school Test voice, each first verification voice composition the first verification voice set.After smart machine gets user speech information, Included voice content can be obtained in user speech information first, then searched in the multistage voice content of storage with it is described The target voice content that voice content matches, finally according to voice content and first verification voice incidence relation, search with The corresponding target first of the target voice content verifies voice, using the target first verification voice as with the user The matched first verification voice of voice content included by voice messaging.

In one embodiment, smart machine is according to language included in user speech acquisition of information user speech information Sound content, in this way may be due to user pronunciation inaccuracy, the language for leading to the voice content got and being stored in smart machine Sound content is not exactly the same.Such as the voice content stored in smart machine is " I go to bed round 11:00at Night ", user is when reading aloud, since night pronunciation of words is not accurate enough, by voiced consonant/n/ breaking-out voiceless consonant/l/, intelligence Energy equipment may according to user speech acquisition of information to voice content are as follows: " I go to bed round 11:00at light ". Therefore, in one embodiment, to solve the above-mentioned problems, smart machine can preset matching threshold, if intelligence is set When matching for the voice content got and between pre-stored target voice content more than matching threshold, it is determined that the mesh It marks voice content and the voice content of the user speech information is matched.It should be noted that the above are the present invention to enumerate A kind of solution to the problems described above, without limitation for the specific method embodiment of the present invention.

In one embodiment, the matching threshold of smart machine setting can be for each word for including in voice content It is arranged, for example matching threshold is set as 80%, shows that 80% or more is phase if the word for including in two sections of voice contents With, then it can determine that two sections of voice contents match；Or the matching threshold of smart machine setting is also possible in voice Hold meaning, if two sections of voice content meaning similarities are more than 80%, it is determined that two sections of voice contents are matched.It needs Bright, the above are some possible matching threshold setting methods enumerated in the embodiment of the present invention, for specifically matching threshold Value setting method can determine that the embodiment of the present invention is not specifically limited according to actual needs.

In one embodiment, above-mentioned acquisition can refer to for the acquisition methods of the tamber parameter of the first verification voice The tamber parameter method part of user speech information describes, and details are not described herein.

In one embodiment, smart machine gets the tamber parameter and first in user speech information and verifies voice After tamber parameter, the smart machine is in step S203 based on the tamber parameter and described in the user speech information The tamber parameter of one verification voice, which determines, refers to tone color frequency.Wherein, the reference tone color frequency refers to believes according to user speech Tamber parameter in breath the tone color frequency of the first verification voice is adjusted after tone color frequency or described with reference to tone color frequency Rate is it also will be understood that tone color frequency needed for becoming the second verification voice of generation approximated user tone color.

In one embodiment, the embodiment of step S203 can be the tone color frequency and ginseng for determining the first verification voice The transformational relation between tone color frequency is examined, is then determined needed for transformational relation according to the tamber parameter in user speech information Conversion coefficient and corrected parameter can be finally calculated with reference to tone color frequency according to transformational relation, conversion coefficient and corrected parameter Rate.Specifically, the tone color based on tamber parameter and the first verification voice in user speech information described in step S203 Parameter, which determines, refers to tone color frequency, it may include: determine the tone color frequency of the first verification voice；Believed according to the user speech The tamber parameter of tamber parameter and the first verification voice in breath determines conversion coefficient and corrected parameter；According to described first Tone color frequency, the conversion coefficient, the corrected parameter and the tone color frequency conversion rule for verifying voice, determine the reference Tone color frequency.

In one embodiment, the tone color frequency of the first verification voice may include the fundamental tone frequency of the first verification voice Rate or the tone color frequency of the first verification voice may include that the fundamental frequency of the first verification voice and first verify the general of voice Voice frequency.When the tone color frequency of the first verification voice includes the fundamental frequency of the first verification voice, the first verification voice is determined Fundamental frequency mode are as follows: then the pitch period for determining the first verification voice can be obtained pitch period is inverted The fundamental frequency of first verification voice, it is assumed for example that the pitch period of the first verification voice is 4ms, the tone color of the first verification voice Frequency is then 1/4ms=250Hz.It, can when the tone color frequency of the first verification voice includes the overtone frequency of the first verification voice The overtone frequency of the first verification voice is determined using the above method.For fundamental frequency, the pitch period refers to hair Time needed for the periodic opening and closing of the vocal cords of raw body determines the common method of pitch period in one embodiment It may include time domain method, frequency domain method and mixing method.Wherein, the time domain method refers to directly estimates fundamental tone week by speech waveform Phase, the frequency domain method are that voice signal is changed to frequency domain to estimate that the common method of pitch period is Cepstrum Method, the mixing Method is first to extract signal channels model parameter, is then filtered using channel model parameter to voice messaging, obtains source of sound sequence Column finally recycle correlation method or average amplitude difference method to acquire pitch period.Above a part of only enumerating determines first The method for verifying the pitch period of voice, smart machine can select any of the above or a variety of methods to determine institute according to actual needs State the pitch period of the first verification voice.

In one embodiment, the tone color frequency conversion rule can be smart machine and be converted according to used voice What method determined, it is assumed that phonetics transfer method used by smart machine is the phonetics transfer method based on Gaussian Mixture, with base For voice frequency, tone color frequency translation rule is indicated with following formula:

Wherein, F '₀(t) it indicates to refer to fundamental frequency,Indicate that the fundamental frequency of the first verification voice, a indicate conversion Coefficient, b indicate that corrected parameter, conversion coefficient and corrected parameter at this time are based on the pitch parameter and the in user speech information What the pitch parameter of one verification voice determined.The phonetics transfer method that smart machine uses can also be based on Codebook of Vector Quantization Mapping, artificial nerve network model and Hidden Markov Model etc. are not listed one by one various voices and turn in the embodiment of the present invention Change the corresponding tone color frequency conversion rule of method.In other embodiments, the tone color frequency conversion rule is also possible to intelligence Equipment is according to other settings that impose a condition.

It can identical not for fundamental frequency in tone color frequency and tone color transformation rule used by overtone frequency yet Together.That is, the tone color transformation rule of overtone frequency can also use the conversion calculation of formula 1, wherein F '₀(t) It indicates to refer to overtone frequency,Indicating the overtone frequency of the first verification voice, a indicates that conversion coefficient, b indicate corrected parameter, Conversion coefficient and corrected parameter at this time is based on the overtone in the overtone parameter and the first verification voice in user speech information What parameter determined.

In one embodiment, it is described according to the first verification tone color frequency of voice, the conversion coefficient, described repair Positive parameter and tone color frequency conversion rule, determine that the embodiment with reference to tone color frequency may is that first school Tone color frequency, the conversion coefficient and the corrected parameter for testing voice, which are updated in the tone color frequency conversion rule, to carry out It calculates, resulting calculated result is to refer to tone color frequency.

In one embodiment, the tamber parameter in the user speech information includes according in the user speech information Target time section tone color frequency determine the first mean value and first variance, tamber parameter includes pitch parameter, or, tone color join Number includes pitch parameter and overtone parameter, and correspondingly, the first mean value may include the first fundamental tone mean value, or, the first mean value may include First fundamental tone mean value and the first overtone mean value, first variance may include the first fundamental tone variance or first variance may include the first base Sound variance and the first overtone variance.Specifically, if the tamber parameter in user speech information includes pitch parameter, the user Tamber parameter in voice messaging includes the determined according to the fundamental frequency of the target time section in the user speech information One fundamental tone mean value and the first fundamental tone variance；If tamber parameter includes pitch parameter and overtone parameter, institute in user speech information State the tamber parameter in user speech information include according to the fundamental frequency of the target time section in the user speech information and The first fundamental tone mean value, the first overtone mean value, the first overtone variance and the first fundamental tone variance that overtone frequency determines respectively.

Similar, when the tamber parameter of the first verification voice includes the target according to the first verification voice Between section tone color frequency determine the second mean value and second variance, that is to say, that it is described first verification voice tamber parameter can The the second fundamental tone mean value and the second base that fundamental frequency including verifying the target time section of voice according to described first determines The tamber parameter of sound variance or the first verification voice may include the object time according to the first verification voice The the second fundamental tone mean value and the second overtone mean value that the fundamental frequency and overtone frequency of section determine respectively, the second fundamental tone variance and second Overtone variance.

So, the tone color of the tamber parameter according in the user speech information and the first verification voice Parameter determines that the embodiment of conversion coefficient and corrected parameter can be with are as follows: based on the first variance, the second variance and pre- If conversion coefficient determines rule, the conversion parameter is determined；Based on first mean value, second mean value and default amendment ginseng Number determines rule, determines the corrected parameter.Wherein, the default corrected parameter determines that rule and the preset conversion factor are true Set pattern then can be what smart machine was determined according to used phonetics transfer method.

For example, it is assumed that phonetics transfer method used in smart machine is the phonetics transfer method based on Gaussian Mixture, institute It states tone color frequency conversion rule in the conversion method based on Gaussian Mixture and is represented by formulaJoined with tone color Number is including for pitch parameter, the conversion coefficient of smart machine setting determines that rule can be expressed as formula:

Wherein,Indicate the first fundamental tone variance that the pitch parameter in user speech information includes,Indicate the first verification The second fundamental tone variance that pitch parameter in voice includes；The corrected parameter of smart machine setting determines that rule can be following public Formula:

B=μ_t-aμ_s, formula 3；

Wherein, b indicates corrected parameter, μ_tIndicate the first fundamental tone mean value that the pitch parameter in user speech information includes, μ_s Indicate the second mean value that the pitch parameter in the first verification voice includes.

For including sound parameter and overtone parameter in tamber parameter, conversion coefficient used by the smart machine, which determines, is advised Then determine that rule can be the same or different with corrected parameter, that is to say, that if in tamber parameter including overtone parameter, intelligence The method that formula 2 also can be used in energy equipment calculates conversion coefficient, and the method for using formula 3 calculates corrected parameter, whereinTable Show the first overtone variance that the overtone parameter in user speech information includes,Indicate the overtone parameter in the first verification voice Including the second overtone variance, μ_tIndicate the first overtone mean value that the overtone parameter in user speech information includes, μ_sIndicate first The second overtone mean value that overtone parameter in verification voice includes.

In one embodiment, described determined based on the first variance, the second variance and preset conversion factor is advised Then, determine the conversion coefficient is mode are as follows: the first variance, the second variance are substituted into the preset conversion factor It determines and is calculated in rule, resulting calculated result is conversion coefficient；Similarly, described to be based on first mean value, institute State the second mean value and the default corrected parameter determine rule, determine the mode of the corrected parameter are as follows: by first mean value, Second mean value is substituted into the determining rule of the default corrected parameter and is calculated, and resulting calculated result is to correct ginseng Number.

In one embodiment, smart machine before executing step S203, can also be held after executing step S202 Row: similarity score is carried out to the user speech information based on the first verification voice, obtains similarity score result；If The similarity score result meets tone color regularization condition, then executes described based on the tamber parameter in the user speech information It determines with the tamber parameter of the first verification voice with reference to the step of tone color frequency.That is, smart machine gets user's language After message breath and the first verification voice, the difference between user speech information and the first verification voice can be first determined, according to institute State whether diversity judgement needs to carry out tone color adjustment to the first verification voice.

In one embodiment, described that user speech information progress similarity is commented based on the first verification voice Point can refer to and pronunciation similarity score is carried out to the user speech information and the first verification voice, if user speech information with Similarity score between the first verification voice is higher, shows user pronunciation height similar to the first verification sound pronunciation, That is the accuracy of user pronunciation is higher, at this time may user by first verification voice can correct one's pronunciation mistake part, Smart machine can not execute step S203, can so save the power dissipation overhead of smart machine；If user speech information with it is described Similarity score between first verification voice is lower, shows that user pronunciation and the first verification sound pronunciation similarity are lower, That is the accuracy of user pronunciation is lower, and user may be difficult the mistake that corrects one's pronunciation by the first verification voice, smart machine at this time Step S203 need to be executed.

In other embodiments, described that user speech information progress similarity is commented based on the first verification voice Dividing can also refer to the user speech information and the first verification voice progress tone color similarity score, if user speech information Similarity score between the first verification voice is higher, and the tone color and first for showing user verify the tone color of voice more Close, user hears when the first verification voice that, just like the sound for hearing user oneself, user can be according to the first verification at this time Practice on Phonetics is spoken, and smart machine can not execute step S203；If between user speech information and the first verification voice Similarity score it is lower, show that the tone color of user and the tone color of the first verification voice differ larger, smart machine needs to be implemented Step S203.

In one embodiment, smart machine is after having determined with reference to tone color frequency, based on described in step S204 It is generated and the matched second verification voice of the voice content included by the user speech information with reference to tone color frequency.Intelligence Equipment can generate the second verification voice, the reality of the step S204 based on the first verification voice and with reference to tone color frequency The mode of applying can be with are as follows: based on the tone color frequency for adjusting the first verification voice corresponding period with reference to tone color frequency, obtains To the second verification voice.Specifically, if including referring to fundamental frequency with reference to tone color frequency, the smart machine is based on institute State the fundamental frequency adjusted in the first verification voice corresponding period with reference to fundamental frequency；If including with reference to tone color frequency Overtone frequency, the smart machine based on it is described adjust with reference to overtone frequency it is general in the first verification voice corresponding period Voice frequency.

The factors such as length, thin and thick, toughness, stiffness and the pronunciation habit of different user vocal cords are different, this is but also different The pitch parameter of user is not also identical, likewise, the vocal cords of different sounding bodies are different, the organs such as caused mouth, throat, lung Vibration it is also different, therefore the overtone parameter of different generating body is not also identical.It is according to user speech information with reference to tone color frequency Tamber parameter determine, therefore can reflect the sound of user with reference to tone color frequency, it is described based on adjusting with reference to tone color frequency Described first verifies the tone color frequency of voice corresponding period it is to be understood that keeping language included by the first verification voice Sound content is constant, and the tone color frequency of the first verification voice is adjusted to also to verify in voice and wrap by first with reference to tone color frequency The voice content included is played out with the sound (tone color of user) of user.So be conducive to the verification that user listens to oneself tone color Voice, more intuitively study and contrast standard pronunciation.

In one embodiment, smart machine is after generating the second verification voice, further includes: plays second school Voice is tested, corrects user speech in order to which user is based on the second verification voice.Smart machine passes through step S201- step The second verification voice that S204 is generated is verification voice identical with user's tone color, plays the second verification voice, Yong Huke It is subject to and hears so that the verification voice of oneself tone color should how be pronounced, be conducive to user and correct one's pronunciation, improve spoken language exercise Efficiency.In one embodiment, smart machine, can be with before playing the second verification voice: output play cuing letter Breath, the play cuing information play the second verification voice for prompting the user whether；It is played when detecting that user confirms When operation, the second verification voice is played, can choose whether to play according to their own needs in order to user in this way, improve and use Family experience.

In one embodiment, the generation and the voice content included by the user speech information matched the After two verification voices, the method also includes: obtain the difference between the second verification voice and the user speech information Different information；It is generated based on the different information and corrects prompt information, correct use in order to which user is based on the correction prompt information Family voice.Wherein, the different information may include pronunciation difference, smart machine comparative analysis user speech information and the second verification Pronunciation difference between voice, and according to the heteroplasia that is weak in pronunciation at prompt information is corrected, may include in the correction prompt information to Correct word, user pronunciation and verification pronunciation.The word to be corrected is any one word for including in voice content, is used Family pronunciation refers to that user treats the pronunciation for correcting word, and the verification pronunciation is the standard pronunciation of the word to be corrected.

For example, being provided in an embodiment of the present invention a kind of according to the second verification voice and user speech letter with reference to Fig. 3 Different information between breath generates the schematic diagram for correcting prompt information.Shown in Fig. 3, smart machine can be aobvious by display device Show spoken language exercise window, user selects the voice content to be practiced by the prompt in spoken language exercise window, and defeated according to prompting Access customer voice messaging 301.After smart machine gets user speech information, the second school is generated by step S201-S204 Voice 302 is tested, the second verification voice 302 can be shown to user in order to which user chooses whether to play.In addition, smart machine By the different information between the second verification voice of analysis and user speech information, generates and correct prompt information.

Assuming that the voice content that user selects is " Are you staying at home or going out this Weekend? ", smart machine to second verification voice and user speech analysis when find: to word in user speech information 301 The pronunciation of letter t is voiceless consonant/t/ in staying, since in stressed syllable, the subsequent voiceless consonant/t/ of/s/ wants turbidity at turbid Consonant/d/, the second pronunciation for verifying letter t in word staying in voice 302 is voiced consonant/d/, and smart machine is based on above-mentioned The correction prompt information 303 that different information generates can be with are as follows: staying " t " pronunciation/t/ is corrected as/d/.

In one embodiment, it is described based on it is described with reference to tone color frequency generate with the user speech information included by After the matched second verification voice of voice content, the method also includes: it is generated and is used according to the user speech information Family voice curve；Voice curve is verified according to the second verification speech production；The user speech is shown on a user interface Curve and the verification voice curve, in order to which user is based on the user speech curve and verification voice curvature correction use Family voice.Smart machine can draw user speech curve according to the user speech information and the second verification voice respectively With verification voice curve, user can by the comparison of above-mentioned two curves find read aloud when there are the problem of.

For example, based in Fig. 3 it is assumed that being based on after smart machine generates the second verification voice by step S201-S204 User speech information generates user speech curve and verifies voice curve based on the second verification speech production, and in the user interface Display user can intuitively be found out from curve oneself read aloud with verification voice existing for difference, be conducive to user correct hair Sound.

In conclusion the embodiment of the present invention get user speech information and its it is corresponding first verification voice after, It can be determined according to the tamber parameter in the tamber parameter and the first verification voice in user speech information and refer to tone color frequency, into one Step, the second verification voice to match with user speech information is generated with reference to tone color frequency based on this, it is approximate to realize generation The verification voice of user's tone color improves the efficiency of user's spoken language exercise in order to which user accurately corrects incorrect pronunciations.

Description based on above method embodiment, in one embodiment, the embodiment of the invention also provides a kind of such as Fig. 4 Shown in voice processing apparatus structural schematic diagram.As shown in figure 4, the voice processing apparatus in the embodiment of the present invention, including obtain Unit 401 and processing unit 402 are taken, in embodiments of the present invention, the voice processing apparatus, which can also be arranged in, to be needed to language In the smart machine that sound is handled.

In one embodiment, the acquiring unit 401 is used for: being obtained user speech information, and is obtained user's language Tamber parameter in message breath；The processing unit 402 is used for: being searched and voice content included by the user speech information Matched first verification voice；The acquiring unit 401 is also used to: obtaining the tamber parameter of the first verification voice；It is described Processing unit 402 is also used to: the tone color ginseng based on tamber parameter and the first verification voice in the user speech information Number, which determines, refers to tone color frequency；Institute's A Hu processing unit 402 is also used to: being generated and the user based on described with reference to tone color frequency The matched second verification voice of voice content included by voice messaging.

In one embodiment, the tamber parameter includes pitch parameter and overtone parameter, the reference tone color frequency packet It includes with reference to fundamental frequency and with reference to overtone frequency.

In one embodiment, the processing unit 402 is also used to: based on the first verification voice to user's language Message breath carries out similarity score, obtains similarity score result；If the similarity score result meets tone color regularization condition, The then tamber parameter based on tamber parameter and the first verification voice in the user speech information described in processing unit 402 It determines and refers to tone color frequency.

In one embodiment, the processing unit 402 is based on tamber parameter in the user speech information and described The tamber parameter of first verification voice determines the embodiment for referring to tone color frequency are as follows: determines the tone color of the first verification voice Frequency；Conversion coefficient is determined according to the tamber parameter of tamber parameter and the first verification voice in the user speech information And corrected parameter；According to the tone color frequency of the first verification voice, the conversion coefficient, the corrected parameter and tone color frequency Rate transformation rule determines described with reference to tone color frequency.

In one embodiment, the tamber parameter in the user speech information includes according in the user speech information Target time section tone color frequency determine the first mean value and first variance, it is described first verification voice tamber parameter include The second mean value and second variance determined according to the tone color frequency of the target time section of the first verification voice；The place It manages unit 402 and conversion is determined according to the tamber parameter of tamber parameter and the first verification voice in the user speech information The embodiment of coefficient and corrected parameter are as follows: determined and advised based on the first variance, the second variance and preset conversion factor Then, the conversion coefficient is determined；Rule is determined based on first mean value, second mean value and default corrected parameter, is determined The corrected parameter.

In one embodiment, the processing unit 402 is searched and voice content included by the user speech information The embodiment of matched first verification voice are as follows: obtain the voice content included in the user speech information；Root Target first is searched from the first verification voice set according to the voice content included in the user speech information to verify Voice, the target first verify the voice for including in voice content included in voice and the user speech information Content matches；Using the target first verification voice as with the voice content included by the user speech information The the first verification voice matched.

In one embodiment, the processing unit 402 is also used to: the second verification voice is played, in order to user User speech is corrected based on the second verification voice.

In one embodiment, the processing unit 402 is also used to: obtaining the second verification voice and user's language Different information between message breath；It is generated based on the different information and corrects prompt information, in order to which user is based on the correction Prompt information corrects user speech.

In one embodiment, the processing unit 402 is also used to: generating user speech according to the user speech information Curve；Voice curve is verified according to the second verification speech production；Show on a user interface the user speech curve and The verification voice curve, in order to which user is based on the user speech curve and verification voice curve correction user's language Sound.

In one embodiment, the processing unit 402 is based on described with reference to the generation of tone color frequency and the user speech The embodiment of the matched second verification voice of voice content included by information are as follows: be based on the reference voice frequency tune Whole described first verifies the tone color frequency of voice corresponding period, obtains the second verification voice.

In the embodiment of the present invention acquiring unit 401 get user speech information and its including tamber parameter, processing is single Member 402 search it is corresponding with the user speech information first verification voice and its including tamber parameter, further, handle The tamber parameter that the tamber parameter and the first verification voice that unit 402 includes based on user speech information include is determined with reference to tone color Frequency, then processing unit 402 is based on the second verification language for generating with reference to tone color frequency and matching with user speech information Sound realizes the verification voice for generating approximated user tone color, and in order to which user accurately corrects incorrect pronunciations, the registered permanent residence is used in raising The efficiency of language practice.

Fig. 5 is referred to, for a kind of structural schematic diagram for smart machine that the embodiment of the present invention supplies.Intelligence as shown in Figure 5 Equipment includes: one or more processors 501 and one or more memories 502, and the processor 501 and memory 502 are logical It crosses bus 503 to be connected, for memory 503 for storing computer program, the computer program includes program instruction, processor 501 store program instruction for executing the memory 502.

The memory 502 may include volatile memory (volatile memory), such as random access memory (random-access memory, RAM)；Memory 502 also may include nonvolatile memory (non-volatile Memory), such as flash memory (flash memory), solid state hard disk (solid-state drive, SSD) etc.；Memory 502 can also include the combination of the memory of mentioned kind.

The processor 501 can be central processing unit (Central Processing Unit, CPU).The processor 501 can further include hardware chip.Above-mentioned hardware chip can be specific integrated circuit (application- Specific integrated circuit, ASIC), programmable logic device (programmable logic device, PLD) etc..The PLD can be field programmable gate array (field-programmable gate array, FPGA), lead to With array logic (generic array logic, GAL) etc..The combination of the processor 501 or above structure.

In the embodiment of the present invention, for the memory 502 for storing computer program, the computer program includes program Instruction, the processor 501 are used to execute the program instruction of the storage of memory 502, for realizing that above-mentioned method of speech processing is real The step of applying the correlation method in example.

In one embodiment, the processor 501 is configured that described program instruction is called to be used for: obtaining user speech letter Breath, and obtain the tamber parameter in the user speech information；It searches and voice content included by the user speech information Matched first verification voice, and obtain the tamber parameter of the first verification voice；Based in the user speech information The tamber parameter of tamber parameter and the first verification voice, which determines, refers to tone color frequency；It is generated based on described with reference to tone color frequency With the matched second verification voice of the voice content included by the user speech information.

In one embodiment, the processor 501 is based on the tamber parameter and described in the user speech information The tamber parameter of one verification voice determines that the processor 501 is configured to call described program instruction with reference to before tone color frequency It is also used to: similarity score being carried out to the user speech information based on the first verification voice, obtains similarity score knot Fruit；If the similarity score result meets tone color regularization condition, the processor 501 executes described based on user's language The tamber parameter of tamber parameter and the first verification voice in message breath is determined with reference to the step of tone color frequency.

In one embodiment, the processor 501 be used for based in the user speech information tamber parameter and institute The tamber parameter for stating the first verification voice determines that embodiment when referring to tone color frequency is；Determine the first verification voice Tone color frequency；Conversion is determined according to the tamber parameter of tamber parameter and the first verification voice in the user speech information Coefficient and corrected parameter；According to tone color frequency, the conversion coefficient, the corrected parameter and the sound of the first verification voice Color frequency conversion rule determines described with reference to tone color frequency.

In one embodiment, the tamber parameter in the user speech information includes according in the user speech information Target time section tone color frequency determine the first mean value and first variance, it is described first verification voice tamber parameter include The second mean value and second variance determined according to the tone color frequency of the target time section of the first verification voice；The place It manages device 501 and is used to be determined according to the tamber parameter of tamber parameter and the first verification voice in the user speech information and turn Change embodiment when coefficient and corrected parameter are as follows: true based on the first variance, the second variance and preset conversion factor Set pattern then, determines the conversion coefficient；Rule is determined based on first mean value, second mean value and default corrected parameter, Determine the corrected parameter.

In one embodiment, the processor 501 for searches and the user speech information included by voice Hold embodiment when matched first verification voice are as follows: obtain in the voice included in the user speech information Hold；Target first is searched from the first verification voice set according to the voice content included in the user speech information Verify voice, the target first verifies include in included voice content and the user speech information in voice described Voice content matches；Using the target first verification voice as in the voice included by the user speech information Hold the matched first verification voice.

In one embodiment, the processor 501 is used to generate and user's language based on described with reference to tone color frequency After the matched second verification voice of the included voice content of message breath, the processor 501 is configured described in calling Program instruction is also used to: being played the second verification voice, is corrected user's language in order to which user is based on the second verification voice Sound.

In one embodiment, the processor 501 is used to generate and user's language based on described with reference to tone color frequency After the matched second verification voice of the included voice content of message breath, the processor 501 is configured described in calling Program instruction is also used to: obtaining the different information between the second verification voice and the user speech information；Based on described Different information, which generates, corrects prompt information, corrects user speech in order to which user is based on the correction prompt information.

In one embodiment, the processor 501 is used to generate and user's language based on described with reference to tone color frequency After the matched second verification voice of the included voice content of message breath, the processor 501 is configured described in calling Program instruction is also used to: generating user speech curve according to the user speech information；According to the second verification speech production Verify voice curve；Show the user speech curve and the verification voice curve, on a user interface in order to user's base User speech is corrected in the user speech curve and the verification voice curve.

In one embodiment, the processor 501 is used to generate and user's language based on described with reference to tone color frequency The embodiment of the matched second verification voice of the included voice content of message breath are as follows: be based on the reference voice frequency The tone color frequency for adjusting the first verification voice corresponding period obtains the second verification voice.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

Above disclosed is only section Example of the present invention, cannot limit the right model of the present invention with this certainly It encloses, therefore equivalent changes made in accordance with the claims of the present invention, is still within the scope of the present invention.

Claims

1. a kind of method of speech processing characterized by comprising

It searches and verifies voice with voice content included by the user speech information matched first, and obtain first school Test the tamber parameter of voice；

It is determined based on the tamber parameter of tamber parameter and the first verification voice in the user speech information with reference to tone color Frequency；

It is generated and the voice content matched second included by the user speech information based on described with reference to tone color frequency Verify voice.

2. the method as described in claim 1, which is characterized in that the tamber parameter includes pitch parameter and overtone parameter, institute Stating with reference to tone color frequency includes with reference to fundamental frequency and with reference to overtone frequency.

3. the method as described in claim 1, which is characterized in that the tamber parameter based in the user speech information and Before the tamber parameter determination of the first verification voice refers to tone color frequency, the method also includes:

Similarity score is carried out to the user speech information based on the first verification voice, obtains similarity score result；

If the similarity score result meets tone color regularization condition, it is described based in the user speech information to trigger execution Tamber parameter and the tamber parameter of the first verification voice the step of determining with reference to tone color frequency.

4. the method according to claim 1, which is characterized in that the sound based in the user speech information The tamber parameter of color parameter and the first verification voice, which determines, refers to tone color frequency, comprising:

Determine the tone color frequency of the first verification voice；

Conversion coefficient is determined according to the tamber parameter of tamber parameter and the first verification voice in the user speech information And corrected parameter；

According to tone color frequency, the conversion coefficient, the corrected parameter and the tone color frequency conversion of the first verification voice Rule determines described with reference to tone color frequency.

5. method as claimed in claim 4, which is characterized in that the tamber parameter in the user speech information includes according to institute State the first mean value and first variance that the tone color frequency of the target time section in user speech information determines, the first verification language The tamber parameter of sound includes the second mean value determined according to the tone color frequency of the target time section of the first verification voice And second variance；

The tamber parameter of the tamber parameter according in the user speech information and the first verification voice determines conversion Coefficient and corrected parameter, comprising:

Rule is determined based on the first variance, the second variance and preset conversion factor, determines the conversion coefficient；

Rule is determined based on first mean value, second mean value and default corrected parameter, determines the corrected parameter.

6. the method as described in claim 1, which is characterized in that the lookup and voice included by the user speech information First verification voice of content matching, comprising:

Obtain the voice content included in the user speech information；

Target the is searched from the first verification voice set according to the voice content included in the user speech information One verification voice, the target first verify the institute for including in voice content included in voice and the user speech information Voice content is stated to match；

Using the target first verification voice as with the matched institute of the voice content included by the user speech information State the first verification voice.

7. the method as described in claim 1, which is characterized in that described to be generated and the user based on described with reference to tone color frequency After the matched second verification voice of voice content included by voice messaging, the method also includes:

The second verification voice is played, corrects user speech in order to which user is based on the second verification voice.

8. the method as described in claim 1, which is characterized in that described to be generated and the user based on described with reference to tone color frequency After the matched second verification voice of voice content included by voice messaging, the method also includes:

Obtain the different information between the second verification voice and the user speech information；

It is generated based on the different information and corrects prompt information, correct user's language in order to which user is based on the correction prompt information Sound.

9. the method as described in claim 1, which is characterized in that described to be generated and the user based on described with reference to tone color frequency After the matched second verification voice of voice content included by voice messaging, the method also includes:

User speech curve is generated according to the user speech information；

Voice curve is verified according to the second verification speech production；

Show the user speech curve and the verification voice curve, on a user interface in order to which user is based on the user Voice curve and the verification voice curve correct user speech.

10. the method as described in claim 1, which is characterized in that described to be generated and the use based on described with reference to tone color frequency The matched second verification voice of voice content included by the voice messaging of family, including；

Based on the tone color frequency for adjusting the first verification voice corresponding period with reference to tone color frequency, described second is obtained Verify voice.

11. a kind of voice processing apparatus characterized by comprising

Processing unit, for searching and the matched first verification voice of voice content included by the user speech information；

The processing unit is also used to the sound based on tamber parameter and the first verification voice in the user speech information Color parameter, which determines, refers to tone color frequency；

The processing unit, be also used to based on it is described with reference to tone color frequency generate with the user speech information included by described in The matched second verification voice of voice content.

12. a kind of smart machine, which is characterized in that including processor and memory, the memory is for storing computer journey Sequence, the computer program include program instruction, and the processor is configured for calling described program instruction, execute such as right It is required that the described in any item method of speech processing of 1-10.

13. a kind of computer storage medium, which is characterized in that be stored with computer program in the computer storage medium and refer to It enables, when the computer program instructions are executed by processor, for executing as at the described in any item voices of claim 1-10 Reason method.