CN104361895A

CN104361895A - Voice quality evaluation equipment, method and system

Info

Publication number: CN104361895A
Application number: CN201410734839.8A
Authority: CN
Inventors: 林晖
Original assignee: SHANGHAI LIULISHUO INFORMATION TECHNOLOGY Co Ltd
Current assignee: SHANGHAI LIULISHUO INFORMATION TECHNOLOGY Co Ltd
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2015-02-18
Anticipated expiration: 2034-12-04
Also published as: CN104361895B

Abstract

The invention provides rhythm-based voice quality evaluation equipment, method and system, data processing equipment and method, voice processing equipment and method and a mobile terminal in order to overcome the defect that the voice rhythm information problem is not considered when the pronunciation condition of a user is evaluated by using the existing voice technology. The voice quality evaluation equipment comprises a storage unit, a user voice receiving unit, a feature acquiring unit and a voice quality calculating unit, wherein the storage unit is suitable for storing a reserved text and a reference rhythm feature corresponding to the reserved text, the reserved text comprises one or more sentences, and each sentence comprises one or more words; the user voice receiving unit is suitable for receiving user voice recorded by the user according to the reserved text; the feature acquiring unit is suitable for acquiring a user rhythm feature of the user voice; and the voice quality calculating unit is suitable for calculating the quality of the user voice on the basis of the relevancy between the reference rhythm feature and the user rhythm feature. The technology provided by the invention can be applied to the technical field of voice.

Description

Voice quality assessment equipment, method and system

Technical field

The present invention relates to voice technology field, particularly relate to a kind of voice quality assessment equipment, method and system based on rhythm, data processing equipment and method, speech processing device and method, and mobile terminal.

Background technology

Along with the development of internet, the language learning application based on internet have also been obtained and develops fast.In some language learnings application, learning stuff is sent to client by internet by application provider, user obtains learning stuff via client, and operate on the client according to the instruction of learning stuff, such as input characters, input voice or select etc., and obtain feedback, thus improve the language ability of oneself.

For language learning, except to learn grammar and except vocabulary etc., an important aspect is the listening and speaking ability learned a language, the ability especially said.For often kind of language, often there is when speaking under different scenes different rhythm of speaking.In general, people, when speaking, often have suitable pause after some word finished in sentence, and how long etc. rhythm show to carry out after having said which word to pause and pause just.In addition, when the syllable of word is more than one, also there is certain dead time in the pronunciation between syllable and syllable.Therefore, user, when learning to speak with this language, also needs this rhythm and/or the pronunciation rhythm of speaking of study.

In existing voice technology, user is by the sound pick-up outfit recorded speech of client, system is according to the text corresponding with these voice, the voice that user records are split, and just the voice of user and existing acoustic model compare, thus to the feedback whether user provides this word pronunciation correct on word ground one by one.But existing voice technology does not consider any information in dependent voice rhythm when evaluating the pronunciation situation of user, learner therefore can not be allowed to learn the rhythm of speaking and/or pronouncing.

Summary of the invention

Give hereinafter about brief overview of the present invention, to provide about the basic comprehension in some of the present invention.Should be appreciated that this general introduction is not summarize about exhaustive of the present invention.It is not that intention determines key of the present invention or pith, and nor is it intended to limit the scope of the present invention.Its object is only provide some concept in simplified form, in this, as the preorder in greater detail discussed after a while.

Given this, the invention provides a kind of voice quality assessment equipment, method and system based on rhythm, data processing equipment and method, speech processing device and method, and mobile terminal, at least to solve existing voice technology does not consider the information of dependent voice rhythm problem when evaluating the pronunciation situation of user.

According to an aspect of the present invention, provide a kind of voice quality assessment equipment based on rhythm, this equipment comprises: storage unit, be suitable for storing pre-determined text and reference rhythm characteristic corresponding to pre-determined text, this pre-determined text comprises one or more statement, and each statement comprises one or more word; User speech receiving element, is suitable for receiving the user speech of user for the typing of pre-determined text institute; Feature acquiring unit, is suitable for the user's rhythm characteristic obtaining user speech; And voice quality computing unit, be suitable for, based on reference to the correlativity between rhythm characteristic and user's rhythm characteristic, calculating the voice quality of user speech.

According to another aspect of the present invention, additionally provide a kind of data processing equipment, this equipment is suitable for performing in the server, and comprises: server storage unit, is suitable for storing pre-determined text and at least one section reference voice corresponding with pre-determined text; And tempo calculation unit, be suitable for the cadence information calculating this section of reference voice according at least one section of reference voice, and this cadence information is kept in server storage unit, or calculates the reference rhythm characteristic of at least one section of reference voice according to this cadence information and be kept in server storage unit with reference to rhythm characteristic.

According to another aspect of the present invention, additionally provide a kind of speech processing device, this equipment is suitable for performing in a computer, and comprises: reference voice receiving element, is suitable for receiving the voice of specific user for the typing of pre-determined text institute as reference voice; And tempo calculation unit, be suitable for the cadence information calculating this reference voice according to reference voice, so that this cadence information and pre-determined text are sent to book server explicitly, or according to the reference rhythm characteristic of these cadence information computing reference voice, to be sent to book server explicitly with reference to rhythm characteristic and pre-determined text.

According to another aspect of the present invention, additionally provide a kind of voice quality assessment method based on rhythm, the method comprising the steps of: receive the user speech of user for the typing of pre-determined text institute, this pre-determined text comprises one or more statement, and each statement comprises one or more word; Obtain user's rhythm characteristic of user speech; And based on the correlativity between reference rhythm characteristic corresponding to pre-determined text and user's rhythm characteristic, calculate the voice quality of user speech.

According to another aspect of the present invention, additionally provide a kind of data processing method, the method is suitable for performing in the server, and comprises the steps: to store pre-determined text and at least one section reference voice corresponding with pre-determined text; Calculate the cadence information of this section of reference voice according at least one section of reference voice, and preserve this cadence information, or calculate the reference rhythm characteristic of at least one section of reference voice according to this cadence information and preserve with reference to rhythm characteristic.

According to another aspect of the present invention, additionally provide a kind of method of speech processing, the method is suitable for performing in a computer, and comprises the steps: to receive the voice of specific user for the typing of pre-determined text institute as reference voice; And according to the cadence information of reference voice computing reference voice, so that this cadence information and pre-determined text are sent to book server explicitly, or according to the reference rhythm characteristic of these cadence information computing reference voice, to be sent to book server explicitly with reference to rhythm characteristic and pre-determined text.

According to another aspect of the present invention, additionally provide a kind of mobile terminal, comprise as above based on the voice quality assessment equipment of rhythm.

According to a further aspect of the invention, additionally provide a kind of voice quality assessment system based on rhythm, comprise as above based on voice quality assessment equipment and the data processing equipment as above of rhythm.

The above-mentioned voice quality assessment scheme based on rhythm according to the embodiment of the present invention, correlativity between its user's rhythm characteristic based on the user speech obtained and reference rhythm characteristic, calculate the voice quality of user speech, one of at least following benefit can be obtained: the information considering dependent voice rhythm in the process of voice quality calculating user speech, user can be made according to result of calculation to know the accuracy of voice in rhythm of oneself recording, and then be conducive to speak rhythm and/or pronunciation rhythm that user judges whether to need to correct oneself; The calculating of user speech and evaluating is completed on client computer or client mobile terminal, makes user carry out off-line learning; Calculated amount is less; Save time; Operation is more simple, convenient; And when the representation of user's rhythm characteristic changes, easily the reference rhythm character representation calculated according to the cadence information of reference voice can be become the form identical with user's rhythm characteristic, make the process of voice quality assessment equipment more flexible, convenient, practicality is stronger.

By below in conjunction with the detailed description of accompanying drawing to most preferred embodiment of the present invention, these and other advantage of the present invention will be more obvious.

Accompanying drawing explanation

The present invention can be better understood by reference to hereinafter given by reference to the accompanying drawings description, wherein employs same or analogous Reference numeral in all of the figs to represent identical or similar parts.Described accompanying drawing comprises in this manual together with detailed description below and forms the part of this instructions, and is used for illustrating the preferred embodiments of the present invention further and explaining principle and advantage of the present invention.In the accompanying drawings:

Fig. 1 is the structured flowchart schematically showing mobile terminal 100;

Fig. 2 schematically shows according to an embodiment of the invention based on the block diagram of a kind of example arrangement of the voice quality assessment equipment 200 of rhythm;

Fig. 3 is the block diagram of a kind of possibility structure schematically showing the feature acquiring unit 230 shown in Fig. 2;

Fig. 4 schematically shows in accordance with another embodiment of the present invention based on the block diagram of a kind of example arrangement of the voice quality assessment equipment 400 of rhythm;

Fig. 5 is the block diagram of a kind of example arrangement schematically showing data processing equipment 500 according to an embodiment of the invention;

Fig. 6 is the block diagram of a kind of example arrangement schematically showing speech processing device 600 according to an embodiment of the invention;

Fig. 7 schematically shows according to an embodiment of the invention based on the process flow diagram of a kind of exemplary process of the voice quality assessment method of rhythm;

Fig. 8 is the process flow diagram of a kind of exemplary process schematically showing data processing method according to an embodiment of the invention;

Fig. 9 is the process flow diagram of a kind of exemplary process schematically showing method of speech processing according to an embodiment of the invention; And

Figure 10 is the process flow diagram of the another kind of exemplary process schematically showing method of speech processing according to an embodiment of the invention.

The element that it will be appreciated by those skilled in the art that in accompanying drawing be only used to simple and clear for the purpose of illustrate, and not necessarily to draw in proportion.Such as, in accompanying drawing, the size of some element may be exaggerated relative to other elements, to contribute to improving the understanding to the embodiment of the present invention.

Embodiment

To be described one exemplary embodiment of the present invention by reference to the accompanying drawings hereinafter.For clarity and conciseness, all features of actual embodiment are not described in the description.But, should understand, must make a lot specific to the decision of embodiment in the process of any this practical embodiments of exploitation, to realize the objectives of developer, such as, meet those restrictive conditions relevant to system and business, and these restrictive conditions may change to some extent along with the difference of embodiment.In addition, although will also be appreciated that development is likely very complicated and time-consuming, concerning the those skilled in the art having benefited from present disclosure, this development is only routine task.

At this, also it should be noted is that, in order to avoid the present invention fuzzy because of unnecessary details, illustrate only in the accompanying drawings with according to the closely-related apparatus structure of the solution of the present invention and/or treatment step, and eliminate other details little with relation of the present invention.

The embodiment provides a kind of voice quality assessment equipment based on rhythm, this equipment comprises: storage unit, be suitable for storing pre-determined text and reference rhythm characteristic corresponding to pre-determined text, this pre-determined text comprises one or more statement, and each statement comprises one or more word; User speech receiving element, is suitable for receiving the user speech of user for the typing of pre-determined text institute; Feature acquiring unit, is suitable for the user's rhythm characteristic obtaining user speech; And voice quality computing unit, be suitable for, based on reference to the correlativity between rhythm characteristic and user's rhythm characteristic, calculating the voice quality of user speech.

The above-mentioned according to an embodiment of the invention voice quality assessment equipment based on rhythm can be the application performing process in traditional desktop type or laptop computer (not shown) etc., also can be the client application (one of the application 154 in mobile terminal 100 as shown in Figure 1) performing process in mobile terminal (as shown in Figure 1), or also can be the web application etc. visited by browser on above-mentioned traditional desktop type, laptop user or mobile terminal.

Fig. 1 is the structured flowchart of mobile terminal 100.The mobile terminal 100 with multiple point touching ability can comprise memory interface 102, one or more data processor, image processor and/or CPU (central processing unit) 104, and peripheral interface 106.

Memory interface 102, one or more processor 104 and/or peripheral interface 106 both can be discrete components, also can be integrated in one or more integrated circuit.In the mobile terminal 100, various element can be coupled by one or more communication bus or signal wire.Sensor, equipment and subsystem can be coupled to peripheral interface 106, to help to realize several functions.Such as, motion sensor 110, optical sensor 112 and range sensor 114 can be coupled to peripheral interface 106, to facilitate the functions such as orientation, illumination and range finding.Other sensors 116 can be connected with peripheral interface 106 equally, such as positioning system (such as GPS), temperature sensor, biometric sensor or other sensor devices, can help thus to implement relevant function.

Camera sub-system 120 and optical sensor 122 may be used for the realization of the camera function of convenient such as recording photograph and video clipping, and wherein said camera sub-system and optical sensor can be such as charge-coupled image sensor (CCD) or complementary metal oxide semiconductor (CMOS) (CMOS) optical sensor.

Can help realize communication function by one or more radio communication subsystem 124, wherein radio communication subsystem can comprise radio-frequency transmitter and transmitter and/or light (such as infrared) Receiver And Transmitter.The particular design of radio communication subsystem 124 and embodiment can depend on one or more communication networks that mobile terminal 100 is supported.Such as, mobile terminal 100 can comprise the communication subsystem 124 being designed to support GSM network, GPRS network, EDGE network, Wi-Fi or WiMax network and BlueboothTM network.

Audio subsystem 126 can be coupled with loudspeaker 128 and microphone 130, such as, to help the function of implementing to enable voice, speech recognition, speech reproduction, digital recording and telephony feature.

I/O subsystem 140 can comprise touch screen controller 142 and/or other input control devices 144 one or more.

Touch screen controller 142 can be coupled to touch-screen 146.For example, what this touch-screen 146 and touch screen controller 142 can use any one in multiple touch-sensing technology to detect to carry out with it contact and movement or time-out, and wherein detection technology is including, but not limited to capacitive character, resistive, infrared and surface acoustic wave technique.

Other input control devices 144 one or more can be coupled to other input/control devicess 148, the indication equipment of such as one or more button, rocker switch, thumb wheel, infrared port, USB port and/or stylus and so on.One or more button (not shown) can comprise the up/down button for control loudspeaker 128 and/or microphone 130 volume.

Memory interface 102 can be coupled with storer 150.This storer 150 can comprise high-speed random access memory and/or nonvolatile memory, such as one or more disk storage device, one or more optical storage apparatus, and/or flash memories (such as NAND, NOR).

Storer 150 can store operating system 152, the operating system of such as Android, IOS or WindowsPhone and so on.This operating system 152 can comprise the instruction of the task of depending on hardware for the treatment of basic system services and execution.Storer 150 can also store application 154.When these are applied in operation, can be loaded into processor 104 from storer 150, and run on the operating system run by processor 104, and the function that the various user of the Interface realization utilizing operating system and bottom hardware to provide expects, as instant messaging, web page browsing, pictures management etc.Application can provide independent of operating system, also can be that operating system carries.Application 154 comprises according to voice quality assessment equipment 200 of the present invention.

Describe according to an embodiment of the invention based on an example of the voice quality assessment equipment 200 of rhythm below in conjunction with Fig. 2.

As shown in Figure 2, voice quality assessment equipment 200 comprises storage unit 210, user speech receiving element 220, feature acquiring unit 230 and voice quality computing unit 240.

As shown in Figure 2, in voice quality assessment equipment 200, storage unit 210 is for storing pre-determined text and the reference rhythm characteristic corresponding with this pre-determined text.Pre-determined text comprises one or more statement, and each statement comprises one or more word.Wherein, each word in statement can comprise multiple letter or at least one word usually.

According to a kind of implementation, the language of pre-determined text be such as such as English and so on, word by the language that letter is formed time, pre-determined text is except these content of text of one or more words comprising one or more statement and each statement, optionally can also comprise the information such as syllable and/or phoneme of each word, and the corresponding relation between information and the letter forming this word such as the syllable of each word and/or phoneme.

It should be noted that, although the language for pre-determined text described by above citing is the situation of English, the language of actual pre-determined text is not limited to English, can be any one language such as Chinese, French or German.

According to a kind of implementation, pre-determined text and reference rhythm characteristic can be download from book server in advance and be kept at storage unit 210.Wherein, book server mentioned here can be such as the server that hereinafter data processing equipment 500 described in conjunction with Figure 5 is resident.Calculated amount under this mode is less, does not need ancillary cost time computing reference rhythm characteristic, can save time, and operation is also more simple, convenient.

According to another kind of implementation, also can download pre-determined text in advance from book server and not download with reference to rhythm characteristic.In this implementation, the cadence information of reference voice can be downloaded from book server, then calculate according to the cadence information of reference voice, thus obtain with reference to rhythm characteristic.Thus, can be kept in storage unit 210 by the pre-determined text of download with by calculating the reference rhythm characteristic obtained.In this manner, when the representation of user's rhythm characteristic changes, easily the reference rhythm character representation calculated according to the cadence information of reference voice can be become the form identical with user's rhythm characteristic, make the process of voice quality assessment equipment 200 more flexible, convenient, practicality is stronger.

It should be noted that, the process carrying out computing reference rhythm characteristic according to the cadence information of reference voice with reference to the processing procedure of hereinafter composition graphs 5 description, no longer can describe in detail here.

Here, reference voice can be in advance for the voice that this pre-determined text is recorded by specific user's (such as with the language of pre-determined text be mother tongue user or the professional language teacher etc. relevant to the language of pre-determined text).Cadence information can about one section of reference voice, also can about multistage reference voice.The reference rhythm characteristic of multistage reference voice can be by being averaging rear acquisition to the reference rhythm characteristic of each section of reference voice.

When user starts voice quality assessment equipment 200, as mentioned above, above-mentioned pre-determined text and the reference rhythm characteristic corresponding with this pre-determined text has been had in storage unit 210.Then, by the display device of the touch-screen 146 of such as mobile terminal 100 and so on, present the content of text (i.e. above-mentioned pre-determined text) corresponding to voice to be logged to user, and point out user to record corresponding voice.Like this, user can carry out the corresponding voice of typing by the input media such as microphone 130 grade of such as mobile terminal 100, as user speech, and receives this user speech by user speech receiving element 220.

Then, the user speech that user speech receiving element 220 is received is transmitted to feature acquiring unit 230, and obtains user's rhythm characteristic of this user speech by feature acquiring unit 230.

Fig. 3 shows a kind of possible exemplary construction of feature acquiring unit 230.In this example, feature acquiring unit 230 can comprise alignment subelement 310 and feature calculation subelement 320.

As shown in Figure 3, alignment subelement 310 can utilize predetermined acoustical model (acoustic model) user speech and pre-determined text to be carried out pressure to align (force alignment), to determine the corresponding relation between each phoneme of each word in pre-determined text and/or each syllable in each word and/or each syllable and the part of user speech.

Generally speaking, acoustic model is trained by the recording of a large amount of mother tongue speaker, utilizes acoustic model can calculate input voice and correspond to the possibility of known word, and then input voice can be carried out pressure with known word and align.Here, the reference voice that " input voice " can be user speech or hereinafter will mention, and " known word " can be pre-determined text.

Wherein, the correlation technique of acoustic model can be known with reference to the related data in http://mi.eng.cam.ac.uk/ ~ mjfg/ASRU_talk09.pdf, and the correlation technique of forcing alignment can be known with reference to the related data in http://www.isip.piconepress.com/projects/speech/software/tutori als/production/fundamentals/v1.0/section_04/s04_04_p01.h tml and http://www.phon.ox.ac.uk/jcoleman/BAAP_ASR.pdf, or also can utilize other prior aries, here no longer describe in detail.

In addition, it should be noted that, by aliging carrying out pressure between user speech with pre-determined text, the corresponding relation between each statement in pre-determined text and the part of speech (such as certain voice segments) of user speech can be determined, also namely, the voice segments corresponding with each statement in pre-determined text can be determined in user speech.

In addition, as mentioned above, by force alignment can also obtain as required in following three kinds of corresponding relations any one or multiple: the corresponding relation between the part of speech (such as certain block of speech) of each word in pre-determined text and user speech; Corresponding relation between the part of speech (such as certain block of speech) of each syllable in each word in pre-determined text and user speech; And the corresponding relation between each phoneme of each syllable in each word in pre-determined text and the part of speech (such as certain block of speech) of user speech.

Like this, based on the corresponding relation that alignment subelement 310 is determined, feature calculation subelement 320 can calculate user's rhythm characteristic of user speech.

According to a kind of implementation, feature calculation subelement 320 can for each statement of pre-determined text, according to the time interval between two block of speech that often adjacent two words are corresponding in user speech in this statement, obtain the rhythm characteristic of this statement voice segments corresponding in user speech.Then, the rhythm characteristic of whole user speech is formed based on the rhythm characteristic of each voice segments corresponding in user speech of each statement of pre-determined text obtained.

In one example in which, for each statement in pre-determined text, the information that all time intervals determined in this statement can form by feature calculation subelement 320 is as the rhythm characteristic of voice segments corresponding to this statement.

Such as, for certain statement " how are you today " in pre-determined text, by forcing alignment, can obtain statement " how are you today " corresponding to the voice segments Use in user speech, and wherein each word " how ", " are ", " you " and " today " distinguish corresponding block of speech Ub1, Ub2, Ub3 and Ub4 successively in user speech.By forcing alignment, the pause duration often between two block of speech of adjacent two words corresponding in user speech in this statement can be obtained, namely following cadence information:

(0.2-0.5), (0.6-0.8), (1.0-1.3) (supposing that unit is second).

Wherein, (0.2-0.5) represents that the pause duration between Ub1 and Ub2 is from time point " 0.2 " to time point " 0.5 ", and also, the time interval is 0.3 second; (0.6-0.8) represent that the pause duration between Ub2 and Ub3 is from time point " 0.6 " to time point " 0.8 ", also, the time interval is 0.2 second; (1.0-1.3) then represent that the pause duration between Ub3 and Ub4 is from time point " 1.0 " to time point " 1.3 ", also, the time interval is 0.3 second.It should be noted that, in this example embodiment, all pauses are all used as the time interval, and do not consider the length of pause.

Thus, in this example embodiment, the information that the obtained user speech middle age can be formed about time interval of statement " how areyou today " is as the rhythm characteristic of voice segments Use corresponding to this statement, wherein, this information such as can be expressed as but be not limited to the form of vector, that is, (0.3,0.2,0.3).

The rhythm characteristic that the time interval between the phonological component utilizing word corresponding with word is formed, directly can reflect the pause length of user when reading this sentence between each word.

In another example, for each statement in pre-determined text, feature calculation subelement 320 also can by determining the duration of the block of speech that each word in this statement is corresponding in user speech, and the information formed by duration corresponding for all words in this statement determined is as the rhythm characteristic of voice segments corresponding to this statement.

Such as, still for statement " how are you today ", by forcing alignment, the duration of the block of speech that each word is corresponding in user speech in this statement can be obtained, namely following cadence information:

(0-0.2), (0.5-0.6), (0.8-1.0), (1.3-1.5) (supposing that unit is second).

The lasting duration that can obtain Ub1 is 0.2 second, and the lasting duration of Ub2 is 0.1 second, and the lasting duration of Ub3 is 0.2 second, and the lasting duration of Ub4 is also 0.2 second.

Like this, in this example embodiment, can using the rhythm characteristic of the obtained information formed about each word duration of statement " how are you today " as voice segments Use corresponding to this statement, wherein, this information such as can be expressed as but be not limited to the form of vector, namely, (0.2,0.1,0.2,0.2).

Utilize the lasting duration of phonological component corresponding to each word and the rhythm characteristic formed, directly can reflecting the pronunciation duration of user's each word when reading this sentence, also indirectly reflecting the pause length between each word.

In addition, according to another kind of implementation, relatively comformed information can be carried out according to the size of the time interval and predetermined space threshold value.In other words, in this kind of implementation, is set to the first value (being such as 1) time interval being more than or equal to predetermined space threshold value, and is set to the second value (being such as 0) time interval being less than predetermined space threshold value.Wherein, predetermined space threshold value such as can set based on experience value, or also can be determined by the method for test, no longer describes in detail here.

Such as, still for statement " how are you today ", by forcing alignment, the pause duration often between two block of speech of adjacent two words corresponding in user speech in this statement can be obtained, namely following cadence information:

(0.2-0.5), (0.6-0.8), (1.0-1.3) (supposing that unit is second).

Suppose in this example embodiment, predetermined space threshold value is 0.25 second, then the pause duration between Ub1 and Ub2 is 0.3 second, is greater than predetermined space threshold value, therefore the property value in the time interval is set to 1; Pause duration between Ub2 and Ub3 is 0.2 second, is less than predetermined space threshold value, therefore the property value in the time interval is set to 0; And the time interval between Ub3 and Ub4 is 0.3 second, be greater than predetermined space threshold value, therefore the property value in the time interval is set to 1.

Utilization " 0 ", " 1 " value represent the property value in the time interval, and also, " 0 " represents that time interval is very short, and " 1 " represents that time interval is longer.Like this, in this example embodiment, the information that the property value in the obtained time interval about statement " how are you today " can be formed is as the rhythm characteristic of voice segments Use corresponding to this statement, wherein, this information such as can be expressed as but be not limited to the form of vector, that is, (1,0,1).

Suppose to comprise altogether two statements in pre-determined text, these two statements are difference corresponding voice segments Use1 and Use2 in user speech, and is (1 by the rhythm characteristic calculating voice segments Use1,, and the rhythm characteristic of voice segments Use2 is (0 0,1), 1,1) rhythm characteristic that, then can obtain this user speech is { (1,0,1), (0,1,1) }.

It can thus be appreciated that, by the way, the rhythm characteristic (vector be made up of 0,1 value as mentioned above) formed can be made to embody interval between word and word more intuitively.In this case, by the setting of predetermined space threshold value, and relatively can grow the pause of relatively short to (and/or between syllable and syllable) between word and word (being such as less than predetermined space threshold value) pause of (being such as more than or equal to predetermined space threshold value) and make a distinction, avoid the impact of shorter pause on formed rhythm characteristic, more meet speaking and/or custom of pronouncing of people.

Be presented above the example of several acquisition user rhythm characteristic, in subsequent treatment, the reference rhythm characteristic of same form and user's rhythm characteristic is utilized to compare (such as calculating similarity therebetween or distance etc.), comparative result is supplied to user, user can be made to know rapidly oneself speak rhythm and/or pronunciation rhythm whether Shortcomings, and (pause such as between certain two word should be longer or shorter can to know how this improves at once, or, the pronunciation duration of certain word should be longer, should how long pause between certain two syllable of certain word, etc.).It should be noted that, here given by citing is the rhythm characteristic obtaining sentence based on the interval calculation between word and word, in other examples, also can calculate the rhythm characteristic of each word based on the interval between each syllable in each word, process is similar with it, therefore repeats no more here.

It should be noted that, in an embodiment of the present invention, rhythm of speaking refers to the pause between word and word, and the rhythm that pronounces then refers to the pause between syllable and syllable.

Like this, obtained user's rhythm characteristic of user speech by feature acquiring unit 230 after, voice quality computing unit 240 can based on user's rhythm characteristic and the voice quality calculating user speech with reference to the correlativity between rhythm characteristic.

According to a kind of implementation, voice quality computing unit 240 based on user's rhythm characteristic and with reference to the correlativity between rhythm characteristic, and can obtain the mark of the voice quality for describing user speech according to this correlativity.

In one example in which, suppose that user's rhythm characteristic that feature acquiring unit 230 obtains is { (1,0,1), (0, }, and hypothetical reference rhythm characteristic is { (1,01,1), 0), (1,1,1) }, then can by calculating { (1,0,1), (0,1,1) } with { (1,0,0), (1,1,1) } between similarity, and using this similarity as the mark of voice quality describing this user speech.That is, the similarity between user's rhythm characteristic of calculating and reference rhythm characteristic is higher, and the voice quality of user speech is also higher.

Wherein, user's rhythm characteristic { (1, 0, 1), (0, 1, 1) } with reference rhythm characteristic { (1, 0, 0), (1, 1, 1) similarity } can obtain according to the similarity between the vector of correspondence position therebetween, such as, first calculate (1, 0, 1) with (1, 0, 0) vector similarity between, and (0, 1, 1) with (1, 1, 1) vector similarity between, then the weighted mean value of all vector similarities of calculating is got or weighted sum is used as user's rhythm characteristic and with reference to the similarity between rhythm characteristic.Wherein, when weighted mean value or the weighted sum of compute vector similarity, can using the weight of the weight of that statement in pre-determined text corresponding for each vector as the vector similarity of this vector correspondence, and the weight of each statement can be rule of thumb establish in advance in pre-determined text, also can utilize and arrange (such as with reference to rhythm characteristic, with reference to the weight containing the more statement of the vector correspondence of multielement " 1 " in rhythm characteristic arrange higher), or also each weight all can be set to 1, etc.

In addition, in another example, can also calculate based on user's rhythm characteristic and with reference to the correlation calculations distance therebetween between rhythm characteristic, and obtain the mark of the voice quality for describing user speech according to this distance.Such as, can using the mark of the inverse of distance as the voice quality of description user speech.That is, the distance between user's rhythm characteristic of calculating and reference rhythm characteristic is larger, and the voice quality of user speech is poorer.

Such as, user's rhythm characteristic { (1,0,1), (0,1,1) } can obtain according to the distance between the vector of correspondence position therebetween with reference to the distance between rhythm characteristic { (1,0,0), (1,1,1) }.Such as, can first calculate (1,0,1) with (1,0,0) the vectorial spacing and (0 between, 1,1) with (1,1,1) the vectorial spacing between, then gets the weighted mean value of institute's directed quantity spacing of calculating or weighted sum is used as user's rhythm characteristic and with reference to the similarity between rhythm characteristic.Wherein, the weight when weighted mean value or the weighted sum of compute vector spacing can be arranged according to the mode identical with the weight when the weighted mean value of compute vector similarity or weighted sum, repeats no more here.

In addition, it should be noted that, if the reference rhythm characteristic stored in storage unit 210 is not expressed as the form identical with the form of user's rhythm characteristic (such as the form of vector), then can be first expressed as form identical with it, then the similarity calculated therebetween or distance etc.

In addition, also it should be noted that, voice quality computing unit 240 can calculate the correlativity (i.e. similarity or distance) between user's rhythm characteristic and reference rhythm characteristic sentence by sentence, then obtains the massfraction (also namely obtaining the massfraction of voice segments one by one corresponding with each statement of pre-determined text in user speech successively) of user speech sentence by sentence.In addition, voice quality computing unit 240 also after calculating the correlativity (i.e. similarity or distance) between user's rhythm characteristic of whole user speech and reference rhythm characteristic, then can obtain the massfraction describing whole user speech.

Describe according to an embodiment of the invention based on another example of the voice quality assessment equipment of rhythm below in conjunction with Fig. 4.

In the example as depicted in fig. 4, voice quality assessment equipment 400, except comprising storage unit 410, user speech receiving element 420, feature acquiring unit 430 and voice quality computing unit 440, also comprises output unit 450.Wherein, storage unit 410 in voice quality assessment equipment 400 shown in Fig. 4, user speech receiving element 420, feature acquiring unit 430 and voice quality computing unit 440 can have the 26S Proteasome Structure and Function identical with the corresponding unit in above voice quality assessment equipment 200 described in conjunction with Figure 2 respectively, and similar technique effect can be reached, repeat no more here.

Output unit 450 can the result of calculation of visual output voice quality, such as, can be presented the result of calculation of upper Voice Quality by the display device of the touch-screen 146 of such as mobile terminal 100 and so on to user.

According to a kind of implementation, output unit 450 can export and reflect that the mark of voice quality is used as the result of calculation of voice quality.

Such as, output unit 450 can export the mark of the voice quality of each voice segments corresponding with each statement of pre-determined text in (such as exporting sentence by sentence) reflection user speech visually.Like this, user can know the accuracy of speak rhythm and/or the pronunciation rhythm of oneself said every word, and especially when the mark of a certain sentence is lower, user can recognize that this rhythm needs to correct at once, learns more targeted.

And for example, output unit 450 can export the mark of the voice quality reflecting whole user speech visually.Like this, user can the rhythm of the said one section of voice of overall recognition oneself whether accurate.

In addition, in other examples, output unit 450 also can export the mark of the voice quality of each voice segments corresponding with each statement of pre-determined text in reflection user speech visually simultaneously and reflect the mark of voice quality of whole user speech.

According to another kind of implementation, output unit 450 can export user's rhythm characteristic visually and be used as the result of calculation of voice quality with reference to the difference between rhythm characteristic.

Such as, received pronunciation can represent with two parallel row of user speech by output unit 450, wherein with " ' " number represents to there is pause between two words, if it is identical to pause, then can show with general fashion, such as green " ' " number; If different, be then highlighted this pause, the redness " ' " of such as overstriking.

Like this, by the output display of output unit 450, user can know speak rhythm and/or the difference of pronouncing between rhythm of speak rhythm and/or pronunciation rhythm and the received pronunciation (namely reference voice) here of oneself easily, difference has much etc., thus can correct oneself speak rhythm and/or pronunciation rhythm more targetedly, more accurately.

According to other implementations, output unit 450 also can export visually simultaneously reflection voice quality mark and and user's rhythm characteristic and be used as the result of calculation of voice quality with reference to the difference between rhythm characteristic, the detail of this implementation with reference to the description about above two kinds of implementations, can repeat no more here.

Known by describing above, above-mentioned according to an embodiment of the invention based on the voice quality assessment equipment of rhythm, the correlativity between its user's rhythm characteristic based on the user speech obtained and reference rhythm characteristic, calculates the voice quality of user speech.Because this equipment considers the information of dependent voice rhythm in the process of voice quality calculating user speech, therefore, it is possible to make user know the accuracy of voice in rhythm of oneself recording according to result of calculation, and then be conducive to speak rhythm and/or the pronunciation rhythm that user judges whether to need to correct oneself.

In addition, above-mentioned according to an embodiment of the invention based on rhythm voice quality assessment equipment correspond to subscription client, it completes on client computer or client mobile terminal the calculating of user speech and evaluation, and existing voice technology normally completes the calculating of user speech and evaluation at server end, off-line learning (when downloaded stored learning stuff) can be carried out in voice quality assessment equipment use family of the present invention, and without the need to must on-line study be carried out as prior art.

In addition, embodiments of the invention additionally provide a kind of data processing equipment, this equipment is suitable for performing in the server, and comprise: server storage unit, be suitable for storing pre-determined text, and be suitable for storing at least one section reference voice corresponding with pre-determined text or storing at least one section of reference voice from external reception; And tempo calculation unit, be suitable for the cadence information calculating at least one section of reference voice, and this cadence information is kept in server storage unit, or calculates the reference rhythm characteristic of at least one section of reference voice according to this cadence information and be kept in server storage unit with reference to rhythm characteristic.

Fig. 5 shows an example of data processing equipment 500 according to an embodiment of the invention.As shown in Figure 5, data processing equipment 500 comprises server storage unit 510 and tempo calculation unit 520.

Data processing equipment 500 such as can be implemented as resident application on the server.Server such as can comprise web server, and it can utilize http agreement to communicate with subscription client (such as voice quality assessment equipment 200 or 400 as described above), but is not limited to this.

Server storage unit 510 can store the text material of various language learning material, i.e. pre-determined text.Wherein, for often kind of language, server storage unit 510, except storing except pre-determined text, can also store at least one section reference voice corresponding with pre-determined text or can receive from external units such as the speech processing devices 600 that such as hereafter will describe and store at least one section of reference voice.

Should be understood that, pre-determined text mentioned here and above said pre-determined text are similar, except these content of text of one or more words comprising one or more statement and each statement, can also optionally comprise each word the information such as syllable and/or phoneme (language of such as pre-determined text be such as English and so on, word by the language that letter is formed time), and the corresponding relation between information and the letter forming this word such as the syllable of each word and/or phoneme.

Tempo calculation unit 520 by calculating the cadence information or reference rhythm characteristic that obtain at least one section of reference voice, and can be kept in server storage unit by the cadence information obtained or with reference to rhythm characteristic.Wherein, the process obtained with reference to rhythm characteristic can be similar with the process of above described acquisition user rhythm characteristic, below will illustrate, and eliminate the description of part identical content.

According to a kind of implementation, the cadence information of at least one section of reference voice obtained can be kept in server storage unit 510 by tempo calculation unit 520.In this implementation, in subsequent treatment, the cadence information of the pre-determined text that data processing equipment 500 can be stored and at least one section of reference voice is supplied to subscription client (such as voice quality assessment equipment 200 or 400 as described above).

In addition, according to another kind of implementation, tempo calculation unit 520 also can obtain the reference rhythm characteristic of this at least one section of reference voice according to the cadence information of at least one section of reference voice obtained, and is kept in server storage unit 510 by the reference rhythm characteristic obtained.In this implementation, in subsequent treatment, the reference rhythm characteristic of the pre-determined text that data processing equipment 500 can be stored and at least one section of reference voice is supplied to subscription client (such as voice quality assessment equipment 200 or 400 as described above).

In one example in which, suppose that " at least one section of reference voice " comprises R1, R2 two sections of reference voice altogether.For certain statement " how are you today " in pre-determined text and reference voice R1, by forcing alignment, statement " how are you today " can be obtained corresponding to the voice segments R1se in reference voice R1 ₁, and wherein each word " how ", " are ", " you " and " today " distinguish corresponding block of speech Rb1, Rb2, Rb3 and Rb4 successively in reference voice.By forcing alignment, the cadence information of reference voice R1 can be obtained, that is:

(0.2-0.4), (0.5-0.7), (0.9-1.2) (supposing that unit is second).

Here, (0.2-0.4) represents that the pause duration between Rb1 and Rb2 is from time point " 0.2 " to time point " 0.4 "; (0.5-0.7) represent that the pause duration between Rb2 and Rb3 is from time point " 0.5 " to time point " 0.7 "; (0.9-1.2) then represent that the pause duration between Rb3 and Rb4 is from time point " 0.9 " to time point " 1.2 ".Thus, in this example embodiment, can using the information that forms about the time interval of statement " how are you today " in obtained reference voice R1 as voice segments R1se corresponding to this statement ₁rhythm characteristic, wherein, this information such as can be expressed as but be not limited to vector form, that is, (0.2,0.2,0.3).

In another example, can relatively carry out comformed information (supposing that in this example, interval threshold is 0.25) according to the size of the time interval and predetermined space threshold value.Like this, according to above cadence information and predetermined space threshold value, voice segments R1se can be obtained ₁, R1se ₂rhythm characteristic be respectively (0,0,1) and (0,1,1) rhythm characteristic of reference voice R1 can, then be obtained for { (0,0,1), ((0,1,1) } be kept in server storage unit 510 as with reference to rhythm characteristic.

In addition, in other examples, the cadence information of reference voice also can be formed by the duration of the block of speech corresponding in reference voice of each word of statement each in pre-determined text, and process is in this case similar with the alignment processing above in user speech, therefore repeats no more.

It should be noted that, perform in data processing equipment 500 according to an embodiment of the invention with the process identical based on the voice quality assessment equipment 200 or 400 of rhythm described by above composition graphs 2 or Fig. 4, technique effect similarly can be obtained, repeat no longer one by one here.

In addition, embodiments of the invention additionally provide a kind of speech processing device, and this equipment is suitable for performing in a computer, and comprise: reference voice receiving element, be suitable for receiving the voice of specific user for the typing of pre-determined text institute, as reference voice, and send to book server with reference to voice; And/or tempo calculation unit, be suitable for the cadence information according to reference voice computing reference voice, so that this cadence information and pre-determined text are sent to book server explicitly, or according to the reference rhythm characteristic of these cadence information computing reference voice, to be sent to book server explicitly with reference to rhythm characteristic and pre-determined text.

Fig. 6 shows an example of speech processing device 600 according to an embodiment of the invention.As shown in Figure 6, speech processing device 600 comprises reference voice receiving element 610, and can comprise tempo calculation unit 620.

As shown in Figure 6, according to a kind of implementation, when speech processing device 600 only includes reference voice receiving element 610, the voice of specific user's (as be mother tongue with pre-determined text language user or the professional language teacher etc. relevant to this language) for the typing of pre-determined text institute can be received by reference to voice receiving unit 610, as reference voice, and send to book server (server as resident in above data processing equipment 500 described in conjunction with Figure 5) with reference to voice.

In addition, according to another kind of implementation, when speech processing device 600 also comprises tempo calculation unit 620, the reference voice that can receive according to reference voice receiving element 610 carrys out the cadence information of computing reference voice, so that this cadence information and pre-determined text are sent to book server explicitly, or according to the reference rhythm characteristic of these cadence information computing reference voice (this process can with reference to associated description above), to be sent to book server explicitly with reference to rhythm characteristic and pre-determined text.

In actual applications, speech processing device 600 can corresponding to the teacher's client be arranged on computing machine or other-end, such as, with software simulating.

The user of teacher's client can record received pronunciation for each statement in pre-determined text, to send to corresponding server end as with reference to voice, performs subsequent treatment by server end.In this case, reference voice can be gathered by internet by convenient service device, and not need the process participating in recorded speech, can save time and operate.

In addition, teacher's client also directly can carry out Treatment Analysis in this locality to its received pronunciation recorded (i.e. reference voice), generate the parameter (as reference voice feature) corresponding with this received pronunciation, and be transferred to server end storage together with pre-determined text, thus the process load of server end can be reduced.

In addition, embodiments of the invention additionally provide a kind of mobile terminal, comprise as above based on the voice quality assessment equipment of rhythm.This mobile terminal can have the above-mentioned function had based on the voice quality assessment equipment 200 or 400 of rhythm, and can reach similar technique effect, no longer describes in detail here.

In addition, embodiments of the invention additionally provide a kind of voice quality assessment system based on rhythm, and this system comprises as above based on voice quality assessment equipment 200 or 400 and the data processing equipment as above 500 of rhythm.

According to a kind of implementation, voice quality assessment system, except comprising above-mentioned voice quality assessment equipment 200 or 400 and data processing equipment 500, optionally can also comprise speech processing device 600 as above.In this implementation, voice quality assessment equipment 200 or 400 in voice quality assessment system can corresponding to the subscription client be arranged in computing machine or mobile terminal, data processing equipment 500 can corresponding to being arranged at server end, and speech processing device 600 can correspond to teacher's client.In actual treatment, teacher's client can provide reference voice (can also provide cadence information or the reference rhythm characteristic of reference voice alternatively) to server end, server is for storing these information and pre-determined text, subscription client then can download these information to analyze the user speech of user's input, to complete voice quality assessment to it from server.The details of process respectively with reference to the description given by above composition graphs 2 or 4, Fig. 5 and Fig. 6, can repeat no more here.

In addition, embodiments of the invention additionally provide a kind of voice quality assessment method based on rhythm, the method comprises the steps: to receive the user speech of user for the typing of pre-determined text institute, and this pre-determined text comprises one or more statement, and each statement comprises one or more word; Obtain user's rhythm characteristic of user speech; And based on the correlativity between reference rhythm characteristic corresponding to pre-determined text and user's rhythm characteristic, calculate the voice quality of user speech.

A kind of exemplary process of the above-mentioned voice quality assessment method based on rhythm is described below in conjunction with Fig. 7.As shown in Figure 7, the exemplary process flow 700 according to an embodiment of the invention based on the voice quality assessment method of rhythm starts from step S710, then, performs step S720.

In step S720, receive the user speech of user for the typing of pre-determined text institute, this pre-determined text comprises one or more statement, and each statement comprises one or more word.Then, step S730 is performed.Wherein, the processing example in step S720 as can be identical with the process of above user speech receiving element 220 described in conjunction with Figure 2, and can reach similar technique effect, does not repeat them here.

According to a kind of implementation, pre-determined text and reference rhythm characteristic can be downloaded from book server in advance and obtain.

According to another kind of implementation, pre-determined text can be downloaded from book server in advance and obtain, and can be calculate according to the cadence information of at least one section of reference voice downloaded in advance from book server to obtain with reference to rhythm characteristic.

In step S730, obtain user's rhythm characteristic of user speech.Then, step S740 is performed.Wherein, the processing example in step S730 as can be identical with the process of above feature acquiring unit 230 described in conjunction with Figure 2, and can reach similar technique effect, does not repeat them here.

According to a kind of implementation, in step S730, can utilize predetermined acoustical model that user speech and pre-determined text are carried out pressure to align, to determine the corresponding relation between each phoneme of each word in pre-determined text and/or each syllable in each word and/or each syllable and the part of user speech, and obtain user's rhythm characteristic of user speech based on corresponding relation.

Wherein, the step obtaining user's rhythm characteristic of user speech based on corresponding relation such as can realize in the following manner: for each statement of pre-determined text, according to the time interval between two block of speech that often adjacent two words are corresponding in user speech in each statement, obtain the rhythm characteristic of this statement voice segments corresponding in user speech; And based on the rhythm characteristic of each voice segments corresponding in user speech of each statement of pre-determined text obtained, form the rhythm characteristic of user speech.

In one example in which, for each statement in pre-determined text, the information that all time intervals determined in this statement can be formed is as the rhythm characteristic of voice segments corresponding to this statement.Wherein, between every two words interval greater than or when equaling predetermined space threshold value, arrange correspond to this time interval first value; When this time interval is less than predetermined space threshold value, the second value corresponding to this time interval is set; And the rhythm characteristic of the voice segments that this statement is corresponding is determined according to each obtained first value and the second value.

In another example, for each statement in pre-determined text, can determine the duration of the block of speech that each word in this statement is corresponding in user speech, and the information formed by duration corresponding for all words in this statement determined is as the rhythm characteristic of voice segments corresponding to this statement.

In step S740, the correlativity between the reference rhythm characteristic corresponding based on pre-determined text and user's rhythm characteristic, calculates the voice quality of user speech.Wherein, the processing example in step S740 as can be identical with the process of above voice quality computing unit 240 described in conjunction with Figure 2, and can reach similar technique effect, does not repeat them here.Then, end process flow process 700 in step S750.

In addition, according to another kind of implementation, after step S740, the result of calculation of visual output voice quality can also optionally be comprised the steps:.

Wherein, the result of calculation of voice quality can comprise: the mark of reflection voice quality; And/or the difference between user's rhythm characteristic and reference rhythm characteristic.

Known by describing above, above-mentioned according to an embodiment of the invention based on the voice quality assessment method of rhythm, the correlativity between its user's rhythm characteristic based on the user speech obtained and reference rhythm characteristic, calculates the voice quality of user speech.Because the method considers the information of dependent voice rhythm in the process of voice quality calculating user speech, therefore, it is possible to make user know the accuracy of voice in rhythm of oneself recording according to result of calculation, and then be conducive to speak rhythm and/or the pronunciation rhythm that user judges whether to need to correct oneself.

In addition, above-mentioned according to an embodiment of the invention based on rhythm voice quality assessment method correspond to subscription client, it completes on client computer or client mobile terminal the calculating of user speech and evaluation, and existing voice technology normally completes the calculating of user speech and evaluation at server end, voice quality assessment method of the present invention makes user can carry out off-line learning (when downloaded stored learning stuff), and without the need to must on-line study be carried out as prior art.

In addition, embodiments of the invention additionally provide a kind of data processing method, and the method is suitable for performing in the server, and comprise the steps: to store pre-determined text; Store at least one section reference voice corresponding with pre-determined text, or store at least one section of reference voice from external reception; And obtain the cadence information of at least one section of reference voice, and this cadence information is preserved, or obtain the reference rhythm characteristic of at least one section of reference voice according to this cadence information and preserve with reference to rhythm characteristic.

A kind of exemplary process of above-mentioned data processing method is described below in conjunction with Fig. 8.As shown in Figure 8, the exemplary process flow 800 of data processing method starts from step S810 according to an embodiment of the invention, then, performs step S820.

In step S820, store pre-determined text and at least one section reference voice corresponding with pre-determined text, or store pre-determined text and store at least one section of reference voice from external reception.Then, step S830 is performed.Wherein, the processing example in step S820 as can be identical with the process of above server storage unit 510 described in conjunction with Figure 5, and can reach similar technique effect, does not repeat them here.

In step S830, obtain the cadence information of at least one section of reference voice, and this cadence information is preserved, or obtain the reference rhythm characteristic of at least one section of reference voice according to this cadence information and preserve with reference to rhythm characteristic.Wherein, the processing example in step S830 as can be identical with the process of above acquisition unit 520 described in conjunction with Figure 5, and can reach similar technique effect, does not repeat them here.Then, end process flow process 800 in step S840.

Wherein, the data processing method of the embodiment of the invention described above can obtain the technique effect similar with above described data processing equipment 500, no longer describes in detail here.

In addition, embodiments of the invention additionally provide a kind of method of speech processing, the method is suitable for performing in a computer, and comprises the steps: to receive the voice of specific user for the typing of pre-determined text institute as reference voice, and sends to book server with reference to voice.In addition alternatively, the cadence information of this reference voice can also be calculated according to reference voice, so that this cadence information and pre-determined text are sent to book server explicitly, or according to the reference rhythm characteristic of this cadence information acquisition reference voice, to be sent to book server explicitly with reference to rhythm characteristic and pre-determined text.

A kind of exemplary process of above-mentioned method of speech processing is described below in conjunction with Fig. 9.As shown in Figure 9, the exemplary process flow 900 of method of speech processing starts from step S910 according to an embodiment of the invention, then, performs step S920.

In step S920, receive the voice of specific user for the typing of pre-determined text institute, as reference voice.Then, step S930 is performed.

In step S930, send to book server with reference to voice.Then end process flow process 900 in step S940.

Wherein, the processing example for the treatment of scheme 900 as can be identical with the process of above reference voice receiving element 610 described in conjunction with Figure 6, and can reach similar technique effect, does not repeat them here.

In addition, Figure 10 shows the another kind of exemplary process of above-mentioned method of speech processing.As shown in Figure 10, the exemplary process flow 1000 of method of speech processing starts from step S1010 according to an embodiment of the invention, then, performs step S1020.

In step S1020, receive the voice of specific user for the typing of pre-determined text institute, as reference voice.Then, step S1030 is performed.

According to a kind of implementation, the cadence information of reference voice can be obtained in step S1030, so that this cadence information and pre-determined text are sent to book server explicitly.Then end process flow process 1000 in step S1040.

According to another kind of implementation, can in step S1030 according to this cadence information obtain reference voice reference rhythm characteristic, to be sent to book server explicitly with reference to rhythm characteristic and pre-determined text.Then end process flow process 1000 in step S1040.

Wherein, the processing example for the treatment of scheme 1000, as with above reception described in conjunction with Figure 6 is identical with the process obtaining unit 620, and can reached similar technique effect, does not repeat them here.

Wherein, the method for speech processing of the embodiment of the invention described above can obtain the technique effect similar with above described speech processing device 600, no longer describes in detail here.

A11: a kind of voice quality assessment method based on rhythm, comprises step: receive the user speech of user for the typing of pre-determined text institute, this pre-determined text comprises one or more statement, and each statement comprises one or more word; Obtain user's rhythm characteristic of described user speech; And based on the correlativity between reference rhythm characteristic corresponding to described pre-determined text and described user's rhythm characteristic, calculate the voice quality of described user speech.A12: in the voice quality assessment method according to A11, the step of user's rhythm characteristic of the described user speech of described acquisition comprises: utilize predetermined acoustical model that described user speech and described pre-determined text are carried out pressure and align, to determine the corresponding relation between each word in described pre-determined text and the part of described user speech, and obtain user's rhythm characteristic of described user speech based on described corresponding relation.A13: in the voice quality assessment method according to A12, the described step obtaining user's rhythm characteristic of described user speech based on described corresponding relation comprises: for each statement of described pre-determined text, according to the time interval between two block of speech that often adjacent two words are corresponding in described user speech in each statement, obtain the rhythm characteristic of this statement voice segments corresponding in described user speech; And based on the rhythm characteristic of each voice segments corresponding in described user speech of each statement of described pre-determined text obtained, form the rhythm characteristic of described user speech.A14: in the voice quality assessment method according to A13, each statement in described pre-determined text: the information formed in all time intervals determined in this statement is as the rhythm characteristic of voice segments corresponding to this statement; Or determine the duration of the block of speech that each word in this statement is corresponding in described user speech, and the information formed by duration corresponding for all words in this statement determined is as the rhythm characteristic of voice segments corresponding to this statement.A15: in the voice quality assessment method according to A14, between every two words interval greater than or when equaling predetermined space threshold value, arrange correspond to this time interval first value; When this time interval is less than predetermined space threshold value, the second value corresponding to this time interval is set; And the rhythm characteristic of the voice segments that this statement is corresponding is determined according to each obtained first value and the second value.A16: the voice quality assessment method according to A11 also comprises: the result of calculation of visual output institute Voice Quality.A17: in the voice quality assessment method according to A16, the result of calculation of institute's Voice Quality comprises: the mark of reflection institute Voice Quality; And/or the difference between described user's rhythm characteristic and described reference rhythm characteristic.A18: in the voice quality assessment method according to A11: described pre-determined text and described download from book server in advance with reference to rhythm characteristic obtain; Or described pre-determined text obtains from book server download in advance, and described reference rhythm characteristic calculates according to the cadence information of at least one section of reference voice downloaded in advance from book server to obtain.A19: a kind of data processing method, the method is suitable for performing in the server, and comprises the steps: to store pre-determined text and at least one section reference voice corresponding with described pre-determined text; And the cadence information of described reference voice is calculated according to described at least one section of reference voice, and preserve this cadence information, or calculate the reference rhythm characteristic of described at least one section of reference voice according to this cadence information and preserve described with reference to rhythm characteristic.A20: a kind of method of speech processing, the method is suitable for performing in a computer, and comprises the steps: to receive the voice of specific user for the typing of pre-determined text institute as reference voice; And the cadence information of described reference voice is calculated according to described reference voice, so that this cadence information and described pre-determined text are sent to book server explicitly, or according to this cadence information calculate described reference voice reference rhythm characteristic, to be sent to described book server by described explicitly with reference to rhythm characteristic and described pre-determined text.A21: a kind of mobile terminal, comprises according to the voice quality assessment equipment based on rhythm of the present invention.A22: a kind of voice quality assessment system based on rhythm, comprises according to the voice quality assessment equipment based on rhythm of the present invention and data processing equipment.A23: a kind of voice quality assessment system based on rhythm, comprising: according to the voice quality assessment equipment based on rhythm of the present invention; Store the server of pre-determined text and reference cadence information and/or reference rhythm characteristic; And according to speech processing device of the present invention.

Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires than the feature more multiple features clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.

Those skilled in the art are to be understood that the module of the equipment in example disclosed herein or unit or assembly can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned in one or more equipment different from the equipment in this example.Module in aforementioned exemplary can be combined as a module or can be divided into multiple submodule in addition.

Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.

In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.

In addition, some in described embodiment are described as at this can by the processor of computer system or the method implemented by other device performing described function or the combination of method element.Therefore, there is the device of processor formation for implementing the method or method element of the necessary instruction for implementing described method or method element.In addition, the element described herein of device embodiment is the example as lower device: this device is for implementing the function performed by the element of the object in order to implement this invention.

As used in this, unless specifically stated so, use ordinal number " first ", " second ", " the 3rd " etc. to describe plain objects and only represent the different instances relating to similar object, and be not intended to imply the object be described like this must have the time upper, spatially, sequence aspect or in any other manner to definite sequence.

Although the embodiment according to limited quantity describes the present invention, benefit from description above, those skilled in the art understand, in the scope of the present invention described thus, it is contemplated that other embodiment.In addition, it should be noted that the language used in this instructions is mainly in order to object that is readable and instruction is selected, instead of select to explain or limiting theme of the present invention.Therefore, when not departing from the scope and spirit of appended claims, many modifications and changes are all apparent for those skilled in the art.For scope of the present invention, be illustrative to disclosing of doing of the present invention, and nonrestrictive, and scope of the present invention is defined by the appended claims.

Claims

1., based on a voice quality assessment equipment for rhythm, comprising:

Storage unit, be suitable for storing pre-determined text and reference rhythm characteristic corresponding to described pre-determined text, this pre-determined text comprises one or more statement, and each statement comprises one or more word;

User speech receiving element, is suitable for receiving the user speech of user for the typing of described pre-determined text institute;

Feature acquiring unit, is suitable for the user's rhythm characteristic obtaining described user speech; And

Voice quality computing unit, is suitable for, based on the correlativity between described reference rhythm characteristic and described user's rhythm characteristic, calculating the voice quality of described user speech.

2. voice quality assessment equipment according to claim 1, wherein, described feature acquiring unit comprises:

Alignment subelement, is suitable for utilizing predetermined acoustical model that described user speech and described pre-determined text are carried out pressure and aligns, to determine the corresponding relation between each word in described pre-determined text and the part of described user speech; And

Feature calculation subelement, is suitable for the user's rhythm characteristic calculating described user speech based on described corresponding relation.

3. voice quality assessment equipment according to claim 2, wherein, described feature calculation subelement is suitable for:

For each statement of described pre-determined text, according to the time interval between two block of speech that often adjacent two words are corresponding in described user speech in each statement, obtain the rhythm characteristic of this statement voice segments corresponding in described user speech; And

Based on the rhythm characteristic of each voice segments corresponding in described user speech of each statement of described pre-determined text obtained, form the rhythm characteristic of described user speech.

4. voice quality assessment equipment according to claim 3, wherein, described feature calculation subelement is suitable for for each statement in described pre-determined text:

The information formed in all time intervals determined in this statement is as the rhythm characteristic of voice segments corresponding to this statement; Or

Determine the duration of the block of speech that each word in this statement is corresponding in described user speech, and the information formed by duration corresponding for all words in this statement determined is as the rhythm characteristic of voice segments corresponding to this statement.

5. voice quality assessment equipment according to claim 4, wherein, between every two words interval greater than or when equaling predetermined space threshold value, arrange correspond to this time interval first value; When this time interval is less than predetermined space threshold value, the second value corresponding to this time interval is set; And the rhythm characteristic of the voice segments that this statement is corresponding is determined according to each obtained first value and the second value.

6. voice quality assessment equipment according to claim 1, also comprises:

Output unit, is suitable for the result of calculation of visual output institute Voice Quality.

7. voice quality assessment equipment according to claim 6, wherein, described output unit is suitable for exporting the result of calculation that following result is used as institute's Voice Quality:

The mark of reflection institute Voice Quality; And/or

Difference between described user's rhythm characteristic and described reference rhythm characteristic.

8. voice quality assessment equipment according to claim 1, wherein:

Described storage unit is suitable for downloading described pre-determined text and described reference rhythm characteristic in advance from book server; Or

Described storage unit is suitable for the cadence information downloading described pre-determined text and at least one section of reference voice from book server in advance, and it is described with reference to rhythm characteristic to calculate acquisition according to the cadence information of described at least one section of reference voice.

9. a data processing equipment, this equipment is suitable in the server resident, and comprises:

Server storage unit, is suitable for storing pre-determined text and at least one section reference voice corresponding with described pre-determined text; And

Tempo calculation unit, be suitable for the cadence information calculating this section of reference voice according to described at least one section of reference voice, and this cadence information is kept in described server storage unit, or calculates the reference rhythm characteristic of described at least one section of reference voice according to this cadence information and be kept at described in described server storage unit with reference to rhythm characteristic.

10. a speech processing device, this equipment is suitable for performing in a computer, and comprises:

Reference voice receiving element, is suitable for receiving the voice of specific user for the typing of pre-determined text institute as reference voice; And

Tempo calculation unit, be suitable for the cadence information calculating described reference voice according to described reference voice, so that this cadence information and described pre-determined text are sent to book server explicitly, or according to this cadence information calculate described reference voice reference rhythm characteristic, to be sent to book server by described explicitly with reference to rhythm characteristic and described pre-determined text.