CN104361895B

CN104361895B - Voice quality assessment equipment, method and system

Info

Publication number: CN104361895B
Application number: CN201410734839.8A
Authority: CN
Inventors: 林晖
Original assignee: SHANGHAI LIULISHUO INFORMATION TECHNOLOGY Co Ltd
Current assignee: SHANGHAI LIULISHUO INFORMATION TECHNOLOGY Co Ltd
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2018-12-18
Anticipated expiration: 2034-12-04
Also published as: CN104361895A

Abstract

The voice quality assessment equipment that the present invention provides a kind of based on rhythm, method and system, data processing equipment and method, speech processing device and method and mobile terminal, to overcome the problems, such as that existing voice technology does not consider the information in relation to voice rhythm when evaluating the pronunciation situation of user.Voice quality assessment equipment includes: storage unit, is suitable for storage pre-determined text and pre-determined text is corresponding with reference to rhythm characteristic, which includes one or more sentence, and each sentence includes one or more word；User speech receiving unit, the user speech for being directed to the typing of pre-determined text institute suitable for receiving user；Feature acquiring unit, suitable for obtaining user's rhythm characteristic of user speech；And voice quality computing unit, suitable for calculating the voice quality of user speech based on the correlation between reference rhythm characteristic and user's rhythm characteristic.Above-mentioned technology of the invention can be applied to voice technology field.

Description

Voice quality assessment equipment, method and system

Technical field

The present invention relates to voice technology field more particularly to a kind of voice quality assessment equipment based on rhythm, method and System, data processing equipment and method, speech processing device and method and mobile terminal.

Background technique

With the development of internet, language learning application Internet-based has also obtained quick development.In some languages In speech study application, application provider sends client for learning stuff by internet, and user obtains via client and learns Material is practised, and is operated on the client according to the instruction of learning stuff, such as input text, input voice or progress Selection etc., and fed back, to improve the language competence of oneself.

For language learning, other than learning grammar with vocabulary etc., an important aspect is to learn hearing for language Ability, the ability especially said.Often there is different sections of speaking for every kind of language, when speaking under different scenes It plays.In general, people often have pause appropriate after finishing certain words in sentence when speaking, and rhythm is exactly Show to be paused and paused after having said which word how long etc..In addition, when the syllable of word is more than one, syllable with There is also certain dead times for pronunciation between syllable.Therefore, user is when study is spoken with the language, it is also necessary to learn this Plant speak rhythm and/or pronunciation rhythm.

In existing voice technology, user by the sound pick-up outfit recorded speech of client, system according to the voice Corresponding text, to user record voice split, and one by one word with regard to user voice and existing acoustic model It is compared, to provide a user whether the word pronunciation correctly feeds back.However, existing voice technology is in evaluation user Pronunciation situation when do not consider in relation to voice rhythm in terms of any information, therefore can not allow learner study speak And/or the rhythm of pronunciation.

Summary of the invention

It has been given below about brief overview of the invention, in order to provide about the basic of certain aspects of the invention Understand.It should be appreciated that this summary is not an exhaustive overview of the invention.It is not intended to determine pass of the invention Key or pith, nor is it intended to limit the scope of the present invention.Its purpose only provides certain concepts in simplified form, Taking this as a prelude to a more detailed description discussed later.

In consideration of it, the voice quality assessment equipment that the present invention provides a kind of based on rhythm, method and system, data processing Device and method, speech processing device and method and mobile terminal, at least to solve existing voice technology in evaluation user Pronunciation situation when do not consider the problems of the information in relation to voice rhythm.

According to an aspect of the invention, there is provided a kind of voice quality assessment equipment based on rhythm, the equipment include: Storage unit, be suitable for storage pre-determined text and pre-determined text it is corresponding refer to rhythm characteristic, the pre-determined text include one or Multiple sentences, and each sentence includes one or more word；User speech receiving unit is suitable for receiving user for predetermined text The user speech of this institute typing；Feature acquiring unit, suitable for obtaining user's rhythm characteristic of user speech；And voice quality meter Unit is calculated, suitable for calculating the voice quality of user speech based on the correlation between reference rhythm characteristic and user's rhythm characteristic.

According to another aspect of the present invention, a kind of data processing equipment is additionally provided, which is suitable in the server It executes, and includes: server storage unit, be suitable for storage pre-determined text and at least one section corresponding with pre-determined text refers to language Sound；And tempo calculation unit, suitable for calculating the cadence information of this section of reference voice according at least one section of reference voice, and should Cadence information is stored in server storage unit, or the reference rhythm of at least one section reference voice is calculated according to the cadence information Feature will be simultaneously stored in server storage unit with reference to rhythm characteristic.

According to another aspect of the present invention, a kind of speech processing device is additionally provided, which is suitable in a computer It executes, and includes: reference voice receiving unit, be used as reference for the voice of pre-determined text institute typing suitable for receiving specific user Voice；And tempo calculation unit, suitable for calculating the cadence information of the reference voice according to reference voice, by the cadence information It is sent to book server in association with pre-determined text, or special according to the reference rhythm that the cadence information calculates reference voice Sign is sent to book server will refer to rhythm characteristic and pre-determined text in association.

According to another aspect of the present invention, a kind of voice quality assessment method based on rhythm, this method are additionally provided Comprising steps of receiving the user speech that user is directed to the typing of pre-determined text institute, which includes one or more sentence, And each sentence includes one or more word；Obtain user's rhythm characteristic of user speech；And it is corresponding based on pre-determined text Reference rhythm characteristic and user's rhythm characteristic between correlation, calculate the voice quality of user speech.

According to another aspect of the present invention, a kind of data processing method is additionally provided, this method is suitable in the server It executes, and includes the following steps: to store pre-determined text and at least one section of reference voice corresponding with pre-determined text；According at least One section of reference voice calculates the cadence information of this section of reference voice, and saves the cadence information, or is calculated according to the cadence information The reference rhythm characteristic of at least one section reference voice is simultaneously saved with reference to rhythm characteristic.

According to another aspect of the present invention, a kind of method of speech processing is additionally provided, this method is suitable in a computer It executes, and includes the following steps: that receiving specific user is directed to the voice of pre-determined text institute typing as reference voice；And according to Reference voice calculates the cadence information of reference voice, and the cadence information and pre-determined text are sent to reservation service in association Device, or calculate according to the cadence information reference rhythm characteristic, related to pre-determined text rhythm characteristic will be referred to of reference voice It is sent to book server to connection.

According to another aspect of the present invention, a kind of mobile terminal is additionally provided, including as described above based on rhythm Voice quality assessment equipment.

According to a further aspect of the invention, a kind of voice quality assessment system based on rhythm is additionally provided, including such as The upper described voice quality assessment equipment based on rhythm and data processing equipment as described above.

The above-mentioned voice quality assessment scheme according to an embodiment of the present invention based on rhythm, the user speech based on acquisition User's rhythm characteristic and with reference to the correlation between rhythm characteristic, to calculate the voice quality of user speech, can obtain to One of few following benefit: considering the information in relation to voice rhythm during calculating the voice quality of user speech, can Make user and know accuracy of the voice oneself recorded in terms of rhythm according to calculated result, and then is conducive to user's judgement Whether need to correct oneself speak rhythm and/or pronunciation rhythm；Calculating and evaluation to user speech is in client computer Or completed on client mobile terminal, so that user is carried out off-line learning；Calculation amount is smaller；Save the time；It operates simpler Singly, conveniently；It, can easily will be according to the cadence information of reference voice and when the representation of user's rhythm characteristic changes Reference rhythm character representation calculated is at form identical with user's rhythm characteristic, so that the processing of voice quality assessment equipment More flexible, conveniently, practicability is stronger.

By the detailed description below in conjunction with attached drawing to highly preferred embodiment of the present invention, these and other of the invention is excellent Point will be apparent from.

Detailed description of the invention

The present invention can be by reference to being better understood, wherein in institute below in association with description given by attached drawing Have and has used the same or similar appended drawing reference in attached drawing to indicate same or similar component.The attached drawing is together with following It is described in detail together comprising in the present specification and forming a part of this specification, and is used to that this is further illustrated The preferred embodiment and explanation the principle of the present invention and advantage of invention.In the accompanying drawings:

Fig. 1 is the structural block diagram for schematically showing mobile terminal 100；

Fig. 2 is to schematically show the voice quality assessment equipment 200 according to an embodiment of the invention based on rhythm A kind of exemplary structure block diagram；

Fig. 3 is the block diagram for schematically showing a kind of possible structure of feature acquiring unit 230 shown in Fig. 2；

Fig. 4 is to schematically show the voice quality assessment equipment in accordance with another embodiment of the present invention based on rhythm A kind of block diagram of 400 exemplary structure；

Fig. 5 is a kind of exemplary knot for schematically showing data processing equipment 500 according to an embodiment of the invention The block diagram of structure；

Fig. 6 is a kind of exemplary knot for schematically showing speech processing device 600 according to an embodiment of the invention The block diagram of structure；

Fig. 7 is the one kind for schematically showing the voice quality assessment method based on rhythm of embodiment according to the present invention The flow chart of exemplary process；

Fig. 8 is the stream for schematically showing a kind of exemplary process of data processing method of embodiment according to the present invention Cheng Tu；

Fig. 9 is the stream for schematically showing a kind of exemplary process of method of speech processing of embodiment according to the present invention Cheng Tu；And

Figure 10 is another exemplary process for schematically showing the method for speech processing of embodiment according to the present invention Flow chart.

It will be appreciated by those skilled in the art that element in attached drawing is just for the sake of showing for the sake of simple and clear, And be not necessarily drawn to scale.For example, the size of certain elements may be exaggerated relative to other elements in attached drawing, with Just the understanding to the embodiment of the present invention is helped to improve.

Specific embodiment

Exemplary embodiment of the invention is described hereinafter in connection with attached drawing.For clarity and conciseness, All features of actual implementation mode are not described in the description.It should be understood, however, that developing any this actual implementation Much decisions specific to embodiment must be made during example, to realize the objectives of developer, for example, symbol Restrictive condition those of related to system and business is closed, and these restrictive conditions may have with the difference of embodiment Changed.In addition, it will also be appreciated that although development is likely to be extremely complex and time-consuming, to having benefited from the disclosure For those skilled in the art of content, this development is only routine task.

Here, and also it should be noted is that, in order to avoid having obscured the present invention because of unnecessary details, in the accompanying drawings Illustrate only with closely related apparatus structure and/or processing step according to the solution of the present invention, and be omitted and the present invention The little other details of relationship.

The voice quality assessment equipment based on rhythm that the embodiment provides a kind of, the equipment include: that storage is single Member, is suitable for storage pre-determined text and pre-determined text is corresponding with reference to rhythm characteristic, which includes one or more language Sentence, and each sentence includes one or more word；User speech receiving unit is recorded suitable for receiving user for pre-determined text The user speech entered；Feature acquiring unit, suitable for obtaining user's rhythm characteristic of user speech；And voice quality calculates list Member, suitable for calculating the voice quality of user speech based on the correlation between reference rhythm characteristic and user's rhythm characteristic.

The above-mentioned voice quality assessment equipment based on rhythm of embodiment according to the present invention can be in traditional desktop The application that processing is executed in type or laptop computer (not shown) etc. is also possible in mobile terminal (as shown in Figure 1) The client application (one kind of the application 154 in mobile terminal 100 as shown in Figure 1) of processing is executed, or is also possible to The web application etc. accessed on above-mentioned traditional desktop type, laptop user or mobile terminal by browser.

Fig. 1 is the structural block diagram of mobile terminal 100.Mobile terminal 100 with multiple point touching ability may include storage Device interface 102, one or more data processors, image processor and/or central processing unit 104 and peripheral interface 106。

Memory interface 102, one or more processors 104 and/or peripheral interface 106 either discrete component, It can integrate in one or more integrated circuits.In the mobile terminal 100, various elements can pass through one or more communication Bus or signal wire couple.Sensor, equipment and subsystem may be coupled to peripheral interface 106, a variety of to help to realize Function.For example, motion sensor 110, optical sensor 112 and range sensor 114 may be coupled to peripheral interface 106, with side Just the functions such as orientation, illumination and ranging.Other sensors 116 can equally be connected with peripheral interface 106, such as positioning system (such as GPS receiver), temperature sensor, biometric sensor or other sensor devices, it is possible thereby to help to implement correlation Function.

Camera sub-system 120 and optical sensor 122 can be used for the camera of convenient such as record photos and video clips The realization of function, wherein the camera sub-system and optical sensor for example can be charge-coupled device (CCD) or complementary gold Belong to oxide semiconductor (CMOS) optical sensor.

It can help to realize communication function by one or more radio communication subsystems 124, wherein wireless communication System may include radio-frequency transmitter and transmitter and/or light (such as infrared) Receiver And Transmitter.Radio communication subsystem 124 particular design and embodiment can depend on one or more communication networks that mobile terminal 100 is supported.For example, Mobile terminal 100 may include be designed to support GSM network, GPRS network, EDGE network, Wi-Fi or WiMax network and The communication subsystem 124 of BlueboothTM network.

Audio subsystem 126 can be coupled with loudspeaker 128 and microphone 130, to help to implement to enable voice Function, such as speech recognition, speech reproduction, digital record and telephony feature.

I/O subsystem 140 may include touch screen controller 142 and/or other one or more input controllers 144.

Touch screen controller 142 may be coupled to touch screen 146.For example, the touch screen 146 and touch screen controller 142 can be used any one of a variety of touch-sensing technologies to detect the contact carried out therewith and movement or pause, Middle detection technology includes but is not limited to capacitive character, resistive, infrared and surface acoustic wave technique.

Other one or more input controllers 144 may be coupled to other input/control devicess 148, for example, one or The pointer device of multiple buttons, rocker switch, thumb wheel, infrared port, USB port, and/or stylus etc.One or Multiple buttons (not shown) may include the up/down button for controlling 130 volume of loudspeaker 128 and/or microphone.

Memory interface 102 can be coupled with memory 150.The memory 150 may include that high random access is deposited Reservoir and/or nonvolatile memory, such as one or more disk storage equipments, one or more optical storage apparatus, and/ Or flash memories (such as NAND, NOR).

Memory 150 can store an operating system 152, such as the behaviour of Android, IOS or Windows Phone etc Make system.The operating system 152 may include for handling basic system services and executing the finger of the task dependent on hardware It enables.Memory 150 can also be stored using 154.These applications in operation, can be loaded into processor 104 from memory 150 On, and run on the operating system run via processor 104, and provided using operating system and bottom hardware Interface realizes the various desired functions of user, such as instant messaging, web page browsing, pictures management.Using can be independently of behaviour System offer is provided, is also possible to what operating system carried.It include voice quality assessment equipment according to the present invention using 154 200。

The voice quality assessment equipment 200 based on rhythm of embodiment according to the present invention is described below in conjunction with Fig. 2 An example.

As shown in Fig. 2, voice quality assessment equipment 200 includes storage unit 210, user speech receiving unit 220, feature Acquiring unit 230 and voice quality computing unit 240.

As shown in Fig. 2, in voice quality assessment equipment 200, storage unit 210 is for storing pre-determined text and pre- with this It is corresponding with reference to rhythm characteristic to determine text.Pre-determined text includes one or more sentence, and each sentence include one or Multiple words.Wherein, each word in sentence usually may include multiple letters or at least one text.

According to a kind of implementation, language that the language of pre-determined text is, for example, such as English etc, that word is made of letter Yan Shi, pre-determined text in addition to include one or more sentences and one or more words of each sentence these content of text it Outside, it is also an option that property include syllable and/or syllable and/or phoneme of the information such as phoneme and each word of each word etc. Corresponding relationship between information and the letter for constituting the word.

It should be noted that, although the case where language for described in the above citing being pre-determined text is English, but it is practical The language of pre-determined text is not limited to English, can be any one language such as Chinese, French or German.

According to a kind of implementation, pre-determined text and reference rhythm characteristic can be and download and protect from book server in advance There are in storage unit 210.Wherein, book server mentioned here for example can be below in association with described in Fig. 5 The server that data processing equipment 500 is resident.Calculation amount under this mode is smaller, does not need the ancillary cost time to calculate With reference to rhythm characteristic, the time can be saved, is operated also simpler, conveniently.

According to another implementation, pre-determined text can also be downloaded in advance from book server and not download reference Rhythm characteristic.In this implementation, the cadence information that reference voice can be downloaded from book server, then according to reference The cadence information of voice is calculated, to obtain with reference to rhythm characteristic.Thus, it is possible to by the pre-determined text of downloading and pass through The reference rhythm characteristic obtained is calculated to be stored in storage unit 210.In this manner, when the expression shape of user's rhythm characteristic When formula changes, it will easily can be saved with reference to rhythm character representation at user according to the cadence information of reference voice is calculated The identical form of feature is played, so that the processing of voice quality assessment equipment 200 is more flexible, conveniently, practicability is stronger.

It should be noted that the process calculated according to the cadence information of reference voice with reference to rhythm characteristic can refer to down The treatment process of Fig. 5 description is combined in text, I will not elaborate.

Here, reference voice can be by specific user (such as using the language of pre-determined text as the user of mother tongue or with it is pre- Determine the relevant professional language teacher of language etc. of text) in advance for the voice of pre-determined text recording.Cadence information can be About one section of reference voice, it is also possible to about multistage reference voice.The reference rhythm characteristic of multistage reference voice can be with It is to be obtained after being averaging by the reference rhythm characteristic to each section of reference voice.

When user starts voice quality assessment equipment 200, as described above, having had in storage unit 210 above-mentioned predetermined Text and it is corresponding with the pre-determined text refer to rhythm characteristic.Then, pass through the touch screen 146 etc of such as mobile terminal 100 Display device, content of text corresponding to voice to be logged (i.e. above-mentioned pre-determined text) is presented to user, and prompt user Record corresponding voice.In this way, user can be by input units such as the microphones 130 of mobile terminal 100 come typing phase The voice answered as user speech, and is received the user speech by user speech receiving unit 220.

Then, received user speech is transmitted to feature acquiring unit 230 by user speech receiving unit 220, and User's rhythm characteristic of the user speech is obtained by feature acquiring unit 230.

Fig. 3 shows a kind of possible exemplary construction of feature acquiring unit 230.In this example, feature acquiring unit 230 may include alignment subelement 310 and feature calculation subelement 320.

As shown in figure 3, alignment subelement 310 can use predetermined acoustical model (acoustic model) for user speech It carries out forcing to be aligned (force alignment) with pre-determined text, to determine in each word and/or each word in pre-determined text Each syllable and/or each syllable each phoneme and user speech part between corresponding relationship.

In general, acoustic model is made of training by the recording of a large amount of mother tongue speaker, to utilize acoustic model A possibility that input voice corresponds to known text can be calculated, and then input voice and known text can be subjected to pressure pair Together.Here, the reference voice that " input voice " can be user speech or will hereinafter mention, and " known text " can be Pre-determined text.

Wherein it is possible to reference to the correlation money in http://mi.eng.cam.ac.uk/~mjfg/ASRU_talk09.pdf Material is to know the relevant technologies of acoustic model, and can refer to http://www.isip.piconepress.com/ projects/speech/software/tutorials/produc tion/fundamentals/v1.0/section_04/ Related data in s04_04_p01.html and http://www.phon.ox.ac.uk/jcoleman/BAAP_ASR.pdf comes Know the relevant technologies for forcing alignment, or also can use other prior arts, I will not elaborate.

In addition, it should be noted that, can be determined pre- by will carry out forcing to be aligned between user speech and pre-determined text The corresponding relationship between each sentence in text and the part of speech (such as some voice segments) of user speech is determined, that is, can To determine voice segments corresponding with each sentence in pre-determined text in user speech.

In addition to this, as described above, by forcing alignment that can also obtain as needed in following three kinds of corresponding relationships Any one or more: between each word in pre-determined text and the part of speech (such as some block of speech) of user speech Corresponding relationship；The part of speech (such as some block of speech) of each syllable and user speech in each word in pre-determined text it Between corresponding relationship；And each syllable in each word in pre-determined text each phoneme and user speech part of speech Corresponding relationship between (such as some block of speech).

In this way, based on the corresponding relationship that alignment subelement 310 determines, feature calculation subelement 320 can calculate user's language User's rhythm characteristic of sound.

According to a kind of implementation, feature calculation subelement 320 can be directed to each sentence of pre-determined text, according at this Time interval of each adjacent two word between two block of speech corresponding in user speech, exists in sentence to obtain the sentence The rhythm characteristic of corresponding voice segments in user speech.Then, each sentence of the pre-determined text based on acquisition is in user speech In the rhythm characteristics of corresponding each voice segments form the rhythm characteristic of entire user speech.

In one example, for each sentence in pre-determined text, feature calculation subelement 320 can be by the language Rhythm characteristic of the information that all time intervals determined in sentence are constituted as the corresponding voice segments of the sentence.

For example, be aligned for some sentence " how are you today " in pre-determined text by forcing, it can To obtain sentence " how are you today " corresponding to the voice segments Use in user speech, and wherein each word " how ", " are ", " you " and " today " successively respectively corresponds block of speech Ub1, Ub2, Ub3 and Ub4 in user speech.Pass through pressure pair Together, pause duration of each adjacent two word between two block of speech corresponding in user speech in the available sentence, I.e. following cadence information:

(0.2-0.5), (0.6-0.8), (1.0-1.3) (assuming that unit is the second).

Wherein, (0.2-0.5) indicate Ub1 and Ub2 between pause duration be from time point " 0.2 " to time point " 0.5 ", That is, time interval is 0.3 second；(0.6-0.8) indicates that the pause duration between Ub2 and Ub3 is from time point " 0.6 " to time Point " 0.8 ", that is, time interval is 0.2 second；And (1.0-1.3) then indicates that the pause duration between Ub3 and Ub4 is from the time Point " 1.0 " is to time point " 1.3 ", that is, time interval is 0.3 second.It should be noted that in this example embodiment, all are stopped Pause all as time interval, without the length for considering to pause.

It as a result, in this example embodiment, can be by the obtained user speech middle age about sentence " how are you today " Rhythm characteristic of the information that is constituted of time interval as the corresponding voice segments Use of the sentence, wherein the information for example can be with It is expressed as but is not limited to the form of vector, that is, (0.3,0.2,0.3).

Using the time interval between word phonological component corresponding with word come the rhythm characteristic constituted, it can directly reflect use Pause length of the family when reading the sentence between each word.

In another example, for each sentence in pre-determined text, feature calculation subelement 320 can also lead to Cross the duration for determining the block of speech corresponding in user speech of each word in the sentence, and by all words in the determining sentence Rhythm characteristic of the information that corresponding duration is constituted as the corresponding voice segments of the sentence.

For example, be aligned still by taking sentence " how are you today " as an example by forcing, it is every in the available sentence The duration of a word block of speech corresponding in user speech, i.e., following cadence information:

(0-0.2), (0.5-0.6), (0.8-1.0), (1.3-1.5) (assuming that unit is the second).

The duration of available Ub1 is 0.2 second, and the duration of Ub2 is 0.1 second, and the duration of Ub3 is 0.2 Second, and the duration of Ub4 is also 0.2 second.

In this way, in this example embodiment, it can be by obtained each word duration institute about sentence " how are you today " Rhythm characteristic of the information of composition as the corresponding voice segments Use of the sentence, wherein the information can for example be expressed as but not It is limited to the form of vector, that is, (0.2,0.1,0.2,0.2).

The rhythm characteristic constituted using the duration of the corresponding phonological component of each word, can directly reflect that user exists The pronunciation duration of each word when the sentence is read, also reflects the pause length between each word indirectly.

In addition, according to another implementation, it can be according to the size of time interval and predetermined space threshold value relatively come really Determine information.In other words, in this kind of implementation, first is set as above or equal to the time interval of predetermined space threshold value It is worth (for example, 1), and sets second value (for example, 0) for the time interval for being less than predetermined space threshold value.Wherein, predetermined space Threshold value can for example be set based on experience value, or can also be determined by the method for test, and I will not elaborate.

For example, be aligned still by taking sentence " how are you today " as an example by forcing, it is every in the available sentence Pause duration of the two neighboring word between two block of speech corresponding in user speech, i.e., following cadence information:

(0.2-0.5), (0.6-0.8), (1.0-1.3) (assuming that unit is the second).

Assuming that in this example embodiment, predetermined space threshold value is 0.25 second, then 0.3 second a length of when pause between Ub1 and Ub2, Greater than predetermined space threshold value, therefore 1 is set by the attribute value of time interval；It is 0.2 second a length of when pause between Ub2 and Ub3, it is small In predetermined space threshold value, therefore 0 is set by the attribute value of time interval；And the time interval between Ub3 and Ub4 is 0.3 second, greatly In predetermined space threshold value, therefore 1 is set by the attribute value of time interval.

The attribute value of time interval is indicated using " 0 ", " 1 " value, that is, " 0 " indicates that time interval is very short, and " 1 " table Show that time interval is longer.In this way, in this example embodiment, can by it is obtained about sentence " how are you today " when Between rhythm characteristic of the information that is constituted of the attribute value that is spaced as the corresponding voice segments Use of the sentence, wherein the information is for example It can be expressed as but be not limited to the form of vector, that is, (1,0,1).

Assuming that altogether including two sentences in pre-determined text, the two sentences respectively correspond voice segments in user speech Use1 and Use2, and the rhythm characteristic that voice segments Use1 is obtained by calculation is (1,0,1), and the rhythm characteristic of voice segments Use2 For (0,1,1), then the rhythm characteristic of the available user speech is { (1,0,1), (0,1,1) }.

It follows that by the above-mentioned means, the rhythm characteristic to be formed is enabled to (to be made of as mentioned above 0,1 value Vector) more intuitively embody interval between word and word.It in this case, can by the setting of predetermined space threshold value By the pause of (such as less than predetermined space threshold value) relatively short between word and word (and/or between syllable and syllable) and relatively The pause of long (for example being greater than or equal to predetermined space threshold value) distinguishes, and it is special to rhythm is formed by avoid shorter pause The influence of sign, more meet people speak and/or habit of pronouncing.

Several examples for obtaining user's rhythm characteristic are presented above and utilize the reference of same form in subsequent processing Rhythm characteristic is compared (such as calculating similarity or distance etc. between the two) with user's rhythm characteristic, and comparison result is mentioned User is supplied, family can be used and quickly know that oneself speak rhythm and/or pronunciation rhythm, and can be at once with the presence or absence of deficiency Know how this improves (for example the pause between certain two word should be longer or shorter, alternatively, the pronunciation duration of some word Should be longer, should pause between certain two syllable of some word how long, etc.).It should be noted that here given by citing Be that the word-based interval calculation between word obtains the rhythm characteristic of sentence, in other examples, each word can also be based on In each syllable between interval calculate the rhythm characteristic of each word, process is similar therewith, therefore which is not described herein again.

It should be noted that in an embodiment of the present invention, rhythm of speaking refers to the pause between word and word, and the rhythm that pronounces Then refer to the pause between syllable and syllable.

In this way, voice quality calculates single after obtaining user's rhythm characteristic of user speech by feature acquiring unit 230 Member 240 can be based on user's rhythm characteristic and the voice quality for calculating user speech with reference to the correlation between rhythm characteristic.

According to a kind of implementation, voice quality computing unit 240 can be based on user's rhythm characteristic and with reference to rhythm spy Correlation between sign, and the score for describing the voice quality of user speech is obtained according to the correlation.

In one example, it is assumed that user's rhythm characteristic that feature acquiring unit 230 obtains be (1,0,1), (0,1, 1) }, and assume that with reference to rhythm characteristic be { (1,0,0), (1,1,1) }, then { (1,0,1), (0,1,1) } can be obtained by calculation With the similarity between { (1,0,0), (1,1,1) }, and using the similarity as describe the user speech voice quality point Number.That is, the user's rhythm characteristic calculated and higher, the voice matter of user speech with reference to the similarity between rhythm characteristic It measures also higher.

Wherein, between user's rhythm characteristic { (1,0,1), (0,1,1) } and reference rhythm characteristic { (1,0,0), (1,1,1) } Similarity can be obtained according to the similarity between the vector of corresponding position between the two, for example, first calculate (1,0,1) with The vector similarity between vector similarity and (0,1,1) and (1,1,1) between (1,0,0) then takes all of calculating The weighted average or weighted sum of vector similarity are as user's rhythm characteristic and with reference to the similarity between rhythm characteristic.Its In, it, can be by that in the corresponding pre-determined text of each vector when calculating the weighted average or weighted sum of vector similarity Weight of the weight of a sentence as the corresponding vector similarity of the vector, and in pre-determined text the weight of each sentence can be it is pre- It has first rule of thumb set, also can use with reference to rhythm characteristic and be arranged (for example, will be more polynary with reference to containing in rhythm characteristic The weight of the corresponding sentence of vector of plain " 1 " is arranged higher), or each weight can also be disposed as 1, etc..

In addition, in another example, can also calculate based on the phase between user's rhythm characteristic and reference rhythm characteristic Closing property calculates distance between the two, and obtains the score of the voice quality for describing user speech according to this distance.For example, It can be by the score of the voice quality reciprocal as description user speech of distance.That is, the user's rhythm characteristic calculated The distance between reference rhythm characteristic is bigger, and the voice quality of user speech is poorer.

For example, between user's rhythm characteristic { (1,0,1), (0,1,1) } and reference rhythm characteristic { (1,0,0), (1,1,1) } Distance can be obtained according to the distance between vector of corresponding position between the two.For example, can first calculate (1,0,1) with Then distance between vector between vector between (1,0,0) between distance and (0,1,1) and (1,1,1) takes all of calculating The weighted average of distance or weighted sum are as user's rhythm characteristic and with reference to the similarity between rhythm characteristic between vector.Its In, weight in the weighted average or weighted sum of distance between calculating vector can according to calculate vector similarity plus The identical mode of weight when weight average value or weighted sum is arranged, and which is not described herein again.

In addition, it should be noted that, if the reference rhythm characteristic stored in storage unit 210 is not expressed as saving with user The identical form of form (such as form of vector) for playing feature, then can be expressed as same form first, then Calculate similarity or distance etc. between the two.

In addition, it should also be noted that, voice quality computing unit 240 can calculate user's rhythm characteristic and reference sentence by sentence Correlation (i.e. similarity or distance) between rhythm characteristic, then obtain the mass fraction of user speech sentence by sentence and (namely successively obtain Obtain the mass fraction of voice segments one by one corresponding with each sentence of pre-determined text in user speech).In addition, voice quality calculates list First 240 can also be in the user's rhythm characteristic for calculating entire user speech and with reference to correlation (the i.e. phase between rhythm characteristic Like degree or distance) after, then obtain describing the mass fraction of entire user speech.

The another of the voice quality assessment equipment based on rhythm of embodiment according to the present invention is described below with reference to Fig. 4 A example.

In the example as depicted in fig. 4, voice quality assessment equipment 400 including storage unit 410, user speech in addition to connecing It receives except unit 420, feature acquiring unit 430 and voice quality computing unit 440, further includes output unit 450.Wherein, Storage unit 410, user speech receiving unit 420, feature acquiring unit in voice quality assessment equipment 400 shown in Fig. 4 430 and voice quality computing unit 440 can be respectively provided with above in conjunction with voice quality assessment equipment described in Fig. 2 The identical structure and function of corresponding unit in 200, and similar technical effect can be reached, which is not described herein again.

Output unit 450 can visualize the calculated result of output voice quality, for example, can pass through such as mobile terminal The calculated result of the display device of 100 touch screen 146 etc Voice Quality in user's presentation.

According to a kind of implementation, output unit 450 can export the score of reflection voice quality as voice quality Calculated result.

For example, output unit 450 can visually export in (for example exporting sentence by sentence) reflection user speech and predetermined text Originally the score of the voice quality of the corresponding each voice segments of each sentence.In this way, user can know every words described in oneself Speak rhythm and/or pronounce rhythm accuracy, especially when the score of a certain sentence is lower, user can be immediately recognized that this The rhythm of sentence needs to correct, and learns more targeted.

For another example, output unit 450 can visually export the score for reflecting the voice quality of entire user speech.This Whether sample, the rhythm that user is capable of one section of voice described in overall recognition oneself are accurate.

In addition, in other examples, output unit 450 can also visually export in reflection user speech and pre- simultaneously Determine the score of the voice quality of the corresponding each voice segments of each sentence of text and the voice quality of the entire user speech of reflection Score.

According to another implementation, output unit 450 can visually export user's rhythm characteristic and with reference to rhythm Difference between feature as voice quality calculated result.

For example, output unit 450 can indicate received pronunciation with user speech with two parallel rows, wherein with " ' " Number indicate two words between exist pause, if pause is identical, can be shown with general fashion, for example, green " ' " number； If it is different, then be highlighted the pause, for example, overstriking red " ' ".

In this way, shown by the output of output unit 450, user can easily know oneself speak rhythm and/or The difference of pronunciation rhythm and received pronunciation (reference voice i.e. here) spoken between rhythm and/or the rhythm that pronounces, difference have It is much etc., so as to more targetedly, more accurately correct oneself speak rhythm and/or pronunciation rhythm.

According to other implementations, output unit 450 can also visually export the score of reflection voice quality simultaneously And and user's rhythm characteristic and the calculated result with reference to the difference between rhythm characteristic as voice quality, the implementation Detail can refer to the description as described in both the above implementation, which is not described herein again.

As can be seen from the above description, the voice quality assessment equipment based on rhythm of above-mentioned embodiment according to the present invention, Correlation between the user's rhythm characteristic and reference rhythm characteristic of its user speech based on acquisition, to calculate user speech Voice quality.Since the equipment considers the information in relation to voice rhythm during calculating the voice quality of user speech, Therefore being able to use family knows accuracy of the voice oneself recorded in terms of rhythm according to calculated result, and then is conducive to User judges whether to need to correct oneself speak rhythm and/or pronunciation rhythm.

In addition, the voice quality assessment equipment based on rhythm of above-mentioned embodiment according to the present invention corresponds to user client End, calculating and evaluation to user speech are completed on client computer or client mobile terminal, and existing Voice technology is usually to complete the calculating and evaluation to user speech, voice quality assessment equipment of the invention in server end User is set to carry out off-line learning (downloaded storage learning stuff in the case where), without must be into as the prior art Row on-line study.

In addition, the equipment is suitable for executing in the server the embodiments of the present invention also provide a kind of data processing equipment, And include: server storage unit, it is suitable for storage pre-determined text, and be suitable for storing at least one section ginseng corresponding with pre-determined text Written comments on the work, etc of public of officials sound receives and stores at least one section of reference voice from outside；And tempo calculation unit, it is suitable for calculating at least one section ginseng The cadence information of written comments on the work, etc of public of officials sound, and the cadence information is stored in server storage unit, or according to the cadence information calculate to The reference rhythm characteristic of few one section of reference voice will be simultaneously stored in server storage unit with reference to rhythm characteristic.

Fig. 5 shows an example of the data processing equipment 500 of embodiment according to the present invention.As shown in figure 5, data Processing equipment 500 includes server storage unit 510 and tempo calculation unit 520.

Data processing equipment 500 for example can be implemented as resident application on the server.Server for example may include Web server can use http agreement and user client (such as voice quality assessment equipment 200 as described above Or it 400) is communicated, but be not only restricted to this.

Server storage unit 510 can store the text material of various language learning materials, i.e. pre-determined text.Wherein, For every kind of language, server storage unit 510 can also store and predetermined text other than it can store pre-determined text It this corresponding at least one section of reference voice or can be set from speech processing device 600 that for example will be described below etc. outside It is standby to receive and store at least one section of reference voice.

It should be understood that pre-determined text mentioned here is similar with described pre-determined text above, in addition to including one Or except multiple sentences and one or more words of each sentence these content of text, it is also an option that property include each The information such as the syllable and/or phoneme of word (such as language that the language of pre-determined text is such as English etc, that word is made of letter When) and each word corresponding relationship between information and the letter for constituting the word of syllable and/or phoneme etc..

Tempo calculation unit 520 can obtain the cadence information of at least one section reference voice by calculating or with reference to rhythm Feature, and be stored in server storage unit by obtained cadence information or with reference to rhythm characteristic.Wherein, it obtains and refers to rhythm The process of feature can be similar with the acquisition process of user's rhythm characteristic described above, below will be for example, simultaneously The description of the identical content in part is omitted.

According to a kind of implementation, tempo calculation unit 520 can believe the rhythm of at least one section reference voice of acquisition Breath is stored in server storage unit 510.In this implementation, in subsequent processing, data processing equipment 500 can be with It is (such as described above that the cadence information of the pre-determined text and at least one section reference voice that are stored is supplied to user client Voice quality assessment equipment 200 or 400).

In addition, tempo calculation unit 520 can also refer to language according at least one section of acquisition according to another implementation The cadence information of sound saves the reference rhythm characteristic of acquisition to obtain the reference rhythm characteristic of at least one section reference voice In server storage unit 510.In this implementation, in subsequent processing, data processing equipment 500 can be deposited It is (such as described above that the reference rhythm characteristic of the pre-determined text of storage and at least one section reference voice is supplied to user client Voice quality assessment equipment 200 or 400).

In one example, it is assumed that " at least one section of reference voice " includes two sections of reference voices of R1, R2 altogether.With pre-determined text In some sentence " how are you today " and reference voice R1 for, pass through force alignment, available sentence " how Are you today " corresponds to the voice segments R1se in reference voice R1₁, and wherein each word " how ", " are ", " you " and " today " successively respectively corresponds block of speech Rb1, Rb2, Rb3 and Rb4 in reference voice.By forcing to be aligned, available ginseng The cadence information of written comments on the work, etc of public of officials sound R1, it may be assumed that

(0.2-0.4), (0.5-0.7), (0.9-1.2) (assuming that unit is the second).

Here, (0.2-0.4) indicates that the pause duration between Rb1 and Rb2 is from time point " 0.2 " to time point " 0.4 "； (0.5-0.7) indicates that the pause duration between Rb2 and Rb3 is from time point " 0.5 " to time point " 0.7 "；And (0.9-1.2) is then Indicate that the pause duration between Rb3 and Rb4 is from time point " 0.9 " to time point " 1.2 ".As a result, in this example embodiment, it can incite somebody to action The information constituted in obtained reference voice R1 about the time interval of sentence " how are you today " is as the language The corresponding voice segments R1se of sentence₁Rhythm characteristic, wherein the information can for example be expressed as but be not limited to the form of vector, That is, (0.2,0.2,0.3).

In another example, it can relatively determine that information is (false according to the size of time interval and predetermined space threshold value If in the example 0.25) interval threshold is.In this way, according to the above cadence information and predetermined space threshold value, available voice Section R1se₁、R1se₂Rhythm characteristic be respectively (0,0,1) and (0,1,1), then the rhythm characteristic of available reference voice R1 be (0,0,1), ((0,1,1) } it is stored in server storage unit 510 as with reference to rhythm characteristic.

In addition, in other examples, the cadence information of reference voice can also be by each word of sentence each in pre-determined text The duration of corresponding block of speech is formed in reference voice, processing in this case with above in user speech Alignment processing is similar, and so it will not be repeated.

It should be noted that executed in the data processing equipment 500 of embodiment according to the present invention with above in conjunction with The identical processing of voice quality assessment equipment 200 or 400 based on rhythm described in Fig. 2 or Fig. 4 can obtain similar Technical effect, no longer repeat one by one here.

In addition, the equipment is suitable for executing in a computer the embodiments of the present invention also provide a kind of speech processing device, And include: reference voice receiving unit, the voice for being directed to the typing of pre-determined text institute suitable for receiving specific user, as reference language Sound, and reference voice is sent to book server；And/or tempo calculation unit, it is suitable for being calculated according to reference voice with reference to language The cadence information and pre-determined text are sent to book server in association, or are believed according to the rhythm by the cadence information of sound Breath calculates the reference rhythm characteristic of reference voice, is sent to reservation service in association will refer to rhythm characteristic and pre-determined text Device.

Fig. 6 shows an example of the speech processing device 600 of embodiment according to the present invention.As shown in fig. 6, voice Processing equipment 600 includes reference voice receiving unit 610, and can also include tempo calculation unit 620.

As shown in fig. 6, according to a kind of implementation, when speech processing device 600 only includes reference voice receiving unit 610 When, can be received by reference to voice receiving unit 610 specific user (such as using pre-determined text language as the user of mother tongue or with Relevant professional language teacher of the language etc.) it is directed to the voice of pre-determined text institute typing, as reference voice, and by reference voice It is sent to book server (such as the server that above data processing equipment 500 described in conjunction with Figure 5 is resident).

It, can be with when speech processing device 600 further includes tempo calculation unit 620 in addition, according to another implementation The cadence information that reference voice is calculated according to the received reference voice of reference voice receiving unit 610, by the cadence information It is sent to book server in association with pre-determined text, or calculates the reference rhythm characteristic of reference voice according to the cadence information (process can be with reference to associated description above), with will refer to rhythm characteristic and pre-determined text be sent in association it is predetermined Server.

In practical applications, speech processing device 600 can correspond to the teacher being arranged in computer or other terminals Client, for example, it is implemented in software.

The user of teacher's client can record received pronunciation for each sentence in pre-determined text, using as reference Voice is sent to corresponding server end, executes subsequent processing by server end.In this case, it can be convenient server to pass through Internet acquires reference voice, and the processing without participating in recorded speech can save time and operation.

In addition, teacher's client can also be handled directly in the received pronunciation (i.e. reference voice) that it is recorded in local Analysis generates parameter (such as reference voice feature) corresponding with the received pronunciation, and is transferred to clothes together with pre-determined text Business device end storage, so as to reduce the processing load of server end.

In addition, the embodiments of the present invention also provide a kind of mobile terminal, including as described above based on the voice of rhythm Quality evaluation equipment.The mobile terminal can have possessed by the above-mentioned voice quality assessment equipment 200 or 400 based on rhythm Function, and similar technical effect can be reached, I will not elaborate.

In addition, the embodiments of the present invention also provide a kind of voice quality assessment system based on rhythm, the system include Voice quality assessment equipment 200 or 400 and data processing equipment as described above 500 based on rhythm as described above.

According to a kind of implementation, voice quality assessment system is in addition to including above-mentioned voice quality assessment equipment 200 or 400 Except data processing equipment 500, it is also an option that property include speech processing device 600 as described above.In this realization In mode, the voice quality assessment equipment 200 or 400 in voice quality assessment system can correspond to be set to computer or shifting User client in dynamic terminal, data processing equipment 500 can correspond to be set to server end, and speech processing device 600 can correspond to teacher's client.In actual treatment, teacher's client can provide reference voice to server end (can Selection of land can also provide the cadence information of reference voice or with reference to rhythm characteristic), server is for storing these information and making a reservation for Text, and user client can then download these information to analyze the user speech of user's input, with complete to its from server At voice quality assessment.The details of processing can combine description given by Fig. 2 or 4, Fig. 5 and Fig. 6 with reference to above respectively, this In repeat no more.

In addition, the embodiments of the present invention also provide a kind of voice quality assessment method based on rhythm, this method include Following steps: receiving the user speech that user is directed to the typing of pre-determined text institute, which includes one or more sentence, And each sentence includes one or more word；Obtain user's rhythm characteristic of user speech；And it is corresponding based on pre-determined text Reference rhythm characteristic and user's rhythm characteristic between correlation, calculate the voice quality of user speech.

A kind of exemplary process of the above-mentioned voice quality assessment method based on rhythm is described below with reference to Fig. 7.Such as Fig. 7 Shown, the exemplary process flow 700 of the voice quality assessment method according to an embodiment of the invention based on rhythm starts In step S710, then, step S720 is executed.

In step S720, the user speech that user is directed to the typing of pre-determined text institute is received, which includes one Or multiple sentences, and each sentence includes one or more word.Then, step S730 is executed.Wherein, in step S720 Processing for example can be identical as the processing above in conjunction with user speech receiving unit 220 described in Fig. 2, and can reach phase Similar technical effect, details are not described herein.

According to a kind of implementation, pre-determined text and reference rhythm characteristic can be and obtain in advance from book server downloading 's.

According to another implementation, pre-determined text can be to be obtained from book server downloading in advance, and reference node It plays feature and can be to be calculated according to the cadence information of at least one section reference voice downloaded in advance from book server and obtain.

In step S730, user's rhythm characteristic of user speech is obtained.Then, step S740 is executed.Wherein, step Processing in S730 for example can be identical as the processing above in conjunction with feature acquiring unit 230 described in Fig. 2, and can reach To similar technical effect, details are not described herein.

According to a kind of implementation, in step S730, predetermined acoustical model can use by user speech and predetermined text This carries out pressure alignment, to determine each syllable in each word and/or each word in pre-determined text and/or each syllable Corresponding relationship between each phoneme and the part of user speech, and it is special based on user's rhythm that corresponding relationship obtains user speech Sign.

Wherein, the step of obtaining user's rhythm characteristic of user speech based on corresponding relationship for example can be as follows To realize: for each sentence of pre-determined text, according in each sentence each adjacent two word it is corresponding in user speech Two block of speech between time interval, obtain the rhythm characteristic of sentence voice segments corresponding in user speech；With And the rhythm characteristic of each sentence of the pre-determined text based on acquisition each voice segments corresponding in user speech, form user's language The rhythm characteristic of sound.

In one example, for each sentence in pre-determined text, between the institute's having time that can will be determined in the sentence Every rhythm characteristic of the information constituted as the corresponding voice segments of the sentence.Wherein, when the time interval between every two word When more than or equal to predetermined space threshold value, setting corresponds to the first value of the time interval；Between the time interval is less than making a reservation for When threshold value, setting corresponds to the second value of the time interval；And it is determined according to each first value obtained and second value The rhythm characteristic of the corresponding voice segments of the sentence.

In another example, for each sentence in pre-determined text, it can determine that each word is in user in the sentence The duration of corresponding block of speech in voice, and the information that the corresponding duration of words all in the determining sentence is constituted as The rhythm characteristic of the corresponding voice segments of the sentence.

It is corresponding with reference to related between rhythm characteristic and user's rhythm characteristic based on pre-determined text in step S740 Property, calculate the voice quality of user speech.Wherein, the processing in step S740 for example can with above in conjunction with described by Fig. 2 Voice quality computing unit 240 processing it is identical, and similar technical effect can be reached, details are not described herein.Then, Process 700 is ended processing in step S750.

In addition, according to another implementation, after step S740, it is also an option that property include the following steps: can Calculated result depending on changing output voice quality.

Wherein, the calculated result of voice quality may include: the score for reflecting voice quality；And/or user's rhythm characteristic Difference between reference rhythm characteristic.

As can be seen from the above description, the voice quality assessment method based on rhythm of above-mentioned embodiment according to the present invention, Correlation between the user's rhythm characteristic and reference rhythm characteristic of its user speech based on acquisition, to calculate user speech Voice quality.Since this method considers the information in relation to voice rhythm during calculating the voice quality of user speech, Therefore being able to use family knows accuracy of the voice oneself recorded in terms of rhythm according to calculated result, and then is conducive to User judges whether to need to correct oneself speak rhythm and/or pronunciation rhythm.

In addition, the voice quality assessment method based on rhythm of above-mentioned embodiment according to the present invention corresponds to user client End, calculating and evaluation to user speech are completed on client computer or client mobile terminal, and existing Voice technology is usually to complete the calculating and evaluation to user speech, voice quality assessment method of the invention in server end User is set to carry out off-line learning (downloaded storage learning stuff in the case where), without must be into as the prior art Row on-line study.

In addition, this method is suitable for executing in the server the embodiments of the present invention also provide a kind of data processing method, And include the following steps: to store pre-determined text；At least one section of reference voice corresponding with pre-determined text is stored, or is received from outside And store at least one section of reference voice；And the cadence information of at least one section reference voice is obtained, and the cadence information is saved, Or the reference rhythm characteristic of at least one section reference voice is obtained according to the cadence information and is saved with reference to rhythm characteristic.

A kind of exemplary process of above-mentioned data processing method is described below with reference to Fig. 8.As shown in figure 8, according to this hair The exemplary process flow 800 of the data processing method of bright one embodiment starts from step S810, then, executes step S820。

In step S820, pre-determined text and at least one section of reference voice corresponding with pre-determined text are stored, or storage is in advance Determine text and receives and stores at least one section of reference voice from outside.Then, step S830 is executed.Wherein, in step S820 Processing for example can be identical as the processing above in conjunction with server storage unit 510 described in Fig. 5, and can reach similar As technical effect, details are not described herein.

In step S830, the cadence information of at least one section reference voice is obtained, and the cadence information is saved, or according to The cadence information obtains the reference rhythm characteristic of at least one section reference voice and saves with reference to rhythm characteristic.Wherein, step S830 In processing for example can be identical as the processing above in conjunction with obtaining unit 520 described in Fig. 5, and can reach similar Technical effect, details are not described herein.Then, process 800 is ended processing in step S840.

Wherein, the data processing method of embodiments of the invention described above can obtain and data processing described above The similar technical effect of equipment 500, I will not elaborate.

In addition, this method is suitable for executing in a computer the embodiments of the present invention also provide a kind of method of speech processing, And include the following steps: to receive specific user for the typing of pre-determined text institute voice as reference voice, and by reference voice It is sent to book server.Additionally optionally, the cadence information that the reference voice can also be calculated according to reference voice, should Cadence information and pre-determined text are sent to book server in association, or the reference of reference voice is obtained according to the cadence information Rhythm characteristic is sent to book server will refer to rhythm characteristic and pre-determined text in association.

A kind of exemplary process of above-mentioned method of speech processing is described below with reference to Fig. 9.As shown in figure 9, according to this hair The exemplary process flow 900 of the method for speech processing of bright one embodiment starts from step S910, then, executes step S920。

In step S920, the voice that specific user is directed to the typing of pre-determined text institute is received, as reference voice.Then, Execute step S930.

In step S930, reference voice is sent to book server.Then process is ended processing in step S940 900。

Wherein, the processing of process flow 900 for example can with above in conjunction with reference voice receiving unit described in Fig. 6 610 processing is identical, and can reach similar technical effect, and details are not described herein.

In addition, Figure 10 shows another exemplary process of above-mentioned method of speech processing.As shown in Figure 10, according to this hair The exemplary process flow 1000 of the method for speech processing of bright one embodiment starts from step S1010, then, executes step S1020。

In step S1020, the voice that specific user is directed to the typing of pre-determined text institute is received, as reference voice.Then, Execute step S1030.

According to a kind of implementation, the cadence information of reference voice can be obtained in step S1030, which is believed Breath is sent to book server with pre-determined text in association.Then process 1000 is ended processing in step S1040.

According to another implementation, the reference of reference voice can be obtained according to the cadence information in step S1030 Rhythm characteristic is sent to book server will refer to rhythm characteristic and pre-determined text in association.Then in step S1040 End processing process 1000.

Wherein, the processing of process flow 1000 for example can with above in conjunction with described in Fig. 6 reception and obtaining unit 620 processing is identical, and can reach similar technical effect, and details are not described herein.

Wherein, the method for speech processing of embodiments of the invention described above can obtain and speech processes described above The similar technical effect of equipment 600, I will not elaborate.

A11: a kind of voice quality assessment method based on rhythm is recorded comprising steps of receiving user for pre-determined text The user speech entered, which includes one or more sentence, and each sentence includes one or more word；It obtains User's rhythm characteristic of the user speech；And it is corresponding with reference to rhythm characteristic and user section based on the pre-determined text The correlation between feature is played, the voice quality of the user speech is calculated.A12: the voice quality assessment side according to A11 In method, the step of user's rhythm characteristic for obtaining the user speech includes: using predetermined acoustical model by the user Voice carries out forcing to be aligned with the pre-determined text, with the portion of each word and the user speech in the determination pre-determined text / corresponding relationship, and obtain based on the corresponding relationship user's rhythm characteristic of the user speech.A13: according to A12 In the voice quality assessment method, user's rhythm characteristic that the user speech is obtained based on the corresponding relationship Step includes: each sentence for the pre-determined text, according in each sentence each adjacent two word in user's language Time interval in sound between two corresponding block of speech obtains sentence voice segments corresponding in the user speech Rhythm characteristic；And each voice segments that each sentence of the pre-determined text based on acquisition is corresponding in the user speech Rhythm characteristic, form the rhythm characteristic of the user speech.A14: right in the voice quality assessment method according to A13 Each sentence in the pre-determined text: the information that all time intervals determined in the sentence are constituted is as the sentence The rhythm characteristic of corresponding voice segments；Or determine the block of speech that each word is corresponding in the user speech in the sentence when It is long, and the information that the corresponding duration of words all in the determining sentence is constituted is as the rhythm of the corresponding voice segments of the sentence Feature.A15: in the voice quality assessment method according to A14, when the time interval between every two word is greater than or equal in advance When determining interval threshold, setting corresponds to the first value of the time interval；When the time interval is less than predetermined space threshold value, setting Second value corresponding to the time interval；And the corresponding language of the sentence is determined according to each first value obtained and second value The rhythm characteristic of segment.A16: the voice quality assessment method according to A11 further include: visualization output institute's Voice Quality Calculated result.A17: in the voice quality assessment method according to A16, the calculated result of institute's Voice Quality includes: anti- Reflect the score of institute's Voice Quality；And/or user's rhythm characteristic and the difference with reference between rhythm characteristic.A18: root According in voice quality assessment method described in A11: the pre-determined text and the rhythm that refers to are characterized in advance from reservation service Device downloading obtains；Or the pre-determined text is obtained from book server downloading in advance, and the rhythm that refers to is characterized in It is obtained according to the calculating of the cadence information for at least one section reference voice downloaded in advance from book server.A19: at a kind of data Reason method, this method be suitable for execute in the server, and include the following steps: store pre-determined text and with the pre-determined text Corresponding at least one section of reference voice；And believed according to the rhythm that at least one section of reference voice calculates the reference voice Breath, and saves the cadence information, or the reference rhythm characteristic, simultaneously of at least one section reference voice is calculated according to the cadence information It saves described with reference to rhythm characteristic.A20: a kind of method of speech processing, this method are suitable for executing in a computer, and including as follows Step: specific user is received for the voice of pre-determined text institute typing as reference voice；And according to the reference voice meter The cadence information and the pre-determined text are sent to reservation service by the cadence information for calculating the reference voice in association Device, or according to the cadence information calculate the reference voice reference rhythm characteristic, with by it is described with reference to rhythm characteristic with it is described Pre-determined text is sent to the book server in association.A21: a kind of mobile terminal, including it is according to the present invention based on section The voice quality assessment equipment played.A22: a kind of voice quality assessment system based on rhythm, including according to the present invention be based on The voice quality assessment equipment and data processing equipment of rhythm.A kind of A23: voice quality assessment system based on rhythm, comprising: Voice quality assessment equipment according to the present invention based on rhythm；It stores pre-determined text and refers to cadence information and/or reference node Play the server of feature；And speech processing device according to the present invention.

Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims than feature more features expressly recited in each claim.More precisely, as following As claims reflect, inventive aspect is all features less than single embodiment disclosed above.Therefore, it abides by Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself As a separate embodiment of the present invention.

Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple Submodule.

Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.

In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the invention element performed by Function.

As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc. Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.

Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit Determine subject of the present invention and selects.Therefore, without departing from the scope and spirit of the appended claims, for this Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present invention, to this Invent done disclosure be it is illustrative and not restrictive, it is intended that the scope of the present invention be defined by the claims appended hereto.

Claims

1. a kind of voice quality assessment equipment based on rhythm, comprising:

Storage unit, is suitable for storage pre-determined text and the pre-determined text is corresponding with reference to rhythm characteristic, which includes One or more sentence, and each sentence includes one or more word；

User speech receiving unit, the user speech for being directed to pre-determined text institute typing suitable for receiving user；

Feature acquiring unit, suitable for obtaining user's rhythm characteristic of the user speech；And

Voice quality computing unit, suitable for based on the correlation with reference between rhythm characteristic and user's rhythm characteristic, Calculate the voice quality of the user speech, comprising:

User's rhythm characteristic and the similarity with reference between rhythm characteristic are calculated, and using the similarity as description institute State the score of the voice quality of user speech；Or

Based on user's rhythm characteristic with it is described with reference to the correlation calculations between rhythm characteristic it is between the two at a distance from, and root The score for describing the voice quality of the user speech is obtained according to the distance；It is described to be saved with reference to rhythm characteristic and the user Acquisition process and the representation for playing feature are all the same, and rhythm characteristic is able to reflect the pause length between each word；

Wherein, the feature acquiring unit includes:

It is aligned subelement, suitable for the user speech is carried out forcing to be aligned with the pre-determined text using predetermined acoustical model, With the corresponding relationship between the part of each word and the user speech in the determination pre-determined text；And

Feature calculation subelement, suitable for calculating user's rhythm characteristic of the user speech based on the corresponding relationship, comprising:

For each sentence of the pre-determined text, according to each adjacent two word institute in the user speech in each sentence Time interval between corresponding two block of speech obtains the rhythm of sentence voice segments corresponding in the user speech Feature；And

The rhythm characteristic of each sentence of the pre-determined text based on acquisition each voice segments corresponding in the user speech, Form the rhythm characteristic of the user speech.

2. voice quality assessment equipment according to claim 1, wherein the feature calculation subelement is suitable for for described Each sentence in pre-determined text:

The information that all time intervals determined in the sentence are constituted is as the rhythm characteristic of the corresponding voice segments of the sentence； Or

Determine the duration for the block of speech that each word is corresponding in the user speech in the sentence, and will be in the determining sentence Rhythm characteristic of the information that the corresponding duration of all words is constituted as the corresponding voice segments of the sentence.

3. voice quality assessment equipment according to claim 2, wherein be greater than when the time interval between every two word or When equal to predetermined space threshold value, setting corresponds to the first value of the time interval；When the time interval is less than predetermined space threshold value When, setting corresponds to the second value of the time interval；And the sentence is determined according to each first value obtained and second value The rhythm characteristic of corresponding voice segments.

4. voice quality assessment equipment according to claim 1, further includes:

Output unit, suitable for visualizing the calculated result of output institute's Voice Quality.

5. voice quality assessment equipment according to claim 4, wherein the output unit, which is suitable for exporting following result, to be come Calculated result as institute's Voice Quality:

Reflect the score of institute's Voice Quality；And/or

User's rhythm characteristic and the difference with reference between rhythm characteristic.

6. voice quality assessment equipment according to claim 1, in which:

The storage unit is suitable for downloading the pre-determined text in advance from book server and described with reference to rhythm characteristic；Or

The storage unit is suitable for downloading the section of the pre-determined text and at least one section reference voice in advance from book server Information is played, and refers to rhythm characteristic according to the calculating acquisition of the cadence information of at least one section reference voice is described.

7. a kind of data processing equipment, which is suitable for being resident in the server, and includes:

Server storage unit is suitable for storage pre-determined text and at least one section of reference voice corresponding with the pre-determined text, The pre-determined text includes one or more sentence, and each sentence includes one or more word；And

Tempo calculation unit, suitable for calculating the cadence information of this section of reference voice according at least one section of reference voice, and will The cadence information is stored in the server storage unit, or calculates at least one section of reference voice according to the cadence information Reference rhythm characteristic and be stored in described in the server storage unit with reference to rhythm characteristic；Wherein,

It is described pre- with determination suitable for carrying out forcing to be aligned with the pre-determined text by the reference voice using predetermined acoustical model Determine the corresponding relationship between the part of each word and the reference voice in text；And

The reference rhythm characteristic of the reference voice is calculated based on the corresponding relationship, comprising:

For each sentence of the pre-determined text, according to each adjacent two word institute in the reference voice in each sentence Time interval between corresponding two block of speech obtains the rhythm of sentence voice segments corresponding in the reference voice Feature；And

The rhythm characteristic of each sentence of the pre-determined text based on acquisition each voice segments corresponding in the reference voice, Form the reference rhythm characteristic of the reference voice.

8. a kind of speech processing device, which is suitable for executing in a computer, and includes:

Reference voice receiving unit, suitable for receive specific user for the typing of pre-determined text institute voice be used as reference voice, this Pre-determined text includes one or more sentence, and each sentence includes one or more word；And

Tempo calculation unit believes the rhythm suitable for calculating the cadence information of the reference voice according to the reference voice Breath is sent to book server with the pre-determined text in association, or the ginseng of the reference voice is calculated according to the cadence information Examine rhythm characteristic, to be sent to book server in association with reference to rhythm characteristic and the pre-determined text for described；Wherein,

User's rhythm characteristic of the reference voice is calculated based on the corresponding relationship, comprising:

9. a kind of voice quality assessment method based on rhythm, comprising steps of

The user speech that user is directed to the typing of pre-determined text institute is received, which includes one or more sentence, and every A sentence includes one or more word；

Obtain user's rhythm characteristic of the user speech, comprising:

The user speech is carried out forcing to be aligned with the pre-determined text using predetermined acoustical model, with the determination predetermined text Each word in this and the corresponding relationship between the part of the user speech；

User's rhythm characteristic of the user speech is obtained based on the corresponding relationship, comprising:

The rhythm characteristic of each sentence of the pre-determined text based on acquisition each voice segments corresponding in the user speech, Form the rhythm characteristic of the user speech；And it is corresponding with reference to rhythm characteristic and user section based on the pre-determined text The correlation between feature is played, the voice quality of the user speech is calculated, comprising:

Based on user's rhythm characteristic with it is described with reference to the correlation calculations between rhythm characteristic it is between the two at a distance from, and root The score for describing the voice quality of the user speech is obtained according to the distance；It is described to be saved with reference to rhythm characteristic and the user Acquisition process and the representation for playing feature are all the same, and rhythm characteristic is able to reflect the pause length between each word.

10. voice quality assessment method according to claim 9, wherein for each sentence in the pre-determined text:

11. voice quality assessment method according to claim 10, wherein when the time interval between every two word is greater than Or when being equal to predetermined space threshold value, setting corresponds to the first value of the time interval；When the time interval is less than predetermined space threshold When value, setting corresponds to the second value of the time interval；And the language is determined according to each first value obtained and second value The rhythm characteristic of the corresponding voice segments of sentence.

12. voice quality assessment method according to claim 9, further includes: the meter of visualization output institute's Voice Quality Calculate result.

13. voice quality assessment method according to claim 12, wherein the calculated result of institute's Voice Quality includes:

Reflect the score of institute's Voice Quality；And/or

14. voice quality assessment method according to claim 9, in which:

The pre-determined text and described it is characterized in obtaining from book server downloading in advance with reference to rhythm；Or

The pre-determined text is obtained from book server downloading in advance, and the rhythm that refers to is characterized in basis from predetermined clothes The cadence information for at least one section reference voice that business device is downloaded in advance calculates acquisition.

15. a kind of data processing method, this method is suitable for executing in the server, and includes the following steps:

Store pre-determined text and at least one section of reference voice corresponding with the pre-determined text, the pre-determined text include one or The multiple sentences of person, and each sentence includes one or more word；And

The cadence information of the reference voice is calculated according at least one section of reference voice, and saves the cadence information or root The reference rhythm characteristic of at least one section reference voice is calculated according to the cadence information and is saved described with reference to rhythm characteristic；Its In

The step of calculating the reference rhythm characteristic of the reference voice include:

The reference voice is carried out forcing to be aligned with the pre-determined text using predetermined acoustical model, with the determination predetermined text Each word in this and the corresponding relationship between the part of the reference voice；

The reference rhythm characteristic of the reference voice is obtained based on the corresponding relationship, comprising:

16. a kind of method of speech processing, this method is suitable for executing in a computer, and includes the following steps:

Specific user is received for the voice of pre-determined text institute typing as reference voice, the pre-determined text is including one or more A sentence, and each sentence includes one or more word；And

The cadence information that the reference voice is calculated according to the reference voice, by the cadence information and the pre-determined text phase It is associatedly sent to book server, or the reference rhythm characteristic of the reference voice is calculated according to the cadence information, with by institute It states and is sent to the book server in association with reference to rhythm characteristic and the pre-determined text；Wherein

17. a kind of mobile terminal is set including the voice quality assessment such as of any of claims 1-6 based on rhythm It is standby.

18. a kind of voice quality assessment system based on rhythm, including rhythm is based on as of any of claims 1-6 Voice quality assessment equipment and data processing equipment as claimed in claim 7.

19. a kind of voice quality assessment system based on rhythm, comprising:

Such as the voice quality assessment equipment of any of claims 1-6 based on rhythm；

Store pre-determined text and with reference to cadence information and/or with reference to the server of rhythm characteristic；And

Speech processing device as claimed in claim 8.