CN104464751A

CN104464751A - Method and device for detecting pronunciation rhythm problem

Info

Publication number: CN104464751A
Application number: CN201410674294.6A
Authority: CN
Inventors: 张儒瑞; 赵乾; 潘颂声; 宋碧霄; 吴玲
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2014-11-21
Filing date: 2014-11-21
Publication date: 2015-03-25
Anticipated expiration: 2034-11-21
Also published as: CN104464751B

Abstract

The invention provides a method and device for detecting a pronunciation rhythm problem. The method comprises the steps of receiving voice data to be detected; obtaining word boundary information of the voice data to be detected, and extracting rhythm information of the voice data to be detected; generating rhythm annotated information of the voice data to be detected according to the word boundary information and the rhythm information of the voice data to be detected; carrying out comparative analysis on the rhythm annotated information of the voice data to be detected and rhythm annotated information of reference voice data annotated in advance so as to detect whether the pronunciation rhythm problem exists in the voice data to be detected or not. The method for detecting the pronunciation rhythm problem automatically obtains the rhythm annotated information of voice and enables the information to be compared, manual annotation is not needed, application is more flexible and wide, and particularly in voice learning software, the rhythm problem of user pronunciation can be evaluated more efficiently by automatically detecting the rhythm of the voice. Particularly, in the detecting process, a large-capacity database is not needed, the calculated amount is small, and the detection efficiency is improved.

Description

The detection method of pronunciation rhythm problem and device

Technical field

The present invention relates to voice processing technology field, particularly a kind of pronounce the detection method of rhythm problem and device.

Background technology

Along with the development of speech recognition technology, speech evaluating technology is played a greater and greater role in speech recognition and application.Whether voice evaluation technology is mainly used in assessing the quality of speech data, wherein, not only comprise the assessment carried out the voice quality of words in speech data, also comprise and detect accurately and assess the rhythm in speech data.Such as, in language learning, user is by listening index zone pronunciation and carrying out learning a language with reading.User by comparison with pronunciation and the pronunciation in standard pronunciation and the rhythm whether consistent, and carry out correction according to comparison result and improve constantly learning level.Wherein, how can assess exactly, the rhythm problem with the existence in pronunciation of feedback user is the key had mastery of a language fast.Phonetic-rhythm problem, refers in voice the rhythm occurring mistake, such as, does not have liaison during this liaison, does not pause during this pause, does not read again time this is read again.In addition, under some other scenes, as in speech recognition, also need to detect the pronunciation rhythm problem of voice.

At present, the technology for rhythm problem detection mainly contains artificial mark method and prosodic constraints method.

Wherein, artificial mark method, need the correct rhythm manually marking out voice in the text that voice are corresponding, then corresponding according to the rhythm of artificial mark positional information, extracts the acoustic feature that the rhythm of relevant position in voice is relevant, and detects voice and whether there is rhythm problem, such as, to having marked the word read again, the acoustic features such as the energy of the voice of this word, fundamental frequency are extracted, by judging whether these acoustic features are greater than the methods such as certain thresholding and determine whether this word has been read again.

Prosodic constraints method, carries out the method for rhythm assessment to input speech data according to prosodic constraints.Wherein, prosodic constraints is namely: the language construction of the speech data of input or syntactic structure etc. are mated with the normal structure of the received pronunciation in standard corpus storehouse, and to be derived the due rhythm boundary position of input voice by the rhythm boundary position of the received pronunciation with analog structure.For the situation that may there is numerous received pronunciation similar to input phonetic structure in standard corpus storehouse, can determine input speech data needs to adopt which kind of rhythm border according to the statistical probability on rhythm border.

The technology of existing two kinds of rhythms assessment, all needs the word boundary and the rhythm border that manually mark voice.Just cannot assess the rhythm of user pronunciation when inartificial mark.In addition, prosodic constraints method needs jumbo standard corpus storehouse, on the one hand, very large storage space is taken, on the other hand, standard corpus Kuku Plays voice are also need manually to carry out correct prosodic labeling, and when judging prosodic constraints, also need to inquire about whole standard corpus storehouse, calculate the statistical probability on rhythm border, and then could prosodic constraints be determined, calculated amount is very large.

Summary of the invention

The present invention is intended to solve the problems of the technologies described above at least to a certain extent.

For this reason, first object of the present invention be to propose a kind of pronounce the detection method of rhythm problem, without the need to artificial mark, application is more flexibly, extensively, more effectively can assess the rhythm problem of user pronunciation, improve detection efficiency.

Second object of the present invention be to propose a kind of pronounce the pick-up unit of rhythm problem.

For reaching above-mentioned purpose, according to a first aspect of the present invention embodiment propose a kind of pronounce the detection method of rhythm problem, comprising: receive speech data to be measured; Obtain the word boundary information of described speech data to be measured, and extract the prosodic information of described speech data to be measured; According to the word boundary information of described speech data to be measured and the prosodic labeling information of the described speech data to be measured of prosodic information generation; The prosodic labeling information of the prosodic labeling information of described voice to be measured with the reference voice data marked in advance is compared, whether there is pronunciation rhythm problem to detect described speech data to be measured.

The detection method of the pronunciation rhythm problem of the embodiment of the present invention, by obtaining the word boundary information of speech data to be measured, and extract its prosodic information, to generate the prosodic labeling information of speech data to be measured accordingly, and compare with the prosodic labeling information of the reference voice data marked in advance and detect rhythm problem, can the prosodic labeling information of automatic acquisition voice compare, without the need to artificial mark, apply more flexible, extensively, especially in language learning class software, by automatically detecting the rhythm of voice, more effectively can assess the rhythm problem of user pronunciation.In addition, do not need jumbo database in testing process, calculated amount is few, improves detection efficiency.

Second aspect present invention embodiment provide a kind of pronounce the pick-up unit of rhythm problem, comprising: receiver module, for receiving speech data to be measured; Acquisition module, for obtaining the word boundary information of described speech data to be measured, and extracts the prosodic information of described speech data to be measured; Generation module, for generating the prosodic labeling information of described speech data to be measured according to the word boundary information of described speech data to be measured and prosodic information; Whether detection module, for the prosodic labeling information of the prosodic labeling information of described voice to be measured with the reference voice data marked in advance being compared, exist pronunciation rhythm problem to detect described speech data to be measured.

The pick-up unit of the pronunciation rhythm problem of the embodiment of the present invention, by obtaining the word boundary information of speech data to be measured, and extract its prosodic information, to generate the prosodic labeling information of speech data to be measured accordingly, and compare with the prosodic labeling information of the reference voice data marked in advance and detect rhythm problem, can the prosodic labeling information of automatic acquisition voice compare, without the need to artificial mark, apply more flexible, extensively, especially in language learning class software, by automatically detecting the rhythm of voice, more effectively can assess the rhythm problem of user pronunciation.In addition, do not need jumbo database in testing process, calculated amount is few, improves detection efficiency.

Additional aspect of the present invention and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.

Accompanying drawing explanation

Above-mentioned and/or additional aspect of the present invention and advantage will become obvious and easy understand from accompanying drawing below combining to the description of embodiment, wherein:

Fig. 1 is the process flow diagram of detection method of rhythm problem of pronouncing according to an embodiment of the invention;

Fig. 2 is according to an embodiment of the invention to the process flow diagram of the method that reference voice data marks;

Fig. 3 is the structural representation of pick-up unit of rhythm problem of pronouncing according to an embodiment of the invention;

Fig. 4 is the structural representation of the pick-up unit of pronunciation rhythm problem according to the present invention's specific embodiment;

Fig. 5 is the structural representation of the pick-up unit of pronunciation rhythm problem according to another embodiment of the present invention.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.

In describing the invention, it is to be appreciated that term " multiple " refers to two or more; Term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance.

Below with reference to the accompanying drawings detection method according to the pronunciation rhythm problem of the embodiment of the present invention and device are described.

Fig. 1 is the process flow diagram of detection method of rhythm problem of pronouncing according to an embodiment of the invention.As shown in Figure 1, according to the detection method of the pronunciation rhythm problem of the embodiment of the present invention, comprising:

S101, receives speech data to be measured.

For example, speech data to be measured can be user for the reference voice of standard record with reading voice.

S102, obtains the word boundary information of speech data to be measured, and extracts the prosodic information of speech data to be measured.

Particularly, in one embodiment of the invention, can first obtain content of text corresponding to speech data to be measured (such as, with read voice with the content of text read), and build decoding network according to text content, then decoding network and acoustic model are passed to demoder.Wherein, acoustic model is the background mathematics model of speech recognition, and model unit can be phoneme, syllable or word, and the modeling pattern of current main flow adopts Hidden Markov modeling.Demoder is one of core of speech recognition system, and its task is the acoustic feature to input, according to acoustic model, decoding network, finds the language unit sequence of maximum probability corresponding to this acoustic feature.Decoding network is also known as grammer network, for node with phoneme (simple or compound vowel of a Chinese syllable, initial consonant etc. as Chinese character), syllable or the word in above-mentioned content of text, annexation between phoneme is the digraph of arc, and decoding network defines the scope of demoder output language unit sequence.

Then, the acoustic feature extracting speech data to be measured is passed to demoder and is decoded, and speech data to be measured is alignd with corresponding content of text.The word boundary information of speech data to be measured can be obtained according to alignment result.Wherein, acoustic feature is the class value describing Short Time Speech essential characteristic, normally a kind of proper vector (MFCC (abbreviation of the Mel frequency cepstral coefficient) proper vectors as 39 dimensions) of fixing dimension.Word boundary information refers to that in voice to be measured, word plays time frame corresponding to initiator (or moment) to terminating time frame (or moment) corresponding to pronunciation, thus, each word time period used can be read in speech data to be measured according to word boundary acquisition of information, and the time period between word.

Finally, can according to the prosodic information of the word boundary information extraction speech data to be measured of speech data to be measured.The rhythm of voice mainly comprises: liaison, sense-group pause, read again, the information such as rising-falling tone.For the detection of the different rhythm, the prosodic features of extraction is different.For example, when judging the liaison rhythm, continuously whether the prosodic features of extraction comprise between two words with or without quiet, fundamental frequency, whether energy occurs the prosodic features such as low ebb; When judging the pause rhythm, extract the prosodic features such as the quiet duration between word; When judging to read the rhythm again, extract the prosodic features such as energy magnitude, fundamental frequency of word; When judging the rising-falling tone rhythm, extract the prosodic features such as the fundamental frequency slope of word.And then, according to word boundary information, the above-mentioned prosodic features between each word and word can be calculated successively, determine the prosodic information such as liaison, pause between the reading again of each word in voice to be measured, rising-falling tone and word according to corresponding determination strategy.

For example, if it is continuous to there is not quiet and fundamental frequency between two words, then this two word liaisons can be judged; If the mute time between two words exceedes regular hour threshold value, as 0.05 second, then can judge there is pause between two words; If the energy magnitude of one or more word exceedes certain energy threshold, then show that the one or more word is read again.Similarly, the rising-falling tone feature of word also can be judged according to fundamental frequency slope.

S103, generates the prosodic labeling information of speech data to be measured according to the word boundary information of speech data to be measured and prosodic information.

Wherein, prosodic labeling information comprises at least one prosodic information and the positional information corresponding respectively with at least one prosodic information, and wherein, each positional information is determined according to corresponding rhythm boundary information.Prosodic labeling information refers to the positional information of the correct rhythm in the text marked out corresponding to voice, namely to mark out in text liaison, pause or which lexical stress between which two word, and prosodic labeling is the important evidence as rhythm assessment.

In one embodiment of the invention, generate the prosodic labeling information of speech data to be measured according to the word boundary information of speech data to be measured and prosodic information, can specifically comprise: the rhythm boundary information determining speech data to be measured according to the word boundary information of speech data to be measured and prosodic information; The prosodic information of rhythm boundary information to speech data to be measured according to speech data to be measured marks, to generate the prosodic labeling information of speech data to be measured.

Wherein, according to word boundary information and prosodic information corresponding to word, rhythm boundary information can be determined, and determine the positional information of each prosodic information further, then mark according to the positional information of prosodic information.For example, if word A and B liaison, then the initial time frame of the rhythm that this rhythm of liaison is corresponding is the Voice onset time frame (or moment) of word A and pronunciation end time frame (or moment) of word B, and can determine that the positional information that this rhythm of liaison is corresponding is the position that in text, word A and word B is corresponding.And then, can according to the positional information of each rhythm at the corresponding prosodic information of corresponding position mark.

S104, compares the prosodic labeling information of the prosodic labeling information of voice to be measured with the reference voice data marked in advance, whether there is pronunciation rhythm problem to detect speech data to be measured.

Wherein, reference voice refer to voice to be measured with the received pronunciation read.

In an embodiment of the present invention, particularly, can judge whether the prosodic labeling information of voice to be measured meets the following conditions with the prosodic labeling information of the reference voice data marked in advance:

Marked the whole prosodic informations marked in the prosodic labeling information of reference voice data in the prosodic labeling information of speech data to be measured, and positional information corresponding to the prosodic information marked is consistent; And the prosodic information marked in the prosodic labeling information of speech data to be measured does not comprise the prosodic information do not marked in the prosodic labeling information of reference voice data.

If do not met, then judge that speech data to be measured exists pronunciation rhythm problem.

That is, whole rhythms (and the rhythm boundary information of correspondence is identical) of reference voice data are only comprised at speech data to be measured, and when not comprising the rhythm that reference voice data do not have in speech data to be measured, just judge that speech data to be measured does not exist pronunciation rhythm problem.Otherwise then there is pronunciation rhythm problem in speech data to be measured.

Further, in one embodiment of the invention, when judging that speech data to be measured exists rhythm problem, then generate pronunciation rhythm problem clew information according to comparison result, and user is pointed out.Particularly, according to comparison result, can judge that speech data to be measured is relative to the rhythm (can comprise the rhythm lacked or the rhythm had more) not identical in reference voice data, and for the not identical rhythm, user pointed out.Thus, prompting and the feedback of pronunciation rhythm problem can be carried out in time to user, be convenient to user and improve, promote Consumer's Experience.

In an embodiment of the present invention, also can comprise and reference voice data is marked, to obtain the step of the prosodic labeling information of reference voice data.Particularly, as shown in Figure 2, can comprise the following steps the method that reference voice data marks:

S201, decodes to reference voice data, and obtains the word boundary information of reference voice data according to decoded result.

In one embodiment of the invention, decoding network can be built by the content of text corresponding according to reference voice data, and decoding network and acoustic model are passed to demoder, then the acoustic feature of reference voice data is extracted, and pass to demoder and decode, reference voice data is alignd with corresponding content of text.The word boundary information of reference voice data can be obtained according to alignment result.

S202, extracts the prosodic information of reference voice data.

Particularly, with or without quiet between the word that can judge reference voice data, whether fundamental frequency is continuously and reference voice data is carried out to multiple sound judgement, obtained the slope etc. of quiet duration, energy magnitude, fundamental frequency, to extract the prosodic features of reference voice data.Further, can based on these prosodic features according to the liaison in corresponding determination strategy determination reference voice data, pause, read again, the prosodic information such as rising-falling tone.

S203, according to the rhythm boundary information of prosodic information and word boundary information determination reference voice data.

For example, if word A and B liaison, then the initial time frame of the rhythm that this rhythm of liaison is corresponding is the Voice onset time frame (or moment) of word A and pronunciation end time frame (or moment) of word B.And then, can according to each rhythm boundary information at the corresponding prosodic information of corresponding position mark.

S204, marks reference voice data according to rhythm boundary information.

Thus, automatically can detect the prosodic information of reference voice data, rower of going forward side by side is noted, and avoids loaded down with trivial details, the error of artificial mark etc., and disposable mark good after, later reusable, convenient, accurate in detection.

In order to realize above-described embodiment, the present invention also propose a kind of pronounce the pick-up unit of rhythm problem.

Fig. 3 is the structural representation of pick-up unit of rhythm problem of pronouncing according to an embodiment of the invention.

As shown in Figure 3, according to the pick-up unit of the pronunciation rhythm problem of the embodiment of the present invention, comprising: receiver module 10, acquisition module 20, generation module 30 and detection module 40.

Particularly, receiver module 10 is for receiving speech data to be measured.For example, speech data to be measured can be user for the reference voice of standard record with reading voice.

Acquisition module 20 for obtaining the word boundary information of speech data to be measured, and extracts the prosodic information of speech data to be measured.

More specifically, in one embodiment of the invention, first acquisition module 20 can obtain content of text corresponding to speech data to be measured (such as, with read voice with the content of text read), and build decoding network according to text content, then decoding network and acoustic model are passed to demoder.Wherein, acoustic model is the background mathematics model of speech recognition, and model unit can be phoneme, syllable or word, and the modeling pattern of current main flow adopts Hidden Markov modeling.Demoder is one of core of speech recognition system, and its task is the acoustic feature to input, according to acoustic model, decoding network, finds the language unit sequence of maximum probability corresponding to this acoustic feature.Decoding network is also known as grammer network, for node with phoneme (simple or compound vowel of a Chinese syllable, initial consonant etc. as Chinese character), syllable or the word in above-mentioned content of text, annexation between phoneme is the digraph of arc, and decoding network defines the scope of demoder output language unit sequence.

Then, the acoustic feature that acquisition module 20 extracts speech data to be measured is passed to demoder and is decoded, and speech data to be measured is alignd with corresponding content of text.The word boundary information of speech data to be measured can be obtained according to alignment result.Wherein, acoustic feature is the class value describing Short Time Speech essential characteristic, normally a kind of proper vector (MFCC (abbreviation of the Mel frequency cepstral coefficient) proper vectors as 39 dimensions) of fixing dimension.Word boundary information refers to that in voice to be measured, word plays time frame corresponding to initiator (or moment) to terminating time frame (or moment) corresponding to pronunciation, thus, each word time period used can be read in speech data to be measured according to word boundary acquisition of information, and the time period between word.

Finally, acquisition module 20 can according to the prosodic information of the word boundary information extraction speech data to be measured of speech data to be measured.The rhythm of voice mainly comprises: liaison, sense-group pause, read again, the information such as rising-falling tone.For the detection of the different rhythm, the prosodic features of extraction is different.For example, when acquisition module 20 judges the liaison rhythm, continuously whether the prosodic features of extraction comprise between two words with or without quiet, fundamental frequency, whether energy occurs the prosodic features such as low ebb; When judging the pause rhythm, extract the prosodic features such as the quiet duration between word; When judging to read the rhythm again, extract the prosodic features such as energy magnitude, fundamental frequency of word; When judging the rising-falling tone rhythm, extract the prosodic features such as the fundamental frequency slope of word.And then, according to word boundary information, the above-mentioned prosodic features between each word and word can be calculated successively, determine the prosodic information such as liaison, pause between the reading again of each word in voice to be measured, rising-falling tone and word according to corresponding determination strategy.

Generation module 30 is for generating the prosodic labeling information of speech data to be measured according to the word boundary information of speech data to be measured and prosodic information.Wherein, prosodic labeling information comprises at least one prosodic information and the positional information corresponding respectively with at least one prosodic information, and wherein, each positional information is determined according to corresponding rhythm boundary information.Prosodic labeling information refers to the positional information of the correct rhythm in the text marked out corresponding to voice, namely to mark out in text liaison, pause or which lexical stress between which two word, and prosodic labeling is the important evidence as rhythm assessment.

In one embodiment of the invention, generation module 30 is specifically for the rhythm boundary information of determining speech data to be measured according to the word boundary information of speech data to be measured and prosodic information; The prosodic information of rhythm boundary information to speech data to be measured according to speech data to be measured marks, to generate the prosodic labeling information of speech data to be measured.

Whether detection module 40, for the prosodic labeling information of the prosodic labeling information of voice to be measured with the reference voice data marked in advance being compared, exists pronunciation rhythm problem to detect speech data to be measured.Wherein, reference voice refer to voice to be measured with the received pronunciation read.

In an embodiment of the present invention, detection module 40 specifically for: whether the prosodic labeling information judging voice to be measured meets the following conditions with the prosodic labeling information of the reference voice data marked in advance: marked the whole prosodic informations marked in the prosodic labeling information of reference voice data in the prosodic labeling information of speech data to be measured, and positional information corresponding to the prosodic information marked is consistent; And the prosodic information marked in the prosodic labeling information of speech data to be measured does not comprise the prosodic information do not marked in the prosodic labeling information of reference voice data; If do not met, then judge that speech data to be measured exists pronunciation rhythm problem.

Fig. 4 is the structural representation of the pick-up unit of pronunciation rhythm problem according to the present invention's specific embodiment.

As shown in Figure 4, according to the pick-up unit of the pronunciation rhythm problem of the embodiment of the present invention, comprising: receiver module 10, acquisition module 20, generation module 30, detection module 40 and labeling module 50.

Particularly, labeling module 50 for marking reference voice data, to obtain the prosodic labeling information of reference voice data.

In one embodiment of the invention, labeling module 50 can be specifically for: decode to reference voice data, and obtain the word boundary information of reference voice data according to decoded result; Extract the prosodic information of reference voice data; According to the rhythm boundary information of prosodic information and word boundary information determination reference voice data; According to rhythm boundary information, reference voice data is marked.

More specifically, labeling module 50 can build decoding network by the content of text corresponding according to reference voice data, and decoding network and acoustic model are passed to demoder, then the acoustic feature of reference voice data is extracted, and pass to demoder and decode, reference voice data is alignd with corresponding content of text.The word boundary information of reference voice data can be obtained according to alignment result.

Then, with or without quiet between the word that labeling module 50 can judge reference voice data, whether fundamental frequency is continuous and reference voice data is carried out to multiple sound judgement, obtained the slope etc. of quiet duration, energy magnitude, fundamental frequency, to extract the prosodic features of reference voice data.Further, can based on these prosodic features according to the liaison in corresponding determination strategy determination reference voice data, pause, read again, the prosodic information such as rising-falling tone.

As shown in Figure 5, according to the pick-up unit of the pronunciation rhythm problem of the embodiment of the present invention, comprising: receiver module 10, acquisition module 20, generation module 30, detection module 40, labeling module 50 and reminding module 60.

Particularly, reminding module 60, for when judging that speech data to be measured is when existing rhythm problem, generating pronunciation rhythm problem clew information according to comparison result, and pointing out user.More specifically, reminding module 60 is used can according to comparison result, judge that speech data to be measured is relative to the rhythm (can comprise the rhythm lacked or the rhythm had more) not identical in reference voice data, and for the not identical rhythm, user is pointed out.

Thus, the pick-up unit of the pronunciation rhythm problem of the embodiment of the present invention, can carry out prompting and the feedback of pronunciation rhythm problem in time, be convenient to user and improve, promote Consumer's Experience to user.

Describe and can be understood in process flow diagram or in this any process otherwise described or method, represent and comprise one or more for realizing the module of the code of the executable instruction of the step of specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can not according to order that is shown or that discuss, comprise according to involved function by the mode while of basic or by contrary order, carry out n-back test, this should understand by embodiments of the invention person of ordinary skill in the field.

In flow charts represent or in this logic otherwise described and/or step, such as, the sequencing list of the executable instruction for realizing logic function can be considered to, may be embodied in any computer-readable medium, for instruction execution system, device or equipment (as computer based system, comprise the system of processor or other can from instruction execution system, device or equipment instruction fetch and perform the system of instruction) use, or to use in conjunction with these instruction execution systems, device or equipment.With regard to this instructions, " computer-readable medium " can be anyly can to comprise, store, communicate, propagate or transmission procedure for instruction execution system, device or equipment or the device that uses in conjunction with these instruction execution systems, device or equipment.The example more specifically (non-exhaustive list) of computer-readable medium comprises following: the electrical connection section (electronic installation) with one or more wiring, portable computer diskette box (magnetic device), random access memory (RAM), ROM (read-only memory) (ROM), erasablely edit ROM (read-only memory) (EPROM or flash memory), fiber device, and portable optic disk ROM (read-only memory) (CDROM).In addition, computer-readable medium can be even paper or other suitable media that can print described program thereon, because can such as by carrying out optical scanning to paper or other media, then carry out editing, decipher or carry out process with other suitable methods if desired and electronically obtain described program, be then stored in computer memory.

Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, multiple step or method can with to store in memory and the software performed by suitable instruction execution system or firmware realize.Such as, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: the discrete logic with the logic gates for realizing logic function to data-signal, there is the special IC of suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.

Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is that the hardware that can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, this program perform time, step comprising embodiment of the method one or a combination set of.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, also can be that the independent physics of unit exists, also can be integrated in a module by two or more unit.Above-mentioned integrated module both can adopt the form of hardware to realize, and the form of software function module also can be adopted to realize.If described integrated module using the form of software function module realize and as independently production marketing or use time, also can be stored in a computer read/write memory medium.

The above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.

In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referred to the schematic representation of above-mentioned term.And the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.

Although illustrate and describe embodiments of the invention, those having ordinary skill in the art will appreciate that: can carry out multiple change, amendment, replacement and modification to these embodiments when not departing from principle of the present invention and aim, scope of the present invention is by claim and equivalency thereof.

Claims

1. a detection method for rhythm problem of pronouncing, is characterized in that, comprising:

Receive speech data to be measured;

Obtain the word boundary information of described speech data to be measured, and extract the prosodic information of described speech data to be measured;

According to the word boundary information of described speech data to be measured and the prosodic labeling information of the described speech data to be measured of prosodic information generation;

The prosodic labeling information of the prosodic labeling information of described voice to be measured with the reference voice data marked in advance is compared, whether there is pronunciation rhythm problem to detect described speech data to be measured.

2. the detection method of pronunciation rhythm problem as claimed in claim 1, is characterized in that, also comprise:

Described reference voice data is marked, to obtain the prosodic labeling information of described reference voice data.

3. the detection method of pronunciation rhythm problem as claimed in claim 2, is characterized in that, describedly marks described reference voice data, specifically comprises:

Described reference voice data is decoded, and obtains the word boundary information of described reference voice data according to decoded result;

Extract the prosodic information of described reference voice data;

The rhythm boundary information of described reference voice data is determined according to described prosodic information and described word boundary information;

According to described rhythm boundary information, described reference voice data is marked.

4. the detection method of pronunciation rhythm problem as claimed in claim 1, it is characterized in that, the described word boundary information according to described speech data to be measured and prosodic information generate the prosodic labeling information of described speech data to be measured, specifically comprise:

The rhythm boundary information of described speech data to be measured is determined according to the word boundary information of described speech data to be measured and prosodic information;

The prosodic information of rhythm boundary information to described speech data to be measured according to described speech data to be measured marks, to generate the prosodic labeling information of described speech data to be measured.

5. the detection method of the pronunciation rhythm problem as described in any one of claim 1-4, it is characterized in that, described prosodic labeling information comprises at least one prosodic information and the positional information corresponding respectively with at least one prosodic information described, wherein, each described positional information is determined according to corresponding rhythm boundary information.

6. the detection method of pronunciation rhythm problem as claimed in claim 5, is characterized in that, describedly the prosodic labeling information of the prosodic labeling information of described voice to be measured with the reference voice data marked in advance is compared, and specifically comprises:

Judge whether the prosodic labeling information of described voice to be measured meets the following conditions with the prosodic labeling information of the reference voice data marked in advance:

Marked the whole prosodic informations marked in the prosodic labeling information of described reference voice data in the prosodic labeling information of described speech data to be measured, and positional information corresponding to the prosodic information marked is consistent;

And the prosodic information marked in the prosodic labeling information of described speech data to be measured does not comprise the prosodic information do not marked in the prosodic labeling information of described reference voice data;

If do not met, then judge that described speech data to be measured exists pronunciation rhythm problem.

7. the detection method of pronunciation rhythm problem as claimed in claim 1, is characterized in that, also comprise:

When judging that described speech data to be measured exists rhythm problem, then generate pronunciation rhythm problem clew information according to comparison result, and user is pointed out.

8. a pick-up unit for rhythm problem of pronouncing, is characterized in that, comprising:

Receiver module, for receiving speech data to be measured;

Acquisition module, for obtaining the word boundary information of described speech data to be measured, and extracts the prosodic information of described speech data to be measured;

Generation module, for generating the prosodic labeling information of described speech data to be measured according to the word boundary information of described speech data to be measured and prosodic information;

Whether detection module, for the prosodic labeling information of the prosodic labeling information of described voice to be measured with the reference voice data marked in advance being compared, exist pronunciation rhythm problem to detect described speech data to be measured.

9. the pick-up unit of pronunciation rhythm problem as claimed in claim 8, is characterized in that, also comprise:

Labeling module, for marking described reference voice data, to obtain the prosodic labeling information of described reference voice data.

10. the pick-up unit of pronunciation rhythm problem as claimed in claim 9, is characterized in that, described labeling module specifically for:

Extract the prosodic information of described reference voice data;

The pick-up unit of 11. as claimed in claim 8 pronunciation rhythm problems, is characterized in that, described generation module specifically for:

The pick-up unit of 12. pronunciation rhythm problems as described in any one of claim 8-11, it is characterized in that, described prosodic labeling information comprises at least one prosodic information and the positional information corresponding respectively with at least one prosodic information described, wherein, each described positional information is determined according to corresponding rhythm boundary information.

The pick-up unit of 13. as claimed in claim 12 pronunciation rhythm problems, is characterized in that, described detection module specifically for:

And the prosodic information marked in the prosodic labeling information of described speech data to be measured does not comprise the prosodic information do not marked in the prosodic labeling information of described reference voice data.

The pick-up unit of 14. pronunciation rhythm problems as claimed in claim 8, is characterized in that, also comprise:

Reminding module, for when judging that described speech data to be measured is when existing rhythm problem, generating pronunciation rhythm problem clew information according to comparison result, and pointing out user.