CN104464751A - Method and device for detecting pronunciation rhythm problem - Google Patents

Method and device for detecting pronunciation rhythm problem Download PDF

Info

Publication number
CN104464751A
CN104464751A CN201410674294.6A CN201410674294A CN104464751A CN 104464751 A CN104464751 A CN 104464751A CN 201410674294 A CN201410674294 A CN 201410674294A CN 104464751 A CN104464751 A CN 104464751A
Authority
CN
China
Prior art keywords
information
prosodic
measured
rhythm
speech data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410674294.6A
Other languages
Chinese (zh)
Other versions
CN104464751B (en
Inventor
张儒瑞
赵乾
潘颂声
宋碧霄
吴玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201410674294.6A priority Critical patent/CN104464751B/en
Publication of CN104464751A publication Critical patent/CN104464751A/en
Application granted granted Critical
Publication of CN104464751B publication Critical patent/CN104464751B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a method and device for detecting a pronunciation rhythm problem. The method comprises the steps of receiving voice data to be detected; obtaining word boundary information of the voice data to be detected, and extracting rhythm information of the voice data to be detected; generating rhythm annotated information of the voice data to be detected according to the word boundary information and the rhythm information of the voice data to be detected; carrying out comparative analysis on the rhythm annotated information of the voice data to be detected and rhythm annotated information of reference voice data annotated in advance so as to detect whether the pronunciation rhythm problem exists in the voice data to be detected or not. The method for detecting the pronunciation rhythm problem automatically obtains the rhythm annotated information of voice and enables the information to be compared, manual annotation is not needed, application is more flexible and wide, and particularly in voice learning software, the rhythm problem of user pronunciation can be evaluated more efficiently by automatically detecting the rhythm of the voice. Particularly, in the detecting process, a large-capacity database is not needed, the calculated amount is small, and the detection efficiency is improved.

Description

The detection method of pronunciation rhythm problem and device
Technical field
The present invention relates to voice processing technology field, particularly a kind of pronounce the detection method of rhythm problem and device.
Background technology
Along with the development of speech recognition technology, speech evaluating technology is played a greater and greater role in speech recognition and application.Whether voice evaluation technology is mainly used in assessing the quality of speech data, wherein, not only comprise the assessment carried out the voice quality of words in speech data, also comprise and detect accurately and assess the rhythm in speech data.Such as, in language learning, user is by listening index zone pronunciation and carrying out learning a language with reading.User by comparison with pronunciation and the pronunciation in standard pronunciation and the rhythm whether consistent, and carry out correction according to comparison result and improve constantly learning level.Wherein, how can assess exactly, the rhythm problem with the existence in pronunciation of feedback user is the key had mastery of a language fast.Phonetic-rhythm problem, refers in voice the rhythm occurring mistake, such as, does not have liaison during this liaison, does not pause during this pause, does not read again time this is read again.In addition, under some other scenes, as in speech recognition, also need to detect the pronunciation rhythm problem of voice.
At present, the technology for rhythm problem detection mainly contains artificial mark method and prosodic constraints method.
Wherein, artificial mark method, need the correct rhythm manually marking out voice in the text that voice are corresponding, then corresponding according to the rhythm of artificial mark positional information, extracts the acoustic feature that the rhythm of relevant position in voice is relevant, and detects voice and whether there is rhythm problem, such as, to having marked the word read again, the acoustic features such as the energy of the voice of this word, fundamental frequency are extracted, by judging whether these acoustic features are greater than the methods such as certain thresholding and determine whether this word has been read again.
Prosodic constraints method, carries out the method for rhythm assessment to input speech data according to prosodic constraints.Wherein, prosodic constraints is namely: the language construction of the speech data of input or syntactic structure etc. are mated with the normal structure of the received pronunciation in standard corpus storehouse, and to be derived the due rhythm boundary position of input voice by the rhythm boundary position of the received pronunciation with analog structure.For the situation that may there is numerous received pronunciation similar to input phonetic structure in standard corpus storehouse, can determine input speech data needs to adopt which kind of rhythm border according to the statistical probability on rhythm border.
The technology of existing two kinds of rhythms assessment, all needs the word boundary and the rhythm border that manually mark voice.Just cannot assess the rhythm of user pronunciation when inartificial mark.In addition, prosodic constraints method needs jumbo standard corpus storehouse, on the one hand, very large storage space is taken, on the other hand, standard corpus Kuku Plays voice are also need manually to carry out correct prosodic labeling, and when judging prosodic constraints, also need to inquire about whole standard corpus storehouse, calculate the statistical probability on rhythm border, and then could prosodic constraints be determined, calculated amount is very large.
Summary of the invention
The present invention is intended to solve the problems of the technologies described above at least to a certain extent.
For this reason, first object of the present invention be to propose a kind of pronounce the detection method of rhythm problem, without the need to artificial mark, application is more flexibly, extensively, more effectively can assess the rhythm problem of user pronunciation, improve detection efficiency.
Second object of the present invention be to propose a kind of pronounce the pick-up unit of rhythm problem.
For reaching above-mentioned purpose, according to a first aspect of the present invention embodiment propose a kind of pronounce the detection method of rhythm problem, comprising: receive speech data to be measured; Obtain the word boundary information of described speech data to be measured, and extract the prosodic information of described speech data to be measured; According to the word boundary information of described speech data to be measured and the prosodic labeling information of the described speech data to be measured of prosodic information generation; The prosodic labeling information of the prosodic labeling information of described voice to be measured with the reference voice data marked in advance is compared, whether there is pronunciation rhythm problem to detect described speech data to be measured.
The detection method of the pronunciation rhythm problem of the embodiment of the present invention, by obtaining the word boundary information of speech data to be measured, and extract its prosodic information, to generate the prosodic labeling information of speech data to be measured accordingly, and compare with the prosodic labeling information of the reference voice data marked in advance and detect rhythm problem, can the prosodic labeling information of automatic acquisition voice compare, without the need to artificial mark, apply more flexible, extensively, especially in language learning class software, by automatically detecting the rhythm of voice, more effectively can assess the rhythm problem of user pronunciation.In addition, do not need jumbo database in testing process, calculated amount is few, improves detection efficiency.
Second aspect present invention embodiment provide a kind of pronounce the pick-up unit of rhythm problem, comprising: receiver module, for receiving speech data to be measured; Acquisition module, for obtaining the word boundary information of described speech data to be measured, and extracts the prosodic information of described speech data to be measured; Generation module, for generating the prosodic labeling information of described speech data to be measured according to the word boundary information of described speech data to be measured and prosodic information; Whether detection module, for the prosodic labeling information of the prosodic labeling information of described voice to be measured with the reference voice data marked in advance being compared, exist pronunciation rhythm problem to detect described speech data to be measured.
The pick-up unit of the pronunciation rhythm problem of the embodiment of the present invention, by obtaining the word boundary information of speech data to be measured, and extract its prosodic information, to generate the prosodic labeling information of speech data to be measured accordingly, and compare with the prosodic labeling information of the reference voice data marked in advance and detect rhythm problem, can the prosodic labeling information of automatic acquisition voice compare, without the need to artificial mark, apply more flexible, extensively, especially in language learning class software, by automatically detecting the rhythm of voice, more effectively can assess the rhythm problem of user pronunciation.In addition, do not need jumbo database in testing process, calculated amount is few, improves detection efficiency.
Additional aspect of the present invention and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.
Accompanying drawing explanation
Above-mentioned and/or additional aspect of the present invention and advantage will become obvious and easy understand from accompanying drawing below combining to the description of embodiment, wherein:
Fig. 1 is the process flow diagram of detection method of rhythm problem of pronouncing according to an embodiment of the invention;
Fig. 2 is according to an embodiment of the invention to the process flow diagram of the method that reference voice data marks;
Fig. 3 is the structural representation of pick-up unit of rhythm problem of pronouncing according to an embodiment of the invention;
Fig. 4 is the structural representation of the pick-up unit of pronunciation rhythm problem according to the present invention's specific embodiment;
Fig. 5 is the structural representation of the pick-up unit of pronunciation rhythm problem according to another embodiment of the present invention.
Embodiment
Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.
In describing the invention, it is to be appreciated that term " multiple " refers to two or more; Term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance.
Below with reference to the accompanying drawings detection method according to the pronunciation rhythm problem of the embodiment of the present invention and device are described.
Fig. 1 is the process flow diagram of detection method of rhythm problem of pronouncing according to an embodiment of the invention.As shown in Figure 1, according to the detection method of the pronunciation rhythm problem of the embodiment of the present invention, comprising:
S101, receives speech data to be measured.
For example, speech data to be measured can be user for the reference voice of standard record with reading voice.
S102, obtains the word boundary information of speech data to be measured, and extracts the prosodic information of speech data to be measured.
Particularly, in one embodiment of the invention, can first obtain content of text corresponding to speech data to be measured (such as, with read voice with the content of text read), and build decoding network according to text content, then decoding network and acoustic model are passed to demoder.Wherein, acoustic model is the background mathematics model of speech recognition, and model unit can be phoneme, syllable or word, and the modeling pattern of current main flow adopts Hidden Markov modeling.Demoder is one of core of speech recognition system, and its task is the acoustic feature to input, according to acoustic model, decoding network, finds the language unit sequence of maximum probability corresponding to this acoustic feature.Decoding network is also known as grammer network, for node with phoneme (simple or compound vowel of a Chinese syllable, initial consonant etc. as Chinese character), syllable or the word in above-mentioned content of text, annexation between phoneme is the digraph of arc, and decoding network defines the scope of demoder output language unit sequence.
Then, the acoustic feature extracting speech data to be measured is passed to demoder and is decoded, and speech data to be measured is alignd with corresponding content of text.The word boundary information of speech data to be measured can be obtained according to alignment result.Wherein, acoustic feature is the class value describing Short Time Speech essential characteristic, normally a kind of proper vector (MFCC (abbreviation of the Mel frequency cepstral coefficient) proper vectors as 39 dimensions) of fixing dimension.Word boundary information refers to that in voice to be measured, word plays time frame corresponding to initiator (or moment) to terminating time frame (or moment) corresponding to pronunciation, thus, each word time period used can be read in speech data to be measured according to word boundary acquisition of information, and the time period between word.
Finally, can according to the prosodic information of the word boundary information extraction speech data to be measured of speech data to be measured.The rhythm of voice mainly comprises: liaison, sense-group pause, read again, the information such as rising-falling tone.For the detection of the different rhythm, the prosodic features of extraction is different.For example, when judging the liaison rhythm, continuously whether the prosodic features of extraction comprise between two words with or without quiet, fundamental frequency, whether energy occurs the prosodic features such as low ebb; When judging the pause rhythm, extract the prosodic features such as the quiet duration between word; When judging to read the rhythm again, extract the prosodic features such as energy magnitude, fundamental frequency of word; When judging the rising-falling tone rhythm, extract the prosodic features such as the fundamental frequency slope of word.And then, according to word boundary information, the above-mentioned prosodic features between each word and word can be calculated successively, determine the prosodic information such as liaison, pause between the reading again of each word in voice to be measured, rising-falling tone and word according to corresponding determination strategy.
For example, if it is continuous to there is not quiet and fundamental frequency between two words, then this two word liaisons can be judged; If the mute time between two words exceedes regular hour threshold value, as 0.05 second, then can judge there is pause between two words; If the energy magnitude of one or more word exceedes certain energy threshold, then show that the one or more word is read again.Similarly, the rising-falling tone feature of word also can be judged according to fundamental frequency slope.
S103, generates the prosodic labeling information of speech data to be measured according to the word boundary information of speech data to be measured and prosodic information.
Wherein, prosodic labeling information comprises at least one prosodic information and the positional information corresponding respectively with at least one prosodic information, and wherein, each positional information is determined according to corresponding rhythm boundary information.Prosodic labeling information refers to the positional information of the correct rhythm in the text marked out corresponding to voice, namely to mark out in text liaison, pause or which lexical stress between which two word, and prosodic labeling is the important evidence as rhythm assessment.
In one embodiment of the invention, generate the prosodic labeling information of speech data to be measured according to the word boundary information of speech data to be measured and prosodic information, can specifically comprise: the rhythm boundary information determining speech data to be measured according to the word boundary information of speech data to be measured and prosodic information; The prosodic information of rhythm boundary information to speech data to be measured according to speech data to be measured marks, to generate the prosodic labeling information of speech data to be measured.
Wherein, according to word boundary information and prosodic information corresponding to word, rhythm boundary information can be determined, and determine the positional information of each prosodic information further, then mark according to the positional information of prosodic information.For example, if word A and B liaison, then the initial time frame of the rhythm that this rhythm of liaison is corresponding is the Voice onset time frame (or moment) of word A and pronunciation end time frame (or moment) of word B, and can determine that the positional information that this rhythm of liaison is corresponding is the position that in text, word A and word B is corresponding.And then, can according to the positional information of each rhythm at the corresponding prosodic information of corresponding position mark.
S104, compares the prosodic labeling information of the prosodic labeling information of voice to be measured with the reference voice data marked in advance, whether there is pronunciation rhythm problem to detect speech data to be measured.
Wherein, reference voice refer to voice to be measured with the received pronunciation read.
In an embodiment of the present invention, particularly, can judge whether the prosodic labeling information of voice to be measured meets the following conditions with the prosodic labeling information of the reference voice data marked in advance:
Marked the whole prosodic informations marked in the prosodic labeling information of reference voice data in the prosodic labeling information of speech data to be measured, and positional information corresponding to the prosodic information marked is consistent; And the prosodic information marked in the prosodic labeling information of speech data to be measured does not comprise the prosodic information do not marked in the prosodic labeling information of reference voice data.
If do not met, then judge that speech data to be measured exists pronunciation rhythm problem.
That is, whole rhythms (and the rhythm boundary information of correspondence is identical) of reference voice data are only comprised at speech data to be measured, and when not comprising the rhythm that reference voice data do not have in speech data to be measured, just judge that speech data to be measured does not exist pronunciation rhythm problem.Otherwise then there is pronunciation rhythm problem in speech data to be measured.
Further, in one embodiment of the invention, when judging that speech data to be measured exists rhythm problem, then generate pronunciation rhythm problem clew information according to comparison result, and user is pointed out.Particularly, according to comparison result, can judge that speech data to be measured is relative to the rhythm (can comprise the rhythm lacked or the rhythm had more) not identical in reference voice data, and for the not identical rhythm, user pointed out.Thus, prompting and the feedback of pronunciation rhythm problem can be carried out in time to user, be convenient to user and improve, promote Consumer's Experience.
The detection method of the pronunciation rhythm problem of the embodiment of the present invention, by obtaining the word boundary information of speech data to be measured, and extract its prosodic information, to generate the prosodic labeling information of speech data to be measured accordingly, and compare with the prosodic labeling information of the reference voice data marked in advance and detect rhythm problem, can the prosodic labeling information of automatic acquisition voice compare, without the need to artificial mark, apply more flexible, extensively, especially in language learning class software, by automatically detecting the rhythm of voice, more effectively can assess the rhythm problem of user pronunciation.In addition, do not need jumbo database in testing process, calculated amount is few, improves detection efficiency.
In an embodiment of the present invention, also can comprise and reference voice data is marked, to obtain the step of the prosodic labeling information of reference voice data.Particularly, as shown in Figure 2, can comprise the following steps the method that reference voice data marks:
S201, decodes to reference voice data, and obtains the word boundary information of reference voice data according to decoded result.
In one embodiment of the invention, decoding network can be built by the content of text corresponding according to reference voice data, and decoding network and acoustic model are passed to demoder, then the acoustic feature of reference voice data is extracted, and pass to demoder and decode, reference voice data is alignd with corresponding content of text.The word boundary information of reference voice data can be obtained according to alignment result.
S202, extracts the prosodic information of reference voice data.
Particularly, with or without quiet between the word that can judge reference voice data, whether fundamental frequency is continuously and reference voice data is carried out to multiple sound judgement, obtained the slope etc. of quiet duration, energy magnitude, fundamental frequency, to extract the prosodic features of reference voice data.Further, can based on these prosodic features according to the liaison in corresponding determination strategy determination reference voice data, pause, read again, the prosodic information such as rising-falling tone.
S203, according to the rhythm boundary information of prosodic information and word boundary information determination reference voice data.
For example, if word A and B liaison, then the initial time frame of the rhythm that this rhythm of liaison is corresponding is the Voice onset time frame (or moment) of word A and pronunciation end time frame (or moment) of word B.And then, can according to each rhythm boundary information at the corresponding prosodic information of corresponding position mark.
S204, marks reference voice data according to rhythm boundary information.
Thus, automatically can detect the prosodic information of reference voice data, rower of going forward side by side is noted, and avoids loaded down with trivial details, the error of artificial mark etc., and disposable mark good after, later reusable, convenient, accurate in detection.
In order to realize above-described embodiment, the present invention also propose a kind of pronounce the pick-up unit of rhythm problem.
Fig. 3 is the structural representation of pick-up unit of rhythm problem of pronouncing according to an embodiment of the invention.
As shown in Figure 3, according to the pick-up unit of the pronunciation rhythm problem of the embodiment of the present invention, comprising: receiver module 10, acquisition module 20, generation module 30 and detection module 40.
Particularly, receiver module 10 is for receiving speech data to be measured.For example, speech data to be measured can be user for the reference voice of standard record with reading voice.
Acquisition module 20 for obtaining the word boundary information of speech data to be measured, and extracts the prosodic information of speech data to be measured.
More specifically, in one embodiment of the invention, first acquisition module 20 can obtain content of text corresponding to speech data to be measured (such as, with read voice with the content of text read), and build decoding network according to text content, then decoding network and acoustic model are passed to demoder.Wherein, acoustic model is the background mathematics model of speech recognition, and model unit can be phoneme, syllable or word, and the modeling pattern of current main flow adopts Hidden Markov modeling.Demoder is one of core of speech recognition system, and its task is the acoustic feature to input, according to acoustic model, decoding network, finds the language unit sequence of maximum probability corresponding to this acoustic feature.Decoding network is also known as grammer network, for node with phoneme (simple or compound vowel of a Chinese syllable, initial consonant etc. as Chinese character), syllable or the word in above-mentioned content of text, annexation between phoneme is the digraph of arc, and decoding network defines the scope of demoder output language unit sequence.
Then, the acoustic feature that acquisition module 20 extracts speech data to be measured is passed to demoder and is decoded, and speech data to be measured is alignd with corresponding content of text.The word boundary information of speech data to be measured can be obtained according to alignment result.Wherein, acoustic feature is the class value describing Short Time Speech essential characteristic, normally a kind of proper vector (MFCC (abbreviation of the Mel frequency cepstral coefficient) proper vectors as 39 dimensions) of fixing dimension.Word boundary information refers to that in voice to be measured, word plays time frame corresponding to initiator (or moment) to terminating time frame (or moment) corresponding to pronunciation, thus, each word time period used can be read in speech data to be measured according to word boundary acquisition of information, and the time period between word.
Finally, acquisition module 20 can according to the prosodic information of the word boundary information extraction speech data to be measured of speech data to be measured.The rhythm of voice mainly comprises: liaison, sense-group pause, read again, the information such as rising-falling tone.For the detection of the different rhythm, the prosodic features of extraction is different.For example, when acquisition module 20 judges the liaison rhythm, continuously whether the prosodic features of extraction comprise between two words with or without quiet, fundamental frequency, whether energy occurs the prosodic features such as low ebb; When judging the pause rhythm, extract the prosodic features such as the quiet duration between word; When judging to read the rhythm again, extract the prosodic features such as energy magnitude, fundamental frequency of word; When judging the rising-falling tone rhythm, extract the prosodic features such as the fundamental frequency slope of word.And then, according to word boundary information, the above-mentioned prosodic features between each word and word can be calculated successively, determine the prosodic information such as liaison, pause between the reading again of each word in voice to be measured, rising-falling tone and word according to corresponding determination strategy.
For example, if it is continuous to there is not quiet and fundamental frequency between two words, then this two word liaisons can be judged; If the mute time between two words exceedes regular hour threshold value, as 0.05 second, then can judge there is pause between two words; If the energy magnitude of one or more word exceedes certain energy threshold, then show that the one or more word is read again.Similarly, the rising-falling tone feature of word also can be judged according to fundamental frequency slope.
Generation module 30 is for generating the prosodic labeling information of speech data to be measured according to the word boundary information of speech data to be measured and prosodic information.Wherein, prosodic labeling information comprises at least one prosodic information and the positional information corresponding respectively with at least one prosodic information, and wherein, each positional information is determined according to corresponding rhythm boundary information.Prosodic labeling information refers to the positional information of the correct rhythm in the text marked out corresponding to voice, namely to mark out in text liaison, pause or which lexical stress between which two word, and prosodic labeling is the important evidence as rhythm assessment.
In one embodiment of the invention, generation module 30 is specifically for the rhythm boundary information of determining speech data to be measured according to the word boundary information of speech data to be measured and prosodic information; The prosodic information of rhythm boundary information to speech data to be measured according to speech data to be measured marks, to generate the prosodic labeling information of speech data to be measured.
Wherein, according to word boundary information and prosodic information corresponding to word, rhythm boundary information can be determined, and determine the positional information of each prosodic information further, then mark according to the positional information of prosodic information.For example, if word A and B liaison, then the initial time frame of the rhythm that this rhythm of liaison is corresponding is the Voice onset time frame (or moment) of word A and pronunciation end time frame (or moment) of word B, and can determine that the positional information that this rhythm of liaison is corresponding is the position that in text, word A and word B is corresponding.And then, can according to the positional information of each rhythm at the corresponding prosodic information of corresponding position mark.
Whether detection module 40, for the prosodic labeling information of the prosodic labeling information of voice to be measured with the reference voice data marked in advance being compared, exists pronunciation rhythm problem to detect speech data to be measured.Wherein, reference voice refer to voice to be measured with the received pronunciation read.
In an embodiment of the present invention, detection module 40 specifically for: whether the prosodic labeling information judging voice to be measured meets the following conditions with the prosodic labeling information of the reference voice data marked in advance: marked the whole prosodic informations marked in the prosodic labeling information of reference voice data in the prosodic labeling information of speech data to be measured, and positional information corresponding to the prosodic information marked is consistent; And the prosodic information marked in the prosodic labeling information of speech data to be measured does not comprise the prosodic information do not marked in the prosodic labeling information of reference voice data; If do not met, then judge that speech data to be measured exists pronunciation rhythm problem.
That is, whole rhythms (and the rhythm boundary information of correspondence is identical) of reference voice data are only comprised at speech data to be measured, and when not comprising the rhythm that reference voice data do not have in speech data to be measured, just judge that speech data to be measured does not exist pronunciation rhythm problem.Otherwise then there is pronunciation rhythm problem in speech data to be measured.
The pick-up unit of the pronunciation rhythm problem of the embodiment of the present invention, by obtaining the word boundary information of speech data to be measured, and extract its prosodic information, to generate the prosodic labeling information of speech data to be measured accordingly, and compare with the prosodic labeling information of the reference voice data marked in advance and detect rhythm problem, can the prosodic labeling information of automatic acquisition voice compare, without the need to artificial mark, apply more flexible, extensively, especially in language learning class software, by automatically detecting the rhythm of voice, more effectively can assess the rhythm problem of user pronunciation.In addition, do not need jumbo database in testing process, calculated amount is few, improves detection efficiency.
Fig. 4 is the structural representation of the pick-up unit of pronunciation rhythm problem according to the present invention's specific embodiment.
As shown in Figure 4, according to the pick-up unit of the pronunciation rhythm problem of the embodiment of the present invention, comprising: receiver module 10, acquisition module 20, generation module 30, detection module 40 and labeling module 50.
Particularly, labeling module 50 for marking reference voice data, to obtain the prosodic labeling information of reference voice data.
In one embodiment of the invention, labeling module 50 can be specifically for: decode to reference voice data, and obtain the word boundary information of reference voice data according to decoded result; Extract the prosodic information of reference voice data; According to the rhythm boundary information of prosodic information and word boundary information determination reference voice data; According to rhythm boundary information, reference voice data is marked.
More specifically, labeling module 50 can build decoding network by the content of text corresponding according to reference voice data, and decoding network and acoustic model are passed to demoder, then the acoustic feature of reference voice data is extracted, and pass to demoder and decode, reference voice data is alignd with corresponding content of text.The word boundary information of reference voice data can be obtained according to alignment result.
Then, with or without quiet between the word that labeling module 50 can judge reference voice data, whether fundamental frequency is continuous and reference voice data is carried out to multiple sound judgement, obtained the slope etc. of quiet duration, energy magnitude, fundamental frequency, to extract the prosodic features of reference voice data.Further, can based on these prosodic features according to the liaison in corresponding determination strategy determination reference voice data, pause, read again, the prosodic information such as rising-falling tone.
Thus, automatically can detect the prosodic information of reference voice data, rower of going forward side by side is noted, and avoids loaded down with trivial details, the error of artificial mark etc., and disposable mark good after, later reusable, convenient, accurate in detection.
Fig. 5 is the structural representation of the pick-up unit of pronunciation rhythm problem according to another embodiment of the present invention.
As shown in Figure 5, according to the pick-up unit of the pronunciation rhythm problem of the embodiment of the present invention, comprising: receiver module 10, acquisition module 20, generation module 30, detection module 40, labeling module 50 and reminding module 60.
Particularly, reminding module 60, for when judging that speech data to be measured is when existing rhythm problem, generating pronunciation rhythm problem clew information according to comparison result, and pointing out user.More specifically, reminding module 60 is used can according to comparison result, judge that speech data to be measured is relative to the rhythm (can comprise the rhythm lacked or the rhythm had more) not identical in reference voice data, and for the not identical rhythm, user is pointed out.
Thus, the pick-up unit of the pronunciation rhythm problem of the embodiment of the present invention, can carry out prompting and the feedback of pronunciation rhythm problem in time, be convenient to user and improve, promote Consumer's Experience to user.
Describe and can be understood in process flow diagram or in this any process otherwise described or method, represent and comprise one or more for realizing the module of the code of the executable instruction of the step of specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can not according to order that is shown or that discuss, comprise according to involved function by the mode while of basic or by contrary order, carry out n-back test, this should understand by embodiments of the invention person of ordinary skill in the field.
In flow charts represent or in this logic otherwise described and/or step, such as, the sequencing list of the executable instruction for realizing logic function can be considered to, may be embodied in any computer-readable medium, for instruction execution system, device or equipment (as computer based system, comprise the system of processor or other can from instruction execution system, device or equipment instruction fetch and perform the system of instruction) use, or to use in conjunction with these instruction execution systems, device or equipment.With regard to this instructions, " computer-readable medium " can be anyly can to comprise, store, communicate, propagate or transmission procedure for instruction execution system, device or equipment or the device that uses in conjunction with these instruction execution systems, device or equipment.The example more specifically (non-exhaustive list) of computer-readable medium comprises following: the electrical connection section (electronic installation) with one or more wiring, portable computer diskette box (magnetic device), random access memory (RAM), ROM (read-only memory) (ROM), erasablely edit ROM (read-only memory) (EPROM or flash memory), fiber device, and portable optic disk ROM (read-only memory) (CDROM).In addition, computer-readable medium can be even paper or other suitable media that can print described program thereon, because can such as by carrying out optical scanning to paper or other media, then carry out editing, decipher or carry out process with other suitable methods if desired and electronically obtain described program, be then stored in computer memory.
Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, multiple step or method can with to store in memory and the software performed by suitable instruction execution system or firmware realize.Such as, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: the discrete logic with the logic gates for realizing logic function to data-signal, there is the special IC of suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.
Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is that the hardware that can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, this program perform time, step comprising embodiment of the method one or a combination set of.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, also can be that the independent physics of unit exists, also can be integrated in a module by two or more unit.Above-mentioned integrated module both can adopt the form of hardware to realize, and the form of software function module also can be adopted to realize.If described integrated module using the form of software function module realize and as independently production marketing or use time, also can be stored in a computer read/write memory medium.
The above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.
In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referred to the schematic representation of above-mentioned term.And the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.
Although illustrate and describe embodiments of the invention, those having ordinary skill in the art will appreciate that: can carry out multiple change, amendment, replacement and modification to these embodiments when not departing from principle of the present invention and aim, scope of the present invention is by claim and equivalency thereof.

Claims (14)

1. a detection method for rhythm problem of pronouncing, is characterized in that, comprising:
Receive speech data to be measured;
Obtain the word boundary information of described speech data to be measured, and extract the prosodic information of described speech data to be measured;
According to the word boundary information of described speech data to be measured and the prosodic labeling information of the described speech data to be measured of prosodic information generation;
The prosodic labeling information of the prosodic labeling information of described voice to be measured with the reference voice data marked in advance is compared, whether there is pronunciation rhythm problem to detect described speech data to be measured.
2. the detection method of pronunciation rhythm problem as claimed in claim 1, is characterized in that, also comprise:
Described reference voice data is marked, to obtain the prosodic labeling information of described reference voice data.
3. the detection method of pronunciation rhythm problem as claimed in claim 2, is characterized in that, describedly marks described reference voice data, specifically comprises:
Described reference voice data is decoded, and obtains the word boundary information of described reference voice data according to decoded result;
Extract the prosodic information of described reference voice data;
The rhythm boundary information of described reference voice data is determined according to described prosodic information and described word boundary information;
According to described rhythm boundary information, described reference voice data is marked.
4. the detection method of pronunciation rhythm problem as claimed in claim 1, it is characterized in that, the described word boundary information according to described speech data to be measured and prosodic information generate the prosodic labeling information of described speech data to be measured, specifically comprise:
The rhythm boundary information of described speech data to be measured is determined according to the word boundary information of described speech data to be measured and prosodic information;
The prosodic information of rhythm boundary information to described speech data to be measured according to described speech data to be measured marks, to generate the prosodic labeling information of described speech data to be measured.
5. the detection method of the pronunciation rhythm problem as described in any one of claim 1-4, it is characterized in that, described prosodic labeling information comprises at least one prosodic information and the positional information corresponding respectively with at least one prosodic information described, wherein, each described positional information is determined according to corresponding rhythm boundary information.
6. the detection method of pronunciation rhythm problem as claimed in claim 5, is characterized in that, describedly the prosodic labeling information of the prosodic labeling information of described voice to be measured with the reference voice data marked in advance is compared, and specifically comprises:
Judge whether the prosodic labeling information of described voice to be measured meets the following conditions with the prosodic labeling information of the reference voice data marked in advance:
Marked the whole prosodic informations marked in the prosodic labeling information of described reference voice data in the prosodic labeling information of described speech data to be measured, and positional information corresponding to the prosodic information marked is consistent;
And the prosodic information marked in the prosodic labeling information of described speech data to be measured does not comprise the prosodic information do not marked in the prosodic labeling information of described reference voice data;
If do not met, then judge that described speech data to be measured exists pronunciation rhythm problem.
7. the detection method of pronunciation rhythm problem as claimed in claim 1, is characterized in that, also comprise:
When judging that described speech data to be measured exists rhythm problem, then generate pronunciation rhythm problem clew information according to comparison result, and user is pointed out.
8. a pick-up unit for rhythm problem of pronouncing, is characterized in that, comprising:
Receiver module, for receiving speech data to be measured;
Acquisition module, for obtaining the word boundary information of described speech data to be measured, and extracts the prosodic information of described speech data to be measured;
Generation module, for generating the prosodic labeling information of described speech data to be measured according to the word boundary information of described speech data to be measured and prosodic information;
Whether detection module, for the prosodic labeling information of the prosodic labeling information of described voice to be measured with the reference voice data marked in advance being compared, exist pronunciation rhythm problem to detect described speech data to be measured.
9. the pick-up unit of pronunciation rhythm problem as claimed in claim 8, is characterized in that, also comprise:
Labeling module, for marking described reference voice data, to obtain the prosodic labeling information of described reference voice data.
10. the pick-up unit of pronunciation rhythm problem as claimed in claim 9, is characterized in that, described labeling module specifically for:
Described reference voice data is decoded, and obtains the word boundary information of described reference voice data according to decoded result;
Extract the prosodic information of described reference voice data;
The rhythm boundary information of described reference voice data is determined according to described prosodic information and described word boundary information;
According to described rhythm boundary information, described reference voice data is marked.
The pick-up unit of 11. as claimed in claim 8 pronunciation rhythm problems, is characterized in that, described generation module specifically for:
The rhythm boundary information of described speech data to be measured is determined according to the word boundary information of described speech data to be measured and prosodic information;
The prosodic information of rhythm boundary information to described speech data to be measured according to described speech data to be measured marks, to generate the prosodic labeling information of described speech data to be measured.
The pick-up unit of 12. pronunciation rhythm problems as described in any one of claim 8-11, it is characterized in that, described prosodic labeling information comprises at least one prosodic information and the positional information corresponding respectively with at least one prosodic information described, wherein, each described positional information is determined according to corresponding rhythm boundary information.
The pick-up unit of 13. as claimed in claim 12 pronunciation rhythm problems, is characterized in that, described detection module specifically for:
Judge whether the prosodic labeling information of described voice to be measured meets the following conditions with the prosodic labeling information of the reference voice data marked in advance:
Marked the whole prosodic informations marked in the prosodic labeling information of described reference voice data in the prosodic labeling information of described speech data to be measured, and positional information corresponding to the prosodic information marked is consistent;
And the prosodic information marked in the prosodic labeling information of described speech data to be measured does not comprise the prosodic information do not marked in the prosodic labeling information of described reference voice data.
The pick-up unit of 14. pronunciation rhythm problems as claimed in claim 8, is characterized in that, also comprise:
Reminding module, for when judging that described speech data to be measured is when existing rhythm problem, generating pronunciation rhythm problem clew information according to comparison result, and pointing out user.
CN201410674294.6A 2014-11-21 2014-11-21 The detection method and device for rhythm problem of pronouncing Active CN104464751B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410674294.6A CN104464751B (en) 2014-11-21 2014-11-21 The detection method and device for rhythm problem of pronouncing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410674294.6A CN104464751B (en) 2014-11-21 2014-11-21 The detection method and device for rhythm problem of pronouncing

Publications (2)

Publication Number Publication Date
CN104464751A true CN104464751A (en) 2015-03-25
CN104464751B CN104464751B (en) 2018-01-16

Family

ID=52910695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410674294.6A Active CN104464751B (en) 2014-11-21 2014-11-21 The detection method and device for rhythm problem of pronouncing

Country Status (1)

Country Link
CN (1) CN104464751B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203539A (en) * 2016-03-17 2017-09-26 曾雅梅 The speech evaluating device of complex digital word learning machine and its evaluation and test and continuous speech image conversion method
CN107452370A (en) * 2017-07-18 2017-12-08 太原理工大学 A kind of application method of the judgment means of Chinese vowel followed by a nasal consonant dysphonia patient
CN108536668A (en) * 2018-02-26 2018-09-14 科大讯飞股份有限公司 Wake-up word evaluation method and device, storage medium and electronic equipment
CN111028823A (en) * 2019-12-11 2020-04-17 广州酷狗计算机科技有限公司 Audio generation method and device, computer readable storage medium and computing device
CN111508522A (en) * 2019-01-30 2020-08-07 沪江教育科技(上海)股份有限公司 Statement analysis processing method and system
CN111508523A (en) * 2019-01-30 2020-08-07 沪江教育科技(上海)股份有限公司 Voice training prompting method and system
CN111951827A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Continuous reading identification correction method, device, equipment and readable storage medium
CN112183086A (en) * 2020-09-23 2021-01-05 北京先声智能科技有限公司 English pronunciation continuous reading mark model based on sense group labeling
CN112309429A (en) * 2019-07-30 2021-02-02 上海流利说信息技术有限公司 Method, device and equipment for explosion loss detection and computer readable storage medium
CN112331229A (en) * 2020-10-23 2021-02-05 网易有道信息技术(北京)有限公司 Voice detection method, device, medium and computing equipment
CN113053415A (en) * 2021-03-24 2021-06-29 北京儒博科技有限公司 Continuous reading detection method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650942A (en) * 2009-08-26 2010-02-17 北京邮电大学 Prosodic structure forming method based on prosodic phrase
CN102063898A (en) * 2010-09-27 2011-05-18 北京捷通华声语音技术有限公司 Method for predicting prosodic phrases
US20110196678A1 (en) * 2007-08-22 2011-08-11 Nec Corporation Speech recognition apparatus and speech recognition method
CN102237081A (en) * 2010-04-30 2011-11-09 国际商业机器公司 Method and system for estimating rhythm of voice
CN102426834A (en) * 2011-08-30 2012-04-25 中国科学院自动化研究所 Method for testing rhythm level of spoken English

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110196678A1 (en) * 2007-08-22 2011-08-11 Nec Corporation Speech recognition apparatus and speech recognition method
CN101650942A (en) * 2009-08-26 2010-02-17 北京邮电大学 Prosodic structure forming method based on prosodic phrase
CN102237081A (en) * 2010-04-30 2011-11-09 国际商业机器公司 Method and system for estimating rhythm of voice
CN102063898A (en) * 2010-09-27 2011-05-18 北京捷通华声语音技术有限公司 Method for predicting prosodic phrases
CN102426834A (en) * 2011-08-30 2012-04-25 中国科学院自动化研究所 Method for testing rhythm level of spoken English

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203539B (en) * 2016-03-17 2020-07-14 曾雅梅 Speech evaluating device of complex word learning machine and evaluating and continuous speech imaging method thereof
CN107203539A (en) * 2016-03-17 2017-09-26 曾雅梅 The speech evaluating device of complex digital word learning machine and its evaluation and test and continuous speech image conversion method
CN107452370A (en) * 2017-07-18 2017-12-08 太原理工大学 A kind of application method of the judgment means of Chinese vowel followed by a nasal consonant dysphonia patient
CN108536668A (en) * 2018-02-26 2018-09-14 科大讯飞股份有限公司 Wake-up word evaluation method and device, storage medium and electronic equipment
CN111508522A (en) * 2019-01-30 2020-08-07 沪江教育科技(上海)股份有限公司 Statement analysis processing method and system
CN111508523A (en) * 2019-01-30 2020-08-07 沪江教育科技(上海)股份有限公司 Voice training prompting method and system
CN111951827A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Continuous reading identification correction method, device, equipment and readable storage medium
CN112309429A (en) * 2019-07-30 2021-02-02 上海流利说信息技术有限公司 Method, device and equipment for explosion loss detection and computer readable storage medium
CN111028823B (en) * 2019-12-11 2024-06-07 广州酷狗计算机科技有限公司 Audio generation method, device, computer readable storage medium and computing equipment
CN111028823A (en) * 2019-12-11 2020-04-17 广州酷狗计算机科技有限公司 Audio generation method and device, computer readable storage medium and computing device
CN112183086A (en) * 2020-09-23 2021-01-05 北京先声智能科技有限公司 English pronunciation continuous reading mark model based on sense group labeling
CN112183086B (en) * 2020-09-23 2024-06-14 北京先声智能科技有限公司 English pronunciation continuous reading marking model based on interest group marking
CN112331229B (en) * 2020-10-23 2024-03-12 网易有道信息技术(北京)有限公司 Voice detection method, device, medium and computing equipment
CN112331229A (en) * 2020-10-23 2021-02-05 网易有道信息技术(北京)有限公司 Voice detection method, device, medium and computing equipment
CN113053415A (en) * 2021-03-24 2021-06-29 北京儒博科技有限公司 Continuous reading detection method, device, equipment and storage medium
CN113053415B (en) * 2021-03-24 2023-09-29 北京如布科技有限公司 Method, device, equipment and storage medium for detecting continuous reading

Also Published As

Publication number Publication date
CN104464751B (en) 2018-01-16

Similar Documents

Publication Publication Date Title
CN104464751A (en) Method and device for detecting pronunciation rhythm problem
CN106601228B (en) Sample labeling method and device based on artificial intelligence rhythm prediction
CN105336322B (en) Polyphone model training method, and speech synthesis method and device
CN105185373B (en) The generation of prosody hierarchy forecast model and prosody hierarchy Forecasting Methodology and device
Norris et al. The possible-word constraint in the segmentation of continuous speech
CN105551481B (en) The prosodic labeling method and device of voice data
CN104464755B (en) Speech evaluating method and device
CN104464757B (en) Speech evaluating method and speech evaluating device
TWI441163B (en) Chinese speech recognition device and speech recognition method thereof
US8069042B2 (en) Using child directed speech to bootstrap a model based speech segmentation and recognition system
KR101587866B1 (en) Apparatus and method for extension of articulation dictionary by speech recognition
US7177810B2 (en) Method and apparatus for performing prosody-based endpointing of a speech signal
CN106575502A (en) Systems and methods for providing non-lexical cues in synthesized speech
CN104299612A (en) Method and device for detecting imitative sound similarity
Kakouros et al. Perception of sentence stress in speech correlates with the temporal unpredictability of prosodic features
CN109863554B (en) Acoustic font model and acoustic font phoneme model for computer-aided pronunciation training and speech processing
US9852743B2 (en) Automatic emphasis of spoken words
US9129596B2 (en) Apparatus and method for creating dictionary for speech synthesis utilizing a display to aid in assessing synthesis quality
WO2016103652A1 (en) Speech processing device, speech processing method, and recording medium
CN109697975B (en) Voice evaluation method and device
Moró et al. A prosody inspired RNN approach for punctuation of machine produced speech transcripts to improve human readability
Lin et al. Improving L2 English rhythm evaluation with automatic sentence stress detection
Moniz et al. Prosodically-based automatic segmentation and punctuation
JP5447382B2 (en) Speech recognition hypothesis verification device, speech recognition device, method and program used therefor
Proença et al. Children's reading aloud performance: a database and automatic detection of disfluencies

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant