CN102237088B

CN102237088B - Device and method for acquiring speech recognition multi-information text

Info

Publication number: CN102237088B
Application number: CN2011101651010A
Authority: CN
Inventors: 张峰; 黄伟
Original assignee: Shengle Information Technolpogy Shanghai Co Ltd
Current assignee: SHANGHAI GEAK ELECTRONICS Co.,Ltd.
Priority date: 2011-06-17
Filing date: 2011-06-17
Publication date: 2013-10-23
Anticipated expiration: 2031-06-17
Also published as: CN102237088A

Abstract

The invention provides a device and a method for acquiring a speech recognition multi-information text. After a speech audio frequency is converted into pure text information by speech recognition, individual character pronunciation speed, individual character pronunciation strength and individual character pronunciation intonation in the speech audio frequency are integrated into the initially-generated pure text information in a certain expression way to generate multi-information text information. The device and the method for acquiring the speech recognition multi-information text can be widely used for information release platforms such as micro blogs, short messages, signature files and the like.

Description

Speech recognition multi-information text deriving means and method

Technical field

The present invention relates to the speech recognition technology of computer field, particularly a kind of speech recognition multi-information text deriving means and method.

Background technology

Recent two decades comes, and speech recognition technology is obtained marked improvement, has obtained to use more and more widely.Expectation is in coming 10 years, and speech recognition technology will enter the every field such as industry, household electrical appliances, communication, automotive electronics, medical treatment, home services, consumption electronic product.

So-called speech recognition refers to the automatic Understanding people's such as computing machine or machinery voice.For example, by utilizing speech recognition, computing machine or machinery can be moved according to people's voice, the phonetic modification that perhaps can make the people is literal.The main method that adopts is in the speech recognition, extracts the physical features such as frequency spectrum that the voice that send have, and compares with the physical features model of pre-stored vowel, consonant or word, finally obtains with the identical expressing information of people's voice content.But in the prior art, the text message that obtains by speech recognition technology can only be plain text information usually, described plain text information refers to the text message that the literal format size is unified, do not have special symbol except punctuation mark, and all mention that plain text information part all refers to this meaning in the instructions.Therefore a lot of valuable information in the voice, information such as speaker's word speed, stress, tone can't show in the plain text information after speech recognition.

Summary of the invention

The technical problem to be solved in the present invention provides a kind of speech recognition multi-information text deriving means and method, usually can only be plain text information to solve the text message that obtains by speech recognition technology in the prior art, a lot of valuable information in the voice be cashed problem out in can't the text message after speech recognition.

For solving the problems of the technologies described above, the invention provides a kind of speech recognition rich text deriving means, comprising:

Plain text information and one time generation module are used for by speech recognition speech audio being converted to plain text information, are used for obtaining simultaneously the one time of speech audio, determine the one word speed by the length of described one time;

The rich text generation module is used for the text message with the many information of described plain text Information generation.

Optionally, also comprise one intensity computing module, be used for obtaining one intensity according to described one Time Calculation.

Optionally, described rich text generation module is used for integrating in described plain text information the text message of the many information of Information generation of described one word speed and/or described one intensity.

Optionally, also comprise individual character intonation computing module, be used for obtaining the one intonation according to described one Time Calculation.

Optionally, described rich text generation module is used for integrating in described plain text information the text message of the many information of Information generation of described one word speed and/or described one intensity and/or one intonation.

The present invention also provides a kind of speech recognition multi-information text acquisition methods, may further comprise the steps:

Step 1 is converted to plain text information by speech recognition with speech audio, obtains simultaneously the one time in the speech audio, and then determines the one word speed by the length of described one time;

Step 2 is with the text message of the many information of described plain text Information generation.

Optionally, in the described step 2, in described plain text information, integrate the text message of the many information of Information generation of described one word speed.

Optionally, between described step 1 and step 2, also comprise the step that obtains one intensity and/or one intonation according to described one Time Calculation.

Optionally, in the described step 2, in described plain text information, integrate the text message of the many information of Information generation of described one word speed and/or described one intensity and/or described one intonation.

Optionally, described one intonation utilizes the described one time to calculate by the fundamental frequency extractive technique.

Optionally, described one intensity by calculate described one in the time average of intensity of phonation obtain.

Speech recognition multi-information text deriving means of the present invention and method also are integrated into the one word speed in the speech audio, one intensity, one intonation the text message that generates many information in the plain text information of initial generation by certain manifestation mode after by speech recognition speech audio being converted to plain text information.Speech recognition multi-information text deriving means of the present invention and method can be widely used in the information promulgating platforms such as microblogging, note and signature.

Description of drawings

Fig. 1 is an embodiment configuration diagram of speech recognition multi-information text deriving means of the present invention;

Fig. 2 is another embodiment configuration diagram of speech recognition multi-information text deriving means of the present invention;

Fig. 3 is speech recognition multi-information text acquisition methods one embodiment schematic flow sheet of the present invention;

Fig. 4 is another embodiment schematic flow sheet of speech recognition multi-information text acquisition methods of the present invention;

Fig. 5 is the schematic diagram of the text message of a kind of many information of the present invention;

Fig. 6 is the schematic diagram of the text message of the many information of another kind of the present invention.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the below is described in detail the specific embodiment of the present invention.

The text message of many information of the present invention represents that system and method can utilize multiple substitute mode to realize; the below is illustrated by preferred embodiment; certainly the present invention is not limited to this specific embodiment, and the known general replacement of one of ordinary skilled in the art is encompassed in protection scope of the present invention undoubtedly.

The invention provides a kind of speech recognition rich text deriving means.

Embodiment one

Please referring to Fig. 1, Fig. 1 is an embodiment configuration diagram of speech recognition multi-information text deriving means of the present invention.As shown in Figure 1, speech recognition multi-information text deriving means of the present invention comprises:

Plain text information and one time generation module, be used for by speech recognition speech audio being converted to plain text information, be used for obtaining simultaneously the one time of speech audio, be start time and the concluding time of one, and then determine the one word speed by the length of described one time.The described one time obtains when speech audio is converted to plain text information in the process of speech recognition automatically.

The rich text generation module is used for the text message in the many information of Information generation of described plain text information integration one word speed.

According to the one word speed that obtains, represent word speed by literal spacing or the literal width that changes in the plain text information, perhaps represent word speed, the perhaps combination of above several method by adding symbol.

For example, the plain text information that the generation by described speech recognition plain text information generating module obtains is: good refreshing, be extracted into mobile phone.

Represent word speed by the literal spacing that changes plain text information, obtain the text message of many information: good refreshing, be extracted into mobile phone.

By changing the literal width means word speed of plain text information, obtain the text message of many information: good refreshing, be extracted into mobile phone.

Represent word speed by in plain text information, adding symbol, obtain the text message of many information: good～～refreshing, the mobile phone of drawing a lottery～be extracted into～～.

Embodiment two

Please referring to Fig. 2, Fig. 2 is another embodiment configuration diagram of speech recognition multi-information text deriving means of the present invention.As shown in Figure 2, speech recognition multi-information text deriving means of the present invention comprises:

One intensity computing module is used for obtaining one intensity according to the one Time Calculation that obtains.Utilize the described one time that obtains, calculate the average of intensity of phonation in the one time period, can obtain the intensity of phonation of each word.

Individual character intonation computing module is used for obtaining the one intonation according to the one Time Calculation that obtains.Described one intonation obtains by the fundamental frequency extractive technique.The frequency of vocal cord vibration when the fundamental frequency in the fundamental frequency extractive technique refers to send out voiced sound in the phonation.Existing multiple fundamental frequency extraction algorithm mainly contains the correlation method of time domain, Cepstrum Method of frequency domain etc. in the prior art.

The rich text generation module is used for the text message in the many information of Information generation of described plain text information integration one word speed and/or one intensity and/or one intonation.The text message of described many information is the text message that includes expression pronunciation word speed and/or pronunciation intonation and/or intensity of phonation implication content.

1) according to the one word speed that obtains, represent word speed by literal spacing or the literal width that changes in the plain text information, perhaps represent word speed, the perhaps combination of above several method by adding symbol.

2) according to the one intensity that obtains, represent intensity of phonation, the perhaps combination of above method by changing literal size in the plain text information or text color or character script thickness.

For example, the plain text information that obtains after the processing by described speech recognition plain text information generating module is: good refreshing, be extracted into mobile phone.

By changing the literal size expression intensity of phonation of plain text information, obtain the text message of many information: good refreshing, be extracted into mobile phone.

Represent intensity of phonation by the text color that changes plain text information, obtain the text message of many information: good (redness) feel well (blueness), takes out (brown) prize and takes out (redness) and arrive mobile phone and (redness).

Represent intensity of phonation by the character script thickness that changes plain text information, obtain the text message of many information: good refreshing, be extracted into mobile phone.

3) according to the one intonation that obtains, the top by each word in plain text information or bottom add curve and represent to pronounce intonation.

By on plain text information Chinese word top or the bottom add the curve of representative pronunciation intonation, obtain the text message of many information as shown in Figure 5.

4) use simultaneously above-mentioned 1) to 3) the middle method of describing, one word speed, one intensity and one intonation all are integrated into the text message that generates many information in the plain text information.

The final rich text information that generates as shown in Figure 6.

The present invention also provides a kind of speech recognition multi-information text acquisition methods.

Embodiment three

Please referring to Fig. 3, Fig. 3 is speech recognition multi-information text acquisition methods one embodiment schematic flow sheet of the present invention.As shown in Figure 3, the invention provides a kind of speech recognition multi-information text acquisition methods, may further comprise the steps:

Step 1, by speech recognition speech audio is converted to plain text information, obtain simultaneously the one time in the speech audio, i.e. the start time of one and concluding time, and then determine the word speed of one by the length of described one time.The described one time obtains when speech audio is converted to plain text information in the process of speech recognition automatically.

Step 2, the text message of the many information of Information generation of integration one word speed in described plain text information.

Embodiment four

Please referring to Fig. 4, Fig. 4 is another embodiment schematic flow sheet of speech recognition multi-information text acquisition methods of the present invention.As shown in Figure 4, the invention provides a kind of speech recognition multi-information text acquisition methods, may further comprise the steps:

Step 2 obtains one intensity and/or one intonation according to the one Time Calculation that obtains.

When calculating described one intensity, utilize the described one time that obtains, calculate the average of intensity of phonation in the one time period, can obtain the intensity of phonation of each word.

Described one intonation calculates by the fundamental frequency extractive technique.

Step 3, the text message of the many information of Information generation of integration one word speed and/or one intensity and/or one intonation in described plain text information.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. a speech recognition rich text deriving means is characterized in that, comprising:

Plain text information and one time generation module, be used for by speech recognition speech audio being converted to plain text information, be used for obtaining simultaneously the one time of described speech audio, determine the one word speed by the length of described one time;

The rich text generation module, be used for the text message with the many information of described plain text Information generation, namely in described plain text information, integrate the text message of the many information of Information generation of described one word speed and/or one intensity and/or one intonation;

Individual character intonation computing module is used for obtaining the one intonation according to described one Time Calculation.

2. speech recognition rich text deriving means as claimed in claim 1 is characterized in that, also comprises one intensity computing module, is used for obtaining one intensity according to described one Time Calculation.

3. speech recognition rich text deriving means as claimed in claim 2, it is characterized in that, described rich text generation module is used for integrating in described plain text information the text message of the many information of Information generation of described one word speed and/or described one intensity.

4. a speech recognition multi-information text acquisition methods is characterized in that, may further comprise the steps:

Step 2 is with the text message of the many information of described plain text Information generation;

Between described step 1 and step 2, also comprise the step that obtains one intensity and/or one intonation according to described one Time Calculation;

In the described step 2, in described plain text information, integrate the text message of the many information of Information generation of described one word speed and/or described one intensity and/or described one intonation.

5. speech recognition multi-information text acquisition methods as claimed in claim 4 is characterized in that, in the described step 2, integrates the text message of the many information of Information generation of described one word speed in described plain text information.

6. speech recognition multi-information text acquisition methods as claimed in claim 4 is characterized in that, described one intonation utilizes the described one time to calculate by the fundamental frequency extractive technique.

7. speech recognition multi-information text acquisition methods as claimed in claim 4 is characterized in that, described one intensity by calculate described one in the time average of intensity of phonation obtain.