CN106504754B

CN106504754B - A kind of real-time method for generating captions according to audio output

Info

Publication number: CN106504754B
Application number: CN201610863894.6A
Authority: CN
Inventors: 卜佳俊; 于智; 陈静; 王灿; 王炜; 陈纯
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2019-10-18
Anticipated expiration: 2036-09-29
Also published as: CN106504754A

Abstract

A kind of real-time method for generating captions according to audio output, steps are as follows: the audio-frequency information that electronic equipment is exported proceeded as follows: the audio-frequency information exported using audio collection module real-time monitoring electronic equipment, and collected；The audio-frequency information being collected into is passed to voice extraction module, the irrelevant contents such as background music in audio-frequency information are filtered and are carried out noise reduction process, obtain accurate voice messaging；Hereafter the obtained voice messaging for needing to be converted into text is input to speech recognition module, obtains the corresponding text information of voice；The text most obtained after display module is by conversion afterwards is using the form real-time display of subtitle on device screen.Advantage of the process is that hearing-impaired people can be helped to obtain video, the voice content for including in audio or other forms obtains voice messaging for hearing-impaired people and provides effective convenient and fast mode, while also providing convenience for ordinary user.

Description

A kind of real-time method for generating captions according to audio output

Technical field

The present invention relates to the interaction ancillary technique fields towards hearing-impaired people, give birth to automatically in particular according to the real-time subtitle of audio At method.

Background technique

2012, the World Health Organization once reported that the prevalence rate of the above dysaudia of moderate in population in the world was 5.3%.Phase Pass data, which also show current China, has 15.84% people to suffer from dysaudia.Wherein, suffer from crippling dysaudia, i.e. moderate The people of the above dysaudia accounts for the 5.17% of total population.With universal, the video, sound of the electronic equipments such as PC and mobile phone The multimedia forms such as frequency become the important medium for obtaining information instantly.However for hearing-impaired people, multimedia content is being obtained Voice messaging when exist it is greatly difficult.Text becomes a kind of major way that hearing-impaired people obtains information at present, works as view When frequency information includes voice but do not provide subtitle, hearing-impaired people can not just obtain corresponding information, as part news video only wraps Caption information is not corresponded to completely containing content summary.

For vision disorder user, voice can be changed into real time for the text of the display in electronic equipment screen by reading screen software, Word content information, which is obtained, for it provides effective way.But hearing-impaired people is lacked and accordingly turns the voice in equipment The tool of text, therefore it is very urgent for the demand of such tool.In recent years, speech recognition technology obtains marked improvement, Accuracy of identification is also continuously improved, and starts to move towards market from laboratory, more and more using the function comprising voice to be changed into text Energy.But the method for the voice for being played in real time in electronic equipment (including playing voice messaging when video) display corresponding subtitle Space state is still in application.

Therefore existing speech recognition system is combined, provides corresponding subtitle in real time by the audio-frequency information exported in equipment Great convenience hearing-impaired people is obtained into the corresponding content of voice messaging, and preferably helps its life, study and work.

Summary of the invention

The present invention will overcome the disadvantages mentioned above of the prior art, propose a kind of real-time subtitle generation side according to audio output Method, to help dysaudia user more convenient, accurate can obtain the audio-frequency information pair exported in real time in electronic equipment The word content answered.

A kind of real-time method for generating captions according to audio output of the present invention, comprising the following steps:

1) audio collection；The audio-frequency information of real-time monitoring electronic equipment output, and collected；

2) voice extracts；The audio-frequency information being collected into is handled, the nothings such as the background music in audio-frequency information are filtered out Hold and carry out noise reduction process inside the Pass, obtains accurate voice messaging；

3) speech recognition；After the voice messaging for obtaining needing to be converted into text, speech recognition is carried out to voice messaging, is obtained The corresponding text information of voice；

4) it shows；The text obtained after conversion is shown on device screen by way of subtitle.

Audio collection described in step 1) is specifically: for electronic equipment, being also sent to no matter being transmitted to sound card The audio file that audio decoder needs to export sound all may include voice messaging, and the particular content of audio collection is to being The no audio-frequency information real-time monitoring for having output, and audio signal is further processed in time after monitoring.

The extraction of voice described in step 2) specifically includes

21) audio signal is the frequency of the regular sound wave with voice, music and audio, amplitude change information load Body, while voice signal frequency range is: 300Hz~3.4kHz, due to that only need to export subtitle to voice messaging, voice is mentioned Modulus block will mainly according to the frequency extraction audio file voice voiceprint of voice, be retrieved for subsequent fixed voice line；

22) noise reduction process is carried out with voice voiceprint of the corresponding filtering algorithm to extraction, obtains more accurate people Several lines improve accuracy of identification.

Speech recognition described in step 3) is specifically: obtained voice voiceprint input speech recognition module is carried out Languages identification, feature extraction, retrieval, matching, and carry out the relevant treatments such as context semantic analysis and finally obtain accurately accordingly Text information.

Display described in step 4) is specifically: after obtaining the corresponding text information of voice, it being used subtitle shape in real time Formula is shown on user's screen, is read for user, understands that the related voice content played provides effective, convenient and fast side for user Formula.

The invention proposes a kind of real-time method for generating captions according to audio output, the advantage is that: based on existing Speech recognition system provide it is a kind of audio-frequency information is converted into text information and in the method for Subtitle Demonstration, be suitable for computer, hand The electronic equipments such as machine solve the difficulty that cannot obtain voice messaging for hearing-impaired people, also browse video or sound for ordinary user Frequency etc. is provided convenience.

Detailed description of the invention

Fig. 1 is flow chart of the method for the present invention.

Specific embodiment

With reference to the accompanying drawings, the present invention is further illustrated:

1. a kind of real-time method for generating captions according to audio output, specific implementation the following steps are included:

2) voice extracts, i.e., handles the audio-frequency information being collected into, filter out the background music etc. in audio-frequency information Irrelevant contents simultaneously carry out noise reduction process, obtain accurate voice messaging；

3) speech recognition after the voice messaging for obtaining needing to be converted into text, carries out speech recognition, it is corresponding to obtain voice Text information.

4) it shows, the text obtained after conversion is shown on device screen by way of subtitle.

Audio collection described in step 1), specifically: for electronic equipment, being also sent to no matter being transmitted to sound card The audio file that audio decoder needs to export sound all may include voice messaging, and the content of audio collection is: prison in real time The audio-frequency information for whether having output is surveyed, and audio signal is further processed in time after monitoring.

The extraction of voice described in step 2) is specifically:

1) audio signal is the frequency of regular sound wave with voice, music and audio, amplitude change information carrier, Simultaneously voice signal frequency range be: 300Hz~3.4kHz, due to only need to voice messaging export subtitle, voice extraction be by Mainly according to the frequency extraction audio file voice voiceprint of voice, retrieved for subsequent fixed voice line；

2) with the masking effect of triangle bandpass filter (Triangle Filters) simulation human ear to the voice sound of extraction Line information carries out noise reduction process, obtains more accurate voice vocal print and improves accuracy of identification.

Speech recognition described in step 3), specifically: obtained voice voiceprint being inputted and carries out languages identification, spy Sign is extracted, retrieval, is matched, and is carried out the relevant treatments such as context semantic analysis and finally obtained accurate corresponding text information.Tool Body process are as follows:

31) using the cloud corpus for each dialect of each languages collected and recorded in advance, different corpus are extracted using MFCC technology Exclusive phonetic feature, concrete operations are as follows: audio is decomposed into frame, and to every frame calculating cycle power spectrum；Then in power spectrum It is upper to calculate energy and logarithm using mel filter；Coefficient after retaining the dct transform of 2-13 logarithmic energy is as feature.

32) to actual acquisition to acoustic information equally use MFCC technology extract phonetic feature, and with corpus planting modes on sink characteristic It is compared, most similar corpus is determined according to similitude.

33) voice messaging is decomposed into multiple continuous fragments, characteristic similarity is utilized in corpus, is matched corresponding Text information.

34) by after the integration of all text informations, cloud Chinese phrase semantic base, the analysis front and back language that continuously word occurs are utilized Adopted correlation, and calculate the semantic dependency between the approximate word of phonetic feature of front and back word.If semantic related between existing word It is not strong, then replace with the nearly sound word that higher semanteme wants closing property.

35) all texts are summarized into integration, the coherent identification text results of generative semantics.

Display described in step 4), specifically: after obtaining the corresponding text information of voice, it being used into subtitle shape in real time Formula is shown on user's screen, is read for user, understands that the related voice content played provides effective, convenient and fast side for user Formula.

Content described in this specification embodiment is only enumerating to the way of realization of inventive concept, protection of the invention Range should not be construed as being limited to the specific forms stated in the embodiments, and protection scope of the present invention is also and in this field skill Art personnel conceive according to the present invention it is conceivable that equivalent technologies mean.

Claims

1. a kind of real-time method for generating captions according to audio output, the method is characterized in that:

Step 1) audio collection；The audio-frequency information of real-time monitoring electronic equipment output, and collected, audio collection is specifically pair Whether there is the audio-frequency information real-time monitoring of output, and audio signal is further processed in time after monitoring；

Step 2) voice extracts；The audio-frequency information being collected into is handled, the background music filtered out in audio-frequency information is unrelated Content simultaneously carries out noise reduction process, obtains accurate voice messaging, and voice extraction specifically includes:

21) it will mainly according to the frequency extraction audio file voice voiceprint of voice, be retrieved for subsequent fixed voice line；

22) noise reduction process is carried out with voice voiceprint of the corresponding filtering algorithm to extraction, obtains more accurate voice sound Line improves accuracy of identification；

Step 3) speech recognition；After the voice messaging for obtaining needing to be converted into text, speech recognition is carried out to voice messaging, is obtained The corresponding text information of voice, speech recognition specifically include: obtained voice voiceprint input speech recognition module is carried out Languages identification, feature extraction, retrieval, matching, and carry out context semantic analysis relevant treatment and finally obtain accurately corresponding text Word information；Specifically:

31) using the cloud corpus for each dialect of each languages collected and recorded in advance, it is exclusive that different corpus are extracted using MFCC technology Phonetic feature, concrete operations are as follows: audio is decomposed into frame, and to every frame calculating cycle power spectrum；Then sharp on power spectrum Energy and logarithm are calculated with mel filter；Coefficient after retaining the dct transform of 2-13 logarithmic energy is as feature；

32) to actual acquisition to acoustic information equally use MFCC technology extract phonetic feature, and with corpus planting modes on sink characteristic carry out It compares, most similar corpus is determined according to similitude；

33) voice messaging is decomposed into multiple continuous fragments, characteristic similarity is utilized in corpus, matches corresponding text Information；

34) by after the integration of all text informations, cloud Chinese phrase semantic base, the analysis front and back semantic phase that continuously word occurs are utilized Guan Xing, and calculate the semantic dependency between the approximate word of phonetic feature of front and back word；If semantic related not strong between existing word, Then replace with the nearly sound word that higher semanteme wants closing property；

35) all texts are summarized into integration, the coherent identification text results of generative semantics；

Step 4) display；The text obtained after conversion is shown on device screen by way of subtitle.