CN106504754B - A kind of real-time method for generating captions according to audio output - Google Patents

A kind of real-time method for generating captions according to audio output Download PDF

Info

Publication number
CN106504754B
CN106504754B CN201610863894.6A CN201610863894A CN106504754B CN 106504754 B CN106504754 B CN 106504754B CN 201610863894 A CN201610863894 A CN 201610863894A CN 106504754 B CN106504754 B CN 106504754B
Authority
CN
China
Prior art keywords
audio
voice
text
real
frequency information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610863894.6A
Other languages
Chinese (zh)
Other versions
CN106504754A (en
Inventor
卜佳俊
于智
陈静
王灿
王炜
陈纯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201610863894.6A priority Critical patent/CN106504754B/en
Publication of CN106504754A publication Critical patent/CN106504754A/en
Application granted granted Critical
Publication of CN106504754B publication Critical patent/CN106504754B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

A kind of real-time method for generating captions according to audio output, steps are as follows: the audio-frequency information that electronic equipment is exported proceeded as follows: the audio-frequency information exported using audio collection module real-time monitoring electronic equipment, and collected;The audio-frequency information being collected into is passed to voice extraction module, the irrelevant contents such as background music in audio-frequency information are filtered and are carried out noise reduction process, obtain accurate voice messaging;Hereafter the obtained voice messaging for needing to be converted into text is input to speech recognition module, obtains the corresponding text information of voice;The text most obtained after display module is by conversion afterwards is using the form real-time display of subtitle on device screen.Advantage of the process is that hearing-impaired people can be helped to obtain video, the voice content for including in audio or other forms obtains voice messaging for hearing-impaired people and provides effective convenient and fast mode, while also providing convenience for ordinary user.

Description

A kind of real-time method for generating captions according to audio output
Technical field
The present invention relates to the interaction ancillary technique fields towards hearing-impaired people, give birth to automatically in particular according to the real-time subtitle of audio At method.
Background technique
2012, the World Health Organization once reported that the prevalence rate of the above dysaudia of moderate in population in the world was 5.3%.Phase Pass data, which also show current China, has 15.84% people to suffer from dysaudia.Wherein, suffer from crippling dysaudia, i.e. moderate The people of the above dysaudia accounts for the 5.17% of total population.With universal, the video, sound of the electronic equipments such as PC and mobile phone The multimedia forms such as frequency become the important medium for obtaining information instantly.However for hearing-impaired people, multimedia content is being obtained Voice messaging when exist it is greatly difficult.Text becomes a kind of major way that hearing-impaired people obtains information at present, works as view When frequency information includes voice but do not provide subtitle, hearing-impaired people can not just obtain corresponding information, as part news video only wraps Caption information is not corresponded to completely containing content summary.
For vision disorder user, voice can be changed into real time for the text of the display in electronic equipment screen by reading screen software, Word content information, which is obtained, for it provides effective way.But hearing-impaired people is lacked and accordingly turns the voice in equipment The tool of text, therefore it is very urgent for the demand of such tool.In recent years, speech recognition technology obtains marked improvement, Accuracy of identification is also continuously improved, and starts to move towards market from laboratory, more and more using the function comprising voice to be changed into text Energy.But the method for the voice for being played in real time in electronic equipment (including playing voice messaging when video) display corresponding subtitle Space state is still in application.
Therefore existing speech recognition system is combined, provides corresponding subtitle in real time by the audio-frequency information exported in equipment Great convenience hearing-impaired people is obtained into the corresponding content of voice messaging, and preferably helps its life, study and work.
Summary of the invention
The present invention will overcome the disadvantages mentioned above of the prior art, propose a kind of real-time subtitle generation side according to audio output Method, to help dysaudia user more convenient, accurate can obtain the audio-frequency information pair exported in real time in electronic equipment The word content answered.
A kind of real-time method for generating captions according to audio output of the present invention, comprising the following steps:
1) audio collection;The audio-frequency information of real-time monitoring electronic equipment output, and collected;
2) voice extracts;The audio-frequency information being collected into is handled, the nothings such as the background music in audio-frequency information are filtered out Hold and carry out noise reduction process inside the Pass, obtains accurate voice messaging;
3) speech recognition;After the voice messaging for obtaining needing to be converted into text, speech recognition is carried out to voice messaging, is obtained The corresponding text information of voice;
4) it shows;The text obtained after conversion is shown on device screen by way of subtitle.
Audio collection described in step 1) is specifically: for electronic equipment, being also sent to no matter being transmitted to sound card The audio file that audio decoder needs to export sound all may include voice messaging, and the particular content of audio collection is to being The no audio-frequency information real-time monitoring for having output, and audio signal is further processed in time after monitoring.
The extraction of voice described in step 2) specifically includes
21) audio signal is the frequency of the regular sound wave with voice, music and audio, amplitude change information load Body, while voice signal frequency range is: 300Hz~3.4kHz, due to that only need to export subtitle to voice messaging, voice is mentioned Modulus block will mainly according to the frequency extraction audio file voice voiceprint of voice, be retrieved for subsequent fixed voice line;
22) noise reduction process is carried out with voice voiceprint of the corresponding filtering algorithm to extraction, obtains more accurate people Several lines improve accuracy of identification.
Speech recognition described in step 3) is specifically: obtained voice voiceprint input speech recognition module is carried out Languages identification, feature extraction, retrieval, matching, and carry out the relevant treatments such as context semantic analysis and finally obtain accurately accordingly Text information.
Display described in step 4) is specifically: after obtaining the corresponding text information of voice, it being used subtitle shape in real time Formula is shown on user's screen, is read for user, understands that the related voice content played provides effective, convenient and fast side for user Formula.
The invention proposes a kind of real-time method for generating captions according to audio output, the advantage is that: based on existing Speech recognition system provide it is a kind of audio-frequency information is converted into text information and in the method for Subtitle Demonstration, be suitable for computer, hand The electronic equipments such as machine solve the difficulty that cannot obtain voice messaging for hearing-impaired people, also browse video or sound for ordinary user Frequency etc. is provided convenience.
Detailed description of the invention
Fig. 1 is flow chart of the method for the present invention.
Specific embodiment
With reference to the accompanying drawings, the present invention is further illustrated:
1. a kind of real-time method for generating captions according to audio output, specific implementation the following steps are included:
1) audio collection;The audio-frequency information of real-time monitoring electronic equipment output, and collected;
2) voice extracts, i.e., handles the audio-frequency information being collected into, filter out the background music etc. in audio-frequency information Irrelevant contents simultaneously carry out noise reduction process, obtain accurate voice messaging;
3) speech recognition after the voice messaging for obtaining needing to be converted into text, carries out speech recognition, it is corresponding to obtain voice Text information.
4) it shows, the text obtained after conversion is shown on device screen by way of subtitle.
Audio collection described in step 1), specifically: for electronic equipment, being also sent to no matter being transmitted to sound card The audio file that audio decoder needs to export sound all may include voice messaging, and the content of audio collection is: prison in real time The audio-frequency information for whether having output is surveyed, and audio signal is further processed in time after monitoring.
The extraction of voice described in step 2) is specifically:
1) audio signal is the frequency of regular sound wave with voice, music and audio, amplitude change information carrier, Simultaneously voice signal frequency range be: 300Hz~3.4kHz, due to only need to voice messaging export subtitle, voice extraction be by Mainly according to the frequency extraction audio file voice voiceprint of voice, retrieved for subsequent fixed voice line;
2) with the masking effect of triangle bandpass filter (Triangle Filters) simulation human ear to the voice sound of extraction Line information carries out noise reduction process, obtains more accurate voice vocal print and improves accuracy of identification.
Speech recognition described in step 3), specifically: obtained voice voiceprint being inputted and carries out languages identification, spy Sign is extracted, retrieval, is matched, and is carried out the relevant treatments such as context semantic analysis and finally obtained accurate corresponding text information.Tool Body process are as follows:
31) using the cloud corpus for each dialect of each languages collected and recorded in advance, different corpus are extracted using MFCC technology Exclusive phonetic feature, concrete operations are as follows: audio is decomposed into frame, and to every frame calculating cycle power spectrum;Then in power spectrum It is upper to calculate energy and logarithm using mel filter;Coefficient after retaining the dct transform of 2-13 logarithmic energy is as feature.
32) to actual acquisition to acoustic information equally use MFCC technology extract phonetic feature, and with corpus planting modes on sink characteristic It is compared, most similar corpus is determined according to similitude.
33) voice messaging is decomposed into multiple continuous fragments, characteristic similarity is utilized in corpus, is matched corresponding Text information.
34) by after the integration of all text informations, cloud Chinese phrase semantic base, the analysis front and back language that continuously word occurs are utilized Adopted correlation, and calculate the semantic dependency between the approximate word of phonetic feature of front and back word.If semantic related between existing word It is not strong, then replace with the nearly sound word that higher semanteme wants closing property.
35) all texts are summarized into integration, the coherent identification text results of generative semantics.
Display described in step 4), specifically: after obtaining the corresponding text information of voice, it being used into subtitle shape in real time Formula is shown on user's screen, is read for user, understands that the related voice content played provides effective, convenient and fast side for user Formula.
Content described in this specification embodiment is only enumerating to the way of realization of inventive concept, protection of the invention Range should not be construed as being limited to the specific forms stated in the embodiments, and protection scope of the present invention is also and in this field skill Art personnel conceive according to the present invention it is conceivable that equivalent technologies mean.

Claims (1)

1. a kind of real-time method for generating captions according to audio output, the method is characterized in that:
Step 1) audio collection;The audio-frequency information of real-time monitoring electronic equipment output, and collected, audio collection is specifically pair Whether there is the audio-frequency information real-time monitoring of output, and audio signal is further processed in time after monitoring;
Step 2) voice extracts;The audio-frequency information being collected into is handled, the background music filtered out in audio-frequency information is unrelated Content simultaneously carries out noise reduction process, obtains accurate voice messaging, and voice extraction specifically includes:
21) it will mainly according to the frequency extraction audio file voice voiceprint of voice, be retrieved for subsequent fixed voice line;
22) noise reduction process is carried out with voice voiceprint of the corresponding filtering algorithm to extraction, obtains more accurate voice sound Line improves accuracy of identification;
Step 3) speech recognition;After the voice messaging for obtaining needing to be converted into text, speech recognition is carried out to voice messaging, is obtained The corresponding text information of voice, speech recognition specifically include: obtained voice voiceprint input speech recognition module is carried out Languages identification, feature extraction, retrieval, matching, and carry out context semantic analysis relevant treatment and finally obtain accurately corresponding text Word information;Specifically:
31) using the cloud corpus for each dialect of each languages collected and recorded in advance, it is exclusive that different corpus are extracted using MFCC technology Phonetic feature, concrete operations are as follows: audio is decomposed into frame, and to every frame calculating cycle power spectrum;Then sharp on power spectrum Energy and logarithm are calculated with mel filter;Coefficient after retaining the dct transform of 2-13 logarithmic energy is as feature;
32) to actual acquisition to acoustic information equally use MFCC technology extract phonetic feature, and with corpus planting modes on sink characteristic carry out It compares, most similar corpus is determined according to similitude;
33) voice messaging is decomposed into multiple continuous fragments, characteristic similarity is utilized in corpus, matches corresponding text Information;
34) by after the integration of all text informations, cloud Chinese phrase semantic base, the analysis front and back semantic phase that continuously word occurs are utilized Guan Xing, and calculate the semantic dependency between the approximate word of phonetic feature of front and back word;If semantic related not strong between existing word, Then replace with the nearly sound word that higher semanteme wants closing property;
35) all texts are summarized into integration, the coherent identification text results of generative semantics;
Step 4) display;The text obtained after conversion is shown on device screen by way of subtitle.
CN201610863894.6A 2016-09-29 2016-09-29 A kind of real-time method for generating captions according to audio output Active CN106504754B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610863894.6A CN106504754B (en) 2016-09-29 2016-09-29 A kind of real-time method for generating captions according to audio output

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610863894.6A CN106504754B (en) 2016-09-29 2016-09-29 A kind of real-time method for generating captions according to audio output

Publications (2)

Publication Number Publication Date
CN106504754A CN106504754A (en) 2017-03-15
CN106504754B true CN106504754B (en) 2019-10-18

Family

ID=58291207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610863894.6A Active CN106504754B (en) 2016-09-29 2016-09-29 A kind of real-time method for generating captions according to audio output

Country Status (1)

Country Link
CN (1) CN106504754B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301867A (en) * 2017-08-10 2017-10-27 安徽声讯信息技术有限公司 A kind of voice restarts control system
CN107767871B (en) * 2017-10-12 2021-02-02 安徽听见科技有限公司 Text display method, terminal and server
CN108389281A (en) * 2018-03-17 2018-08-10 广东容祺智能科技有限公司 A kind of unmanned plane cruising inspection system with voice record function
US11178465B2 (en) * 2018-10-02 2021-11-16 Harman International Industries, Incorporated System and method for automatic subtitle display
CN109257659A (en) * 2018-11-16 2019-01-22 北京微播视界科技有限公司 Subtitle adding method, device, electronic equipment and computer readable storage medium
CN109600681B (en) * 2018-11-29 2021-05-25 上海华绰信息科技有限公司 Subtitle display method, device, terminal and storage medium
CN112567330A (en) * 2018-11-30 2021-03-26 华为技术有限公司 Voice recognition method, device and system
CN109714608B (en) * 2018-12-18 2023-03-10 深圳壹账通智能科技有限公司 Video data processing method, video data processing device, computer equipment and storage medium
CN109872714A (en) * 2019-01-25 2019-06-11 广州富港万嘉智能科技有限公司 A kind of method, electronic equipment and storage medium improving accuracy of speech recognition
CN113692619A (en) * 2019-05-02 2021-11-23 谷歌有限责任公司 Automatically subtitling audible portions of content on a computing device
CN111968630B (en) * 2019-05-20 2024-03-19 北京字节跳动网络技术有限公司 Information processing method and device and electronic equipment
CN112135197B (en) * 2019-06-24 2022-12-09 腾讯科技(深圳)有限公司 Subtitle display method and device, storage medium and electronic equipment
CN112312181A (en) * 2019-07-26 2021-02-02 深圳Tcl新技术有限公司 Smart television voice recognition method, system and readable storage medium
CN110415706A (en) * 2019-08-08 2019-11-05 常州市小先信息技术有限公司 A kind of technology and its application of superimposed subtitle real-time in video calling
CN110675117A (en) * 2019-09-17 2020-01-10 深圳市天道日新科技有限公司 Criminal trial network remote litigation method
CN111107284B (en) * 2019-12-31 2022-09-06 洛阳乐往网络科技有限公司 Real-time generation system and generation method for video subtitles
CN111768787A (en) * 2020-06-24 2020-10-13 中国人民解放军海军航空大学 Multifunctional auxiliary audio-visual method and system
CN111836062A (en) * 2020-06-30 2020-10-27 北京小米松果电子有限公司 Video playing method and device and computer readable storage medium
CN111787380A (en) * 2020-07-06 2020-10-16 四川长虹网络科技有限责任公司 Voice channel switching control method and device and handheld intelligent terminal
CN112843724B (en) * 2021-01-18 2022-03-22 浙江大学 Game scenario display control method and device, electronic equipment and storage medium
CN113435198A (en) * 2021-07-05 2021-09-24 深圳市鹰硕技术有限公司 Automatic correction display method and device for caption dialect words
CN115880737B (en) * 2021-09-26 2024-04-19 天翼爱音乐文化科技有限公司 Subtitle generation method, system, equipment and medium based on noise reduction self-learning
CN114087725A (en) * 2021-11-16 2022-02-25 珠海格力电器股份有限公司 Method for preventing mistaken awakening of air conditioner by combining WIFI channel state detection

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4736478B2 (en) * 2005-03-07 2011-07-27 日本電気株式会社 Voice transcription support device, method and program thereof
CN103685985A (en) * 2012-09-17 2014-03-26 联想(北京)有限公司 Communication method, transmitting device, receiving device, voice processing equipment and terminal equipment
CN103561217A (en) * 2013-10-14 2014-02-05 深圳创维数字技术股份有限公司 Method and terminal for generating captions
CN105704538A (en) * 2016-03-17 2016-06-22 广东小天才科技有限公司 Method and system for generating audio and video subtitles
CN105845129A (en) * 2016-03-25 2016-08-10 乐视控股(北京)有限公司 Method and system for dividing sentences in audio and automatic caption generation method and system for video files
CN105913845A (en) * 2016-04-26 2016-08-31 惠州Tcl移动通信有限公司 Mobile terminal voice recognition and subtitle generation method and system and mobile terminal

Also Published As

Publication number Publication date
CN106504754A (en) 2017-03-15

Similar Documents

Publication Publication Date Title
CN106504754B (en) A kind of real-time method for generating captions according to audio output
CN103310788B (en) A kind of voice information identification method and system
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN102723078B (en) Emotion speech recognition method based on natural language comprehension
CN107945805B (en) A kind of across language voice identification method for transformation of intelligence
CN108399923B (en) More human hairs call the turn spokesman's recognition methods and device
Shaw et al. Emotion recognition and classification in speech using artificial neural networks
CN109215665A (en) A kind of method for recognizing sound-groove based on 3D convolutional neural networks
CN112102850B (en) Emotion recognition processing method and device, medium and electronic equipment
CN107274916A (en) The method and device operated based on voiceprint to audio/video file
Yağanoğlu Real time wearable speech recognition system for deaf persons
CN101794576A (en) Dirty word detection aid and using method thereof
CN104142831B (en) Application program searching method and device
CN106251872A (en) A kind of case input method and system
CN110970036A (en) Voiceprint recognition method and device, computer storage medium and electronic equipment
WO2013052292A9 (en) Waveform analysis of speech
CN106782503A (en) Automatic speech recognition method based on physiologic information in phonation
CN110349565B (en) Auxiliary pronunciation learning method and system for hearing-impaired people
CN105845126A (en) Method for automatic English subtitle filling of English audio image data
CN113782032B (en) Voiceprint recognition method and related device
Chamoli et al. Detection of emotion in analysis of speech using linear predictive coding techniques (LPC)
CN105139866A (en) Nanyin music recognition method and device
CN107251137A (en) Improve method, device and the computer readable recording medium storing program for performing of the set of at least one semantic primitive using voice
Salhi et al. Robustness of auditory teager energy cepstrum coefficients for classification of pathological and normal voices in noisy environments
Bouafif et al. A speech tool software for signal processing applications

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant