CN106504754B - A kind of real-time method for generating captions according to audio output - Google Patents
A kind of real-time method for generating captions according to audio output Download PDFInfo
- Publication number
- CN106504754B CN106504754B CN201610863894.6A CN201610863894A CN106504754B CN 106504754 B CN106504754 B CN 106504754B CN 201610863894 A CN201610863894 A CN 201610863894A CN 106504754 B CN106504754 B CN 106504754B
- Authority
- CN
- China
- Prior art keywords
- audio
- voice
- text
- real
- frequency information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 17
- 238000000605 extraction Methods 0.000 claims abstract description 13
- 238000012544 monitoring process Methods 0.000 claims abstract description 9
- 238000011946 reduction process Methods 0.000 claims abstract description 7
- 238000006243 chemical reaction Methods 0.000 claims abstract description 4
- 238000005516 engineering process Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 5
- 230000005236 sound signal Effects 0.000 claims description 5
- 230000010354 integration Effects 0.000 claims description 4
- 238000001228 spectrum Methods 0.000 claims description 4
- 238000011282 treatment Methods 0.000 claims description 3
- 230000001427 coherent effect Effects 0.000 claims description 2
- 238000001914 filtration Methods 0.000 claims description 2
- 239000012634 fragment Substances 0.000 claims description 2
- 208000032041 Hearing impaired Diseases 0.000 abstract description 9
- 230000008901 benefit Effects 0.000 abstract description 2
- 230000008569 process Effects 0.000 abstract description 2
- 230000008859 change Effects 0.000 description 2
- 230000002089 crippling effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 208000029257 vision disease Diseases 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
A kind of real-time method for generating captions according to audio output, steps are as follows: the audio-frequency information that electronic equipment is exported proceeded as follows: the audio-frequency information exported using audio collection module real-time monitoring electronic equipment, and collected;The audio-frequency information being collected into is passed to voice extraction module, the irrelevant contents such as background music in audio-frequency information are filtered and are carried out noise reduction process, obtain accurate voice messaging;Hereafter the obtained voice messaging for needing to be converted into text is input to speech recognition module, obtains the corresponding text information of voice;The text most obtained after display module is by conversion afterwards is using the form real-time display of subtitle on device screen.Advantage of the process is that hearing-impaired people can be helped to obtain video, the voice content for including in audio or other forms obtains voice messaging for hearing-impaired people and provides effective convenient and fast mode, while also providing convenience for ordinary user.
Description
Technical field
The present invention relates to the interaction ancillary technique fields towards hearing-impaired people, give birth to automatically in particular according to the real-time subtitle of audio
At method.
Background technique
2012, the World Health Organization once reported that the prevalence rate of the above dysaudia of moderate in population in the world was 5.3%.Phase
Pass data, which also show current China, has 15.84% people to suffer from dysaudia.Wherein, suffer from crippling dysaudia, i.e. moderate
The people of the above dysaudia accounts for the 5.17% of total population.With universal, the video, sound of the electronic equipments such as PC and mobile phone
The multimedia forms such as frequency become the important medium for obtaining information instantly.However for hearing-impaired people, multimedia content is being obtained
Voice messaging when exist it is greatly difficult.Text becomes a kind of major way that hearing-impaired people obtains information at present, works as view
When frequency information includes voice but do not provide subtitle, hearing-impaired people can not just obtain corresponding information, as part news video only wraps
Caption information is not corresponded to completely containing content summary.
For vision disorder user, voice can be changed into real time for the text of the display in electronic equipment screen by reading screen software,
Word content information, which is obtained, for it provides effective way.But hearing-impaired people is lacked and accordingly turns the voice in equipment
The tool of text, therefore it is very urgent for the demand of such tool.In recent years, speech recognition technology obtains marked improvement,
Accuracy of identification is also continuously improved, and starts to move towards market from laboratory, more and more using the function comprising voice to be changed into text
Energy.But the method for the voice for being played in real time in electronic equipment (including playing voice messaging when video) display corresponding subtitle
Space state is still in application.
Therefore existing speech recognition system is combined, provides corresponding subtitle in real time by the audio-frequency information exported in equipment
Great convenience hearing-impaired people is obtained into the corresponding content of voice messaging, and preferably helps its life, study and work.
Summary of the invention
The present invention will overcome the disadvantages mentioned above of the prior art, propose a kind of real-time subtitle generation side according to audio output
Method, to help dysaudia user more convenient, accurate can obtain the audio-frequency information pair exported in real time in electronic equipment
The word content answered.
A kind of real-time method for generating captions according to audio output of the present invention, comprising the following steps:
1) audio collection;The audio-frequency information of real-time monitoring electronic equipment output, and collected;
2) voice extracts;The audio-frequency information being collected into is handled, the nothings such as the background music in audio-frequency information are filtered out
Hold and carry out noise reduction process inside the Pass, obtains accurate voice messaging;
3) speech recognition;After the voice messaging for obtaining needing to be converted into text, speech recognition is carried out to voice messaging, is obtained
The corresponding text information of voice;
4) it shows;The text obtained after conversion is shown on device screen by way of subtitle.
Audio collection described in step 1) is specifically: for electronic equipment, being also sent to no matter being transmitted to sound card
The audio file that audio decoder needs to export sound all may include voice messaging, and the particular content of audio collection is to being
The no audio-frequency information real-time monitoring for having output, and audio signal is further processed in time after monitoring.
The extraction of voice described in step 2) specifically includes
21) audio signal is the frequency of the regular sound wave with voice, music and audio, amplitude change information load
Body, while voice signal frequency range is: 300Hz~3.4kHz, due to that only need to export subtitle to voice messaging, voice is mentioned
Modulus block will mainly according to the frequency extraction audio file voice voiceprint of voice, be retrieved for subsequent fixed voice line;
22) noise reduction process is carried out with voice voiceprint of the corresponding filtering algorithm to extraction, obtains more accurate people
Several lines improve accuracy of identification.
Speech recognition described in step 3) is specifically: obtained voice voiceprint input speech recognition module is carried out
Languages identification, feature extraction, retrieval, matching, and carry out the relevant treatments such as context semantic analysis and finally obtain accurately accordingly
Text information.
Display described in step 4) is specifically: after obtaining the corresponding text information of voice, it being used subtitle shape in real time
Formula is shown on user's screen, is read for user, understands that the related voice content played provides effective, convenient and fast side for user
Formula.
The invention proposes a kind of real-time method for generating captions according to audio output, the advantage is that: based on existing
Speech recognition system provide it is a kind of audio-frequency information is converted into text information and in the method for Subtitle Demonstration, be suitable for computer, hand
The electronic equipments such as machine solve the difficulty that cannot obtain voice messaging for hearing-impaired people, also browse video or sound for ordinary user
Frequency etc. is provided convenience.
Detailed description of the invention
Fig. 1 is flow chart of the method for the present invention.
Specific embodiment
With reference to the accompanying drawings, the present invention is further illustrated:
1. a kind of real-time method for generating captions according to audio output, specific implementation the following steps are included:
1) audio collection;The audio-frequency information of real-time monitoring electronic equipment output, and collected;
2) voice extracts, i.e., handles the audio-frequency information being collected into, filter out the background music etc. in audio-frequency information
Irrelevant contents simultaneously carry out noise reduction process, obtain accurate voice messaging;
3) speech recognition after the voice messaging for obtaining needing to be converted into text, carries out speech recognition, it is corresponding to obtain voice
Text information.
4) it shows, the text obtained after conversion is shown on device screen by way of subtitle.
Audio collection described in step 1), specifically: for electronic equipment, being also sent to no matter being transmitted to sound card
The audio file that audio decoder needs to export sound all may include voice messaging, and the content of audio collection is: prison in real time
The audio-frequency information for whether having output is surveyed, and audio signal is further processed in time after monitoring.
The extraction of voice described in step 2) is specifically:
1) audio signal is the frequency of regular sound wave with voice, music and audio, amplitude change information carrier,
Simultaneously voice signal frequency range be: 300Hz~3.4kHz, due to only need to voice messaging export subtitle, voice extraction be by
Mainly according to the frequency extraction audio file voice voiceprint of voice, retrieved for subsequent fixed voice line;
2) with the masking effect of triangle bandpass filter (Triangle Filters) simulation human ear to the voice sound of extraction
Line information carries out noise reduction process, obtains more accurate voice vocal print and improves accuracy of identification.
Speech recognition described in step 3), specifically: obtained voice voiceprint being inputted and carries out languages identification, spy
Sign is extracted, retrieval, is matched, and is carried out the relevant treatments such as context semantic analysis and finally obtained accurate corresponding text information.Tool
Body process are as follows:
31) using the cloud corpus for each dialect of each languages collected and recorded in advance, different corpus are extracted using MFCC technology
Exclusive phonetic feature, concrete operations are as follows: audio is decomposed into frame, and to every frame calculating cycle power spectrum;Then in power spectrum
It is upper to calculate energy and logarithm using mel filter;Coefficient after retaining the dct transform of 2-13 logarithmic energy is as feature.
32) to actual acquisition to acoustic information equally use MFCC technology extract phonetic feature, and with corpus planting modes on sink characteristic
It is compared, most similar corpus is determined according to similitude.
33) voice messaging is decomposed into multiple continuous fragments, characteristic similarity is utilized in corpus, is matched corresponding
Text information.
34) by after the integration of all text informations, cloud Chinese phrase semantic base, the analysis front and back language that continuously word occurs are utilized
Adopted correlation, and calculate the semantic dependency between the approximate word of phonetic feature of front and back word.If semantic related between existing word
It is not strong, then replace with the nearly sound word that higher semanteme wants closing property.
35) all texts are summarized into integration, the coherent identification text results of generative semantics.
Display described in step 4), specifically: after obtaining the corresponding text information of voice, it being used into subtitle shape in real time
Formula is shown on user's screen, is read for user, understands that the related voice content played provides effective, convenient and fast side for user
Formula.
Content described in this specification embodiment is only enumerating to the way of realization of inventive concept, protection of the invention
Range should not be construed as being limited to the specific forms stated in the embodiments, and protection scope of the present invention is also and in this field skill
Art personnel conceive according to the present invention it is conceivable that equivalent technologies mean.
Claims (1)
1. a kind of real-time method for generating captions according to audio output, the method is characterized in that:
Step 1) audio collection;The audio-frequency information of real-time monitoring electronic equipment output, and collected, audio collection is specifically pair
Whether there is the audio-frequency information real-time monitoring of output, and audio signal is further processed in time after monitoring;
Step 2) voice extracts;The audio-frequency information being collected into is handled, the background music filtered out in audio-frequency information is unrelated
Content simultaneously carries out noise reduction process, obtains accurate voice messaging, and voice extraction specifically includes:
21) it will mainly according to the frequency extraction audio file voice voiceprint of voice, be retrieved for subsequent fixed voice line;
22) noise reduction process is carried out with voice voiceprint of the corresponding filtering algorithm to extraction, obtains more accurate voice sound
Line improves accuracy of identification;
Step 3) speech recognition;After the voice messaging for obtaining needing to be converted into text, speech recognition is carried out to voice messaging, is obtained
The corresponding text information of voice, speech recognition specifically include: obtained voice voiceprint input speech recognition module is carried out
Languages identification, feature extraction, retrieval, matching, and carry out context semantic analysis relevant treatment and finally obtain accurately corresponding text
Word information;Specifically:
31) using the cloud corpus for each dialect of each languages collected and recorded in advance, it is exclusive that different corpus are extracted using MFCC technology
Phonetic feature, concrete operations are as follows: audio is decomposed into frame, and to every frame calculating cycle power spectrum;Then sharp on power spectrum
Energy and logarithm are calculated with mel filter;Coefficient after retaining the dct transform of 2-13 logarithmic energy is as feature;
32) to actual acquisition to acoustic information equally use MFCC technology extract phonetic feature, and with corpus planting modes on sink characteristic carry out
It compares, most similar corpus is determined according to similitude;
33) voice messaging is decomposed into multiple continuous fragments, characteristic similarity is utilized in corpus, matches corresponding text
Information;
34) by after the integration of all text informations, cloud Chinese phrase semantic base, the analysis front and back semantic phase that continuously word occurs are utilized
Guan Xing, and calculate the semantic dependency between the approximate word of phonetic feature of front and back word;If semantic related not strong between existing word,
Then replace with the nearly sound word that higher semanteme wants closing property;
35) all texts are summarized into integration, the coherent identification text results of generative semantics;
Step 4) display;The text obtained after conversion is shown on device screen by way of subtitle.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610863894.6A CN106504754B (en) | 2016-09-29 | 2016-09-29 | A kind of real-time method for generating captions according to audio output |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610863894.6A CN106504754B (en) | 2016-09-29 | 2016-09-29 | A kind of real-time method for generating captions according to audio output |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106504754A CN106504754A (en) | 2017-03-15 |
CN106504754B true CN106504754B (en) | 2019-10-18 |
Family
ID=58291207
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610863894.6A Active CN106504754B (en) | 2016-09-29 | 2016-09-29 | A kind of real-time method for generating captions according to audio output |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106504754B (en) |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107301867A (en) * | 2017-08-10 | 2017-10-27 | 安徽声讯信息技术有限公司 | A kind of voice restarts control system |
CN107767871B (en) * | 2017-10-12 | 2021-02-02 | 安徽听见科技有限公司 | Text display method, terminal and server |
CN108389281A (en) * | 2018-03-17 | 2018-08-10 | 广东容祺智能科技有限公司 | A kind of unmanned plane cruising inspection system with voice record function |
US11178465B2 (en) * | 2018-10-02 | 2021-11-16 | Harman International Industries, Incorporated | System and method for automatic subtitle display |
CN109257659A (en) * | 2018-11-16 | 2019-01-22 | 北京微播视界科技有限公司 | Subtitle adding method, device, electronic equipment and computer readable storage medium |
CN109600681B (en) * | 2018-11-29 | 2021-05-25 | 上海华绰信息科技有限公司 | Subtitle display method, device, terminal and storage medium |
CN112567330A (en) * | 2018-11-30 | 2021-03-26 | 华为技术有限公司 | Voice recognition method, device and system |
CN109714608B (en) * | 2018-12-18 | 2023-03-10 | 深圳壹账通智能科技有限公司 | Video data processing method, video data processing device, computer equipment and storage medium |
CN109872714A (en) * | 2019-01-25 | 2019-06-11 | 广州富港万嘉智能科技有限公司 | A kind of method, electronic equipment and storage medium improving accuracy of speech recognition |
CN113692619A (en) * | 2019-05-02 | 2021-11-23 | 谷歌有限责任公司 | Automatically subtitling audible portions of content on a computing device |
CN111968630B (en) * | 2019-05-20 | 2024-03-19 | 北京字节跳动网络技术有限公司 | Information processing method and device and electronic equipment |
CN112135197B (en) * | 2019-06-24 | 2022-12-09 | 腾讯科技(深圳)有限公司 | Subtitle display method and device, storage medium and electronic equipment |
CN112312181A (en) * | 2019-07-26 | 2021-02-02 | 深圳Tcl新技术有限公司 | Smart television voice recognition method, system and readable storage medium |
CN110415706A (en) * | 2019-08-08 | 2019-11-05 | 常州市小先信息技术有限公司 | A kind of technology and its application of superimposed subtitle real-time in video calling |
CN110675117A (en) * | 2019-09-17 | 2020-01-10 | 深圳市天道日新科技有限公司 | Criminal trial network remote litigation method |
CN111107284B (en) * | 2019-12-31 | 2022-09-06 | 洛阳乐往网络科技有限公司 | Real-time generation system and generation method for video subtitles |
CN111768787A (en) * | 2020-06-24 | 2020-10-13 | 中国人民解放军海军航空大学 | Multifunctional auxiliary audio-visual method and system |
CN111836062A (en) * | 2020-06-30 | 2020-10-27 | 北京小米松果电子有限公司 | Video playing method and device and computer readable storage medium |
CN111787380A (en) * | 2020-07-06 | 2020-10-16 | 四川长虹网络科技有限责任公司 | Voice channel switching control method and device and handheld intelligent terminal |
CN112843724B (en) * | 2021-01-18 | 2022-03-22 | 浙江大学 | Game scenario display control method and device, electronic equipment and storage medium |
CN113435198A (en) * | 2021-07-05 | 2021-09-24 | 深圳市鹰硕技术有限公司 | Automatic correction display method and device for caption dialect words |
CN115880737B (en) * | 2021-09-26 | 2024-04-19 | 天翼爱音乐文化科技有限公司 | Subtitle generation method, system, equipment and medium based on noise reduction self-learning |
CN114087725A (en) * | 2021-11-16 | 2022-02-25 | 珠海格力电器股份有限公司 | Method for preventing mistaken awakening of air conditioner by combining WIFI channel state detection |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4736478B2 (en) * | 2005-03-07 | 2011-07-27 | 日本電気株式会社 | Voice transcription support device, method and program thereof |
CN103685985A (en) * | 2012-09-17 | 2014-03-26 | 联想(北京)有限公司 | Communication method, transmitting device, receiving device, voice processing equipment and terminal equipment |
CN103561217A (en) * | 2013-10-14 | 2014-02-05 | 深圳创维数字技术股份有限公司 | Method and terminal for generating captions |
CN105704538A (en) * | 2016-03-17 | 2016-06-22 | 广东小天才科技有限公司 | Method and system for generating audio and video subtitles |
CN105845129A (en) * | 2016-03-25 | 2016-08-10 | 乐视控股(北京)有限公司 | Method and system for dividing sentences in audio and automatic caption generation method and system for video files |
CN105913845A (en) * | 2016-04-26 | 2016-08-31 | 惠州Tcl移动通信有限公司 | Mobile terminal voice recognition and subtitle generation method and system and mobile terminal |
-
2016
- 2016-09-29 CN CN201610863894.6A patent/CN106504754B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN106504754A (en) | 2017-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106504754B (en) | A kind of real-time method for generating captions according to audio output | |
CN103310788B (en) | A kind of voice information identification method and system | |
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
CN102723078B (en) | Emotion speech recognition method based on natural language comprehension | |
CN107945805B (en) | A kind of across language voice identification method for transformation of intelligence | |
CN108399923B (en) | More human hairs call the turn spokesman's recognition methods and device | |
Shaw et al. | Emotion recognition and classification in speech using artificial neural networks | |
CN109215665A (en) | A kind of method for recognizing sound-groove based on 3D convolutional neural networks | |
CN112102850B (en) | Emotion recognition processing method and device, medium and electronic equipment | |
CN107274916A (en) | The method and device operated based on voiceprint to audio/video file | |
Yağanoğlu | Real time wearable speech recognition system for deaf persons | |
CN101794576A (en) | Dirty word detection aid and using method thereof | |
CN104142831B (en) | Application program searching method and device | |
CN106251872A (en) | A kind of case input method and system | |
CN110970036A (en) | Voiceprint recognition method and device, computer storage medium and electronic equipment | |
WO2013052292A9 (en) | Waveform analysis of speech | |
CN106782503A (en) | Automatic speech recognition method based on physiologic information in phonation | |
CN110349565B (en) | Auxiliary pronunciation learning method and system for hearing-impaired people | |
CN105845126A (en) | Method for automatic English subtitle filling of English audio image data | |
CN113782032B (en) | Voiceprint recognition method and related device | |
Chamoli et al. | Detection of emotion in analysis of speech using linear predictive coding techniques (LPC) | |
CN105139866A (en) | Nanyin music recognition method and device | |
CN107251137A (en) | Improve method, device and the computer readable recording medium storing program for performing of the set of at least one semantic primitive using voice | |
Salhi et al. | Robustness of auditory teager energy cepstrum coefficients for classification of pathological and normal voices in noisy environments | |
Bouafif et al. | A speech tool software for signal processing applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |