WO2009132871A1 - Procédé et système de conversion de la parole en texte - Google Patents
Procédé et système de conversion de la parole en texte Download PDFInfo
- Publication number
- WO2009132871A1 WO2009132871A1 PCT/EP2009/052092 EP2009052092W WO2009132871A1 WO 2009132871 A1 WO2009132871 A1 WO 2009132871A1 EP 2009052092 W EP2009052092 W EP 2009052092W WO 2009132871 A1 WO2009132871 A1 WO 2009132871A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio signal
- digital audio
- text
- text data
- markers
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000005236 sound signal Effects 0.000 claims abstract description 60
- 238000006243 chemical reaction Methods 0.000 claims abstract description 11
- 239000003550 marker Substances 0.000 claims description 15
- 239000003086 colorant Substances 0.000 claims description 4
- 230000002093 peripheral effect Effects 0.000 claims description 2
- 239000008186 active pharmaceutical agent Substances 0.000 description 14
- 230000005540 biological transmission Effects 0.000 description 6
- 238000012549 training Methods 0.000 description 3
- 238000007792 addition Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
- H04N21/440236—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/08—Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division
- H04N7/087—Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only
- H04N7/088—Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only the inserted signal being digital
- H04N7/0884—Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only the inserted signal being digital for the transmission of additional display-information, e.g. menu for programme or channel selection
- H04N7/0885—Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only the inserted signal being digital for the transmission of additional display-information, e.g. menu for programme or channel selection for the transmission of subtitles
Definitions
- the present invention relates to a method for converting speech into text, and in particular a method which can be employed for generating subtitles live in television transmissions.
- the present invention also relates to a system for carrying out such a method.
- Known systems for converting a speech into a text comprise a sampler module which converts an analog audio signal of a speech into a digital audio signal, as well as a voice recognition module which converts the digital audio signal into text data.
- Such systems have some disadvantages when the speech is generated by a speaker, generally called respeaker, for creating in real time television subtitles comprising the text data.
- each word not contained in the system dictionary must be added manually and trained by the speaker by pronouncing it one or more times so that the system can associate it to the corresponding phonemes.
- this operation can be carried out only in advance, namely not during the normal dictation process, so that if during a transmission the speaker has to pronounce a new word more times, the system can never interpret the latter in a correct manner.
- the known systems convert the speech into text with a certain delay, since they use the context of the dictated sentence for eliminating the ambiguities which are inevitably discovered during the phoneme processing process, so that they generate text data only when the speaker makes a pause in the dictation, which is however quite rare when he tries to follow a transmission in real time.
- the method and the system according to the present invention allow to automatically insert into the speech the desired commands without the speaker being forced to pronounce them, so as to avoid also the training phase of new words.
- commands can comprise one or more text characters, in particular symbols, characters, words and/or sentences, and/or text formatting commands, in particular colors, size and/or fonts.
- association of the markers with the commands can be modified in real time by a supervisor according to the argument of the speech, without modifying or training for new markers.
- the only training, to be carried out only once for each speaker, is required for acquiring the phonemes used as markers.
- the commands associated to the markers inserted into the digital audio signal are compared with the commands associated to the markers found in the text data for allowing the detection of possible recognition errors of the same markers.
- figure 1 shows a first block scheme of the system
- - figure 2 shows a scheme of the insertion of a marker
- figure 3 shows a scheme of the correction of a marker series
- figure 4 shows a second block scheme of the system.
- the system according to the present invention comprises in a known way at least one sampler module SM which converts an analog audio signal AA into a digital audio signal DS.
- Analog audio signal AA is a speech S of a first speaker Sl picked up by at least one transducer, in particular a microphone MIC.
- Analog audio signal AA can be processed by an audio processor AP, for example comprising equalization, gate and compression stages, before it is sampled by sampler module SM.
- Digital audio signal DS contains at least one sampled waveform SW substantially corresponding to speech S and is transmitted to a voice recognition module VRM which converts digital audio signal DS into a dictated text D substantially corresponding to speech S.
- the system also comprises an audio editor AE suitable for automatically inserting into digital audio signal DS at least one marker Mx comprising a digital waveform stored in at least one digital table DT comprising one or more markers Ml...Mn associated to one or more commands Cl...Cn and to one or more labels L 1...Ln.
- markers M 1...Mn comprise one or more phonemes pronounced by first speaker S 1 and sampled in advance, for example through the same sampler module SM.
- An input/output interface IO shows to first speaker Sl labels L 1...Ln associated to markers M 1...Mn.
- First speaker Sl can select markers M 1...Mn to be inserted into digital audio signal DS by pressing buttons associated to labels L 1...Ln.
- input/output interface IO is a touchscreen which shows labels L 1...Ln, which can be selected by touching the area of the touchscreen which shows the same labels.
- input/output interface IO can comprise a display, a keyboard, a mouse and/or other input/output devices.
- marker Mx corresponding to label Lx is immediately inserted into digital audio signal DS by audio editor AE.
- the latter comprises an audio buffer which temporarily stores and shifts forward the rest of sampled waveform SW, so as to make up for the portion of speech S corresponding to the duration of marker Mx.
- audio editor AE can cancel possible pauses from digital audio signal DS and/or can digitally accelerate digital audio signal DS without varying the pitch of speech S.
- Digital audio signal DS comprising sampled waveform SW and marker Mx is then processed by voice recognition module VRM, which converts digital audio signal DS into text data TD including dictated text D and marker Mx converted into the corresponding phonemes and inserted into dictation D.
- VRM voice recognition module
- a text converter TC converts the text of the phonemes corresponding to marker Mx into command Cx associated to marker Mx in digital table DT.
- Command Cx can consist of one or more text character, in particular symbols, characters, words and/or sentences, and/or text formatting commands, in particular colors, size and/or fonts.
- Text data TD generated by text converter TC comprise then command Cx included in dictated text D.
- first speaker Sl can insert a plurality of markers Mx... My into various points of sampled waveform SW in digital audio signal DS, in which case text data TD generated by text converter TC comprise a plurality of commands Cx...Cy included in the same points of the corresponding dictated text D.
- text data TD generated by text converter TC comprise a plurality of commands Cx...Cy included in the same points of the corresponding dictated text D.
- first speaker Sl selects with input/output interface IO labels Lx... Ly corresponding to commands Cx...Cy and to markers Mx...My, the selected commands Cx... Cy are stored also in a digital memory DM, so that if a marker Mx...
- text converter TC can compare anyway in digital memory DM the sequence of commands Cx...Cy which have been selected and commands Cx...Cy associated to markers Mx...My transformed into text data TD, so as to obtain text data TD which include these commands Cx...Cy in their correct sequence.
- Input/output interface 10 sampler module SM and/or digital table DT, as well as digital memory DM, are components and/or peripherals, also of a known kind, of a client computer CC, while audio editor AE, voice recognition module VRM and/or text converter TC, as well as audio processor AP, are programs, also of a known kind, suitable to be executed by client computer CC.
- a plurality of speakers Sl...Sm provided with a client computer CCl ...CCm can generate with the above mentioned method one or more text data sequences TDl 1...TDIp... TDmI...TDmq, which are sent through a data network to at least one server computer SC, which combines in an automatic and/or manual matter such sequences for generating at least one text T to be sent to a text generator TG, for example for being displayed in a television transmission.
- Text T can further contain also other text data TDx... TDy which can be created with a method different from the above described one.
- a supervisor SV can manually process the contents and/or the order of text data TDl 1... TDIp... TDmI...TDmq... TDx...TDy.
- the sequences of text data TDl 1... TDIp... TDmI...TDmq... TDx...TDy can be also automatically ordered by server computer SC by inserting the first available text data as soon as a pause longer than a determined threshold value is detected in the sequence of the text data which are employed at that time for generating text T.
- server computer SC can inserting the first available text data as soon as a pause longer than a determined threshold value is detected in the sequence of the text data which are employed at that time for generating text T.
- Supervisor SV can also process with server computer SC and send through the same data network to client computers CC 1...CCm one or more digital tables DT 1...DTz in which markers M 1...Mx are associated to particular labels L 1...Lx and commands
- C 1...Cx which relate to the argument (for example politics, sports, economy, news, etc.) dealt by speakers Sl...Sm, so as to update in real time commands Cl...Cx associated to markers M 1...Mx and usable by speakers S 1...Sm during the conversion of analog audio signal AA into digital audio signal DS.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
L’invention concerne un procédé de conversion de la parole (S) en texte (T), comprenant les étapes fonctionnelles consistant à : - convertir un signal audio analogique (AA) de la parole (S) en un signal audio numérique (DS) ; - convertir le signal audio numérique (DS) en données de texte (TD); - un ou plusieurs marqueurs (Mx... My) constituant une forme d’onde numérique étant insérés dans le signal audio numérique (DS) avant l’étape consistant à convertir le signal audio numérique (DS) en données de texte (TD) ; - les marqueurs (Mx...My) étant convertis en une ou plusieurs instructions (Cx...Cy) dans les données de texte (TD) après l’étape consistant à convertir le signal audio numérique (DS) en données de texte (TD). L’invention concerne également un système permettant de mettre en œuvre ce procédé.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP09737915A EP2283481A1 (fr) | 2008-04-30 | 2009-02-20 | Procédé et système de conversion de la parole en texte |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IT000794A ITMI20080794A1 (it) | 2008-04-30 | 2008-04-30 | Metodo e sistema per convertire parlato in testo |
ITMI2008A000794 | 2008-04-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009132871A1 true WO2009132871A1 (fr) | 2009-11-05 |
Family
ID=40297044
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2009/052092 WO2009132871A1 (fr) | 2008-04-30 | 2009-02-20 | Procédé et système de conversion de la parole en texte |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP2283481A1 (fr) |
IT (1) | ITMI20080794A1 (fr) |
WO (1) | WO2009132871A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120316882A1 (en) * | 2011-06-10 | 2012-12-13 | Morgan Fiumi | System for generating captions for live video broadcasts |
US8532469B2 (en) | 2011-06-10 | 2013-09-10 | Morgan Fiumi | Distributed digital video processing system |
US8749618B2 (en) | 2011-06-10 | 2014-06-10 | Morgan Fiumi | Distributed three-dimensional video conversion system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5960447A (en) | 1995-11-13 | 1999-09-28 | Holt; Douglas | Word tagging and editing system for speech recognition |
US20050086705A1 (en) * | 2003-08-26 | 2005-04-21 | Jarman Matthew T. | Method and apparatus for controlling play of an audio signal |
WO2005116992A1 (fr) * | 2004-05-27 | 2005-12-08 | Koninklijke Philips Electronics N.V. | Procede et systeme pour modifier des messages |
US20070256016A1 (en) * | 2006-04-26 | 2007-11-01 | Bedingfield James C Sr | Methods, systems, and computer program products for managing video information |
-
2008
- 2008-04-30 IT IT000794A patent/ITMI20080794A1/it unknown
-
2009
- 2009-02-20 EP EP09737915A patent/EP2283481A1/fr not_active Withdrawn
- 2009-02-20 WO PCT/EP2009/052092 patent/WO2009132871A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5960447A (en) | 1995-11-13 | 1999-09-28 | Holt; Douglas | Word tagging and editing system for speech recognition |
US20050086705A1 (en) * | 2003-08-26 | 2005-04-21 | Jarman Matthew T. | Method and apparatus for controlling play of an audio signal |
WO2005116992A1 (fr) * | 2004-05-27 | 2005-12-08 | Koninklijke Philips Electronics N.V. | Procede et systeme pour modifier des messages |
US20070256016A1 (en) * | 2006-04-26 | 2007-11-01 | Bedingfield James C Sr | Methods, systems, and computer program products for managing video information |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120316882A1 (en) * | 2011-06-10 | 2012-12-13 | Morgan Fiumi | System for generating captions for live video broadcasts |
US8532469B2 (en) | 2011-06-10 | 2013-09-10 | Morgan Fiumi | Distributed digital video processing system |
US8749618B2 (en) | 2011-06-10 | 2014-06-10 | Morgan Fiumi | Distributed three-dimensional video conversion system |
US9026446B2 (en) * | 2011-06-10 | 2015-05-05 | Morgan Fiumi | System for generating captions for live video broadcasts |
Also Published As
Publication number | Publication date |
---|---|
EP2283481A1 (fr) | 2011-02-16 |
ITMI20080794A1 (it) | 2009-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6999933B2 (en) | Editing during synchronous playback | |
US8706495B2 (en) | Synchronise an audio cursor and a text cursor during editing | |
JP6633153B2 (ja) | 情報を抽出する方法及び装置 | |
JP5787780B2 (ja) | 書き起こし支援システムおよび書き起こし支援方法 | |
US20060184369A1 (en) | Voice activated instruction manual | |
EP1611570B1 (fr) | Systeme de correction des resultats de la reconnaissance de la parole a indication du niveau de confiance | |
JP2011191922A (ja) | 翻訳装置、翻訳方法及びコンピュータプログラム | |
US10304457B2 (en) | Transcription support system and transcription support method | |
US20100256972A1 (en) | Automatic simultaneous interpertation system | |
WO2008114453A1 (fr) | Appareil de synthèse vocale, système de synthèse vocale, appareil de traitement du langage, procédé de synthèse vocale et programme informatique | |
JP2012181358A (ja) | テキスト表示時間決定装置、テキスト表示システム、方法およびプログラム | |
US20190005950A1 (en) | Intention estimation device and intention estimation method | |
CN113225612A (zh) | 字幕生成方法、装置、计算机可读存储介质及电子设备 | |
JP4436087B2 (ja) | 文字データ修正装置、文字データ修正方法および文字データ修正プログラム | |
WO2009132871A1 (fr) | Procédé et système de conversion de la parole en texte | |
CN113409761B (zh) | 语音合成方法、装置、电子设备以及计算机可读存储介质 | |
US20120078629A1 (en) | Meeting support apparatus, method and program | |
KR101990019B1 (ko) | 하이브리드 자막 효과 구현 단말 및 방법 | |
CN108682423A (zh) | 一种语音识别方法和装置 | |
JP5818753B2 (ja) | 音声対話システム及び音声対話方法 | |
JP2001282779A (ja) | 電子化テキスト作成システム | |
JP2008243076A (ja) | 翻訳装置、方法及びプログラム | |
JP6387044B2 (ja) | テキスト処理装置、テキスト処理方法およびテキスト処理プログラム | |
JP2004287756A (ja) | 電子メール作成装置及び電子メール作成方法 | |
JPH10136260A (ja) | 字幕スーパー・タイミング発生装置および方法ならびに字幕スーパー処理装置および方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09737915 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009737915 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |