WO2009132871A1 - Method and system for converting speech into text - Google Patents

Method and system for converting speech into text Download PDF

Info

Publication number
WO2009132871A1
WO2009132871A1 PCT/EP2009/052092 EP2009052092W WO2009132871A1 WO 2009132871 A1 WO2009132871 A1 WO 2009132871A1 EP 2009052092 W EP2009052092 W EP 2009052092W WO 2009132871 A1 WO2009132871 A1 WO 2009132871A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
digital audio
text
text data
markers
Prior art date
Application number
PCT/EP2009/052092
Other languages
English (en)
French (fr)
Inventor
Giacomo Olgeni
Mattia Scaricabarozzi
Original Assignee
Colby S.R.L.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Colby S.R.L. filed Critical Colby S.R.L.
Priority to EP09737915A priority Critical patent/EP2283481A1/en
Publication of WO2009132871A1 publication Critical patent/WO2009132871A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440236Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/08Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division
    • H04N7/087Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only
    • H04N7/088Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only the inserted signal being digital
    • H04N7/0884Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only the inserted signal being digital for the transmission of additional display-information, e.g. menu for programme or channel selection
    • H04N7/0885Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only the inserted signal being digital for the transmission of additional display-information, e.g. menu for programme or channel selection for the transmission of subtitles

Definitions

  • the present invention relates to a method for converting speech into text, and in particular a method which can be employed for generating subtitles live in television transmissions.
  • the present invention also relates to a system for carrying out such a method.
  • Known systems for converting a speech into a text comprise a sampler module which converts an analog audio signal of a speech into a digital audio signal, as well as a voice recognition module which converts the digital audio signal into text data.
  • Such systems have some disadvantages when the speech is generated by a speaker, generally called respeaker, for creating in real time television subtitles comprising the text data.
  • each word not contained in the system dictionary must be added manually and trained by the speaker by pronouncing it one or more times so that the system can associate it to the corresponding phonemes.
  • this operation can be carried out only in advance, namely not during the normal dictation process, so that if during a transmission the speaker has to pronounce a new word more times, the system can never interpret the latter in a correct manner.
  • the known systems convert the speech into text with a certain delay, since they use the context of the dictated sentence for eliminating the ambiguities which are inevitably discovered during the phoneme processing process, so that they generate text data only when the speaker makes a pause in the dictation, which is however quite rare when he tries to follow a transmission in real time.
  • the method and the system according to the present invention allow to automatically insert into the speech the desired commands without the speaker being forced to pronounce them, so as to avoid also the training phase of new words.
  • commands can comprise one or more text characters, in particular symbols, characters, words and/or sentences, and/or text formatting commands, in particular colors, size and/or fonts.
  • association of the markers with the commands can be modified in real time by a supervisor according to the argument of the speech, without modifying or training for new markers.
  • the only training, to be carried out only once for each speaker, is required for acquiring the phonemes used as markers.
  • the commands associated to the markers inserted into the digital audio signal are compared with the commands associated to the markers found in the text data for allowing the detection of possible recognition errors of the same markers.
  • figure 1 shows a first block scheme of the system
  • - figure 2 shows a scheme of the insertion of a marker
  • figure 3 shows a scheme of the correction of a marker series
  • figure 4 shows a second block scheme of the system.
  • the system according to the present invention comprises in a known way at least one sampler module SM which converts an analog audio signal AA into a digital audio signal DS.
  • Analog audio signal AA is a speech S of a first speaker Sl picked up by at least one transducer, in particular a microphone MIC.
  • Analog audio signal AA can be processed by an audio processor AP, for example comprising equalization, gate and compression stages, before it is sampled by sampler module SM.
  • Digital audio signal DS contains at least one sampled waveform SW substantially corresponding to speech S and is transmitted to a voice recognition module VRM which converts digital audio signal DS into a dictated text D substantially corresponding to speech S.
  • the system also comprises an audio editor AE suitable for automatically inserting into digital audio signal DS at least one marker Mx comprising a digital waveform stored in at least one digital table DT comprising one or more markers Ml...Mn associated to one or more commands Cl...Cn and to one or more labels L 1...Ln.
  • markers M 1...Mn comprise one or more phonemes pronounced by first speaker S 1 and sampled in advance, for example through the same sampler module SM.
  • An input/output interface IO shows to first speaker Sl labels L 1...Ln associated to markers M 1...Mn.
  • First speaker Sl can select markers M 1...Mn to be inserted into digital audio signal DS by pressing buttons associated to labels L 1...Ln.
  • input/output interface IO is a touchscreen which shows labels L 1...Ln, which can be selected by touching the area of the touchscreen which shows the same labels.
  • input/output interface IO can comprise a display, a keyboard, a mouse and/or other input/output devices.
  • marker Mx corresponding to label Lx is immediately inserted into digital audio signal DS by audio editor AE.
  • the latter comprises an audio buffer which temporarily stores and shifts forward the rest of sampled waveform SW, so as to make up for the portion of speech S corresponding to the duration of marker Mx.
  • audio editor AE can cancel possible pauses from digital audio signal DS and/or can digitally accelerate digital audio signal DS without varying the pitch of speech S.
  • Digital audio signal DS comprising sampled waveform SW and marker Mx is then processed by voice recognition module VRM, which converts digital audio signal DS into text data TD including dictated text D and marker Mx converted into the corresponding phonemes and inserted into dictation D.
  • VRM voice recognition module
  • a text converter TC converts the text of the phonemes corresponding to marker Mx into command Cx associated to marker Mx in digital table DT.
  • Command Cx can consist of one or more text character, in particular symbols, characters, words and/or sentences, and/or text formatting commands, in particular colors, size and/or fonts.
  • Text data TD generated by text converter TC comprise then command Cx included in dictated text D.
  • first speaker Sl can insert a plurality of markers Mx... My into various points of sampled waveform SW in digital audio signal DS, in which case text data TD generated by text converter TC comprise a plurality of commands Cx...Cy included in the same points of the corresponding dictated text D.
  • text data TD generated by text converter TC comprise a plurality of commands Cx...Cy included in the same points of the corresponding dictated text D.
  • first speaker Sl selects with input/output interface IO labels Lx... Ly corresponding to commands Cx...Cy and to markers Mx...My, the selected commands Cx... Cy are stored also in a digital memory DM, so that if a marker Mx...
  • text converter TC can compare anyway in digital memory DM the sequence of commands Cx...Cy which have been selected and commands Cx...Cy associated to markers Mx...My transformed into text data TD, so as to obtain text data TD which include these commands Cx...Cy in their correct sequence.
  • Input/output interface 10 sampler module SM and/or digital table DT, as well as digital memory DM, are components and/or peripherals, also of a known kind, of a client computer CC, while audio editor AE, voice recognition module VRM and/or text converter TC, as well as audio processor AP, are programs, also of a known kind, suitable to be executed by client computer CC.
  • a plurality of speakers Sl...Sm provided with a client computer CCl ...CCm can generate with the above mentioned method one or more text data sequences TDl 1...TDIp... TDmI...TDmq, which are sent through a data network to at least one server computer SC, which combines in an automatic and/or manual matter such sequences for generating at least one text T to be sent to a text generator TG, for example for being displayed in a television transmission.
  • Text T can further contain also other text data TDx... TDy which can be created with a method different from the above described one.
  • a supervisor SV can manually process the contents and/or the order of text data TDl 1... TDIp... TDmI...TDmq... TDx...TDy.
  • the sequences of text data TDl 1... TDIp... TDmI...TDmq... TDx...TDy can be also automatically ordered by server computer SC by inserting the first available text data as soon as a pause longer than a determined threshold value is detected in the sequence of the text data which are employed at that time for generating text T.
  • server computer SC can inserting the first available text data as soon as a pause longer than a determined threshold value is detected in the sequence of the text data which are employed at that time for generating text T.
  • Supervisor SV can also process with server computer SC and send through the same data network to client computers CC 1...CCm one or more digital tables DT 1...DTz in which markers M 1...Mx are associated to particular labels L 1...Lx and commands
  • C 1...Cx which relate to the argument (for example politics, sports, economy, news, etc.) dealt by speakers Sl...Sm, so as to update in real time commands Cl...Cx associated to markers M 1...Mx and usable by speakers S 1...Sm during the conversion of analog audio signal AA into digital audio signal DS.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)
PCT/EP2009/052092 2008-04-30 2009-02-20 Method and system for converting speech into text WO2009132871A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP09737915A EP2283481A1 (en) 2008-04-30 2009-02-20 Method and system for converting speech into text

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IT000794A ITMI20080794A1 (it) 2008-04-30 2008-04-30 Metodo e sistema per convertire parlato in testo
ITMI2008A000794 2008-04-30

Publications (1)

Publication Number Publication Date
WO2009132871A1 true WO2009132871A1 (en) 2009-11-05

Family

ID=40297044

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2009/052092 WO2009132871A1 (en) 2008-04-30 2009-02-20 Method and system for converting speech into text

Country Status (3)

Country Link
EP (1) EP2283481A1 (it)
IT (1) ITMI20080794A1 (it)
WO (1) WO2009132871A1 (it)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120316882A1 (en) * 2011-06-10 2012-12-13 Morgan Fiumi System for generating captions for live video broadcasts
US8532469B2 (en) 2011-06-10 2013-09-10 Morgan Fiumi Distributed digital video processing system
US8749618B2 (en) 2011-06-10 2014-06-10 Morgan Fiumi Distributed three-dimensional video conversion system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5960447A (en) 1995-11-13 1999-09-28 Holt; Douglas Word tagging and editing system for speech recognition
US20050086705A1 (en) * 2003-08-26 2005-04-21 Jarman Matthew T. Method and apparatus for controlling play of an audio signal
WO2005116992A1 (en) * 2004-05-27 2005-12-08 Koninklijke Philips Electronics N.V. Method of and system for modifying messages
US20070256016A1 (en) * 2006-04-26 2007-11-01 Bedingfield James C Sr Methods, systems, and computer program products for managing video information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5960447A (en) 1995-11-13 1999-09-28 Holt; Douglas Word tagging and editing system for speech recognition
US20050086705A1 (en) * 2003-08-26 2005-04-21 Jarman Matthew T. Method and apparatus for controlling play of an audio signal
WO2005116992A1 (en) * 2004-05-27 2005-12-08 Koninklijke Philips Electronics N.V. Method of and system for modifying messages
US20070256016A1 (en) * 2006-04-26 2007-11-01 Bedingfield James C Sr Methods, systems, and computer program products for managing video information

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120316882A1 (en) * 2011-06-10 2012-12-13 Morgan Fiumi System for generating captions for live video broadcasts
US8532469B2 (en) 2011-06-10 2013-09-10 Morgan Fiumi Distributed digital video processing system
US8749618B2 (en) 2011-06-10 2014-06-10 Morgan Fiumi Distributed three-dimensional video conversion system
US9026446B2 (en) * 2011-06-10 2015-05-05 Morgan Fiumi System for generating captions for live video broadcasts

Also Published As

Publication number Publication date
EP2283481A1 (en) 2011-02-16
ITMI20080794A1 (it) 2009-11-01

Similar Documents

Publication Publication Date Title
US8706495B2 (en) Synchronise an audio cursor and a text cursor during editing
US20020143534A1 (en) Editing during synchronous playback
JP6633153B2 (ja) 情報を抽出する方法及び装置
US9263027B2 (en) Broadcast system using text to speech conversion
JP5787780B2 (ja) 書き起こし支援システムおよび書き起こし支援方法
US20060184369A1 (en) Voice activated instruction manual
US10304457B2 (en) Transcription support system and transcription support method
JP2011191922A (ja) 翻訳装置、翻訳方法及びコンピュータプログラム
US20100256972A1 (en) Automatic simultaneous interpertation system
US20060195318A1 (en) System for correction of speech recognition results with confidence level indication
WO2008114453A9 (ja) 音声合成装置、音声合成システム、言語処理装置、音声合成方法及びコンピュータプログラム
JP2012181358A (ja) テキスト表示時間決定装置、テキスト表示システム、方法およびプログラム
US20190005950A1 (en) Intention estimation device and intention estimation method
JP4100243B2 (ja) 映像情報を用いた音声認識装置及び方法
CN113225612A (zh) 字幕生成方法、装置、计算机可读存储介质及电子设备
US8676578B2 (en) Meeting support apparatus, method and program
JP4436087B2 (ja) 文字データ修正装置、文字データ修正方法および文字データ修正プログラム
WO2009132871A1 (en) Method and system for converting speech into text
KR101990019B1 (ko) 하이브리드 자막 효과 구현 단말 및 방법
JP2018045675A (ja) 情報提示方法、情報提示プログラム及び情報提示システム
JP5818753B2 (ja) 音声対話システム及び音声対話方法
CN113409761B (zh) 语音合成方法、装置、电子设备以及计算机可读存储介质
JP2001282779A (ja) 電子化テキスト作成システム
CN108962246B (zh) 语音控制方法、装置及计算机可读存储介质
JP2008243076A (ja) 翻訳装置、方法及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09737915

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2009737915

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE