EP2283481A1 - Method and system for converting speech into text - Google Patents

Method and system for converting speech into text

Info

Publication number
EP2283481A1
EP2283481A1 EP09737915A EP09737915A EP2283481A1 EP 2283481 A1 EP2283481 A1 EP 2283481A1 EP 09737915 A EP09737915 A EP 09737915A EP 09737915 A EP09737915 A EP 09737915A EP 2283481 A1 EP2283481 A1 EP 2283481A1
Authority
EP
European Patent Office
Prior art keywords
audio signal
digital audio
text
text data
markers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP09737915A
Other languages
German (de)
English (en)
French (fr)
Inventor
Giacomo Olgeni
Mattia Scaricabarozzi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Colby Srl
Original Assignee
Colby Srl
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Colby Srl filed Critical Colby Srl
Publication of EP2283481A1 publication Critical patent/EP2283481A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440236Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/08Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division
    • H04N7/087Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only
    • H04N7/088Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only the inserted signal being digital
    • H04N7/0884Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only the inserted signal being digital for the transmission of additional display-information, e.g. menu for programme or channel selection
    • H04N7/0885Systems for the simultaneous or sequential transmission of more than one television signal, e.g. additional information signals, the signals occupying wholly or partially the same frequency band, e.g. by time division with signal insertion during the vertical blanking interval only the inserted signal being digital for the transmission of additional display-information, e.g. menu for programme or channel selection for the transmission of subtitles

Definitions

  • the present invention relates to a method for converting speech into text, and in particular a method which can be employed for generating subtitles live in television transmissions.
  • the present invention also relates to a system for carrying out such a method.
  • Known systems for converting a speech into a text comprise a sampler module which converts an analog audio signal of a speech into a digital audio signal, as well as a voice recognition module which converts the digital audio signal into text data.
  • Such systems have some disadvantages when the speech is generated by a speaker, generally called respeaker, for creating in real time television subtitles comprising the text data.
  • each word not contained in the system dictionary must be added manually and trained by the speaker by pronouncing it one or more times so that the system can associate it to the corresponding phonemes.
  • this operation can be carried out only in advance, namely not during the normal dictation process, so that if during a transmission the speaker has to pronounce a new word more times, the system can never interpret the latter in a correct manner.
  • the known systems convert the speech into text with a certain delay, since they use the context of the dictated sentence for eliminating the ambiguities which are inevitably discovered during the phoneme processing process, so that they generate text data only when the speaker makes a pause in the dictation, which is however quite rare when he tries to follow a transmission in real time.
  • the method and the system according to the present invention allow to automatically insert into the speech the desired commands without the speaker being forced to pronounce them, so as to avoid also the training phase of new words.
  • commands can comprise one or more text characters, in particular symbols, characters, words and/or sentences, and/or text formatting commands, in particular colors, size and/or fonts.
  • association of the markers with the commands can be modified in real time by a supervisor according to the argument of the speech, without modifying or training for new markers.
  • the only training, to be carried out only once for each speaker, is required for acquiring the phonemes used as markers.
  • the commands associated to the markers inserted into the digital audio signal are compared with the commands associated to the markers found in the text data for allowing the detection of possible recognition errors of the same markers.
  • figure 1 shows a first block scheme of the system
  • - figure 2 shows a scheme of the insertion of a marker
  • figure 3 shows a scheme of the correction of a marker series
  • figure 4 shows a second block scheme of the system.
  • the system according to the present invention comprises in a known way at least one sampler module SM which converts an analog audio signal AA into a digital audio signal DS.
  • Analog audio signal AA is a speech S of a first speaker Sl picked up by at least one transducer, in particular a microphone MIC.
  • Analog audio signal AA can be processed by an audio processor AP, for example comprising equalization, gate and compression stages, before it is sampled by sampler module SM.
  • Digital audio signal DS contains at least one sampled waveform SW substantially corresponding to speech S and is transmitted to a voice recognition module VRM which converts digital audio signal DS into a dictated text D substantially corresponding to speech S.
  • the system also comprises an audio editor AE suitable for automatically inserting into digital audio signal DS at least one marker Mx comprising a digital waveform stored in at least one digital table DT comprising one or more markers Ml...Mn associated to one or more commands Cl...Cn and to one or more labels L 1...Ln.
  • markers M 1...Mn comprise one or more phonemes pronounced by first speaker S 1 and sampled in advance, for example through the same sampler module SM.
  • An input/output interface IO shows to first speaker Sl labels L 1...Ln associated to markers M 1...Mn.
  • First speaker Sl can select markers M 1...Mn to be inserted into digital audio signal DS by pressing buttons associated to labels L 1...Ln.
  • input/output interface IO is a touchscreen which shows labels L 1...Ln, which can be selected by touching the area of the touchscreen which shows the same labels.
  • input/output interface IO can comprise a display, a keyboard, a mouse and/or other input/output devices.
  • marker Mx corresponding to label Lx is immediately inserted into digital audio signal DS by audio editor AE.
  • the latter comprises an audio buffer which temporarily stores and shifts forward the rest of sampled waveform SW, so as to make up for the portion of speech S corresponding to the duration of marker Mx.
  • audio editor AE can cancel possible pauses from digital audio signal DS and/or can digitally accelerate digital audio signal DS without varying the pitch of speech S.
  • Digital audio signal DS comprising sampled waveform SW and marker Mx is then processed by voice recognition module VRM, which converts digital audio signal DS into text data TD including dictated text D and marker Mx converted into the corresponding phonemes and inserted into dictation D.
  • VRM voice recognition module
  • a text converter TC converts the text of the phonemes corresponding to marker Mx into command Cx associated to marker Mx in digital table DT.
  • Command Cx can consist of one or more text character, in particular symbols, characters, words and/or sentences, and/or text formatting commands, in particular colors, size and/or fonts.
  • Text data TD generated by text converter TC comprise then command Cx included in dictated text D.
  • first speaker Sl can insert a plurality of markers Mx... My into various points of sampled waveform SW in digital audio signal DS, in which case text data TD generated by text converter TC comprise a plurality of commands Cx...Cy included in the same points of the corresponding dictated text D.
  • text data TD generated by text converter TC comprise a plurality of commands Cx...Cy included in the same points of the corresponding dictated text D.
  • first speaker Sl selects with input/output interface IO labels Lx... Ly corresponding to commands Cx...Cy and to markers Mx...My, the selected commands Cx... Cy are stored also in a digital memory DM, so that if a marker Mx...
  • text converter TC can compare anyway in digital memory DM the sequence of commands Cx...Cy which have been selected and commands Cx...Cy associated to markers Mx...My transformed into text data TD, so as to obtain text data TD which include these commands Cx...Cy in their correct sequence.
  • Input/output interface 10 sampler module SM and/or digital table DT, as well as digital memory DM, are components and/or peripherals, also of a known kind, of a client computer CC, while audio editor AE, voice recognition module VRM and/or text converter TC, as well as audio processor AP, are programs, also of a known kind, suitable to be executed by client computer CC.
  • a plurality of speakers Sl...Sm provided with a client computer CCl ...CCm can generate with the above mentioned method one or more text data sequences TDl 1...TDIp... TDmI...TDmq, which are sent through a data network to at least one server computer SC, which combines in an automatic and/or manual matter such sequences for generating at least one text T to be sent to a text generator TG, for example for being displayed in a television transmission.
  • Text T can further contain also other text data TDx... TDy which can be created with a method different from the above described one.
  • a supervisor SV can manually process the contents and/or the order of text data TDl 1... TDIp... TDmI...TDmq... TDx...TDy.
  • the sequences of text data TDl 1... TDIp... TDmI...TDmq... TDx...TDy can be also automatically ordered by server computer SC by inserting the first available text data as soon as a pause longer than a determined threshold value is detected in the sequence of the text data which are employed at that time for generating text T.
  • server computer SC can inserting the first available text data as soon as a pause longer than a determined threshold value is detected in the sequence of the text data which are employed at that time for generating text T.
  • Supervisor SV can also process with server computer SC and send through the same data network to client computers CC 1...CCm one or more digital tables DT 1...DTz in which markers M 1...Mx are associated to particular labels L 1...Lx and commands
  • C 1...Cx which relate to the argument (for example politics, sports, economy, news, etc.) dealt by speakers Sl...Sm, so as to update in real time commands Cl...Cx associated to markers M 1...Mx and usable by speakers S 1...Sm during the conversion of analog audio signal AA into digital audio signal DS.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)
EP09737915A 2008-04-30 2009-02-20 Method and system for converting speech into text Withdrawn EP2283481A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IT000794A ITMI20080794A1 (it) 2008-04-30 2008-04-30 Metodo e sistema per convertire parlato in testo
PCT/EP2009/052092 WO2009132871A1 (en) 2008-04-30 2009-02-20 Method and system for converting speech into text

Publications (1)

Publication Number Publication Date
EP2283481A1 true EP2283481A1 (en) 2011-02-16

Family

ID=40297044

Family Applications (1)

Application Number Title Priority Date Filing Date
EP09737915A Withdrawn EP2283481A1 (en) 2008-04-30 2009-02-20 Method and system for converting speech into text

Country Status (3)

Country Link
EP (1) EP2283481A1 (it)
IT (1) ITMI20080794A1 (it)
WO (1) WO2009132871A1 (it)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8532469B2 (en) 2011-06-10 2013-09-10 Morgan Fiumi Distributed digital video processing system
US8749618B2 (en) 2011-06-10 2014-06-10 Morgan Fiumi Distributed three-dimensional video conversion system
US9026446B2 (en) * 2011-06-10 2015-05-05 Morgan Fiumi System for generating captions for live video broadcasts

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5960447A (en) 1995-11-13 1999-09-28 Holt; Douglas Word tagging and editing system for speech recognition
KR20060123072A (ko) * 2003-08-26 2006-12-01 클리어플레이, 아이엔씨. 오디오 신호의 재생을 제어하는 방법 및 장치
KR20070020252A (ko) * 2004-05-27 2007-02-20 코닌클리케 필립스 일렉트로닉스 엔.브이. 메시지를 수정하기 위한 방법 및 시스템
US8701005B2 (en) * 2006-04-26 2014-04-15 At&T Intellectual Property I, Lp Methods, systems, and computer program products for managing video information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2009132871A1 *

Also Published As

Publication number Publication date
WO2009132871A1 (en) 2009-11-05
ITMI20080794A1 (it) 2009-11-01

Similar Documents

Publication Publication Date Title
US6999933B2 (en) Editing during synchronous playback
US8706495B2 (en) Synchronise an audio cursor and a text cursor during editing
JP6633153B2 (ja) 情報を抽出する方法及び装置
JP5787780B2 (ja) 書き起こし支援システムおよび書き起こし支援方法
US20060184369A1 (en) Voice activated instruction manual
EP1611570B1 (en) System for correction of speech recognition results with confidence level indication
JP2011191922A (ja) 翻訳装置、翻訳方法及びコンピュータプログラム
US10304457B2 (en) Transcription support system and transcription support method
US20100256972A1 (en) Automatic simultaneous interpertation system
WO2008114453A9 (ja) 音声合成装置、音声合成システム、言語処理装置、音声合成方法及びコンピュータプログラム
JP2012181358A (ja) テキスト表示時間決定装置、テキスト表示システム、方法およびプログラム
US20190005950A1 (en) Intention estimation device and intention estimation method
CN113225612A (zh) 字幕生成方法、装置、计算机可读存储介质及电子设备
JP4436087B2 (ja) 文字データ修正装置、文字データ修正方法および文字データ修正プログラム
WO2009132871A1 (en) Method and system for converting speech into text
CN113409761B (zh) 语音合成方法、装置、电子设备以及计算机可读存储介质
US20120078629A1 (en) Meeting support apparatus, method and program
KR101990019B1 (ko) 하이브리드 자막 효과 구현 단말 및 방법
CN108682423A (zh) 一种语音识别方法和装置
JP5818753B2 (ja) 音声対話システム及び音声対話方法
JP2001282779A (ja) 電子化テキスト作成システム
JP2008243076A (ja) 翻訳装置、方法及びプログラム
JP6387044B2 (ja) テキスト処理装置、テキスト処理方法およびテキスト処理プログラム
JP2004287756A (ja) 電子メール作成装置及び電子メール作成方法
JPH10136260A (ja) 字幕スーパー・タイミング発生装置および方法ならびに字幕スーパー処理装置および方法

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20101026

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA RS

DAX Request for extension of the european patent (deleted)
17Q First examination report despatched

Effective date: 20120214

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20130903