TW202318252A - Caption service system for remote speech recognition - Google Patents
Caption service system for remote speech recognition Download PDFInfo
- Publication number
- TW202318252A TW202318252A TW110139500A TW110139500A TW202318252A TW 202318252 A TW202318252 A TW 202318252A TW 110139500 A TW110139500 A TW 110139500A TW 110139500 A TW110139500 A TW 110139500A TW 202318252 A TW202318252 A TW 202318252A
- Authority
- TW
- Taiwan
- Prior art keywords
- speech recognition
- subtitle
- asr
- speaker
- live broadcast
- Prior art date
Links
Images
Landscapes
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
本發明有關於遠端語音辨識的字幕服務系統,尤其是指經由字幕服務器與聽打員而為聽障者提供遠端語音辨識的字幕服務系統。 The present invention relates to a subtitle service system for remote voice recognition, in particular to a subtitle service system that provides remote voice recognition for the hearing-impaired via a subtitle server and a typist.
因為COVID-19疫情的爆發,所以遠距直播與教學成為廣受採納的趨勢。但目前的一般遠距直播與教學並無字幕顯示,使得聽障學生無法上課。 Due to the outbreak of the COVID-19 epidemic, remote live broadcasting and teaching has become a widely adopted trend. However, the current general remote live broadcast and teaching do not have subtitles, which makes it impossible for hearing-impaired students to attend classes.
在一般的課堂上,聽障學生上課也有問題,因為沒有顯示器直接顯示教師授課內容的字幕。在各種演講與會議場合,聽障人士無法參加,因為沒有顯示器直接顯示字幕。 In general classrooms, hearing-impaired students also have problems in class, because there are no monitors that directly display subtitles of what the teacher is teaching. In various lectures and conferences, the hearing impaired cannot participate because there is no monitor to directly display subtitles.
因此為聽障人士設置能顯示教師或演講者所說內容的字幕,是對聽障人士的一大福音。 Therefore, it is a great boon for the hearing-impaired to set subtitles that can display what the teacher or speaker said.
現在有一些會議使用聽打員當場將講者所說的內容即時用電腦打字成為字幕呈現在電腦螢幕上,讓聽障者能瞭解現場狀況。但聽打員耗費很大的精力聆聽講者的內容,一旦工作時間過長,可能就會出現漏句與錯字,因此必須提供更完善的遠端聽打員方案。 At present, some conferences use audiovisual staff to type what the speaker said on the spot with a computer and display them as subtitles on the computer screen, so that the hearing-impaired can understand the situation at the scene. However, the listener spends a lot of energy listening to the content of the speaker. Once the working hours are too long, there may be missing sentences and typos. Therefore, a more complete remote listener solution must be provided.
本發明的目的在提出一種遠端語音辨識的字幕服務系統,為 聽障者提供遠端語音辨識的字幕服務,本發明的內容敘述如下。 The purpose of the present invention is to propose a subtitle service system for remote speech recognition, for The subtitle service of remote voice recognition is provided for the hearing-impaired, and the content of the present invention is described as follows.
此系統包含在A地的講者與直播設備,在B地的聽打員與電腦,在C地的聽障者與直播畫面、在D地的自動語音辨識(ASR)字幕服務器,以網路聯接直播設備、電腦、直播畫面與ASR字幕服務器。 This system includes speakers and live broadcast equipment at A, listeners and computers at B, hearing-impaired people and live broadcast images at C, automatic speech recognition (ASR) subtitle server at D, and Internet Connect the live broadcast equipment, computer, live screen and ASR subtitle server.
自動語音辨識(ASR)字幕服務器包含:即時訊息協定(RTMP)用以接收A地經由網路而傳來的直播串流;開源語音識別工具包用以進形行語音識別和信號處理;網路伺服器負責提供網頁的介面,透過HTTP協定傳給直播設備、電腦與直播畫面;錄音模組以供聽打員進行回放的功能。 Automatic Speech Recognition (ASR) subtitle server includes: Instant Messaging Protocol (RTMP) for receiving live streaming from A through the network; open source speech recognition toolkit for speech recognition and signal processing; network The server is responsible for providing the interface of the webpage, which is transmitted to the live broadcast device, computer and live broadcast screen through the HTTP protocol; the recording module is used for playback by the listener.
講者的聲音送入自動語音辨識(ASR)字幕服務器轉成文字,經由聽打員校正,然後將文字字幕與講者的影像和聲音一起送到聽障者的直播畫面,使聽障者能看到講者所說的文字字幕。 The voice of the speaker is sent to the Automatic Speech Recognition (ASR) subtitle server to convert it into text, which is corrected by the auditory typist. See text subtitles of what the speaker said.
1:講者 1: speaker
2:直播設備 2: Live broadcast equipment
3:聽打員 3: listener
4:電腦 4: computer
5:聽障者 5: hearing impaired
6:直播畫面 6: Live screen
61:字幕區域 61: subtitle area
7:自動語音辨識(ASR)字幕服務器 7: Automatic Speech Recognition (ASR) Subtitle Server
8:網路 8: Network
9:即時訊息協定(RTMP) 9: Instant Messaging Protocol (RTMP)
10:開源語音識別工具包Kaldi ASR 10: Open source speech recognition toolkit Kaldi ASR
11:網路伺服器 11:Web server
12:錄音模組 12:Recording module
13:OBS直播軟體 13:OBS Live Streaming Software
14:ASR上傳介面 14: ASR upload interface
15:音源串流 15: Audio streaming
16:音源紀錄 16: Audio record
17:直播設備截錄的影像聲音 17: Video and sound captured by live equipment
18:從瀏覽器截錄的字幕內容 18: Subtitle content captured from browser
圖1為本發明遠端語音辨識的字幕服務系統基本架構示意圖。 FIG. 1 is a schematic diagram of the basic structure of a subtitle service system for remote speech recognition according to the present invention.
圖2為本發明自動語音辨識(ASR)字幕服務器的內容示意圖。 FIG. 2 is a schematic diagram of the content of the Automatic Speech Recognition (ASR) subtitle server of the present invention.
圖3為本發明直播設備的內容示意圖。 Fig. 3 is a schematic diagram of the content of the live broadcast device of the present invention.
圖4為本發明自動語音辨識(ASR)字幕服務器的字幕產生過程示意圖。 FIG. 4 is a schematic diagram of the subtitle generation process of the Automatic Speech Recognition (ASR) subtitle server of the present invention.
圖5為本發明聽打員在異地的操作示意圖。 Fig. 5 is a schematic diagram of the operation of the listener in different places according to the present invention.
圖6為本發明將講者直播畫面與字幕合併輸出的示意圖。 Fig. 6 is a schematic diagram of combining and outputting the speaker's live video and subtitles according to the present invention.
圖1說明本發明遠端語音辨識的字幕服務系統基本架構。講者1與直播設備2在A地,聽打員3與其電腦4在B地,聽障者5與直播畫面6在C地,自動語音辨識(ASR)字幕服務器7在D地,四地以網路8相聯,網路8可以是本地區域網路或全球網際網路。若A地、B地、C地在同一地,代表講者1、聽打員3、聽障者5在同一間教室或會議室。 FIG. 1 illustrates the basic structure of the subtitle service system for remote speech recognition of the present invention. The speaker 1 and the live broadcast device 2 are at site A, the listener 3 and his computer 4 are at site B, the hearing-impaired person 5 and the live broadcast screen 6 are at site C, the automatic speech recognition (ASR) subtitle server 7 is at site D, and four sites The network 8 is connected, and the network 8 may be a local area network or a global Internet. If site A, site B, and site C are in the same site, it means that speaker 1, auditory typist 3, and hearing-impaired person 5 are in the same classroom or meeting room.
圖2說明本發明自動語音辨識(ASR)字幕服務器7所包含的內容。即時訊息協定(Real-Time Messaging Protocol,RTMP)9,是一種廣泛使用在直播串流的一種協議,ASR字幕機服務器7搭載此協議就能接收A地經由網路8而傳來的直播串流。若不使用RTMP,也可以使用HTTP Live Streaming(縮寫為HLS),這是由蘋果公司提出的基於HTTP串流媒體網路傳輸協定,同樣也能接收A地經由網路8而傳來的直播串流。但本發明的方法並不侷限於RTMP或HLS。 FIG. 2 illustrates what is included in the Automatic Speech Recognition (ASR) subtitle server 7 of the present invention. Instant Messaging Protocol (Real-Time Messaging Protocol, RTMP) 9 is a protocol widely used in live streaming. The ASR subtitle machine server 7 equipped with this protocol can receive the live streaming from A through the network 8 . If you do not use RTMP, you can also use HTTP Live Streaming (abbreviated as HLS), which is an HTTP-based streaming media network transmission protocol proposed by Apple, and can also receive live streams from A through the network 8 flow. But the method of the present invention is not limited to RTMP or HLS.
ASR字幕服務器7的語音辨識方面使用一開源語音識別工具包Kaldi ASR 10,用於語音識別和信號處理,可在Apache License v2.0下免費獲得。 The speech recognition of the ASR subtitle server 7 uses an open source speech recognition toolkit Kaldi ASR 10 for speech recognition and signal processing, which can be obtained free of charge under the Apache License v2.0.
ASR字幕服務器7上需要架設網路伺服器11,一個負責提供網頁的介面,透過HTTP協定傳給客戶端(一般是指網頁瀏覽器),客戶端指直播設備2、電腦4、直播畫面6等。 The ASR subtitle server 7 needs to set up a network server 11, an interface responsible for providing web pages, which is transmitted to the client (generally referring to a web browser) through the HTTP protocol. The client refers to the live broadcast device 2, the computer 4, the live screen 6, etc. .
ASR字幕服務器7具有一錄音模組12,以供聽打員進行「回放」的功能。 The ASR subtitle server 7 has a recording module 12 for the function of "playback" by the listener.
請見圖3,說明本發明直播設備2的內容。A地講者1的直播設備2收錄到講者1的影像與聲音而分成兩路,其中第一路是輸入Open Broadcaster Software(OBS)直播軟體13,此軟體是由OBS Project開發的自由開源跨平台串流媒體和錄影程式,是直播主通常使用的軟體。OBS直播軟體13後端可以直接推送到各個平台例如YouTube、Facebook或Twitch等。 Please refer to FIG. 3 , illustrating the content of the live broadcast device 2 of the present invention. In A, speaker 1's live broadcast device 2 records the image and sound of speaker 1 and divides it into two channels, of which the first channel is input Open Broadcaster Software (OBS) live broadcast software 13, this software is a free and open source cross-platform streaming media and video recording program developed by the OBS Project, and is commonly used by live broadcasters. The backend of OBS Live Streaming Software 13 can be directly pushed to various platforms such as YouTube, Facebook or Twitch.
第二路僅提供講者1的聲音,可以透過簡易的ASR上傳介面14將聲音進行封包並透過RTMP 9(或HLS)串流到ASR字幕服務器7。
The second channel only provides the voice of the speaker 1, and the voice can be packaged through the simple
請見圖4,說明本發明自動語音辨識(ASR)字幕服務器7的字幕產生過程。當串流封包送到ASR字幕服務器7的RTMP 9(或HLS)接收模組後,串流封包就會被解包成音源串流15,並分別傳給Kaldi ASR 10以及錄音模組12。錄音模組12會隨著時間的進行,將音源依時間記錄成音源紀錄16。Kaldi ASR 10模組收到音源串流15後,即逐步轉化成文字,每段文字都會夾帶一個「標籤」,如圖4所示。「標籤」會描述這段文字是對應音源紀錄第幾秒的位置,並記錄時長為多少。這些文字以及標籤會顯示在網路伺服器11的頁面上,經由網路8傳到直播設備2、電腦4、直播畫面6。
Please refer to FIG. 4 , which illustrates the subtitle generation process of the Automatic Speech Recognition (ASR) subtitle server 7 of the present invention. When the streaming packet is sent to the RTMP 9 (or HLS) receiving module of the ASR subtitle server 7, the streaming packet will be unpacked into an audio source stream 15, and sent to the Kaldi ASR 10 and the recording module 12 respectively. The recording module 12 will record the sound source into the
請見圖5,說明本發明聽打員3在異地的操作。在異地的聽打員3開啟YouTube、Facebook或Twitch平台接收來自A地講者1的直播影像與聲音。聽打員3並且透過網頁瀏覽而登入ASR字幕服務器7的網路伺服器11頁面,從頁面上閱聽講者1的文字與聲音。 See also Fig. 5, illustrate the operation of the listener 3 of the present invention in different places. The listener 3 in a different place turns on YouTube, Facebook or Twitch to receive the live video and sound from the speaker 1 in A. The listener 3 logs into the web server 11 page of the ASR subtitle server 7 through web browsing, and listens to the words and sounds of the lecturer 1 on the page.
聽打員3在ASR字幕服務器7上被設定為具備可讀可寫的權限,因此可以在網路伺服器11頁面上修改Kaldi ASR 10所產生的文字。每段文字在網路伺服器11頁面上有標籤屬性,例如聽打員3在第C段文字
點擊兩下,網路伺服器11頁面就會根據標籤的指示,請求音源紀錄16的第N3秒、語音時長Z秒的片段進行回放。這樣聽打員3就可以確認講者1所說過的內容是什麼,進而修改文字。
Listener 3 is set to have readable and writable authority on ASR subtitle server 7, therefore can revise the literal that
請見圖6,說明本發明A地講者1將直播畫面6與字幕合併輸出。講者1的直播設備2可以透過網頁瀏覽登入ASR字幕服務器7的網路伺服器11頁面,但只有讀取權限。換句話說,在講者1的直播設備2上只能看ASR字幕服務器7翻譯出來的文字,以及聽打員3修改後的文字。 Please refer to FIG. 6 , which illustrates that the speaker 1 of the present invention A combines the live image 6 and subtitles for output. The live broadcast device 2 of the speaker 1 can browse and log in to the web server 11 page of the ASR subtitle server 7 through the webpage, but only has the read permission. In other words, only the text translated by the ASR subtitle server 7 and the text modified by the listener 3 can be viewed on the live broadcast device 2 of the speaker 1 .
OBS直播軟體13具有疊加畫面的功能,講者1在直播設備2從ASR字幕服務器7的網路伺服器11頁面選取字幕內容並進行截錄,然後將直播設備2截錄的影像聲音17以及從瀏覽器截錄的字幕內容18進行疊加,經由OBS直播軟體13輸出含有ASR字幕服務器7所產生的字幕的直播畫面6。然後由OBS直播軟體13推送到YouTube、Facebook或Twitch等平台,使C地的聽障者5能在直播畫面6上的字幕區域61看到字幕內容。 The OBS live broadcast software 13 has the function of superimposing pictures. The lecturer 1 selects the subtitle content from the web server 11 page of the ASR subtitle server 7 on the live broadcast device 2 and intercepts it, and then the video and sound 17 recorded by the live broadcast device 2 and from the The subtitle content 18 captured by the browser is superimposed, and the live screen 6 containing the subtitle generated by the ASR subtitle server 7 is output via the OBS live broadcast software 13 . Then push to platforms such as YouTube, Facebook or Twitch by OBS live broadcast software 13, make the hearing-impaired person 5 of C place see the subtitle content in the subtitle area 61 on the live broadcast picture 6.
本發明的精神與範圍決定於下面的申請專利範圍,不受限於上述實施例。 The spirit and scope of the present invention are determined by the scope of claims below, and are not limited to the above-mentioned embodiments.
1:講者 1: speaker
2:直播設備 2: Live broadcast equipment
3:聽打員 3: listener
4:電腦 4: computer
5:聽障者 5: hearing impaired
6:直播畫面 6: Live screen
61:字幕區域 61: subtitle area
7:自動語音辨識(ASR)字幕服務器 7: Automatic Speech Recognition (ASR) Subtitle Server
8:網路 8: Network
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW110139500A TW202318252A (en) | 2021-10-25 | 2021-10-25 | Caption service system for remote speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW110139500A TW202318252A (en) | 2021-10-25 | 2021-10-25 | Caption service system for remote speech recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
TW202318252A true TW202318252A (en) | 2023-05-01 |
Family
ID=87378808
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW110139500A TW202318252A (en) | 2021-10-25 | 2021-10-25 | Caption service system for remote speech recognition |
Country Status (1)
Country | Link |
---|---|
TW (1) | TW202318252A (en) |
-
2021
- 2021-10-25 TW TW110139500A patent/TW202318252A/en unknown
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6820055B2 (en) | Systems and methods for automated audio transcription, translation, and transfer with text display software for manipulating the text | |
US7035804B2 (en) | Systems and methods for automated audio transcription, translation, and transfer | |
US9160551B2 (en) | Analytic recording of conference sessions | |
JP2007006444A (en) | Multimedia production control system | |
CN102802044A (en) | Video processing method, terminal and subtitle server | |
WO2016127691A1 (en) | Method and apparatus for broadcasting dynamic information in multimedia conference | |
WO2012072008A1 (en) | Method and device for superposing auxiliary information of video signal | |
CN111479124A (en) | Real-time playing method and device | |
US20020188772A1 (en) | Media production methods and systems | |
JP2005269607A (en) | Instant interactive audio/video management system | |
CA3159656A1 (en) | Distributed network recording system with synchronous multi-actor recording | |
US11735185B2 (en) | Caption service system for remote speech recognition | |
TW202318252A (en) | Caption service system for remote speech recognition | |
JP2003271530A (en) | Communication system, inter-system relevant device, program and recording medium | |
CN106170986A (en) | Program output device, program server, auxiliary information management server, program and the output intent of auxiliary information and storage medium | |
JP2004266578A (en) | Moving image editing method and apparatus | |
Grewe et al. | MPEG-H Audio production workflows for a Next Generation Audio Experience in Broadcast, Streaming and Music | |
TW202318398A (en) | Speech recognition system for teaching assistance | |
US20230026467A1 (en) | Systems and methods for automated audio transcription, translation, and transfer for online meeting | |
US20230096430A1 (en) | Speech recognition system for teaching assistance | |
Nishiyama et al. | Recording and Mixing Techniques for Ambisonic Sound Production | |
Schreer et al. | Media production, delivery and interaction for platform independent systems: format-agnostic media | |
US11381628B1 (en) | Browser-based video production | |
JP2013201505A (en) | Video conference system and multipoint connection device and computer program | |
JP5424359B2 (en) | Understanding support system, support terminal, understanding support method and program |