TW202318252A

TW202318252A - Caption service system for remote speech recognition

Info

Publication number: TW202318252A
Application number: TW110139500A
Authority: TW
Inventors: 陳信宏; 廖元甫; 王逸如; 黃紹華; 姚秉志; 葉政育; 陳又碩; 鍾耀興; 黃彥鈞; 黃啟榮; 沈立得; 古甯允
Original assignee: 國立陽明交通大學
Priority date: 2021-10-25
Filing date: 2021-10-25
Publication date: 2023-05-01

Abstract

The present invention provides a caption service system for remote speech recognition, which provides caption service for the hearing impaired. This system includes a speaker and a live broadcast equipment at A, a listener-typist and a computer at B, a hearing impaired and a live screen at C, and an automatic speech recognition (ASR) caption server at D. Connect the live broadcast equipment, the computer, the live screen and the ASR caption server with a network. The speaker's audio is sent to the automatic speech recognition (ASR) caption server to be converted into text, which is corrected by the listener-typist, and then the text caption is sent to the live screen of the hearing impaired together with the speaker's video and audio, so that the hearing impaired can see the text caption spoken by the speaker.

Description

Subtitle Service System for Remote Speech Recognition

本發明有關於遠端語音辨識的字幕服務系統，尤其是指經由字幕服務器與聽打員而為聽障者提供遠端語音辨識的字幕服務系統。 The present invention relates to a subtitle service system for remote voice recognition, in particular to a subtitle service system that provides remote voice recognition for the hearing-impaired via a subtitle server and a typist.

因為COVID-19疫情的爆發，所以遠距直播與教學成為廣受採納的趨勢。但目前的一般遠距直播與教學並無字幕顯示，使得聽障學生無法上課。 Due to the outbreak of the COVID-19 epidemic, remote live broadcasting and teaching has become a widely adopted trend. However, the current general remote live broadcast and teaching do not have subtitles, which makes it impossible for hearing-impaired students to attend classes.

在一般的課堂上，聽障學生上課也有問題，因為沒有顯示器直接顯示教師授課內容的字幕。在各種演講與會議場合，聽障人士無法參加，因為沒有顯示器直接顯示字幕。 In general classrooms, hearing-impaired students also have problems in class, because there are no monitors that directly display subtitles of what the teacher is teaching. In various lectures and conferences, the hearing impaired cannot participate because there is no monitor to directly display subtitles.

因此為聽障人士設置能顯示教師或演講者所說內容的字幕，是對聽障人士的一大福音。 Therefore, it is a great boon for the hearing-impaired to set subtitles that can display what the teacher or speaker said.

現在有一些會議使用聽打員當場將講者所說的內容即時用電腦打字成為字幕呈現在電腦螢幕上，讓聽障者能瞭解現場狀況。但聽打員耗費很大的精力聆聽講者的內容，一旦工作時間過長，可能就會出現漏句與錯字，因此必須提供更完善的遠端聽打員方案。 At present, some conferences use audiovisual staff to type what the speaker said on the spot with a computer and display them as subtitles on the computer screen, so that the hearing-impaired can understand the situation at the scene. However, the listener spends a lot of energy listening to the content of the speaker. Once the working hours are too long, there may be missing sentences and typos. Therefore, a more complete remote listener solution must be provided.

本發明的目的在提出一種遠端語音辨識的字幕服務系統，為聽障者提供遠端語音辨識的字幕服務，本發明的內容敘述如下。 The purpose of the present invention is to propose a subtitle service system for remote speech recognition, for The subtitle service of remote voice recognition is provided for the hearing-impaired, and the content of the present invention is described as follows.

此系統包含在A地的講者與直播設備，在B地的聽打員與電腦，在C地的聽障者與直播畫面、在D地的自動語音辨識(ASR)字幕服務器，以網路聯接直播設備、電腦、直播畫面與ASR字幕服務器。 This system includes speakers and live broadcast equipment at A, listeners and computers at B, hearing-impaired people and live broadcast images at C, automatic speech recognition (ASR) subtitle server at D, and Internet Connect the live broadcast equipment, computer, live screen and ASR subtitle server.

自動語音辨識(ASR)字幕服務器包含：即時訊息協定(RTMP)用以接收A地經由網路而傳來的直播串流；開源語音識別工具包用以進形行語音識別和信號處理；網路伺服器負責提供網頁的介面，透過HTTP協定傳給直播設備、電腦與直播畫面；錄音模組以供聽打員進行回放的功能。 Automatic Speech Recognition (ASR) subtitle server includes: Instant Messaging Protocol (RTMP) for receiving live streaming from A through the network; open source speech recognition toolkit for speech recognition and signal processing; network The server is responsible for providing the interface of the webpage, which is transmitted to the live broadcast device, computer and live broadcast screen through the HTTP protocol; the recording module is used for playback by the listener.

講者的聲音送入自動語音辨識(ASR)字幕服務器轉成文字，經由聽打員校正，然後將文字字幕與講者的影像和聲音一起送到聽障者的直播畫面，使聽障者能看到講者所說的文字字幕。 The voice of the speaker is sent to the Automatic Speech Recognition (ASR) subtitle server to convert it into text, which is corrected by the auditory typist. See text subtitles of what the speaker said.

1:講者 1: speaker

2:直播設備 2: Live broadcast equipment

3:聽打員 3: listener

4:電腦 4: computer

5:聽障者 5: hearing impaired

6:直播畫面 6: Live screen

61:字幕區域 61: subtitle area

7:自動語音辨識(ASR)字幕服務器 7: Automatic Speech Recognition (ASR) Subtitle Server

8:網路 8: Network

9:即時訊息協定(RTMP) 9: Instant Messaging Protocol (RTMP)

10:開源語音識別工具包Kaldi ASR 10: Open source speech recognition toolkit Kaldi ASR

11:網路伺服器 11:Web server

12:錄音模組 12:Recording module

13:OBS直播軟體 13:OBS Live Streaming Software

14:ASR上傳介面 14: ASR upload interface

15:音源串流 15: Audio streaming

16:音源紀錄 16: Audio record

17:直播設備截錄的影像聲音 17: Video and sound captured by live equipment

18:從瀏覽器截錄的字幕內容 18: Subtitle content captured from browser

圖1為本發明遠端語音辨識的字幕服務系統基本架構示意圖。 FIG. 1 is a schematic diagram of the basic structure of a subtitle service system for remote speech recognition according to the present invention.

圖2為本發明自動語音辨識(ASR)字幕服務器的內容示意圖。 FIG. 2 is a schematic diagram of the content of the Automatic Speech Recognition (ASR) subtitle server of the present invention.

圖3為本發明直播設備的內容示意圖。 Fig. 3 is a schematic diagram of the content of the live broadcast device of the present invention.

圖4為本發明自動語音辨識(ASR)字幕服務器的字幕產生過程示意圖。 FIG. 4 is a schematic diagram of the subtitle generation process of the Automatic Speech Recognition (ASR) subtitle server of the present invention.

圖5為本發明聽打員在異地的操作示意圖。 Fig. 5 is a schematic diagram of the operation of the listener in different places according to the present invention.

圖6為本發明將講者直播畫面與字幕合併輸出的示意圖。 Fig. 6 is a schematic diagram of combining and outputting the speaker's live video and subtitles according to the present invention.

圖1說明本發明遠端語音辨識的字幕服務系統基本架構。講者1與直播設備2在A地，聽打員3與其電腦4在B地，聽障者5與直播畫面6在C地，自動語音辨識(ASR)字幕服務器7在D地，四地以網路8相聯，網路8可以是本地區域網路或全球網際網路。若A地、B地、C地在同一地，代表講者1、聽打員3、聽障者5在同一間教室或會議室。 FIG. 1 illustrates the basic structure of the subtitle service system for remote speech recognition of the present invention. The speaker 1 and the live broadcast device 2 are at site A, the listener 3 and his computer 4 are at site B, the hearing-impaired person 5 and the live broadcast screen 6 are at site C, the automatic speech recognition (ASR) subtitle server 7 is at site D, and four sites The network 8 is connected, and the network 8 may be a local area network or a global Internet. If site A, site B, and site C are in the same site, it means that speaker 1, auditory typist 3, and hearing-impaired person 5 are in the same classroom or meeting room.

圖2說明本發明自動語音辨識(ASR)字幕服務器7所包含的內容。即時訊息協定(Real-Time Messaging Protocol，RTMP)9，是一種廣泛使用在直播串流的一種協議，ASR字幕機服務器7搭載此協議就能接收A地經由網路8而傳來的直播串流。若不使用RTMP，也可以使用HTTP Live Streaming(縮寫為HLS)，這是由蘋果公司提出的基於HTTP串流媒體網路傳輸協定，同樣也能接收A地經由網路8而傳來的直播串流。但本發明的方法並不侷限於RTMP或HLS。 FIG. 2 illustrates what is included in the Automatic Speech Recognition (ASR) subtitle server 7 of the present invention. Instant Messaging Protocol (Real-Time Messaging Protocol, RTMP) 9 is a protocol widely used in live streaming. The ASR subtitle machine server 7 equipped with this protocol can receive the live streaming from A through the network 8 . If you do not use RTMP, you can also use HTTP Live Streaming (abbreviated as HLS), which is an HTTP-based streaming media network transmission protocol proposed by Apple, and can also receive live streams from A through the network 8 flow. But the method of the present invention is not limited to RTMP or HLS.

ASR字幕服務器7的語音辨識方面使用一開源語音識別工具包Kaldi ASR 10，用於語音識別和信號處理，可在Apache License v2.0下免費獲得。 The speech recognition of the ASR subtitle server 7 uses an open source speech recognition toolkit Kaldi ASR 10 for speech recognition and signal processing, which can be obtained free of charge under the Apache License v2.0.

ASR字幕服務器7上需要架設網路伺服器11，一個負責提供網頁的介面，透過HTTP協定傳給客戶端(一般是指網頁瀏覽器)，客戶端指直播設備2、電腦4、直播畫面6等。 The ASR subtitle server 7 needs to set up a network server 11, an interface responsible for providing web pages, which is transmitted to the client (generally referring to a web browser) through the HTTP protocol. The client refers to the live broadcast device 2, the computer 4, the live screen 6, etc. .

ASR字幕服務器7具有一錄音模組12，以供聽打員進行「回放」的功能。 The ASR subtitle server 7 has a recording module 12 for the function of "playback" by the listener.

請見圖3，說明本發明直播設備2的內容。A地講者1的直播設備2收錄到講者1的影像與聲音而分成兩路，其中第一路是輸入Open Broadcaster Software(OBS)直播軟體13，此軟體是由OBS Project開發的自由開源跨平台串流媒體和錄影程式，是直播主通常使用的軟體。OBS直播軟體13後端可以直接推送到各個平台例如YouTube、Facebook或Twitch等。 Please refer to FIG. 3 , illustrating the content of the live broadcast device 2 of the present invention. In A, speaker 1's live broadcast device 2 records the image and sound of speaker 1 and divides it into two channels, of which the first channel is input Open Broadcaster Software (OBS) live broadcast software 13, this software is a free and open source cross-platform streaming media and video recording program developed by the OBS Project, and is commonly used by live broadcasters. The backend of OBS Live Streaming Software 13 can be directly pushed to various platforms such as YouTube, Facebook or Twitch.

第二路僅提供講者1的聲音，可以透過簡易的ASR上傳介面14將聲音進行封包並透過RTMP 9(或HLS)串流到ASR字幕服務器7。 The second channel only provides the voice of the speaker 1, and the voice can be packaged through the simple ASR upload interface 14 and streamed to the ASR subtitle server 7 through RTMP 9 (or HLS).

請見圖4，說明本發明自動語音辨識(ASR)字幕服務器7的字幕產生過程。當串流封包送到ASR字幕服務器7的RTMP 9(或HLS)接收模組後，串流封包就會被解包成音源串流15，並分別傳給Kaldi ASR 10以及錄音模組12。錄音模組12會隨著時間的進行，將音源依時間記錄成音源紀錄16。Kaldi ASR 10模組收到音源串流15後，即逐步轉化成文字，每段文字都會夾帶一個「標籤」，如圖4所示。「標籤」會描述這段文字是對應音源紀錄第幾秒的位置，並記錄時長為多少。這些文字以及標籤會顯示在網路伺服器11的頁面上，經由網路8傳到直播設備2、電腦4、直播畫面6。 Please refer to FIG. 4 , which illustrates the subtitle generation process of the Automatic Speech Recognition (ASR) subtitle server 7 of the present invention. When the streaming packet is sent to the RTMP 9 (or HLS) receiving module of the ASR subtitle server 7, the streaming packet will be unpacked into an audio source stream 15, and sent to the Kaldi ASR 10 and the recording module 12 respectively. The recording module 12 will record the sound source into the sound source record 16 according to time. After the Kaldi ASR 10 module receives the audio source stream 15, it will gradually convert it into text, and each text will contain a "tag", as shown in Figure 4. "Tag" will describe the position of this text corresponding to the second of the audio recording, and the recording duration. These words and tags will be displayed on the webpage of the web server 11, and transmitted to the live broadcast device 2, the computer 4, and the live broadcast screen 6 via the network 8.

請見圖5，說明本發明聽打員3在異地的操作。在異地的聽打員3開啟YouTube、Facebook或Twitch平台接收來自A地講者1的直播影像與聲音。聽打員3並且透過網頁瀏覽而登入ASR字幕服務器7的網路伺服器11頁面，從頁面上閱聽講者1的文字與聲音。 See also Fig. 5, illustrate the operation of the listener 3 of the present invention in different places. The listener 3 in a different place turns on YouTube, Facebook or Twitch to receive the live video and sound from the speaker 1 in A. The listener 3 logs into the web server 11 page of the ASR subtitle server 7 through web browsing, and listens to the words and sounds of the lecturer 1 on the page.

聽打員3在ASR字幕服務器7上被設定為具備可讀可寫的權限，因此可以在網路伺服器11頁面上修改Kaldi ASR 10所產生的文字。每段文字在網路伺服器11頁面上有標籤屬性，例如聽打員3在第C段文字點擊兩下，網路伺服器11頁面就會根據標籤的指示，請求音源紀錄16的第N₃秒、語音時長Z秒的片段進行回放。這樣聽打員3就可以確認講者1所說過的內容是什麼，進而修改文字。 Listener 3 is set to have readable and writable authority on ASR subtitle server 7, therefore can revise the literal that Kaldi ASR 10 produces on network server 11 pages. Each section of text has a tag attribute on the web server 11 page. For example, the listener 3 clicks twice on the C paragraph text, and the web server 11 page will request the N ₃ of the audio source record 16 according to the tag instruction. seconds, and the audio duration is Z seconds for playback. In this way, the listener 3 can confirm what the speaker 1 has said, and then modify the text.

請見圖6，說明本發明A地講者1將直播畫面6與字幕合併輸出。講者1的直播設備2可以透過網頁瀏覽登入ASR字幕服務器7的網路伺服器11頁面，但只有讀取權限。換句話說，在講者1的直播設備2上只能看ASR字幕服務器7翻譯出來的文字，以及聽打員3修改後的文字。 Please refer to FIG. 6 , which illustrates that the speaker 1 of the present invention A combines the live image 6 and subtitles for output. The live broadcast device 2 of the speaker 1 can browse and log in to the web server 11 page of the ASR subtitle server 7 through the webpage, but only has the read permission. In other words, only the text translated by the ASR subtitle server 7 and the text modified by the listener 3 can be viewed on the live broadcast device 2 of the speaker 1 .

OBS直播軟體13具有疊加畫面的功能，講者1在直播設備2從ASR字幕服務器7的網路伺服器11頁面選取字幕內容並進行截錄，然後將直播設備2截錄的影像聲音17以及從瀏覽器截錄的字幕內容18進行疊加，經由OBS直播軟體13輸出含有ASR字幕服務器7所產生的字幕的直播畫面6。然後由OBS直播軟體13推送到YouTube、Facebook或Twitch等平台，使C地的聽障者5能在直播畫面6上的字幕區域61看到字幕內容。 The OBS live broadcast software 13 has the function of superimposing pictures. The lecturer 1 selects the subtitle content from the web server 11 page of the ASR subtitle server 7 on the live broadcast device 2 and intercepts it, and then the video and sound 17 recorded by the live broadcast device 2 and from the The subtitle content 18 captured by the browser is superimposed, and the live screen 6 containing the subtitle generated by the ASR subtitle server 7 is output via the OBS live broadcast software 13 . Then push to platforms such as YouTube, Facebook or Twitch by OBS live broadcast software 13, make the hearing-impaired person 5 of C place see the subtitle content in the subtitle area 61 on the live broadcast picture 6.

本發明的精神與範圍決定於下面的申請專利範圍，不受限於上述實施例。 The spirit and scope of the present invention are determined by the scope of claims below, and are not limited to the above-mentioned embodiments.

1:講者 1: speaker

2:直播設備 2: Live broadcast equipment

3:聽打員 3: listener

4:電腦 4: computer

5:聽障者 5: hearing impaired

6:直播畫面 6: Live screen

61:字幕區域 61: subtitle area

8:網路 8: Network

Claims

A subtitle service system for remote speech recognition, comprising:

A speaker and a live broadcast device are at a site A, a listener and a computer are at a site B, a hearing-impaired person and a live broadcast screen are at a site C, and an automatic speech recognition (ASR) subtitle server is at a site D , the live broadcast equipment, the computer, the live screen and the ASR subtitle server are connected through a network; the voice of the speaker is sent to the automatic speech recognition (ASR) subtitle server and converted into text, which is corrected by the listener, Then send the text subtitles together with the speaker's image and sound to the live broadcast of the hearing-impaired person, so that the hearing-impaired person can see the text subtitles spoken by the speaker.

For example, the subtitle service system for remote speech recognition in item 1 of the patent scope, wherein the Automatic Speech Recognition (ASR) subtitle server includes:

A real-time message protocol (RTMP) is used to receive a live streaming from the place A via the network;

An open source speech recognition toolkit for speech recognition and signal processing;

A web server is responsible for providing the interface of the webpage, which is transmitted to the live broadcast device, the computer and the live broadcast screen through the HTTP protocol;

A recording module for playback by the listener.

For example, the subtitle service system for remote speech recognition in item 2 of the patent scope, in which the live broadcast device intercepts the video and audio of the speaker and divides it into two channels:

The first way is to input the video and sound of the speaker into an OBS live broadcast software, and the back end of the OBS live broadcast software can directly push the video and sound of the speaker to various platforms such as YouTube, Facebook or Twitch, etc.;

The second channel only provides the speaker's voice, which can be uploaded through an ASR interface Form a stream packet and send it to the ASR subtitle server through the RTMP.

For example, the subtitle service system for remote speech recognition in the third item of patent scope, wherein the subtitle generation process of the ASR subtitle server is as follows:

When the streaming packet is sent to the RTMP receiving module of the ASR subtitle server, the streaming packet will be unpacked into a sound stream, and sent to the open source speech recognition toolkit and the recording module respectively;

The recording module records the audio source stream into an audio source record according to time as time progresses;

After receiving the audio source stream, the open source speech recognition toolkit will gradually convert it into text, and each text will contain a tag, which describes the position of the text corresponding to the second recorded by the audio source, and the recording time is how much; the text and the label are displayed on the web server page, and transmitted to the live broadcast device, the computer and the live broadcast screen via the network.

For example, the subtitle service system for remote speech recognition in item 4 of the patent scope, wherein the listener opens YouTube, Facebook or Twitch platform to receive images and sounds from the speaker; The web server page of the ASR subtitle server, read and listen to the text and sound of the speaker from this page; the listener is set to have readable and writable permissions on the ASR subtitle server, so it can be used on the ASR subtitle server The text generated by the open source speech recognition toolkit is modified on the web server page.

For example, the subtitle service system for remote speech recognition in claim 5 of the patent scope, wherein the speaker logs in the web server page of the ASR subtitle server through web browsing from the live broadcast device, and reads the text; the speaker Combine the live video and the text and output it to the OBS live software for superimposition, and then the OBS live software will push it to YouTube, Facebook or Twitch, etc. The platform enables the hearing-impaired person to see the text content in a subtitle area on the live broadcast screen.

Such as the subtitle service system for remote speech recognition in item 1 of the scope of the patent application, wherein the network is a local area network or a global Internet.

For example, in the subtitle service system for remote speech recognition in item 2 of the scope of the patent application, the instant message protocol (RTMP) can be replaced by Apple's HTTP Live Streaming (HLS), or replaced by any network transmission protocol with similar functions.

For example, the subtitle service system for remote speech recognition in item 2 of the patent application scope, in which the open source speech recognition tool is Kaldi ASR, which can be obtained free of charge under the Apache License v2.0.

For example, the subtitle service system for remote voice recognition in item 1 of the scope of the patent application, wherein if the location A, the location B, and the location C are in the same location, it means that the speaker, the typing personnel, and the hearing-impaired person are in the same location. classroom or conference room.