KR20000003717A

KR20000003717A - Voice communicating method using voice recognition/synthesis and apparatus performing the same

Info

Publication number: KR20000003717A
Application number: KR1019980024987A
Authority: KR
Inventors: 홍상진
Original assignee: 윤종용; 삼성전자 주식회사
Priority date: 1998-06-29
Filing date: 1998-06-29
Publication date: 2000-01-25

Abstract

PURPOSE: A voice communication method is provided for a user to recognize a voice signal at a receiving side. CONSTITUTION: The voice communication method using a voice recognition/synthesis comprises the steps of: transmitting a voice data packet to a data network from a receiving side, wherein the voice data packet has a header to which at least transmission frame number is written(500); comparing the transmission frame number with a predetermined transmission frame number at a transmitting side to calculate frame loss rate(504 to 510); transmitting the header, to which the calculated frame loss rate is written, to the transmitting side; and transmitting character data packet from the receiving side by use of voice recognition/synthesis instead of the voice data packet when the frame loss rate is over a predetermined threshold value(512).

Description

Voice communication method and apparatus using voice recognition / synthesis

본 발명은 멀티미디어 통신에 관한 것으로서, 특히 트래픽이 심할 경우에 음성 인식/합성을 이용한 음성 통신 장치 및 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to multimedia communications, and more particularly, to an apparatus and method for speech communication using speech recognition / synthesis when heavy traffic is present.

도 1은 종래의 멀티미디어 통신 환경을 설명하기 위한 블럭도로서, 영상 및 음성 통신을 위해 영상 및 음성 통신장치를 통합하는 시스템은 개략적으로 영상 입력/출력 디바이스(100,102), 음성 입력/출력 디바이스(110/112), 영상 부호화기/복호화기(120,122), 음성 부호화기/복호화기(130,132), 멀티플렉서(MUX)/디멀티플렉서(DEMUX)(150)를 구비한다.1 is a block diagram illustrating a conventional multimedia communication environment, in which a system integrating a video and audio communication device for video and audio communication is schematically a video input / output device 100 and 102 and a voice input / output device 110. 112, image encoders / decoders 120 and 122, speech coders / decoders 130 and 132, and multiplexers (MUX) and demultiplexers (DEMUX) 150.

도 1을 참조하여, 종래의 영상 및 음성을 전송하는 송신부분을 살펴보면, 영상 입력 디바이스(100)와 음성 입력 디바이스(110)로 각각 입력된 영상 및 음성은 영상 부호화기(120) 및 음성 부호화기(130)를 통해 각각 부호화된다. MUX/DEMUX(150)는 부호화된 두 신호의 동기 등을 고려하여 데이타 스트림을 생성하고, 소정의 통신망을 통해 목적으로 하는 시스템으로 전송한다. 이와 같이, 일반적으로 영상 및 음성을 전송하는 통신망에서는 트래픽이 수시로 변하게 되는데, 이 특성을 고려하여 상황에 맞게 스트림의 데이타량을 조절하여 보내게 된다. 그리고, 역으로 MUX/DEMUX(150)으로 수신된 신호는 영상 데이타 스트림과 음성 데이타 스트림으로 분리되고, 각각 영상 복호화기(122) 및 음성 복호화기(132)를 통해 원영상과 원음성이 얻어진다.Referring to FIG. 1, referring to a transmission part for transmitting a conventional video and audio, the video and audio input to the video input device 100 and the audio input device 110, respectively, may be an image encoder 120 and an audio encoder 130. Are each encoded through The MUX / DEMUX 150 generates a data stream in consideration of synchronization of two encoded signals and transmits the data stream to a target system through a predetermined communication network. As such, in general, in a communication network that transmits video and audio, traffic changes frequently, and in consideration of this characteristic, the data amount of the stream is adjusted according to the situation. On the contrary, the signal received by the MUX / DEMUX 150 is divided into an image data stream and an audio data stream, and the original image and the original audio are obtained through the image decoder 122 and the audio decoder 132, respectively. .

통신망 트래픽에 적용되는 기술은 멀티미디어 통신이 각광을 받으면서 기술 축적이 이루어졌다. 그러나, 트래픽이 많이 걸려 있을 때에는 어쩔 수 없이 통신의 질이 떨어질 수 밖에 없다. 결국, 송신측에서 보낸 데이타를 수신측에서 전부 받지 못하거나 송신측에서 트래픽에 적응하여 소량의 데이타만을 보내게 된다. 이러한 경우에 수신된 영상에서 일정 장면이 손실되어버리면, 사용자는 부자연스러운 영상 장면들을 보게 되겠지만 일반적으로 전체적인 정보는 인지할 수 있다. 그러나, 수신된 음성에서 일정 데이타량을 손실하게 되면, 사용자는 정보를 인지할 수 없게 되고, 영상에 비해 불편함을 느끼게 된다.The technology applied to the network traffic has accumulated technology as the multimedia communication is in the spotlight. However, when the traffic is heavy, the quality of communication is inevitably deteriorated. As a result, the sender does not receive all of the data sent by the sender or the sender adapts the traffic and sends only a small amount of data. In this case, if a certain scene is lost in the received image, the user will see unnatural image scenes, but generally the overall information can be perceived. However, if a certain amount of data is lost in the received voice, the user may not be able to recognize the information and may feel uncomfortable compared to the image.

본 발명이 이루고자하는 기술적 과제는, 데이타 전송 과정에서 프레임 손실 비율을 체크하여 트래픽이 심한 경우에는 음성 인식/음성 합성을 이용함으로써, 수신측에서 사용자가 음성 신호를 인지할 수 있도록 하는, 음성 인식/합성을 이용한 음성 통신 방법을 제공하는데 있다.The technical problem to be achieved by the present invention is to check the frame loss ratio during the data transmission process, by using voice recognition / voice synthesis in case of heavy traffic, so that the user can recognize the voice signal at the receiving side, voice recognition / The present invention provides a voice communication method using synthesis.

본 발명이 이루고자하는 다른 기술적 과제는, 상기 음성 통신 방법을 수행하는 음성 인식/합성을 이용한 음성 통신 장치를 제공하는데 있다.Another object of the present invention is to provide a voice communication apparatus using voice recognition / synthesis for performing the voice communication method.

도 1은 종래의 멀티미디어 통신 환경을 설명하기 위한 블럭도이다.1 is a block diagram illustrating a conventional multimedia communication environment.

도 2는 본 발명이 적용되는 두 시스템간의 데이타 전송을 개념적으로 나타내는 도면이다.2 is a diagram conceptually illustrating data transmission between two systems to which the present invention is applied.

도 3은 본 발명에 의한 음성 통신 장치를 설명하기 위한 블럭도이다.3 is a block diagram illustrating a voice communication apparatus according to the present invention.

도 4 (a) 및 (c)는 본 발명에서 사용되는 전송 데이타 구조를 설명하기 위한 도면이다.4 (a) and (c) are diagrams for explaining the transmission data structure used in the present invention.

도 5는 본 발명에 의한 음성 통신 방법을 설명하기 위한 플로우챠트이다.5 is a flowchart for explaining a voice communication method according to the present invention.

상기 과제를 이루기 위하여, 트래픽이 심한 경우에 음성 인식/합성을 이용한 본 발명에 의한 음성 통신 방법은, 송신측에서 적어도 송신 프레임 숫자가 명시된 헤더를 갖는 음성 데이타 패킷을 데이타 통신망으로 전송하는 단계, 수신측에서 수신된 음성 데이타 패킷의 송신 프레임 숫자와 소정의 송신 프레임 숫자를 비교하여 프레임 손실 비율을 계산하는 단계, 계산된 프레임 손실 비율이 명시된 헤더를 다시 송신측으로 전송하는 단계 및 프레임 손실 비율이 소정의 임계치를 초과하면, 이후부터 송신측에서 음성 데이타 패킷을 대신하여 음성 인식/합성을 이용한 문자 데이타 패킷을 전송하는 단계를 구비하는 것을 특징으로 한다.In order to achieve the above object, in the case of heavy traffic, the voice communication method according to the present invention using voice recognition / synthesis comprises the steps of: transmitting a voice data packet having a header with a specified transmission frame number at a transmitting side to a data communication network; Comparing the transmission frame number of the received voice data packet with the predetermined transmission frame number to calculate the frame loss ratio, transmitting the header indicating the calculated frame loss ratio back to the transmitting side, and the frame loss ratio being predetermined. When the threshold value is exceeded, thereafter, the transmitting side transmits a text data packet using voice recognition / synthesis in place of the voice data packet.

상기 다른 과제를 이루기 위하여, 트래픽이 심한 경우에 음성 인식/합성을 이용한 본 발명에 의한 음성 통신 장치는, 전송할 음성 데이타를 제어신호에 응답하여 음성 모드로 처리할 것인지, 음성 인식/합성 모드로 처리할 것인지를 선택하는 송신측/수신측 모드 스위칭부, 모드 스위칭부에서 음성 모드를 선택하면, 음성 데이타를 부호화/복호화하는 음성 부호화기/복호화기, 모드 스위칭부에서 음성 인식/합성 모드를 선택하면, 송신시에 음성 데이타를 문자 데이타로 변환하여 전송하고, 수신시에 그 역으로 처리하는 음성 인식/합성부 및 이전에 전송된 음성 데이타의 전송 결과로부터 프레임 손실 비율을 계산하고, 계산된 값이 소정의 임계치를 초과하면 상기 제어신호를 발생하는 프레임 손실 비율 계산 및 비교부를 구비하는 것을 특징으로 한다.In order to achieve the above object, the voice communication apparatus according to the present invention using voice recognition / synthesis when the traffic is heavy, processes the voice data to be transmitted in the voice mode in response to the control signal or in the voice recognition / synthesis mode. When the voice mode is selected in the transmission / reception mode mode switching unit for selecting whether or not to be selected, the voice coder / decoder for encoding / decoding the voice data, and the voice recognition / synthesis mode is selected in the mode switching unit, In the transmission, the voice data is converted into text data and transmitted, the frame recognition rate is calculated from the voice recognition / synthesis unit and the result of the transmission of the previously transmitted voice data. And a frame loss ratio calculation and comparison unit for generating the control signal when the threshold is exceeded.

이하, 본 발명에 의한 음성 인식/합성을 이용한 음성 통신 방법 및 그 장치를 첨부한 도면을 참조하여 다음과 같이 설명한다.Hereinafter, a voice communication method using voice recognition / synthesis according to the present invention and a device thereof will be described with reference to the accompanying drawings.

도 2는 본 발명이 적용되는 두 시스템간의 데이타 통신을 개념적으로 나타내는 도면이다.2 is a diagram conceptually illustrating data communication between two systems to which the present invention is applied.

도 2를 참조하면, 예컨대, 제1 시스템(200)과 제2 시스템(210)간에 데이타 통신을 이뤄질 때, 본 발명에서는 영상/음성/문자 데이타 패킷이 전송되며, 부가적으로 프레임 손실 비율(FLR)이 전송된다. 프레임 손실 비율(FLR)은 데이타 통신망상에서 트래픽에 따라 결정되며, 프레임 손실 비율이 소정의 임계치보다 클 경우에 음성을 대신하여 문자가 전송된다. 즉, 영상 및 음성 데이타 패킷을 대신하여 FLR에 따라 영상 및 문자 데이타 패킷을 생성된다.Referring to FIG. 2, for example, when data communication is performed between the first system 200 and the second system 210, the video / audio / text data packet is transmitted in the present invention, and additionally, a frame loss ratio (FLR) is added. ) Is sent. The frame loss rate (FLR) is determined by traffic on a data communication network, and text is transmitted in place of voice when the frame loss rate is larger than a predetermined threshold. That is, video and text data packets are generated according to the FLR instead of video and audio data packets.

도 3은 본 발명에 의한 음성 통신 장치를 설명하기 위한 블럭도로서, 영상 및 음성 통신을 위해 영상 및 음성 통신장치를 통합하는 시스템에서, 영상 통신 장치는 영상 입력/출력 디바이스(300,302), 영상 부호화기/복호화기(310,312) 및 멀티플렉서(MUX)/디멀티플렉서(DEMUX)(360)를 구비하며, 본 발명에서 특징으로 하는 음성 통신 장치는 음성 입력/출력 디바이스(320/322), 제1 및 제2 모드 스위칭부(330,332), 음성 부호화기/복호화기(340,342), 음성 인식 모듈/TTS(Text To Speech)(350), 워드 리스트(352), 멀티플렉서(MUX)/디멀티플렉서(DEMUX)(360) 및 FLR 계산 및 비교부(370)를 구비한다. 도 3에 도시된 시스템은 도 2에 도시된 한 측의 시스템을 나타내며, 음성 통신 장치면서에서 종래의 장치와 비교된다.3 is a block diagram illustrating a voice communication apparatus according to the present invention. In a system integrating a video and audio communication apparatus for video and voice communication, the video communication apparatus includes an image input / output device (300, 302), an image encoder. / Decoder (310,312) and multiplexer (MUX) / demultiplexer (DEMUX) 360, the voice communication device characterized in the present invention is a voice input / output device (320/322), the first and second modes Switching unit 330,332, speech coder / decoder 340,342, speech recognition module / Text To Speech (TTS) 350, word list 352, multiplexer (MUX) / demultiplexer (DEMUX) 360 and FLR calculation And a comparator 370. The system shown in FIG. 3 represents the system on one side shown in FIG. 2, compared to conventional devices in terms of voice communication devices.

도 3을 참조하면, 영상 입력 디바이스(300)와 음성 입력 디바이스(320)로 각각 입력된 영상 및 음성은 통상 영상 부호화기(310) 및 음성 부호화기(340)를 통해 각각 부호화된다. MUX/DEMUX(360)는 부호화된 두 신호의 동기 등을 고려하여 데이타 스트림을 생성하고, 소정의 통신망을 통해 목적으로 하는 시스템으로 전송한다.Referring to FIG. 3, video and audio input to the video input device 300 and the audio input device 320 are respectively encoded through the video encoder 310 and the voice encoder 340, respectively. The MUX / DEMUX 360 generates a data stream in consideration of synchronization of two encoded signals and transmits the data stream to a target system through a predetermined communication network.

그러나, 트래픽이 심할 경우에 음성 입력 디바이스(320)로 입력된 음성은 제1 모드 스위칭부(330)에 의해 음성을 음성 부호화기(340) 대신에 음성 인식 모듈/TTS(350)로 전송된다. 제1 모드 스위칭부(330)는 입력된 음성을 통상의 음성 모드로 처리할 것인지, 문자 변환을 위해 음성 인식 모드로 처리할 것인지를 FLR 계산 및 비교부(370)에 의해 제어한다. FLR 계산 및 비교부(370)는 MUX/DEMUX(360)를 통해 전송된 이전 데이타 전송 결과에 따라 FLR을 계산하고, 계산된 값이 소정의 임계치를 초과하면 음성 인식 모드로 변환하는 제어신호를 발생한다. FLR 계산 및 비교부(370)로부터 발생된 제어신호에 응답하여, 송신측에서 제1 모드 스위칭부(330) 및 수신측에서 제2 모드 스위칭부(332)는 음성 인식 모듈/TTS(350)로 전송 경로를 전환한다.However, when the traffic is heavy, the voice input to the voice input device 320 is transmitted by the first mode switching unit 330 to the voice recognition module / TTS 350 instead of the voice encoder 340. The first mode switching unit 330 controls whether the input voice is processed in a normal voice mode or a voice recognition mode for text conversion by the FLR calculator and comparator 370. The FLR calculation and comparison unit 370 calculates the FLR according to the previous data transmission result transmitted through the MUX / DEMUX 360 and generates a control signal for converting to the voice recognition mode when the calculated value exceeds a predetermined threshold. do. In response to the control signal generated from the FLR calculation and comparison unit 370, the first mode switching unit 330 at the transmitting side and the second mode switching unit 332 at the receiving side are sent to the voice recognition module / TTS 350. Switch the transmission path.

음성 인식 모듈/TTS(350)을 이용하여 음성 통신을 할 경우에, 수신측에서는 입력된 음성 즉, 사용자가 말한 어구중에서 워드 리스트(352)에서 찾을 수 있는 단어를 텍스트로 변환하여 전송한다. 일반적으로, 이런 방식으로 전송하게 되면, 음성신호를 압축한 데이타량보다 데이타량이 적게된다. 워드 리스트(352)를 참조하는 이유는 100% 음성 인식률을 보장하기가 불가능하기 때문이다. 따라서, 인식할 수 있는 적당한 수의 어구를 워드 리스트(352)에 사용자가 추가하여 전송할 수 있도록 한다. 한편, 수신측에서는 음성 데이타 스트림을 대신하여 수신된 텍스트는 제2 모드 스위칭부(322)에 의해 선택된 음성 인식 모듈/TTS(350)에서 음성 합성을 이용하여 텍스트에서 음성으로 변환되어 사용자에게 들려 주거나, 또는 텍스트 자체로 사용자에게 보여준다. 이렇게 되면, 수신측에서는 적은 량의 데이타를 수신하게 되고, 특히, 텍스트로 된 정보를 수신하게 될 경우에는 상대방이 보낸 정보를 정확하게 인지하게 된다.When voice communication is performed using the voice recognition module / TTS 350, the receiving side converts the input voice, that is, a word found in the word list 352 among the words spoken by the user, into text and transmits the converted word. In general, transmission in this manner results in less data than the amount of data compressed by the audio signal. The reason for referring to the word list 352 is that it is impossible to guarantee 100% speech recognition rate. Thus, the user adds a suitable number of phrases to the word list 352 so that they can be transmitted. On the other hand, at the receiving side, the text received in place of the voice data stream is converted from text to voice using voice synthesis in the voice recognition module / TTS 350 selected by the second mode switching unit 322, or is heard by the user. Or show it to the user as text itself. In this case, the receiving side receives a small amount of data, and in particular, when receiving the information in text, the receiving party correctly recognizes the information sent.

도 4 (a) 및 (c)는 본 발명에서 사용되는 전송 데이타 구조를 설명하기 위한 도면으로서, 도 4 (a)는 영상/음성 패킷 구조를, 도 4 (b)는 영상/문자 패킷 구조를, 도 4 (c)는 헤더의 세부 내역을 각각 나타낸다.4 (a) and 4 (c) are diagrams for explaining a transmission data structure used in the present invention, and FIG. 4 (a) shows a video / audio packet structure, and FIG. 4 (b) shows a video / text packet structure. 4 (c) shows the details of the header, respectively.

도 4를 참조하면, 일반적으로 제1 및 제2 모드 스위칭부(330,332)에서 음성 모드를 선택할 경우에, MUX/DEMUX(360)를 통해 전송되는 데이타 스트림은 도 4 (a)에 도시된 바와 같은 영상/음성 패킷 구조를 갖는다. 한편, 트래픽이 심하여 제1 및 제2 모드 스위칭부(330,332)에서 음성 인식/TTS 모드를 선택할 경우에, MUX/DEMUX(360)를 통해 전송되는 데이타 스트림은 도 4 (b)에 도시된 바와 같은 영상/문자 패킷 구조를 갖는다. 데이타 패킷은 전송되는 프레임들에 관련된 정보를 표현하기 위해 프레임 앞에 헤더를 포함한다. 본 발명에 의한 데이타 패킷에서 헤더는 적어도 영상/음성 모드인지 영상/음성인식/TTS 모드인지를 명시하는 모드 플래그, 프레임 숫자 및 FLR을 포함한다.Referring to FIG. 4, in general, when the voice mode is selected by the first and second mode switching units 330 and 332, the data stream transmitted through the MUX / DEMUX 360 may be as shown in FIG. 4A. It has a video / audio packet structure. Meanwhile, when the traffic is heavy and the first and second mode switching units 330 and 332 select the voice recognition / TTS mode, the data stream transmitted through the MUX / DEMUX 360 is as shown in FIG. 4 (b). It has a video / text packet structure. The data packet includes a header before the frame to represent information related to the frames transmitted. In the data packet according to the present invention, the header includes at least a mode flag, a frame number, and an FLR indicating whether the video / audio mode or the video / audio recognition / TTS mode is used.

도 4 및 도 5를 참조하여 도 3에 도시된 두 시스템간의 음성 통신 방법을 구체적으로 설명한다. 먼저, 트래픽 여부가 확인되지 않은 환경하에서 제1 시스템에서 제2 시스템으로 패킷 헤더에 송신 프레임 숫자가 명시된 패킷 1을 송신한다(제500단계). 이때, 전송 데이타 구조는 도 4 (a)에 도시된 바와 같은 영상/음성 패킷 구조로 되어 있다. 제2 시스템에서는 패킷 1을 수신한다(제502단계). 그후에, 수신된 패킷 1의 패킷 헤더에 명시된 프레임 숫자를 확인하여 FLR을 계산한다(제504단계). 즉, 수신측인 제2 시스템에서는 수신된 데이타 패킷의 헤더에 포함된 프레임 숫자를 실제 수신되어야할 프레임 숫자와 비교하여 FLR을 계산한다.A voice communication method between the two systems shown in FIG. 3 will be described in detail with reference to FIGS. 4 and 5. First, in the environment in which traffic is not checked, packet 1 in which the transmission frame number is specified in the packet header is transmitted from the first system to the second system (step 500). At this time, the transmission data structure has a video / audio packet structure as shown in Fig. 4A. The second system receives the packet 1 (step 502). Thereafter, the frame number specified in the packet header of the received packet 1 is checked to calculate the FLR (step 504). That is, the second system on the receiving side calculates the FLR by comparing the frame number included in the header of the received data packet with the frame number to be actually received.

FLR이 계산된 후에, 제2 시스템에서 제1 시스템으로 패킷을 전송할 경우에 패킷 헤더에 FLR이 명시된 패킷 2를 송신한다(제506단계). 제1 시스템에서는 패킷 2를 수신한다(제508단계). 송신측인 제1 시스템에서는 데이타 패킷의 헤더에 포함된 FLR을 소정의 임계치 즉, 손실 비율 임계치와 비교하여 임계치를 초과하는가를 판단한다(제510단계). 임계치를 초과하면, 음성인식/TTS 모드로 스위칭하여 이후에 전송되는 데이타의 구조를 도 4 (b)에 도시된 바와 같은 영상/문자 패킷 구조로 하며, 임계치를 초과하지 않으면, 이전과 같이 음성 모드로 계속 스위칭한다.After the FLR is calculated, when the packet is transmitted from the second system to the first system, packet 2 in which the FLR is specified in the packet header is transmitted (step 506). The first system receives the packet 2 (step 508). The first system on the transmitting side compares the FLR included in the header of the data packet with a predetermined threshold, that is, a loss ratio threshold, to determine whether the threshold is exceeded (step 510). When the threshold is exceeded, the structure of the data transmitted after switching to the voice recognition / TTS mode is a video / text packet structure as shown in FIG. 4 (b). Continue switching.

이상에서 설명한 바와 같이, 본 발명에 의한 음성 인식/합성을 이용한 음성 통신 방법 및 그 장치는, 트래픽이 심한 경우에는 음성 인식/음성 합성을 이용함으로써, 수신측에서 사용자가 음성 신호를 인지할 수 있도록 하는 이점이 있다.As described above, the voice communication method and apparatus using voice recognition / synthesis according to the present invention use voice recognition / voice synthesis in case of heavy traffic, so that the user can recognize the voice signal at the receiving side. This has the advantage.

Claims

In the case of heavy traffic, a voice communication method using voice recognition / synthesis,

Transmitting, at the transmitting side, a voice data packet having a header, at least in which transmission frame numbers are specified, to the data communication network;

Calculating a frame loss ratio by comparing the transmission frame number of the voice data packet received at the receiving side with a predetermined transmission frame number;

Transmitting a header specifying the calculated frame loss rate back to the transmitting side; And

And when the frame loss ratio exceeds a predetermined threshold, transmitting a text data packet using voice recognition / synthesis on behalf of the voice data packet.

In the case of heavy traffic, a voice communication device using voice recognition / synthesis,

A transmission side / reception side mode switching unit for selecting whether to process the voice data to be transmitted in a voice mode or a voice recognition / synthesis mode in response to a control signal;

A speech encoder / decoder for encoding / decoding the speech data when the speech mode is selected by the mode switching unit;

A voice recognition / synthesis unit for converting the voice data into text data when transmitting and transmitting the text data and vice versa when the mode switching unit selects the voice recognition / synthesis mode; And

And a frame loss ratio calculation and comparison unit for calculating a frame loss ratio from a transmission result of previously transmitted voice data and generating the control signal when the calculated value exceeds a predetermined threshold.

The method of claim 2, wherein the speech recognition / synthesis unit,

And a word found in the word list among the phrases of the voice data to be transmitted is converted into a text using a word list trained on a predetermined phrase that can be voice recognized.