KR102136393B1

KR102136393B1 - Apparatus and Method for managing text changed from voice in call

Info

Publication number: KR102136393B1
Application number: KR1020180084074A
Authority: KR
Inventors: 안영수; 이상훈; 이은동; 신동진; 임치완; 권기재; 백민석; 전준용
Original assignee: 주식회사 케이티
Priority date: 2018-07-19
Filing date: 2018-07-19
Publication date: 2020-07-21
Also published as: KR20200009556A

Abstract

통화 중의 음성을 텍스트로 변환하여 관리하는 장치 및 방법이 개시된다. 일 측면에 따른, 서비스 가입자의 통화 음성을 텍스트 변환하고, 변환된 텍스트를 관리하는 장치는, 서비스 가입자가 발신자 또는 수신자가 되는 통화의 음성 데이터를 수신하는 수신부; 수신된 음성 데이터를 발신 음성 데이터 및 수신 음성 데이터로 각각 분류하는 분류부; 분류된 음성 데이터를 발신 텍스트 및 수신 텍스트의 데이터로 각각 변환하는 변환부; 변환된 발신자의 발신 텍스트 및 수신자의 수신 텍스트를 구분하고 시간순으로 나열하여 통화 텍스트를 생성하는 통화 텍스트부; 및 서비스 가입자의 요청에 의해, 생성된 통화 텍스트를 조회하여 서비스 가입자의 단말로 제공하는 제공부를 포함한다. Disclosed is an apparatus and method for converting and managing voice during a call into text. According to an aspect, an apparatus for text-converting a voice of a service subscriber and managing the converted text includes: a receiver configured to receive voice data of a call in which a service subscriber becomes a caller or a receiver; A classification unit for classifying the received voice data into outgoing voice data and received voice data, respectively; A conversion unit that converts the classified voice data into data of outgoing text and received text, respectively; A call text unit for distinguishing the converted caller's calling text and recipient's receiving text and arranging them in chronological order to generate call text; And a provision unit for retrieving the generated call text at the request of the service subscriber and providing it to the terminal of the service subscriber.

Description

Apparatus and Method for managing text changed from voice in call}

본 발명은 음성 통화의 기술로서, 통화 중에 발신자 및 수신자의 대화 음성을 텍스트로 변환하고, 변환된 텍스트를 관리하는 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for converting a conversational voice of a caller and a recipient into text during a call, and managing the converted text during a call.

기존에 통화 내용을 저장하는 방법으로는 통화 내용의 녹취 서비스가 유일했다. 사용자는 모바일 단말에 녹취용 어플리케이션을 설치하고, 모바일 통화 중에 어플리케이션의 녹취 버튼을 눌러서 통화 내용을 음성 파일로 저장하였다.Previously, the method of storing the contents of a call was the only service for recording the contents of the call. The user installs a recording application on the mobile terminal, and presses the recording button of the application during a mobile call to save the call as a voice file.

여기서, 사용자가 녹취된 내용을 확인하기 위해서는 녹취된 상기 음성 파일을 처음부터 끝까지 청취해야만 한다. 만약, 녹취된 상기 음성 파일이 여러 개일 경우, 사용자는 원하는 내용을 찾을 때까지 복수개의 상기 음성 파일을 모두 청취해야 한다.Here, the user must listen to the recorded voice file from beginning to end in order to confirm the recorded contents. If there are multiple recorded voice files, the user must listen to all of the plurality of voice files until the desired content is found.

따라서, 복수개의 음성 파일의 경우, 사용자가 원하는 내용이 어느 음성 파일에 있는지 확인하고, 해당 음성 파일의 어느 위치에 존재하는지 확인하는 것이 어려운 문제점이 있었다. Therefore, in the case of a plurality of voice files, it is difficult to check which voice file the user wants and which location of the voice file exists.

한국등록특허 10-0935524(2009.12.28)Korean Registered Patent 10-0935524 (Dec. 28, 2009)

본 발명은 상기와 같은 문제점을 해결하기 위한 것으로서, 발신자와 수신자 사이의 통화 음성을 텍스트로 변환하고, 변환된 텍스트를 관리하는 장치 및 방법을 제공하는 것을 목적으로 한다.The present invention is to solve the above problems, and an object of the present invention is to provide an apparatus and method for converting a call voice between a caller and a receiver into text and managing the converted text.

일 측면에 따른, 서비스 가입자의 통화 음성을 텍스트 변환하고, 변환된 텍스트를 관리하는 장치는, 상기 서비스 가입자가 발신자 또는 수신자가 되는 통화의 음성 데이터를 수신하는 수신부; 상기 수신된 음성 데이터를 발신 음성 데이터 및 수신 음성 데이터로 각각 분류하는 분류부; 상기 분류된 음성 데이터를 발신 텍스트 및 수신 텍스트의 데이터로 각각 변환하는 변환부; 상기 변환된 발신자의 발신 텍스트 및 수신자의 수신 텍스트를 구분하고 시간순으로 나열하여 통화 텍스트를 생성하는 통화 텍스트부; 및 서비스 가입자의 요청에 의해, 상기 생성된 통화 텍스트를 조회하여 서비스 가입자의 단말로 제공하는 제공부를 포함한다.According to an aspect, an apparatus for text-converting a voice of a service subscriber and managing the converted text includes: a receiver configured to receive voice data of a call in which the service subscriber becomes a caller or a receiver; A classification unit classifying the received voice data into outgoing voice data and received voice data, respectively; A conversion unit for converting the classified voice data into data of outgoing text and received text, respectively; A call text unit for distinguishing the converted sender's calling text and recipient's received text and arranging them in chronological order to generate call text; And a provision unit for querying the generated call text and providing it to the terminal of the service subscriber at the request of the service subscriber.

상기 장치는, 통화 중인 상기 서비스 가입자의 통화 단말로부터 텍스트 변환을 요청하는 DTMF(Dual Tone Multiple Frequency) 신호를 수신하는 DTMF부를 더 포함하고, 상기 분류부는 수신된 상기 DTMF 신호에 의해 분류 처리한다.The apparatus further includes a DTMF unit that receives a dual tone multiple frequency (DTMF) signal requesting text conversion from the calling terminal of the service subscriber in a call, and the classifying unit classifies the received DTMF signal.

상기 분류부는, SIP(Session Initiation Protocol) 메시지의 발신 측 아이피 및 포트, 수신 측 아이피 및 포트와 RTP(Realtime Transfer Protocol) 패킷의 동기화 소스 아이디를 참조하여 발신 및 수신의 음성 데이터를 각각 분류한다.The classification unit classifies voice data of outgoing and receiving by referring to the source ID and port of the SIP (Session Initiation Protocol) message, and the source ID and port of the receiving side and the synchronization source ID of the Realtime Transfer Protocol (RTP) packet.

상기 장치는, 상기 분류된 발신 음성 데이터 및 수신 음성 데이터의 RTP 패킷 중에서 음성 패킷만을 남기기 위해 무음에 해당되는 SID(Silence Indicator) 패킷을 제거하는 무음 제거부를 더 포함하고, 상기 변환부는 남겨진 상기 음성 패킷에 대해 STT(Speech To Text) 엔진을 이용하여 텍스트로 변환한다.The apparatus further includes a silence removing unit that removes a Silence Indicator (SID) packet corresponding to silence to leave only a voice packet among the RTP packets of the classified outgoing voice data and received voice data, and the conversion unit leaves the remaining voice packet The text is converted to text using the STT (Speech To Text) engine.

상기 장치는, SIP 메시지의 SDP(Session Description Protocol)의 ptime 값에서 참조된 RTP 패킷의 시간 분량을 이용하여 변환된 텍스트의 타임스탬프 정보로서 음성 발생 시간을 계산하는 타임스탬프부를 더 포함한다.The apparatus further includes a timestamp unit for calculating a voice generation time as timestamp information of the converted text by using the time amount of the RTP packet referenced in the ptime value of the Session Description Protocol (SDP) of the SIP message.

상기 타임스탬프부는, 상기 RTP 패킷의 코덱 정보를 확인하고, 확인된 코덱의 샘플링 레이트로부터 초당 증가하는 타임스탬프를 확인하고, 직전 패킷으로부터 증가된 SID 패킷의 타임스탬프 값으로부터 상기 SID 패킷의 시간 분량을 계산하고, 상기 RTP 패킷의 시간 분량 및 상기 계산된 SID 패킷의 시간 분량을 이용하여, 상기 음성 발생 시간을 계산한다.The timestamp unit checks codec information of the RTP packet, checks a timestamp increasing per second from the sample rate of the identified codec, and calculates the time amount of the SID packet from the timestamp value of the SID packet increased from the previous packet. Calculate and calculate the voice generation time by using the time amount of the RTP packet and the time amount of the calculated SID packet.

상기 통화 텍스트부는, 발신 전화번호, 수신 전화번호, 총 통화 시간, 발신 텍스트 데이터 및 음성 발생 시간의 적어도 하나 이상의 세트, 수신 텍스트 데이터 및 음성 발생 시간의 적어도 하나 이상의 세트를 포함하는 상기 통화 텍스트를 저장한다.The call text unit stores the call text including at least one set of outgoing phone number, incoming call number, total call time, outgoing text data and voice generation time, and at least one set of received text data and voice generation time do.

상기 장치는, 상기 발신 음성 데이터 및 상기 수신 음성 데이터 중에서 먼저 도착된 음성 데이터의 시작 위치에서, 상기 먼저 도착된 음성 데이터의 RTP 패킷의 시간만큼 시간을 뒤로하여 늦게 도착된 발신 또는 수신의 음성 데이터의 시작 위치를 설정하고, 설정된 시작 위치를 이용하여 각각의 상기 발신 음성 데이터 및 상기 수신 음성 데이터의 음성 스트림을 하나의 통합된 음성 데이터의 스트림으로 믹싱하는 믹싱부를 더 포함하고, 상기 제공부는 상기 통합된 음성 데이터를 제공한다.The apparatus, at a starting position of the voice data that has arrived first among the outgoing voice data and the received voice data, is configured to display the voice data of the late arrival or reception voice data by the time of the RTP packet of the first arrival voice data. Further comprising a mixing unit for setting a starting position, and mixing the voice stream of each of the outgoing voice data and the received voice data into a single stream of integrated voice data using the set starting position, the providing unit is the integrated Provide voice data.

상기 제공부는, 문자 메시지, 이메일, SNS(Social Network Service), 웹 페이지 중에서 적어도 하나 이상을 이용하여 상기 통화 텍스트를 서비스 가입자의 단말로 제공한다.The providing unit provides the call text to a terminal of a service subscriber by using at least one of a text message, an email, a social network service (SNS), and a web page.

다른 측면에 따른, 장치가 서비스 가입자의 통화 음성을 텍스트 변환하고, 변환된 텍스트를 관리하는 방법은, 상기 서비스 가입자가 발신자 또는 수신자가 되는 통화의 음성 데이터를 수신하는 단계; 상기 수신된 음성 데이터를 발신 음성 데이터 및 수신 음성 데이터로 각각 분류하는 단계; 상기 분류된 음성 데이터를 발신 텍스트 및 수신 텍스트의 데이터로 각각 변환하는 단계; 상기 변환된 발신자의 발신 텍스트 및 수신자의 수신 텍스트를 구분하고 시간순으로 나열하여 통화 텍스트를 생성하는 단계; 및 상기 서비스 가입자의 요청에 의해, 상기 생성된 통화 텍스트를 조회하여 서비스 가입자의 단말로 제공하는 단계를 포함한다.According to another aspect, a method for a device to text-convert a service subscriber's call voice and manage the converted text includes: the service subscriber receiving voice data of a call to be a caller or a recipient; Classifying the received voice data into outgoing voice data and received voice data, respectively; Converting the classified voice data into data of outgoing text and received text, respectively; Generating call texts by separating the converted sender's calling text and the recipient's receiving text and listing them in chronological order; And at the request of the service subscriber, querying the generated call text and providing it to the terminal of the service subscriber.

본 발명의 일 측면에 따르면, 통화 내용을 텍스트로 관리하여 사용자는 필요에 따라 통화 내용의 텍스트를 조회하고 검색할 수 있다.According to an aspect of the present invention, by managing the call content as text, the user can search and search the text of the call content as needed.

통화 내용의 텍스트 변환시, 무음 패킷을 제거하고 통화 음성의 패킷만을 변환함으로써 STT 엔진의 처리 부하가 경감될 수 있다.In text conversion of the call content, the processing load of the STT engine can be reduced by removing the silent packet and converting only the packet of the call voice.

발신자 및 수신자의 통화 내용을 하나의 통합된 음성 데이터로 저장할 때, 늦게 도착된 발신측 또는 수신측의 음성 패킷의 시작 위치를 먼저 도착한 발신측 또는 수신측의 패킷의 시간 분량만큼 뒤로 설정함으로써 크로스 토크가 제거된 하나의 통합된 음성 데이터를 저장할 수 있다.When storing caller's and receiver's call contents as one integrated voice data, cross talk by setting the start position of the voice packet of the sender or receiver arriving late by the amount of time of the packet of the sender or receiver arriving first Can store one unified voice data that has been removed.

본 명세서에 첨부되는 다음의 도면들은 본 발명의 바람직한 실시예를 예시하는 것이며, 후술한 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어 해석되지 않아야 한다.
도 1은 본 발명의 일 실시예에 따른 통화 관리 시스템의 개략적 구성도이다.
도 2는 도 1의 텍스트 관리 서버의 개략적 구성도이다.
도 3은 도 2의 텍스트 관리 서버가 Tx 및 Rx의 음성 데이터를 분류하는 예시도이다.
도 4a 및 도 4b는 도 2의 텍스트 관리 서버가 통화 음성 패킷을 발신 음성 패킷 및 수신 음성 패킷으로 분류하는 예시도이다.
도 5는 도 2의 텍스트 관리 서버가 수신한 통화 음성 패킷 중에서 SID 패킷의 예시도이다.
도 6은 도 2의 텍스트 관리 서버가 도 5의 SID 패킷의 타임스탬프의 시간을 계산하는 예시도이다.
도 7은 본 발명의 다른 실시예에 따라 도 1의 텍스트 관리 서버가 발신 음성 데이터와 수신 음성 데이터를 하나의 통합된 음성 스트림으로 믹싱하는 예시도이다.
도 8a 및 도 8b는 도 7의 믹싱에서 늦게 도착된 발신 또는 수신 측의 음성 데이터의 시작 위치를 늦추는 예시도이다.
도 9는 본 발명의 일 실시예에 따른 통화 음성의 텍스트 관리 방법의 개략적 순서도이다.The following drawings attached to this specification are intended to illustrate preferred embodiments of the present invention, and serve to further understand the technical idea of the present invention together with the detailed description of the invention described below, and thus the present invention is described in such drawings. It should not be interpreted as being limited to.
1 is a schematic configuration diagram of a call management system according to an embodiment of the present invention.
FIG. 2 is a schematic configuration diagram of the text management server of FIG. 1.
3 is an exemplary diagram in which the text management server of FIG. 2 classifies voice data of Tx and Rx.
4A and 4B are exemplary diagrams in which the text management server of FIG. 2 classifies call voice packets into outgoing voice packets and incoming voice packets.
5 is an exemplary diagram of an SID packet among call voice packets received by the text management server of FIG. 2.
FIG. 6 is an exemplary diagram in which the text management server of FIG. 2 calculates a timestamp time of the SID packet of FIG. 5.
FIG. 7 is an exemplary diagram in which the text management server of FIG. 1 mixes outgoing voice data and received voice data into one integrated voice stream according to another embodiment of the present invention.
8A and 8B are exemplary diagrams of delaying the start position of voice data of a calling or receiving side that arrives late in the mixing of FIG. 7.
9 is a schematic flowchart of a text management method of a voice call according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. 이에 앞서, 본 명세서 및 청구 범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서, 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상에 모두 대변하는 것은 아니므로, 본 출원 시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, the terms or words used in the present specification and claims should not be construed as being limited to ordinary or lexical meanings, and the inventor appropriately explains the concept of terms to explain his or her invention in the best way. Based on the principle that it can be defined, it should be interpreted as meanings and concepts consistent with the technical spirit of the present invention. Accordingly, the embodiments shown in the embodiments and the drawings described in this specification are only the most preferred embodiments of the present invention and do not represent all of the technical spirit of the present invention, and thus can replace them at the time of application. It should be understood that there may be equivalents and variations.

도 1은 본 발명의 일 실시예에 따른 통화 관리 시스템(100)의 개략적 구성도이다.1 is a schematic configuration diagram of a call management system 100 according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 통화 관리 시스템(100)은 발신자 통화 단말(110), 텍스트 관리 서버(130), TAS(Telephony Application Server)(140), CSCF(Call Session Control Function)(150) 및 수신자 통화 단말(170)을 포함하여 구성된다.The call management system 100 according to an embodiment of the present invention includes a caller terminal 110, a text management server 130, a telephony application server (TAS) 140, a call session control function (CSCF) 150, and It comprises a receiver call terminal 170.

발신자 통화 단말(110)에서 수신자 통화 단말(170)로 통화 호를 발신하면, 통신사 서버를 통해 호 데이터(통화 데이터)가 중계되고, 수신자 통화 단말(170)이 착신 호를 수신하면 통화가 연결되어 발신자와 수신자의 음성 통화가 시작된다. 상기 호 데이터는 신호 데이터(예 : SIP 메시지)와 음성 데이터(예 : RTP 패킷)로 구분되고, 구분된 각 데이터는 별도의 처리와 전송 경로를 갖는다.When a call call is sent from the caller terminal 110 to the caller terminal 170, call data (call data) is relayed through the carrier server, and when the caller terminal 170 receives the incoming call, the call is connected. The voice call of the caller and receiver is started. The call data is divided into signal data (eg, SIP messages) and voice data (eg, RTP packets), and each classified data has a separate processing and transmission path.

VoLTE 환경에서 통신사 서버는 상기 신호 데이터의 처리를 위해 TAS(140) 및 CSCF(150)를 포함한다. CSCF(150)는 호 처리 서버로부터 호 데이터의 SIP 메시지를 중계 처리한다. 또한, TAS(140)는 호 기반의 멀티미디어 부가 서비스(예 : 발신자 표시 서비스, 통화 연결음 서비스, 통화중 대기 서비스 등)를 해당 부가 서비스의 가입자에게 제공하는 서버이다. 통화 서비스 가입자가 본 발명의 통화 관리 서비스에 가입하면, 서비스 가입 정보가 TAS(140)에서 관리된다. 호 데이터의 음성 데이터는 통신사의 통화 처리 서버를 통해 발신자 통화 단말(110) 및 수신자 통화 단말(170)로 중계된다.In the VoLTE environment, the carrier server includes the TAS 140 and the CSCF 150 for processing the signal data. The CSCF 150 relays the SIP message of the call data from the call processing server. In addition, the TAS 140 is a server that provides call-based multimedia supplementary services (eg, caller ID service, call ring tone service, call waiting service, etc.) to subscribers of the corresponding supplementary service. When a subscriber of a call service subscribes to the call management service of the present invention, service subscription information is managed by the TAS 140. The voice data of the call data is relayed to the caller terminal 110 and the caller terminal 170 through the communication company's call processing server.

여기서, 본 발명은 상기 통화 처리 서버를 텍스트 관리 서버(130)로 구현한다. 텍스트 관리 서버(130)는 발신자 통화 단말(110) 및 수신자 통화 단말(170)의 사이에서 호 데이터의 음성 데이터를 중계하는 과정에서 상기 통화 관리 서비스의 가입자에게 통화 내용의 텍스트 변환 서비스를 제공한다. 즉, 상기 통화 관리 서비스는 서비스 가입자의 요청에 의해, 음성 통화 내용을 텍스트 변환하고, 변환된 텍스트를 관리하여 가입자에게 조회 및 검색 서비스를 제공할 수 있다.Here, the present invention implements the call processing server as a text management server (130). The text management server 130 provides a text conversion service of a call content to a subscriber of the call management service in the process of relaying voice data of call data between the caller terminal 110 and the caller terminal 170. That is, the call management service may provide a query and search service to the subscriber by converting the voice call contents into text and managing the converted text at the request of the service subscriber.

도 2는 도 1의 텍스트 관리 서버(130)의 개략적 구성도이다.2 is a schematic configuration diagram of the text management server 130 of FIG. 1.

본 발명의 일 실시예에 따른 텍스트 관리 서버(130)는 수신부(231), DTMF부(232), 분류부(233), 무음 제거부(234), 변환부(235), 타임스탬프부(236), 통화 텍스트부(237) 및 제공부(239)를 포함하여 구성된다.The text management server 130 according to an embodiment of the present invention includes a receiving unit 231, a DTMF unit 232, a classification unit 233, a silence removing unit 234, a conversion unit 235, and a time stamp unit 236. ), a call text unit 237 and a provision unit 239.

텍스트 관리 서버(130)가 메모리와 프로세서로 구성된 컴퓨터 단말이라고 가정하면, 각 구성부(231 ~ 239)들은 프로그램의 형태로 메모리에 로딩되어 프로세서를 통해 실행될 수 있다. 예를 들면, 각 구성부(231 ~ 239)들은 텍스트 관리 프로그램으로 제작된 후, 텍스트 관리 서버(130)의 프로세서에 의해 실행되어 통화 내용의 음성으로부터 변환된 텍스트를 이용하여 상기 텍스트 관리 서비스를 제공할 수 있다.Assuming that the text management server 130 is a computer terminal composed of a memory and a processor, each component 231 to 239 may be loaded into a memory in the form of a program and executed through the processor. For example, each of the components 231 to 239 is produced by a text management program, and then executed by a processor of the text management server 130 to provide the text management service using text converted from voice of a call. can do.

상기 수신부(231)는 통화 중에 발생된 발신자 및 수신자의 각 음성 데이터를 RTP 패킷으로 수신한다. 물론, 발신 음성 데이터는 텍스트 관리 서버(130)를 거쳐 수신자 통화 단말(170)을 목적지로 하여 중계되고, 수신 음성 데이터는 텍스트 관리 서버(130)를 거쳐 발신자 통화 단말(110)을 목적지로 하여 중계된다.The receiving unit 231 receives each voice data of the caller and the receiver generated during the call in RTP packets. Of course, the outgoing voice data is relayed to the destination call terminal 170 via the text management server 130 as a destination, and the received voice data is relayed to the destination call terminal 110 via the text management server 130 as a destination. do.

상기 DTMF부(232)는 본 발명의 통화 관리 서비스에 가입된 가입자의 통화 중에 가입자의 통화 단말에서 통화 음성의 텍스트 변환을 요청하는 DTMF 신호를 수신한다. 즉, 상기 통화 관리 서비스의 가입자는 통화 중에 통화 음성의 텍스트 변환을 요청할 경우, 특정 키(예 : * 키)를 눌러서 DTMF 신호를 발생시킨다.The DTMF unit 232 receives a DTMF signal requesting text conversion of a call voice from a subscriber's call terminal during a call of a subscriber subscribed to the call management service of the present invention. That is, when the subscriber of the call management service requests text conversion of a call voice during a call, a specific key (eg, * key) is pressed to generate a DTMF signal.

상기 분류부(233)는 DTMF부(232)에서 텍스트 관리 서비스의 DTMF 신호가 확인되면, 수신부(231)에서 수신된 음성 데이터를 발신 음성 데이터 및 수신 음성 데이터로 각각 분류한다. 참고로, 상기 음성 데이터의 분류는 도 3 내지 도 4b를 참조하여 후술된다.When the DTMF signal of the text management service is confirmed by the DTMF unit 232, the classification unit 233 classifies the voice data received from the reception unit 231 into outgoing voice data and received voice data, respectively. For reference, the classification of the voice data will be described later with reference to FIGS. 3 to 4B.

상기 무음 제거부(234)는 분류부(233)에서 분류된 발신 음성 데이터 및 수신 음성 데이터의 각 RTP 패킷이 무음에 해당할 경우, 무음은 텍스트 변환할 필요가 없으므로 무음의 RTP 패킷을 제거한다. 상기 제거에 의해, 통화 음성을 포함하는 RTP 패킷만 남겨질 수 있다. 참고로, 상기 무음의 RTP 패킷은 도 5 및 도 6을 참조하여 후술된다.The silence removing unit 234 removes the silent RTP packet because the silence does not need to be converted to text when each RTP packet of the outgoing voice data and the received voice data classified in the classification unit 233 corresponds to silence. By the removal, only RTP packets containing a call voice can be left. For reference, the silent RTP packet will be described later with reference to FIGS. 5 and 6.

상기 변환부(235)는 발신 RTP 패킷의 음성을 발신 텍스트로 변환하고, 수신 RTP 패킷의 음성을 수신 텍스트로 변환한다. 변환부(235)는 STT 엔진으로 구성되고, 별도의 STT 서버로 구축되어도 무방하다.The conversion unit 235 converts the voice of the outgoing RTP packet into the outgoing text, and the voice of the received RTP packet into the received text. The conversion unit 235 is composed of an STT engine, and may be constructed as a separate STT server.

여기서, 변환부(235)는 무음 제거부(234)에 의해 무음의 RTP 패킷이 제거된 통화 음성의 RTP 패킷들만 입력받아 STT 기반의 텍스트 변환을 할 수 있기 때문에 데이터 처리 부하가 경감되는 장점을 갖는다.Here, the conversion unit 235 has the advantage of reducing the data processing load because only the RTP packets of the call voice in which the silent RTP packet is removed by the silence removing unit 234 can be input and converted to STT-based text. .

상기 타임스탬프부(236)는 RTP 패킷의 타임스탬프의 값을 시간으로 계산하고, 계산된 시간을 변환된 텍스트의 통화 시간에 해당하는 타임스탬프 정보로 설정한다. 예를 들면, 통화 음성의 변환된 텍스트는 메시지처럼 발생 시간의 타임스탬프가 설정된다.The time stamp unit 236 calculates the time stamp value of the RTP packet as time, and sets the calculated time as time stamp information corresponding to the talk time of the converted text. For example, the converted text of the call voice is timestamped with the occurrence time like a message.

여기서, 타임스탬프부(236)가 시간을 계산하는 이유는 텍스트 관리 서버(130), 발신자 통화 단말(110) 및 수신자 통화 단말(170)의 시간이 다르고, RTP 패킷의 경로와 그 경로를 경유하는 시간이 다르기 때문이다. 또한, 텍스트 관리 서버(130)에 도착하는 RTP 패킷의 도착 순서는 음성 발생 순서와 일치하지 않으므로, 그 도착 시간을 음성 발생 시간으로 사용해서는 안된다. 이러한 점을 고려하여, 타임스탬프부(236)는 정확한 통화 음성의 발생 시간을 위해, 통화 시작 시간을 기준으로 설정한 후, 각 RTP 패킷에 포함된 타임스탬프의 값으로 발생 시간을 계산한다. 따라서, 상기 계산에 의해, 음성으로부터 변환된 텍스트의 음성 발생 시간은 정확성을 갖는 장점이 있다. 상기 기준 시간에 해당하는 통화 시작 시간은 첫 번째 패킷의 도착 시간일 수 있다. 참고로, 상기 계산은 도 6을 참조하여 후술된다.Here, the reason why the time stamp unit 236 calculates the time is that the time of the text management server 130, the caller terminal 110, and the caller terminal 170 is different, and the path of the RTP packet and the path are passed through. Because time is different. In addition, since the arrival order of RTP packets arriving at the text management server 130 does not match the voice generation order, the arrival time should not be used as the voice generation time. In consideration of this, the timestamp unit 236 sets the call start time based on the call start time for accurate call voice generation time, and then calculates the generation time based on the timestamp value included in each RTP packet. Therefore, by the above calculation, the speech generation time of text converted from speech has the advantage of accuracy. The call start time corresponding to the reference time may be the arrival time of the first packet. For reference, the calculation will be described later with reference to FIG. 6.

상기 통화 텍스트부(237)는 변환부(235)에서 변환된 발신자의 발신 텍스트 및 수신자의 수신 텍스트를 구분하고 타임스탬프부(356)에서 계산된 시간순으로 나열하여 통화 텍스트를 생성하고, 생성된 통화 텍스트를 저장한다.The call text unit 237 generates call text by distinguishing the caller's calling text and the callee's received text converted by the conversion unit 235, and arranged in chronological order calculated by the time stamp unit 356, and the generated call Save the text.

여기서, 저장되는 통화 텍스트 정보는 서비스 가입자에게 제공되기 위해, 발신 전화번호, 수신 전화번호, 총 통화 시간, 발신 텍스트 데이터 및 음성 발생 시간의 적어도 하나 이상의 세트, 수신 텍스트 데이터 및 음성 발생 시간의 적어도 하나 이상의 세트를 포함한다. 즉, 상기 세트는 각 텍스트 문장 및 그 음성 발생 시간으로 구성된다. 통화 텍스트 정보가 표시되는 화면 UI는 메신저 화면과 유사하게 구성될 수 있으며 특별한 제한을 두지 않는다.Here, the stored call text information is to be provided to service subscribers, at least one set of an outgoing phone number, an incoming phone number, total call time, outgoing text data and voice generation time, at least one of received text data and voice generation time Includes the above set. That is, the set consists of each text sentence and its voice generation time. The screen UI on which the call text information is displayed can be configured similarly to the messenger screen, and does not place any special restrictions.

상기 제공부(239)는 서비스 가입자의 요청에 의해, 통화 텍스트부(237)에서 데이터 관리되는 통화 텍스트를 조회하여 서비스 가입자의 단말로 제공한다.The providing unit 239 queries the call text data managed by the call text unit 237 at the request of the service subscriber, and provides it to the terminal of the service subscriber.

통화 텍스트 정보는 문자 메시지, 이메일, SNS(Social Network Service), 웹 페이지 중에서 적어도 하나 이상을 이용하여 서비스 가입자의 단말로 제공될 수 있다. 예를 들면, 카톡의 화면 UI처럼 발신자 및 수신자의 통화 내용이 텍스트 정보로 나열된 상기 웹 페이지가 서비스 가입자의 단말로 제공될 수 있다.The call text information may be provided to a terminal of a service subscriber using at least one of a text message, an email, a social network service (SNS), and a web page. For example, the web page in which the caller's and receiver's call contents are listed as text information may be provided to the terminal of the service subscriber, such as the screen UI of KakaoTalk.

도 3은 도 2의 텍스트 관리 서버(130)가 Tx 및 Rx의 음성 데이터를 분류하는 예시도이다.3 is an exemplary diagram in which the text management server 130 of FIG. 2 classifies voice data of Tx and Rx.

서비스 가입자 A가 발신자가 되어 수신자에게 전화를 거는 경우, 발신 데이터의 Rx 스트림은 A의 음성 내용이고, 수신 데이터의 Tx 스트림은 발신자 A가 듣는 수신자의 음성 내용이다.When the service subscriber A becomes the caller and calls the recipient, the Rx stream of the outgoing data is the voice content of A, and the Tx stream of the received data is the voice content of the receiver that the caller A hears.

텍스트 관리 서버(130)의 분류부(233)는 호 데이터의 SIP(Session Initiation Protocol) 메시지에 포함된 발신 측 아이피 및 포트(src IP/Port), 수신 측 아이피 및 포트(dst IP/Port)와 RTP 패킷의 동기화 소스 아이디(SSRC : Synchronized Source ID)를 참조하여 통화 데이터의 RTP 패킷을 발신 음성 데이터의 Tx 스트림과 수신 음성 데이터의 Rx 스트림으로 분류한다. SSRC는 동기화에 의해 스트림으로 연결되는 통화 소스 음원의 식별자에 해당한다.The classification unit 233 of the text management server 130 includes a source IP and a port (src IP/Port), a destination IP and a port (dst IP/Port) included in a SIP (Session Initiation Protocol) message of call data. The RTP packet of the call data is classified into the Tx stream of the outgoing voice data and the Rx stream of the received voice data by referring to the Synchronized Source ID (SSRC) of the RTP packet. SSRC corresponds to an identifier of a call source sound source connected to a stream by synchronization.

예를 들어, 서비스 가입자 A가 발신자가 되어 수신자에게 전화를 거는 경우, 소스 IP/Port는 A의 통화 단말이고 목적지 IP/Port는 텍스트 관리 서버(130)인 RTP 패킷들이 A가 말하는 발신 음성 데이터의 Tx 스트림으로 분류된다. 또한, 소스 IP/Port가 텍스트 관리 서버(130)이고 목적지 IP/Port가 A의 통화 단말인 RTP 패킷들이 상대방이 말하여 A가 듣게 되는 수신 음성 데이터의 Rx 스트림으로 분류된다.For example, when the service subscriber A is a caller and makes a call to the recipient, the source IP/Port is the call terminal of A and the destination IP/Port is the text management server 130. It is classified as a Tx stream. In addition, the RTP packets whose source IP/Port is the text management server 130 and the destination IP/Port is the call terminal of A are classified into Rx streams of received voice data that the other party speaks and A hears.

도 4a 및 도 4b는 도 2의 텍스트 관리 서버(130)가 통화 음성 패킷을 발신 음성 패킷 및 수신 음성 패킷으로 분류하는 예시도이다.4A and 4B are exemplary diagrams in which the text management server 130 of FIG. 2 classifies a call voice packet into an outgoing voice packet and a received voice packet.

도 4a를 참조하면, 텍스트 관리 서버(130)에 도착되는 발신자 및 수신자의 음성 RTP 패킷이 도시된다. RTP 패킷 1이 수신 음성이고 RTP 패킷 2가 발신 음성이라 가정하면, 수신 측의 RTP 패킷 1의 3개 패킷이 텍스트 관리 서버(130)에 최초로 도착된 후, 이어서 발신 측 RTP 패킷 2의 2개 패킷이 도착된다. 통화 중 대화가 진행될수록 각 RTP 패킷들이 텍스트 관리 서버(130)에 도착된다.4A, a voice RTP packet of a sender and a receiver arriving at the text management server 130 is shown. Assuming that RTP packet 1 is the incoming voice and RTP packet 2 is the outgoing voice, after the first three packets of RTP packet 1 of the receiving side arrive at the text management server 130, then the two packets of RTP packet 2 of the calling side It arrives. As the conversation progresses during a call, each RTP packet arrives at the text management server 130.

여기서, RTP 패킷의 시간 분량이 20 ms의 음성 데이터를 갖는 경우, RTP 패킷 1의 수신자가 60 ms 동안 먼저 말한 후, 이어서 RTP 패킷 2의 발신자가 말하는 것으로 해석될 수 있다.Here, when the time portion of the RTP packet has 20 ms of voice data, it can be interpreted that the receiver of the RTP packet 1 speaks for 60 ms first, and then the sender of the RTP packet 2 speaks.

도 4b를 참조하면, 분류부(233)는 도 3에서와 같은 IP/Port 값 및 SSRC 값을 참조하여 수신 측의 RTP 패킷 1과 발신 측의 RTP 패킷 2의 음성 패킷을 각각 분류한다.Referring to FIG. 4B, the classification unit 233 classifies the voice packets of the RTP packet 1 of the receiving side and the RTP packet 2 of the calling side, respectively, with reference to the IP/Port value and SSRC value as in FIG.

도 5는 도 2의 텍스트 관리 서버가 수신한 통화 음성 패킷 중에서 SID 패킷(500)의 예시도이다.FIG. 5 is an exemplary diagram of the SID packet 500 among the call voice packets received by the text management server of FIG. 2.

음성 데이터의 RTP 패킷은 실제 사람의 음성이 포함된 음성 패킷과 무음의 SID 패킷(500)으로 구분된다. SID 패킷(500)은 사용자가 말을 중단하고 듣는 경우 발생되는 무음의 RTP 패킷이다. The RTP packet of the voice data is divided into a voice packet including a real human voice and a silent SID packet 500. The SID packet 500 is a silent RTP packet generated when a user stops speaking and listens.

통화의 실제 대화 내용을 살펴보면, 발신자 또는 수신자가 얘기하는 동안 상대방은 듣게 된다. 음성 데이터를 발신 측 Tx 및 수신 측 Rx 데이터로 분리하면, 분리된 각 데이터마다 약 1/2 은 묵음인 상황이 발생한다. 즉, 통화 데이터의 RTP 패킷들 중 1/2은 상기 묵음에 해당되는 SID 패킷(500)일 수 있다.Looking at the actual conversation in the call, the other party hears while the caller or recipient speaks. When the voice data is separated into Tx and Rx data on the receiving side, about 1/2 of each separated data is silent. That is, half of RTP packets of call data may be SID packets 500 corresponding to the silence.

RTP 패킷의 페이로드에서 정의된 "FT" 필드에서 각 코덱마다 정의되는 값을 확인하면, 묵음의 SID 패킷(500)을 알 수 있다. 참고로, 도 5의 AMR-NB 코덱의 경우, "FT"의 값 "8"이 확인되면, 해당 RTP 패킷은 사용자의 음성이 무음인 SID 패킷(500)이다. 또한, AMR-WB 코덱의 경우, "FT"의 값 "9"가 확인되면, SID 패킷(500)이다. 참고로, 코덱 정보를 확인하는 방법은 도 6을 참조하여 후술된다. 무음 제거부(234)는 RTP 패킷에서 "FT" 필드의 값을 확인하여 SID 패킷(500)을 필터링하여 음성 패킷만 남긴다. 만약, 남겨진 음성 패킷만 STT 서버의 변환부(235)로 전송되어 텍스트 변환될 경우, STT 서버의 부하는 경감된다.When the value defined for each codec is checked in the "FT" field defined in the payload of the RTP packet, the silent SID packet 500 can be known. For reference, in the case of the AMR-NB codec of FIG. 5, when the value “8” of “FT” is confirmed, the corresponding RTP packet is a SID packet 500 in which the user's voice is silent. Also, in the case of the AMR-WB codec, if the value "9" of "FT" is confirmed, it is the SID packet 500. For reference, a method of checking codec information will be described later with reference to FIG. 6. The silence removing unit 234 filters the SID packet 500 by checking the value of the “FT” field in the RTP packet, and leaves only the voice packet. If only the remaining voice packets are transmitted to the STT server conversion unit 235 for text conversion, the load of the STT server is reduced.

도 6은 도 2의 텍스트 관리 서버(130)가 도 5의 SID 패킷(500)의 타임스탬프의 시간을 계산하는 예시도이다.FIG. 6 is an exemplary diagram in which the text management server 130 of FIG. 2 calculates the timestamp time of the SID packet 500 of FIG. 5.

음성 순서에 따라, n-1번째 RTP 패킷, n번째 RTP 패킷(600), n+1번째 SID 패킷(500), n+2번째 RTP 패킷이 나열된 것으로 가정한다. 각 RTP 패킷은 타임스탬프의 값을 포함한다. RTP 패킷에서 음성이 포함된 패킷은 시간 분량이 동일하지만, SID 패킷(500)의 시간 분량은 코덱마다 다르므로, 각 코덱의 정보로부터 시간 분량이 계산되어야 한다.It is assumed that according to the voice order, the n-1th RTP packet, the nth RTP packet 600, the n+1st SID packet 500, and the n+2th RTP packet are listed. Each RTP packet includes a timestamp value. In the RTP packet, the packet containing the voice has the same amount of time, but the time amount of the SID packet 500 is different for each codec, so the time amount must be calculated from the information of each codec.

음성이 포함된 RTP 패킷의 경우, 텍스트 관리 서버(130)의 타임스탬프부(236)는 SIP 메시지의 SDP의 ptime 값(예 : 20 ms)으로 음성이 포함된 RTP 패킷의 시간 분량을 정의한다.In the case of an RTP packet including voice, the time stamp unit 236 of the text management server 130 defines the time amount of the RTP packet containing voice as a ptime value (eg, 20 ms) of the SDP of the SIP message.

SID 패킷(500)에 해당되는 RTP 패킷의 경우, 타임스탬프부(236)는 코덱 정보를 참조하여 시간 분량을 계산한다. 타임스탬프부(236)는 RTP 페이로드 내에 있는 코덱 헤더를 확인하거나 그 페이로드의 길이로부터 코덱 정보를 알아낼 수 있다. 참고로, VoLTE 단말은 EVS, AMR-WB, AMR-NB 코덱을 사용한다. 예를 들어, EVS 코덱의 경우, SID 패킷(500)은 묵음 패킷으로써 일반 음성 패킷보다 페이로드의 길이가 짧으며 일반 음성 패킷들은 20 ms의 시간 분량마다 전송되는데 SID 패킷(500)은 20 ms 마다 전송할 수도 있고 그 이상의 기간(20 ms의 배수)으로 전달될 수도 있다. 즉, SID 패킷(500)은 그 패킷 사이즈에 해당되는 시간 분량의 계산이 요구되며, RTP 패킷의 타임스탬프 값을 이용하여 상기 시간 분량이 계산될 수 있다.For the RTP packet corresponding to the SID packet 500, the time stamp unit 236 calculates the amount of time by referring to the codec information. The timestamp unit 236 may check the codec header in the RTP payload or obtain codec information from the length of the payload. For reference, the VoLTE terminal uses EVS, AMR-WB, and AMR-NB codecs. For example, in the case of the EVS codec, the SID packet 500 is a silent packet, which has a shorter payload length than the normal voice packet, and the normal voice packets are transmitted every 20 ms, and the SID packet 500 is transmitted every 20 ms. It can be transmitted or delivered over a longer period (multiple of 20 ms). That is, the SID packet 500 is required to calculate a time amount corresponding to the packet size, and the time amount can be calculated using the time stamp value of the RTP packet.

예를 들어, n번째 RTP 패킷(600)의 타임스탬프가 1,000이고, n+1번째 SID 패킷(500)의 타임스탬프가 2,600이라 가정하면, 타임스탬프부(236)는 타임스탬프의 차이 값을 1,600으로 계산한다. RTP 패킷(500, 600)에서 타임스탬프 필드는 통화 코덱의 샘플링 레이트(rate)와 관계된 상대적인 값이다. 따라서, RTP의 타임스탬프를 통해 절대적인 시간 분량을 계산하기 위해서는 통화 코덱의 샘플링 레이트를 알아야 한다. 샘플링 레이트는 통화 호가 성립될 때 SIP 메시지의 SDP를 확인하면 알 수 있다. AMR-WB 코덱으로 통화 호가 성립됐다고 가정하면, 타임스탬프부(236)는 SDP로부터 AMR-WB 코덱의 샘플링 레이트 16 kHz를 참조한다. 16 kHz는 1초에 16000번 샘플링을 의미하므로, 1 ms 에는 16번 샘플링되고, RTP의 타임스탬프의 값은 1 ms 당 16씩 증가한다. 음성 RTP 패킷(600)의 경우, 20 ms 마다 RTP 패킷을 전송하므로 타임스탬프의 값은 320(16*20)씩 증가한다.For example, assuming that the timestamp of the nth RTP packet 600 is 1,000 and the timestamp of the n+1st SID packet 500 is 2,600, the timestamp unit 236 sets the difference value of the timestamp to 1,600. Is calculated as In the RTP packets 500 and 600, the timestamp field is a relative value related to the sampling rate of the call codec. Therefore, in order to calculate the absolute amount of time through the time stamp of RTP, it is necessary to know the sampling rate of the call codec. The sampling rate can be known by checking the SDP of the SIP message when a call is established. Assuming that a call is established with the AMR-WB codec, the time stamp unit 236 refers to the sampling rate of the AMR-WB codec from SDP 16 kHz. Since 16 kHz means sampling 16000 times per second, 16 times are sampled in 1 ms, and the timestamp value of RTP increases by 16 per 1 ms. In the case of the voice RTP packet 600, since the RTP packet is transmitted every 20 ms, the timestamp value increases by 320 (16*20).

여기서, RTP 패킷(600)과 SID 패킷(500)의 타임스탬프의 상기 차이 값이 1,600으로 계산되었으므로, 1,600/320=5가 되고, 5*20 ms = 100 ms가 계산된다. 그러면, 타임스탬프부(236)는 SID 패킷(500)의 시간 분량을 100 ms로 계산한다. RTP 패킷(600)과 SID 패킷(500)의 통합된 시간 분량은 20+100 = 120 ms로 계산된다. 따라서, n-1번째 RTP 패킷, n번째 RTP 패킷(600), n+1번째 SID 패킷(500) 및 n+2번째 RTP 패킷의 순서에서, 타임스탬프부(236)는 RTP 패킷(600)의 텍스트에 시간 t1의 타임스탬프 정보를 설정한 경우, n+2번째의 RTP 패킷의 텍스트에 발생 시간 t1+120을 계산해서 타임스탬프 정보로 설정할 수 있다.Here, since the difference value of the timestamp of the RTP packet 600 and the SID packet 500 was calculated to be 1,600, 1,600/320=5, and 5*20 ms = 100 ms is calculated. Then, the time stamp unit 236 calculates the time amount of the SID packet 500 as 100 ms. The combined time amount of the RTP packet 600 and the SID packet 500 is calculated as 20+100 = 120 ms. Accordingly, in the order of the n-1th RTP packet, the nth RTP packet 600, the n+1st SID packet 500, and the n+2th RTP packet, the timestamp unit 236 of the RTP packet 600 When the timestamp information of the time t1 is set in the text, the occurrence time t1+120 can be calculated in the text of the n+2th RTP packet and set as the timestamp information.

도 7은 본 발명의 다른 실시예에 따라 도 1의 텍스트 관리 서버(130)가 발신 음성 데이터와 수신 음성 데이터를 하나의 통합된 음성 스트림으로 믹싱하는 예시도이다.7 is an exemplary diagram in which the text management server 130 of FIG. 1 mixes outgoing voice data and received voice data into one integrated voice stream according to another embodiment of the present invention.

본 발명의 다른 실시예에서는 서비스 가입자가 텍스트 관리 서버(130)로부터 통화 텍스트를 제공받아 확인한 후, 통화 음성 내용을 요청할 경우, 통화 내용의 음성 스트림을 서비스 가입자에게 제공하기 위해, 도 2의 믹싱부(238)가 발신 통화 음성과 수신 통화 음성을 하나의 통합된 음성으로 믹싱한다.In another embodiment of the present invention, when the service subscriber receives and confirms the call text from the text management server 130 and requests the call voice content, the mixing unit of FIG. 2 provides a voice stream of the call content to the service subscriber 238 mixes the outgoing call voice and the incoming call voice into one unified voice.

여기서, 수신 음성의 RTP 패킷 1 스트림(710)과 발신 음성의 RTP 패킷 2 스트림(730)을 하나의 통합된 음성 스트림으로 믹싱한다고 가정한다.Here, it is assumed that the RTP packet 1 stream 710 of the received voice and the RTP packet 2 stream 730 of the outgoing voice are mixed into one integrated voice stream.

도 4b에서와 같이 수신 및 발신의 RTP 패킷(400, 410)의 시작 위치를 동일하게 하여 믹싱할 경우, 발신 음성과 통화 음성이 겹치는 크로스 토크(cross talk)가 발생한다. 크로스 토크를 방지하고 최대한 실제 통화 내용과 동일한 통화 음성 스트림을 믹싱하기 위해, 믹싱부(238)는 늦게 도착한 발신 음성의 RTP 패킷 2 스트림(730)을 먼저 도착한 수신 음성의 RTP 패킷 1의 3개 패킷의 3*20 ms = 60 ms의 시간 분량만큼 뒤로 늦춘다.As shown in FIG. 4B, when mixing the same starting positions of the RTP packets 400 and 410 for reception and transmission, a cross talk occurs in which the calling voice and the calling voice overlap. In order to prevent cross talk and to mix a call voice stream that is the same as the actual call content as much as possible, the mixing unit 238 RTP packet 2 of outgoing voice arriving late 3 packets of RTP packet 1 of incoming voice arriving first to stream 730 Delays by 3*20 ms = 60 ms.

도 8a 및 도 8b는 도 7의 믹싱에서 늦게 도착된 발신 또는 수신 측의 음성 데이터의 시작 위치를 늦추는 예시도이다.8A and 8B are exemplary diagrams of delaying the start position of voice data of a calling or receiving side that arrives late in the mixing of FIG. 7.

도 8a를 참조하면, A와 B의 실제 통화 내용이 시간 축을 기준으로 도시된다. A의 음성으로 통화가 시작하여 A와 B 사이의 통화 대화가 이어진다.Referring to FIG. 8A, the actual call contents of A and B are shown based on the time axis. The call starts with the voice of A, and the conversation between A and B continues.

도 8b를 참조하면, 본 발명의 기술 적용없이 Tx 및 Rx의 RTP 패킷의 시작 위치를 동일 위치로 하거나 패킷의 도착 시간을 기준으로 할 경우, 상기 크로스 토크가 발생되는 영역(810, 830)에서 A와 B의 음성이 겹칠 수 있다.Referring to FIG. 8B, when the starting positions of the RTP packets of Tx and Rx are the same or based on the arrival time of the packets without applying the technology of the present invention, A is generated in the areas 810 and 830 where the cross talk occurs. The voices of and B may overlap.

본 발명은 크로스 토크를 배제하기 위해, 믹싱부(238)가 먼저 도착된 A의 RTP 패킷의 시간 분량만큼 뒤로 늦춘 시작 위치에서 B의 RTP 패킷의 스트림을 배치하여 믹싱한다. 따라서, 크로스 토크가 배제된 하나의 통합된 통화 스트림으로 A 및 B의 통화 음성이 믹싱된다.In order to exclude the crosstalk, the present invention mixes by arranging the stream of the RTP packet of B at the starting position where the mixing unit 238 is delayed back by the amount of time of the RTP packet of A arriving first. Thus, A and B call voices are mixed into one unified call stream without crosstalk.

도 9는 본 발명의 일 실시예에 따른 통화 음성의 텍스트 관리 방법의 개략적 순서도이다.9 is a schematic flowchart of a text management method of a voice call according to an embodiment of the present invention.

발신자 통화 단말(110)과 수신자 통화 단말(170) 사이에서 통화가 개시되면, 텍스트 관리 서버(130)는 발신 및 수신의 통화 데이터를 수신한다(S910). 여기서, 텍스트 관리 서버(130)는 TAS(140)를 통해 신호 데이터의 SIP 메시지를 수신한다. 또한, 텍스트 관리 서버는 음성 데이터의 RTP 패킷을 수신한다.When a call is initiated between the caller terminal 110 and the caller terminal 170, the text management server 130 receives the call and receive call data (S910). Here, the text management server 130 receives the SIP message of the signal data through the TAS 140. In addition, the text management server receives RTP packets of voice data.

통화 중에 서비스가 가입자가 본 발명의 텍스트 변환 서비스를 요청하기 위해 통화 단말의 키 버튼을 누르면, 텍스트 관리 서버(130)는 가입자의 통화 단말로부터 해당 서비스 키의 DTMF 신호를 수신한다(S920).During a call, when a service subscriber presses a key button of a call terminal to request a text conversion service of the present invention, the text management server 130 receives a DTMF signal of a corresponding service key from the call terminal of the subscriber (S920).

상기 DTMF 신호의 수신에 의해, 텍스트 관리 서버(130)는 발신 음성 데이터의 RTP 패킷과 수신 음성 데이터의 패킷을 분류한다(S930). 상기 패킷의 분류는 도 3에서와 같이 발신 측 아이피 및 포트, 수신 측 아이피 및 포트와 SSRC 값의 참조에 의해 처리된다. 또한, 도 4b에서와 같이 각각 수신 측 RTP 패킷 1(400) 및 발신 측 RTP 패킷 2(410)으로 분류된다.By receiving the DTMF signal, the text management server 130 classifies the RTP packet of the outgoing voice data and the packet of the received voice data (S930). The classification of the packet is processed by referring to the source IP and port, the destination IP and port, and the SSRC value as shown in FIG. 3. Also, as shown in FIG. 4B, each is classified into a receiving side RTP packet 1 400 and a calling side RTP packet 2 410.

발신자 및 수신자의 RTP 패킷이 분류되면, 텍스트 관리 서버(130)는 분류된 각 RTP 패킷들 중에서 무음의 SID 패킷(500)을 제거하여 음성이 포함된 RTP 패킷들을 남긴다(S940). When the RTP packets of the sender and the receiver are classified, the text management server 130 removes the silent SID packet 500 from each of the classified RTP packets, and leaves RTP packets including voice (S940).

무음의 SID 패킷(500)이 제거된 후, 텍스트 변환 서버(130)는 음성이 포함된 RTP 패킷들을 STT 엔진을 이용하여 텍스트 변환한다(S950). After the silent SID packet 500 is removed, the text conversion server 130 performs text conversion on the RTP packets including the voice using the STT engine (S950).

여기서, 텍스트 변환 서버(130)는 도 4b에서와 같이 음성 발생 순서로 정렬된 각 RTP 패킷(400, 410)들을 상대로 기준 시간을 최초 통화 발생 시간으로 설정하여 각 패킷의 시작 시간을 계산한다(S951). 도 6을 참조하여 상기에서 설명한 바와 같이, SID 패킷(500)은 코덱 정보를 참조하여 시간 분량이 계산된다.Here, the text conversion server 130 calculates the start time of each packet by setting the reference time as the initial call generation time relative to each RTP packet (400, 410) arranged in the order of voice generation as shown in FIG. 4B (S951) ). As described above with reference to FIG. 6, the time amount of the SID packet 500 is calculated with reference to codec information.

텍스트 변환 및 시간 계산이 완료된 후, 텍스트 변환 서버(130)는 변환된 테스트와 계산된 음성 발생 시간을 한 개 세트로 매칭하고, 기 설정된 화면 UI로 표시되기 위해, 매칭된 각 세트들을 포함하는 통화 텍스트를 생성하여 저장한다(S960). After the text conversion and time calculation are completed, the text conversion server 130 matches the converted test and the calculated voice generation time to one set, and displays the set screen UI to display a call including each matched set. The text is generated and stored (S960).

이후, 서비스 가입자가 통화 텍스트의 제공을 요청하면, 텍스트 변환 서버(130)는 가입자의 요청에 대응되는 통화 텍스트를 조회하여 제공한다(S970). 즉, 텍스트 변환 서버(130)는 서비스 가입자의 요청에 따라 통화 내용으로부터 변경된 통화 텍스트의 조회 및 검색 서비스를 제공할 수 있다.Thereafter, when the service subscriber requests the provision of the call text, the text conversion server 130 inquires and provides the call text corresponding to the request of the subscriber (S970). That is, the text conversion server 130 may provide a search and search service for call text changed from the call content at the request of the service subscriber.

본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 이것에 의해 한정되지 않으며 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 본 발명의 기술사상과 아래에 기재될 특허청구범위의 균등범위 내에서 다양한 수정 및 변형이 가능함은 물론이다.Although the present invention has been described by means of limited embodiments and drawings, the present invention is not limited by this, and the technical idea of the present invention and claims to be described below by those skilled in the art to which the present invention pertains Of course, various modifications and variations are possible within the scope of the scope.

100 : 시스템 110 : 발신자 통화 단말
120 : 텍스트 관리 서버 140 : TAS
150 : CSCF 170 : 수신자 통화 단말100: system 110: caller terminal
120: text management server 140: TAS
150: CSCF 170: call recipient terminal

Claims

In the apparatus for text conversion of the voice of the service subscriber, and managing the converted text,
A receiver configured to receive voice data of a call in which the service subscriber becomes a caller or a receiver;
A classification unit classifying the received voice data into outgoing voice data and received voice data, respectively;
A conversion unit for converting the classified voice data into data of outgoing text and received text, respectively;
A call text unit for distinguishing the converted sender's calling text and recipient's received text and arranging them in chronological order to generate call text;
At the request of the service subscriber, a provision unit for querying the generated call text and providing it to the terminal of the service subscriber; And
And a silence removing unit for removing a Silence Indicator (SID) packet corresponding to silence in order to leave only a voice packet among RTP packets of the classified outgoing voice data and received voice data,
The conversion unit converts the remaining voice packet into text using a speech to text (STT) engine.

According to claim 1,
Further comprising a DTMF unit for receiving a dual tone multiple frequency (DTMF) signal requesting text conversion from the call terminal of the service subscriber in a call,
The classification unit is characterized in that the classification processing by the received DTMF signal.

According to claim 1,
The classification unit,
A device characterized by classifying voice data of transmission and reception by referring to a synchronization source ID of an IP and a port of a SIP (Session Initiation Protocol) message, an IP and a port of a receiver, and a Realtime Transfer Protocol (RTP) packet.

delete

In the apparatus for text conversion of the voice of the service subscriber, and managing the converted text,
A receiver configured to receive voice data of a call in which the service subscriber becomes a caller or a receiver;
A classification unit classifying the received voice data into outgoing voice data and received voice data, respectively;
A conversion unit for converting the classified voice data into data of outgoing text and received text, respectively;
A call text unit for distinguishing the converted sender's calling text and recipient's received text and arranging them in chronological order to generate call text;
At the request of the service subscriber, a provision unit for querying the generated call text and providing it to the terminal of the service subscriber; And
And a timestamp unit for calculating a voice generation time as timestamp information of the text converted using the time amount of the RTP packet referenced in the ptime value of the Session Description Protocol (SDP) of the SIP message.

The method of claim 5,
The time stamp portion,
Check the codec information of the RTP packet, check the timestamp increasing per second from the confirmed sampling rate of the codec, calculate the time amount of the SID packet from the timestamp value of the SID packet increased from the previous packet, and calculate the RTP And calculating the voice generation time using the time amount of the packet and the time amount of the calculated SID packet.

The method of claim 1 or 5,
The currency text section,
Characterized by storing the call text comprising at least one set of outgoing phone number, incoming phone number, total call time, outgoing text data and voice generation time, at least one set of incoming text data and voice generation time Device.

In the apparatus for text conversion of the voice of the service subscriber, and managing the converted text,
A receiver configured to receive voice data of a call in which the service subscriber becomes a caller or a receiver;
A classification unit classifying the received voice data into outgoing voice data and received voice data, respectively;
A conversion unit for converting the classified voice data into data of outgoing text and received text, respectively;
A call text unit for distinguishing the converted sender's calling text and recipient's received text and arranging them in chronological order to generate call text;
At the request of the service subscriber, a provision unit for querying the generated call text and providing it to the terminal of the service subscriber; And
Set the starting position of the voice data of the outgoing or received voice data from the starting position of the voice data that has arrived first among the outgoing voice data and the received voice data by the time of the RTP packet of the first arrived voice data. And a mixing unit for mixing each of the outgoing voice data and the received voice data into a single integrated voice data stream using the set starting position.
And the providing unit provides the integrated voice data.

The method of any one of claims 1, 5, or 8,
The providing unit,
Device for providing the call text to the terminal of the service subscriber by using at least one of a text message, an email, a social network service (SNS), and a web page.

A method for a device to text-convert a service subscriber's call voice and manage the converted text,
The service subscriber receiving voice data of a call to be a caller or a recipient;
Classifying the received voice data into outgoing voice data and received voice data, respectively;
Converting the classified voice data into data of outgoing text and received text, respectively;
Generating call texts by distinguishing the converted sender's calling text and recipient's receiving text and arranging them in chronological order; And
And in response to the request of the service subscriber, querying the generated call text and providing it to the terminal of the service subscriber,
Since the step of sorting,
Further comprising the step of removing the SID (Silence Indicator) packet corresponding to the silence to leave only the voice packet from the RTP packets of the classified outgoing voice data and received voice data,
The step of converting is a step of converting the remaining voice packet into text using a speech to text (STT) engine.

The method of claim 10,
Before the sorting step,
The method further includes receiving a dual tone multiple frequency (DTMF) signal requesting text conversion from the calling terminal of the service subscriber in a call,
The step of classifying is a step of classifying by the received DTMF signal.

The method of claim 10,
The classification step,
It is a step of classifying voice data of sending and receiving by referring to a synchronization source ID of an IP and a port of a SIP (Session Initiation Protocol) message, an IP and a port of a receiver, and a Realtime Transfer Protocol (RTP) packet, respectively. Way.

delete

A method for a device to text-convert a service subscriber's call voice and manage the converted text,
The service subscriber receiving voice data of a call to be a caller or a recipient;
Classifying the received voice data into outgoing voice data and received voice data, respectively;
Converting the classified voice data into data of outgoing text and received text, respectively;
Generating call texts by distinguishing the converted sender's calling text and recipient's receiving text and arranging them in chronological order; And
And in response to the request of the service subscriber, querying the generated call text and providing it to the terminal of the service subscriber,
Since the step of converting,
And calculating a voice generation time as timestamp information of the converted text by using the time portion of the RTP packet referenced in the ptime value of the Session Description Protocol (SDP) of the SIP message.

The method of claim 14,
The calculating step,
Check the codec information of the RTP packet, check the timestamp increasing per second from the confirmed sampling rate of the codec, calculate the time amount of the SID packet from the timestamp value of the SID packet increased from the previous packet, and calculate the RTP And calculating the voice generation time using the time amount of the packet and the time amount of the calculated SID packet.

The method of claim 10 or 14,
The generating step,
And storing the call text including at least one set of outgoing phone number, incoming phone number, total call time, outgoing text data and timestamp, and at least one set of incoming text data and timestamp. Way.

A method for a device to text-convert a service subscriber's call voice and manage the converted text,
The service subscriber receiving voice data of a call to be a caller or a recipient;
Classifying the received voice data into outgoing voice data and received voice data, respectively;
Converting the classified voice data into data of outgoing text and received text, respectively;
Generating call texts by distinguishing the converted sender's calling text and recipient's receiving text and arranging them in chronological order; And
And in response to the request of the service subscriber, querying the generated call text and providing it to the terminal of the service subscriber,
Prior to the providing step above,
Set the starting position of the voice data of the outgoing or received voice data from the starting position of the voice data that has arrived first among the outgoing voice data and the received voice data by the time of the RTP packet of the first arrived voice data. And, further comprising the step of mixing the voice stream of each of the outgoing voice data and the received voice data into a single stream of integrated voice data using the set starting position,
The providing step is a method of providing the integrated voice data.

The method of any one of claims 10, 14 or 17,
The providing step,
And providing the call text to a terminal of a service subscriber using at least one of a text message, an email, a social network service (SNS), and a web page.