KR101904817B1

KR101904817B1 - Call conversation Speech to Text converting system

Info

Publication number: KR101904817B1
Application number: KR1020170177343A
Authority: KR
Inventors: 정성준
Original assignee: (주) 미스터멘션
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2018-10-05

Abstract

The present invention relates to a call conversation speech to text converting system. More particularly, the present invention relates to a call conversation speech to text converting system which separates speech from the call conversation of a receiver and divide and store a caller and the receiver when making a call using voice over Internet protocol (VoIP), separates and stores the silent interval of the collected speech packets of the caller and the receiver into segments to improve the accuracy of speech recognition, converts the separated speech packet into text, arranges sentences in time series based on time information applied as the time stamp of each packet and create a deposition when integrating, to facilitate text search with words and sentences, and accumulates repeatedly depositions to obtain big data. The call conversation speech to text converting system includes a server, a packet collecting part, a speech separating part, a text conversing part, and a text integrating part.

Description

{Call conversation Speech to text converting system}

본 발명은 통화 내용 음성-텍스트 변환 녹취록 생성 시스템에 관한 것이다. 보다 상세하게는 인터넷전화(VoIP)를 사용하여 통화 시, 송, 수신자의 통화 내용에 대해 음성을 분리하여 송신자와 수신자로 구분 저장하고, 각 수집된 송, 수신자의 음성 패킷의 무음 구간을 기준으로 세그먼트 단위로 분리 저장함으로써 음성 인식 정확도를 향상시키며, 상기의 분리된 음성 패킷을 텍스트로 변환하여 통합 시, 각 패킷의 타임스탬프로 적용된 시간 정보를 기준으로 문장을 시계열로 정렬하여 녹취록 작성할 수 있어 단어, 문장으로 텍스트 검색을 용이하게 하고, 반복적인 녹취록 축척으로 빅데이터 확보가 가능한 통화 내용 음성-텍스트 변환 녹취록 생성 시스템에 관한 것이다. BACKGROUND OF THE INVENTION 1. Field of the Invention [0002] The present invention relates to a voice-to-text conversion transcription recording system. More specifically, voice is separated from the contents of a call, a call, and a recipient by using an Internet telephone (VoIP), and the voice and the voice are separately classified into a sender and a receiver. The speech recognition accuracy can be improved by segmenting and segmenting the sentences, and the separated speech packets can be converted into texts, and when they are integrated, transcriptions can be arranged in time series based on time information applied as time stamps of respective packets, Text conversion transcription recording system capable of facilitating text search with sentences and securing big data at a repetitive transcription scale.

현재, 기술의 발전과 함께 인터넷 사용이 보편화 되면서 웹 기반으로 전화 통화를 할 수 있는 인터넷 전화 서비스로서 VoIP 시스템이 활발히 개발되고 있다. 이러한 인터넷 전화 서비스는 종래의 공중전화망(PSTN) 대신에 인터넷 통신망을 이용하며 아날로그 형태의 음성 신호를 패킷 형태의 디지털 신호로 변환하여 음성 통화를 가능하게 한다. 인터넷 전화 서비스는 패킷 교환망으로서 동적인 라우팅을 하며 "Best-effort" 형 서비스를 제공하므로 통화 품질을 보장하기 어렵다. 그렇지만, 인터넷 전화 서비스는 저렴한 가격으로 장거리 통화가 가능하며 다양한 형태의 서비스를 창출할 수 있어 대량의 전화 통화를 요하는 콜센터, 공공기관, 기업(통신사), 아웃소싱 업체 등, 다방면으로 활용되고 있다.Currently, as the use of the Internet becomes more common with the development of technology, a VoIP system is being actively developed as an Internet telephone service capable of making a telephone call over the web. Such an Internet telephone service uses an Internet communication network instead of a conventional public telephone network (PSTN), and converts an analog voice signal into a digital packet voice signal. The Internet telephony service is a packet-switched network that performs dynamic routing and provides a "best-effort" type service, which makes it difficult to guarantee call quality. However, Internet telephony service is being used in various fields such as call center, public institution, corporation (communication company), and outsourcing company which require a large amount of telephone calls because it is possible to make long distance calls at low prices and create various types of services.

한편, 콜센터의 상담원 전화기와 컴퓨터를 서로 인터페이스하기 위한 장치로서 이벤트 처리 장치가 있다. 이벤트 처리 장치의 예로서 CTI(Computer Telephony Integration)시스템을 들 수 있는데, CTI시스템은 컴퓨터와 전화를 통합한 시스템이며 PC를 이용하여 교환기를 제어하는 것이다.On the other hand, there is an event processing device as an apparatus for interfacing a telephone of a call center and a computer. An example of an event processing device is a computer telephony integration (CTI) system. The CTI system is a system that integrates a computer and a telephone, and controls the exchange using a PC.

이벤트 처리 장치는 고객에 관련된 많은 정보를 전화 교환망과 IVR(Interactive Voice Response)을 통하여 고객의 전화번호, 카드번호, 주민번호, 서비스코드 등 기타 부가적인 키(Key) 값들을 상담원에게 효율적으로 알려주며, 전화 상담 업무를 컴퓨터를 활용하여 편리하게 관리할 수 있게 한다.The event processing unit efficiently informs the agent of a lot of information related to the customer through the telephone exchange network and the IVR (Interactive Voice Response), such as the customer's telephone number, card number, resident number, service code and other additional key values, It makes it possible to conveniently manage telephone counseling tasks using computers.

예를 들어, 고객이 특정 기업에 전화를 걸어 상담하기를 원하면 이벤트 처리 장치는 고객에 관련된 많은 정보를 상담원의 모니터에 보여주므로, 원활하고 신속한 상담이 이루어지도록 해주고, 상담원이 전화를 걸어야 할 필요가 있는 고객들에 대해서도 자동적으로 데이터를 정리하여 전화를 걸어주기도한다. For example, if a customer wants to talk to a specific company for consultation, the event processor will show a lot of information about the customer on the agent's monitor, so that smooth and prompt consultation can be done and the agent needs to call We also automatically arrange the data for customers who have a phone call.

교환기와 이벤트 처리 장치는 특정 링크(Link)를 통하여 연결되고 이벤트 처리 장치와 데이터 베이스 호스트(DB host)는 네트워크(LAN)로 연결되며, 상담원은 이벤트 처리 장치 또는 DB 호스트를 통하여 호(call)의 진행 상태정보를 주고받는다. 고객과 상담원 간의 전화 통화를 녹음하고 응대 내용을 기록하며 녹음을 재생할 수 있는 장치는, 추후 분쟁의 소지가 있을 경우 확실한 증거 확보 수단이 된다.The exchanger and the event processing device are connected through a specific link, the event processing device and the database host are connected to the network (LAN), and the agent is connected to the event processing device or the DB host Send and receive progress status information. A device that can record telephone conversations between a customer and an agent, record the contents of the response, and reproduce the recording will be a clear means of evidence in case of future disputes.

하지만, 상기의 응대 내용을 기록하여 녹음한 파일은 용량이 커 보관하는 데 한계가 있고, 추후 특정 문구를 검색하는 데 무리가 있으며, 추후, 데이터 분석을 통한 2차 자료 활용에도 제약이 따르는 문제점이 있다.However, there is a limit to the storage capacity of the recorded file by recording the contents of the response described above, and it is difficult to search for a specific phrase in the future, and there is a limitation in using the second data through data analysis in the future have.

선행기술문헌 : KR등록특허공보 제1213514호(2012.12.18 공고)Prior Art Document: KR Patent Registration No. 1213514 (published on Dec. 18, 2012)

본 발명은 상기와 같은 문제점을 해결하기 위해 안출된 것으로, 통화 내용 텍스트 파일의 검색이 용이하고, 데이터 축척 및 분석을 통한 빅데이터 활용에도 도움을 주며, 송신 음성과 수신 음성의 혼재에도 높은 정확도의 음성 인식이 가능하고, 연속된 문장에서도 텍스트 추출 정확성이 높은 통화 내용 음성-텍스트 변환 녹취록 생성 시스템을 제공하는 데 그 목적이 있다.SUMMARY OF THE INVENTION The present invention has been made in order to solve the above-mentioned problems, and it is an object of the present invention to provide an apparatus and method for searching for a text content file, The present invention provides a voice-to-text transcription record generation system capable of voice recognition and high accuracy of text extraction even in a continuous sentence.

상기 목적을 달성하기 위해 안출된 본 발명에 따른 통화 내용 음성-텍스트 변환 녹취록 생성 시스템은 인터넷전화(VoIP)를 사용하여 상담 서비스를 제공하는 상담원과 상담 서비스를 요청하는 상담자를 위한 것으로, 상담원과 상담자의 통화 내역을 저장, 관리, 및 제공하기 위한 DB를 포함하는 서버; 음성 패킷을 수집하기 위한 것으로, 상담사의 음성 패킷과 서비스 요청자의 음성 패킷을 별개로 수집하는 패킷 수집부; 패킷 수집부에서 수집된 음성 패킷을 세그먼트 단위로 분리한 음성 데이터를 저장하는 음성 분리부; 음성 분리부에서 분리된 세그먼트 단위의 음성 데이터를 텍스트로 변환하는 텍스트 변환부; 및 텍스트 변환부에서 변환된 텍스트를 통합하여 DB에 저장하는 텍스트 통합부를 포함할 수 있다.According to an aspect of the present invention, there is provided an apparatus for generating a voice-to-text conversion transcription record according to the present invention. The system includes an agent for providing a consultation service using an Internet telephone (VoIP) A server for storing, managing, and providing call details of the server; A packet collecting part for collecting voice packets, the packet collecting part collecting voice packets of a consultant separately from voice packets of a service requester; A voice separator for storing voice data obtained by dividing voice packets collected by the packet collecting unit into segments; A text conversion unit for converting the speech data of the segment unit separated by the speech separation unit into text; And a text integration unit for integrating the text converted by the text conversion unit and storing the integrated text in the DB.

또한, 음성 분리부는 패킷 수집부에서 수집된 음성 패킷을 세그먼트 단위로 분리 시, 음성 패킷에서 무음으로 이루어진 구간을 기준으로 각 분리 하는 것을 포함할 수 있다.In addition, the voice separating unit may include separating voice packets collected by the packet collecting unit on a segment-by-segment basis, and separating voice packets based on a silence interval.

또한, 음성 분리부는 분리된 세그먼트마다 데이터의 시점이 표시되는 세그먼트 타임스탬프가 적용되는 것을 포함할 수 있다.In addition, the speech separator may include a segment time stamp to which a time point of data is displayed for each of the separated segments.

본 발명에 의하면, 인터넷전화(VoIP)를 사용하여 통화 시, 송, 수신자의 통화 내용에 대해 음성을 분리하여 송신자와 수신자로 구분 저장하고, 각 수집된 송, 수신자의 음성 패킷에서 무음 구간을 기준으로 세그먼트 단위로 분리 저장함으로써 음성 인식 정확도를 향상시키는 데 그 효과가 있다. According to the present invention, voice is separated from the contents of a call, a call, and a receiver by using an Internet telephone (VoIP), and the voice and the voice are separately classified into a sender and a receiver. And the speech recognition accuracy can be improved by separately storing the speech data in segment units.

또한, 본 발명에 의하면, 상기의 분리된 음성 패킷을 텍스트로 변환하여 통합 시, 각 패킷의 타임스탬프로 적용된 시간 정보를 기준으로 문장을 시계열적으로 정렬하여 녹취록 작성할 수 있어 단어, 문장으로 텍스트 검색을 용이하게 하고, 반복적인 녹취록 축척으로 빅데이터 확보가 가능한 데 그 효과가 있다.In addition, according to the present invention, when the separated speech packets are converted into texts and integrated, the transcriptions can be arranged in a time-lapse manner on the basis of the time information applied as the time stamp of each packet, And it is possible to secure big data at a repetitive transcript scale.

도 1은 본 발명의 바람직한 실시예에 따른 통화 내용 음성-텍스트 변환 녹취록 생성 시스템의 개념도,
도 2는 본 발명의 바람직한 실시예에 따른 통화 내용 음성-텍스트 변환 녹취록 생성 시스템의 RTP 패킷의 header 구조를 나타낸 것,
도 3은 본 발명의 바람직한 실시예에 따른 통화 내용 음성-텍스트 변환 녹취록 생성 시스템에서 패킷 수집부의 수집 순서를 나타낸 순서도, 및
도 4는 본 발명의 바람직한 실시예에 따른 통화 내용 음성-텍스트 변환 녹취록 생성 시스템에서 음성 분리부의 음성 세그먼트 저장 순서를 나타낸 순서도이다.1 is a conceptual diagram of a voice-to-text conversion transcription recording system according to a preferred embodiment of the present invention;
FIG. 2 illustrates a header structure of an RTP packet in a voice-to-text conversion transcription recording system according to a preferred embodiment of the present invention,
3 is a flowchart illustrating a collection procedure of a packet collection unit in a voice-to-text conversion voice recording system according to a preferred embodiment of the present invention, and FIG.
FIG. 4 is a flowchart illustrating a voice segment storage procedure of a speech separator in the speech-to-text conversion transcription recording system according to the preferred embodiment of the present invention.

이하, 본 발명의 바람직한 실시 예를 첨부된 도면들을 참조하여 상세히 설명한다. 우선 각 도면의 구성 요소들에 참조 부호를 부가함에 있어서, 동일한 구성 요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다. 또한, 이하에서 본 발명의 바람직한 실시예를 설명할 것이나, 본 발명의 기술적 사상은 이에 한정하거나 제한되지 않고 당업자에 의해 변형되어 다양하게 실시될 수 있음은 물론이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to designate the same or similar components throughout the drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. In addition, the preferred embodiments of the present invention will be described below, but it is needless to say that the technical idea of the present invention is not limited thereto and can be variously modified by those skilled in the art.

도 1은 본 발명의 바람직한 실시예에 따른 통화 내용 음성-텍스트 변환 녹취록 생성 시스템(100)의 개념도이고, 도 2는 본 발명의 바람직한 실시예에 따른 통화 내용 음성-텍스트 변환 녹취록 생성 시스템(100)의 RTP 패킷의 header 구조를 나타낸 것이며, 도 3은 본 발명의 바람직한 실시예에 따른 통화 내용 음성-텍스트 변환 녹취록 생성 시스템(100)에서 패킷 수집부(20)의 수집 순서를 나타낸 순서도이고, 도 4는 본 발명의 바람직한 실시예에 따른 통화 내용 음성-텍스트 변환 녹취록 생성 시스템(100)에서 음성 분리부(30)의 음성 세그먼트 저장 순서를 나타낸 순서도이다.FIG. 1 is a conceptual diagram of a voice-to-text conversion transcription recording system 100 according to a preferred embodiment of the present invention. FIG. 2 is a block diagram of a system for generating a voice content-to-text conversion transcription record 100 according to a preferred embodiment of the present invention. FIG. 3 is a flowchart showing the collection procedure of the packet collecting unit 20 in the voice-to-text conversion transcription recording system 100 according to the preferred embodiment of the present invention, and FIG. Is a flowchart illustrating a voice segment storage procedure of the voice separation unit 30 in the voice-to-text conversion voice recording system 100 according to the preferred embodiment of the present invention.

본 발명의 바람직한 실시예에 따른 통화 내용 음성-텍스트 변환 녹취록 생성 시스템(100)은, 도 1을 참조하면, 서버(10), 패킷 수집부(20), 음성 분리부(30), 텍스트 변환부(40), 및 텍스트 통합부(50)를 포함하여 구성되고, 인터넷전화(VoIP)를 사용하여 송수신자의 통화 내용에 대한 음성 인식 정확도를 향상시키고, 녹취록 작성할 수 있으며, 녹취록 축척으로 빅데이터 확보가 가능한 것이다.1, the system for generating a conversation contents voice-to-text conversion transcription according to a preferred embodiment of the present invention includes a server 10, a packet collecting unit 20, a voice separating unit 30, (40), and a text integration unit (50). It is possible to improve the accuracy of voice recognition of conversation contents of a sender / receiver by using Internet telephony (VoIP) and to create a transcription record. It is possible.

이하, 서버(10)부터 상세히 설명하기로 한다.Hereinafter, the server 10 will be described in detail.

서버(10)는 인터넷전화(VoIP)를 사용하여 상담 서비스를 제공하는 상담원과 상담 서비스를 요청하는 상담자를 위한 것으로, 상담원과 상담자의 통화 내역을 저장, 관리, 및 제공하기 위한 DB(11)를 포함한다.The server 10 is for a consultant who provides a consultation service using an Internet telephone (VoIP) and a consultant who requests a consultation service. The server 10 includes a DB 11 for storing, managing, and providing call details of the consultant and a consultant .

DB(11)에는 상담 내용이 녹음된 음성 파일(MP3 등), 음성 패킷, 음성데이터 등이 저장되고, 음성을 텍스트로 변환하기 위해 표준 언어가 기저장되어 있으며, 표준 언어와 일치하지 않는 단어도 저장하는 것을 특징으로 한다.The DB 11 stores a voice file (MP3, etc.), voice packet, voice data, and the like in which the consultation contents are recorded. A standard language is stored in order to convert voice to text. .

특히, 관리자는 서버(10)에 접속하여 DB(11)에 저장된 표준 언어와 일치하지 않는 단어를 확인할 수 있고, 이를 표준 단어로 수정하여 새로이 저장할 수 있으며, 차후에는 해당 단어를 표준 언어와 일치하지 않더라도 자동으로 표준 단어로 변환되게끔 설정할 수 있고, 유행어 같은 경우를 대비하여 표준어로 DB(11)에 추가 저장할 수 있음은 물론이다.In particular, the administrator can access the server 10 to check words that do not match the standard language stored in the DB 11, modify them into standard words and store them newly, It is possible to automatically set the language to be converted into a standard word, and to add the standard word to the DB 11 in case of a buzzword.

패킷 수집부(20)는 음성 패킷을 수집하기 위한 것으로, 상담사(상담원)의 음성 패킷과 서비스 요청자(상담자)의 음성 패킷을 별개로 수집한다.The packet collecting unit 20 is for collecting voice packets and separately collects voice packets of the counselor (agent) and voice packets of the service requester (counselor).

보다 상세하게는, 또한, VoIP 통화에서는 SIP(Session Initiation Protocol) 프로토콜을 통해 음성통화를 개시하고, RTP 프로토콜을 통해 실제 음성 데이터를 송, 수신한다. RTP(Real-time Transport Protocol)를 통해 송, 수신된 실제 음성은 발신 측에서 음성 코덱에 의해 디지털화되어 RTP 패킷의 페이로드(payload)로 나누어 전송하며, 반대로 수신 측에서는 음성 코덱을 통해 소리로 변환하여 스피커로 출력하고, 특히, 수신 측에서 RTP 패킷 헤더의 동기발신식별자(SSRC ID: Synchronization Source Identifier) 필드의 값을 통해 연속된 음성데이터(complete message)를 재조합할 수 있다. 또한, RTP 패킷의 결합 순서는 순서번호(Sequence Number)와 타임스탬프(time stamp)를 의해 결정되고, RTP 패킷의 동기 발신 식별자(SSRC ID: Synchronization Source Identifier) 값을 근거하여 송신자와 수신자의 패킷을 분리 저장(수집)하고, 통화 개시 시간도 저장한다.More specifically, in a VoIP call, a voice call is started through a SIP (Session Initiation Protocol) protocol, and actual voice data is transmitted and received through an RTP protocol. The actual voice transmitted and received through the Real-time Transport Protocol (RTP) is digitized by the voice codec at the caller side and divided into payloads of the RTP packet. On the contrary, at the receiver side, In particular, it is possible to recombine the consecutive voice data (complete message) through the value of the Synchronization Source Identifier (SSRC ID) field of the RTP packet header at the receiving end. The combining order of the RTP packets is determined by a sequence number and a time stamp, and the packets of the sender and the receiver are transmitted on the basis of the Synchronization Source Identifier (SSRC ID) value of the RTP packet Store (collect), and store the call start time.

또한, 도 2를 참조하면, RTP 패킷의 결합 순서는 순서번호(Sequence Number)와 타임스탬프(time stamp)를 의해 결정된다. 따라서 송신자와 수신자(상담자와 상담원)의 음성데이터(음성 패킷)를 RTP 프로토콜을 통해 분리하여 저장한다.Also, referring to FIG. 2, the order of combining RTP packets is determined by a sequence number and a time stamp. Therefore, voice data (voice packet) of the sender and the receiver (consultant and agent) is separated and stored through the RTP protocol.

또한, 도 3을 참조하면, 또한, VoIP 통화가 시작됨에 따라 RTP 패킷을 서버에서 수집하고, 패킷 헤더의 SSRC ID가 상담원과 상담자의 ID가 상이하며, RTP 패킷 미러링(복사)이 이루어지게 된다. 미러링된 패킷은 SSRC ID를 기준으로 상담원의 RTP 패킷과 상담자의 RTP 패킷을 분리 수집하여 DB에 저장하되, 서버에서의 RTP 패킷 미러링 시 타 상담원 또는 상담자의 미러링된 패킷이 유입될 수 있음에 따라 분리 수집하고자 하는 상담자와 상담원의 SSRC ID가 아닌 경우에는 해당 패킷을 폐기하도록 한다. Also, referring to FIG. 3, when the VoIP call is started, the RTP packet is collected in the server, the SSRC ID of the packet header is different from the ID of the agent and the consultant, and the RTP packet mirroring (copying) is performed. The mirrored packet separates the RTP packet of the agent and the RTP packet of the consultant based on the SSRC ID, and stores the collected RTP packet in the DB. However, since the mirrored packet of the other agent or the consultant may be introduced at the time of mirroring the RTP packet at the server, If it is not the SSRC ID of the consultant and the agent to be collected, discard the packet.

음성 분리부(30)는 패킷 수집부(20)에서 수집된 음성 패킷에서 무음으로 이루어진 구간을 기준으로 세그먼트 단위로 각 분리한 음성 데이터를 서버(10) DB(11)에 저장하고, 분리된 세그먼트마다 데이터의 시점이 표시되는 세그먼트 타임스탬프(Time stamp)가 적용된다.The voice separation unit 30 stores the voice data separated in units of segments on the basis of the silence period in the voice packets collected by the packet collection unit 20 in the DB 11 of the server 10, A segment time stamp indicating the start point of the data is applied.

또한, 음성 분리부(30)의 RTP 패킷의 페이로드로부터 음성데이터 추출 시 무음 구간을 기준으로 세그먼트로 분리함으로써 음성데이터를 문장 단위로 분할하며, 분할 시 송 수신 시작 시간과 RTP 패킷의 타임스탬프를 기준으로 세그먼트 단위의 음성데이터 각각에 타임스탬프를 생성 적용된다.In addition, when voice data is extracted from the payload of the RTP packet of the voice separation unit 30, the voice data is divided into segments by separating the segments into segments based on the silence period. The time stamps of the RTP packets A time stamp is generated and applied to each of the segmental audio data.

또한, 음성 통화 내용을 세그먼트 단위로 분리 시, VoIP 통화의 시작 시간과 RTP(Real-Time Transport Protocol) 패킷의 Timestamp 정보를 기반으로 세그먼트마다 Time stamp를 기록한 후, 녹취문서로 변환 시 통화 내용을 시계열적으로 정렬하는 기준값으로 사용한다.In addition, when the voice call contents are divided into segments, a time stamp is recorded for each segment based on the start time of the VoIP call and the timestamp information of the RTP (Real-Time Transport Protocol) packet, It is used as a reference value for thermal alignment.

보다 상세하게는, 도 4를 참조하면, 패킷 수집부에서 수집된 RTP 패킷을 순서번호(시퀀스 넘버 : Sequence Number)순으로 입력하고, 마지막 패킷일 경우에만 음성 세그먼트를 저장하며, 마지막 패킷을 제외한 패킷 중 무음 구간을 기준으로 분리하여 새로운 음성 세그먼트 단위의 음성데이터를 생성하고, 각각에 타임스탬프를 적용시킨다. 만약 무음 구간이 없는 경우에는 마지막 패킷으로 분류되어 음성 세그먼트로 저장된다. More specifically, referring to FIG. 4, the RTP packets collected in the packet collecting unit are input in sequence number (Sequence Number) order, the voice segment is stored only in the last packet, and the packet excluding the last packet The speech data of a new speech segment unit is generated by separating the speech segments based on the silence period, and time stamps are applied to each of them. If there is no silence interval, it is classified as the last packet and stored as a voice segment.

일례로, 분리된 상담원의 대화 중, "고객님, 고객님의 휴대 전화 요금이 미납되었습니다. (무음구간, 즉 상대 측 대화) 고객님의 미납 급액은 000원입니다."라는 내용의 통화음성이 디지털 음성데이터 형태로 여러 개의 RTP 패킷으로 나뉘어져 미러링(복사)되어 DB에 저장되어 있고, 이 패킷들이 RTP 패킷 헤더의 순서번호(시퀀스넘버) 순으로 각 입력되고, 무음구간을 기준으로 패킷의 페이로드(음성 데이터)를 재조합 하여 세그먼트를 생성한다. 세그먼트는 패킷에서 무음구간(상대측 대화구간)이 발견되면 현재까지의 음성데이터를 세그먼트로 저장하고, 이후 패킷들은 새로운 세그먼트로 생성한다. 이러한 무음구간을 기준으로 세그먼트 단위로 음성데이터를 재조합하는 과정을 반복하면, 통화내용을 문장단위로 분리할 수 있다. 즉 예시의 문장은 "고객님, 고객님의 휴대 전화 요금이 미납되었습니다."와 "고객님의 미납 급액은 000원입니다."라는 문장으로 분리되어 두 개의 세그먼트로 저장되며, 이때 RTP 패킷헤더의 timestamp를 참조하여 세그먼트 타임스탬프도 같이 계산하여 저장한다. For example, during a conversation of a separated agent, a call voice of "Your customer's mobile phone charge has been paid (silent section, ie, conversation with the other party) is 000 won" (Copied) into a plurality of RTP packets and stored in the DB. These packets are input into the RTP packet header in the order of sequence numbers (sequence numbers) of the RTP packet headers. Based on the silent interval, ) Are recombined to generate a segment. The segment stores the speech data up to the present when the silent interval (peer talk interval) is found in the packet, and then the packets are generated as a new segment. By repeating the process of rearranging the voice data on a segment basis on the basis of the silence section, the contents of the call can be divided into sentences. In other words, the sentence of the example is separated into two segments separated by the sentence of "Your customer's cell phone charge is not paid" and "Your unpaid amount is 000 won", and the timestamp of the RTP packet header is referred to The segment time stamp is also calculated and stored.

텍스트 변환부(40)는 음성 분리부(30)에서 분리된 세그먼트 단위의 음성 데이터를 텍스트로 변환한다.The text conversion unit 40 converts the segmented speech data separated by the speech segmentation unit 30 into text.

또한, 텍스트 변환부(40)는 잡음 제거부(41), 데이터 필터링부(42)를 포함하여 구성될 수 있다.The text converting unit 40 may include a noise removing unit 41 and a data filtering unit 42.

보다 상세하게는, 텍스트 변환부(40)에서 텍스트로 변환하기 위해, 저장된 송수신 RTP 패킷들로부터 원래 전송한 음성데이터로 재조합한 후, 음성 데이터(MP3)를 재생하여 주파수를 추출하고, 잡음 제거부(41)로 상기 주파수 영역에서 사람의 목소리가 조사되는 특정 구간을 제외하고 외부 잡음을 제거한다. 특히, 음성 및 잡음구간 인식과 제거는 스펙트럼 차감법이나 LMS 알고리즘 적응필터 등을 이용하여 다양한 방식으로 잡음이 제거될 수 있다.More specifically, after the text conversion unit 40 recombines audio data originally transmitted from the stored transmission / reception RTP packets to convert it into text, the audio data MP3 is reproduced to extract a frequency, (41) removes the external noise except for a specific section in which the human voice is irradiated in the frequency domain. Particularly, the recognition and removal of voice and noise sections can be canceled in various ways using a spectrum subtraction method or an LMS algorithm adaptive filter.

이후, 잡음이 제거된 음성 데이터를 세그먼트 단위의 음성 데이터로 분리한 후, 각 텍스트로 변환한다. 여기서 텍스트 변환은 Google의 음성인식 엔진이나, 모바일 기기의 음성인식 장치 등 종래에 이용되고 있는 기술을 활용할 수 있다.Thereafter, the voice data from which the noise has been removed is separated into the voice data of the segment unit, and then the voice data is converted into each text. Here, the text conversion can utilize a technology that is conventionally used, such as Google's speech recognition engine or a speech recognition device of a mobile device.

특히, 각 수집된 송, 수신자의 음성 패킷을 무음 구간을 기준으로 세그먼트 단위로 분리 저장함으로써 텍스트 변환부(40)에서 텍스트 변환 시에도 무음 구간 기준으로 분리하고, 일 구간씩 따로 텍스트 변환을 하여 음성 인식 정확도를 향상시킬 수 있다.In particular, the voice packets of the collected songs and receivers are divided and stored in units of segments on the basis of a silent section, so that the text conversion section 40 separates voice packets based on a silent section, The recognition accuracy can be improved.

이후, 텍스트 변환부(40)의 데이터 필터링부(42)를 이용하여 텍스트 변환의 정확성을 더욱 향상시킨다.Thereafter, the accuracy of the text conversion is further improved by using the data filtering unit 42 of the text conversion unit 40.

보다 상세하게는, 인식오류 단어, 유행어, 및 사투리를 포함하는 변환된 텍스트가 서버(10) DB(11)의 표준 단어, 문장 구간과 일치하지 않는 경우, 이를 체킹하고, DB(11)에 저장한다. More specifically, if the converted text including the recognition error word, buzzword, and dialect does not match the standard word and sentence interval of the server 10 DB 11, it is checked and stored in the DB 11 do.

DB(11)에 저장된 표준 단어, 문장과 일치하지 않는 텍스트의 경우, 관리자가 서버(10)에 접속하여 이를 새로 표준 단어, 표준 문장으로 등록하거나 수정 등록함으로써, 이후 변환되는 텍스트는 자동으로 수정되거나 새로 등록된 단어, 문장 등으로 변환되어 필터링됨으로써, 인식 문장의 정확성을 더욱 향상시키는 장점이 있다.In the case of a standard word stored in the DB 11 or a text that does not coincide with the sentence, the administrator accesses the server 10 and registers it in a new standard word or standard sentence, Converted into newly registered words and sentences, and filtered, thereby improving the accuracy of recognition sentences.

일례로, 텍스트 변환부(40)로 변환된 텍스트가 '지가', '예약', 했습니데이'라고 가정하면, 이는, 무음 구간을 기준으로 세그먼트 단위로 분리하여 텍스트로 변환된 것이다. 서버(10) DB(11)에 저장된 표준 단어 중 '예약'은 매칭된 것, '지가', '했습니데이'가 일치하는 표준 단어나 문장이 없는 것이라 가정하면, 서버(10) DB(11)에서는 '지가'와 '했습니데이'를 체킹하고 저장해두며, 관리자가 서버(10)에 접속하였을 경우에 이를 수정하거나 표준 단어나 문장으로 신규등록할 수 있도록 하고, 차후 '지가', '했습니데이'가 변환될 경우에는 수정되거나 신규 등록된 단어나 문장으로 자동변환되도록 한다.For example, assuming that the text converted by the text conversion unit 40 is 'paper money', 'reserved', and 'received day', it is converted into text by segmenting on a segment basis based on a silent section. If the server 10 determines that the 'reserved' among the standard words stored in the server 10 DB 11 is that there is no standard word or sentence matching the 'match', ' The server 10 checks whether the user has accessed the server 10, or if the administrator accesses the server 10, the user can register the new information with a standard word or sentence, Is automatically converted into a modified or newly registered word or sentence.

텍스트 통합부(50)는 텍스트 변환부(40)에서 변환된 텍스트를 통합하여 DB(11)에 저장한다.The text integrating unit 50 integrates the converted text in the text converting unit 40 and stores it in the DB 11. [

또한, 텍스트 통합부(50)는 세그먼트 단위로 음성 인식되어 텍스트로 변환된 상담자와 상담원의 문장들을 세그먼트 타임스탬프를 기준으로 시계열로 정렬하여 녹취록을 생성한다.In addition, the text integrating unit 50 arranges the sentences of the consultant and the consultant who are speech-recognized on a segment-by-segment basis and converted into text into a time series based on the segment time stamp to generate a transcript.

보다 상세하게는, 텍스트 통합부(50)에서는 세그먼트 단위로 변환된 송신/수신 문장들을 세그먼트 타임스탬프를 기준으로 시계열적으로 정렬하면, 시간 순서대로 작성된 녹취록을 생성할 수 있고, 이를 서버(10) DB(11)에 저장하기에, 추후, 녹취록을 확인할 경우에 단어로 검색이 용이하고, 음성 파일을 보관하는 것보다 텍스트 파일로 보관하는 것이 용량이 작아 속도가 더 빠르고, 적은 용량으로 많은 자료를 보관할 수 있음은 물론이다.In more detail, the text integrating unit 50 can generate a transcript created in time sequence by arranging transmission / reception sentences converted in units of segments on a time-series basis based on a segment time stamp, It is easier to search for words in the future when checking the transcripts, and it is faster to store them in a text file than to store the voice files, so that the speed is faster and a lot of data is stored in the DB 11 Of course.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 다양한 수정, 변경 및 치환이 가능할 것이다. 따라서, 본 발명에 개시된 실시예 및 첨부된 도면들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예 및 첨부된 도면에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.It will be apparent to those skilled in the art that various modifications, substitutions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims. will be. Therefore, the embodiments disclosed in the present invention and the accompanying drawings are intended to illustrate and not to limit the technical spirit of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments and the accompanying drawings . The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

10 - 서버
11 - DB
20 - 패킷 수집부
30 - 음성 분리부
40 - 텍스트 변환부
41 - 잡음 제거부
42 - 데이터 필터링부
50 - 텍스트 통합부
100 - 통화 내용 음성-텍스트 변환 녹취록 생성 시스템10-server
11 - DB
20 - packet collecting unit
30 -
40 - Text conversion section
41 - Noise canceling
42 - Data filtering unit
50 - text integration unit
100 - Voice-to-speech transcription recording system

Claims

A server for a counselor who provides counseling service using VoIP and a counselor who requests counseling service and includes a DB for storing, managing, and providing call history of the counselor and the counselor;
A packet collecting part for collecting voice packets, the packet collecting part collecting voice packets of a consultant separately from voice packets of a service requester;
A voice separator for storing voice data obtained by dividing voice packets collected by the packet collecting unit into segments;
A text conversion unit for converting the speech data of the segment unit separated by the speech separation unit into text; And
A text integration unit for integrating the converted text in the text conversion unit and storing the integrated text in a DB
Lt; / RTI >
The speech separator
When the voice packets collected by the packet collecting unit are divided into segments, the packets collected in the packet collecting unit are input in order of the order of the numbers, and only the last packet is stored as voice segments. Of the voice packets excluding the last packet, And stores them as voice segments. If there is no silence interval, they are classified as last packets and stored as voice segments
/ RTI >
The speech separator
Segment timestamps are applied to the segmented data, which shows the viewpoint of the data.
/ RTI >
The text conversion unit
A noise removing unit for removing noise excluding a specific section in which a human voice is extracted by extracting a frequency by reproducing voice data,
A data filtering unit for connecting to the server and registering or modifying a new word or a sentence when the converted text does not match the standard word or sentence stored in the DB,
Further comprising: a voice-to-text conversion transcription recording system.

delete