KR102608344B1

KR102608344B1 - Speech recognition and speech dna generation system in real time end-to-end

Info

Publication number: KR102608344B1
Application number: KR1020210016252A
Authority: KR
Inventors: 최성집; 최현집
Original assignee: 주식회사 퀀텀에이아이
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2023-11-29
Also published as: KR20220112560A

Abstract

본 발명은 감지되는 음성을 기반으로 패킷 변환 처리하여 생성되는 음성 패킷 스트림(Packet Stream) 데이터의 헤더(Header) 파트에 포함된 시간 정보를 이용해 시간 순으로 각 음성 패킷 스트림 데이터의 음성정보를 포함하고 있는 페이로드(Payload) 파트를 연결하여 음성 프레임(Speech Frame)을 생성하는 스트림 통합기(Stream Integrator)로부터 제공되는 상기 음성 프레임을 기반으로 음성 DNA를 생성하고, 생성된 음성 DNA를 이용해 감지된 음성에 대응되는 텍스트를 추출하는 실시간 End-to-End 방식의 음성 인식 및 음성DNA 생성 시스템에 관한 것입니다.The present invention includes voice information of each voice packet stream data in chronological order using time information included in the header part of voice packet stream data generated by packet conversion processing based on detected voice. Speech DNA is generated based on the speech frame provided by the Stream Integrator, which creates a speech frame by connecting the payload parts, and the speech detected using the generated speech DNA. This is about a real-time end-to-end voice recognition and voice DNA generation system that extracts the corresponding text.

Description

Real-time end-to-end voice recognition and voice DNA generation system {SPEECH RECOGNITION AND SPEECH DNA GENERATION SYSTEM IN REAL TIME END-TO-END}

본 발명은 스트림 통합기(Stream Integrator)를 통해 감지되는 음성 기반의 음성 패킷 데이터를 이용해 생성되는 음성 프레임을 기반으로 음성 DNA를 생성함과 동시에 감지된 음성에 대응되는 텍스트를 실시간으로 추출함에 있어, 더욱 효율적인 딥러닝 수행 메커니즘과 더욱 신속하고 정확한 음성 인식 결과물로서의 텍스트를 추출 제공 가능한 음성 인식 시스템에 관한 것이다.The present invention generates voice DNA based on voice frames generated using voice-based voice packet data detected through a stream integrator and simultaneously extracts text corresponding to the detected voice in real time, It is about a voice recognition system that can extract and provide text as a more efficient deep learning performance mechanism and more quickly and accurately as a voice recognition result.

음성인식 기술은 키보드나 마우스와 같은 별도의 입력 수단을 이용하기 않고 음성 신호에 반응하여 각종 장비 및 해당 장비가 갖춘 기능적 특성을 제어하는 인터페이스를 구축하는 분야에서, 최근에는 콜센터 운영 및 회의로 자동작성과 같이 특정 업무의 효율성을 높이는 방향으로 까지 영역을 넓혀가고 있다.Voice recognition technology is a field that builds an interface that controls various equipment and its functional characteristics by responding to voice signals without using a separate input method such as a keyboard or mouse. Recently, it has been used for automatic creation for call center operations and meetings. The scope is expanding to increase the efficiency of specific tasks.

이러한 음성 인식 기술은 인공지능의 적용에서부터 해당 인공지능의 학습에 필요한 데이터베이스 구축에 이르기까지 많은 시각과 비용이 소요됨은 물론이고, 이를 제어 관리하는 시스템의 구축 및 더욱 정확한 음성 인식 결과물의 제공 등을 위한 각종 기술적 고도화 및 발전이 요구되고 있는 실정이다.This voice recognition technology not only requires a lot of time and money from applying artificial intelligence to building a database necessary for learning of the artificial intelligence, but also requires a lot of time and effort to build a system to control and manage it and provide more accurate voice recognition results. There is a need for various technological advancements and developments.

기존의 음성 인식 시스템을 구축하기 위안 각종 모듈들의 복합적 구조는 하나의 일 예로서 DNN-HMM 기반의 음향 모델과 어휘 사전, 언어 모델을 하나의 decoding network로 구성된 복잡한 구조의 종래 시스템이 존재한다.As an example of the complex structure of various modules to build an existing speech recognition system, there is a conventional system with a complex structure consisting of a DNN-HMM-based acoustic model, a vocabulary dictionary, and a language model as a single decoding network.

이러한 종래 시스템과 대비되어 DNN-HMM 기반 음향모델, weighted finite state transducer (WFST)를 이용한 decoding network, N-gram을 이용한 언어모델로 구성된 복잡한 방법을 대체하여 텍스트에 대한 speech signal 혹은 특징만으로 구성된 네트워크를 이용하는 end-to-end 방식의 시스템 또한 존재한다. In contrast to these conventional systems, a complex method consisting of a DNN-HMM-based acoustic model, a decoding network using a weighted finite state transducer (WFST), and a language model using N-grams is replaced with a network consisting of only speech signals or features for text. There is also an end-to-end system that can be used.

하지만 기존의 end-to-end 방식의 음성 인식 시스템 역시 한국어 음절 단위로 모델의 출력을 구성할 경우 가능한 초성, 중성, 종성의 조합이 총 11,172개의 출력을 필요로는 점과 같은 한국어가 가지고 있는 문자적 특성을 고려할 때 여전히 적용에 어려움이 존재하였다.However, the existing end-to-end speech recognition system also requires a total of 11,172 possible combinations of initial, medial, and final consonants to output when configuring the model's output in units of Korean syllables. Considering the characteristics, there were still difficulties in application.

이에 따라, 기존의 음성 인식 시스템 및 해당 시스템에 구축되는 각종 모듈, 알고리즘, 모델 등을 한국어의 문자적 특성을 고려하여 최적화시키기 위한 각종 기술적 노력들이 진행되고 있는 실정이다.Accordingly, various technical efforts are being made to optimize the existing voice recognition system and various modules, algorithms, and models built into the system, taking into account the textual characteristics of the Korean language.

이와 관련하여 입력된 음성 신호를 음소 단위로 분석한 결과에 대하여, 문자열 도메인에서 그 음소의 조합과 해석을 통하여 임의로 구성 가능한 단어 사전에서 최적의 인식 결과를 찾아내는 방법의 음성 인식을 처리함에 있어 음절이나 단어 또는 문장 단위를 기반으로 주파수 도메인에서 음성 인식을 처리하는 기존 음성 인식 시스템보다 음성 인식의 인식률과 성능을 효과적으로 향상시키기 위해 마련된 종래기술에 대한 선행문헌에는 대한민국 공개특허공보 제10-2010-0026028호의 "음소 단위(PLU: Phone Like Unit)를 기반으로 하는 음성 인식을 위한 점수 행렬(score matrix) 구축과 음소 단위 순서(PLU sequence)의 최적 경로 처리 기법에 의한 음성 신호의 문자 변환 장치 구현에 대한 방법 연구"(이하, '종래기술'이라고 함)이 있다.In relation to this, with regard to the results of analyzing the input speech signal on a phoneme basis, in processing speech recognition by finding the optimal recognition result from a dictionary of words that can be arbitrarily configured through the combination and interpretation of the phoneme in the string domain, the syllable or Prior literature on the prior art prepared to effectively improve the recognition rate and performance of voice recognition over existing voice recognition systems that process voice recognition in the frequency domain based on word or sentence units includes Korean Patent Publication No. 10-2010-0026028. "A method for constructing a score matrix for speech recognition based on phone like unit (PLU) and implementing a device for converting speech signals into text using optimal path processing technique of phoneme unit sequence (PLU sequence) Research” (hereinafter referred to as ‘prior art’).

하지만, 종래기술을 비롯한 기존의 음성 인식을 통한 텍스트 제공과 관련한 시스템의 경우, 복잡함 모듈상의 구성을 갖추고 있을 뿐만 아니라 언어의 배치 순서를 확률적으로 계산하는 언어 모델과 기 설정된 별도의 발음 사전이 요구되었으며 음성 인식의 처리 과정상의 기능적 효율이 현저히 낮은 문제점이 있었다.However, in the case of systems related to providing text through existing voice recognition, including the prior art, not only do they have a complex modular structure, but they also require a language model that probabilistically calculates the order of language placement and a separate preset pronunciation dictionary. There was a problem in that the functional efficiency of the voice recognition process was significantly low.

이 뿐만 아니라, 종래기술을 비롯한 기존의 음성 인식을 통한 텍스트 제공과 관련한 시스템의 경우, 인공지능을 활용한 음성인식 및 이의 학습에 있어 여전히 많은 시간과 비용이 요구되고 있었으며 음성의 화자가 가진 개별적 특징을 고려하지 못할 뿐만 아니라 궁극적으로 적확한 텍스트의 도출에도 도달하지 못하는 문제점이 존재하였다.In addition, in the case of systems related to providing text through existing voice recognition, including the prior art, a lot of time and cost were still required for voice recognition and learning using artificial intelligence, and the individual characteristics of the speaker of the voice were still required. There was a problem that not only could not be considered, but ultimately it was not possible to arrive at an accurate text.

본 발명은 상기 문제점을 해결하기 위해 창작된 것으로써, 본 발명의 목적은 복잡함 모듈상의 구성을 갖추고 있을 뿐만 아니라 언어의 배치 순서를 확률적으로 계산하는 언어 모델과 기 설정된 별도의 발음 사전 없이도 음성 패킷 스트림을 기반으로 더욱 신속하고 효율적으로 텍스트를 도출해낼 수 있는 음성 인식 시스템을 제공하는데 있다.The present invention was created to solve the above problems, and the purpose of the present invention is to not only have a complex modular structure, but also to provide voice packets without a language model that probabilistically calculates the arrangement order of the language and a separate preset pronunciation dictionary. The goal is to provide a voice recognition system that can derive text more quickly and efficiently based on streams.

또한, 본 발명의 또 다른 목적은 인공지능을 활용한 음성인식 및 이의 학습에 필요한 각종 학습 데이터의 생성에 있어 시간과 비용 상의 절약이 효과적으로 이루어지며, 음성을 제공하는 화자의 개인적 특성을 고려하여 화자의 인식 및 구별을 기반으로 한 음성 인식이 수행될 수 있어 궁극적으로 음성 인식 결과물로서의 텍스트 정확도를 상당히 고도하게 향상시킨 음성 인식 시스템을 제공하는데 있다.In addition, another purpose of the present invention is to effectively save time and cost in the generation of various learning data required for voice recognition and learning using artificial intelligence, and to take into account the personal characteristics of the speaker providing the voice. Speech recognition based on recognition and distinction can be performed, ultimately providing a speech recognition system that significantly improves text accuracy as a speech recognition result.

상기 목적을 달성하기 위하여 본 발명은, 감감지되는 음성을 기반으로 패킷 변환 처리하여 생성되는 음성 패킷 스트림(Packet Stream) 데이터의 헤더(Header) 파트에 포함된 시간 정보를 이용해 시간 순으로 각 음성 패킷 스트림 데이터의 음성정보를 포함하고 있는 페이로드(Payload) 파트를 연결하여 음성 프레임(Speech Frame)을 생성하는 스트림 통합기(Stream Integrator)로부터 제공되는 상기 음성 프레임을 기반으로 음성 DNA를 생성하고, 동시에 감지된 음성에 대응되는 텍스트를 추출하는 실시간 End-to-End 방식의 음성 인식 및 음성DNA 생성 시스템에 있어서, 상기 스트림 통합기를 통해 생성된 음성 프레임을 기반으로 시간에 따른 주파수(Frequency) 변화를 벡터화하여 주파수 특징(Frequency Feature)을 나타낸 적어도 하나 이상의 주파수 도메인(Frequency Domain)과 상기 스트림 통합기를 통해 생성된 음성 프레임을 기반으로 시간에 따른 진폭(Amplitude) 변화를 벡터화하여 시간 특징(Time Feature)을 나타낸 적어도 하나 이상의 시간 도메인(Time Domain)을 추출 생성하는 스트림 특징(Stream Feature) 추출부; 및 상기 스트림 특징 추출부를 통해 생성된 주파수 도메인과 시간 도메인을 기반으로 인코딩 및 디코딩 처리를 통해 상호 통합된 음성 DNA를 생성하고, 상기 음성 DNA를 이용해 예측 텍스트(Raw Text)를 도출하는 인코딩/디코딩(Encoding/Decording) 수행부;를 포함한다.In order to achieve the above object, the present invention converts each voice packet in chronological order using time information included in the header part of voice packet stream data generated by packet conversion processing based on detected voice. Speech DNA is generated based on the speech frame provided by the Stream Integrator, which creates a speech frame by connecting the payload part containing the speech information of stream data, and at the same time In a real-time end-to-end voice recognition and voice DNA generation system that extracts text corresponding to detected voice, the change in frequency over time is vectorized based on the voice frame generated through the stream integrator. Based on at least one frequency domain representing the frequency feature and the voice frame generated through the stream integrator, the change in amplitude over time is vectorized to represent the time feature. A stream feature extraction unit that extracts and generates at least one time domain; And encoding/decoding ( Encoding/Decoding) execution part;

그리고 상기 실시간 End-to-End 방식의 음성 인식 및 음성DNA 생성 시스템은, 음성 프레임(Speech Frame)상의 묵음구간(Blank)을 기준으로 구분되는 시작구간과 종료구간이 프레임 별로 태깅(Tagging)되어 스크립트형태로 마련된 학습용 데이터가 기 저장되며, 상기 학습용 데이터를 통해 음성 프레임상의 시작구간과 종료구간에 대한 태깅(Tagging)을 수행하는 기능의 학습이 이루어지는 태깅 알고리즘이 갖춰짐에 따라, 상기 스트림 통합기를 통해 생성된 음성 프레임의 신호 스트림 상의 묵음구간(Blank)을 기준으로 구분되는 시작구간과 종료구간의 태깅(Tagging)을 상기 태깅 알고리즘을 이용해 수행하여 위치가 표지된 특징 스트림(Postinoal Tagged Character Stream) 정보를 생성하는 태깅(Tagging) 수행부;를 더 포함하며. 상기 태깅 수행부에 설치된 태깅 알고리즘은 생성되는 위치가 표지된 특징 스트림 정보를 학습용 데이터로 활용하여 학습을 수행한다.In addition, in the real-time end-to-end voice recognition and voice DNA generation system, the start and end sections, which are divided based on the silence section (Blank) on the speech frame, are tagged for each frame and can be used as a script. The training data prepared in the form is pre-stored, and a tagging algorithm is equipped to learn the function of performing tagging for the start and end sections of the voice frame through the training data, through the stream integrator. Tagging of the start and end sections, which are classified based on the silence section (Blank) on the signal stream of the generated voice frame, is performed using the above tagging algorithm to obtain location-tagged feature stream (Postinoal Tagged Character Stream) information. It further includes a tagging execution unit that generates. The tagging algorithm installed in the tagging execution unit performs learning by using feature stream information indicating the generated location as learning data.

또한, 상기 인코딩/디코딩 수행부는, 상기 스트림 특징 추출부를 통해 생성된 주파수 도메인을 기 설정된 소정의 주파수 대역별로 구간을 나누어, 각 구간별 특징을 추출하여 인코딩(Encoding)을 수행하는 제1인코딩부분; 상기 스트림 특징 추출부를 통해 생성된 시간 도메인을 기 설정된 소정의 진폭 신호 강도별로 구간을 나누어, 각 구간별 특징을 추출하여 인코딩(Encoding)을 수행하는 제2인코딩부분; 상기 제1인코딩부분을 통해 인코딩을 거친 주파수 대역별 특징 기반의 주파수 도메인을 주 도메인으로 하고, 상기 제2인코딩부분을 통해 인코딩을 거친 진폭 신호 강도별 특징 기반의 시간 도메인을 주 도메인과 관련한 정보 보강을 위한 보조 도메인으로 하여 디코딩(Decoding)을 수행하는 제1디코딩부분; 및 상기 제2인코딩부분을 통해 인코딩을 거친 진폭 신호 강도별 특징 기반의 시간 도메인을 주 도메인으로 하고, 상기 제1인코딩부분을 통해 인코딩을 거친 주파수 대역별 특징 기반의 주파수 도메인을 주 도메인과 관련한 정보 보강을 위한 보조 도메인으로 하여 디코딩(Decoding)을 수행하는 제2디코딩부분;을 포함한다.In addition, the encoding/decoding unit includes a first encoding unit that divides the frequency domain generated through the stream feature extraction unit into sections by predetermined frequency bands, extracts features for each section, and performs encoding; a second encoding part that divides the time domain generated through the stream feature extractor into sections according to preset amplitude signal strength, extracts features for each section, and performs encoding; The frequency domain based on features for each frequency band encoded through the first encoding part is used as the main domain, and the time domain based on features for each amplitude signal intensity encoded through the second encoding part is reinforced with information related to the main domain. A first decoding part that performs decoding using an auxiliary domain for; and a time domain based on features for each amplitude signal intensity encoded through the second encoding portion as the main domain, and information related to the main domain using a frequency domain based on features for each frequency band encoded through the first encoding portion. It includes a second decoding part that performs decoding using an auxiliary domain for reinforcement.

아울러, 상기 인코딩/디코딩 수행부는, 상기 제1디코딩부분을 통해 생성되는 제1디코딩 도메인과 상기 제2디코딩부분을 통해 생성되는 제2디코딩 도메인을 통합시켜 종합음성특징 도메인으로서의 상기 음성 DNA를 생성하는 음성 DNA 생성부분; 및 상기 음성 DNA 생성부분을 통해 생성된 상기 음성 DNA의 종합음성특징 도메인을 분석하여 예측 텍스트(Raw Text)를 도출하는 예측 텍스트 생성부분;을 더 포함한다. In addition, the encoding/decoding performing unit integrates the first decoding domain generated through the first decoding part and the second decoding domain generated through the second decoding part to generate the voice DNA as a comprehensive voice feature domain. Negative DNA production part; and a predicted text generation part that analyzes the comprehensive voice feature domain of the voice DNA generated through the voice DNA generation part to derive a predicted text (Raw Text).

여기서, 상기 제1인코딩부분을 통해 추출되는 주파수 대역 구간별 특징은 주파수 도메인상의 전체 시간축 내 특정 시간구간에 걸쳐 주파수 대역 구간별로 특정 문자가 대응되어 위치할 확률에 관한 어텐션(Attention)을 포함하며, 제2인코딩부분을 통해 추출되는 진폭 신호 강도 구간별 특징은 시간 도메인상의 전체 시간축 내 특정 시간구간에 걸쳐 진폭 신호 강도 구간별로 특정 문자가 대응되어 위치할 확률에 관한 어텐션(Attention)을 포함한다.Here, the characteristics of each frequency band section extracted through the first encoding part include attention regarding the probability that a specific character is located in correspondence with each frequency band section over a specific time section within the entire time axis on the frequency domain, The characteristics of each amplitude signal intensity section extracted through the second encoding part include attention regarding the probability that a specific character is located in correspondence with each amplitude signal intensity section over a specific time section within the entire time axis in the time domain.

그리고 상기 실시간 End-to-End 방식의 음성 인식 및 음성DNA 생성 시스템은, 자연어 처리(NLP, Natural Language Processing) 알고리즘이 갖춰지며, 상기 태깅 수행부를 통해 생성되는 위치가 표지된 특징 스트림 정보를 기반으로 상기 자연어 처리 알고리즘을 이용해 상기 인코딩/디코딩 수행부를 통해 생성된 예측 텍스트(Raw Text)를 음성 인식 결과물로서의 최종 텍스트로 변환시키는 센텐스 인핸서(Sentence Enhancer)형 텍스트 변환부;를 더 포함한다.And the real-time end-to-end voice recognition and voice DNA generation system is equipped with a natural language processing (NLP) algorithm and is based on location-marked feature stream information generated through the tagging unit. It further includes a sentence enhancer-type text conversion unit that converts the predicted text (Raw Text) generated through the encoding/decoding unit using the natural language processing algorithm into a final text as a speech recognition result.

여기서, 상기 센텐스 인핸서형 텍스트 변환부는, 상기 인코딩/디코딩 수행부를 통해 생성된 예측 텍스트(Raw Text)를 이루는 문자간의 연결상태에 대한 상관관계를 분석하여 상기 자연어 처리 알고리즘의 학습에 이용 가능한 제1자연어 처리용 학습 데이터를 생성하는 인코더 형태의 제1자연어 처리 알고리즘 학습부분; 및 상기 제1자연어 처리 알고리즘 학습부분을 통해 생성된 제1자연어 처리용 학습 데이터와 상기 센텐스 인핸서형 텍스트 변환부를 통해 변환 처리된 최종 텍스트 간의 상관관계를 분석하여 상기 자연어 처리 알고리즘의 학습에 이용 가능한 제2자연어 처리용 학습 데이터를 생성하는 디코더 형태의 제2자연어 처리 알고리즘 학습부분;를 포함한다.Here, the sentence enhancer-type text conversion unit analyzes the correlation between the connection states between characters forming the predicted text (Raw Text) generated through the encoding/decoding unit, and provides a first converter that can be used for learning the natural language processing algorithm. A first natural language processing algorithm learning portion in the form of an encoder that generates learning data for natural language processing; And by analyzing the correlation between the learning data for first natural language processing generated through the first natural language processing algorithm learning part and the final text converted through the sentence enhancer type text conversion unit, the data can be used for learning the natural language processing algorithm. It includes a second natural language processing algorithm learning portion in the form of a decoder that generates learning data for second natural language processing.

또한, 상기 스트림 특징 추출부, 인코딩/디코딩 수행부, 태깅 수행부 및 센텐스 인핸서형 텍스트 변환부는 하나의 메모리 내에 상호 연동 가능한 형태로 구축되어 기능 처리를 수행하게 된다.In addition, the stream feature extraction unit, encoding/decoding unit, tagging unit, and sentence enhancer type text conversion unit are built in a form that can be interoperable within a single memory to perform functional processing.

본 발명에 의하면 다음과 같은 효과가 있다.According to the present invention, the following effects are achieved.

첫째, 복잡함 모듈상의 구성을 갖추고 있을 뿐만 아니라 언어의 배치 순서를 확률적으로 계산하는 언어 모델과 기 설정된 별도의 발음 사전 없이도 음성 패킷 스트림을 기반으로 더욱 신속하고 효율적으로 텍스트를 도출해낼 수 있다.First, not only does it have a complex modular structure, but it can also derive text more quickly and efficiently based on the voice packet stream without a language model that probabilistically calculates the arrangement order of the language and a separate pronunciation dictionary.

둘째, 인공지능을 활용한 음성인식 및 이의 학습에 필요한 각종 학습 데이터의 생성에 있어 시간과 비용 상의 절약이 효과적으로 이루어진다.Second, time and cost savings are effectively achieved in the creation of various learning data required for voice recognition and learning using artificial intelligence.

셋째, 음성을 제공하는 화자의 개인적 특성을 고려하여 화자의 인식 및 구별을 기반으로 한 음성 인식이 수행될 수 있다.Third, voice recognition based on speaker recognition and discrimination can be performed by taking into account the personal characteristics of the speaker providing the voice.

넷째, 궁극적으로 음성 인식 결과물로서의 텍스트 정확도를 상당히 고도하게 향상시킨 음성 인식 시스템을 제공할 수 있다.Fourth, ultimately, it is possible to provide a speech recognition system that significantly improves text accuracy as a speech recognition result.

도1은 본 발명에 따른 실시간 End-to-End 방식의 음성 인식 및 음성DNA 생성 시스템의 구성을 도시한 블럭도이다.Figure 1 is a block diagram showing the configuration of a real-time end-to-end voice recognition and voice DNA generation system according to the present invention.

본 발명의 바람직한 실시예에 대하여 첨부된 도면을 참조하여 더 구체적으로 설명하되, 이미 주지된 기술적 부분에 대해서는 설명의 간결함을 위해 생략하거나 압축하기로 한다.Preferred embodiments of the present invention will be described in more detail with reference to the attached drawings, but already well-known technical parts will be omitted or compressed for brevity of explanation.

<실시간 End-to-End 방식의 음성 인식 및 음성DNA 생성 시스템에 관한 설명><Description of real-time end-to-end voice recognition and voice DNA generation system>

먼저, 본 발명은 실시간 End-to-End 방식의 음성 인식 및 음성DNA 생성 시스템(100)에 관한 것으로, 감지되는 음성을 기반으로 패킷 변환 처리하여 생성되는 음성 패킷 스트림(Packet Stream) 데이터의 헤더(Header) 파트에 포함된 시간 정보를 이용해 시간 순으로 각 음성 패킷 스트림 데이터의 음성정보를 포함하고 있는 페이로드(Payload) 파트를 연결하여 음성 프레임(Speech Frame)을 생성하는 스트림 통합기(Stream Integrator, SI)로부터 제공되는 상기 음성 프레임을 기반으로 음성 DNA를 생성하고, 이와 동시에 감지된 음성에 대응되는 텍스트를 실시간으로 추출하기 위해 도1과 같이 스트림 특징 추출부(110), 인코딩/디코딩 수행부(120), 태깅 수행부(130) 및 센텐스 인핸서형 텍스트 변환부(140)를 포함한다.First, the present invention relates to a real-time end-to-end voice recognition and voice DNA generation system 100. The header of voice packet stream data generated by packet conversion processing based on the detected voice ( Stream Integrator, which creates a speech frame by connecting the payload part containing the speech information of each voice packet stream data in chronological order using the time information included in the Header part. In order to generate voice DNA based on the voice frame provided from (SI) and simultaneously extract text corresponding to the detected voice in real time, a stream feature extraction unit 110 and an encoding/decoding performing unit ( 120), a tagging execution unit 130, and a sentence enhancer type text conversion unit 140.

우선, 본 발명은 실시간 End-to-End 방식의 음성 인식 및 음성DNA 생성 시스템(100)은 E2E ASR(End-to-End Automatic Speech Recognition)을 수행하고, 이를 위한 인공지능 학습구조 및 구성을 갖추고 있다.First, the present invention is a real-time end-to-end voice recognition and voice DNA generation system 100 that performs E2E ASR (End-to-End Automatic Speech Recognition) and is equipped with an artificial intelligence learning structure and configuration for this. there is.

특정 화자가 말을 하여 음성을 발생시키면, 해당 음성은 인식 후 디지털 처리되어 Wav 혹은 Pcm과 같은 형태의 파일로 변환된 후 네트워크상의 이용 및 이동을 위해 패킷 처리모듈(미도시)를 통해 패킷 변환 처리되어 음성 패킷 스트림(Packet Stream) 데이터가 된다.When a specific speaker speaks and generates a voice, the voice is recognized and digitally processed, converted into a file such as Wav or Pcm, and then packet converted through a packet processing module (not shown) for use and movement on the network. It becomes voice packet stream data.

이러한 음성 패킷 스트림(Packet Stream) 데이터는 헤더-페이로드의 각 파트별 영역이 연결된 구조를 기본적으로 갖추고 있으나, 수신되는 데이터가 일정한 시간적 순서를 지켜 순차적으로 정돈되어 수신되는 것이 아니기 때문에 스트림 통합기(SI)를 통해 음성 프레임(Speech Frame)의 형태로 갖춰질 필요가 있다.This voice packet stream data basically has a structure in which the areas for each part of the header and payload are connected, but since the received data is not received sequentially in a certain temporal order, a stream integrator ( It needs to be in the form of a speech frame through SI).

구체적으로, 음성 패킷 스트림(Packet Stream) 데이터의 헤더(Header) 파트에는 신호의 출처 및 도착과 관련한 각종 정보를 비롯해 시간 정보 역시 포함되어 있고, 페이로드(Payload) 파트에는 인식된 각종 음성정보가 포함되어 있다.Specifically, the header part of voice packet stream data includes various information related to the source and arrival of the signal as well as time information, and the payload part contains various recognized voice information. It is done.

스트림 특징 추출부(110)는 스트림 통합기(SI)를 통해 생성된 음성 프레임을 기반으로 시간에 따른 주파수(Frequency) 변화를 벡터화하여 주파수 특징(Frequency Feature)을 나타낸 적어도 하나 이상의 주파수 도메인(Frequency Domain)과 스트림 통합기(SI)를 통해 생성된 음성 프레임을 기반으로 시간에 따른 진폭(Amplitude) 변화를 벡터화하여 시간 특징(Time Feature)을 나타낸 적어도 하나 이상의 시간 도메인(Time Domain)을 추출 생성한다.The stream feature extractor 110 vectorizes the frequency change over time based on the voice frame generated through the stream integrator (SI), and at least one frequency domain representing the frequency feature. ) and a stream integrator (SI) to extract and generate at least one time domain representing the time feature by vectorizing the change in amplitude over time based on the voice frame generated through the stream integrator (SI).

여기서, 스트림 특징 추출부(110)를 통해 생성되는 주파수 도메인(Frequency Domain)은 시간의 변화에 따른 주파수(Frequency) 변화를 벡터화한 도메인 정보로서 이를 통해 해당 음성의 주파수 특징(Frequency Feature)을 파악할 수 있다.Here, the frequency domain generated through the stream feature extractor 110 is domain information that vectorizes the frequency change according to time, through which the frequency feature of the voice can be identified. there is.

예를 들어, 주파수 도메인(Frequency Domain)은 x축은 시간의 단위를 나타내어 1초를 수백 혹은 수천의 비트 단위로 나누어 구간을 표시하고, y축은 주파수(Frequency)의 대역 크기를 Hz단위로 나누어 구간을 표시한 형태의 벡터 정보로 마련될 수 있으나 이에 한정되지 아니한다.For example, in the frequency domain, the x-axis represents the unit of time and displays a section by dividing 1 second into hundreds or thousands of bits, and the y-axis divides the frequency band size into Hz units to display the section. It may be provided as vector information in the indicated form, but is not limited to this.

또한, 스트림 특징 추출부(110)를 통해 생성되는 시간 도메인(Time Domain)은 시간의 변화에 따른 진폭(Amplitude) 변화를 벡터화한 도메인 정보로서 이를 통해 해당 음성의 시간 특징(Time Feature)을 파악할 수 있다.In addition, the time domain generated through the stream feature extractor 110 is domain information that vectorizes the change in amplitude according to the change in time, through which the time feature of the corresponding voice can be identified. there is.

예를 들어, 시간 도메인(Time Domain)은 x축은 시간의 단위를 나타내어 1초를 수백 혹은 수천의 비트 단위로 나누어 구간을 표시하고, y축은 진폭(Amplitude)의 신호 강도를 일정 규격으로 나누어 구간을 표시한 형태의 벡터 정보로 마련될 수 있으나 이에 한정되지 아니한다.For example, in the time domain, the x-axis represents the unit of time and displays a section by dividing 1 second into hundreds or thousands of bits, and the y-axis divides the signal strength of the amplitude into a certain standard to display the section. It may be provided as vector information in the indicated form, but is not limited to this.

인코딩/디코딩(Encoding/Decording) 수행부(120)는 스트림 특징 추출부(110)를 통해 생성된 주파수 도메인과 시간 도메인을 기반으로 인코딩 및 디코딩 처리를 통해 상호 통합된 음성 DNA를 생성하고, 더 나아가 생성된 음성 DNA를 이용해 예측 텍스트(Raw Text)를 도출하는 과정을 거치게 된다.The encoding/decoding unit 120 generates integrated voice DNA through encoding and decoding processing based on the frequency domain and time domain generated through the stream feature extraction unit 110, and further The process of deriving predicted text (raw text) is performed using the generated voice DNA.

이를 위해, 인코딩/디코딩 수행부(120)는 제1인코딩부분(121), 제2인코딩부분(122), 제1디코딩부분(123), 제2디코딩부분(124), 음성 DNA 생성부분(125), 예측 텍스트 생성부분(126)을 포함한다.For this purpose, the encoding/decoding unit 120 includes a first encoding part 121, a second encoding part 122, a first decoding part 123, a second decoding part 124, and a voice DNA generating part 125. ), and includes a predictive text generation portion 126.

먼저, 제1인코딩부분(121)은 스트림 특징 추출부(110)를 통해 생성된 주파수 도메인을 기 설정된 소정의 주파수 대역별로 구간을 나누어, 각 구간별 특징을 추출하여 인코딩(Encoding)을 수행한다.First, the first encoding part 121 divides the frequency domain generated through the stream feature extractor 110 into sections for each preset frequency band, extracts features for each section, and performs encoding.

여기서, 제1인코딩부분(121)을 통해 추출되는 주파수 대역 구간별 특징은 주파수 도메인상의 전체 시간축 내 특정 시간구간에 걸쳐 주파수 대역 구간별로 특정 문자가 대응되어 위치할 확률에 관한 어텐션(Attention)을 포함한다.Here, the characteristics of each frequency band section extracted through the first encoding part 121 include attention regarding the probability that a specific character is located in correspondence with each frequency band section over a specific time section within the entire time axis on the frequency domain. do.

또한, 제1인코딩부분(121)을 통해 인코딩되는 주파수 대역 구간별 특징 순환신경망(RNN, recurrent neural network)에 관한 정보 역시 포함된다.In addition, information on the feature recurrent neural network (RNN) for each frequency band section encoded through the first encoding part 121 is also included.

이와 동시에, 제2인코딩부분(122)은 스트림 특징 추출부(110)를 통해 생성된 시간 도메인을 기 설정된 소정의 진폭 신호 강도별로 구간을 나누어, 각 구간별 특징을 추출하여 인코딩(Encoding)을 수행한다.At the same time, the second encoding part 122 divides the time domain generated through the stream feature extractor 110 into sections according to preset amplitude signal strength, extracts features for each section, and performs encoding. do.

여기서, 제2인코딩부분(122)을 통해 추출되는 진폭 신호 강도 구간별 특징은 시간 도메인상의 전체 시간축 내 특정 시간구간에 걸쳐 진폭 신호 강도 구간별로 특정 문자가 대응되어 위치할 확률에 관한 어텐션(Attention)을 포함한다.Here, the characteristics of each amplitude signal intensity section extracted through the second encoding part 122 are attention related to the probability that a specific character is located in correspondence with each amplitude signal intensity section over a specific time section within the entire time axis on the time domain. Includes.

아울러, 또한, 제2인코딩부분(122)을 통해 인코딩되는 진폭 신호 강도 구간별 특징 순환신경망(RNN, recurrent neural network)에 관한 정보 역시 포함된다.In addition, information on the feature recurrent neural network (RNN) for each amplitude signal intensity section encoded through the second encoding portion 122 is also included.

다음으로, 제1디코딩부분(123)은 제1인코딩부분(121)을 통해 인코딩을 거친 주파수 대역별 특징 기반의 주파수 도메인을 주 도메인으로 하고, 제2인코딩부분(122)을 통해 인코딩을 거친 진폭 신호 강도별 특징 기반의 시간 도메인을 주 도메인과 관련한 정보 보강을 위한 보조 도메인으로 하여 디코딩(Decoding)을 수행하게 된다.Next, the first decoding part 123 uses the frequency domain based on the characteristics of each frequency band encoded through the first encoding part 121 as the main domain, and the amplitude encoded through the second encoding part 122 Decoding is performed using the time domain based on the characteristics of each signal strength as an auxiliary domain to reinforce information related to the main domain.

이를 통해, 디코딩되어 나오는 도메인의 특징은 주파수 도메인을 통해 나타나고 있는 주파수 특징(Frequency Feature)을 기저로 하여, 부분별로 결여된 시간 특징(Time Feature)이 보강되어 한 번에 주파수 특징과 시간 특징 모두를 파악할 수 있을 뿐만 아니라 상호 보완적으로 더욱 체계화된 멜 스펙트로그램(Mel-Spectrogram) 형태의 벡터 구조를 갖춘 도메인을 얻게 된다.Through this, the decoded domain features are based on the frequency features appearing through the frequency domain, and the missing time features in each part are reinforced to provide both frequency features and time features at once. You will obtain a domain with a vector structure in the form of a Mel-Spectrogram that is not only understandable but also complementary and more systematic.

또한, 제2디코딩부분(124)는 제2인코딩부분(122)을 통해 인코딩을 거친 진폭 신호 강도별 특징 기반의 시간 도메인을 주 도메인으로 하고, 제1인코딩부분(121)을 통해 인코딩을 거친 주파수 대역별 특징 기반의 주파수 도메인을 주 도메인과 관련한 정보 보강을 위한 보조 도메인으로 하여 디코딩(Decoding)을 수행하게 된다.In addition, the second decoding part 124 uses the time domain based on the characteristics of each amplitude signal strength encoded through the second encoding part 122 as the main domain, and the frequency encoded through the first encoding part 121 Decoding is performed using the frequency domain based on features for each band as an auxiliary domain to reinforce information related to the main domain.

이를 통해, 디코딩되어 나오는 도메인의 특징은 시간 도메인을 통해 나타나고 있는 시간 특징(Time Feature)을 기저로 하여, 부분별로 결여된 주파수 특징(Frequency Feature)이 보강되어 한 번에 주파수 특징과 시간 특징 모두를 파악할 수 있을 뿐만 아니라 상호 보완적으로 더욱 체계화된 멜 스펙트로그램(Mel-Spectrogram) 형태의 벡터 구조를 갖춘 도메인을 얻게 된다.Through this, the features of the decoded domain are based on the time features that appear through the time domain, and the frequency features that are missing in each part are reinforced to provide both frequency features and time features at once. You will obtain a domain with a vector structure in the form of a Mel-Spectrogram that is not only understandable but also complementary and more systematic.

다음으로, 음성 DNA 생성부분(125)은 제1디코딩부분(121)을 통해 생성되는 제1디코딩 도메인과 제2디코딩부분(122)을 통해 생성되는 제2디코딩 도메인을 통합시켜 종합적인 특정 화자의 음성특징이 반영된 종합음성특징을 나타내어 화자별로 구분되어 식별 데이터로 활용 가능한 형태인 벡터화된 정보로서의 음성 DNA를 생성하게 된다.Next, the voice DNA generation part 125 integrates the first decoding domain generated through the first decoding part 121 and the second decoding domain generated through the second decoding part 122 to create a comprehensive specific speaker. By representing comprehensive voice characteristics that reflect voice characteristics, voice DNA is generated as vectorized information that is classified by speaker and can be used as identification data.

이를 통해, 음성 DNA는 화자를 구분하여 인식하며 음성인식을 수행함에 이용될 뿐만 아니라, 아래 설명될 인공지능 기반의 음성인식에 요구되는 각종 텍스트 도출과 관련한 알고리즘들의 학습에도 화자별로 구분하여 특징을 학습할 수 있도록 할 수 있다.Through this, voice DNA is not only used to identify and recognize speakers and perform voice recognition, but also to learn the characteristics of each speaker for learning algorithms related to deriving various texts required for artificial intelligence-based voice recognition, which will be explained below. You can do it.

예측 텍스트 생성부분(126)은 음성 DNA 생성부분(125)을 통해 생성된 상기 음성 DNA의 종합음성특징 도메인을 분석하여 예측 텍스트(Raw Text)를 도출하게 된다.The predicted text generation part 126 analyzes the comprehensive voice feature domain of the voice DNA generated through the voice DNA generation part 125 to derive a predicted text (Raw Text).

여기서, 예측 텍스트(Raw Text)는 아래 설명될 음성 신호 내 묵음구간을 이용해 도출되는 특정 문자, 단어, 혹은 문장 단위의 시작구간과 종료구간을 태깅하는 과정과 자연어 처리 과정을 거치지 않은 예비적인 1차 텍스트 도출 결과에 해당한다.Here, the predicted text (Raw Text) is a preliminary primary text that has not gone through the tagging process and natural language processing of the start and end sections of specific characters, words, or sentences derived using silent sections in the speech signal, which will be explained below. It corresponds to the text derivation result.

구체적으로, 예측 텍스트 생성부분(126)은 화자의 음성에 완전하지는 않지만 상당부분 대응되는 예측 텍스트(Raw Text)를 도출하기 위한 별도의 기능 수행 알고리즘을 갖추고 있도록 실시할 수 있으며, 해당 알고리즘 실시에 따라 별도의 모듈로 독립되어 구성을 추가로 갖추거나, 아래 설명될 센텐스 인핸서형 텍스트 변환부(140)에 반영시킬 수도 있으나 특정하게 한정되지 아니한다.Specifically, the predictive text generation part 126 may be equipped with a separate function performance algorithm to derive predicted text (Raw Text) that corresponds to a significant portion, but not completely, of the speaker's voice, depending on the implementation of the algorithm. It may be independently configured as a separate module, or may be reflected in the sentence enhancer type text conversion unit 140, which will be described below, but is not specifically limited.

태깅(Tagging) 수행부(130)는 음성 프레임상의 묵음구간(Blank)을 기준으로 구분되는 시작구간과 종료구간의 위치 표지가 태깅을 통해 이루어진 결과물로서 위치가 표지된 특징 스트림(Postinoal Tagged Character Stream) 정보를 생성한다.The tagging unit 130 creates a postinoal tagged character stream as a result of tagging the positions of the start and end sections, which are divided based on the blank section on the voice frame. generate information.

여기서, 태깅(Tagging) 수행부(130)는 우선적으로 음성 프레임상의 묵음구간(Blank)을 기준으로 구분되는 시작구간과 종료구간이 데이터 별로 태깅(Tagging)되어 스크립트형태로 마련된 학습용 데이터가 기 저장하기 위한 별도의 데이터 베이스로서 학습용 데이터 저장공간(130M)이 구비될 수 있으며, 실시에 따라 시스템 전체상에 별도의 데이터베이스가 독립적으로 구성되어 상호 연동을 통한 태깅(Tagging) 수행부(130)의 기능 수행이 이루어지도록 구현될 수 있으나 이에 한정되지 않는다.Here, the tagging unit 130 first tags the start section and end section, which are divided based on the silence section (Blank) on the voice frame, for each data, and stores the learning data prepared in the form of a script. As a separate database for learning, a data storage space (130M) may be provided, and depending on the implementation, a separate database will be independently configured throughout the system to perform the function of the tagging execution unit 130 through mutual interconnection. This can be implemented to achieve this, but is not limited to this.

또한, 태깅(Tagging) 수행부(130)는 학습용 데이터 저장공간(130M)에 기 저장된 학습용 데이터를 통해 음성 프레임상의 시작구간과 종료구간에 대한 태깅(Tagging)을 수행하는 기능의 학습이 이루어지는 태깅 알고리즘(130A)이 갖춰진다.In addition, the tagging performing unit 130 is a tagging algorithm that learns the function of tagging the start and end sections of the voice frame through the learning data previously stored in the learning data storage space (130M). (130A) is provided.

이에 따라, 태깅(Tagging) 수행부(130)는 스트림 통합기(SI)를 통해 생성된 음성 프레임상의 묵음구간(Blank)을 기준으로 구분되는 시작구간과 종료구간의 태깅(Tagging)을 기 마련되어 학습을 거친 태깅 알고리즘을 이용해 수행하게 되며, 이와 같은 태깅 기능의 수행에는 제1인코딩부분(121)을 통해 인코딩을 거친 주파수 대역별 특징 기반의 주파수 도메인, 제2인코딩부분(122)을 통해 인코딩을 거친 진폭 신호 강도별 특징 기반의 시간 도메인, 제1디코딩부분(121)을 통해 생성되는 제1디코딩 도메인, 제2디코딩부분(122)을 통해 생성되는 제2디코딩 도메인 및 예측 텍스트 생성부분(126)을 통해 생성된 예측 텍스트 중 적어도 하나 이상의 정보를 기반으로 활용하여 진행되게 된다.Accordingly, the tagging unit 130 is prepared and learns the tagging of the start section and the end section, which are divided based on the silence section (Blank) on the voice frame generated through the stream integrator (SI). This tagging function is performed using a tagging algorithm that has been encoded through the first encoding part 121, a frequency domain based on features for each frequency band, and a frequency domain encoded through the second encoding part 122. A time domain based on features for each amplitude signal strength, a first decoding domain generated through the first decoding part 121, a second decoding domain generated through the second decoding part 122, and a predicted text generation part 126. The process is carried out based on at least one piece of information among the predicted texts generated through the process.

결과적으로, 위치가 표지된 특징 스트림(Postinoal Tagged Character Stream) 정보가 생성되고, 해당 정보는 태깅 알고리즘의 학습용 데이터로 활용되어 학습 수행과정에 제공되게 되고, 실시에 따라 앞 서 설명한 학습용 데이터 저장공간(130M)에 구분되어 저장 관리될 수도 있다.As a result, location tagged feature stream information is generated, and the information is used as learning data for the tagging algorithm and provided in the learning process. Depending on the implementation, the learning data storage space (storage space for learning described above) is used as learning data for the tagging algorithm. It can also be stored and managed separately in 130M).

센텐스 인핸서(Sentence Enhancer)형 텍스트 변환부(140)는 내부에 별도의 자연어 처리(NLP, Natural Language Processing) 알고리즘(140A)이 갖춰지며, 태깅 수행부(130)를 통해 생성되는 위치가 표지된 특징 스트림 정보를 기반으로 자연어 처리 알고리즘(140A)을 이용해 인코딩/디코딩 수행부(120)를 통해 생성된 예측 텍스트(Raw Text)를 음성 인식 결과물로서의 최종 텍스트로 변환시킨다.The Sentence Enhancer type text conversion unit 140 is equipped with a separate Natural Language Processing (NLP) algorithm 140A, and the location generated through the tagging execution unit 130 is marked. Based on the feature stream information, the predicted text (Raw Text) generated through the encoding/decoding unit 120 is converted into the final text as a speech recognition result using the natural language processing algorithm 140A.

이와 같이, 예측 텍스트(Raw Text)를 자연어 처리과정을 거쳐 변환되는 음성 인식 결과물로서의 최종 텍스트는 화자가 내뱉은 음성에 더욱 정확하게 대응되는 결과를 갖추게 되어 기능의 신뢰도를 더욱 높이게 된다.In this way, the final text as a voice recognition result that is converted from predicted text (Raw Text) through a natural language processing process has a result that corresponds more accurately to the voice uttered by the speaker, further increasing the reliability of the function.

더욱이, 센텐스 인핸서(Sentence Enhancer)형 텍스트 변환부(140)는 자연어 처리 알고리즘(140A)의 학습을 통해 기능적 개선이 지속적으로 이루어질 수 있도록 제1자연어 처리 알고리즘 학습부분(141)과 제2자연어 처리 알고리즘 학습부분(142)를 더 포함하게 된다.Moreover, the Sentence Enhancer-type text conversion unit 140 includes the first natural language processing algorithm learning part 141 and the second natural language processing so that functional improvement can be continuously achieved through learning of the natural language processing algorithm 140A. An algorithm learning part 142 is further included.

우선, 제1자연어 처리 알고리즘 학습부분(141)은 하나의 인코더로서 인코딩/디코딩 수행부(120)를 통해 생성된 예측 텍스트(Raw Text)를 이루는 문자간의 연결상태에 대한 상관관계를 분석하여 자연어 처리 알고리즘(140A)의 학습에 이용 가능한 제1자연어 처리용 학습 데이터를 생성한다.First, the first natural language processing algorithm learning part 141 is an encoder and processes natural language by analyzing the correlation between the connection states between characters forming the predicted text (Raw Text) generated through the encoding/decoding performance unit 120. Learning data for first natural language processing that can be used for learning the algorithm 140A is generated.

다음으로, 제2자연어 처리 알고리즘 학습부분(142)은 하나의 디코더로서 제1자연어 처리 알고리즘 학습부분(141)을 통해 생성된 제1자연어 처리용 학습 데이터와 센텐스 인핸서형 텍스트 변환부(140)를 통해 변환 처리된 최종 텍스트 간의 상관관계를 분석하여 상기 자연어 처리 알고리즘의 학습에 이용 가능한 제2자연어 처리용 학습 데이터를 생성한다.Next, the second natural language processing algorithm learning part 142 is a decoder that combines learning data for first natural language processing generated through the first natural language processing algorithm learning part 141 and the sentence enhancer type text conversion unit 140. By analyzing the correlation between the final texts converted and processed, learning data for second natural language processing that can be used for learning the natural language processing algorithm is generated.

이와 같이, 생성되는 제1자연어 처리용 학습 데이터 및 제2자연어 처리용 학습 데이터는 별도의 데이터베이스 공간에 기록 저장 가능하며, 이를 자연어 처리 알고리즘(140A)이 학습에 활용하여 자연어 처리와 관련한 딥러닝이 진행될 수 있게 구현 가능하다.In this way, the generated learning data for first natural language processing and learning data for second natural language processing can be recorded and stored in a separate database space, and the natural language processing algorithm 140A uses this for learning to perform deep learning related to natural language processing. It can be implemented to proceed.

그리고 앞 서 설명한 스트림 특징 추출부(110), 인코딩/디코딩 수행부(120), 태깅 수행부(130) 및 센텐스 인핸서형 텍스트 변환부(140)는 하나의 메모리(M) 내에 상호 연동 가능한 형태로 같이 구축되어 기능 처리를 수행하게 됨에 따라, 더욱 신속하고 효율적인 음성인식 기능의 수행이 이루어질 수 있게 된다.And the stream feature extraction unit 110, encoding/decoding unit 120, tagging unit 130, and sentence enhancer type text conversion unit 140 described above are interoperable within one memory (M). As it is built together and performs functional processing, it becomes possible to perform the voice recognition function more quickly and efficiently.

이는 더욱이 음성 시스템 내에 구성 및 구조를 구축함에 있어 더욱 간소화되어 복잡성을 최소화시킴으로서 기능적, 비용적 효율을 충분히 높게 마련할 수 있다.Furthermore, this further simplifies the configuration and structure within the voice system, minimizing complexity, thereby ensuring sufficiently high functional and cost efficiency.

본 발명에 개시된 실시예는 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의해서 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 보호범위는 아래 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.The embodiments disclosed in the present invention are not intended to limit but illustrate the technical idea of the present invention, and the scope of the technical idea of the present invention is not limited by these examples. The scope of protection should be interpreted in accordance with the claims below, and all technical ideas within the equivalent scope should be interpreted as being included in the scope of rights of the present invention.

100 : 실시간 End-to-End 방식의 음성 인식 및 음성DNA 생성 시스템
110 : 스트림 특징 추출부
120 : 인코딩/디코딩 수행부
121 : 제1인코딩부분 122 : 제2인코딩부분
123 : 제1디코딩부분 124 : 제2디코딩부분
125 : 음성 DNA 생성부분 125 : 예측 텍스트 생성부분
130 : 태깅 수행부
130M : 학습용 데이터 저장공간
130A : 태깅 알고리즘
140 : 센텐스 인핸서형 텍스트 변환부
141 : 제1자연어 처리 알고리즘 학습부분
142 : 제2자연어 처리 알고리즘 학습부분
140A : 자연어 처리 알고리즘
SI : 스트림 통합기
M : 메모리100: Real-time end-to-end voice recognition and voice DNA generation system
110: Stream feature extraction unit
120: Encoding/decoding performance unit
121: first encoding part 122: second encoding part
123: first decoding part 124: second decoding part
125: Voice DNA generation part 125: Predicted text generation part
130: tagging execution unit
130M: Data storage space for learning
130A: Tagging algorithm
140: Sentence enhancer type text conversion unit
141: First natural language processing algorithm learning part
142: Second natural language processing algorithm learning part
140A: Natural language processing algorithm
SI: Stream Integrator
M: memory

Claims

A payload that contains the voice information of each voice packet stream data in chronological order using the time information included in the header part of the voice packet stream data generated by packet conversion processing based on the detected voice. (Payload) Creates speech DNA based on the speech frame provided by the Stream Integrator, which creates a speech frame by connecting parts, and simultaneously extracts text corresponding to the detected speech in real time. In the end-to-end voice recognition and voice DNA generation system,
At least one frequency domain representing a frequency feature by vectorizing frequency changes over time based on the voice frame generated through the stream integrator, and the voice frame generated through the stream integrator A stream feature extraction unit that extracts and generates at least one time domain representing a time feature by vectorizing the change in amplitude over time based on;
Encoding/decoding to generate integrated speech DNA through encoding and decoding processing based on the frequency domain and time domain generated through the stream feature extraction unit, and to derive predicted text (Raw Text) using the speech DNA. /Decording) execution department;
The start and end sections, which are divided based on the blank section on the speech frame, are tagged for each frame, and the learning data prepared in the form of a script is already stored, and the start section on the speech frame is stored through the learning data. As a tagging algorithm is equipped to learn the function of tagging the section and the end section, the start of the voice frame generated through the stream integrator is differentiated based on the blank section (Blank) in the signal stream. A tagging performing unit that performs tagging of the section and the end section using the tagging algorithm to generate postinoal tagged character stream information; and
A Natural Language Processing (NLP) algorithm is provided, and prediction text (Raw) is generated through the encoding/decoding unit using the natural language processing algorithm based on the location-marked feature stream information generated through the tagging unit. It includes a Sentence Enhancer-type text conversion unit that converts text) into the final text as a voice recognition result,
The tagging algorithm installed in the tagging execution unit performs learning by using feature stream information with the generated location marked as learning data.
Real-time end-to-end voice recognition and voice DNA generation system.

delete

According to paragraph 1,
The encoding/decoding unit,
A first encoding part that divides the frequency domain generated through the stream feature extraction unit into sections by predetermined frequency bands, extracts features for each section, and performs encoding;
a second encoding part that divides the time domain generated through the stream feature extractor into sections according to preset amplitude signal strength, extracts features for each section, and performs encoding;
The frequency domain based on features for each frequency band encoded through the first encoding part is used as the main domain, and the time domain based on features for each amplitude signal intensity encoded through the second encoding part is reinforced with information related to the main domain. A first decoding part that performs decoding using an auxiliary domain for; and
The time domain based on features for each amplitude signal strength encoded through the second encoding part is used as the main domain, and the frequency domain based on features for each frequency band encoded through the first encoding part is reinforced with information related to the main domain. A second decoding part that performs decoding as an auxiliary domain for
Real-time end-to-end voice recognition and voice DNA generation system.

According to paragraph 3,
The encoding/decoding unit,
a voice DNA generating portion that generates the voice DNA as a comprehensive voice feature domain by integrating a first decoding domain generated through the first decoding portion and a second decoding domain generated through the second decoding portion; and
Characterized in that it further comprises a prediction text generation part that derives a predicted text (Raw Text) by analyzing the comprehensive voice feature domain of the voice DNA generated through the voice DNA generation part.
Real-time end-to-end voice recognition and voice DNA generation system.

According to paragraph 3,
The characteristics of each frequency band section extracted through the first encoding part include attention regarding the probability that a specific character is located in correspondence with each frequency band section over a specific time section within the entire time axis on the frequency domain,
The characteristics of each amplitude signal intensity section extracted through the second encoding part include attention regarding the probability that a specific character is located in correspondence with each amplitude signal intensity section over a specific time section within the entire time axis in the time domain. to do
Real-time end-to-end voice recognition and voice DNA generation system.

delete

According to paragraph 1,
The sentence enhancer type text conversion unit,
An encoder type that generates learning data for first natural language processing that can be used for learning the natural language processing algorithm by analyzing the correlation of the connection status between characters forming the predicted text (Raw Text) generated through the encoding/decoding performing unit. First natural language processing algorithm learning part; and
A product that can be used for learning the natural language processing algorithm by analyzing the correlation between the learning data for first natural language processing generated through the first natural language processing algorithm learning part and the final text converted through the sentence enhancer type text conversion unit. 2. A second natural language processing algorithm learning part in the form of a decoder that generates learning data for natural language processing;
Real-time end-to-end voice recognition and voice DNA generation system.

In clause 7,
The stream feature extraction unit, encoding/decoding performance unit, tagging performance unit, and sentence enhancer-type text conversion unit are constructed in a form that can be interoperable within one memory and perform functional processing.
Real-time end-to-end voice recognition and voice DNA generation system.