KR102020773B1

KR102020773B1 - Multimedia Speech Recognition automatic evaluation system based using TTS

Info

Publication number: KR102020773B1
Application number: KR1020190039483A
Authority: KR
Inventors: 이충재; 이창재; 송민규
Original assignee: 미디어젠(주)
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-11-04
Also published as: WO2020204256A1

Abstract

The present invention relates to an automatic multimedia speech recognition evaluation system using a speech synthesis engine, and more particularly, to an automatic multimedia speech recognition evaluation system using a speech synthesis engine, which uses a TTS engine (speech synthesis engine) to play and evaluate new sentence patterns in real time without recording in order to improve the problems, of an existing automatic speech recognition evaluation system which sequentially plays pre-recorded speech data to evaluate the recognition rate of a speech recognition apparatus, such as enormous time and cost due to the continuous construction of a speech recording DB and the possibility of not recognizing other sentences with the same meaning, thereby outputting accurate and fast performance and functional verification results. The automatic multimedia speech recognition evaluation system according to the present invention comprises: an automatic speech recognition test device (100); a TTS driving device (200); and a speech recognition engine device (300).

Description

Multimedia Speech Recognition automatic evaluation system based using TTS}

본 발명은 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템에 관한 것으로서, 더욱 상세하게는 미리 녹음된 음성 데이터를 순차적으로 재생하여 음성 인식 장치의 인식률을 평가하는 일반적인 음성인식 자동 평가시스템의 음성녹음DB의 지속적인 구축에 따른 막대한 시간과 비용 소요, 같은 의미의 다른 문장을 인식하지 못할 가능성 등의 문제점을 개선하고자 TTS엔진(음성합성엔진)을 활용하여 언제든지 새로운 문형을 녹음없이 실시간으로 재생하여 평가할 수 있는 음성인식 자동 평가시스템을 제공함으로써, 정확하고 빠른 성능과 기능 검증 결과를 출력할 수 있는 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템에 관한 것이다.The present invention relates to a system for automatically evaluating multimedia speech recognition using a speech synthesis engine, and more particularly, to a speech recording DB of a general speech recognition automatic evaluation system for sequentially reproducing pre-recorded speech data to evaluate a recognition rate of a speech recognition apparatus. In order to improve the problems such as huge time and cost, and the possibility of not recognizing other sentences with the same meaning, it is possible to use TTS engine (voice synthesis engine) at any time to evaluate and play new sentences in real time without recording. By providing a speech recognition automatic evaluation system, the present invention relates to a multimedia speech recognition automatic evaluation system using a speech synthesis engine that can output accurate and fast performance and function verification results.

음성인식기술(Speech Recognition)은 마이크로폰을 통해 입력된 사용자의 음성을 컴퓨터가 분석하고 특징을 추출해서 미리 입력된 단어나 문장에 근접한 결과를 명령어로서 인식하고, 인식된 명령어에 해당하는 동작을 수행하도록 하는 기술이다. Speech Recognition technology allows a computer to analyze a user's voice input through a microphone, extract features, recognize a result that is close to a pre-entered word or sentence as a command, and perform an operation corresponding to the recognized command. It is a technique to do.

기존의 음성인식 시스템은 차량, 모바일 등 단말기 내부에 음성인식 엔진이 저장되는 단말 음성인식 방식과, 스마트폰 인터넷 음성 검색 및 각종 정보 처리를 위한 클라우드 기반 서버 음성인식이 각각 서비스 용도에 맞게 변별적으로 사용되어 왔다.Conventional voice recognition system has a terminal voice recognition method in which a voice recognition engine is stored inside a terminal such as a vehicle and a mobile device, and a cloud-based server voice recognition for smartphone internet voice search and various information processing is suitably adapted to each service purpose. Has been used.

특히, 기존의 음성인식 자동화 평가 시스템은 음성 인식률을 측정하기 위한음성 DB 자동 출력 장치 또는 노이즈 환경을 조정할 수 있는 조절 장치 등을 중심으로 진행되어 왔다.In particular, the existing speech recognition automation evaluation system has been progressed mainly on the speech DB automatic output device for measuring the speech recognition rate, or the adjusting device for adjusting the noise environment.

그러나, 최근 음성인식은 인식 결과의 규격이 상이한 단말 음성인식과 클라우드 기반 서버 음성인식이 동시에 구동되는 하이브리드 방식에 대한 통합 성능 검증이 필요하므로 각각 상이한 규격의 결과들을 통합 분석할 수 있는 알고리즘과 운영 방안이 필요하다.However, recently, voice recognition requires integrated performance verification for a hybrid method in which both the terminal voice recognition and cloud-based server voice recognition are simultaneously operated. This is necessary.

예를 들어, 차량용 음성인식 시스템의 경우에, 실차 고속 주행 환경에서 다국어 원어민을 직접 차량에 탑승시켜서 정해진 명령어를 발화하도록 지도한 뒤 검수자가 동승하여 인식 결과를 수동으로 체크하는 방식이 일반적이다.For example, in the case of a voice recognition system for a vehicle, a method in which a multilingual native speaker is directly boarded in a vehicle in a real vehicle high-speed driving environment, instructs a user to utter a predetermined command, and the inspector is accompanied by a passenger to manually check the recognition result.

즉, 기존 음성인식 자동 테스트에 대한 발명은 대개 실차 환경을 고려한 것이 아닌, PC에 음성 인식 시스템을 구비하고, 인식 대상 어휘들을 자동으로 입력하여 결과를 집계하는 배치(Batch) 방식과, 테스트 환경 조성 시 노이즈와 음성의 비율을 자동으로 조정하는 음량 조정 장치 등의 기술 중심으로 이루어지고 있었다.In other words, the invention of the existing automatic speech recognition test usually does not take into consideration the actual vehicle environment, but includes a speech recognition system on a PC, and automatically inputs recognition target vocabularies, and batches the results, and creates a test environment. The focus was on technology such as a volume control device that automatically adjusts the ratio of visual noise and voice.

그러나, 이러한 실차 테스트 방식은 수백 명 단위의 원어민 섭외의 문제, 테스트 장소까지 인솔 및 관리의 문제, 고속 주행 상황에 따른 안전 문제, 인식 결과 수기 기록에 따른 효율성 저하 문제, 막대한 결과 데이터 정제 및 분석 시간 과다 소요 문제, 반복 테스트 불가 문제 등 다양한 현실적 문제에 직면하여 현실적으로 유의미한 통계적 결과 산출에 충분한 정도의 테스트 수행이 어려운 문제가 발생하여 이에 따른 해결 기술이 필요하게 되었다.However, the actual vehicle test method has several hundreds of native speakers, problems with induction and management up to the test site, safety problems due to high-speed driving conditions, efficiency reductions due to the record of recognition results, huge result data purification and analysis time. In the face of various real-world problems such as excessive requirements and non-repeatable tests, it was difficult to perform enough tests to produce statistically realistic results.

한편, 본 발명과 관련있는 종래 기술에 대하여 좀 더 구체적으로 설명하도록 하겠다.On the other hand, the prior art related to the present invention will be described in more detail.

음성 인식을 자동으로 평가하기 위하여 사람의 음성이 필요하므로 많은 사람들을 직접 테스트 장소로 데려와 테스트하기에는 많은 시간과 비용이 소요된다.Since human speech is required to automatically evaluate speech recognition, it is time-consuming and expensive to bring many people directly to the test site for testing.

따라서, 음성을 녹음한 뒤, 녹음 음원을 출력하여 테스트하는 방식이 일반적으로 활용된다.Therefore, after recording voice, a method of outputting and testing a recording sound source is generally utilized.

그러나, 음원의 녹음도 테스트 대상 언어 수가 많아지고, 테스트 대상 명령어가 계속 변화되면 새로운 음원DB를 구축해야 하는데, 이러한 구축은 상당히 어려운 문제점이다.However, if the recording of the sound source also increases the number of languages to be tested, and the test target command is continuously changed, a new sound source DB must be constructed, which is a very difficult problem.

특히, 자연어 인식이 일반화되면서 자연스러운 무작위 문장을 테스트해야 하는 경우에 모든 가변 문장들을 녹음할 수 없기 때문에 이에 대응할 수 있는 새로운 방식의 음성인식 자동 평가시스템이 필요하게 되었다.In particular, when natural language recognition is generalized, all variable sentences cannot be recorded when natural random sentences need to be tested. Therefore, a new automatic speech recognition evaluation system is required.

예를 들어, 음성인식 자동 평가시스템은 미리 녹음된 음성 녹음 데이터를 순차적으로 재상하여 음성 인식 장치의 인식률을 평가하게 된다.For example, the speech recognition automatic evaluation system sequentially reconstructs the prerecorded voice recording data to evaluate the recognition rate of the speech recognition apparatus.

테스터의 작업없이 효율적으로 장시간 테스트가 가능하지만, 제한된 음성 녹음 데이터만을 활용하기 때문에 새로운 기능이 추가될 경우에 그에 알맞은 음성 녹음 데이터 구축 비용이 발생하고, 인식 실패 확률이 상당히 높은 문장을 발견하지 못할 가능성이 대폭적으로 상승하게 된다.Efficient long-term testing can be performed efficiently without the tester's work, but only limited voice recording data can be used to create a suitable voice recording data when new features are added, and it is impossible to find a sentence with a high probability of recognition failure. This rises drastically.

예를 들어, 미디어(Media), 콜(Call) 관련 음성 명령어만 인식하던 음성 인식 장치가 네비게이션 관련 음성 명령어도 인식 가능하도록 업그레이드할 경우에 음성 인식 자동 평가시스템은 네비게이션 관련 음성 녹음 데이터를 구축해야만 한다.For example, if a speech recognition device that only recognizes media and call-related voice commands is upgraded to recognize navigation-related voice commands, the automatic speech recognition evaluation system must build navigation-related voice recording data. .

또한, 구축된 음성 녹음 데이터의 패턴은 모두 성공적으로 인식할지라도, 같은 의미의 다른 문장, 예를 들어, '사무실 검색해'라는 발화 의도에 대한 '사무실로 안내해줘', '사무실 찾아줘' 등의 다른 문장들은 인식 실패할 가능성은 높아지게 된다.In addition, even if all the patterns of the voice recording data are successfully recognized, other sentences having the same meaning, for example, 'guide to the office' and 'find the office' for the intention of utterance of 'office search' Other sentences are more likely to fail recognition.

따라서, 상기와 같은 음성 녹음 데이터 구축 비용 절감을 위한 기술과 다양한 유형의 문장을 테스트하여 인식 실패 확률이 높은 문장을 발견할 수 있는 새로운 시스템 개발이 필요하게 되었다.Therefore, there is a need to develop a new system for discovering a sentence having a high probability of recognition failure by testing a technique for reducing the voice recording data construction cost and testing various types of sentences.

특히, 자연어 인식의 경우, 음성 녹음 DB를 지속적으로 구축해야 하는데, 막대한 시간과 비용이 소요되므로 이를 개선하기 위한 시스템이 절실히 필요하게 되었다.In particular, in the case of natural language recognition, it is necessary to continuously build a voice recording DB, which requires enormous time and cost, and a system for improving it is urgently needed.

(선행문헌) 대한민국공개특허번호 10-2013-0029635호(Previous Document) Republic of Korea Patent Publication No. 10-2013-0029635

따라서 본 발명은 상기와 같은 종래 기술의 문제점을 감안하여 제안된 것으로서, 본 발명의 제1 목적은 자연어 인식을 위하여 음성녹음DB의 지속적인 구축에 따른 막대한 시간과 비용 소요, 같은 의미의 다른 문장을 인식하지 못할 가능성 등의 문제점을 개선하고자 TTS엔진(음성합성엔진)을 활용하여 언제든지 새로운 문형을 녹음없이 실시간으로 재생하여 평가할 수 있는 음성인식 자동 평가시스템을 제공하고자 한다.Therefore, the present invention has been proposed in view of the problems of the prior art as described above, and a first object of the present invention is to recognize a large amount of time and cost required for continuous construction of a voice recording DB for natural language recognition, and to recognize other sentences having the same meaning. In order to improve the problems such as the possibility of not being able to do so, TTS engine (voice synthesis engine) is used to provide a voice recognition automatic evaluation system that can evaluate and play new sentences in real time at any time without recording.

본 발명의 제2 목적은 자연어 인식을 평가할 경우에 종래의 시뮬레이션 테스트 방식보다 가변적인 테스트 환경에서도 인식 정확성에 상당한 우위성을 제공하고자 한다.A second object of the present invention is to provide a significant advantage in recognition accuracy even in a variable test environment when evaluating natural language recognition, compared to conventional simulation test methods.

본 발명이 해결하고자 하는 과제를 달성하기 위하여, 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템은,In order to achieve the problem to be solved by the present invention, a multimedia speech recognition automatic evaluation system using a speech synthesis engine,

음성인식 정답값을 포함하고 있는 테스트 스크립트 정보를 입력받을 경우에 TTS구동장치(200)를 구동 제어하며, 음성인식엔진장치(300)로부터 제공된 음성인식 결과값과 테스트 스크립트의 음성인식 정답값을 비교하여 인식 성공 여부를 판단하여 인식 성공 여부 결과를 출력시키기 위한 음성인식자동테스트장치(100)와,When the test script information including the voice recognition correct answer value is received, the TTS driving apparatus 200 is driven and controlled, and the voice recognition result value provided from the voice recognition engine device 300 is compared with the voice recognition correct answer value of the test script. The automatic voice recognition automatic test apparatus 100 for determining whether the recognition is successful and outputting the recognition success result;

상기 음성인식자동테스트장치(100)로부터 제공된 테스트 스크립트를 음성으로 변환시키며, 변환된 음성을 출력시키기 위한 TTS구동장치(200)와,TTS driving device 200 for converting the test script provided from the automatic voice recognition automatic test device 100 to the voice, and outputs the converted voice,

상기 TTS구동장치를 통해 출력되는 음성을 획득하여 음성인식 결과값을 생성하여 음성인식자동테스트장치(100)로 제공하기 위한 음성인식엔진장치(300)를 포함한다.And a voice recognition engine device 300 for acquiring a voice output through the TTS driving device to generate a voice recognition result value and to provide the voice recognition automatic test device 100.

이상의 구성 및 작용을 지니는 본 발명에 따른 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템을 통해, 자연어 인식을 평가할 경우에 매우 유용한 장점과 종래의 시뮬레이션 테스트 방식보다 가변적인 테스트 환경에서도 상당한 우위성을 제공할 수 있게 된다.Through the automatic speech recognition system for multimedia recognition using the speech synthesis engine according to the present invention having the above-described configuration and operation, it is very useful in evaluating natural language recognition and provides a considerable advantage even in a variable test environment than the conventional simulation test method. It becomes possible.

또한, 음성 녹음 데이터 구축 비용 절감 효과와 다양한 유형의 문장을 테스트하여 인식 실패 확률이 높은 문장을 발견할 수 있는 효과를 제공하게 된다.In addition, it is possible to reduce the cost of constructing voice recording data and to test sentences of various types of sentences to find a sentence having a high probability of recognition failure.

특히, 자연어 인식의 경우 음성 녹음 DB를 지속적으로 구축해야 하는데 소요되는 막대한 시간과 비용을 제거하게 되는데, 본 발명을 통해 언제든지 새로운 문형을 녹음없이 실시간으로 재생하여 평가할 수 있는 즉시성과 효율성 및 비용 절감 효과를 동시에 제공하게 된다.In particular, in the case of natural language recognition, it eliminates the enormous time and cost required to continuously build a voice recording DB. The instantaneous efficiency, efficiency, and cost savings effect can be evaluated by real-time playback of new sentence patterns without recording at any time. Will be provided at the same time.

도 1은 본 발명의 일실시예에 따른 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템을 개략적으로 나타낸 전체 구성도.
도 2는 본 발명의 일실시예에 따른 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템의 음성인식자동테스트장치(100) 블록도.
도 3은 본 발명의 일실시예에 따른 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템의 TTS구동장치(200) 블록도.
도 4는 본 발명의 일실시예에 따른 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템의 음성인식엔진장치(300) 블록도.1 is an overall configuration diagram schematically showing a multimedia speech recognition automatic evaluation system using a speech synthesis engine according to an embodiment of the present invention.
Figure 2 is a block diagram of the automatic speech recognition automatic test apparatus 100 of the automatic multimedia speech recognition evaluation system using a speech synthesis engine according to an embodiment of the present invention.
Figure 3 is a block diagram of the TTS driving apparatus 200 of the automatic multimedia speech recognition system using a speech synthesis engine according to an embodiment of the present invention.
Figure 4 is a block diagram of a speech recognition engine 300 of the automatic speech recognition system for multimedia recognition using a speech synthesis engine according to an embodiment of the present invention.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만, 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. The following merely illustrates the principles of the invention. Therefore, those skilled in the art, although not explicitly described or illustrated herein, can embody the principles of the present invention and invent various devices that fall within the spirit and scope of the present invention.

또한, 본 명세서에 열거된 모든 조건부 용어 및 실시 예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시 예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다.In addition, all conditional terms and embodiments listed herein are in principle clearly intended to be understood only for the purpose of understanding the concept of the invention and are not to be limited to the specifically listed embodiments and states. do.

본 발명을 설명함에 있어서 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되지 않을 수 있다.In describing the present invention, terms such as first and second may be used to describe various components, but the components may not be limited by the terms.

예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component.

어떤 구성요소가 다른 구성요소에 연결되어 있다거나 접속되어 있다고 언급되는 경우는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해될 수 있다.When a component is referred to as being connected or connected to another component, it may be understood that the component may be directly connected to or connected to the other component, but there may be other components in between. .

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니며, 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention, and singular forms may include plural forms unless the context clearly indicates otherwise.

본 명세서에서, 포함하다 또는 구비하다 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것으로서, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해될 수 있다.In this specification, the terms including or including are intended to designate that there exists a feature, a number, a step, an operation, a component, a part, or a combination thereof described in the specification, and one or more other features or numbers, It can be understood that it does not exclude in advance the possibility of the presence or addition of steps, actions, components, parts or combinations thereof.

본 발명의 과제를 해결하기 위한 수단은 하기와 같다.Means for solving the problems of the present invention are as follows.

즉, 본 발명인 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템은,That is, the multimedia speech recognition automatic evaluation system using the present inventors speech synthesis engine,

상기 TTS구동장치를 통해 출력되는 음성을 획득하여 음성인식 결과값을 생성하여 음성인식자동테스트장치(100)로 제공하기 위한 음성인식엔진장치(300)를 포함하여 구성되는 것을 특징으로 한다.It is characterized in that it comprises a voice recognition engine device 300 for obtaining the voice output through the TTS driving device to generate a voice recognition result value to provide to the voice recognition automatic test device 100.

이때, 상기 음성인식자동테스트장치(100)는,At this time, the voice recognition automatic test device 100,

음성인식 테스트를 위한 음성인식 정답값을 포함하고 있는 테스트 스크립트정보를 입력받는 테스트스크립트입력부(110);A test script input unit 110 for receiving test script information including a voice recognition correct answer value for a voice recognition test;

상기 테스트스크립트입력부를 통해 입력된 테스트 스크립트 정보를 테스트정보저장부(130)에 저장 처리시키며, TTS엔진부(220)에서 구동할 수 있는 포맷으로 변환시켜 TTS정보저장부(210)에 변환정보를 저장 처리시키기 위한 스크립트분석부(120);The test script information input through the test script input unit is stored in the test information storage unit 130, and converted into a format that can be driven by the TTS engine unit 220 to convert the converted information into the TTS information storage unit 210. A script analysis unit 120 for storing and processing;

상기 스크립트분석부를 통해 제공된 테스트 스크립트, 테스트 스크립트의 음성인식 정답값, 테스트 번호를 포함하여 저장하고 있는 테스트정보저장부(130);A test information storage unit 130 including a test script provided through the script analysis unit, a voice recognition correct answer value of the test script, and a test number;

상기 테스트저장부에 저장된 테스트 스크립트 정보의 음성인식 정답값과 음성인식엔진장치(300)로부터 제공된 음성인식 결과값을 비교하여 인식 성공 여부를 판단하여 인식 성공 여부 정보를 음성인식결과출력부로 제공하기 위한 음성인식결과처리부(140);To compare the speech recognition correct answer value of the test script information stored in the test storage unit with the speech recognition result value provided from the speech recognition engine device 300 to determine whether the recognition is successful, and to provide the recognition success information to the speech recognition result output unit. Voice recognition result processing unit 140;

상기 음성인식결과처리부에서 제공된 인식 성공 여부 결과를 출력시키기 위한 음성인식결과출력부(150);를 포함하여 구성되는 것을 특징으로 한다.And a voice recognition result output unit 150 for outputting a recognition success result provided by the voice recognition result processing unit.

이때, 상기 스크립트분석부(120)는,At this time, the script analysis unit 120,

테스트 스크립트의 음성인식 정답값에 해당하는 레퍼런스 정보와 TTS 합성에 필요한 화자 정보, 테스트 번호를 추출하여 테스트정보저장부에 저장시키는 것을 특징으로 한다.The reference information corresponding to the voice recognition correct answer value of the test script, the speaker information necessary for the TTS synthesis, and the test number are extracted and stored in the test information storage unit.

이때, 상기 TTS구동장치(200)는,At this time, the TTS driving device 200,

스크립트분석부(120)에서 제공된 구동할 수 있는 포맷으로 변환된 변환정보를 저장하고 있는 TTS정보저장부(210);A TTS information storage unit 210 for storing the converted information converted into a driveable format provided by the script analyzing unit 120;

상기 TTS정보저장부(210)에 저장된 변환정보를 추출하여 음성으로 변환(TTS 합성)시켜 테스트음성출력부(230)로 제공하기 위한 TTS엔진부(220);A TTS engine unit 220 for extracting the converted information stored in the TTS information storage unit 210 and converting the converted information into voice (TTS synthesis) to provide the test voice output unit 230;

상기 TTS엔진부(220)로부터 제공된 변환된 음성을 출력시키기 위한 테스트음성출력부(230);를 포함하여 구성되는 것을 특징으로 한다.And a test voice output unit 230 for outputting the converted voice provided from the TTS engine unit 220.

이때, 상기 음성인식엔진장치(300)는,At this time, the voice recognition engine device 300,

상기 테스트음성출력부(230)에서 출력되는 음성을 획득하여 음성인식 결과값을 생성하여 음성인식결과취득부로 제공하기 위한 음성인식엔진부(310);A voice recognition engine 310 for obtaining a voice output from the test voice output unit 230 to generate a voice recognition result value and providing the voice recognition result acquisition unit to the voice recognition result acquisition unit;

상기 음성인식엔진부를 통해 제공된 음성인식 결과값을 저장하고 있으며, 음성인식 결과값을 통신부(330)로 제공하기 위한 음성인식결과취득부(320);A voice recognition result acquisition unit 320 for storing a voice recognition result value provided through the voice recognition engine unit and providing the voice recognition result value to the communication unit 330;

상기 음성인식결과취득부를 통해 제공된 음성인식 결과값을 음성인식결과처리부(140)로 전송하기 위한 통신부(330);를 포함하여 구성되는 것을 특징으로 한다.And a communication unit 330 for transmitting the voice recognition result value provided through the voice recognition result acquisition unit to the voice recognition result processing unit 140.

이하에서는, 본 발명에 의한 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템의 실시예를 통해 상세히 설명하도록 한다.Hereinafter, it will be described in detail through an embodiment of the automatic multimedia speech recognition system using the speech synthesis engine according to the present invention.

도 1은 본 발명의 일실시예에 따른 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템을 개략적으로 나타낸 전체 구성도이다.1 is an overall configuration diagram schematically showing a multimedia speech recognition automatic evaluation system using a speech synthesis engine according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 크게 음성인식자동테스트장치(100), TTS구동장치(200), 음성인식엔진장치(300)를 포함하여 구성되게 된다.As shown in FIG. 1, a voice recognition automatic test apparatus 100, a TTS driving apparatus 200, and a voice recognition engine apparatus 300 are largely configured.

구체적으로 설명하면, 상기 음성인식자동테스트장치(100)는 음성인식 정답값을 포함하고 있는 테스트 스크립트 정보를 입력받을 경우에 TTS구동장치(200)를 구동 제어하게 된다.Specifically, the voice recognition automatic test apparatus 100 controls the driving of the TTS driving apparatus 200 when receiving test script information including a voice recognition correct answer value.

상기 TTS구동장치(200)를 구동 제어하기 위하여 TTS구동장치가 구동할 수 있는 포맷으로 변환하게 되며, 변환된 정보를 TTS구동장치(200)로 제공하게 되면, 해당 TTS구동장치(200)는 이에 따라 음성으로 변환시키게 되는 것이다.In order to control the driving of the TTS driving apparatus 200, the TTS driving apparatus 200 is converted into a format that can be driven, and when the converted information is provided to the TTS driving apparatus 200, the corresponding TTS driving apparatus 200 is Therefore, it is converted into voice.

상기한 TTS구동장치가 구동할 수 있는 포맷으로 변환하는 기술은 일반적인 기술로서, 예를 들어, 대한민국공개특허번호 제10-2006-0051151호인 텍스트를 음성으로 변환하기 위한 시스템 및 방법에 설명되어 있으므로 상세한 설명은 생략하도록 한다.The technology for converting the TTS driving apparatus into a format that can be driven is a general technology, and is described in, for example, a system and method for converting a text, which is Korean Patent Publication No. 10-2006-0051151, into speech. The description is omitted.

이때, 음성인식엔진장치(300)로부터 제공된 음성인식 결과값과 테스트 스크립트의 음성인식 정답값을 비교하여 인식 성공 여부를 판단하여 인식 성공 여부 결과를 출력시키게 된다.At this time, by comparing the speech recognition result value provided from the speech recognition engine device 300 with the speech recognition correct answer value of the test script, it is determined whether the recognition is successful and outputs the result of the recognition success.

예를 들어, 테스트 스크립트 정보의 음성인식 정답값이 '안녕하세요'이고, 음성인식 결과값이 '안녕하세요'일 경우에 동일하므로 '인식 성공'이라는 결과를 출력시키게 되며, 만약, 음성인식 결과값이 '안뇽하서요'일 경우에는 동일하지 않으므로 '인식 실패'라는 결과를 출력시키게 되는 것이다.For example, if the voice recognition correct answer value of the test script information is' hello 'and the voice recognition result value is' hello', the result is' successful recognition ', and if the voice recognition result value is' If it is not the same, it will output the result of 'recognition failure' because it is not the same.

그리고, 상기 TTS구동장치(200)는 음성인식자동테스트장치(100)로부터 제공된 테스트 스크립트를 음성으로 변환시키며, 변환된 음성을 출력시키게 된다.In addition, the TTS driving apparatus 200 converts the test script provided from the automatic voice recognition test apparatus 100 into voice, and outputs the converted voice.

예를 들어, '안녕하세요'라는 테스트 스크립트를 획득하고, 이를 음성으로 변화시키게 되며, 변환된 음성을 출력시키는 것이다.For example, you might get a test script called "hello", turn it into a voice, and output the converted voice.

즉, '안녕하세요'라는 음성을 출력시키게 된다.That is, it outputs a voice of 'hello'.

그리고, 상기 음성인식엔진장치(300)는 TTS구동장치를 통해 출력되는 음성을 획득하여 음성인식 결과값을 생성하여 음성인식자동테스트장치(100)로 제공하게 되는 것이다.In addition, the voice recognition engine device 300 obtains the voice output through the TTS driving apparatus, generates a voice recognition result value, and provides the voice recognition automatic test apparatus 100.

예를 들어, 출력되는 '안녕하세요'라는 음성을 획득하게 되고, 음성인식 결과값으로 '안녕하세요'를 생성하여 음성인식자동테스트장치(100)로 제공하는 것이다.For example, the output 'hello' is to be obtained, and 'hello' is generated as a voice recognition result value to be provided to the automatic voice recognition test apparatus 100.

이때, 상기 음성인식자동테스트장치(100)는 음성인식엔진장치(300)로부터 제공된 음성인식 결과값인 '안녕하세요'라는 정보와 테스트 스크립트의 음성인식 정답값인 '안녕하세요'를 비교하여 인식 성공 여부를 판단하게 되는데, 예시에서는 일치하기 때문에 '인식 성공'이라는 결과 정보를 출력시키게 되는 것이다.At this time, the speech recognition automatic test device 100 compares the information 'hello', which is the speech recognition result value provided from the speech recognition engine device 300, with the speech recognition correct answer value of the test script 'hello' to determine whether the recognition success. In the example, the result is 'recognition success' because it matches.

하기에서는 상기한 음성인식자동테스트장치(100), TTS구동장치(200), 음성인식엔진장치(300)에 대하여 도면을 참조하여 구체적으로 설명하도록 한다.Hereinafter, the voice recognition automatic test apparatus 100, the TTS driving apparatus 200, and the voice recognition engine apparatus 300 will be described in detail with reference to the accompanying drawings.

도 2는 본 발명의 일실시예에 따른 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템의 음성인식자동테스트장치(100) 블록도이다.2 is a block diagram of an automatic speech recognition automatic test apparatus 100 of a multimedia speech recognition automatic evaluation system using a speech synthesis engine according to an embodiment of the present invention.

도 2에 도시한 바와 같이, 상기 음성인식자동테스트장치(100)는, 테스트스크립트입력부(110), 스크립트분석부(120), 테스트정보저장부(130), 음성인식결과처리부(140), 음성인식결과출력부(150)를 포함하여 구성되게 된다.As shown in FIG. 2, the automatic voice recognition test apparatus 100 includes a test script input unit 110, a script analysis unit 120, a test information storage unit 130, a voice recognition result processing unit 140, and a voice. It is configured to include a recognition result output unit 150.

구체적으로 설명하면, 상기 테스트스크립트입력부(110)는 음성인식 테스트를 위한 음성인식 정답값을 포함하고 있는 테스트 스크립트 정보를 입력받는 기능을 수행하게 된다.Specifically, the test script input unit 110 performs a function of receiving test script information including a voice recognition correct answer value for a voice recognition test.

상기 테스트 스크립트 정보는 다양한 멀티미디어기기로부터 입력받을 수 있다.The test script information may be input from various multimedia devices.

상기한 테스트 스크립트 정보는 텍스트 파일로써, 테스트 스크립트, 테스트 스크립트의 음성인식 정답값, 테스트 번호를 포함하게 된다.The test script information is a text file and includes a test script, a voice recognition correct answer value of the test script, and a test number.

테스트 스크립트 정보는 예를 들어, '#test no-1#test script-love#Speech recognition correct value-love'를 의미하게 된다.The test script information may mean, for example, '#test no-1 # test script-love # Speech recognition correct value-love'.

그리고, 상기 스크립트분석부(120)는 상기 테스트스크립트입력부를 통해 입력된 테스트 스크립트 정보를 테스트정보저장부(130)에 저장 처리시키며, TTS엔진부(220)에서 구동할 수 있는 포맷으로 변환시켜 TTS정보저장부(210)에 변환정보를 저장 처리시키게 된다.In addition, the script analyzing unit 120 stores and processes the test script information input through the test script input unit in the test information storage unit 130, and converts the test script information into a format that can be driven by the TTS engine unit 220. The information storage unit 210 stores the converted information.

즉, 스크립트분석부(120)는 테스트스크립트입력부를 통해 입력된 테스트 스크립트 정보를 테스트정보저장부(130)에 저장 처리하게 되는데, 예를 들어, '#test no-1#test script-love#Speech recognition correct value-love'를 저장 처리하게 되는 것이다.That is, the script analyzing unit 120 stores and processes the test script information input through the test script input unit in the test information storage unit 130. For example, '#test no-1 # test script-love # Speech It will store the recognition correct value-love.

또한, TTS엔진부(220)에서 구동할 수 있는 포맷으로 변환시켜 TTS정보저장부(210)에 변환정보를 저장 처리하게 되는 것이다.In addition, the conversion information is converted into a format that can be driven by the TTS engine unit 220 to store the converted information in the TTS information storage unit 210.

상기한 변환정보는 TTS엔진부(220)에서 TTS(Text-to-Speech) 변환할 수 있는 구동 포맷을 의미하며, 상기 변환정보를 스크립트분석부(120)의 처리에 의해 TTS정보저장부(210)에 저장시키게 되는 것이다.The conversion information refers to a drive format capable of converting a text-to-speech (TTS) in the TTS engine unit 220, and converts the conversion information into a TTS information storage unit 210 by processing of the script analysis unit 120. Will be saved).

예를 들어, 테스트 스크립트의 본문은 음성으로 변환되기 전에 파싱될 수 있다. For example, the body of a test script can be parsed before being converted to speech.

테스트 스크립트는 예를 들어, 섹션, 챕터, 페이지, 단락, 문장 및/또는 (예를 들어, 구두점 및 기타 문법 규칙에 기초한) 그것의 프레그먼트, 또는 단어 또는 문자와 같은 부분들로 파싱될 수 있다.The test script can be parsed into sections, chapters, pages, paragraphs, sentences and / or portions thereof, such as fragments (eg, based on punctuation and other grammar rules), or words or letters. have.

각각의 부분은 그것이 문맥(예를 들어, 언어적 문맥)을 나타낼 수 있는 하나 이상의 특정 속성을 갖는지를 결정하기 위해 분석될 수 있다. Each part can be analyzed to determine if it has one or more specific attributes that can represent a context (eg, linguistic context).

예를 들어, 텍스트 부분이 들여쓰기인지, 불릿 포인트가 앞에 나오는지, 이탤릭체인지, 굵은 폰트인지, 밑줄이 있는지, 두줄 밑줄이 있는지, 아래첨자인지, 윗첨자인지, 특정 구두점이 없는지, 특정 구두점을 포함하는지, 텍스트 내의 다른 폰트 크기에 비교하여 특정 폰트 크기를 갖는지, 모두 대문자인지, 타이틀 문자인지, 특정 방식으로 자리맞춤된 것인지(예를 들어, 오른쪽 맞춤, 가운데 맞춤, 왼쪽 맞춤 또는 양쪽 맞춤), 머릿말의 적어도 일부분인지, 머릿말 또는 꼬릿말의 적어도 일부분인지, 목록(table of contents; TOC)의 적어도 일부분인지, 각주의 적어도 일부분인지, 다른 속성을 갖는지, 상술된 속성들 중 임의의 조합을 갖는지가 결정될 수 있다. For example, is the text part indented, bullet point preceded, italic, bold, underlined, double underlined, subscripted, superscripted, no specific punctuation, or include specific punctuation? , Have a specific font size compared to other font sizes in the text, are all uppercase letters, title characters, are justified in a certain way (for example, right-aligned, centered, left-aligned or justified), headings Whether it is at least part of a header, at least part of a header or footer, at least part of a table of contents (TOC), at least part of a footnote, has a different attribute, or has any combination of the attributes described above. have.

텍스트 부분을 음성으로 변환하는 것은 예를 들어, 텍스트에 대한 하나 이상의 변환 매개변수 값을 설정함으로써 이러한 속성에 기초하여 제어될 수 있다. The conversion of the text portion to speech can be controlled based on this attribute, for example, by setting one or more conversion parameter values for the text.

주어진 텍스트 부분에 대하여, 볼륨, 억양 속도, 목소리 액센트, 목소리 파동, 음절 강조, 그 부분 전 및/또는 후에의 잠시멈춤, 다른 매개변수, 및 이것의 임의의 적합한 조합과 같은 변환 매개변수들 중 임의의 것에 대한 값이 설정될 수 있다. For a given text part, any of the conversion parameters, such as volume, intonation rate, voice accent, voice pulsation, syllable emphasis, pause before and / or after that part, other parameters, and any suitable combination thereof Values for may be set.

이러한 매개변수들 중 임의의 매개변수에 대한 값이 설정될 수 있고, 이것은 주어진 텍스트 부분과 함께 TTS엔진부에 송신된다. Values for any of these parameters can be set, which are sent to the TTS engine section with the given text portion.

예를 들어, 프로그래밍 콜은 특정 SAPI 매개변수에 대한 값 설정을 포함하여, 각각의 텍스트 부분에 대하여 표준 SAPI(Speech API)에 형성될 수 있다.For example, a programming call can be made in a standard Speech API (SAPI) for each piece of text, including setting values for specific SAPI parameters.

텍스트는 사용자에 의해 선택될 수 있고, 예를 들어, 워드 프로세싱(예를 들어, 마이크로소프트® 워드) 문서, 스프레드시트(예를 들어, 엑셀™) 문서, 프리젠테이션(예를 들어, 파워포인트®) 문서, 이메일(예를 들어, 아웃룩®) 메시지 또는 다른 유형의 문서와 같은 디지털 문서 전체일 수 있다. The text can be selected by the user, for example, a word processing (eg Microsoft® Word) document, a spreadsheet (eg Excel ™) document, a presentation (eg PowerPoint®) ) A digital document, such as a document, an email (eg, Outlook®) message, or another type of document.

다르게, 텍스트는 예를 들어, 상술된 것들 중 임의의 것의 일부분과 같은 문서의 일부분일 수 있다.Alternatively, the text may be part of a document such as, for example, part of any of the above.

그리고, 상기 테스트정보저장부(130)는 스크립트분석부를 통해 제공된 테스트 스크립트 정보에 포함된 테스트 스크립트, 테스트 스크립트의 음성인식 정답값, 테스트 번호를 포함하여 저장하고 있게 되는 것이다.The test information storage unit 130 may store the test script included in the test script information provided through the script analysis unit, the voice recognition correct answer value of the test script, and the test number.

또한, 상기 스크립트분석부(120)는,In addition, the script analysis unit 120,

테스트 스크립트의 음성인식 정답값에 해당하는 레퍼런스 정보와 TTS 합성에 필요한 화자 정보, 테스트 번호를 추출하여 테스트정보저장부에 저장시킬 수도 있다.Reference information corresponding to the voice recognition correct answer value of the test script, speaker information necessary for TTS synthesis, and a test number may be extracted and stored in the test information storage unit.

예를 들어, 레퍼런스 정보에는 음성인식 정답값을 제공한 사이트 정보, 회사 정보 등을 의미할 수 있으며, TTS 합성에 필요한 화자 정보에는 성인 여성, 성인 남성, 연예인 1, 연예인 2, 어린 여성 아이 등 과 같은 다양한 화자 정보를 의미할 수 있으며, 테스트 번호의 경우, 성공 및 실패를 일목요연하게 정리하기 위하여 식별할 수 있는 고유한 번호를 의미할 수 있다.For example, reference information may refer to site information or company information that provides a voice recognition correct answer value, and speaker information necessary for TTS synthesis may include adult female, adult male, entertainer 1, entertainer 2, young female child, and the like. The same may mean various speaker information, and in the case of a test number, it may mean a unique number that can be identified to summarize success and failure.

이때, 상기 음성인식결과처리부(140)는 테스트저장부에 저장된 테스트 스크립트의 음성인식 정답값과 음성인식엔진장치(300)로부터 제공된 음성인식 결과값을 비교하여 인식 성공 여부를 판단하여 인식 성공 여부 정보를 음성인식결과출력부로 제공하기 위한 기능을 수행하게 된다.In this case, the speech recognition result processing unit 140 compares the speech recognition correct answer value of the test script stored in the test storage unit with the speech recognition result value provided from the speech recognition engine device 300 to determine whether the recognition is successful or not. It performs a function for providing to the voice recognition result output unit.

예를 들어, 테스트저장부에 저장된 '#test no-1#test script-love#Speech recognition correct value-love'에서 음성인식 정답값인 'love'를 추출하게 되고, 음성인식엔진장치(300)로부터 제공된 음성인식 결과값인 '#speech recognition mode execution signal-speech recognition mode execution signal#test no-1#STT-love'에서 'love'를 비교하게 된다.For example, the voice recognition correct value 'love' is extracted from the '#test no-1 # test script-love # Speech recognition correct value-love' stored in the test storage unit, and the voice recognition engine 300 The 'love' is compared in the 'speech recognition mode execution signal-speech recognition mode execution signal #test no-1 # STT-love' which is the provided speech recognition result.

상기의 경우, 일치하기 때문에 인식 성공 여부는 성공으로 판단되게 되고, 이를 음성인식결과출력부로 제공하게 되는 것이다.In the above case, it is determined whether the recognition is successful because it matches, and this is provided to the voice recognition result output unit.

따라서, 음성인식결과출력부(150)는 상기 음성인식결과처리부에서 제공된 인식 성공 여부 결과를 출력시키게 되는 것이다.Accordingly, the voice recognition result output unit 150 outputs a result of the recognition success provided by the voice recognition result processing unit.

예를 들어, '테스트 번호 1인 love는 음성인식 결과, 인식 성공하였습니다.'라는 결과를 화면에 출력시키게 되는 것이다.For example, 'Test No. 1 love is recognized as a result of voice recognition, recognition succeeded.' Will be displayed on the screen.

한편, 부가적인 양태에 따라, 음성인식자동테스트장치(100)는 음성인식결과처리부(140)에서 인식 성공 여부 정보를 획득하여 설정된 시간에 입력된 테스트 스크립트의 인식 결과를 집계하기 위한 결과처리부;를 더 포함하여 구성할 수 있다.On the other hand, according to an additional aspect, the automatic voice recognition test apparatus 100 is a result processing unit for obtaining the recognition success information from the voice recognition result processing unit 140 and aggregates the recognition result of the test script input at the set time; It can be configured to include more.

예를 들어, 1시간 동안 입력되는 테스트 스크립트별 인식 성공율을 집계할 수 있는데, 테스트번호 1 - 성공, 테스트번호 2 - 실패, 테스트번호 3 - 성공, 테스트번호 4- 실패일 경우에 인식 성공율은 50%가 되는 것이다.For example, the recognition success rate for each test script input for 1 hour can be counted. If the test number 1-success, test number 2-failure, test number 3-success, test number 4- failure, the recognition success rate is 50. Will be%.

한편, 다른 부가적인 양태에 따라, 상기 음성인식자동테스트장치(100)는 음성 평가를 위한 출력 음성의 볼륨값, 노이즈 볼륨값을 설정하기 위한 음성인식평가설정부;를 더 포함하여 구성할 수 있다.On the other hand, according to another additional aspect, the voice recognition automatic test apparatus 100 may further comprise a voice recognition evaluation setting unit for setting the volume value, the noise volume value of the output voice for voice evaluation. .

예를 들어, 상기 음성인식평가설정부를 통해 음성인식 테스트를 위한 음성 출력의 여러 환경을 설정하게 된다.For example, the voice recognition evaluation setting unit sets various environments of the voice output for the voice recognition test.

상기 설정된 값들은 테스트음성출력부(230)로 제공하여 음성 출력을 설정된 값으로 수행하도록 하는 것이다.The set values are provided to the test voice output unit 230 to perform the voice output at the set values.

예를 들어, 평가 출력 음성 볼륨 -10, 노이즈 볼륨 - 2 등의 설정값을 제공하게 되는 것이다.For example, setting values such as the evaluation output voice volume -10 and the noise volume -2 are provided.

상기와 같이, 환경 변수를 설정하는 이유는 실제 음성인식이 적용되는 환경이 소음이 없는 환경이 아니므로 실제 환경에 적합한 상태로 설정하기 위하여 설정값을 제공하여 이를 통해 음성 인식을 평가함으로써, 음성 인식의 성공율을 높이기 위한 것이다.As described above, the reason for setting the environment variable is that the environment to which the actual speech recognition is applied is not an environment without noise, so that the speech recognition is provided by providing a setting value to set the state suitable for the actual environment. To increase the success rate.

도 3은 본 발명의 일실시예에 따른 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템의 TTS구동장치(200) 블록도이다.3 is a block diagram of a TTS driving apparatus 200 of a multimedia voice recognition automatic evaluation system using a voice synthesis engine according to an embodiment of the present invention.

도 3에 도시한 바와 같이, 상기 TTS구동장치(200)는 TTS정보저장부(210), TTS엔진부(220), 테스트음성출력부(230)를 포함하여 구성되게 된다.As shown in FIG. 3, the TTS driving apparatus 200 includes a TTS information storage unit 210, a TTS engine unit 220, and a test voice output unit 230.

구체적으로 설명하면, 상기 TTS정보저장부(210)에는 스크립트분석부(120)에서 제공된 구동할 수 있는 포맷으로 변환된 변환정보를 저장하고 있게 된다.Specifically, the TTS information storage unit 210 stores the converted information converted into a driveable format provided by the script analysis unit 120.

이때, 상기 TTS엔진부(220)는 TTS정보저장부(210)에 저장된 변환정보를 추출하여 음성으로 변환(TTS 합성)시켜 테스트음성출력부(230)로 제공하게 되는 것이다.In this case, the TTS engine 220 extracts the conversion information stored in the TTS information storage unit 210 and converts the converted information into voice (TTS synthesis) to provide the test voice output unit 230.

테스트 번호 1인 'love'에 대하여 음성으로 변환 즉, TTS 합성시켜 테스트음성출력부로 제공하게 되는 것이며, 상기 테스트음성출력부(230)는 상기 TTS엔진부(220)로부터 제공된 변환된 음성을 출력시키게 되는 것이다.The test number 'love' is converted into voice, that is, TTS synthesized and provided to the test voice output unit, and the test voice output unit 230 outputs the converted voice provided from the TTS engine unit 220. Will be.

예를 들어, 상기 TTS 엔진부(220)가 변환한 상기 음성 데이터를 테스트음성출력부에서 출력할 수 있는 포맷의 오디오 파일로 TTS 변환 한다. For example, the TTS engine unit 220 converts the voice data into an audio file having a format that can be output from the test voice output unit.

즉, 테스트음성출력부의 제조사 및 테스트음성출력부의 모델 등에 따라서 특정 기능에서 동작할 수 있는 오디오 파일의 포맷은 각각 상이하므로 오디오 파일의 포맷에 대한 상세한 설명은 생략한다.That is, since the format of the audio file that can operate in a specific function is different depending on the manufacturer of the test audio output unit and the model of the test audio output unit, detailed description of the format of the audio file is omitted.

이때, 상기 테스트음성출력부(230)는 오디오 재생수단인 것을 특징으로 하며, 예를 들어, 스피커(270)를 의미하게 된다.At this time, the test voice output unit 230 is characterized in that the audio reproduction means, for example, it means a speaker 270.

그리고, 테스트음성출력부(230)는 음성 출력시, 음성인식엔진장치(300)로 음성 인식모드 실행 신호를 송출하는 것을 특징으로 한다.The test voice output unit 230 transmits a voice recognition mode execution signal to the voice recognition engine 300 when the voice is output.

따라서, 음성이 출력되는 시점에 자동적으로 음성인식엔진장치(300)이 동작하여 출력되는 음성을 인식하게 되는 것이다.Therefore, the voice recognition engine 300 operates automatically at the time when the voice is output to recognize the output voice.

도 4는 본 발명의 일실시예에 따른 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템의 음성인식엔진장치(300) 블록도이다.4 is a block diagram of a speech recognition engine device 300 of the automatic speech recognition system for multimedia recognition using a speech synthesis engine according to an embodiment of the present invention.

도 4에 도시한 바와 같이, 상기 음성인식엔진장치(300)는 음성인식엔진부(310), 음성인식결과취득부(320), 통신부(330)를 포함하여 구성되게 된다.As shown in FIG. 4, the voice recognition engine device 300 includes a voice recognition engine 310, a voice recognition result acquisition unit 320, and a communication unit 330.

구체적으로 설명하면, 상기 음성인식엔진부(310)는 상기 테스트음성출력부(230)에서 출력되는 음성을 획득하여 음성인식 결과값을 생성하여 음성인식결과취득부로 제공하기 위한 기능을 수행하게 된다.Specifically, the voice recognition engine 310 performs a function of obtaining a voice output from the test voice output unit 230 to generate a voice recognition result value and providing the result to the voice recognition result acquisition unit.

상기 테스트음성출력부(230)에서 출력되는 음성을 획득하기 위하여, 상기 음성인식엔진부에는 음성획득장치를 더 포함하여 구성할 수 있다.In order to obtain the voice output from the test voice output unit 230, the voice recognition engine may further comprise a voice acquisition device.

예를 들어, 음성획득장치로서, 마이크(370)를 의미하게 된다.For example, as a voice acquisition device, this means a microphone 370.

이때, 음성인식엔진부(310)는 상기 마이크를 통해 테스트음성출력부(230)에서 출력되는 음성을 획득하여 음성인식 결과값을 생성하게 되는 것이다.In this case, the voice recognition engine 310 obtains the voice output from the test voice output unit 230 through the microphone to generate a voice recognition result value.

상기 음성인식엔진은 STT(Speech to Text) 엔진을 의미하며, 마이크에 의해 획득된 음성을 텍스트로 변환하는 기능을 수행하게 되며, 텍스트 변환된 값은 음성인식 결과값을 의미하게 된다.The speech recognition engine refers to a speech to text (STT) engine, and performs a function of converting a speech acquired by a microphone into text, and the converted text means a speech recognition result.

예를 들어, 마이크를 통해 획득된 음성이 'love'일 경우에 이를 음성인식엔진부를 통해 STT 변환하게 되면, 'love'라는 음성인식 결과값을 생성하게 되는 것이다.For example, when the voice acquired through the microphone is 'love', when the STT conversion is performed through the voice recognition engine, the voice recognition value 'love' is generated.

그리고, 상기 음성인식결과취득부(320)는 상기 음성인식엔진부를 통해 제공된 음성인식 결과값을 저장하고 있으며, 음성인식 결과값을 통신부(330)로 제공하게 되는 것이다.The voice recognition result acquisition unit 320 stores the voice recognition result provided through the voice recognition engine and provides the voice recognition result to the communication unit 330.

예를 들어, '#speech recognition mode execution signal-speech recognition mode execution signal#test no-1#STT-love'라는 음성인식 결과값을 통신부로 제공하게 된다.For example, the voice recognition result of '#speech recognition mode execution signal-speech recognition mode execution signal # test no-1 # STT-love' is provided to the communication unit.

따라서, 상기 통신부(330)는 음성인식결과취득부를 통해 제공된 음성인식 결과값을 음성인식결과처리부(140)로 전송하게 되는 것이다.Accordingly, the communication unit 330 transmits the voice recognition result value provided through the voice recognition result acquisition unit to the voice recognition result processing unit 140.

즉, '#speech recognition mode execution signal-speech recognition mode execution signal#test no-1#STT-love'라는 음성인식 결과값을 전송하게 되는 것이다.That is, the voice recognition result value '#speech recognition mode execution signal-speech recognition mode execution signal # test no-1 # STT-love' is transmitted.

상기와 같이, 음성합성엔진을 이용한 멀티미디어 음성인식 자동 평가시스템을 이용하게 되면, 언제든지 새로운 문형을 녹음없이 실시간으로 재생하여 평가할 수 있는 즉시성과 효율성 및 비용 절감 효과를 동시에 제공하게 된다.As described above, if the automatic speech recognition system using the speech synthesis engine is used, it provides instantaneous efficiency and efficiency and cost reduction effects that can be evaluated by real-time playback of new sentence patterns without recording.

즉, 종래 기술의 문제점인 자연어 인식의 경우 음성 녹음 DB를 지속적으로 구축해야 하는데 소요되는 막대한 시간과 비용을 제거하게 된다.That is, in the case of natural language recognition, which is a problem of the prior art, it eliminates enormous time and cost required to continuously build a voice recording DB.

본 발명에 의하면, 자연어 인식을 평가할 경우에 매우 유용한 장점과 종래의 시뮬레이션 테스트 방식보다 가변적인 테스트 환경에서도 상당한 우위성을 제공할 수 있게 된다.According to the present invention, it is possible to provide a significant advantage in evaluating natural language recognition and a considerable advantage even in a variable test environment over the conventional simulation test method.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형 실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.In addition, while the above has been shown and described with respect to preferred embodiments of the present invention, the present invention is not limited to the specific embodiments described above, the technical field to which the invention belongs without departing from the spirit of the invention claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or the prospect of the present invention.

100 : 음성인식자동테스트장치
200 : TTS구동장치
300 : 음성인식엔진장치100: automatic voice recognition device
200: TTS driving device
300: voice recognition engine device

Claims

In the automatic evaluation system of multimedia speech recognition using speech synthesis engine,

When the test script information including the voice recognition correct answer value is received, the TTS driving apparatus 200 is driven and controlled, and the voice recognition result value provided from the voice recognition engine device 300 is compared with the voice recognition correct answer value of the test script. The automatic voice recognition automatic test apparatus 100 for determining whether the recognition is successful and outputting the recognition success result;
TTS driving device 200 for converting the test script provided from the automatic voice recognition automatic test device 100 to the voice, and outputs the converted voice,
It is configured to include a voice recognition engine device 300 for obtaining the voice output through the TTS driving device to generate a voice recognition result value to provide to the voice recognition automatic test device 100,

The voice recognition automatic test device 100,
A test script input unit 110 for receiving test script information including a voice recognition correct answer value for a voice recognition test;
The test script information input through the test script input unit is stored in the test information storage unit 130, and converted into a format that can be driven by the TTS engine unit 220 to convert the converted information into the TTS information storage unit 210. A script analysis unit 120 for storing and processing;
A test information storage unit 130 including a test script provided through the script analysis unit, a voice recognition correct answer value of the test script, and a test number;
Speech recognition result output unit 150 compares the speech recognition correct answer value of the test script information stored in the test information storage unit with the speech recognition result value provided from the speech recognition engine device 300 to determine whether the recognition is successful. Speech recognition result processing unit 140 for providing;
Automatic speech recognition system using a speech synthesis engine, characterized in that it comprises a; voice recognition result output unit for outputting the recognition success result provided by the speech recognition result processing unit 140.

delete

The method of claim 1,
The script analysis unit 120,
Automatic speech recognition system using a speech synthesis engine, characterized in that for extracting the reference information corresponding to the correct answer value of the speech recognition of the test script, the speaker information and the test number necessary for the synthesis of the TTS and store it in the test information storage.

The method of claim 1,
The TTS driving device 200,
A TTS information storage unit 210 for storing the converted information converted into a driveable format provided by the script analyzing unit 120;
A TTS engine unit 220 for extracting the converted information stored in the TTS information storage unit 210 and converting the converted information into voice (TTS synthesis) to provide the test voice output unit 230;
And a test voice output unit 230 for outputting the converted voice provided from the TTS engine 220. The automatic voice recognition system for multimedia recognition using a speech synthesis engine.

The method of claim 1,
The voice recognition engine device 300,
A voice recognition engine 310 for acquiring a voice output from the test voice output unit 230 of the TTS driving apparatus 200 to generate a voice recognition result value and to provide the voice recognition result acquisition unit to the voice recognition result acquisition unit;
A voice recognition result acquisition unit 320 for storing a voice recognition result value provided through the voice recognition engine 310 and providing the voice recognition result value to the communication unit 330;
Multimedia speech recognition using a speech synthesis engine, characterized in that it comprises a; communication unit 330 for transmitting the speech recognition result value provided through the speech recognition result acquisition unit 320 to the speech recognition result processing unit 140 Automatic evaluation system.