KR101069534B1

KR101069534B1 - Method and apparatus for searching voice data from audio and video data under the circumstances including unregistered words

Info

Publication number: KR101069534B1
Application number: KR1020090039889A
Authority: KR
Inventors: 이동현; 김석환; 이근배; 노형종
Original assignee: 포항공과대학교 산학협력단
Priority date: 2009-05-07
Filing date: 2009-05-07
Publication date: 2011-09-30
Also published as: KR20100120977A

Abstract

본 발명은 무제한 단어 환경에서 오디오 및 비디오의 음성 데이터 검색 방법 및 장치에 관한 것으로, 크게 색인부와 검색부로 구성되어 있다. 색인부는 오디오 및 비디오로부터 음성 데이터를 추출하는 음성 데이터 추출기; 음성 데이터를 음성 인식기에서 수행하기 위해 문장 단위로 추정되는 적당한 분량으로 분할하여 웨이브 파일 형태로 저장하는 음성 데이터 분할기; 음성 웨이브 파일을 입력으로 받아 텍스트 데이터 파일 형태로 출력해주는 음성 인식기; 음성 인식의 결과로 나온 격자 형태의 정보를 이용하여 다양한 단위로 색인 테이블을 생성하는 색인기를 포함한다. 검색부는 사용자의 질의를 음성 인식에서의 미등록어 여부를 고려하여 색인 테이블을 활용하도록 여러 가지 가능한 질의들로 확장해주는 질의 확장기; 확장된 질의로부터 색인 테이블을 이용하여 검색을 수행하는 검색기; 검색된 결과를 사용자에게 효과적으로 표시해주는 결과 출력기를 포함한다. BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for searching audio data of audio and video in an unlimited word environment. The present invention is largely comprised of an index unit and a search unit. An index unit comprising: a voice data extractor for extracting voice data from audio and video; A voice data divider for dividing the voice data into an appropriate amount estimated in sentence units to be executed in the speech recognizer and storing the voice data in the form of a wave file; A voice recognizer that receives a voice wave file as an input and outputs the voice wave file as a text data file; It includes an indexer for generating an index table in various units using the grid information resulting from speech recognition. The search unit may include: a query expander that expands a user's query into various possible queries to use an index table in consideration of non-registered words in speech recognition; A searcher for performing a search using an index table from the expanded query; It includes a result writer that effectively displays the retrieved results to the user.

음성 검색, 비디오 검색, 무제한 단어, 미등록어 Voice search, video search, unlimited words, unregistered words

Description

Method and apparatus for searching voice data from audio and video data under the circumstances including unregistered words}

본 발명은 미등록어를 포함한 환경에서 오디오 및 비디오의 음성 데이터 검색 방법 및 장치에 관한 것으로, 특히 음성 데이터를 음성 인식 기술을 이용하여 텍스트 정보 형태로 변환한 뒤 색인기로부터 색인 테이블을 생성하고, 음성 인식에서의 미등록어까지 고려하여 음성 내용을 검색할 수 있게 하는 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for retrieving voice data of audio and video in an environment including non-registered words. In particular, the present invention converts voice data into text information using voice recognition technology and generates an index table from an indexer. The present invention relates to a method and apparatus for retrieving speech contents in consideration of unregistered words in.

최근에 들어서, 인터넷의 급격한 발달과 함께 오디오 및 비디오 형태의 자료가 급증하고 있다. 이런 자료를 보다 효율적으로 관리하고 사용하기 위해서는 검색 기술이 필수적이다. In recent years, with the rapid development of the Internet, materials in the form of audio and video have increased rapidly. In order to manage and use such data more efficiently, retrieval technology is essential.

하지만, 수작업을 거쳐 제공 되는 제목이나 내용에 대한 간략한 정보만으로는 정밀한 검색이 불가능하며, 사용자가 직접 오디오 및 비디오를 살펴보며 판단해야하는 경우가 많다. 오디오 및 비디오에서 중요한 역할을 하는 음성 데이터를 이용하여 내용 기반의 검색을 수행 하면 사용자의 불편함을 최소화할 수 있다. However, it is impossible to precisely search only by brief information on the title or content provided through manual operation, and the user often needs to examine and judge the audio and video. Content-based retrieval using voice data, which plays an important role in audio and video, can minimize user inconvenience.

2008년 8월 18일 공개된 공개번호 10-2008-0075266의 "음성 데이터를 이용하여 멀티미디어 데이터 파일의 인덱싱정보를 생성하는 시스템 및 방법과 멀티미디어 데이터파일의 인덱싱 정보를 검색하는 시스템 및 방법"에서는 전반적인 음성 데이터 검색 시스템 및 방법에 관한 내용을 다루고 있으며, 2008년 7월 24일 공개된 공개번호 10-2008-0068844의 "텍스트 메타데이터를 갖는 음성문서의 인덱싱 및 검색방법, 컴퓨터 판독가능 매체"에서는 음성문서에 따라 오는 메타데이터를 참고하여 기존의 검색을 활용하는 것에 관한 내용을 다루고 있다. In "A system and method for generating indexing information of a multimedia data file using voice data and a system and method for retrieving the indexing information of a multimedia data file" published on August 18, 2008, the publication number 10-2008-0075266 A voice data retrieval system and method are described, and in the publication No. 10-2008-0068844 published on July 24, 2008, "Method of indexing and retrieving a voice document with text metadata, a computer readable medium" The metadata that comes with the document refers to the use of existing searches.

하지만, 이런 종래의 기술은 미등록어를 고려하지 않고 있다. 즉, 음성 인식기에서 정의한 사전에 포함되지 않는 검색어의 경우 실제 음성 문서에 그러한 내용이 포함되어 있더라고 아무런 결과를 나타내지 못하는 한계를 가지고 있다.However, this conventional technique does not consider unregistered words. That is, a search word that is not included in the dictionary defined by the speech recognizer has a limitation in that it does not show any result even if such a content is included in the actual speech document.

본 발명은 음성 인식기의 미등록어에 따른 문제점을 해결하기 위한 것으로써, 음성 인식 결과를 색인하는 데 있어서 단어 형태와 함께 그 하위 레벨까지 고려하며, 검색에서 미등록어가 나타났을 때 색인 테이블에 맞추어서 확장함으로써 미등록어를 포함한 환경에서 오디오 및 비디오 음성 데이터 검색 방법 및 장치를 제공하는 것을 목적으로 한다. The present invention is to solve the problem according to the unregistered words of the speech recognizer, and to consider the lower level with the word form in indexing the speech recognition results, and by expanding to match the index table when the unregistered words appear in the search An object of the present invention is to provide a method and apparatus for retrieving audio and video voice data in an environment including non-registered words.

상술한 본 발명의 목적은 오디오 및 비디오로부터 음성 데이터를 추출하여 문장 단위로 추정되는 적당한 분량으로 분할, 웨이브 파일 형태로 저장한 뒤 음성 인식기를 통해 텍스트 데이터 파일 형태로 변환한 정보를 바탕으로 다양한 단위로 생성한 색인 테이블을 참고하여 주어진 사용자 질의에 대해 색인 테이블들을 활용하도록 확장해 검색을 수행함으로써 사용자에게 효과적으로 결과를 표시해 주는 과정을 통해서 달성된다. The object of the present invention described above is to extract speech data from audio and video, and to divide it into an appropriate amount estimated in sentence units, store it in the form of a wave file, and then convert the data into a text data file through a speech recognizer. This can be achieved through the process of displaying the result effectively to the user by extending the search to utilize the index tables for a given user query by referring to the index table created by the.

보다 구체적으로, 본 발명의 하나의 태양에 의하면, 색인부와 검색부를 포함하는 음성 데이터 검색 장치에 있어서, More specifically, according to one aspect of the present invention, in a voice data retrieval apparatus including an index unit and a retrieval unit,

상기 색인부는 오디오 및 비디오로부터 음성 데이터(220)를 추출하는 음성 데이터 추출기(210); 상기 음성 데이터(220)를 음성 인식기(250)에서 수행하기 위해 문장 단위로 추정되는 적당한 분량으로 분할하여 웨이브 파일 형태(240)로 저장하는 음성 데이터 분할기(230); 음성 웨이브 파일을 입력으로 받아 텍스트 데이터 파일 형태(260)로 출력해주는 음성 인식기(250); 음성 인식의 결과(260)로 나온 격자 형태의 정보를 이용하여 다양한 단위로 색인 테이블(280)을 생성하는 색인기(270)를 포함하고, The index unit extracts the voice data 220 from the audio and video voice data extractor 210; A voice data divider 230 for dividing the voice data 220 into an appropriate amount estimated in sentence units for performing in the voice recognizer 250 and storing the voice data 220 in the form of a wave file 240; A voice recognizer 250 for receiving a voice wave file as an input and outputting the voice wave file in a text data file form 260; An indexer 270 for generating the index table 280 in various units using the grid-shaped information resulting from the speech recognition 260,

상기 검색부는 사용자의 질의를 음성 인식에서의 미등록어 여부를 고려하여 색인 테이블을 활용하도록 여러 가지 가능한 질의들로 확장해주는 질의 확장기(110); 확장된 질의(21)로부터 색인 테이블을 이용하여 검색을 수행하는 검색기(120); 검색된 결과(30)를 사용자에게 효과적으로 표시해주는 결과 출력기(130)를 포함하는 미등록어를 포함한 환경에서 오디오 및 비디오 음성 데이터 검색 장치를 제공한다. The search unit includes a query expander 110 which expands a user's query into various possible queries to use an index table in consideration of non-registered words in speech recognition; A searcher 120 for performing a search using an index table from the extended query 21; The present invention provides an audio and video voice data retrieval apparatus in an environment including an unregistered word including a result outputter 130 that effectively displays the retrieved result 30 to a user.

바람직하기로는, 상기 음성 데이터 검색 장치는 상기 음성 인식기의 오류에 강인한 시스템을 위해 격자에서 시간 정보를 바탕으로 중복되는 부분을 최대한 하나로 결합한 압축된 표현을 사용하여, 음성 문서의 해당 지점에 대해 단어, 시간 정보, 확률값 등의 정보를 포함한 색인 테이블을 생성한다. Advantageously, the speech data retrieval device uses a compressed representation that combines as many as possible overlapping portions of the grid based on time information for a system that is robust to errors in the speech recognizer, to generate a word, Create an index table that contains information such as time information and probability values.

바람직하기로는, 상기 음성 데이터 검색 장치는 상기 음성 인식기의 미등록어에 관한 문제를 해결하기 위해서 격자를 결합 형태소 보다 더 낮은 단위인 음절, 음소 단위로 변환하여 색인 테이블을 생성한다. Preferably, the voice data retrieval apparatus generates an index table by converting the lattice into syllables and phonemes, which are lower units than the combined morphemes, in order to solve the problem of unregistered words of the speech recognizer.

바람직하기로는, 상기 음성 데이터 검색 장치는 사용자가 검색하고자 하는 키워드가 음성 인식기에서 미등록어인 것으로 판정이 난 경우 여러 단위의 조합 형태로 질의를 확장하여 해당 되는 여러 가지 색인 테이블을 활용한다. Preferably, the voice data retrieval apparatus utilizes various index tables by extending the query in the form of a combination of units when it is determined that the keyword to be searched by the user is a non-registered word in the voice recognizer.

바람직하기로는, 상기 음성 데이터 검색 장치는 보다 빠른 검색을 위해 결합 형태소 단위가 가장 먼저 나타나는 부분부터 양방향으로 진행하며, 인접한 단위 사이의 거리가 일정한 시간 이내에 들어있는지를 체크하여 검색을 수행한다. Preferably, the voice data retrieval apparatus proceeds bidirectionally from the first appearing portion of the combined morpheme unit for faster retrieval, and checks whether the distance between adjacent units is within a predetermined time.

본 발명의 다른 태양에 의하면, 색인부의 색인단계와 검색부의 검색단계를 포함하는 음성 데이터 검색 방법에 있어서, According to another aspect of the present invention, there is provided a voice data retrieval method comprising an indexing step of an indexing unit and a searching step of a searching unit.

상기 색인단계는 오디오 및 비디오로부터 음성 데이터를 추출하는 음성 데이터 추출 단계; 상기 음성 데이터를 음성 인식 단계에서 수행하기 위해 문장 단위로 추정되는 적당한 분량으로 분할하여 웨이브 파일 형태로 저장하는 음성 데이터 분할 단계; 음성 웨이브 파일을 입력으로 받아 텍스트 데이터 파일 형태로 출력해주는 음성 인식 단계; 음성 인식의 결과로 나온 격자 형태의 정보를 이용하여 다양한 단위로 색인 테이블을 생성하는 색인 단계를 포함하고, The indexing step may include: extracting voice data from audio and video; A voice data dividing step of dividing the voice data into an appropriate amount estimated in units of sentences to be performed in a voice recognition step and storing the voice data in the form of a wave file; A voice recognition step of receiving a voice wave file as an input and outputting the voice wave file as a text data file; An indexing step for generating an index table in various units using the grid-shaped information resulting from speech recognition,

상기 검색단계는 사용자의 질의를 음성 인식에서의 미등록어 여부를 고려하여 색인 테이블을 활용하도록 여러 가지 가능한 질의들로 확장해주는 질의 확장단계; 확장된 질의로부터 색인 테이블을 이용하여 검색을 수행하는 검색수행단계; 검색된 결과를 사용자에게 효과적으로 표시해주는 결과 출력단계를 포함하는 미등록어를 포함한 환경에서 오디오 및 비디오 음성 데이터 검색 방법을 제공한다. The searching step may include: a query expansion step of expanding a user's query into various possible queries to use an index table in consideration of an unregistered word in speech recognition; A search performing step of performing a search using an index table from the expanded query; The present invention provides a method for retrieving audio and video voice data in an environment including an unregistered word including a result output step of effectively displaying a retrieved result to a user.

바람직하기로는, 상기 음성 데이터 검색 방법은 상기 음성 인식단계에서 오류에 강인한 시스템을 위해 격자에서 시간 정보를 바탕으로 중복되는 부분을 최대한 하나로 결합한 압축된 표현을 사용하여, 음성 문서의 해당 지점에 대해 단어, 시간 정보, 확률값 등의 정보를 포함한 색인 테이블을 생성한다. Preferably, the speech data retrieval method uses a compressed representation that combines the overlapping portions of the lattice based on time information to the maximum for a system that is robust against errors in the speech recognition step, and uses a word for the corresponding point in the speech document. Create an index table that contains information such as time information and probability values.

바람직하기로는, 상기 음성 데이터 검색 방법은 상기 음성 인식단계의 미등 록어에 관한 문제를 해결하기 위해서 격자를 결합 형태소 보다 더 낮은 단위인 음절, 음소 단위로 변환하여 색인 테이블을 생성한다. Preferably, the voice data retrieval method generates an index table by converting the lattice into syllables and phonemes that are lower units than the combined morphemes in order to solve the problem of the unregistered words in the speech recognition step.

바람직하기로는, 상기 음성 데이터 검색 방법은 사용자가 검색하고자 하는 키워드가 음성 인식단계에서 미등록어인 것으로 판정이 난 경우 여러 단위의 조합 형태로 질의를 확장하여 해당 되는 여러 가지 색인 테이블을 활용한다. Preferably, the voice data retrieval method utilizes various index tables by extending the query in the form of a combination of units when the keyword to be searched is determined to be a non-registered word in the voice recognition step.

바람직하기로는, 상기 음성 데이터 검색 방법은 보다 빠른 검색을 위해 결합 형태소 단위가 가장 먼저 나타나는 부분부터 양방향으로 진행하며, 인접한 단위 사이의 거리가 일정한 시간 이내에 들어있는지를 체크하여 검색을 수행한다. Preferably, the voice data retrieval method proceeds bidirectionally from the first appearing portion of the combined morpheme unit for faster retrieval, and checks whether the distance between adjacent units is within a predetermined time.

상술한 바와 같이, 본 발명은 음성 데이터 검색 시스템을 구축하는 데 있어서 미등록어를 고려한 색인과 검색 과정을 통해 미등록어를 포함한 환경에서 오디오 및 비디오 음성 데이터 검색을 수행할 수 있다. As described above, the present invention can perform audio and video voice data retrieval in an environment including unregistered words through an index and search process considering unregistered words in constructing a voice data retrieval system.

첨부한 도면을 참조하여 본 발명의 실시예에 대한 미등록어를 포함한 환경에서 오디오 및 비디오의 음성 데이터 검색 방법 및 장치에 대해서 상세하게 설명한다. DETAILED DESCRIPTION Hereinafter, a method and apparatus for retrieving audio data of audio and video in an environment including an unregistered word according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 검색 시스템의 구성을 설명하기 위한 전반적인 개요 블록도이다. 1 is a general overview block diagram for explaining the configuration of a search system.

도 1에서, 일반적으로 음성 데이터 검색 시스템은 사용자(10)가 찾고자 하는 키워드(20)에 대해서 검색을 통해 수많은 비디오 및 오디오(60)에서 키워드가 나타난 지점을 효과적으로 제공해주는 시스템이다. In FIG. 1, in general, a voice data retrieval system is a system that effectively provides a point where keywords appear in a number of video and audio 60 through a search for a keyword 20 that the user 10 wants to find.

본 발명의 실시예에 따른 음성 데이터 검색 시스템은 크게 색인부(200)와 검색부(100)로 구성되어 있다. Voice data retrieval system according to an embodiment of the present invention is largely composed of the index unit 200 and the search unit 100.

색인부(200)는 오디오 및 비디오로부터 음성 데이터(220)를 추출하는 음성 데이터 추출기(210); 상기 음성 데이터(220)를 음성 인식기(250)에서 수행하기 위해 문장 단위로 추정되는 적당한 분량으로 분할하여 웨이브 파일 형태(240)로 저장하는 음성 데이터 분할기(230); 음성 웨이브 파일을 입력으로 받아 텍스트 데이터 파일 형태(260)로 출력해주는 음성 인식기(250); 음성 인식의 결과(260)로 나온 격자 형태의 정보를 이용하여 다양한 단위로 색인 테이블(280)을 생성하는 색인기(270)를 포함한다. The indexer 200 includes a voice data extractor 210 for extracting voice data 220 from audio and video; A voice data divider 230 for dividing the voice data 220 into an appropriate amount estimated in sentence units for performing in the voice recognizer 250 and storing the voice data 220 in the form of a wave file 240; A voice recognizer 250 for receiving a voice wave file as an input and outputting the voice wave file in a text data file form 260; An indexer 270 is used to generate the index table 280 in various units using the grid-shaped information resulting from the speech recognition 260.

검색부(100)는 사용자의 질의를 음성 인식에서의 미등록어 여부를 고려하여 색인 테이블을 활용하도록 여러 가지 가능한 질의들로 확장해주는 질의 확장기(110); 확장된 질의(21)로부터 색인 테이블을 이용하여 검색을 수행하는 검색기(120); 검색된 결과(30)를 사용자에게 효과적으로 표시해주는 결과 출력기(130)를 포함한다. The search unit 100 may include a query expander 110 that expands a user's query into various possible queries to use an index table in consideration of non-registered words in speech recognition; A searcher 120 for performing a search using an index table from the extended query 21; And a result outputter 130 that effectively displays the searched result 30 to the user.

음성 데이터 검색 시스템을 구축하는 데 있어서 가장 먼저 고려해야하는 것은 바로 음성 인식이다. 음성 인식을 위해서는 음성 오디오에 대해 다양한 전처리 과정이 요구된다. 우선적으로 비디오 및 오디오로부터 음성 데이터를 추출해내는 작업이 필요하다. 이런 작업은 음성 데이터 추출기(210)로부터 수행된다. 음성 데이터 추출기(210)는 비디오 및 오디오에서 음성 데이터를 포함하고 있는지를 판단하여 해당 내용을 저장한다. The first thing to consider when building a voice data retrieval system is speech recognition. Speech recognition requires various preprocessing processes for speech audio. First of all, it is necessary to extract voice data from video and audio. This task is performed from the voice data extractor 210. The voice data extractor 210 determines whether the audio data is included in the video and the audio and stores the corresponding content.

음성 인식을 효과적으로 수행하기 위해서는 적절한 길이의 음성 데이터가 필요하다. 길이가 지나치게 긴 경우 음성 인식 성능이 떨어지며 속도 측면에서도 큰 손해를 본다. 따라서 음성 데이터 추출기(210)로부터 나온 음성 데이터를 문장 단위로 추정되는 적당한 분량으로 분할하여 웨이브 파일 형태로 저장하는 것이 필요한 데, 이를 음성 데이터 분할기(230)에서 수행한다. 음성 데이터 분할기(230)는 긴 음성 데이터로부터 여러 개의 짧은 웨이브 파일 형태를 생성해낸다. 음성 데이터를 분석하여 일정한 길이 이상의 묵음이 있는지를 판단하여 이 과정을 수행한다. In order to effectively perform speech recognition, speech data of appropriate length is required. If the length is too long, the speech recognition performance is lowered and the speed is also a great loss. Therefore, it is necessary to divide the speech data from the speech data extractor 210 into an appropriate amount estimated in sentence units and store them in the form of a wave file, which is performed by the speech data divider 230. The voice data divider 230 generates several short wave file types from the long voice data. This process is performed by analyzing the voice data to determine if there is silence over a certain length.

이상으로, 음성 인식기(250)의 입력에 해당하는 부분을 위한 전처리 과정을 모두 마쳤다. 일반적으로, 한국어 음성 인식에 있어서 단위를 어떻게 정의하느냐가 성능에 큰 영향을 미친다. 어절을 단위로 인식할 경우 사전에 많은 단어를 등록해야하고, 미등록어가 빈번하게 발생하여 성능이 좋지 못하다. 형태소를 단위로 인식할 경우 사전에 1~2음절의 짧은 등록어가 많이 포함되어 인식 성능에 악영향을 미친다. 따라서 짧고 빈번하게 발생하는 형태소들의 쌍을 하나로 결합한 결합 형태소를 인식 단위로 사용하는 것이 일반적이다. In the above, the preprocessing for the part corresponding to the input of the speech recognizer 250 has been completed. In general, how to define a unit in Korean speech recognition has a big impact on performance. When a word is recognized as a unit, many words need to be registered in a dictionary, and unregistered words frequently occur, which results in poor performance. When the morpheme is recognized as a unit, many short registered words with one or two syllables are included in the dictionary, which adversely affects the recognition performance. Therefore, it is common to use a combined morpheme that combines short and frequently occurring morphemes into one as a recognition unit.

음성 인식기(250)를 사용하는 데 있어서 크게 두 가지 모델이 필요하다. 하나는 언어 모델(40)이고, 다른 하나는 음향 모델(50)이다. 이 각각의 모델을 일반적인 도메인에서 훈련한 것을 사용할 경우 음성 인식 성능이 좋지 못하다. 따라서 각각의 모델을 음성 데이터에 맞게 적응하는 과정이 필요하다. Two models are needed to use the speech recognizer 250. One is language model 40 and the other is acoustic model 50. Speech recognition performance is poor when using each model trained in the general domain. Therefore, it is necessary to adapt each model to voice data.

언어 모델(40)의 경우 오디오 및 비디오에 부가적으로 제공되는 정보를 최대한 활용하여 관련 있는 내용들을 수집하여 훈련한 것과 일반적인 도메인에서 훈련 한 것을 적절히 섞은 뒤 최종적으로 언어 모델로 사용한다. 음향 모델(50)은 기존에 대량으로 훈련된 음향 모델을 현재 사용하는 음성 데이터의 성격에 맞게 적응시키는 기술을 사용한다. 음향 모델 적응 기술은 일반적으로 사용되는 MAP(maximum a-posteriori) 혹은 MLLR(maximum likelihood linear regression) 방법을 HTK Toolkit[The HTK Book, Young, S. 등, http://htk.eng.cam.ac.uk/docs/docs.shtml]을 이용하여 적용할 수 있다. In the case of the language model 40, the information provided in the audio and video is utilized to the maximum, and the relevant contents are collected and trained and appropriately mixed with those trained in the general domain, and finally used as the language model. The acoustic model 50 uses a technique for adapting a conventionally trained acoustic model to the nature of the voice data currently being used. Acoustic model adaptation techniques can be used to describe commonly used maximum a-posteriori (MAP) or maximum likelihood linear regression (MLLR) methods in the HTK Toolkit [The HTK Book, Young, S. et al., Http://htk.eng.cam.ac .uk / docs / docs.shtml].

이로써 음성 인식기(250)는 음성 웨이브 파일을 입력으로 받아서 텍스트 데이터 파일 형태로 출력한다. 텍스트 데이터 파일 형태에 포함될 수 있는 내용으로는 대표적으로 1-best, n-best, 격자(lattice)가 있다. 일반적으로 음성 인식기는 인식 오류로부터 자유로울 수는 없다. 특히, 음성 인식기에서 생성하는 1-best 만을 고려하여 색인할 경우 음성 인식 오류에 전체 시스템이 민감해질 수밖에 없다. As a result, the voice recognizer 250 receives the voice wave file as an input and outputs the voice wave file in the form of a text data file. Representative contents that can be included in the text data file format are 1-best, n-best, and lattice. In general, speech recognizers cannot be free from recognition errors. In particular, when indexing considering only the 1-best generated by the speech recognizer, the entire system becomes insensitive to speech recognition errors.

음성 인식 오류에 보다 강인한 시스템을 구축하기 위해서 음성 인식기로부터 다수의 후보군을 이용해야하는데, 격자를 그대로 색인에 활용할 경우 색인 테이블 크기가 엄청나게 증가하는 문제점이 발생한다. 따라서 격자에서 시간 정보를 바탕으로 중복되는 부분을 최대한 하나로 결합한 압축된 표현을 사용한다. 이는 TMI (Time-based Merging for Indexing) [Word-lattice based spoken-document indexing with standard text indexers, Peng Yu, K. Thambiratnam, SLT 2008]와 비슷하며 약간은 변형된 형태이다. 색인 테이블의 엔트리는 단어, 문서의 ID, 시작 지점, 끝 지점, 확률 값으로 이루어져 있다. In order to construct a system more robust to speech recognition errors, a large number of candidate groups must be used from the speech recognizer. However, when the grid is used as it is, the index table size increases greatly. Therefore, we use a compressed representation that combines the overlapping parts as much as possible based on time information in the grid. This is similar to Time-based Merging for Indexing (TMI) [Word-lattice based spoken-document indexing with standard text indexers, Peng Yu, K. Thambiratnam, SLT 2008] and is slightly modified. An entry in the index table consists of a word, document ID, start point, end point, and probability value.

위의 과정을 거치면 결합 형태소 단위의 음성 인식 결과를 바탕으로 하여 색 인 테이블(280)이 생성된다. 결합 형태소 색인 테이블(283)을 이용하여 결합 형태소 단위로 인식을 수행했을 경우에서 역시 미등록어 문제를 피할 수 없다. 음성 인식기(250)에서 사전의 목록에 포함되지 않은 단어는 인식이 될 수 없으며 보통의 경우 가장 확률적으로 가장 비슷한 것으로 대체된다. 실제 음성 문서에서 검색하고자 하는 키워드가 있음에도 불구하고 음성 인식기(250)에서 미등록어인 관계로 아무런 결과를 나타내주지 못하는 문제점이 발생한다. Through the above process, the index table 280 is generated based on the speech recognition result of the combined morpheme unit. In the case where recognition is performed in a combined morphological unit using the combined morpheme index table 283, the problem of unregistered words cannot be avoided. In the speech recognizer 250, words that are not included in the list of dictionaries cannot be recognized and are usually replaced with the most likely similarity. Even though there are keywords to be searched in the actual voice document, the voice recognizer 250 may not display any results because it is a non-registered word.

이를 해결하기 위한 것이 도 2에 도시한 바와 같이 단어 혹은 결합 형태소 보다 더 낮은 단위인 음절이나 음소의 색인 테이블(281, 282)을 생성하는 것이다. 이를 위해 음절 혹은 음소 단위의 음성 인식기를 사용하는 것도 방법이지만 성능이 나쁘기 때문에 결합 형태소 단위로 인식하여 나온 격자를 음절 혹은 음소 단위로 변환해주는 방법을 사용한다. 변환 방법은 간단하게 격자의 링크를 음절 혹은 음소의 개수만큼 나누어주면 된다. 확률 값은 그대로 유지하며 시간 정보는 해당 개수로 나누어 추정한다. 이렇게 생성한 음절 혹은 음소 단위의 격자를 바탕으로 위의 색인 과정을 똑같이 거쳐 더 낮은 단위의 색인 테이블이 생성된다. To solve this problem, as shown in FIG. 2, the index tables 281 and 282 of syllables or phonemes, which are lower units than words or combined morphemes, are generated. For this purpose, a syllable or phoneme-based speech recognizer is also used, but because of poor performance, a method of converting a grid recognized as a combined morpheme unit into a syllable or phoneme unit is used. The conversion method is simply to divide the links of the grid by the number of syllables or phonemes. The probability value remains the same and the time information is estimated by dividing by the number. Based on the syllable or phoneme grids, the index table in the lower unit is created through the same indexing process.

이로써, 사용자(10)가 검색하고자 하는 키워드(20)가 음성 인식기(250)에서 미등록어인 것으로 판정이 될 경우 보다 낮은 단위로 변환하여 음절 혹은 음소 색인 테이블을 활용함으로써 기존의 문제점을 보완할 수 있게 된다. Thus, when it is determined that the keyword 20 to be searched by the user 10 is a non-registered word in the speech recognizer 250, the user may convert to a lower unit and use a syllable or phoneme index table to compensate for the existing problem. do.

한국어 입력 시스템은 기본적으로 어절 단위의 띄어쓰기를 하는 데, 이것은 음성 인식 및 색인에서 사용한 결합 형태소 단위와 일치하지 않는다. 또한 한국어의 경우 결합어가 많고, 띄어쓰기도 일정하지 않은 경우도 있다. 따라서 사용자가 입력한 질의를 결합 형태소 단위로 변환해주는 과정이 필요한 데, 형태소 분석기를 이용한 변환기를 사용하는 것도 방법이지만 미등록어 질의는 형태소 분석도 잘 이루어지지 않을 가능성이 크다. The Korean input system basically uses word spacing, which does not match the combined morpheme used in speech recognition and indexing. In the case of Korean, there are a lot of combined words, and the spacing is not constant. Therefore, a process of converting a user input query into a combined morpheme unit is required. It is also possible to use a converter using a morpheme analyzer, but a non-registered query is not likely to be well morphologically analyzed.

검색부(100)의 질의 확장기(110)에서는 주어진 질의가 우선 미등록어로 인식된 경우, 최소한 하나의 결합 형태소를 포함하는 모든 가능한 결합 형태소와 음절의 조합을 고려한다. 이러한 조합이 하나도 없을 경우에 음절로만 이루어진 조합만 고려한다. 각각의 조합에 대해서 해당 음절이 음절 미등록어에 해당할 경우 음소 단위로 변환해 준다. 이 경우 결합 형태소, 음절, 음소 등 총 3가지 단계를 포함하는 질의들이 생성된다. 경우에 따라서는 결합 형태소, 음소만의 조합만 생각할 수도 있다. 각각은 검색 과정에서 해당하는 색인 테이블을 활용하게 된다. The query expander 110 of the search unit 100 considers all possible combinational morphemes and syllable combinations including at least one combined morpheme when a given query is first recognized as a non-registered word. In the absence of any such combination, only combinations of syllables are considered. For each combination, if the syllable is a syllable unregistered word, the syllable is converted to a phoneme unit. In this case, queries are generated that include a total of three levels: combined morphemes, syllables, and phonemes. In some cases, only combination morphemes and phonemes can be considered. Each will use the corresponding index table during the search.

도 2에 도시한 바와 같이 예를 들면 사용자 질의가 '대운하사업'(20)인 경우, 결합 형태소 색인 테이블에는 '운하'와 '사업'이 포함되어 있고, 음절 색인 테이블에는 '운', '하', '사', '업'이 포함되어 있고, '대'는 음절 미등록어에 해당하며 'T EH'의 음소열로 변환이 가능하다고 할 때, 최종적으로 확장되는 질의(21)는 (T, EH, 운하, 사, 업), (T, EH, 운, 하, 사업), (T, EH, 운하, 사업)으로 총 세 가지가 생성된다. For example, as shown in FIG. 2, when the user query is' Grand Canal '20, the combined stemming index table includes' Canals' and' Business', and the syllable index table includes' Lun 'and' Ha. When ',' and 'up' are included, and 'large' corresponds to a syllable unregistered word and can be converted into a phoneme string of 'T EH', the finally expanded query 21 is (T , EH, canal, company, up), (T, EH, canal, summer, business), (T, EH, canal, business).

위의 방법은 기본적으로 결합 형태소를 최대한으로 활용하는 방법인데, 그런 방법을 사용하는 이유는 우선적으로 결합 형태소 색인 테이블(283)이 가장 안정적인 성능을 보여주기 때문이다. 기본적으로 음성 인식 단위이기도 하며, 격자를 색인하는 데 있어서 다른 단위보다 추정 부분이 적게 들어간다. 또 다른 이유는 결합 형태소의 경우 색인 테이블을 검색하는 데 있어서 결과 목록이 다른 단위에 비해서 적게 나타난다. 즉, 좀 더 빠른 속도로 검색을 수행할 수 있다는 것을 의미한다. 만약에 음소 색인 테이블(281)만을 활용한다고 가정하면, 테이블 검색 횟수가 증가할 뿐만 아니라 각각에 대해서 결과 목록도 많기 때문에 많은 시간을 필요로 하게 된다. The above method is basically a method of making the most of the combined morpheme, and the reason for using such a method is that the combined morpheme index table 283 shows the most stable performance. Basically, it is also a unit of speech recognition, and it takes less estimation than other units to index the grid. Another reason is that in the case of combined stemming, the result list appears less in search of index tables than in other units. This means that you can search faster. If it is assumed that only the phoneme index table 281 is used, not only does the number of table searches increase, but also a large number of result lists for each of them.

사용자의 질의는 질의 확장(110)기를 거쳐서 여러 개의 질의들(21)로 확장되며 검색기(120)에서는 확장된 질의(21)를 입력으로 받아 여러 가지 색인 테이블을 이용하여 검색을 수행한다. 질의에 포함되어 있는 각각의 단위에 대해서 해당되는 색인 테이블로부터 목록을 추출한다. 이 목록들의 조합 중에서 모든 단위가 순서대로 근접하여 나타나는 지점이 사용자에게 표시해 줄 최종 결과에 포함되며, 인접한 단위 사이의 거리가 일정한 시간 이내에 들어있는지를 체크하는 과정으로 수행된다. The user's query is expanded into a plurality of queries 21 through the query expansion unit 110, and the searcher 120 receives the expanded query 21 as an input and performs a search using various index tables. Extract a list from the corresponding index table for each unit included in the query. The combination of these lists is included in the final result for the user to indicate that all units appear in order, and is performed by checking whether the distance between adjacent units is within a certain time.

검색을 수행하는 데 있어서 질의의 왼쪽에서부터 오른쪽으로 시작하는 것 대신에 결합 형태소 단위가 가장 먼저 나타나는 부분부터 양방향으로 진행하는 방식이 효과적이다. 결합 형태소 단위의 경우 색인 테이블로부터 추출한 목록의 개수가 상대적으로 다른 단위에 비해서 적게 되므로 검색 과정에 있어서 앞 과정부터 조합 가능한 경우를 크게 줄여준다. Instead of starting from the left side of the query to the right side of the search, it is effective to proceed bidirectionally from the first occurrence of the combined morpheme unit. In the case of the combined morpheme unit, the number of lists extracted from the index table is relatively smaller than that of other units, which greatly reduces the possible combination from the previous step in the search process.

검색에 있어서 스코어는 다음과 같이 계산 된다.In the search, the score is calculated as follows.

여기서 Q는 사용자 질의를 의미하며

, ... ,

은 확장된 사용자 질의를 의미한다. 따라서 N은 확장된 사용자 질의의 개수가 되며, I는 특정 구간을 의미한다. HScore는 해당 질의가 특정 구간에서 어떤 확률로 나타났는지를 의미하며, 이는 색인 테이블의 확률 값으로부터 구한다. Where Q means user query

, ...,

Means extended user query. Therefore, N is the number of extended user queries, and I is a specific interval. HScore means what probability the query appeared in a particular interval, which is obtained from the probability value of the index table.

검색 과정을 거치고 나면 최종적으로 출력기(130)를 통해 검색 결과(30)를 사용자에게 나타내주어야 한다. 출력기(130)는 결과를 스코어(Score)에 따라서 순차적으로 나타내며, 단순히 해당 오디오와 비디오를 제공해 주는 것이 아니라 질의어가 나타난 지점으로 빠르게 이동할 수 있는 인터페이스를 제공해준다. After the search process, the search result 30 should be finally displayed to the user through the output unit 130. The output unit 130 sequentially displays the results according to scores, and provides an interface for quickly moving to the point where the query word appears instead of simply providing corresponding audio and video.

도 1은 검색 시스템의 구성을 설명하기 위한 전반적인 개요 블록도이다.1 is a general overview block diagram for explaining the configuration of a search system.

도 2는 도 1의 색인부의 색인 테이블과 확장된 질의를 보다 구체적으로 도시한 블록도이다. FIG. 2 is a block diagram illustrating in more detail an index table and an extended query of the index unit of FIG. 1.

Claims

In the voice data retrieval apparatus including an index unit and a search unit,

The index unit extracts the voice data 220 from the audio and video voice data extractor 210; A voice data divider 230 for dividing the voice data 220 into an estimated amount of sentences in order to be performed by the voice recognizer 250 and storing the voice data 220 in a wave file form 240; A voice recognizer 250 for receiving a voice wave file as an input and outputting the voice wave file in a text data file form 260; An indexer 270 for generating the index table 280 in various units using the grid-shaped information resulting from the speech recognition 260,

The search unit includes a query expander 110 which expands a user's query into various possible queries to use an index table in consideration of non-registered words in speech recognition; A searcher 120 for performing a search using an index table from the extended query 21; A result outputter 130 for effectively displaying the searched result 30 to the user,

The indexer 270 generates an index table by converting the lattice into syllables and phonemes that are lower units than the combined morphemes in order to solve the problem of the unregistered words of the speech recognizer. And video voice data retrieval device.

The method according to claim 1, wherein a word, time information, probability value, etc. for a corresponding point of the voice document is used by using a compressed expression that combines the overlapping parts of the lattice based on the time information as possible for a system that is robust against error of the speech recognizer. Voice data retrieval device to create an index table containing information of.

delete

The apparatus of claim 1, wherein when the keyword to be searched by the user is determined to be a non-registered word in the speech recognizer, the query is extended by using a combination of units to utilize various index tables.

The voice data retrieval apparatus according to claim 1, wherein the retrieval unit performs a search by checking whether a distance between adjacent units is within a predetermined time from the first appearing portion of the combined morpheme unit for faster retrieval.

In the voice data retrieval method comprising the indexing step of the index unit and the search step of the search unit,

The indexing step may include: extracting voice data from audio and video; A voice data dividing step of dividing the voice data into an estimated amount in sentence units to be performed in a voice recognition step and storing the voice data in a wave file form; A voice recognition step of receiving a voice wave file as an input and outputting the voice wave file as a text data file; An indexing step for generating an index table in various units using the grid-shaped information resulting from speech recognition,

The searching step may include: a query expansion step of expanding a user's query into various possible queries to use an index table in consideration of an unregistered word in speech recognition; A search performing step of performing a search using an index table from the expanded query; It includes a result output step that effectively displays the search results to the user,

In order to solve the problem of unregistered words in the speech recognition step, the indexing step generates an index table by converting the lattice into syllables and phonemes, which are lower units than the combined morphemes. How to retrieve video voice data.

The word, temporal information, and probability value for the corresponding point of the speech document, using a compressed representation that combines as many as possible overlaps based on temporal information in the grid for the error-tolerant system in the speech recognition step. Voice data retrieval method for generating an index table containing information such as.

delete

The voice data retrieval method according to claim 6, wherein when the keyword to be searched by the user is determined to be a non-registered word in the speech recognition step, the query is extended by using a combination of units to utilize various index tables.

The voice data retrieval method according to claim 6, wherein the retrieval unit performs a search by checking whether a distance between adjacent units is within a predetermined time from the first appearing unit of the morpheme unit for faster retrieval.