KR100817432B1

KR100817432B1 - A high-speed searching method and system for speech documents using document expansion

Info

Publication number: KR100817432B1
Application number: KR1020070010324A
Authority: KR
Inventors: 오영환; 서민구
Original assignee: 한국과학기술원
Priority date: 2007-01-31
Filing date: 2007-01-31
Publication date: 2008-03-27

Abstract

A method and a system for searching voice data at high speed with document extension are provided to search voice data at a high speed by indexing a voice document with a phoneme n-gram, which binds n continuous phonemes obtained from a phoneme recognizer, and remarkably reducing a calculation quantity with an efficient document extension technique. A voice recognizer(30) receives voice document data and a query receiver(40) receives a user query. A main system(10) is connected to the voice recognizer and the query receiver, and includes a voice recognition engine(11) converting the voice data or the user query into a phoneme n-gram, and an information search engine(12) searching/sorting the voice document corresponding to the user query. A voice document database(20) is connected to the main system, and stores the voice document data, and the phoneme n-gram corresponding to the voice document data and converted by the voice recognition engine. An output part(50) outputs a result searched and sorted by the information search engine. The query receiver and the output part are connected with the main system through the Internet.

Description

{A High-Speed Searching Method and System for Speech Documents using Document Expansion}

도 1은 일반적인 음성 문서 검색 절차.1 is a general voice document retrieval procedure.

도 2는 음성 문서 검색 시스템의 일반적인 구조.2 is a general structure of a voice document retrieval system.

도 3은 본 발명에 의한 훈련 데이터에서의 단일 음소에 대한 매칭 스코어 계산 방법의 예.3 is an example of a matching score calculation method for a single phoneme in training data according to the present invention.

도 4는 단일 음소

에 대한 매칭 스코어 분포.4 is a single phoneme

Match score distribution for.

도 5는 신뢰도

의 정의.5 is the reliability

Definition.

도 6은 기대 매칭 스코어를 조절하는 신뢰도

의 실제 역할의 예.6 is the reliability of adjusting the expected matching score.

Example of the actual role of.

도 7은

계산의 예.7 is

Example of calculation.

도 8은

및

를 사용한 선별적 DP의 예.8 is

And

Example of selective DP using.

도 9는 단일 음소

와

에 대한 인식 정확도 분포.9 is a single phoneme

Wow

Recognition accuracy distribution for.

도 10은 검색 속도 향상 정도와 최대

저하 정도의 비교.10 is a search speed improvement degree and maximum

Comparison of the degree of degradation.

도 11은 본 발명에 의한 음성 자료의 고속 검색 시스템.11 is a fast search system for audio data according to the present invention.

도 12는 본 발명에 의한 음성 자료의 고속 검색 방법.12 is a fast searching method of voice data according to the present invention.

본 발명은 문서 확장에 의한 음성 자료의 고속 검색 방법에 관한 것으로, 보다 상세하게는 음성 인식에 있어서의 오차를 최대한 줄이면서 종래의 방법보다 검색에 따르는 계산량을 훨씬 줄임으로써 검색의 속도 및 효율을 높이는 문서 확장에 의한 음성 자료의 고속 검색 방법에 관한 것이다.The present invention relates to a fast search method of speech data by document expansion, and more particularly, to increase the speed and efficiency of the search by significantly reducing the amount of computation associated with the search compared to the conventional method while minimizing errors in speech recognition. The present invention relates to a fast search method for audio data by document extension.

컴퓨팅 파워의 증가와 인터넷 기술의 발전에 의해 이미지, 오디오, 동영상과 같은 멀티미디어 정보가 급속히 증가하고 있다. 이에 따라 많은 사용자들이 오디오와 동영상을 인터넷 상에 제공 또는 검색하고 있다. 산업계에서도 멀티미디어 정보 검색에 많은 관심을 보여 YouTube, Google Video와 같은 서비스가 제공되고 있으며, The New York Times와 같은 신문 매체에서도 Podcasts란 이름으로 뉴스를 mp3 형태로 가공하여 제공하고 있다.With the increase of computing power and the development of Internet technology, multimedia information such as image, audio, and video are rapidly increasing. As a result, many users provide or search for audio and video on the Internet. The industry has shown interest in multimedia information retrieval, and services such as YouTube and Google Video are being provided. Newspapers such as The New York Times are also processing and providing news in the form of mp3 in the name of podcasts.

현재 이러한 서비스의 대부분은 사람이 직접 입력한 색인(태그), 멀티미디어 데이터와 함께 제공된 텍스트를 사용한 색인만을 제공하고 있어, 자동화된 음성 정보 색인 및 검색 기법이 필요한 실정이다. 따라서 사람의 음성이 웨이브 형태로 저장된 파일인 음성 문서(Spoken Documents)로부터 사용자의 정보 요구에 상응하는 음성 문서를 검색하는 음성 문서 검색(Spoken Document Retrieval; SDR)에 대한 활발한 연구가 진행되고 있다.Currently, most of these services only provide indexes (tags) input by humans and indexes using texts provided with multimedia data. Therefore, automated voice information indexing and retrieval techniques are required. Accordingly, active researches on the Speech Document Retrieval (SDR) for retrieving a voice document corresponding to a user's information request from a voice document, which is a file in which a human voice is stored in a wave form, are being conducted.

일반적인 음성 문서 검색 절차는 도 1에 보인 바와 같이 음성 인식 엔 진(Speech Recognition Engine)과 정보 검색 엔진(Information Retrieval Engine; IR Engine)으로 구성된다. 음성 인식 엔진의 동작은 오프라인에서 수행되어, 음성 문서의 인식 결과를 정보 검색 엔진 쪽으로 전달한다. 정보 검색 엔진은 사용자의 질의가 주어졌을 때 음성 인식 결과를 이용하여 음성 문서들을 사용자의 질의와 유사성이 높은 순서대로 정렬하여 반환한다.A general voice document retrieval procedure is composed of a Speech Recognition Engine and an Information Retrieval Engine (IR Engine) as shown in FIG. The operation of the speech recognition engine is performed off-line, and delivers the recognition result of the speech document to the information retrieval engine. When the user's query is given, the information retrieval engine sorts and returns the voice documents in order of high similarity to the user's query using the speech recognition result.

음성 문서 검색을 효율적으로 수행하기 위해서는 다음 두 가지 문제가 해결되어야 한다. 첫째는 음성 인식기에 등록되어 있지 않은 미등록 어휘(Out of Vocabulary; OOV)가 음성 문서 내에 존재할 때 처리 문제이다. 특히 인터넷 발달과 함께 빠르게 늘어나는 신조어나 고유명사의 경우 사전에 대용량 연속 음성 인식기(Large Vocabulary Continuous Speech Recognizer; LVCSR)에 사전에 등록하여 학습하기에는 한계가 발생하며, 이 경우 미등록 어휘는 해당 어휘와 유사한 등록된 어휘로 잘못 인식되게 된다. 미등록 어휘로 인한 오인식의 예로는 Taliban이 Tell a band로 인식 되는 경우를 들 수 있다. 미등록 어휘 문제의 해결을 위해 음소, 음절과 같은 서브 워드를 인식 단위로 하는 서브 워드 인식기를 사용하는 방법이 널리 사용되고 있다. 이 경우 고유명사나 신조어를 오인식하더라도 몇 개 음소나 음절 정도로 오인식의 정도가 일부에 한정되게 된다.In order to efficiently perform voice document retrieval, the following two problems must be solved. The first is a processing problem when an out of Vocabulary (OOV) that is not registered in the speech recognizer exists in the speech document. In particular, new words or proper nouns that are rapidly increasing with the development of the Internet have limitations in learning by registering them in advance in a large vocal continuous speech recognizer (LVCSR). The wrong vocabulary is misunderstood. An example of misunderstanding due to unregistered vocabulary is when Taliban is recognized as Tell a band. In order to solve the unregistered vocabulary problem, a method of using a subword recognizer using subwords such as phonemes and syllables as a recognition unit is widely used. In this case, even if a proper noun or a coined word is misrecognized, the degree of the misrecognition is limited to some phonemes or syllables.

두 번째 문제는 현재 음성 인식 기술은 통계를 기반으로 가장 확률이 높은 결과를 정답으로 간주하므로 음성 인식의 결과가 100% 정확하지 않다는 점이다. 이러한 문제의 해결을 위한 방법에는 질의 확장(Query Expansion)과 문서 확장(Document Expansion)이 있다.The second problem is that current speech recognition technology considers the most likely result as the correct answer based on statistics, so the result of speech recognition is not 100% accurate. There are two ways to solve this problem: query expansion and document expansion.

질의 확장에는 크게 이차 문서(Secondary Corpus)를 사용한 확장과 서브 워드를 음성 인식기로 인식할 경우 상호 혼동할 가능성을 표현한 혼동 행렬(Confusion Matrix)을 사용한 확장이 있다. 이차 문서를 사용한 질의 확장은 검색 대상이 되는 음성 문서와 도메인, 시간적으로 유사한 이차 문서로부터 질의와 관련된 단어를 찾아 질의에 추가하는 기법이다. 그러나 이 방법은 이차 문서를 사전에 준비해야 하는 어려움, 어떤 단어를 질의에 추가해야 할지 선별해야 하는 어려움이 있다. 혼동 행렬을 사용한 질의 확장에서는 사용자 질의를 음성 인식하였을 경우, 오인식되어 나타날 가능성이 높은 단어를 찾아 질의에 추가하는 방법을 사용한다. 그러나 높은 검색 정확도를 위해서는 질의와 혼동될 가능성이 높은 단어를 모두 질의에 추가해야하므로 질의의 단어 수가 지나치게 커지는 문제가 있다. There are two types of query expansion: extensions using secondary documents and extensions using a confusion matrix that expresses the possibility of mutual confusion when a subword is recognized by a speech recognizer. Query expansion using secondary documents is a technique of finding and adding words related to a query from a secondary document that is similar in time and domain to the voice document to be searched. However, this method has a difficulty of preparing a secondary document in advance and selecting a word to add to the query. Query expansion using confusion matrices uses a method of finding and adding words that are likely to be misrecognized when a user's query is voice recognized. However, in order to have high search accuracy, all words that are likely to be confused with the query must be added to the query, thereby causing a problem in that the number of words in the query becomes too large.

문서 확장 중 이차 문서를 사용한 문서 확장은 음성 문서와 도메인, 시간적으로 유사한 이차 문서로부터 음성 문서와 관련된 문서를 찾은 뒤, 이차 문서 내 단어를 음성 문서에 추가하는 기법이다. 그러나 질의 확장의 경우와 마찬가지로 이 경우에도 이차 문서 준비의 어려움, 추가할 단어 선별의 어려움이 남는다. 서브 워드 간 혼동 가능성을 표현한 혼동 행렬을 사용한 문서 확장에서는 질의와 음성 문서가 주어졌을 때, 질의 내 서브 워드와 음성 서브 워드간의 유사도를 동적 프로그래밍(Dynamic Programming, 이하 DP)을 사용한 유사 검색을 통해 평가한다. 그러나 이 방법은 계산량이 매우 크고, 따라서 검색 속도의 저하가 크다는 문제가 있다.Document expansion using secondary documents during document expansion is a technique of finding a document related to a voice document from a secondary document that is similar in time to a voice document and a domain, and then adding words in the secondary document to the voice document. However, as in the case of query expansion, in this case, it is difficult to prepare secondary documents and to select words to add. In the document extension using the confusion matrix expressing the possibility of confusion between subwords, the similarity between the subword and the voice subword in the query is evaluated through the similarity search using dynamic programming (DP) when a query and a voice document are given. do. However, this method has a problem that the calculation amount is very large, and thus the search speed is greatly reduced.

하기에는 종래에 연구된 음성 문서 검색과 관련된 기술, 보다 상세히 설명하 면 서브 워드 기반 음성 문서 검색의 개요, 일반적으로 널리 사용되는 정보 검색 모델, 질의 확장 및 문서 확장 기법에 대하여 기술한다.In the following, descriptions related to the conventionally researched voice document retrieval, and in more detail, an overview of sub-word based voice document retrieval, a widely used information retrieval model, a query extension, and a document extension technique are described.

서브 워드 기반 음성 문서 검색이란 음소(Phoneme), 음절(Syllable), 연속된

개의 음소를 묶은 음소

-그램(Phoneme

-gram) 등을 기본 색인 단위로 이용하여 음성 문서를 검색하는 방법을 말한다. 이러한 음성 문서 검색 시스템의 일반적인 구조는 도 2와 같다. 음성 인식기의 음성 인식 결과는 서브 워드로 정보 검색 엔진에 제공되며, 정보 검색 엔진에서는 서브 워드간 혼동 행렬을 사용하여 질의 확장 또는 문서 확장을 수행한다. 이후 정보 검색 모델을 사용하여 음성 문서의 랭킹을 매기고 결과를 출력한다. 표 1은 음소(Phoneme), 음소

-그램(Phoneme

-gram), 음절(Syllable)을 사용한 색인의 예이며, 이들 중에서는 음소

-그램(Phoneme

-gram)이 가장 좋은 검색 성능을 보여주는 것으로 알려져 있다.Subword-based Voice Document Retrieval is Phoneme, Syllable, Continuous

Phonemes

-Gram (Phoneme

-gram), etc., as a basic index unit to search for a voice document. The general structure of such a voice document retrieval system is shown in FIG. The speech recognition result of the speech recognizer is provided as a subword to the information retrieval engine, and the information retrieval engine performs query expansion or document expansion using a confusion matrix between subwords. Then, the information retrieval model is used to rank the voice documents and output the results. Table 1 shows Phoneme, Phoneme

-Gram (Phoneme

-gram), an example using syllables, of which phonemes

-Gram (Phoneme

-gram) is known to show the best search performance.

서브 워드 단위Sub word unit 색인 항목Index entries 음소(Phoneme)Phoneme [W], [EH], [DH], [ER], [F], [OW], [R], [K], [AE], [S], [T][W], [EH], [DH], [ER], [F], [OW], [R], [K], [AE], [S], [T] 음소

-그램 (Phoneme

-gram) (n=3)phoneme

-Gram (Phoneme

-gram) (n = 3) [W, EH, DH], [EH, DH, ER], [DH, ER, F], [ER, F, OW], [F, OW, R], [OW, R, K], [R , K, AE], [K, AE, S], [AE, S, T] Syllable [W, EH], [DH, ER], [F, OW, R], [K, AE, S, T]

일반적인 정보 검색 모델로서는 TF-IDF 기반의 벡터 공간 모델이 가장 널리 사용되고 있다. 벡터 공간 모델이란 색인 단위를 기반으로 문서를 벡터 형태로 변환해 사용하는 정보 검색 모델이다.

개의 키워드를 담고 있는 키워드 집합

와 문서 집합

가 있다고 하자. 그러면 키워드

와 문서

간의 가중치

를 결정하여 문서 벡터

를 얻을 수 있다. 마찬가지 방법으로 질의

에 대해 질의 벡터

를 계산한다. 그 뒤,

와

간의 유사성은 하기의 수학식 1과 같이 벡터 간의 코사인 각도차로 결정된다.As a general information retrieval model, a vector space model based on TF-IDF is most widely used. The vector space model is an information retrieval model that converts a document into a vector form based on index units.

Set of keywords containing keywords

And document set

Say there is. Keyword

And documents

Weights between

Determine your document vector

Can be obtained. Similarly query

Query vector about

Calculate After that,

Wow

The similarity between is determined by the cosine angle difference between the vectors as shown in Equation 1 below.

질의 내 키워드

가 음성 문서

내에 많이 포함되어 있다면, 이 문서는 해당 키워드와 관련성이 높은 것으로 볼 수 있으므로

는 문서

내

의 출현 회수(Term Frequency; TF), 즉

에 비례한다. 또한 키워드

가 전체 문서 집합

내에서 매우 많이 출현하는 단어라면 이 키워드가 문서를 상호 구분하는 능력은 떨어진다고 볼 수 있으므로

는 전체 문서 내에서

가 출현 회수 역수(Inverse Document Frequency; IDF)에 비례해야 한다. 따라서

는 하기의 수학식 2와 같이 계산할 수 있다.Keywords in the query

Voice document

If so, this document may be considered as relevant to that keyword.

Document

of mine

Term Frequency (TF), ie

Proportional to Also keywords

Full set of documents

If you have a very high number of words within this keyword, you may not be able to distinguish between them.

Within the entire document

Should be proportional to the Inverse Document Frequency (IDF). therefore

May be calculated as in Equation 2 below.

다음으로는 상술한 바와 같은 서브 워드 기반 음성 문서 검색 기술의 기본 아이디어 및 TF-IDF 기반의 벡터 공간 모델로서의 정보 검색 모델을 사용하여, 실제 널리 사용되는 음성 문서 검색 기법인 질의 확장(Query Expansion) 기법 및 문서 확장(Document Expansion) 기법에 대해 설명한다.Next, using the basic idea of the subword-based voice document retrieval technique as described above and the information retrieval model as a vector space model based on TF-IDF, the query expansion technique, which is a widely used voice document retrieval technique, is used. And document expansion techniques.

먼저, 질의 확장 기법에는 적합성 피드백(Relevance Feedback) 기법 및 혼동 행렬(Confusion Matrix)을 사용한 질의 확장 기법이 있다.First, query extension techniques include a relevance feedback technique and a query extension technique using a confusion matrix.

적합성 피드백은 질의와 관련이 있는 것으로 판명된 문서 내에는 찾고자 하는 문서에 대한 유용한 키워드가 저장되어 있으나, 관련 없는 것으로 판명된 문서 내에는 검색에 방해되는 키워드가 저장되어있다는 가정으로부터 출발하는 대표적 질의확장 기법이다. 질의

가 주어지고 문서 집합

내에

와 관련된 문서들을

이라 할 때, 최적의 질의

는 하기의 수학식 3과 같이 계산된다.Conformance feedback is a representative query extension that begins with the assumption that a document that turns out to be related to a query contains useful keywords for the document that you are looking for, but that contains keywords that hinder your search. Technique. vaginal

Given a set of documents

Within

Related documents

In this case, the optimal query

Is calculated as in Equation 3 below.

따라서 이 방법은

에 관련된 문서의 가중치는 더하고, 관련되지 않은 문서의 가중치는 제외하는 것으로 볼 수 있다. 그러나 이 방법에서

을 사전에 구성하는 것은 불가능하다. 이와 같은 한계를 극복하기 위해 가상 적합성 피드백(Pseudo Relevance Feedback) 기법이 이용되고 있으며 일반적인 절차는 다음과 같다.So this method

It can be seen that the weights of the related documents are added and the weights of the unrelated documents are excluded. But in this way

It is impossible to configure in advance. To overcome this limitation, pseudo relevance feedback technique is used, and the general procedure is as follows.

(1)

를 사용하여

와 관련된 문서를 검색한다.(One)

use with

Search for documents related to.

(2) 검색된 최상의

개의 문서로부터 출현 회수 역수(IDF)가 낮은

개의 키워드들을 뽑아낸다.(2) found best

Low incidence frequency inverse (IDF) from 2 documents

Extract keywords

(3) 찾아낸 키워드들을

에 추가한 뒤 검색을 재수행한다.(3) found keywords

Add to and rerun the search.

그러나 텍스트 검색에서 사용되는 질의 확장의 경우와 달리, 음성 문서의 경우 검색 대상이 되는 문서에 이미 인식 오류가 포함되어 있어, 해당 문서로 질의를 확장할 경우 전혀 관계없는 키워드가 질의에 포함될 우려가 높기 때문에 검색 대상 이 되는 문서 집합 자체를 질의 확장에 사용할 수 없다. 따라서 음성 문서의 확장에서는 검색 대상이 되는 문서와 동일한 시점에 만들어진 동일한 도메인에 대한 인식 오류가 포함되어 있지 않은 별도의 이차 문서(Secondary Corpus)가 필요하다.However, unlike the query expansion used in text search, the voice document contains a recognition error in the document to be searched, and there is a high possibility that the query will contain irrelevant keywords when extending the query to the document. Because of this, the document set itself can not be used for query expansion. Therefore, the extension of the voice document requires a separate secondary document (Secondary Corpus) that does not include a recognition error for the same domain created at the same time as the document to be searched.

이와 같이 인식 오류를 포함하고 있는 적합성 피드백의 문제점을 해결하기 위하여 혼동 행렬(Confusion Matrix)을 사용한 질의 확장 기법이 사용된다. 이는 음소 혼동 행렬을 사용한 질의 확장은 훈련 데이터를 사용하여 만든 서브 워드 간 혼동 가능성을 이용한 질의 확장 방법이다. 이때 참조 음소(Reference Phoneme)

이 가설 음소(Hypothesis Phoneme)

로 관찰될 확률이 저장된 혼동 행렬

를 필요로 한다. 질의

내에 음소

-그램

만

개 포함되어 있고 음소

-그램

는 존재하지 않을 때, 음성 인식 오류를 고려한 음소

-그램

의 출현 회수

는 다음과 같이 계산된다.In order to solve the problem of relevance feedback including recognition errors, a query extension technique using a confusion matrix is used. This is a query expansion method using the possibility of confusion between subwords created using training data. Reference Phoneme

Hypothesis Phoneme

Confusion matrix with probability of being observed as

Need. vaginal

Phoneme within

-gram

just

Dog included and phoneme

-gram

Phoneme considering speech recognition error when not present

-gram

Number of occurrences of

Is calculated as follows.

즉,

는

가

로 혼동될 확률을

의 길이를 사용하여 정규화한 값으로 정의된다. 그러나 이 방법은 음소의 치환 오류는 보상할 수 있으나, 삽입 삭제 오류는 보정할 수 없다는 근본적인 한계가 있으며, 높은 검색 정확도를 위해서는 많은 양의 음소

-그램을 질의에 추가해야 하는 한계가 있다.In other words,

Is

end

To be confused with

It is defined as a normalized value using the length of. However, this method has a fundamental limitation that it can compensate for phoneme substitution errors, but it cannot compensate for insertion and deletion errors, and a large amount of phonemes for high search accuracy.

There is a limit to adding a gram to the query.

문서 확장(Document Expansion) 기법에는 이차 문서(Secondary Corpus)를 사 용한 문서 확장 기법과 혼동 행렬(Confusion Matrix)을 사용한 문서 확장 기법이 있다.Document expansion techniques include a document expansion technique using a secondary document and a document expansion technique using a confusion matrix.

이차 문서를 사용하는 문서 확장 기법의 경우, 인식 오류가 포함되어 있지 않은 이차 문서로부터 검색할 음성 문서와 관련된 키워드를 추출해 음성 문서에 추가하는 방식으로 문서를 확장시킬 수 있다. 이차 문서 집합

가 갖추어지면 문서 확장은 다음과 같은 절차를 따라 이루어진다.In the case of a document extension technique using a secondary document, a document may be extended by extracting a keyword related to a voice document to be searched from a secondary document that does not include a recognition error and adding it to the voice document. Secondary document set

Once the document is in place, document expansion is accomplished using the following procedure.

(1) 각 문서

에서 출현 회수 역수(IDF)가 낮은 키워드

개를 뽑는다.(1) each document

Low occurrence frequency (IDF)

Pull out the dog.

(2) 뽑은

개의 키워드를 사용해 가상의 질의(pseudo query)를 작성하고 이를 사용해 이차 문서에 대한 검색을 수행한다.(2) pulled

You create a pseudo query with three keywords and use it to search a secondary document.

(3) 검색된 문서로부터 키워드를 추출해

에 추가한다.(3) extract keywords from retrieved documents

Add to

이 방법은 검색 대상이 되는 음성 문서와 시간적으로 유사하고, 도메인이 유사한 이차 음성 문서 집합이 필요하다는 한계가 있다.This method has a limitation in that a secondary voice document set that is similar in time to the voice document to be searched and has a similar domain is required.

한편, 혼동 행렬을 이용한 음성 문서 확장 기법에서는, 검색 정확도를 향상시키기 위해 다양한 길이의 음소

-그램을 추출한 뒤, 이들 각각으로부터 계산된 유사성을 병합하는 방법이 제안되었다. 이 기법에서는 상기 수학식 1을 하기의 수학식 5와 같이 재정의한다.On the other hand, in the speech document extension method using a confusion matrix, phonemes of various lengths are used to improve the search accuracy.

After extracting the grams, a method of merging the similarities calculated from each of them has been proposed. In this technique, Equation 1 is redefined as Equation 5 below.

의 밑첨자

은 음소

-그램의 길이를 의미한다. 서로 다른 길이의 음소

-그램들로부터 검색된 결과는 하기의 수학식 6과 같이 병합된다.

Subscript of

Silver phoneme

-The length in grams. Phonemes of different lengths

The results retrieved from the grams are merged as in Equation 6 below.

이 식에서

은 특정 길이의 음소

-그램에 대한 선호도에 따른 가중치를 의미하며,

=3인 3-그램의 검색 정확도가 다른 음소

-그램에 비해 우수하다는 점을 반영하기 위해

의 값을 사용한다. 이러한 병합 과정에서 음소

-그램의 음성 문서 내 출현 회수를 평가할 때, 문서 확장을 통해 잘못 인식된 음소를 보정할 수 있다. 이를 위해 음성 문서에서 음소

-그램을 다른 음소

-그램으로 오인식될 확률을 이용해 상기 수학식 5를 다음과 같이 재정의한다.In this expression

Phonemes of a certain length

Means weight based on preference for gram,

Phonemes with different search accuracy for 3-grams with = 3

To reflect goodness over grams

Use the value of. Phonemes in this merge process

When evaluating the number of occurrences of a gram's voice document, the document extension can correct for phonemes that were misrecognized. Phonemes in voice documents for this purpose

-Gram another phoneme

Equation 5 is redefined as follows using the probability of being misrecognized as grams.

식에서

는

를

로 오인식할 확률,

는

값에 따른 문서 벡터이다.

와

가 각각 음소

-그램

와

의 길이라고 할 때,

는 크기

인 행렬

를 사용하여 다음과 같은 DP를 통해 계산된다.At the ceremony

Is

To

Probability of misrecognition as

Is

Document vector by value.

Wow

Each phoneme

-gram

Wow

Speaking of the length of,

The size

Phosphorus

Using is calculated through the following DP.

식에서

는 참조 음소(Reference Phoneme)

이 음성 인식 수행 뒤 가설 음소(Hypothesis Phoneme)

로 관찰될 확률,

는 참조 또는 가설 음소가 없을 때의 확률을 의미하며,

는 계산이 끝난 후

의 값에 해당하게 된다.At the ceremony

Reference Phoneme

Hypothesis phoneme after performing this speech recognition

Probability of being observed,

Is the probability of no reference or hypothesis phoneme,

After the calculation is over

Corresponds to the value of.

이와 같은 문서 확장은 질의와 음성 문서 내 모든 음소

-그램을 상호 비교하며, 따라서 검색 정확도는 높여주지만 계산량이 급격하게 증가하게 된다. 이 문제를 해결하기 위해

를 사전에 계산하여 저장해두는 방법이 제안되었으나 이 방법에 대한 검증은 이루어지지 않았으며, 또한 이렇게 계산된 정보의 양은 메모리에 저장할 수 없을 정도로 매우 크다.This document extension can be used for all phonemes in queries and voice documents.

The grams are compared to each other, which increases the accuracy of the search but increases the amount of computation. to solve this problem

The method of calculating and storing the data in advance has been proposed, but the verification of this method has not been carried out, and the amount of information thus calculated is so large that it cannot be stored in memory.

따라서, 본 발명은 상기한 바와 같은 종래 기술의 문제점을 해결하기 위하여 안출된 것으로, 본 발명의 목적은 음소 인식기(Phoneme Recognizer)로 얻은

개의 연속된 음소를 묶은 음소

-그램(Phoneme

-gram)을 사용하여 음성 문서를 색인하고, 음성 오인식 문제를 해결하기 위해 유사 검색을 통한 문서 확장 기법을 효율적 으로 사용하여 계산량을 크게 감소시킴으로써 고속 검색이 가능하게 하는 문서 확장에 의한 음성 자료의 고속 검색 방법을 제공함에 있다. 보다 상세하게는, 유사 검색 시 DP에 수반되는 높은 계산량을 감소시키기 위해 질의와 음성 문서 간 모든 음소

-그램에 대해 DP를 수행하는 종래의 방법과는 달리, 질의와 음성 문서 내 음소

-그램간 상호 비교 회수를 감소시킬 수 있도록, 음성 문서 내의 음소

-그램과 질의 내의 음소

-그램이 같은 음성일 가능성이 있는 경우에만 DP를 수행하도록 함으로써 검색 효율을 크게 높이는 문서 확장에 의한 음성 자료의 고속 검색 방법을 제공함에 있다.Accordingly, the present invention has been made to solve the problems of the prior art as described above, the object of the present invention was obtained by a phoneme recognizer (Phoneme Recognizer)

Phonemes in series of phonemes

-Gram (Phoneme

-gram) to index voice documents, and to solve the voice misrecognition problem, the efficient use of the document expansion technique through similar search effectively reduces the computational speed, thereby enabling high speed retrieval of voice data by document expansion. In providing a search method. More specifically, all phonemes between query and voice documents to reduce the high computational complexity involved with DP in similar searches.

Unlike conventional methods of performing DP on grams, phonemes in queries and voice documents

Phonemes in voice documents, to reduce the number of cross-gram comparisons

Phonemes within grams and queries

The present invention provides a fast search method of speech data by document expansion which greatly improves the search efficiency by performing DP only when the grams may be the same voice.

상기한 바와 같은 목적을 달성하기 위한 본 발명의 문서 확장에 의한 음성 자료의 고속 검색 방법은, 혼동 행렬을 이용한 문서 확장 기법에 의한 음성 자료의 고속 검색 방법에 있어서, 음성 문서의 검색에 사용되는 정보 검색 엔진(Information Retrieval Engine; IR Engine) 및 음성 자료 또는 사용자 질의를 음소 n-그램으로 변환하는 음성 인식 엔진(Speech Recognition Engine)에 의하여, 음소 n-그램의 정합치(Matching Score)가 평가되고 상기 정합치에 근거하여 수행된 계산 결과에 따라 선별적으로 DP(Dynamic Programming)가 수행되는 것을 특징으로 한다. 이 때, 상기 음성 자료의 고속 검색 방법은 상기 정보 검색 엔진 및 음성 인식 엔진에 의하여, 음소 n-그램이 정확하게 인식될 확률인 기대 정합치(Expected Matching Score; EMS) 및 질의와 음성 문서 내 음소 n-그램 간 유사도의 상한인 상한 정합치(Upper Matching Score; UMS)를 사용하여 계산이 수행되고 상기 계산 결과에 따라 선별적으로 DP가 수행되는 것을 특징으로 한다. 또한, 상기 음성 자료의 고속 검색 방법은 상기 정보 검색 엔진에 의하여, 상기 계산 결과가 상한 정합치(UMS)가 기대 정합치(EMS)보다 크거나 같다는 조건을 만족할 경우에 DP가 수행되는 것을 특징으로 한다.In order to achieve the above object, the method of fast searching a speech material by document expansion according to the present invention is a fast searching method of speech material by a document expansion technique using a confusion matrix. Matching Scores of phonemes n- grams are evaluated and evaluated by an Information Retrieval Engine (IR Engine) and a Speech Recognition Engine that converts voice data or user queries into phonemes n- grams. DP (Dynamic Programming) is selectively performed according to the calculation result performed based on the matching value. In this case, the fast search method of the speech data includes an expectation matching score (EMS) and a phoneme n in the query and speech document, which are the probability that the phoneme n- gram is correctly recognized by the information search engine and the speech recognition engine. The calculation is performed using the Upper Matching Score (UMS), which is the upper limit of the similarity between grams, and DP is selectively performed according to the calculation result. In addition, the fast search method of the speech data is characterized in that the DP is performed by the information search engine when the calculation result satisfies the condition that the upper limit match value (UMS) is greater than or equal to the expected match value (EMS). do.

또한, 상기 음성 자료의 고속 검색 방법은 a) 상기 음성 인식 엔진에 의하여, 미리 저장된 음성 문서들에 대하여 음성 문서들을 음소 n-그램의 세트로 미리 변환하는 단계; b) 상기 음성 인식 엔진에 의하여, 발생된 질의에 대하여 질의 내용을 음소 n-그램의 세트로 변환하는 단계; c) 상기 정보 검색 엔진 및 음성 인식 엔진에 의하여, 상기 질의에 대하여 신뢰도(

)에 따라 상기 질의의 기대 정합치(EMS)를 계산하는 단계; d) 상기 정보 검색 엔진에 의하여, 상기 질의 음소 n-그램과 상기 음성 문서들의 음소 n-그램 간 공통된 음소의 개수인 상한 정합치(UMS)를 계산하는 단계; e) 상기 정보 검색 엔진에 의하여, 상기 d) 단계에서 구한 상한 정합치가 상기 c) 단계에서 구한 기대 정합치보다 크거나 같을 경우 DP를 수행하여 오인식을 보상하고, 그렇지 않은 경우 오인식 확률을 0으로 설정하는 단계; f) 상기 정보 검색 엔진에 의하여, 상기 DP 수행 결과 또는 상기 오인식 확률을 0으로 설정한 결과를 사용하여 음소 n-그램의 병합을 통한 유사도를 계산하는 단계; g) 상기 정보 검색 엔진에 의하여, 상기 계산된 유사도를 사용하여 상기 질의에 상응하는 음성 문서를 검색 결과로서 출력하는 단계; 를 포함하여 이루어지는 것을 특징으로 한다. 이 때, 상기 기대 정합치는 가우시안 혼합 모델(Gaussian Mixture Model)을 사용하여 근사 및 계산되는 것을 특징으로 한다. 더불어, 상기 신뢰도는 사용자의 편의에 의해 조절 가능한 것을 특징으로 한다.In addition, the fast retrieval method of the speech material comprises the steps of: a) pre-converting, by the speech recognition engine, speech documents into a set of phoneme n- grams for prestored speech documents; b) converting, by the speech recognition engine, the query content to a set of phoneme n- grams for the generated query; c) reliability of the query by the information retrieval engine and speech recognition engine;

Calculating an expected match value (EMS) of the query according to; d) calculating, the query phoneme n- gram and the upper limit number of positive matches the phoneme n- common phoneme-to-grams of the audio document (UMS), by the search engine; e) by the information retrieval engine, if the upper limit match value obtained in step d) is greater than or equal to the expected match value obtained in step c), a DP is performed to compensate for a misperception, and otherwise, the probability of misrecognition is set to 0. Making; f) calculating, by the information retrieval engine, similarity through merging of phoneme n- grams using the result of the DP or the result of setting the false recognition probability to zero; g) outputting, by the information retrieval engine, a voice document corresponding to the query as a search result using the calculated similarity; Characterized in that comprises a. In this case, the expected matching value may be approximated and calculated using a Gaussian Mixture Model. In addition, the reliability is characterized in that it can be adjusted by the user's convenience.

또한, 본 발명에 의한 문서 확장에 의한 음성 자료의 고속 검색 시스템은, 음성 문서 자료를 입력받는 음성 인식 수단(30); 사용자 질의를 입력받는 질의 입력 수단(40); 상기 음성 인식 수단(30) 및 상기 질의 입력 수단(40)과 연결되며, 음성 자료 또는 사용자 질의를 음소 n-그램으로 변환하는 음성 인식 엔진(11) 및 사용자 질의에 대한 음성 문서를 검색 및 정렬하는 연산을 수행하는 정보 검색 엔진(12)을 포함하여 이루어지는 메인 시스템(10); 상기 메인 시스템(10)과 연결되며, 음성 문서 자료 및 상기 음성 문서 자료에 상응하며 상기 음성 인식 엔진(11)에 의해 변환된 음소 n-그램을 저장하는 음성 문서 데이터베이스(20); 상기 정보 검색 엔진(12)에 의해 검색 및 정렬된 결과를 출력하는 출력 수단(50); 을 포함하여 이루어지는 것 음성 자료 검색 시스템에 있어서, 상술한 바에 의한 검색 방법을 사용하는 것을 특징으로 한다.In addition, the high-speed search system of the speech data by the document expansion according to the present invention, the speech recognition means 30 for receiving the speech document data; Query input means 40 for receiving a user query; It is connected to the speech recognition means 30 and the query input means 40, and the speech recognition engine 11 for converting speech material or user query into phoneme n- gram and searching and sorting a speech document for the user query A main system 10 including an information retrieval engine 12 for performing calculations; A voice document database (20) connected to the main system (10) for storing voice document data and phoneme n- grams corresponding to the voice document data and converted by the voice recognition engine (11); Output means (50) for outputting the results retrieved and sorted by the information retrieval engine (12); In a voice data retrieval system comprising a retrieval method as described above.

이 때, 상기 질의 입력 수단(40) 및 상기 출력 수단(50)과 상기 메인 시스템(10)은 인터넷을 통하여 서로 연결되는 것이 바람직하다. 또한, 이 때 상기 질의 입력 수단(40) 및 상기 출력 수단(50)은 서버, 개인용 컴퓨터, 휴대용 컴퓨터 및 모바일 단말기 중에서 선택되는 어느 하나에 대하여 하드웨어적 또는 소프트웨어적으로 그 구성 수단으로 포함되는 것을 특징으로 한다.At this time, the query input means 40, the output means 50 and the main system 10 is preferably connected to each other via the Internet. In this case, the query input means 40 and the output means 50 may be included as hardware or software configuration means for any one selected from a server, a personal computer, a portable computer, and a mobile terminal. It is done.

이하, 상기한 바와 같은 구성을 가지는 본 발명에 의한 문서 확장에 의한 음성 자료의 고속 검색 방법을 첨부된 도면을 참고하여 상세하게 설명한다.Hereinafter, a high-speed search method of a speech material by document extension according to the present invention having the above-described configuration will be described in detail with reference to the accompanying drawings.

기존 혼동 행렬을 이용한 문서 확장 방법에서는 질의 내 음소

-그램이 음성 문서 내 어떤 음소

-그램과 대응되는 것인지를 음성 문서 검색 단계에서 알 수 없으므로 질의와 음성 문서 내 모든 음소

-그램을 DP를 사용하여 비교하였다. 그러나 상호 연관이 없는 두 음소

-그램 간의 유사도 평가는 불필요한 DP를 유발시켜 계산량이 많아지며, 이에 따라 메모리 공간이 엄청나게 필요해질 뿐만 아니라 검색 속도가 크게 저하되는 문제점이 있다는 것을 상기 서술하였다. 본 발명에서 사용하는 기대 정합치 기반의 음성 문서 확장은 음소

-그램들 간의 상호 연관성을 정의하여 이러한 혼동 행렬을 사용한 문서 확장의 기법에서의 계산량 문제에 대한 해결 방법을 제안한다. Phoneme in Query in Document Extension Method Using Existing Confusion Matrix

-Gram any phoneme within this voice document

-It is not known at the voice document search stage whether it corresponds to the gram, so all phonemes in the query and voice document

-Grams were compared using DP. But two phonemes that are uncorrelated

The similarity evaluation between -grams causes unnecessary DP, resulting in a large amount of calculation, and thus, the memory space is greatly required, and the search speed is greatly reduced. Expected match based speech document extension used in the present invention

We define a correlation between the grams and propose a solution to the computational problem in the technique of document extension using this confusion matrix.

이를 위해 본 발명에서는 가우시안 혼합 모델(Gaussian Mixture)을 사용해 음소 인식 정확도를 근사하여 질의 내 음소

-그램이 음성 문서 내에도 존재하였다면 몇 개의 음소가 정확히 인식되었는지에 대한 기대값인 기대 정합치(Expected Matching Score; EMS)와 질의와 음성 문서 내 음소

-그램 간 유사도의 상한인 상한 정합치(Upper Matching Score; UMS)를 제안하여 DP 수행 여부를 결정한다. DP는 질의와 음성 문서 내 음소

-그램 간

가

보다 같거나 클 때, 즉 음성 문서 내 음소

-그램이 질의의 음소

-그램과 같은 Reference 음소

-그램인 가능성이 있을 때에만 수행된다.To this end, the present invention uses a Gaussian Mixture to approximate phonemic recognition accuracy to approximate phonemes in a query.

Expected Matching Score (EMS), the expectation of how many phonemes were correctly recognized if the gram was also present in the voice document, and the phonemes in the query and voice document

The upper matching score (UMS), which is the upper limit of similarity between grams, is proposed to determine whether to perform DP. DP query and phoneme within a voice document

-Grams between

end

Greater than or equal to, that is, phonemes in a voice document

-Grams of this query

Reference phonemes, such as grams

Only performed when there is a possibility of -gram.

1. 기대 정합치(Expected Matching Score; EMS)의 계산Calculation of Expected Matching Score (EMS)

사용자가 입력한 질의 내 음소

-그램

에 대한 기대 정합치(Expected Matching Score; EMS)인

는

가 음성 문서

에 있었을 경우 정확히 인식될 음소 개수에 대한 기대값을 의미한다.Phoneme in the query entered by the user

-gram

Expected Matching Score (EMS) for

Is

Voice document

If it is, it means the expected value for the number of phonemes to be recognized correctly.

이 값의 계산을 위해 우선 단일 음소(Monophone)

가 음성 인식기에서 정확히 인식될 확률을 가우시안 혼합 모델(Gaussian Mixture Model; GMM)을 사용해 근사한다. 이를 위해 훈련 데이터(Training Data) 내 단일 음소

가 정확히 인식된 경우와 그렇지 않은 경우를 찾고, 그로부터 도 3과 같이 단일 음소

가 정확히 인식될 확률인 정합치(Matching Score)를 구한다. 도 3에서

는 단일 음소

가 정확히 인식된 경우를,

는 정확하게 인식되지 않은 경우를 뜻한다.

의 정확한 인식 여부를 훈련 데이터로부터 구하고 난 뒤,

를

개마다 묶는다. 도 3에서는

가 5인 경우를 보여주고 있으며, 훈련 데이터의 가장 앞부분 5개

에 대해서는

가 정확히 인식될 확률인 정합치가 0.8, 그 뒤의 5개

에 대해서는 정합치가 0.6으로 구해졌다. (도 3의 예는 하나의 실시예일 뿐으로, 본 발명에 의한 검색 방법을 실제 구현된 시스템에서는

값으로 10을 사용하였으며,

의 값은 사용자가 원하는 대로 정할 수 있는 것으로서

의 값에 의해 본 발명이 한정되는 것은 아니다.)To calculate this value, we first need a single phone.

Approximation of the probability that P is correctly recognized by the speech recognizer using the Gaussian Mixture Model (GMM). To do this, single phonemes in the training data

Find the case where it is correctly recognized and the case where it is not, from it

Find the Matching Score, which is the probability that is correctly recognized. In Figure 3

Single phoneme

Is correctly recognized,

Means that it is not correctly recognized.

From the training data,

To

Tie each dog. In Figure 3

Is 5, the first 5 of the training data

About

Is a probability that is correctly recognized is 0.8, followed by five

For, the matched value was found to be 0.6. (The example of FIG. 3 is just one embodiment. In a system in which the search method according to the present invention is actually implemented,

I used 10 as the value,

The value of can be set as desired by the user.

The present invention is not limited by the value of.)

이렇게 계산된

의 정합치는 도 4와 같이 가우시안 혼합 모델로 근사된다. 이를 수식으로 표현하면

가 정확히 인식될 확률에 대한 확률 밀도 함수(Probability Density Function)

는 하기의 수학식 9로 근사된다. 하기의 수학식 9에서,

은 전체 Mixture의 개수를,

과

은

번째 Mixture의 평균과 표 준편차를 뜻한다. 또,

은 각 Mixture에 대한 가중치로서,

을 만족한다.So calculated

The matching value of is approximated by a Gaussian mixture model as shown in FIG. If you express it as a formula

Density Function for Probability

Is approximated by Equation 9 below. In Equation 9 below,

Is the total number of Mixtures,

and

silver

The mean and standard deviation of the first mixture. In addition,

Is the weight for each Mixture,

To satisfy.

또한 이렇게 구해진 분포를 얼마나 신뢰할 것인가에 해당하는 신뢰도(Confidence)

를 도 5와 같이 제안한다. 도 5에서, 함수

는 단일 음소

가 정확히 인식될 확률의 누적 분포 함수(Cumulative Distribution Function)이다. 따라서

가 전체 시스템에 대해 결정되면,

가 정확히 인식될 확률은

가 된다.In addition, confidence that corresponds to how reliable the distribution is

It is proposed as shown in FIG. In Figure 5, the function

Single phoneme

Is the cumulative distribution function of the probability of being correctly recognized. therefore

Once is determined for the whole system,

Is likely to be correctly recognized

Becomes

질의 내 음소

-그램

에 대한

는 앞서 설명한

를 사용하여 하기의 수학식 10과 같이 음소

-그램

내

번째 단일 음소

가 정확히 인식될 확률의 합으로 계산된다.My phoneme in the vagina

-gram

For

Explained earlier

Phoneme as in Equation 10 below using

-gram

of mine

First single phoneme

Is computed as the sum of the probabilities that are correctly recognized.

2. 신뢰도

를 사용한 검색 범위의 제한 2. Reliability

The scope of your search using

상기 정의한 신뢰도

는 동일한 음성이 항상 동일한 인식 정확도를 보이는 것은 아니라는 사실을 반영하기 위해

를 크게 혹은 작게 예측하도록 결정하는 역할을 한다. 도 6은 신뢰도

가 음성 문서 검색에서 하는 역할을 보여준다. 그림에서, 각 도형은 서로 다른 Reference 음소

-그램을 뜻하며, 색의 짙기는 음성 인식의 정확도를 뜻한다. 예를 들어, 음성 문서 (1) 의 검은색 사각형이 음소 인식기를 통해 빗금친 사각형으로 나타난 것은, 주어진 음성이 사각형이라는 참조(Reference) 음소

-그램 정보는 유지되었으나, 약간의 음성 오인식이 발생하여 검은색 사각형과는 음소 배열이 다소 다르게 인식되었음을 뜻한다.Reliability defined above

To reflect the fact that the same voice does not always have the same recognition accuracy

It decides to predict big or small. 6 is reliability

Shows the role played by voice document retrieval. In the figure, each figure has a different reference phoneme

-The gram, the color depth means the accuracy of speech recognition. For example, a black rectangle in voice document (1) is represented by a hatched rectangle through the phoneme recognizer, so that the reference phoneme that the given voice is a rectangle

The gram information was retained, but a bit of speech misrecognition occurred, indicating that the phoneme arrangement was somewhat different from the black square.

도 6에서, (1), (2), (3)의 3개 음성 문서 내 사각형으로 표현한 참조 음소

-그램은 검은색, 빗금, 흰색과 같이 서로 다른 음성 인식 정확도를 보였다. 특히 케이스#1에서는 음성 인식 결과 육각형의 참조 음소

-그램이 흰색 삼각형으로 나타났다. 그 이유는 음성 인식 오류의 정도가 크다면, 본래 어떤 참조 음소

-그램으로부터 온 것인지 판단하기 힘들기 때문이다.In Fig. 6, reference phonemes represented by squares in three voice documents (1), (2), and (3)

The gram showed different speech recognition accuracy, such as black, hatched, and white. In particular, in case # 1, the speech recognition result hexagon phoneme

The grams appear as white triangles. The reason for this is that if the degree of speech recognition error is large, the original reference

It's hard to tell if it's from a gram.

이러한 음성 오인식의 정도 차이를 반영하기 위하여,

는

값의 크기를 크거나 작게 조절할 수 있게 한다. 예를 들어 케이스#2와 같이

가 큰 경우에는 식 9의

가 크게 계산된다. 이 경우 검은색 사각형과 삼각형의 질의는 해당 음소

-그램들이 음성 문서에 있었다면 음성 인식 오류가 거의 없이 검은색 삼각형과 사각형으로 인식될 것으로 예측된다. 따라서 질의 내 검은색 사각형과 삼각형은 음성 문서 내 검은색 사각형과 삼각형을 포함한 (2)에 대해서만 DP를 수행하여 검색 속도가 향상된다. 반면, (1), (3)내 음성에 포함되어있던 검은색 사각형과 삼각형의 참조 음소

-그램들은 음성 인식 오류로 인해 빗금 또는 흰색 정도까지 음소 배열이 달라져 검색 대상에서 제외되어서 검색 정확도가 저하된다. In order to reflect this difference in degree of speech misrecognition,

Is

Allows you to scale the value up or down. For example, like case # 2

If is large of Equation 9

Is greatly calculated. In this case, the query for black squares and triangles is the phoneme

If the grams were in a voice document, it would be expected to be recognized as black triangles and squares with little speech recognition error. Therefore, the search speed is improved by performing DP only on (2) including black squares and triangles in the voice document. On the other hand, (1) and (3) reference phonemes of black squares and triangles that were included in my voice

-The grams are excluded from the search because the phoneme arrangement is changed to the shaded or white level due to the speech recognition error, and the search accuracy is reduced.

반면

가 작다면

가 작게 계산되고, 음성 인식 과정에서 최대 흰색까지 인식 오류가 발생할 수 있는 것으로 예측된다. 따라서 질의 내 검은색 사각형과 삼각형은 음성 문서 내 흰색 이상의 짙기를 가진 사각형과 삼각형을 포함한 (1), (2), (3) 모두와 DP가 수행된다. 그 결과 검색 속도는 저하되는 반면 검색 정확도는 향상된다.On the other hand

Is small

Is calculated small, and it is predicted that recognition errors may occur up to white in the speech recognition process. Therefore, the black rectangles and triangles in the query are all DPs (1), (2), and (3) including rectangles and triangles with more than white intensity in the voice document. As a result, the search speed is reduced while the search accuracy is improved.

그러나 정확도를 높이기 위해

를 무조건 작게 두는 것은 해결책이 아니다. 그 이유는 케이스#1에 보인 바와 같이 육각형의 참조 음소

-그램도 음성 오인식이 심할 경우 삼각형의 완전히 다른 참조 음소

-그램으로 인식될 수 있기 때문이다. 이 경우

를 작게 두었다면 해당 삼각형 역시 삼각형에 대한 검색 대상에 포함되나, 인식 결과 내 삼각형의 실제 참조 음소

-그램은 육각형이었으므로 이 탐색 시간은 낭비에 불과하다. 따라서 정확도를 낮추지 않으면서도 검색 시간의 낭비가 없는

값 설정이 필요하다.But to increase accuracy

Keeping it small is not the solution. The reason is the hexagonal reference phoneme as shown in case # 1.

-Grams also have completely different reference phonemes of triangles when voice misrecognition is severe

Because it can be recognized in grams. in this case

If is small, the triangle is also included in the search for the triangle, but the actual reference phoneme of the triangle in the recognition result.

Since the gram was a hexagon, this search time is a waste. So you don't waste your search time without sacrificing accuracy

You need to set a value.

3. 상한 정합치(Upper Matching Score; UMS)의 계산3. Calculation of Upper Matching Score (UMS)

는 질의와 음성 문서 내 음소

-그램 간 유사도 평가기준으로서, 이들이 동일한 참조 음소

-그램을 갖는지, 그렇다면 어느 정도의 음성 오인식이 발생하였는지를 계산하는데 사용된다. 본 발명에서 제안하는

는 두 음소

-그램 간 공통된 음소의 개수로 정의되며, 계산 방법은 다음과 같다.

Phoneme within a voice document with a query

Similarity criteria between grams, where they are the same reference phoneme

It is used to calculate whether it has a gram, and if so, how many false negatives have occurred. Proposed in the present invention

Two phonemes

It is defined as the number of phonemes in common among grams.

음소 인식기에서 사용하는 단일 음소의 개수를

라하고, 각 단일 음소에 1부터

까지의 색인이 부여되었다고 가정한다. 그러면 질의 내 음소

-그램

는 각 차원에 해당 음소가 발생한 회수를 기록한

차원 벡터

로 표현할 수 있다. 마찬 가지로 음성 문서 내 음소

-그램

를

로 변환하면,

는 하기의 수학식 11과 같이 계산된다.The number of single phonemes used by the phone recognizer

D, from 1 to each single phoneme

Assume that the index up to is given. Then my phoneme in the query

-gram

Records the number of occurrences of the phoneme in each dimension.

Dimension vector

Can be expressed as Likewise, phonemes in voice documents

-gram

To

If you convert to,

Is calculated as in Equation 11 below.

는 음소

-그램들의 공통된 음소의 개수를 순서를 고려하지 않고 비교하므로, 두 개 음소

-그램을 DP로 비교하였을 때 얻을 수 있는 스코어의 상한이 된다.

Phoneme

Since two phonemes are compared without considering their order, two phonemes

It is the upper limit of the score that can be obtained when the gram is compared to DP.

도 7은

계산의 예이다. 그림에서, 검은색 사각형으로 표시한 음소

-그램 [A, B, C]는 그 자신과 비교할 경우 3개의 음소가 공통되므로

값은 3이 된다. 반면 [D, F, G]와 [D, E, D]는 [A, B, C]와 비교 시 공통된 음소가 전혀 없으므로

값은 0이 된다. 이는 이들이 사각형 음소

-그램이 아니라는 사실과 잘 일치한다. 반면, 검은색 사각형 [A, B, C]로 부터 하나의 음성 인식 오류가 발생하여 [A, B, D]의 빗금친 사각형으로 인식된 경우에는 공통된 음소가 A, B의 2개가 존재하여

값은 2가 된다. 마찬가지로 [A, D, G]의 흰색 사각형은

가 1이 된다.7 is

Example of calculation. Phonemes in the illustration, represented by black squares

-Gram [A, B, C] has three phonemes in common

The value is 3. On the other hand, [D, F, G] and [D, E, D] have no common phonemes compared to [A, B, C].

The value is zero. Which means they are square phonemes

This is in good agreement with the fact that it is not a gram. On the other hand, when one voice recognition error occurs from the black squares [A, B, C] and is recognized as the hatched square of [A, B, D], two common phonemes exist, A and B.

The value is 2. Similarly, the white squares in [A, D, G]

Becomes 1

4. 선별적 DP(Selective DP)를 통한 음성 문서 확장4. Voice document expansion with Selective DP

종래의 혼동 행렬을 사용한 음성 문서 확장 기법, 즉 수학식 2를 사용하여 계산하는 방법은 질의 내 음소

-그램

와 음성 문서 내 음소

-그램

모두에 대해 DP를 통해

를 계산하여,

가 음성 문서 내에 있었지만 이것이 음성 인식 과정에서

로 오인식되었을 경우 이를 보상하였다. 하지만

가

로부터 오인식된 것이 아닐 경우

의 계산은 무의미하다. 따라서 이러한 무의미한 계산 때문에 늘어나는 계산량을 감소시키기 위하여 본 발명에서는 하기에 기술하다시피,

와

를 사용한 선별적 DP에서는

가 만족될 때에만

가

로부터 오인식된 것으로 보고 DP를 수행한다.The conventional speech document extension method using the confusion matrix, that is, the method of calculating using Equation 2 is a phoneme in a query.

-gram

Phoneme with voice documents

-gram

Via DP for all

By calculating

Was inside a voice document, but this

If it is misunderstood, it is compensated. However

end

Is not misunderstood by

The calculation of is meaningless. Therefore, in order to reduce the amount of calculations that increase due to this meaningless calculation, as described below in the present invention,

Wow

In selective DP using

Only when is satisfied

end

Report as misidentified from and perform DP.

도 8은 선별적 DP의 예로, 좌측 상단의 검은색 사각형, 육각형, 삼각형의 음소

-그램이 질의로 주어졌을 때, 이들 각각은 EMS 추정기(Estimator)를 통해 2, 1, 3의

값이 계산되었다. 이는 각 음소

-그램을 음성 인식기를 사용해 인식한다면 2개, 1개, 3개의 음소가 정확히 인식 될 것임을 뜻하는 값이다.8 is an example of selective DP, the phoneme of the black square, hexagon, triangle of the upper left

Given a query as a query, each of these can be represented by 2, 1, or 3 through an EMS estimator.

The value was calculated. Which is each phoneme

If the gram is recognized using a speech recognizer, it means that two, one and three phonemes will be correctly recognized.

그 뒤, 좌측 하단의 음성 문서 내에 있는 각 음소

-그램과 질의의 음소

-그램 간의

값이 계산된다. 예를 들어, 질의 내 검은색 사각형의 경우에는 음성 문서 내 빗금친 사각형과 비교되어

값이 2가 되며, 음성 문서 내 검은색 사각형과는

값이 3이 된다. 마찬가지로 질의 내 검은색 삼각형은 음성 문서 내 빗금친 삼각형과 비교되어

값이 2, 검은색 삼각형과 비교되어

값이 3이 된다.Then, each phoneme in the voice document at the bottom left

-Phonemes of grams and queries

-Between grams

The value is calculated. For example, the black rectangle in the query is compared to the hatched rectangle in the voice document.

Value is 2, and the black rectangle in the voice document

The value is three. Similarly, the black triangles in the query are compared to the hatched triangles in the voice document.

The value is 2,

The value is three.

다음, 실제 DP는

가 성립하는 음성 문서 내 음소

-그램들과만 이루어진다. 그림에서 원으로 둘러싸인

값을 갖는 음성 문서 내 음소

-그램들이 이에 해당한다. 따라서 질의 내 검은 사각형은 음성 문서 내 빗금친 사각형, 검은색 사각형과 DP가 수행되며, 질의 내 검은색 삼각형은 음성 문서 내 검은색 삼각형과 DP가 수행된다.Next, the actual DP

Phoneme in a speech document

-Only with grams. Surrounded by circles in the picture

Phonemes in voice documents with values

-Grams correspond to this. Therefore, the black rectangle in the query is executed by the hatched rectangle, black rectangle and DP in the voice document, and the black triangle in the query is performed by the black triangle and DP in the voice document.

하기의 표 2는 기대 정합치 기반 음성 문서 확장의 전체 검색 알고리즘을 보여준다.Table 2 below shows the overall search algorithm of the expectation match based speech document extension.

입력: 음성 문서 세트

, 사용자 질의

출력:

와의 적합성(relevance)에 따라 정렬된 음성 문서들Input: voice document set

, User queries

Print:

Voice documents sorted by relevance with One.

about,

To

Phoneme within range

Gram in

2. Given user query entered by keyboard

Against, 2.1.

To

Phoneme within range

Gram in

Conversion to 2.2.

ego

A.

ego

(One)

Calculation (2)

Calculation (3)

If

Calculate, if not

Set to 0. (4)

Calculation 2.3.

Pair result set

Save to 3.

According to the score

Sort and output

본 발명에서 제안하는 시스템은 혼동행렬을 사용한 문서 확장에서 설명한

-그램 의 병합으로 사용자 질의에 대한 음성 문서의 순위를 선정한다. 따라서 전처리 단계 1에서는, 음성 문서 집합

내의 각 음성 문서

를

의

값에 대한 음소

-그램들로 변환한다. 검색 단계에서는, 2.1.에서 키보드로 입력한 사용자

를

사이의 모든

에 대한 음소

-그램들로 변환한다. 2.2.에서는 질의 내 음소

-그램과 각 음성 문서 내 음소

-그램 간의 선별적 DP를 통해 수학식 7의

로 질의와 음성 문서 간 유사도를 평가한다. 이를 위해

와

를 계산한 뒤,

가

보다 같거나 클 경우에만 선별적으로 DP를 수행하며, 그렇지 않을 경우 DP를 생략하고 수학식 7의

를 0으로 간주한다. 2.3.에서는

가 모든

값에 대해 계산이 완료되었으므로 수학식 6의

을 사용해

-그램의 병합을 통한 유사도를 계산하고, 이를 문서

의 식별자와 함께 결과 집합

에 저장한다. 최종적으로

은 각 문서와 질의 간의 관련성을 뜻하는

에 따라 정렬되어 반환된다.The system proposed in the present invention is described in the document expansion using confusion matrix.

The merge of grams ranks voice documents for user queries. So in preprocessing step 1, the voice document set

Each voice document within

To

of

Phoneme for value

Convert to grams In the search phase, the user who entered the keyboard in 2.1.

To

Everything in between

Phoneme for

Convert to grams In 2.2. Phonemes within the query

-Grams and phonemes within each voice document

The selective DP between grams

We evaluate the similarity between query and voice document. for teeth

Wow

After calculating

end

Selectively perform DP only if it is greater than or equal to, otherwise skip the DP and

Is considered to be zero. In 2.3.

All this

Since the calculation is completed for the value,

Using

Calculate similarity through merging of grams and document

Result set with an identifier of

Store in Finally

Means the relevance between each document and the query

Sorted by and returned.

이와 같이 본 발명에 의한 시스템의 효율성을 검증하기 위하여, 기대 정합치 기반의 음성 문서 확장 기법과 종래의 시스템과의 성능을 비교하였다.As described above, in order to verify the efficiency of the system according to the present invention, the performance of the speech matching system based on the expected matching value and the performance of the conventional system are compared.

성능 비교를 위한 실험환경은 하기와 같다. 음소 인식기는 Sphinx4와 그와 함께 제공되는 음향 모델을 사용해 작성되었으며, 언어 모델은 Hub4의 1996/97 English Broadcast News Transcripts를 텍스트로 하고 FreeTTS로 음소열을 얻은 결과로부터 생성하였다. 음소 인식기의 정확도는 53%로 나타났다. 평가 데이터는 3시간 분량의 1999 Hub4 Broadcast News Evaluation English Test Material 을 사용하였으며, 스크립트에 표시되어 있는대로 웨이브 파일을 나누고 각각을 별개의 음성 문서로 다루었다. 데이터 중 최초 30분은 개발 및 혼동 행렬계산,

계산을 위한 음소가 정확히 인식될 확률에 대한 분포 훈련에 사용하였다. 나머지 2시간 30분은 평가에 이용하였다. 질의는 평가 데이터 내 빈번하게 발생하는 20개 단어로 하였으며, 각 질의에 대한 정답 셋은 테스트 데이터의 스크립트에 대해

를 사용한 벡터 공간 모델 기반의 검색엔진으로부터 구한 결과의 최상위 5, 10, 15, 20개 문서로 하고 이들 각각의 정답 집합을 AnsSet5, AnsSet10, AnsSet15, AnsSet20라 명명하였다.Experimental environment for performance comparison is as follows. The phoneme recognizer was written using Sphinx4 and the accompanying acoustic model, and the language model was created from Hub4's 1996/97 English Broadcast News Transcripts as text and FreeTTS phonemes. The accuracy of the phoneme recognizer was 53%. The evaluation data was a three-hour 1999 Hub4 Broadcast News Evaluation English Test Material, divided into wave files as indicated in the script and treated as separate voice documents. The first 30 minutes of data represent development and confusion matrix calculations,

We used the distribution training for the probability that the phoneme for the calculation would be correctly recognized. The remaining 2 hours 30 minutes were used for evaluation. The query consists of 20 frequently occurring words in the evaluation data. The set of correct answers for each query is based on the script of the test data.

The top five, ten, fifteen, and twenty documents of the results obtained from a vector-space model-based search engine are named and each set of correct answers is named AnsSet5, AnsSet10, AnsSet15, and AnsSet20.

또한, 음소 인식 정확도를 가우시안 혼합 모델로 근사하기 위하여 하기와 같은 작업을 수행하였다. 훈련 데이터 내 각 음소가 10회 출현할 때마다 음소가 정확히 인식될 확률들을 구하고, 이로부터 인식 정확도(Recognition Accuracy)가 0.0, 0.1, ..., 1.0일 확률을 계산하였다. 도 9는 단일 음소(Monophone) W와 Z에 대한 실험 결과로 X축은 각 단일 음소의 인식 정확도를, 세로축은 해당 인식 정확도를 관측하게 될 확률을 뜻한다. 대부분의 단일 음소가 이와 같이 하나의 피크(Peak)를 보여 Mixture 개수는 하나로 고정하였다.In addition, to approximate phoneme recognition accuracy to a Gaussian mixture model, the following work was performed. The probabilities that the phoneme is correctly recognized every 10 phonemes appearing in the training data were calculated, and the probability that the recognition accuracy was 0.0, 0.1, ..., 1.0 was calculated. FIG. 9 shows experimental results of monophones W and Z. The X axis represents the recognition accuracy of each single phoneme, and the vertical axis represents the probability of observing the recognition accuracy. Most single phonemes show one peak like this, so the number of mixers is fixed to one.

하기에 본 발명에 의한 음성 검색 시스템과 종래의 음성 검색 시스템과의 성능 비교 결과를 서술한다. 성능 평가에 사용된 주요 기준은 Recall, Precision, Mean Average Precision(이하 MAP)이다. 사용자 질의에 대해 검색된 문서 집합을 Retrieved, 전체 음성 문서 집합중 실제로 질의와 관련된 정답 문서 집합을 Relevant라 할 때, Recall 및 Precision은 하기의 수학식 12와 같이 정의된다.The following is a result of performance comparison between the voice search system according to the present invention and the conventional voice search system. The main criteria used for performance evaluation are Recall, Precision, and Mean Average Precision (hereafter MAP). When a document set retrieved for a user query is retrieved and a correct answer document set related to the query is actually Relevant of the entire voice document set, Recall and Precision are defined as in Equation 12 below.

Mean Average Precision은 11개 Standard Recall Level (정답 문서의 0%, 10%, 20%, ..., 100%가 검색된 시점)에서 구한 Precision의 평균으로 하였다.Mean Average Precision was the average of the precisions obtained from 11 standard recall levels (points of 0%, 10%, 20%, ..., 100% of the correct answers).

표 3 및 표 4는 종래의 시스템이 사용하는 방법과 본 발명에서 제안하는 기대 정합치 기반의 음성 문서 확장의 성능을 비교한 결과이다. 표 좌측 Recall의 0.0, 0.1, ..., 1.0은 정답중 0%, 10%, ..., 100%가 검색되었을 때를 의미하며 우측의 AnsSet5, AnsSet10, AnsSet15, AnsSet20은 각 정답 집합에서 해당 Recall을 얻게 되었을 경우의 Precision을 의미한다. MAP는 각 AnsSet에서 구한 Precision들의 평균이며, Time은 질의를 처리하는데 걸리는 평균 소요시간이다. 표 4에서 밑줄이 그어진 각 셀은 표3의 종래 방법에 대해 예측(Precision)이 저하된 경우를 의미한다. Tables 3 and 4 show the results of comparing the performance of the method used by the conventional system and the speech document extension based on the expected match proposed by the present invention. 0.0, 0.1, ..., 1.0 of Recall on the left of the table means when 0%, 10%, ..., 100% of the correct answers are found, and AnsSet5, AnsSet10, AnsSet15, and AnsSet20 on the right correspond to each set of correct answers. It means Precision when Recall is obtained. MAP is the average of precisions obtained from each AnsSet, and Time is the average time taken to process a query. Each underlined cell in Table 4 represents a case where the Precision is degraded in the conventional method of Table 3.

실험 결과 11개 Standard Recall Level에 대해 예측이 저하된 경우가 다소 나타났으나 그 정도는 1% 미만이었다. 또한, 종래 방법과 MAP를 비교할 때 최대 MAP 저하 정도(Max. Degradation of MAP)는 AnsSet20에서 기존 방법의 MAP가 0.253, 제안한 방법의 MAP가 0.250인 경우의 0.003이었다. 반면 검색 속도 면에서는 기존의 방법이 2시간 30분의 평가 데이터에 대해 925초가 소요된 반면, 제안한 방법은 53초가 소요되어 검색 속도 향상 정도(Speed Up)는 17.5배에 달하였다. 이는 기존 방법이 실시간 대비 10배의 처리 속도를 보이나, 제안한 방법은 실시간 대비 170배 속도로 검색이 가능함을 뜻한다.Experimental results showed that the predictions for 11 standard recall levels were somewhat lower, but less than 1%. In addition, when comparing the conventional method and the MAP, the maximum MAP reduction degree (Max. Degradation of MAP) was 0.003 when the MAP of the conventional method was 0.253 and the MAP of the proposed method was 0.250 in AnsSet20. On the other hand, in terms of search speed, the conventional method took 925 seconds for the evaluation data of 2 hours and 30 minutes, whereas the proposed method took 53 seconds and the speed up speed was 17.5 times. This means that the existing method shows 10 times the processing speed compared to real time, but the proposed method can search 170 times faster than the real time.

RecallRecall AnsSet5AnsSet5 AnsSet10AnsSet10 AnsSet15AnsSet15 AnsSet20AnsSet20 0.00.0 0.4640.464 0.5450.545 0.6250.625 0.6690.669 0.10.1 0.4640.464 0.5450.545 0.4800.480 0.5520.552 0.20.2 0.4640.464 0.4370.437 0.3500.350 0.4030.403 0.30.3 0.2820.282 0.2290.229 0.2490.249 0.3300.330 0.40.4 0.2820.282 0.1780.178 0.1930.193 0.2500.250 0.50.5 0.0930.093 0.1480.148 0.1350.135 0.1700.170 0.60.6 0.0930.093 0.1260.126 0.1120.112 0.1420.142 0.70.7 0.0370.037 0.0780.078 0.0880.088 0.1210.121 0.80.8 0.0370.037 0.0370.037 0.0730.073 0.0850.085 0.90.9 0.0130.013 0.0250.025 0.0320.032 0.0360.036 1.01.0 0.0130.013 0.0190.019 0.0190.019 0.0240.024 Mean Average Precision (MAP)Mean Average Precision (MAP) 0.2040.204 0.2150.215 0.2140.214 0.2530.253 Time (Sec.)Time (Sec.) 925925

RecallRecall AnsSet5AnsSet5 AnsSet10AnsSet10 AnsSet15AnsSet15 AnsSet20AnsSet20 0.00.0 0.4640.464 0.5450.545 0.6280.628 0.6700.670 0.10.1 0.4640.464 0.5450.545 0.4790.479 0.5490.549 0.20.2 0.4640.464 0.4330.433 0.3430.343 0.4020.402 0.30.3 0.2810.281 0.2250.225 0.2500.250 0.3200.320 0.40.4 0.2810.281 0.1780.178 0.1920.192 0.2500.250 0.50.5 0.0950.095 0.1520.152 0.1360.136 0.1730.173 0.60.6 0.0950.095 0.1270.127 0.1130.113 0.1390.139 0.70.7 0.0390.039 0.0760.076 0.0850.085 0.1160.116 0.80.8 0.0390.039 0.0340.034 0.0700.070 0.0820.082 0.90.9 0.0110.011 0.0230.023 0.0280.028 0.0310.031 1.01.0 0.0110.011 0.0140.014 0.0110.011 0.0140.014 Mean Average Precision (MAP)Mean Average Precision (MAP) 0.2040.204 0.2140.214 0.2120.212 0.2500.250 Time (Sec.)Time (Sec.) 5353 Speed UpSpeed up x17.5x17.5 Max. Degradation of MAPMax. Degradation of MAP 0.0030.003

도 10은 신뢰도

를 바꿔가면서 최대 MAP 저하 정도(Max. Degradation of MAP)와 검색 속도 향상 정도(Speed Up)를 비교한 결과이다. 실험 결과

가 커짐에 따라

값 역시 커져 검색 범위가 좁아지므로 검색 속도는 지속적으로 향상되었다. 반면, 검색 범위가 좁아짐으로 인해

=0.60일 때 최대 MAP 저하 정도가 0.002에서 0.003으로,

=0.96일 때 0.003에서 0.033으로 증가하였다. 10 is reliability

This is the result of comparing Max. Degradation of MAP with Speed Up. Experiment result

As it grows

As the value is also increased, the search range is narrowed, so the search speed is continuously improved. On the other hand, due to the narrowing of the search

When M = 0.60, the maximum MAP drop is 0.002 to 0.003.

It increased from 0.003 to 0.033 when = 0.96.

이와 같은 검색 정확도 저하 원인은 첫째,

가 증가함에 따라

는 서서히 증가하고,

는 질의와 문서 집합이 변하지 않으므로 고정된 값을 가지며, 둘째,

를 만족하는 음소

-그램들에 대해서만 DP가 수행되므로, 가중치를 많이 받는

=3인 음소

-그램에서

=0.60일 때

=1인

-그램들이,

=0.96일 때

=1 or 2인

-그램들이 검색 대상에서 제외되기 때문이다. 특히

=1인 경우보다

=2인

-그램들이 DP 수행 시 유사도가 높게 평가되므로

=0.96에서의 검색 정확도 저하가 더 크게 나타났다.The reasons for this decrease in search accuracy are:

As increases

Gradually increases,

Has a fixed value because the query and document set do not change.

Phoneme to satisfy

Since DP is only performed on grams, it is heavily weighted

Phoneme with = 3

In grams

When = 0.60

= 1 person

-Grams,

When = 0.96

= 1 or 2

-Grams are excluded from the search. Especially

Than = 1

= 2 people

-Grams are evaluated for high similarity when performing DP

The decrease in search accuracy was greater at = 0.96.

도 11은 본 발명에 의한 음성 자료 고속 검색 시스템이며, 도 12는 본 발명에 의한 음성 자료 고속 검색 방법으로서 표 3의 알고리즘을 작동 흐름도로써 도시한 것이다. 도 11에 도시된 바와 같이, 본 발명에 의한 음성 자료 고속 검색 시스템은 음성 인식 엔진(11) 및 정보 검색 엔진(12)을 포함하여 이루어지는 메인 시스템(10)과, 음성 문서 데이터베이스(20)와, 음성 인식 수단(30)과, 질의 입력 수단(40) 및 출력 수단(50)으로 이루어진다.Fig. 11 is a voice data fast retrieval system according to the present invention, and Fig. 12 shows the algorithm of Table 3 as an operation flowchart of the voice data fast retrieval method according to the present invention. As shown in Fig. 11, the voice data fast retrieval system according to the present invention comprises a main system 10 including a voice recognition engine 11 and an information retrieval engine 12, a voice document database 20, Speech recognition means 30, query input means 40 and output means 50.

상기 음성 인식 엔진(11)은 상기 음성 인식 수단(30) 및 상기 질의 입력 수단(40)과 연결되어, 입력받은 음성 자료 또는 사용자 질의를 음소 n-그램으로 변환, 즉 상기 정보 검색 엔진(12)에 의한 연산이 가능한 형태로 가공한다. 음성 문서 자료들은 상기 음성 문서 자료에 상응하며 상술한 바와 같이 가공된 음소 n-그램들과 함께 상기 메인 시스템(10)과 연결되어 있는 음성 문서 데이터베이스(20)에 저장된다. 상기 정보 검색 엔진(12)은 상기 질의 입력 수단(40)에 의해 질의가 발생되면, 상기 사용자 질의의 음소 n-그램과 상기 음성 문서 데이터베이스(20)에 저장되어 있는 음소 n-그램들을 비교하는 검색 연산을 수행하고, 상기 검색된 결과를 정렬한다. 이와 같이 검색된 결과는 상기 메인 시스템(10)과 연결되어 있는 출력 수단(50)으로 출력된다.The speech recognition engine 11 is connected to the speech recognition means 30 and the query input means 40 to convert the received speech data or user query into phoneme n- gram, that is, the information search engine 12. Process into a form that can be calculated by The voice document materials are stored in the voice document database 20 corresponding to the voice document data and connected with the main system 10 together with the phonemes n- grams processed as described above. When the query is generated by the query input means 40, the information search engine 12 searches for comparing phoneme n - grams of the user query with phoneme n- grams stored in the voice document database 20. Perform the operation and sort the searched results. The searched result is output to the output means 50 connected to the main system 10.

본 발명에 의한 음성 자료 고속 검색 시스템은 물론 전체가 일체형인 시스템이어도 무방하다. 즉 예를 들어 개인용 컴퓨터 내에 상기 메인 시스템(10)이 소프트웨어로서 구현되고, 상기 개인용 컴퓨터의 저장 공간 일부에 상기 음성 문서 데이터베이스(20)가 구비되며, 상기 개인용 컴퓨터에 연결된 키보드, 모니터 등과 같은 주변기기로서 음성 인식 수단(30), 질의 입력 수단(40) 및 출력 수단(50)이 구성되도록 할 수 있다. 물론 개인용 컴퓨터 뿐 아니라 휴대용 컴퓨터, 모바일 단말기 등과 같은 시스템에도 상기 음성 자료 고속 검색 시스템을 구비할 수 있다.Not only the voice data high speed retrieval system according to the present invention but also the system as a whole may be integrated. That is, for example, the main system 10 is implemented as software in a personal computer, and the voice document database 20 is provided in a part of the storage space of the personal computer, and as a peripheral device such as a keyboard or a monitor connected to the personal computer. The speech recognition means 30, the query input means 40 and the output means 50 may be configured. Of course, not only a personal computer but also a system such as a portable computer and a mobile terminal can be provided with the voice data high speed search system.

뿐만 아니라, 인터넷 서버와 클라이언트와 같이, 상기 메인 시스템(10)은 따로 구비되고, 상기 질의 입력 수단(40) 및 출력 수단(50)은 상기 메인 시스템(10)과 인터넷을 통하여 연결될 수 있으며, 이 경우 상기 수단들은 각각 일반적인 개인용 컴퓨터에서의 키보드와 모니터일 수 있다. 즉 사용자는 키보드로 사용자 질의를 입력하여 인터넷을 통해 상기 메인 시스템(10)으로 사용자 질의를 전송하고(이 경우 키보드는 질의 입력 수단(40)이 됨), 상기 메인 시스템(10)은 상기 질의에 대한 검색 결과를 상기 사용자의 컴퓨터로 역시 인터넷을 통하여 전송하며, 상기 결과는 상기 사용자의 모니터로 출력된다(이 경우 모니터는 출력 수단(50)이 됨). 마찬가지로, 상기 질의 입력 수단(40)/출력 수단(50)은 휴대용 컴퓨터(랩탑, PDA 등) 또는 모바일 단말기의 입력 수단/출력 수단이 되어도 무방하다.In addition, like the Internet server and the client, the main system 10 is provided separately, the query input means 40 and the output means 50 may be connected to the main system 10 via the Internet, In this case, the means may be a keyboard and a monitor in a general personal computer, respectively. That is, the user inputs a user query using the keyboard to transmit the user query to the main system 10 through the Internet (in this case, the keyboard becomes the query input means 40), and the main system 10 responds to the query. Search results are also sent to the user's computer via the Internet, and the results are output to the user's monitor (in which case the monitor becomes output means 50). Similarly, the query input means 40 / output means 50 may be an input means / output means of a portable computer (laptop, PDA, etc.) or a mobile terminal.

이와 같이 상기 질의 입력 수단(40) 및 출력 수단(50)은 하드웨어적으로 구성될 수도 있으며, 또한 상기 메인 시스템(10)이 다른 서버 시스템과 연결되는 등과 같은 경우, 상기 질의 입력 수단(40) 및 출력 수단(50)은 소프트웨어적으로 구성되어도 무방하다. 예를 들어 어떤 통합 검색 시스템이 사용자로부터 특정 검색 요청을 받았다고 가정하자. 상기 통합 검색 시스템 자체적으로 음성 문서 자료를 검색하여도 무방하나, 음성 문서 자료와 텍스트 문서/이미지 문서/동영상 문서 등의 자료는 형태가 서로 매우 상이하기 때문에 각각의 검색 시스템은 분리되는 것이 바람직하다. 본 발명에 의한 음성 자료 고속 검색 시스템은 이러한 통합 검색 시스템의 한 구성 요소로서 동작할 수 있다. 이러한 경우 본 발명의 음성 자료 고속 검색 시스템은 어떤 개인으로부터 사용자 질의를 입력받는 것이 아니라, 상기 통합 검색 시스템이 사용자로부터 입력받은 자료를 음성 자료 검색에 적합하도록 가공하여 넘겨주는 자료를 사용자 질의로서 입력받아 검색을 수행하게 된다. 이 때에는 질의 입력 수단(40)은 이러한 통합 검색 시스템에서 본 발명에 의한 음성 자료 고속 검색 시스템으로 넘겨주는 알고리즘 자체가 되며, 즉 이 경우에는 질의 입력 수단(40)은 소프트웨어적으로 구현되게 된다. 출력 수단(50)의 경우에도, 개인용 컴퓨터/휴대용 컴퓨터/모바일 단말기 등의 출력 수단으로서와 같이 하드웨어적으로 구현될 수도 있으며 물론 상기 통합 검색 시스템에서의 질의 입력 수단(40)의 예시와 마찬가지로 소프트웨어적으로 구현되어도 무방하다.As described above, the query input means 40 and the output means 50 may be configured in hardware. Also, when the main system 10 is connected to another server system, the query input means 40 and The output means 50 may be comprised by software. For example, suppose an integrated search system receives a specific search request from a user. Although the integrated search system may search for voice document data by itself, the data of voice document data and text document / image document / video document are very different from each other. The voice data fast retrieval system according to the present invention can operate as one component of such an integrated retrieval system. In this case, the voice data high-speed search system of the present invention does not receive a user query from any individual, but receives a data that is processed and handed over by the integrated search system to be suitable for voice data search as a user query. The search will be performed. In this case, the query input means 40 is an algorithm itself that is handed over from the integrated search system to the voice data fast search system according to the present invention, that is, in this case, the query input means 40 is implemented in software. Also in the case of the output means 50, it may be implemented in hardware as an output means such as a personal computer / portable computer / mobile terminal or the like, and of course, as in the example of the query input means 40 in the integrated search system, May be implemented.

본 발명은 상기한 실시예에 한정되지 아니하며, 적용범위가 다양함은 물론이고, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형 실시가 가능한 것은 물론이다.The present invention is not limited to the above-described embodiments, and the scope of application of the present invention is not limited to those of ordinary skill in the art to which the present invention pertains without departing from the gist of the present invention as claimed in the claims. Of course, various modifications can be made.

이상에서와 같이 본 발명에 의하면, 음소 n-그램을 사용한 음성 문석 검색 시 기대 정합치 기반의 음성 문서 확장 기법을 사용함으로써 종래의 혼동 행렬을 사용한 문서 확장 기법에서 나타나는 많은 계산량의 문제가 완전히 해소되어, 계산량이 크게 감소되는 뛰어난 효과가 있다. 상기 성능 비교 결과에 나타났듯이, 본 발명의 방법에 의하면 종래의 방법에 비해 검색 속도가 17.5배 향상되며, 반면 속도 향상에 수반되는 검색 정확도 측면에서의 손실은 MAP 기준으로 0.003에 그치는 것으로 관찰되어, 검색 정확도의 손실은 거의 없으면서 속도가 비약적으로 향상되는 효과가 있음을 확인할 수 있다.As described above, according to the present invention, the problem of a large amount of computation that occurs in the document expansion method using the confusion matrix is completely solved by using the speech document expansion method based on the expected match value when searching the speech sentence using phoneme n- gram. In addition, there is an excellent effect that the amount of calculation is greatly reduced. As shown in the results of the above performance comparison, according to the method of the present invention, the search speed is improved by 17.5 times compared with the conventional method, while the loss in terms of the search accuracy accompanying the speed improvement is observed to be 0.003 on a MAP basis. As a result, it can be seen that there is almost no loss of search accuracy and the speed is dramatically improved.

또한 본 발명에 의하면, 대용량 연속 음성 인식기에 본 발명에서 제안한 기법을 결합함으로써 보다 높은 검색 속도와 보다 정확한 인식률을 갖는 대용량 연속 음성 인식기를 구현할 수 있는 효과가 있으며, 더불어 음소 인식기의 OOV 및 음성 오인식 문제를 해결하는 고성능의 음성 문서 검색 엔진을 구현할 수 있는 효과가 있다. 이에 따라 종래의 음성 인식기나 검색 엔진에 있어서, 본 발명에 의하면 저장 공간 용량, 프로세서의 속도 등에 있어 종래보다 훨씬 요구 조건이 낮아지게 되어 장비 비용이 현격하게 줄어들면서도 인식 및 검색 효율은 오히려 높은 장비를 제작할 수 있게 되는 경제적 효과도 거둘 수 있다.In addition, according to the present invention, by combining the technique proposed in the present invention with a large-capacity continuous speech recognizer, it is possible to implement a large-capacity continuous speech recognizer having a higher search speed and a more accurate recognition rate. There is an effect that can implement a high-performance voice document search engine to solve the problem. Accordingly, in the conventional voice recognizer or search engine, according to the present invention, the requirements for storage space capacity, processor speed, etc. are much lower than before, so that the equipment cost is drastically reduced and the recognition and search efficiency is high. The economic effect of being able to produce can also be achieved.

Claims

In the fast search method of the speech data by the document expansion method using the confusion matrix,

Matching of phoneme n- grams by the Information Retrieval Engine (IR Engine) used for the retrieval of voice documents and the Speech Recognition Engine that converts voice data or user queries to phoneme n - grams (Matching Score) is evaluated and DP (Dynamic Programming) is selectively performed according to the calculation result performed on the basis of the matching value.

The method of claim 1, wherein the fast searching method of the voice material

Expected Matching Score (EMS), which is a probability that the phoneme n- gram is correctly recognized by the information retrieval engine and the speech recognition engine, and an upper limit match that is the upper limit of the similarity between the phoneme n- gram in the query and the voice document ( The method of claim 1, wherein the calculation is performed using Upper Matching Score (UMS) and DP is selectively performed according to the calculation result.

The method of claim 2, wherein the fast searching method of the voice material

The DP is performed by the information retrieval engine when the calculation result satisfies a condition that an upper limit match value (UMS) is greater than or equal to an expected match value (EMS). Way.

The method of claim 3, wherein the fast searching method of the voice material

a) pre-converting, by the speech recognition engine, speech documents to a set of phoneme n- grams for prestored speech documents;

b) converting, by the speech recognition engine, the query content to a set of phoneme n- grams for the generated query;

c) reliability of the query by the information retrieval engine and speech recognition engine;

Calculating an expected match value (EMS) of the query according to;

d) calculating, the query phoneme n- gram and the upper limit number of positive matches the phoneme n- common phoneme-to-grams of the audio document (UMS), by the search engine;

e) by the information retrieval engine, if the upper limit match value obtained in step d) is greater than or equal to the expected match value obtained in step c), a DP is performed to compensate for a misperception, and otherwise, the probability of misrecognition is set to 0. Doing;

f) calculating, by the information retrieval engine, similarity through merging of phoneme n- grams using the result of the DP or the result of setting the false recognition probability to zero;

g) outputting, by the information retrieval engine, a voice document corresponding to the query as a search result using the calculated similarity;

A high-speed retrieval method of audio data by document extension, comprising a.

5. The method of claim 4, wherein the expected match is

A method of fast retrieval of speech data by document extension, characterized in that it is approximated and calculated using a Gaussian Mixture Model.

The method of claim 4, wherein the reliability is

A high speed retrieval method of voice data by document extension, characterized in that it is adjustable at the convenience of the user.

Voice recognition means (30) for receiving voice document data;

Query input means 40 for receiving a user query;

Connected to the speech recognition means 30 and the query input means 40, the speech recognition engine 11 for converting a speech material or a user query into phoneme n- gram and a speech document for the user query are searched and sorted. A main system 10 including an information retrieval engine 12 for performing a calculation;

A voice document database (20) connected to the main system (10) for storing voice document data and phoneme n- grams corresponding to the voice document data and converted by the voice recognition engine (11);

Output means (50) for outputting the results retrieved and sorted by the information retrieval engine (12);

In the voice data retrieval system comprising:

A high speed retrieval system for audio data by document extension, using a retrieval method according to any one of claims 1 to 6.

The method of claim 7, wherein the query input means 40 and the output means 50 and the main system 10

A high speed retrieval system of voice data by document extension, characterized in that connected to each other via the Internet.

The method of claim 8, wherein the query input means 40 and the output means 50

A system for fast retrieval of speech data by document extension, characterized in that it is included as a configuration means in hardware or software for any one selected from a server, a personal computer, a portable computer, and a mobile terminal.