KR20210049518A

KR20210049518A - Method and apparatus for interpreting intention of query

Info

Publication number: KR20210049518A
Application number: KR1020190133906A
Authority: KR
Inventors: 하은지; 성주원; 장두성
Original assignee: 주식회사 케이티
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2021-05-06
Also published as: KR102445172B1

Abstract

The method in which a computing device operated by at least one processor interprets a query comprises: a step of receiving queries in which intentions interpreted by an arbitrary device are labeled; a step of classifying the input queries into a reference query and comparison queries expected to be erroneously classified, and analyzing a vocabulary pattern of the reference query; a step of generating training data with the reference query, the comparison queries, and the vocabulary pattern; and a step of using the training data to train a query interpretation model. The comparison queries are queries indicating a frequency of a predetermined value or lower among the input queries.

Description

Query interpretation method and apparatus {METHOD AND APPARATUS FOR INTERPRETING INTENTION OF QUERY}

본 발명은 질의의 의도를 해석하는 기술에 관한 것이다.The present invention relates to a technique for interpreting the intention of a query.

대화 시스템이나 문서 분석 시스템에서는 사용자 질의의 의도를 정확하게 분류하는 질의 의도 해석 시스템이 필요하며, 이러한 질의 의도 해석 시스템은 대부분 기계 학습 방법을 통해 학습된다. 이때 기계 학습 방법을 이용하여 해석하기 위해서는 사용자의 의도가 부착된 대량의 학습 문장이 필요하다.In a dialogue system or a document analysis system, a query intention analysis system that accurately classifies the intention of a user query is required, and most of these query intention analysis systems are learned through machine learning methods. In this case, in order to interpret using a machine learning method, a large amount of learning sentences to which the user's intention is attached is required.

상용 시스템을 통해 대량의 문장 로그가 매일 생성되므로, 이를 이용하여 학습 데이터를 생성할 수 있으나, 문장과 의도를 자동으로 대응시키는 과정에서 전문가의 인력을 사용하는 경우 비용적, 시간적 측면에서 많은 어려움이 있다.Since a large amount of sentence logs are generated every day through a commercial system, learning data can be generated using this, but in the process of automatically responding to sentences and intentions, there are many difficulties in terms of cost and time when using experts' manpower. have.

이러한 단점을 해결하기 위해 실제 상용 서비스 시스템에서 수집한 로그를 이용하는 방법도 등장하였다. 문장과 의도를 미리 대응시키고, 전문가가 실제 상용 서비스의 결과를 확인하여 수정하고, 수정된 내용을 다시 학습에 사용하여, 시스템의 성능을 올리는 것이다. In order to solve these shortcomings, a method of using logs collected from an actual commercial service system has also appeared. The sentence and intention are matched in advance, the expert confirms and corrects the result of the actual commercial service, and the modified content is used again for learning to improve the performance of the system.

그러나 대량의 상용 서비스 로그를 수작업으로 모두 확인하는 것이 어렵고 엄청난 시간이 소요되므로 대부분 사용되지 않고 있다. However, it is difficult to check all of the large number of commercial service logs manually and it takes a lot of time, so most of them are not used.

해결하고자 하는 과제는 상용 서비스에서 그 의도가 판단된 질의 중 오분류된 것으로 예상되는 소수의 질의들을 선정하고, 선정된 오분류 후보 질의들을 이용하여 질의 해석 모델을 생성하는 방법 및 그 장치를 제공하는 것이다.The task to be solved is to provide a method and apparatus for selecting a few queries that are expected to be misclassified among queries whose intention is determined in a commercial service, and generating a query analysis model using the selected misclassified candidate queries. will be.

한 실시예에 따른 적어도 하나의 프로세서에 의해 동작하는 컴퓨팅 장치가 질의를 해석하는 방법으로서, 임의의 장치에 의해 해석된 의도가 라벨링 된 질의들을 입력받는 단계, 상기 입력 질의들을 기준 질의와 오분류 된 것으로 예상되는 대조 질의들로 분류하고, 상기 기준 질의의 어휘 패턴을 분석하는 단계, 상기 기준 질의, 상기 대조 질의, 그리고 상기 어휘 패턴으로 학습 데이터를 생성하는 단계, 그리고 상기 학습 데이터를 이용하여 질의 해석 모델을 학습시키는 단계를 포함하고, 상기 대조 질의들은 상기 입력 질의들 중 미리 정한 값 이하의 빈도로 나타나는 질의들인 질의 해석 방법이다.A method of interpreting a query by a computing device operated by at least one processor according to an embodiment, the step of receiving queries labeled with an intention interpreted by an arbitrary device, wherein the input queries are misclassified with a reference query. Classifying into expected collation queries, analyzing the vocabulary pattern of the reference query, generating learning data from the reference query, the collating query, and the vocabulary pattern, and analyzing the query using the learning data And training a model, wherein the matching queries are queries that appear with a frequency less than or equal to a predetermined value among the input queries.

상기 학습 데이터를 생성하는 단계는, 상기 기준 질의와 상기 대조 질의들의 의미 유사도를 이용한 필터링 규칙을 생성하고, 상기 필터링 규칙에 따라 상기 대조 질의들 중에서 오분류 후보들을 선정하는 단계, 그리고 상기 기준 질의의 어휘 패턴과 상기 오분류 후보들의 어휘 패턴을 비교하여, 상기 기준 질의에는 존재하고, 상기 오분류 후보들에 존재하지 않는 핵심 어휘들을 추출하는 단계를 포함할 수 있다. The generating of the training data comprises: generating a filtering rule using the semantic similarity of the reference query and the matching queries, selecting misclassification candidates from the matching queries according to the filtering rule, and querying the reference query Comparing the vocabulary pattern with the vocabulary patterns of the misclassification candidates, extracting core vocabularies that exist in the reference query and do not exist in the misclassification candidates.

상기 오분류 후보들을 선정하는 단계는, 상기 기준 질의와 상기 대조 질의들을 언어학적으로 분석하여 각각의 임베딩 벡터를 생성하는 단계, 상기 기준 질의의 임베딩 벡터와 상기 대조 질의들의 각 임베딩 벡터 간 의미 유사도를 계산하는 단계, 그리고 계산된 의미 유사도 값이 미리 정한 값 이상인 대조 질의들를 오분류 후보로 결정하는 단계를 포함할 수 있다.The selecting of the misclassification candidates includes linguistically analyzing the reference query and the reference query to generate each embedding vector, and determining a semantic similarity between the embedding vector of the reference query and each embedding vector of the reference query. It may include calculating, and determining, as candidates for misclassification, contrast queries having a calculated semantic similarity value equal to or greater than a predetermined value.

상기 필터링 규칙은, 상기 기준 질의의 의도와 동일한 의도라고 판단되는 대조 질의는 오분류 후보에서 제외할 수 있다.In the filtering rule, a control query determined to be the same intention as the reference query may be excluded from misclassification candidates.

상기 오분류 후보로 결정하는 단계 이후에, 전문가로부터 상기 오분류 후보에 라벨링 된 의도와 실제 사용자의 의도가 일치하는지 판단 결과를 수신하는 단계, 그리고 상기 결과를 참고하여, 상기 라벨링 된 의도가 상기 실제 사용자의 의도와 같은 질의가 상기 오분류 후보에 포함된 경우, 다시 오분류 후보들을 선정하는 단계를 포함할 수 있다.After the step of determining the candidate for misclassification, receiving a determination result from an expert whether the intention labeled on the candidate for misclassification matches the intention of the actual user, and referring to the result, the labeled intention When a query such as the user's intention is included in the misclassification candidates, the step of selecting the misclassification candidates may be included again.

상기 핵심 어휘들을 추출하는 단계는, 상기 기준 질의를 구성하는 어휘 중에서 임의의 개체명으로 인식되지 않는 어휘를 추출할 수 있다.In the step of extracting the core vocabulary, a vocabulary that is not recognized as an arbitrary entity name from among vocabularies constituting the reference query may be extracted.

상기 학습시키는 단계 이후에, 새로운 질의를 상기 학습된 질의 해석 모델로 입력하는 단계, 상기 학습된 질의 해석 모델로부터 상기 새로운 질의의 의도를 결정하는 단계, 그리고 상기 새로운 질의에 라벨링 된 의도와 상기 결정된 의도가 다른 경우, 상기 새로운 질의의 의도를 상기 결정된 의도로 변경하는 단계를 더 포함할 수 있다.After the learning step, inputting a new query into the learned query interpretation model, determining the intention of the new query from the learned query interpretation model, and the intention labeled in the new query and the determined intention If is different, the step of changing the intention of the new query to the determined intention may be further included.

한 실시예에 따른 컴퓨팅 장치로서, 메모리, 그리고 상기 메모리에 로드된 프로그램의 명령들(instructions)을 실행하는 적어도 하나의 프로세서를 포함하고, 상기 프로그램은 임의의 장치에 의해 해석된 의도가 라벨링 된 질의들을 입력받는 단계, 상기 입력 질의들을 기준 질의와 오분류 된 것으로 예상되는 대조 질의들로 분류하는 단계, 상기 기준 질의와 상기 대조 질의들 각각의 임베딩 벡터를 생성하고, 상기 기준 질의의 임베딩 벡터와 상기 대조 질의들의 각 임베딩 벡터 간 의미 유사도를 계산하는 단계, 상기 계산된 의미 유사도 값이 미리 정한 값 이상인 대조 질의들를 오분류 후보로 선정하는 단계, 상기 기준 질의와 상기 오분류 후보들의 어휘 패턴을 분석하여 상기 기준 질의에 존재하고, 상기 오분류 후보들에 존재하지 않는 핵심 어휘들을 추출하는 단계, 상기 기준 질의, 상기 오분류 후보들 그리고 상기 핵심 어휘들로 학습 데이터를 생성하는 단계, 그리고 상기 학습 데이터를 이용하여 질의 해석 모델을 학습시키는 단계를 실행하도록 기술된 명령들을 포함하는, 컴퓨팅 장치이다.A computing device according to an embodiment, comprising a memory and at least one processor that executes instructions of a program loaded in the memory, wherein the program is a query labeled with an intention interpreted by an arbitrary device. Receiving inputs, classifying the input queries into control queries expected to be misclassified from a reference query, generating an embedding vector of each of the reference query and the reference query, and the embedding vector of the reference query and the Calculating the semantic similarity between each embedding vector of the comparison queries, selecting the comparison queries with the calculated semantic similarity value equal to or greater than a predetermined value as misclassification candidates, analyzing the vocabulary pattern of the reference query and the misclassification candidates Extracting core vocabulary that exists in the reference query and does not exist in the misclassification candidates, generating training data from the reference query, the misclassification candidates, and the core vocabulary, and using the learning data A computing device comprising instructions described to perform the step of training a query interpretation model.

상기 분류하는 단계는, 상기 입력 질의들 중 상기 기준 질의의 의도 라벨과 동일한 기능을 수행하는 의도 라벨을 갖는 질의를 정분류 된 것으로 분류할 수 있다.In the classifying step, a query having an intention label that performs the same function as the intention label of the reference query among the input queries may be classified as correctly classified.

상기 핵심 어휘들을 추출하는 단계는, 상기 기준 질의를 구성하는 어휘를 상기 어휘와 유의 관계에 있는 어휘로 대체한 의미 패턴을 생성할 수 있다.In the step of extracting the core vocabulary, a semantic pattern may be generated in which a vocabulary constituting the reference query is replaced with a vocabulary having a significant relationship with the vocabulary.

본 발명에 따르면 상용 서비스에서 미리 판단된 질의 중 오분류된 것으로 예상되는 소수의 후보를 효율적으로 추출하는 과정을 거치므로, 모든 오분류 질의를 대상으로 하는 질의 해석 장치에 비해 속도와 정확성을 높일 수 있다.According to the present invention, since a process of efficiently extracting a small number of candidates expected to be misclassified among queries determined in advance in a commercial service, speed and accuracy can be improved compared to a query analysis device targeting all misclassified queries. have.

또한 본 발명에 따르면 필터링 규칙과 딥러닝 기반의 알고리즘을 같이 사용함으로써, 대량의 상용 서비스 로그에서 획득한 다양한 유형의 질의 해석이 가능하다.In addition, according to the present invention, by using a filtering rule and a deep learning-based algorithm together, it is possible to analyze various types of queries obtained from a large number of commercial service logs.

도 1은 한 실시예에 따른 질의 해석 장치 및 그 주변 환경의 구성도이다.
도 2는 한 실시예에 따른 입력 데이터 중 오분류로 예상되는 후보를 선정하는 방법의 흐름도이다.
도 3은 한 실시예에 따른 언어 분석기의 동작 방법의 흐름도이다.
도 4는 한 실시예에 따른 오분류된 질의의 의도를 바르게 해석하는 방법의 예시도이다. 1 is a block diagram of an apparatus for analyzing a query and its surrounding environment according to an embodiment.
2 is a flowchart of a method of selecting a candidate expected to be misclassified among input data according to an embodiment.
3 is a flowchart of a method of operating a language analyzer according to an embodiment.
4 is an exemplary diagram of a method of correctly interpreting the intention of a misclassified query according to an embodiment.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art may easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when a part "includes" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary.

본 명세서에서 사용되는 데이터는 도 4의 410 내지 430에 도시된 바와 같이 사용자의 질의와, 상용 시스템에서 분석한 질의의 의도가 라벨로 부착된 형태이다. 즉 사용자가 대화 시스템 또는 문서 분석 시스템 등의 상용 시스템에 요청한 질의와 상용 시스템이 자체적으로 해석한 질의의 의도가 결합된 것을 의미한다. 이때 상용 시스템은 사용자와 대화를 주고받아 요청된 작업을 수행하는 프로그램 또는 장치로서, 예를 들어 인공지능 스피커일 수 있으며 본 발명의 질의 해석 장치(1000)일 수도 있다.The data used in this specification is a form in which a user's query and an intention of the query analyzed by a commercial system are attached with a label as shown in 410 to 430 of FIG. 4. In other words, it means that a query requested by a user to a commercial system such as a conversation system or a document analysis system and the intention of the query interpreted by the commercial system itself are combined. In this case, the commercial system is a program or device that performs a requested task by exchanging a conversation with a user, and may be, for example, an artificial intelligence speaker or the query analysis apparatus 1000 of the present invention.

또한 상용 시스템이 질의 의도를 실제 사용자의 의도와 다르게 해석하는 것을 ‘오분류’, 사용자의 의도에 맞게 해석하는 것을 ‘정분류’ 라고 호칭한다.In addition, when the commercial system interprets the query intention differently from the actual user's intention, it is called'misclassification', and the interpretation that suits the user's intention is called'correct classification'.

이하 질의 해석 장치가 오분류 된 질의의 의도를 올바르게 파악하는 방법에 대해 살펴본다. Hereinafter, we will look at how the query analysis device correctly grasps the intention of a misclassified query.

도 1은 한 실시예에 따른 질의 해석 장치 및 그 주변 환경의 구성도이다.1 is a block diagram of an apparatus for analyzing a query and its surrounding environment according to an embodiment.

도 1을 참고하면, 질의 해석 장치(1000)는 학습 데이터 생성기(100), 질의 해석 모델(210)을 학습시키는 학습기(200) 그리고 학습된 질의 해석 모델(210)을 이용하여 입력된 질의를 해석하는 질의 해석기(300)를 포함한다. Referring to FIG. 1, the query analysis device 1000 interprets the input query using a training data generator 100, a learner 200 for training a query analysis model 210, and a learned query analysis model 210. It includes a query interpreter (300).

설명을 위해, 학습 데이터 생성기(100), 학습기(200) 그리고 질의 해석기(300)로 명명하여 부르나, 이들은 적어도 하나의 프로세서에 의해 동작하는 컴퓨팅 장치이다. 여기서, 학습 데이터 생성기(100), 학습기(200) 그리고 질의 해석기(300)는 하나의 컴퓨팅 장치에 구현되거나, 별도의 컴퓨팅 장치에 분산 구현될 수 있다. 별도의 컴퓨팅 장치에 분산 구현된 경우, 학습 데이터 생성기(100), 학습기(200) 그리고 질의 해석기(300)는 통신 인터페이스를 통해 서로 통신할 수 있다. 컴퓨팅 장치는 본 발명을 수행하도록 작성된 소프트웨어 프로그램을 실행할 수 있는 장치이면 충분하고, 예를 들면, 서버, 랩탑 컴퓨터 등일 수 있다. For the sake of explanation, the learning data generator 100, the learner 200, and the query interpreter 300 are called, but these are computing devices operated by at least one processor. Here, the learning data generator 100, the learner 200, and the query interpreter 300 may be implemented in one computing device or distributedly implemented in a separate computing device. When distributed in a separate computing device, the learning data generator 100, the learner 200, and the query interpreter 300 may communicate with each other through a communication interface. The computing device may be a device capable of executing a software program written to perform the present invention, and may be, for example, a server, a laptop computer, or the like.

학습 데이터 생성기(100), 학습기(200) 그리고 질의 해석기(300) 각각은 하나의 인공지능 모델일 수 있고, 복수의 인공지능 모델로 구현될 수도 있다. 그리고 질의 해석 모델(210)도 하나의 인공지능 모델일 수 있고, 복수의 인공지능 모델로 구현될 수도 있다. 질의 해석 장치(1000)는 하나의 인공지능 모델일 수 있고, 복수의 인공지능 모델로 구현될 수도 있다. 이에 따라, 상술한 구성들에 대응하는 하나 또는 복수의 인공지능 모델은 하나 또는 복수의 컴퓨팅 장치에 의해 구현될 수 있다.Each of the training data generator 100, the learner 200, and the query interpreter 300 may be an artificial intelligence model or may be implemented as a plurality of artificial intelligence models. In addition, the query analysis model 210 may also be an artificial intelligence model, or may be implemented as a plurality of artificial intelligence models. The query analysis apparatus 1000 may be a single artificial intelligence model, or may be implemented as a plurality of artificial intelligence models. Accordingly, one or a plurality of artificial intelligence models corresponding to the above-described configurations may be implemented by one or a plurality of computing devices.

학습 데이터 생성기(100)는 질의 해석 장치(1000)로 입력되는 데이터 중 일부를 학습 데이터로 생성하며, 오분류 후보 추출기(110)와 언어 분석기(120)를 포함한다.The training data generator 100 generates some of the data input to the query analysis device 1000 as training data, and includes a misclassification candidate extractor 110 and a language analyzer 120.

오분류 후보 추출기(110)는 입력되는 대량의 데이터에서 상용 시스템이 사용자의 의도를 잘못 해석한 것으로 예상되는 소량의 오분류 후보 질의를 추출하여 질의 학습 모델(210)의 학습 데이터로 사용한다.The misclassification candidate extractor 110 extracts a small amount of misclassification candidate queries that are expected to misinterpret the user's intention from the input large amount of data and uses them as training data of the query learning model 210.

오분류 후보 추출기(110)는 정분류 된 질의를 이용하여 대량의 데이터를 필터링하는 규칙을 생성하고, 필터링 된 오분류 후보 질의들 중 정분류 질의가 포함되었는지 전문가의 판단을 거쳐 최종 오분류 후보 질의를 선정한다. 오분류 후보 추출기(110)의 동작 방법은 도 2를 통해 자세히 설명한다.The misclassification candidate extractor 110 generates a rule for filtering a large amount of data using the correctly classified query, and the final misclassification candidate query after an expert judges whether the correct classification query is included among the filtered false classification candidate queries. Select. The operation method of the misclassification candidate extractor 110 will be described in detail with reference to FIG. 2.

언어 분석기(120)는 정분류 된 질의와 오분류 된 질의를 각각 언어적으로 분석하여, 오분류된 질의가 올바르게 해석되기 위해 필요한 핵심 어휘를 추출한다. 자세한 분석 방법과 핵심 어휘에 대한 내용은 도 3을 통해 설명한다. The language analyzer 120 linguistically analyzes the correctly classified query and the misclassified query, respectively, and extracts a core vocabulary necessary for the misclassified query to be correctly interpreted. Details of the detailed analysis method and core vocabulary will be described with reference to FIG. 3.

학습 데이터 생성기(100)는 오분류 후보 추출기(110)로부터 올바른 의도 라벨이 부착된 정분류 질의와 잘못된 의도 라벨이 부착된 소량의 오분류 후보를 얻고, 언어 분석기(120)로부터 핵심 어휘들을 얻어 학습 데이터로 생성한다. The training data generator 100 obtains a correct classification query with a correct intention label and a small number of misclassification candidates with an incorrect intention label from the misclassification candidate extractor 110, and learns by obtaining key vocabularies from the language analyzer 120. It is created with data.

학습기(200)는 생성된 학습 데이터를 이용하여 질의 해석 모델(210)을 학습한다. 이때 사용되는 딥러닝 알고리즘은 사전 훈련된 언어 모델을 사용하는 ELMo(Embedding from Language Models)-LSTM일 수 있으며, 어느 하나에 한정되지 않는다.The learner 200 learns the query analysis model 210 using the generated training data. The deep learning algorithm used at this time may be ELMo (Embedding from Language Models)-LSTM using a pretrained language model, and is not limited to any one.

질의 해석기(300)는 학습된 질의 해석 모델(210)을 이용하여, 입력되는 질의에 부착된 의도 라벨이 올바른지 판단한다. 즉 입력되는 사용자의 질의에 부착된 의도 라벨이 실제 사용자의 의도에 부합하는지 판단하고, 의도 라벨이 잘못된 경우 질의 해석기(300)로부터 얻은 의도 라벨로 변경한다. The query interpreter 300 determines whether the intention label attached to the input query is correct using the learned query analysis model 210. That is, it is determined whether the intention label attached to the input user's query corresponds to the actual user's intention, and if the intention label is incorrect, the intention label obtained from the query interpreter 300 is changed.

한편, 도 1은 상용 시스템을 통해 질의 해석 과정을 한번 거쳐, 의도 라벨이 부착된 형태의 질의가 입력된다고 가정하였으나, 상용 시스템을 거치지 않아 의도 라벨이 없는 질의가 질의 해석기(300)에 입력될 수도 있다. 이 경우 질의 해석기(300)는 해당 질의에 올바른 의도 라벨을 새롭게 부착하는 동작을 수행할 수 있다. Meanwhile, FIG. 1 assumes that a query in the form of an intention label attached is inputted through a query analysis process through a commercial system, but a query without an intention label may be input to the query interpreter 300 without going through a commercial system. have. In this case, the query interpreter 300 may perform an operation of newly attaching a correct intention label to the query.

도 2는 한 실시예에 따른 입력 데이터 중 오분류로 예상되는 후보를 선정하는 방법의 흐름도이다.2 is a flowchart of a method of selecting a candidate expected to be misclassified among input data according to an embodiment.

도 2를 참고하면, 오분류 후보 추출기(110)는 대량의 입력 데이터 중에서 빈번하게 등장하는 질의를 기준 질의, 등장 빈도가 낮은 질의를 대조 질의로 분류한다(S101). 예를 들어 1000개의 입력 데이터 중, 제1 질의가 980번 발생하였고, 제2 질의가 10번, 제3 질의가 5번, 제4 질의가 5번 발생한 경우, 오분류 후보 추출기(110)는 제1 질의를 기준 질의로 하고 제2 질의, 제3 질의, 제4 질의를 대조 질의로 분류한다. Referring to FIG. 2, the misclassification candidate extractor 110 classifies a query that frequently appears among a large amount of input data as a reference query and a query with a low appearance frequency as a control query (S101). For example, out of 1000 input data, if the first query occurs 980 times, the second query occurs 10 times, the third query occurs 5 times, and the fourth query occurs 5 times, the misclassification candidate extractor 110 1 Query is the reference query, and the second, third, and fourth queries are classified as control queries.

본 명세서에서 기준 질의는 상용 시스템에서 질의 해석이 올바르게 이루어지고, 사용자 의도에 맞는 의도 라벨이 부착된 것으로 예상되는 질의를 의미한다. 즉 기준 질의는 사용자에 의해 확률상 빈번하게 발화되는 패턴으로서, 해당 질의의 표준으로 패턴화 할 수 있는 질의이다.In this specification, a reference query means a query that is expected to be correctly interpreted in a commercial system and that an intent label appropriate for the user's intention is attached. That is, a reference query is a pattern that is frequently uttered by a user in probability, and is a query that can be patterned with the standard of the query.

오분류 후보 추출기(110)는 기준 질의와 대조 질의의 어휘, 의미 정보를 추출하여 각 질의의 임베딩 벡터를 생성한다(S102). The misclassification candidate extractor 110 extracts vocabulary and semantic information of the reference query and the contrast query to generate an embedding vector for each query (S102).

예를 들어, 오분류 후보 추출기(110)는 기준 질의인 제1 질의와 대조 질의인 제2 질의, 제3 질의, 제4 질의에서 어휘, 형태소/구문 분석 정보, 개체명, 질의가 발생한 상황의 정보, 이력 정보 등을 추출하여 어휘/의미 정보를 생성하고, 이를 바탕으로 각 질의의 임베딩 벡터를 생성한다. For example, the misclassification candidate extractor 110 is used to determine the vocabulary, morpheme/syntax analysis information, entity name, and the situation in which the query occurred in the reference query, the first query, the contrast query, the second query, the third query, and the fourth query. Information, history information, etc. are extracted to generate vocabulary/semantic information, and based on this, an embedding vector of each query is generated.

구체적으로 오분류 후보 추출기(110)는 형태소 분석을 통해 질의의 명사, 동사 등을 분류할 수 있고, 구문 분석을 통해 질의의 구조를 파악할 수 있다. Specifically, the misclassification candidate extractor 110 may classify a noun, verb, etc. of a query through morpheme analysis, and grasp the structure of a query through syntax analysis.

일반적으로 자연어 처리는 형태소 분석, 개체명 인식, 의미역 결정 순으로 이루어진다. 형태소 분석은 문장을 구성하는 어절을 형태소로 분석하여, 품사를 부착하는 것을 의미한다. 개체명(Named Entity)이란 문서에서 특정한 의미를 가지고 있는 단어 또는 어구를 의미하며, 개체명 인식이란 형태소 분석을 통해 추출된 고유 명사의 범주를 분류하는 과정을 말한다. 의미역(Semantic Role)이란 문장 속에서 서술어와의 관계에서 나타나는 명사의 역할을 말하며, 의미역 결정이란 문장 내에 포함된 서술어와 해당 서술어의 수식을 받는 논항간의 의미 관계를 인식, 분류하는 것이다. 문장의 각 성분이 다른 구조로 배열되더라도 같은 의미역을 가질 수 있고, 혹은 같은 구조로 배열된 구문이 다른 의미역을 지닐 수 있다.In general, natural language processing is performed in the order of morpheme analysis, entity name recognition, and semantic domain determination. Morphological analysis refers to attaching parts of speech by analyzing words constituting a sentence into morphemes. Named entity refers to a word or phrase that has a specific meaning in a document, and entity name recognition refers to the process of classifying the categories of proper nouns extracted through morpheme analysis. Semantic Role refers to the role of a noun that appears in the relationship with a predicate in a sentence, and Semantic Role Determination is to recognize and classify the semantic relationship between the predicate contained in the sentence and the argument subject to the modification of the predicate. Even if each component of a sentence is arranged in a different structure, it can have the same semantic domain, or phrases arranged in the same structure can have different semantic domains.

형태소 분석, 개체명 인식, 의미역 결정에 사용되는 방법은 이미 널리 알려진 것으로서 Head-tail 구분법, Tabular 파싱법, Hidden Markov Model(HMM), Support Vector Machine(SVM), Conditional Random Field(CRF), Maximum Entropy(ME), Structural SVM 등의 알고리즘을 이용할 수 있으며 어느 하나에 한정되지 않는다. The methods used for morpheme analysis, entity name recognition, and semantic domain determination are well known, and are known as Head-tail Classification, Tabular Parsing, Hidden Markov Model (HMM), Support Vector Machine (SVM), Conditional Random Field (CRF), and Maximum. Algorithms such as Entropy (ME) and Structural SVM can be used, and are not limited to any one.

예를 들어, 오분류 후보 추출기(110)는 “캡틴마블 틀어줘.”라는 예문에 대해 형태소를 분석하여 “캡틴마블/체언-일반명사, 틀어줘/용언-동사”로 파악하고, 구문 분석을 통해 해당 예문을“체언-일반명사 + 용언-동사”의 조합이라고 파악할 수 있다.For example, the misclassification candidate extractor 110 analyzes the morpheme for the example sentence “Play Captain Marvel.” to identify “Captain Marble/Chon-Common Noun, Play/Pronunciation-Verb”, and perform syntax analysis. Through this, it is possible to grasp the example sentence as a combination of “both words-common nouns + verbs-verbs.”

그리고 개체명 분석을 통해 각 단어 또는 어구의 의미를 파악할 수 있다. 위의 “캡틴마블 틀어줘.”라는 예문에 대해서, “캡틴마블/영화, 틀어줘/재생”으로 각 어구의 의미를 레이블링 할 수 있다.In addition, the meaning of each word or phrase can be grasped through the analysis of the entity name. For the example sentence "Play Captain Marble" above, you can label the meaning of each phrase as "Captain Marble/Movie, Play/Play".

또한, 상황 정보란, 해당 질의를 발화한 시점의 상용 시스템의 동작과 관련된 정보를 의미한다. 예를 들어, 사용자가 TV 시청 중 특정 영화를 재생하라는 질의를 발화한 경우, 상황 정보는 ‘TV 시청 중’이라고 추출될 수 있다. 이력 정보란, 사용자의 과거 발화 내역 또는 사용자가 가입한 특정 서비스에 대한 정보를 포함할 수 있다.In addition, the context information refers to information related to the operation of the commercial system at the time the query is uttered. For example, when the user utters a query to play a specific movie while watching TV, the context information may be extracted as “watching TV”. The history information may include information on a user's past utterances or a specific service to which the user has subscribed.

이후 오분류 후보 추출기(110)는 추출한 형태소, 개체명, 의미역 등의 정보를 이용하여 각 질의의 임베딩 벡터를 생성한다. 임베딩 벡터란 범주형 자료를 연속형 벡터 형태로 변환한 것을 의미한다.Thereafter, the misclassification candidate extractor 110 generates an embedding vector of each query using information such as the extracted morpheme, entity name, and semantic domain. An embedding vector refers to the transformation of categorical data into a continuous vector form.

문장 임베딩 벡터를 생성하는 한 예로서, 구글에서 개발된 sentence2vec 모델을 이용할 수 있다. sentence2vec 모델은 word2vec의 CBOW(Continuous Bag Of Words) 모델을 문장 단위로 확장한 모델이다. 표 1은 sentence2vec 모델을 사용하여 임의의 질의문 2개를 N차원의 벡터로 생성한 예시이다.As an example of generating a sentence embedding vector, the sentence2vec model developed by Google can be used. The sentence2vec model is a model that extends the CBOW (Continuous Bag Of Words) model of word2vec in units of sentences. Table 1 is an example of generating two random queries as N-dimensional vectors using the sentence2vec model.

질의문 1Question 1 (0.00011 0.09193 -1.93188 … 0.335567)(0.00011 0.09193 -1.93188… 0.335567) 질의문 2Question 2 (0.00011 0.11171 -1.31652 … 1.89287)(0.00011 0.11171 -1.31652… 1.89287)

오분류 후보 추출기(110)는 기준 질의의 임베딩 벡터와 대조 질의의 임베딩 벡터 간 의미 유사도를 계산한다(S103). The misclassification candidate extractor 110 calculates a semantic similarity between the embedding vector of the reference query and the embedding vector of the contrast query (S103).

임베딩 벡터 간 의미 유사도를 계산하는 방법은 어느 하나에 한정되지 않는다. 한 예로서, 변환된 임베딩 벡터에 대해, 코사인 유사도(Cosine Similarity)를 계산할 수 있다. 코사인 유사도는 내적 공간의 두 벡터의 각도의 코사인 값으로서, 두 벡터 간 유사한 정도를 0 과 1 사이의 값으로 표현하며 코사인 유사도가 1에 가까울 수록 두 벡터가 유사하다는 것을 의미한다.The method of calculating the semantic similarity between embedding vectors is not limited to any one. As an example, cosine similarity may be calculated for the transformed embedding vector. The cosine similarity is the cosine value of the angles of two vectors in the dot product space, and the degree of similarity between the two vectors is expressed as a value between 0 and 1, and the closer the cosine similarity is to 1, the more similar the two vectors are.

예를 들어 오분류 후보 추출기(110)는 제1 질의의 임베딩 벡터와 제2 질의의 임베딩 벡터를 비교하여 의미 유사도를 계산하고, 제1 질의의 임베딩 벡터와 제3 질의의 임베딩 벡터 간 의미 유사도 그리고 제1 질의의 임베딩 벡터와 제4 질의의 임베딩 벡터 간 의미 유사도를 각각 계산한다. For example, the misclassification candidate extractor 110 calculates semantic similarity by comparing the embedding vector of the first query and the embedding vector of the second query, and the semantic similarity between the embedding vector of the first query and the embedding vector of the third query, and The semantic similarity between the embedding vector of the first query and the embedding vector of the fourth query is calculated.

오분류 후보 추출기(110)는 계산된 의미 유사도 또는 기준 질의의 의도 라벨 등을 이용하여 필터링 규칙을 생성한다(S104). 필터링 규칙은 대조 질의들 중 소수의 질의만을 선정하기 위해 사용되는 것으로, 복수 개가 사용될 수 있다. 예를 들어 의미 유사도가 일정 값 이상인 대조 질의를 오분류 후보로 추출하는 규칙, 또는 특정 의도 라벨을 부착한 질의를 오분류 후보로 추출하는 규칙 등이 이용될 수 있다.The misclassification candidate extractor 110 generates a filtering rule using the calculated semantic similarity or the intention label of the reference query (S104). The filtering rule is used to select only a small number of queries among the matching queries, and a plurality of filtering rules may be used. For example, a rule for extracting a collation query having a semantic similarity greater than or equal to a certain value as a candidate for misclassification, or a rule for extracting a query with a specific intent label as a candidate for misclassification may be used.

필터링 규칙의 한 예로서, 제1 질의의 임베딩 벡터와 제2 질의의 임베딩 벡터 간 의미 유사도가 0.83, 제1 질의의 임베딩 벡터와 제3 질의의 임베딩 벡터 간 의미 유사도가 0.81, 제1 질의의 임베딩 벡터와 제4 질의의 임베딩 벡터 간 의미 유사도가 0.77를 얻은 경우를 가정한다. 이때 사용자 또는 질의 해석 장치(1000)의 관리자가 기준이 되는 의미 유사도 값을 0.8이라고 설정한 경우 0.8 이상인 값을 갖는 제2 질의, 제3 질의만을 오분류 후보로 선정한다.As an example of a filtering rule, the semantic similarity between the embedding vector of the first query and the embedding vector of the second query is 0.83, the semantic similarity between the embedding vector of the first query and the embedding vector of the third query is 0.81, and the embedding of the first query. It is assumed that the semantic similarity between the vector and the embedding vector of the fourth query is 0.77. In this case, when the user or the administrator of the query interpretation apparatus 1000 sets the reference semantic similarity value to 0.8, only the second query and the third query having a value equal to or greater than 0.8 are selected as candidates for misclassification.

이때 기준이 되는 값 이하의 의미 유사도를 갖는 질의는 서로 다른 의미라고 간주된다. 따라서 제4 질의는 제1 질의와 의도가 다른 질의이므로, 제1 질의의 의도 라벨을 부착하기 위한 질의 해석 모델(210)의 학습 데이터로 사용될 수 없다. At this time, queries with a similarity of meaning less than or equal to the standard value are considered to have different meanings. Therefore, since the fourth query is a query having a different intention than the first query, it cannot be used as training data of the query analysis model 210 for attaching the intention label of the first query.

그러나 제4 질의는 다른 임의의 기준 질의와의 의미 유사도가 0.8 이상인 경우 학습 데이터로 이용 가능할 수 있다. However, the fourth query may be used as training data when the semantic similarity with other random reference queries is 0.8 or higher.

필터링 규칙의 다른 예로서, 사용자 또는 질의 해석 장치(1000)의 관리자는 서로 다른 의도 라벨이 해석상 동일한 의도를 갖는 경우를 규칙으로 설정하여 질의 해석 장치(1000)에 등록할 수 있다. As another example of a filtering rule, a user or an administrator of the query analysis apparatus 1000 may register a case in which different intention labels have the same intention for interpretation as a rule and register it in the query analysis apparatus 1000.

예를 들어 제1 질의에 부착된 의도 라벨이 A이고, 제2 질의에 부착된 의도 라벨이 B로 형식적으로는 다르지만, 실질적으로 A와 B는 같은 의도로 간주되는 경우, A와 B가 같다는 규칙을 등록할 수 있다. 이때 의도 라벨 B가 부착된 제2 질의는 실제 사용자 의도에 맞는 의도로 해석된 것으로서, 오분류 후보에서 제외된다. For example, if the intention label attached to the first query is A, and the intention label attached to the second query is B, although formally different, A and B are considered to have the same intention, the rule that A and B are the same. You can register. At this time, the second query with the intention label B attached is interpreted as an intention that fits the actual user's intention, and is excluded from misclassification candidates.

오분류 후보 추출기(110)는 생성된 필터링 규칙에 따라 복수의 대조 질의들 중 오분류 후보를 추출한다(S105). 위에서 설명한 예로서, 복수의 대조 질의들인 제2 질의, 제3 질의, 제4 질의 중 제1 질의와 의미 유사도가 0.8 미만인 제4 질의가 필터링 되고, 제1 질의와 제2 질의의 의도 라벨이 같다는 규칙에 따라 제2 질의가 필터링 된다. 따라서 4개의 대조 질의로부터 제3 질의만을 오분류 후보로 추출한다.The erroneous classification candidate extractor 110 extracts a erroneous classification candidate from among a plurality of collation queries according to the generated filtering rule (S105). As the example described above, among a plurality of collation queries, the second query, the third query, and the fourth query, the fourth query with a semantic similarity of less than 0.8 is filtered, and the intent label of the first query and the second query is the same. The second query is filtered according to the rule. Therefore, only the third query is extracted as a candidate for misclassification from the four control queries.

이후 전문가가 오분류 후보들을 확인한다(S106). 최종 학습 데이터의 검증은 전문가의 판단을 거치게 되며, 전문가가 입력 데이터 전부에 대해 확인하는 것이 아니라 필터링을 한번 거친 오분류 후보들에 대해서만 확인하므로 검증 시간을 감소시킬 수 있다.After that, the expert checks the candidates for misclassification (S106). Verification of the final training data goes through the expert's judgment, and verification time can be reduced because the expert verifies only the candidates for misclassification that have been filtered once instead of verifying all of the input data.

전문가로부터 오분류 후보들이 실제로 사용자의 의도와 다른 의도 라벨을 부착하고 있는지 확인되고(S107), 모두 오분류 된 것이라면 오분류 후보들을 최종 오분류 후보로 선정하고(S108), 오분류 후보들에 정분류 질의가 포함된 경우 전문가의 판단 결과를 반영하여 S104에서 생성된 필터링 규칙을 수정한다(S109).An expert checks whether the misclassification candidates actually attach an intention label different from the user's intention (S107), and if all are misclassified, the misclassification candidates are selected as final misclassification candidates (S108), and correct classification to the misclassification candidates If the query is included, the filtering rule generated in S104 is modified by reflecting the expert's judgment result (S109).

오분류 후보 추출기(110)는 기준 질의와, 위의 과정을 거쳐 생성된 최종 오분류 후보들을 언어 분석기(120)에 전달한다. 이하에서는 언어 분석기(120)가 질의 해석 모델(210) 학습에 사용되는 데이터를 생성하기 위해, 기준 질의와 최종 오분류 후보들의 언어적인 특성의 차이를 분석하는 방법을 살펴본다.The misclassification candidate extractor 110 transmits the reference query and the final misclassification candidates generated through the above process to the language analyzer 120. Hereinafter, a method of analyzing the difference between the linguistic characteristics of the reference query and the final misclassification candidates in order for the language analyzer 120 to generate data used for training the query interpretation model 210 will be described.

도 3은 한 실시예에 따른 언어 분석기의 동작 방법의 흐름도이다.3 is a flowchart of a method of operating a language analyzer according to an embodiment.

도 3을 참고하면, 언어 분석기(120)는 기준 질의의 어휘 의미 패턴을 분석한다(S201). 어휘 의미 패턴이란 문장을 어휘, 품사, 유의어로 나누고 의미 정보를 추가하여 만든 시퀀스이다. 즉 한 의미를 갖는 문장이 다양한 방식으로 표현될 수 있는데 이 방식들의 구문 유형을 정의한 문법이다.Referring to FIG. 3, the language analyzer 120 analyzes a vocabulary semantic pattern of a reference query (S201). A vocabulary semantic pattern is a sequence created by dividing a sentence into vocabulary, parts of speech, and synonyms and adding semantic information. That is, sentences with one meaning can be expressed in various ways, which is a grammar that defines the types of syntax for these methods.

따라서 기준 질의로부터 하나의 어휘 의미 패턴을 구축하게 되면, 해당 패턴을 갖는 다른 표현의 문장들을 기준 질의와 같은 의미를 갖는 것으로 해석할 수 있다.Therefore, if one vocabulary semantic pattern is constructed from the reference query, sentences of other expressions having the corresponding pattern can be interpreted as having the same meaning as the reference query.

예를 들어 기준 질의가 "캡틴마블 틀어줘."라면, 도 2를 통해 설명한 바와 같이 형태소 분석, 구문 분석, 개체명 분석 등을 통하여 기준 질의의 어휘 의미 패턴을 '체언-일반명사-영화 + 용언-동사-재생'이라고 분석할 수 있다. For example, if the reference query is "Play Captain Marble," the vocabulary semantic pattern of the reference query is changed through morpheme analysis, syntax analysis, and entity name analysis as described with reference to FIG. It can be analyzed as'verb-regeneration'.

또한, '영화'와 유의 관계에 있는 '드라마'를 이용하여, 유의 의미 패턴을 '체언-일반명사-드라마 + 용언-동사-재생'이라고 생성할 수 있다. In addition, by using the'drama' that has a significant relationship with the'movie', the meaning pattern of the significance can be generated as'body language-common noun-drama + proverb-verb-reproduction'.

언어 분석기(120)는 최종 오분류 후보 질의들의 형태소 분석, 구문 분석을 수행한다(S202). 형태소 분석이란, 문장을 형태소라는 최소 의미 단위로 분리하는 것이고, 구문 분석(Parsing)은 문장 내에서 각 형태소들이 가지는 역할을 분석하는 것을 의미하며, 이는 도 2에서 설명한 내용과 같다.The language analyzer 120 performs morpheme analysis and syntax analysis of final misclassification candidate queries (S202). Morphological analysis refers to separating a sentence into a minimum semantic unit called morpheme, and parsing refers to analyzing the roles of each morpheme within a sentence, as described in FIG. 2.

언어 분석기(120)는 최종 오분류 후보들의 형태소, 구문 분석 결과를 바탕으로 기준 질의에는 존재하나 최종 오분류 후보들에는 존재하지 않는 핵심 어휘의 후보를 추출한다(S203). 핵심 어휘를 추출하는 방법은 도 4를 통해 자세히 설명한다.The language analyzer 120 extracts a candidate of a core vocabulary that exists in the reference query but does not exist in the final misclassification candidates based on the morpheme and syntax analysis results of the final misclassification candidates (S203). A method of extracting the core vocabulary will be described in detail with reference to FIG. 4.

이후 언어 분석기(120)는 추출된 핵심 어휘 후보들 각각의 신뢰도를 계산한다(S204). 신뢰도를 계산하기 위해 양방향 LSTM(Bi-directional Long Short-Term Memory)과 (Conditional Random Field, CRF) 방식을 결합한 BiLSTM-CRF을 사용할 수 있으며, 이미 학습된 다른 모델을 사용할 수 있다. Thereafter, the language analyzer 120 calculates the reliability of each of the extracted core vocabulary candidates (S204). To calculate the reliability, BiLSTM-CRF, which combines bi-directional Long Short-Term Memory (LSTM) and (Conditional Random Field, CRF) method, can be used, and other models that have already been learned can be used.

BiLSTM은 순차적 데이터 활용에서 가장 많이 쓰이는 딥러닝 모형인 LSTM을 두 개 사용하여, 각 데이터에 대해 왼쪽(Forward)뿐만 아니라 오른쪽(Backward) 데이터를 고려하도록 보완한 모델이다. BiLSTM은 앞뒤 문맥을 모두 고려해야 하는 자연어 처리에서 높은 성능을 보이는 알고리즘이다. BiLSTM is a model that uses two LSTMs, which are the deep learning models most commonly used in sequential data utilization, to consider not only the left (forward) but also the right (backward) data for each data. BiLSTM is an algorithm that shows high performance in natural language processing that must consider both front and rear contexts.

한편 기준 질의의 어휘 의미 패턴 분석 결과를 바탕으로, 개체명의 종류를 한정하여 신뢰도를 산출할 수 있다.Meanwhile, based on the result of analyzing the vocabulary meaning pattern of the reference query, the reliability can be calculated by limiting the type of the entity name.

이후 언어 분석기(120)는 핵심 어휘의 후보들 중 가장 높은 신뢰도를 가진 핵심 어휘를 추출한다(S205). Thereafter, the language analyzer 120 extracts a core vocabulary having the highest reliability among candidates for the core vocabulary (S205).

언어 분석기(120)는 추출된 핵심 어휘와 기준 질의, 최종 오분류 후보를 학습 데이터로 생성하여 학습기(200)에 전달한다. 학습기(200)는 질의 해석 모델(210)을 학습한다. 질의 해석 모델(210)은 앞서 언급된 ELMo-LSTM로 구현될 수 있으며, 어느 하나로 제한되지 않는다. The language analyzer 120 generates the extracted core vocabulary, a reference query, and a final misclassification candidate as learning data, and transmits the generated data to the learner 200. The learner 200 learns the query analysis model 210. The query interpretation model 210 may be implemented with the aforementioned ELMo-LSTM, and is not limited to any one.

한편 언어 분석기(120)는 기준 질의의 어휘 의미 패턴을 오분류 후보 추출기(110)에 전달하여, 오분류 후보 추출기(110)가 생성한(S104) 필터링 규칙에 이를 포함시킬 수도 있다. Meanwhile, the language analyzer 120 may transmit the vocabulary semantic pattern of the reference query to the misclassification candidate extractor 110 and include it in the filtering rule generated by the misclassification candidate extractor 110 (S104).

도 4는 한 실시예에 따른 오분류된 질의의 의도를 바르게 해석하는 방법의 예시도이다.4 is an exemplary diagram of a method of correctly interpreting the intention of a misclassified query according to an embodiment.

도 4를 참고하면, “캡틴마블 틀어줘”(410)라는 질의가 1000번 발생, “캡틴마블 찾아줄 수 있니”(420), “마블캡틴을 틀면 좋겠어”(430)라는 질의가 각각 5번씩 발생하였고, 상용 시스템은 “캡틴마블 틀어줘”(410)의 의도를 SearchContent로 라벨링, “캡틴마블 찾아줄 수 있니”(420)의 의도를 FindContent, “마블캡틴을 틀면 좋겠어”(430)의 의도를 ILikeContent로 라벨링 한 상황을 가정하여 설명한다. Referring to FIG. 4, the query “Play Captain Marvel” 410 occurs 1000 times, “Can you find Captain Marvel” (420), and “I wish you turn on the Marvel Captain” 430 each 5 times. Occurred, and the commercial system labels the intention of “Play Captain Marvel” (410) as SearchContent, the intention of “Can you find Captain Marvel” (420), FindContent, and the intention of “I want you to play Marvel Captain” (430). It will be described assuming the situation where is labeled as ILikeContent.

한편 도 4에서는 학습에 사용된 데이터를 질의 해석기(300)에 입력하여 결과를 산출하는 것으로 도시하였으나, 학습 데이터와 분석 대상 데이터가 다를 수 있음은 당연하다.Meanwhile, although FIG. 4 shows that data used for learning is input to the query interpreter 300 to calculate a result, it is natural that the learning data and the data to be analyzed may be different.

질의 해석 장치(1000)는 1000번 발생한 “캡틴마블 틀어줘”(410)를 기준 질의로 분류하고, 5번씩 발생한 “캡틴마블 찾아줄 수 있니”(420)와 “마블캡틴을 틀면 좋겠어”(430)를 대조 질의로 분류한다. 즉 상용 시스템이 “캡틴마블 틀어줘”(410)는 정분류, “캡틴마블 찾아줄 수 있니”(420)와 “마블캡틴을 틀면 좋겠어”(430) 는 오분류 한 것으로 가정한다.The query analysis device 1000 classifies “Play Captain Marble” 410, which occurred 1000 times, as a reference query, and “Can you find the Captain Marble” that occurred 5 times, 420 and “I wish I turned on the Marvel Captain” (430 ) Is classified as a control query. In other words, it is assumed that the commercial system “Play Captain Marble” 410 is classified correctly, “Can you find Captain Marble” (420) and “Would you like to play Marvel Captain” (430).

질의 해석 장치(1000)는 기준 질의와 대조 질의의 어휘, 의미 정보를 추출하고, 3개의 질의에 대해 sentence2vec를 이용하여 임베딩 벡터를 생성한다. The query analysis apparatus 1000 extracts vocabulary and semantic information of the reference query and the contrast query, and generates an embedding vector for the three queries using sentence2vec.

질의 해석 장치(1000)는 임베딩 벡터 간의 의미 유사도를 계산한다. 코사인 유사도 계산 결과, “캡틴마블 틀어줘”(410)와 “캡틴마블 찾아줄 수 있니”(420)의 의미 유사도가 0.81로 계산되고, “캡틴마블 틀어줘”(410)와 “마블캡틴을 틀면 좋겠어”(430)의 의미 유사도가 0.83으로 계산될 수 있다. The query analysis apparatus 1000 calculates a semantic similarity between embedding vectors. As a result of calculating the cosine similarity, the meaning of "Play Captain Marble" (410) and "Can you find Captain Marble" (420) is calculated as 0.81, and if you turn on "Play Captain Marble" (410) and "Marble Captain," The meaning similarity of "I wish you" 430 can be calculated as 0.83.

한편, SearchContent와 FindContent는 ‘TV 컨텐츠 찾아주기’라는 서비스를 제공하는 라벨로서, 그 의미가 동일하다. On the other hand, SearchContent and FindContent are labels that provide a service called'search for TV content' and have the same meaning.

따라서 질의 해석 장치(1000)는 기준 질의와의 의미 유사도가 0.8 이상인 조건과 SearchContent와 FindContent는 동일한 의도라는 조건으로 필터링 규칙을 생성하고, 대조 질의 중 오분류 후보를 추출할 수 있다. Accordingly, the query analysis apparatus 1000 may generate a filtering rule under a condition that a semantic similarity with a reference query is 0.8 or more and a condition that SearchContent and FindContent are the same intention, and extract a candidate for misclassification from the matching query.

“캡틴마블 찾아줄 수 있니”(420)와 “마블캡틴을 틀면 좋겠어”(430)는 기준 질의와의 의미 유사도가 모두 0.8 이상이고, “캡틴마블 틀어줘”(410)의 의도 라벨과 “캡틴마블 찾아줄 수 있니”(420)의 의도 라벨이 동일한 것으로 간주되므로, “마블캡틴을 틀면 좋겠어”(430)가 오분류 후보로 추출된다. 한편, 오분류 후보에서 제외된 “캡틴마블 찾아줄 수 있니”(420)의 의도 라벨인 FindContent는 그대로 유지된다. "Can you find Captain Marvel" (420) and "I wish you play Marvel Captain" (430) have a similarity of meaning to the standard query of 0.8 or higher, and the intention label of "Play Captain Marble" (410) and "Captain Since the intention label of "Can you find Marvel" (420) is considered to be the same, "I want to play the Marvel Captain" (430) is extracted as a candidate for misclassification. On the other hand, the intention label of "Can you find Captain Marble" 420, which is excluded from the misclassification candidate, is maintained as it is.

이후 전문가는 “마블캡틴을 틀면 좋겠어”(430)의 의도 라벨 ILikeContent는 실제 사용자의 의도와 다르다는 것을 최종적으로 판단한다. After that, the expert finally judges that the intention label ILikeContent of “I want to play Marvel Captain” 430 is different from the actual user's intention.

질의 해석 장치(1000)는 최종 오분류 후보인 “마블캡틴을 틀면 좋겠어”(430)의 의도 라벨이 ILikeContent에서 기준 질의의 의도 라벨인 SearchContent로 변경되기 위한 핵심 어휘를 추출하기 위해 언어 분석 과정을 수행한다. The query analysis device 1000 performs a linguistic analysis process to extract the core vocabulary for changing the intention label of "I wish to play Marvel Captain" 430, which is the final misclassification candidate, from ILikeContent to SearchContent, which is the intention label of the reference query. do.

우선 기준 질의인 “캡틴마블 틀어줘”(410)의 어휘 의미 패턴을 분석한다. “캡틴마블 틀어줘”(410)라는 질의는 표 2와 같은 어휘 의미 패턴을 갖을 수 있다.First, the vocabulary semantic pattern of the reference query "Play Captain Marble" 410 is analyzed. The query “Play Captain Marble” 410 may have a vocabulary semantic pattern as shown in Table 2.

+ Noun-content + Post + Play+ Noun-content + Post + Play + Noun-content + Noun-none + Play+ Noun-content + Noun-none + Play + Noun-content + Post + Play+ Noun-content + Post + Play

표 2에서, Noun-content는 명사와 컨텐츠 이름이 결합된 것이고, Post는 Postposition의 약자로서 조사를 의미하고, Play는 재생을 의미하는 동사이고, Noun-none은 개체명 분석 결과 별도의 카테고리 정보가 없는 명사를 의미한다. In Table 2, Noun-content is a combination of a noun and content name, Post is an abbreviation of Postposition, meaning investigation, Play is a verb meaning playback, and Noun-none is a separate category information as a result of analyzing the entity name. It means no nouns.

질의 해석 장치(1000)는 최종 오분류 후보의 형태소와 구문을 분석하여 “마블캡틴을 틀면 좋겠어”(430)의 의미 패턴을 'Noun-none + Post + Play + Like'로 생성할 수 있다. 이후 질의 해석 장치(1000)는 기준 질의인 “캡틴마블 틀어줘”(410)에는 존재하나 최종 오분류 후보인 “마블캡틴을 틀면 좋겠어”(430)에는 존재하지 않는 핵심 어휘를 추출한다. 즉 임의의 질의가 정분류 되기 위해 반드시 필요한 요소로서 Noun-none이 될 수 있는 후보들을 추출한다. 이 경우, 표 3과 같이 3 가지의 핵심 어휘 후보를 추출할 수 있다.The query analysis apparatus 1000 may analyze the morpheme and syntax of the final misclassification candidate and generate a semantic pattern of “I wish to play the Marvel Captain” 430 as “Noun-none + Post + Play + Like”. Thereafter, the query analysis apparatus 1000 extracts a core vocabulary that exists in the reference query "Play Captain Marble" 410 but does not exist in the final misclassification candidate "Play Marvel Captain" 430. That is, candidates that can be Noun-none are extracted as an essential element in order for an arbitrary query to be correctly classified. In this case, as shown in Table 3, three key vocabulary candidates can be extracted.

캡틴마블/Noun-contentCaptain Marvel/Noun-content 캡틴/Noun-content + 마블/Noun-noneCaptain/Noun-content + Marvel/Noun-none 캡틴/Noun-contentCaptain/Noun-content

표 3에서, 질의 해석 장치(1000)는 3개의 핵심 어휘 후보들에 언어 학습 임베딩 모델로 신뢰도를 계산하여 '캡틴/Noun-content'을 핵심 어휘로 선정한다. 질의 해석 장치(1000)는 선정된 핵심 어휘를 학습기(200)에 전달하여 질의 해석 모델(210)을 생성한다. In Table 3, the query interpretation apparatus 1000 selects'Captain/Noun-content' as the core vocabulary by calculating the reliability of the three core vocabulary candidates with the language learning embedding model. The query analysis apparatus 1000 generates a query analysis model 210 by transmitting the selected core vocabulary to the learner 200.

한편 질의 해석 장치(1000)는 오분류 후보 추출 과정에서 사용되는 필터링 규칙에, 오분류 질의가 정분류 되기 위해서는 ‘캡틴/Noun-content’을 포함해야 한다는 새로운 규칙을 반영할 수도 있다.Meanwhile, the query analysis apparatus 1000 may reflect a new rule that a'captain/noun-content' must be included in the filtering rule used in the process of extracting a candidate for misclassification, in order for a misclassified query to be correctly classified.

위의 과정을 거쳐 기준 질의, 최종 오분류 후보, 핵심 어휘를 학습 데이터로 하여 질의 해석 모델(210)을 학습한다. Through the above process, the query analysis model 210 is trained using the reference query, the final misclassification candidate, and the core vocabulary as learning data.

질의 해석기(300)는 질의 해석 모델(210)을 이용하여, “마블캡틴을 틀면 좋겠어”라는 질의의 의도 라벨을 430a와 같이 SearchContent로 변경하고, 결과적으로 “캡틴마블 틀어줘”와 “마블캡틴을 틀면 좋겠어”의 의도가 동일한 것으로 해석한다. Using the query analysis model 210, the query interpreter 300 changes the intent label of the query “I wish I would play Marvel Captain” to SearchContent as shown in 430a, and consequently “Play Captain Marvel” and “Marble Captain”. It is interpreted as having the same intention of “I wish to play it on”.

이후 학습된 질의 해석 모델(210)을 포함한 질의 해석 장치(1000)는 대화 시스템 혹은 문서 분석 상용 서비스의 일부로서 사용될 수 있고, 더 높은 성능의 대화 해석 결과를 상용 로그에 출력할 수 있다.Subsequently, the query analysis apparatus 1000 including the learned query analysis model 210 may be used as a part of a conversation system or a document analysis commercial service, and may output a higher-performance conversation analysis result to a commercial log.

또한, 질의 해석 장치(1000)의 출력값은 다시 입력값으로 사용되어, 질의 해석 모델(210)의 학습에 이용될 수 있다.In addition, the output value of the query analysis apparatus 1000 may be used again as an input value, and may be used for learning of the query analysis model 210.

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiments of the present invention described above are not implemented only through an apparatus and a method, but may be implemented through a program that realizes a function corresponding to the configuration of the embodiment of the present invention or a recording medium in which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements by those skilled in the art using the basic concept of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

Claims

A method for interpreting a query by a computing device operated by at least one processor,
Receiving an input of queries labeled with an intention interpreted by an arbitrary device,
Classifying the input queries into control queries that are expected to be misclassified from a reference query, and analyzing a vocabulary pattern of the reference query,
Generating training data from the reference query, the collation query, and the vocabulary pattern, and
Including the step of training a query analysis model using the training data,
The matching queries are queries that appear with a frequency less than or equal to a predetermined value among the input queries.

In claim 1,
Generating the learning data,
Generating a filtering rule using the semantic similarity of the reference query and the matching queries, and selecting misclassification candidates from the matching queries according to the filtering rule, and
And comparing the vocabulary pattern of the reference query with the vocabulary patterns of the misclassification candidates, and extracting core vocabularies that exist in the reference query and do not exist in the misclassification candidates.

In paragraph 2,
The step of selecting the misclassification candidates,
Generating each embedding vector by linguistically analyzing the reference query and the control query,
Calculating a semantic similarity between the embedding vector of the reference query and each embedding vector of the reference query, and
And determining, as misclassification candidates, control queries having a calculated semantic similarity value equal to or greater than a predetermined value.

In paragraph 3,
The filtering rule,
A method of interpreting a query, wherein a control query judged to have the same intention as the reference query is excluded from misclassification candidates.

In paragraph 3,
After the step of determining the candidate for misclassification,
Receiving a determination result from an expert whether the intention labeled in the misclassification candidate matches the intention of an actual user, and
Referring to the result, when a query having the labeled intention as the actual user's intention is included in the misclassification candidate, further comprising the step of selecting the misclassification candidates again.

In paragraph 2,
The step of extracting the core words,
A query interpretation method for extracting a vocabulary that is not recognized as an arbitrary entity name from among the vocabulary constituting the reference query.

In claim 1,
After the learning step,
Inputting a new query into the learned query analysis model,
Determining the intention of the new query from the learned query interpretation model, and
When the intention labeled in the new query and the determined intention are different, changing the intention of the new query to the determined intention.

As a computing device,
Memory, and
Including at least one processor to execute instructions (instructions) of the program loaded in the memory,
The above program,
Receiving an input of queries labeled with an intention interpreted by an arbitrary device,
Classifying the input queries into control queries that are expected to be misclassified from a reference query,
Generating an embedding vector of each of the reference query and the reference query, and calculating a semantic similarity between the embedding vector of the reference query and each embedding vector of the reference query,
Selecting as candidates for misclassification of collation queries having the calculated semantic similarity value equal to or greater than a predetermined value,
Analyzing the vocabulary pattern of the reference query and the misclassification candidates to extract core vocabularies that exist in the reference query and do not exist in the misclassification candidates,
Generating training data from the reference query, the misclassification candidates, and the core vocabulary, and
Learning a query analysis model using the training data
A computing device comprising instructions described to execute a.

In clause 8,
The classifying step,
A computing device for classifying a query having an intention label that performs the same function as an intention label of the reference query among the input queries as correctly classified.

In clause 8,
The step of extracting the core words,
A computing device for generating a semantic pattern in which a vocabulary constituting the reference query is replaced with a vocabulary having a significant relationship with the vocabulary.