KR20240036280A

KR20240036280A - Method for voice memo service and apparatus therefor

Info

Publication number: KR20240036280A
Application number: KR1020220114896A
Authority: KR
Inventors: 손단영; 정요원
Original assignee: 주식회사 케이티
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2024-03-20

Abstract

본 발명은 음성 메모 서비스를 위한 방법 및 이를 위한 장치에 관한 것으로서, 보다 구체적으로는 하나 이상의 제1 텍스트 문장에 대해 의도에 관한 정보와 개체명(named entity)에 관한 정보를 획득하는 것; 상기 의도에 관한 정보와 상기 개체명에 관한 정보에 기초하여 상기 하나 이상의 제1 텍스트 문장에 대한 요약 텍스트를 획득하는 것; 상기 의도에 관한 정보와 상기 개체명에 관한 정보에 기초하여 상기 하나 이상의 제1 텍스트 문장에 대한 우선 순위를 결정하는 것; 및 상기 결정된 우선 순위에 따라 상기 요약 텍스트를 단말 장치로 제공하는 것을 포함하는 방법 및 이를 위한 장치에 관한 것이다. 본 발명에 따르면, 음성 메모를 우선 순위에 따라 요약하여 제공함으로써 빠르고 효과적으로 대응할 수 있다.The present invention relates to a method and device for a voice memo service, and more specifically, to obtaining information about intent and information about named entity for one or more first text sentences; obtaining summary text for the one or more first text sentences based on the information about the intent and the information about the entity name; determining a priority for the one or more first text sentences based on the information about the intent and the information about the entity name; and providing the summary text to a terminal device according to the determined priority, and a device therefor. According to the present invention, it is possible to respond quickly and effectively by providing a summary of voice memos according to priority.

Description

Method for voice memo service and device therefor {METHOD FOR VOICE MEMO SERVICE AND APPARATUS THEREFOR}

본 발명은 음성 메모 서비스에 관한 것으로서, 보다 구체적으로는 음성 인식을 기반으로 음성 메모에 대한 요약 텍스트를 제공하는 방법 및 이를 위한 장치에 관한 것이다.The present invention relates to a voice memo service, and more specifically, to a method and device for providing summary text for a voice memo based on voice recognition.

사용자가 음성으로 상대방에게 메모를 남기면 시스템이 이를 저장하였다가 상대방에게 전달하는 음성 메모 서비스가 널리 사용되어 오고 있다. 음성 메모 서비스는 휴대폰, 전화, 음성 녹음기 등과 같은 다양한 장치로부터 사용자 음성 데이터를 수신하고 저장하였다가 상대방이 요청할 때 저장된 음성 데이터를 그대로 제공하거나 음성 데이터를 음성 인식 기술을 이용하여 텍스트 데이터로 변환하여 제공하는 서비스를 포함한다.The voice memo service, in which a user leaves a memo by voice to the other party and the system stores it and then passes it on to the other party, has been widely used. The voice memo service receives and stores user voice data from various devices such as mobile phones, telephones, and voice recorders, and provides the stored voice data as is when requested by the other party, or converts the voice data into text data using voice recognition technology. Includes services that

기존의 음성 메모 서비스에서는 음성 인식된 결과를 단순히 텍스트로 제공하기 때문에 상대방은 해당 텍스트를 전부 읽은 후 내용을 파악할 수 있으며 많은 음성 메모를 받는 사람의 경우 이들 텍스트를 읽고 내용을 파악하는 것은 매우 많은 시간과 노력을 필요로 한다. 특히, 소상공인의 경우 많은 문의 및 요청과 관련된 음성 메모를 받는데 소상공인이 이러한 음성 메모 서비스의 결과를 텍스트 형태로 제공 받았을 때 일일이 내용을 확인하고 정확하게 사용자 의도를 분석해서 빠르게 대응하기 어려운 문제가 있다. 따라서, 음성 메모를 효과적으로 파악할 수 있는 방안이 요구된다.In existing voice memo services, the voice recognition results are simply provided as text, so the other party can understand the content after reading the entire text. For people who receive many voice memos, reading these texts and understanding the content takes a very long time. and requires effort. In particular, small business owners receive voice memos related to many inquiries and requests, but when small business owners receive the results of these voice memo services in text form, there is a problem in that it is difficult to check the contents individually, accurately analyze the user's intent, and respond quickly. Therefore, a method for effectively understanding voice memos is required.

또한, 새로운 업종이 나날이 생겨남에 따라 관련 어휘 및 문장이 새로 추가되므로 아무리 많은 데이터를 사전에 학습시켰다 하더라도 신규 어휘 및 문장에 대한 음성 인식 성능을 보장하기 어려울 수 있다. 사용자들도 어휘 및 문장을 정형화된 형태로 사용하는 것이 아니라 다양하게 변형한 형태로 사용하여 음성 메모를 남기기 때문에 모든 사용자의 음성 또는 발화에 대한 음성 인식 성능을 확보하기가 어려운 문제가 있다. 따라서, 변화하는 업종 및 사용자 취향에 대응하여 음성 메모에 대한 음성 인식 성능을 향상시킬 수 있는 방안이 요구된다.In addition, as new industries emerge every day, new related vocabulary and sentences are added, so it may be difficult to guarantee voice recognition performance for new vocabulary and sentences no matter how much data is trained in advance. Because users also leave voice notes by using vocabulary and sentences in various modified forms rather than using them in a standardized form, there is a problem in securing voice recognition performance for all users' voices or utterances. Therefore, there is a need for a method to improve voice recognition performance for voice memos in response to changing industries and user tastes.

등록특허 제10-2147619호Registered Patent No. 10-2147619

본 발명의 목적은 음성 메모에 대한 요약 텍스트를 효과적으로 제공하는 방법 및 이를 위한 장치를 제공하는 데 있다.The purpose of the present invention is to provide a method and device for effectively providing summary text for a voice memo.

구체적으로, 본 발명의 목적은 음성 메모 서비스에서 음성 인식 결과를 기반으로 사용자의 의도를 분류하고 개체명을 추출하여 음성 메모에 대한 요약 텍스트를 생성하고 내용의 우선 순위에 따라 제공함으로써 음성 메모의 내용을 효과적으로 파악할 수 있게 하는 방법 및 이를 위한 장치를 제공하는 데 있다.Specifically, the purpose of the present invention is to classify the user's intention based on the voice recognition result in the voice memo service, extract entity names, generate summary text for the voice memo, and provide the content of the voice memo according to the priority of the content. The purpose is to provide a method and a device for effectively identifying

본 발명의 다른 목적은 음성 메모에 대한 음성 인식 성능을 향상시키는 방법 및 이를 위한 장치에 관한 것이다.Another object of the present invention relates to a method and device for improving voice recognition performance for voice memos.

구체적으로, 본 발명의 다른 목적은 음성 인식 신뢰도가 낮은 텍스트 문장을 기반으로 문장 패턴 및 신규 문장을 생성하여 자동 전사 및 음성 인식 모델 학습에 사용함으로써 음성 인식 성능을 향상시키는 방법 및 이를 위한 장치를 제공하는 데 있다.Specifically, another object of the present invention is to provide a method and device for improving speech recognition performance by generating sentence patterns and new sentences based on text sentences with low speech recognition reliability and using them for automatic transcription and speech recognition model learning. There is something to do.

본 발명의 또 다른 목적은 음성 메모 서비스를 이용하는 소상공인의 고객 응대 효율성을 향상시킬 수 있는 방법 및 이를 위한 장치를 제공하는 데 있다. Another object of the present invention is to provide a method and device for improving customer response efficiency of small business owners using voice memo services.

구체적으로, 본 발명의 또 다른 목적은 소상공인에게 편리한 통화 기반 자유 발화의 음성 메모 서비스 제공을 위하여 고객이 소상공인에게 통화로 남긴 음성 메모를 음성 인식 결과를 기반으로 사용자의 의도를 분류하고 개체명(키워드)을 추출하여 메모 내용을 간략하게 요약하여 보여주고, 내용의 우선 순위에 따라 화면에 출력해 줌으로써 소상공인이 효율적으로 고객에게 응대할 수 있게 하는 방법 및 이를 위한 장치를 제공하는 데 있다.Specifically, another purpose of the present invention is to classify the user's intention based on the voice recognition results of the voice memo left by the customer in a call to the small business owner in order to provide a convenient call-based free speech voice memo service to small business owners and enter an entity name (keyword) The purpose is to provide a method and device for small business owners to efficiently respond to customers by extracting the memo contents, briefly summarizing them, and printing them on the screen according to the priority of the contents.

본 발명에서 해결하고자 하는 기술적 과제는 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제는 본 명세서에 기재된 내용으로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be solved by the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned are clearly apparent to those skilled in the art from the contents described in this specification. It will be understandable.

본 발명의 제1 양상으로, 서버 장치에서 음성 데이터를 처리하는 방법이 제공되며, 상기 방법은: 하나 이상의 제1 텍스트 문장에 대해 의도에 관한 정보와 개체명(named entity)에 관한 정보를 획득하는 것; 상기 의도에 관한 정보와 상기 개체명에 관한 정보에 기초하여 상기 하나 이상의 제1 텍스트 문장에 대한 요약 텍스트를 획득하는 것; 상기 의도에 관한 정보와 상기 개체명에 관한 정보에 기초하여 상기 하나 이상의 제1 텍스트 문장에 대한 우선 순위를 결정하는 것; 및 상기 결정된 우선 순위에 따라 상기 요약 텍스트를 단말 장치로 제공하는 것을 포함할 수 있다.In a first aspect of the invention, a method is provided for processing voice data in a server device, the method comprising: obtaining information about intent and information about named entities for one or more first text sentences; thing; obtaining summary text for the one or more first text sentences based on the information about the intent and the information about the entity name; determining a priority for the one or more first text sentences based on the information about the intent and the information about the entity name; And it may include providing the summary text to the terminal device according to the determined priority.

본 발명의 제2 양상으로, 프로세서; 및 메모리를 포함하는 장치가 제공되며, 상기 메모리는 상기 프로세서에 의해 수행될 때 상기 장치가 음성 데이터를 처리하는 특정 동작을 구현하도록 구성된 명령어를 포함하고, 상기 특정 동작은: 하나 이상의 제1 텍스트 문장에 대해 의도에 관한 정보와 개체명(named entity)에 관한 정보를 획득하는 것; 상기 의도에 관한 정보와 상기 개체명에 관한 정보에 기초하여 상기 하나 이상의 제1 텍스트 문장에 대한 요약 텍스트를 획득하는 것; 상기 의도에 관한 정보와 상기 개체명에 관한 정보에 기초하여 상기 하나 이상의 제1 텍스트 문장에 대한 우선 순위를 결정하는 것; 및 상기 결정된 우선 순위에 따라 상기 요약 텍스트를 단말 장치로 제공하는 것을 포함할 수 있다.In a second aspect of the invention, there is provided a processor; and a memory, wherein the memory includes instructions configured to implement specific operations that, when performed by the processor, cause the device to process voice data, the specific operations comprising: one or more first text sentences; obtaining information about intent and information about named entities; obtaining summary text for the one or more first text sentences based on the information about the intent and the information about the entity name; determining a priority for the one or more first text sentences based on the information about the intent and the information about the entity name; And it may include providing the summary text to the terminal device according to the determined priority.

본 발명의 제3 양상으로, 프로세서에 의해 실행될 때 상기 프로세서를 포함하는 장치로 하여금 음성 데이터를 처리하는 특정 동작을 구현하도록 구성된 명령어들을 저장하고 있는 컴퓨터 판독가능한 저장 매체가 제공되며, 상기 특정 동작은: 하나 이상의 제1 텍스트 문장에 대해 의도에 관한 정보와 개체명(named entity)에 관한 정보를 획득하는 것; 상기 의도에 관한 정보와 상기 개체명에 관한 정보에 기초하여 상기 하나 이상의 제1 텍스트 문장에 대한 요약 텍스트를 획득하는 것; 상기 의도에 관한 정보와 상기 개체명에 관한 정보에 기초하여 상기 하나 이상의 제1 텍스트 문장에 대한 우선 순위를 결정하는 것; 및 상기 결정된 우선 순위에 따라 상기 요약 텍스트를 단말 장치로 제공하는 것을 포함할 수 있다.In a third aspect of the invention, there is provided a computer-readable storage medium storing instructions configured to, when executed by a processor, cause a device including the processor to implement specific operations for processing voice data, the specific operations comprising: : Obtaining information about intent and information about named entities for one or more first text sentences; obtaining summary text for the one or more first text sentences based on the information about the intent and the information about the entity name; determining a priority for the one or more first text sentences based on the information about the intent and the information about the entity name; And it may include providing the summary text to the terminal device according to the determined priority.

바람직하게는, 상기 요약 텍스트를 획득하는 것은: 상기 의도에 관한 정보와 상기 개체명에 관한 정보를 조합하여 상기 요약 텍스트를 획득하는 것을 포함할 수 있다.Preferably, obtaining the summary text may include: obtaining the summary text by combining information about the intent and information about the entity name.

바람직하게는, 상기 우선 순위를 결정하는 것은: 우선 순위 리스트에서 상기 의도에 관한 정보의 제2 순위 점수와 상기 개체명에 관한 정보의 제3 순위 점수를 획득하는 것; 및 상기 제2 순위 점수와 상기 제3 순위 점수에 각각 가중치를 적용하여 합산한 값에 기초하여 상기 우선 순위를 결정하는 것을 포함할 수 있다.Preferably, determining the priority includes: obtaining a second priority score of the information about the intent and a third priority score of the information about the entity name in a priority list; And it may include determining the priority based on the sum of the second priority score and the third priority score by applying weights to each of them.

바람직하게는, 상기 방법 또는 특정 동작은: 상기 하나 이상의 제1 텍스트 문장에 대해 업종 분류에 관한 정보를 획득하는 것을 더 포함하되, 상기 우선 순위를 결정하는 것은: 우선 순위 리스트에서 상기 업종 분류에 관한 정보의 제1 순위 점수, 상기 의도에 관한 정보의 제2 순위 점수, 및 상기 개체명에 관한 정보의 제3 순위 점수를 획득하는 것; 및 상기 제1 순위 점수, 상기 제2 순위 점수, 및 상기 제3 순위 점수에 각각 가중치를 적용하여 합산한 값에 기초하여 상기 우선 순위를 결정하는 것을 포함할 수 있다.Preferably, the method or specific operation further comprises: obtaining information regarding the industry classification for the one or more first text sentences, wherein determining the priority comprises: determining information regarding the industry classification in the priority list; obtaining a first rank score of information, a second rank score of information about the intent, and a third rank score of the information about the entity name; And it may include determining the priority based on a value obtained by applying a weight to each of the first priority score, the second priority score, and the third priority score.

바람직하게는, 상기 방법 또는 특정 동작은: 제1 음성 인식 모델을 이용하여 사용자의 음성 데이터에 기초하여 상기 하나 이상의 제1 텍스트 문장을 획득하는 것; 상기 하나 이상의 제1 텍스트 문장에 대한 음성 인식 신뢰도가 적어도 하나의 제1 기준을 만족하지 못함에 기초하여, 상기 의도에 관한 정보와 상기 개체명에 관한 정보에 기초하여 문장 패턴과 적어도 하나의 개체명 후보 단어를 획득하고, 상기 획득한 문장 패턴에 따라 상기 적어도 하나의 개체명 후보 단어에 기초하여 하나 이상의 제2 텍스트 문장을 생성하는 것; 및 상기 하나 이상의 제2 텍스트 문장에 대한 음성 인식 신뢰도가 적어도 하나의 제2 기준을 만족함에 기초하여, 상기 하나 이상의 제2 텍스트 문장 중에서 가장 높은 음성 인식 신뢰도를 가지는 제2 텍스트 문장을 상기 사용자의 음성 데이터에 대한 음성 인식 결과로 결정하는 것을 더 포함할 수 있다.Preferably, the method or specific operation includes: obtaining the one or more first text sentences based on the user's speech data using a first speech recognition model; Based on the speech recognition reliability for the one or more first text sentences not satisfying at least one first criterion, a sentence pattern and at least one entity name based on the information about the intent and the information about the entity name Obtaining candidate words and generating one or more second text sentences based on the at least one entity name candidate word according to the obtained sentence pattern; and based on the speech recognition reliability of the one or more second text sentences satisfying at least one second criterion, a second text sentence having the highest speech recognition reliability among the one or more second text sentences is transmitted to the user's voice. It may further include determining based on the voice recognition results for the data.

바람직하게는, 상기 방법 또는 특정 동작은: 상기 하나 이상의 제2 텍스트 문장 각각을 이용하여 상기 하나 이상의 제2 텍스트 문장 각각에 대한 제2 음성 인식 모델을 학습하는 것을 더 포함하되, 상기 하나 이상의 제2 텍스트 문장 각각에 대한 음성 인식 신뢰도는 상기 제2 음성 인식 모델에 기초하여 획득될 수 있다.Preferably, the method or specific operation further comprises: learning a second speech recognition model for each of the one or more second text sentences using each of the one or more second text sentences, wherein the one or more second Speech recognition reliability for each text sentence may be obtained based on the second speech recognition model.

바람직하게는, 상기 음성 인식 신뢰도는 문장에 대한 신뢰도와 단어에 대한 신뢰도를 포함하며, 상기 하나 이상의 제1 텍스트 문장에 대한 음성 인식 신뢰도가 상기 적어도 하나의 제1 기준을 만족하지 못함은 상기 하나 이상의 제1 텍스트 문장의 문장에 대한 신뢰도가 제1 임계값보다 작거나 상기 하나 이상의 제1 텍스트 문장의 단어에 대한 신뢰도가 제2 임계값보다 작은 것을 포함할 수 있다.Preferably, the voice recognition reliability includes reliability for sentences and reliability for words, and if the speech recognition reliability for the one or more first text sentences does not satisfy the at least one first criterion, the one or more It may include that the reliability of a sentence of the first text sentence is less than a first threshold or that the reliability of the words of the one or more first text sentences is less than a second threshold.

바람직하게는, 상기 음성 인식 신뢰도는 문장에 대한 신뢰도와 단어에 대한 신뢰도를 포함하며, 상기 하나 이상의 제2 텍스트 문장에 대한 음성 인식 신뢰도가 적어도 하나의 제2 기준을 만족함은 상기 하나 이상의 제2 텍스트 문장의 문장에 대한 신뢰도가 제3 임계값 이상이고 상기 하나 이상의 제2 텍스트 문장의 단어에 대한 신뢰도가 제4 임계값 이상인 것을 포함할 수 있다.Preferably, the speech recognition reliability includes reliability for sentences and reliability for words, and the speech recognition reliability for the one or more second text sentences satisfies at least one second criterion. It may include that the reliability of the sentences of the sentence is greater than or equal to a third threshold and that the reliability of the words of the at least one second text sentence is greater than or equal to a fourth threshold.

바람직하게는, 상기 방법 또는 특정 동작은: 상기 하나 이상의 제2 텍스트 문장에 대한 음성 인식 신뢰도가 상기 적어도 하나의 제2 기준을 만족함에 기초하여, 상기 가장 높은 음성 인식 신뢰도를 가지는 제2 텍스트 문장을 이용하여 상기 제1 음성 인식 모델을 학습하는 것을 더 포함하되, 상기 하나 이상의 제2 텍스트 문장에 대한 음성 인식 신뢰도가 상기 적어도 하나의 제2 기준을 만족하지 못함에 기초하여, 상기 가장 높은 음성 인식 신뢰도를 가지는 제2 텍스트 문장을 이용한 상기 제1 음성 인식 모델의 학습은 생략될 수 있다.Preferably, the method or specific operation: selects a second text sentence having the highest speech recognition reliability, based on the speech recognition reliability for the one or more second text sentences satisfying the at least one second criterion. further comprising learning the first speech recognition model using, wherein, based on the speech recognition reliability for the one or more second text sentences not satisfying the at least one second criterion, the highest speech recognition reliability Learning of the first speech recognition model using the second text sentence having can be omitted.

바람직하게는, 상기 방법 또는 특정 동작은: 상기 하나 이상의 제1 텍스트 문장에 대한 음성 인식 신뢰도가 상기 적어도 하나의 제1 기준을 만족함에 기초하여, 상기 하나 이상의 제1 텍스트 문장을 상기 사용자의 음성 데이터에 대한 음성 인식 결과로 결정하는 것을 더 포함할 수 있다.Preferably, the method or specific operation: converts the one or more first text sentences into the user's voice data based on the speech recognition reliability for the one or more first text sentences satisfying the at least one first criterion. It may further include determining based on the voice recognition result.

본 발명에 따르면, 음성 메모에 대한 요약 텍스트를 효과적으로 제공할 수 있다.According to the present invention, it is possible to effectively provide summary text for voice memos.

구체적으로, 본 발명에 따르면, 음성 메모 서비스에서 음성 인식 결과를 기반으로 사용자의 의도를 분류하고 개체명을 추출하여 음성 메모에 대한 요약 텍스트를 생성하고 내용의 우선 순위에 따라 제공함으로써 음성 메모의 내용을 효과적으로 파악할 수 있게 한다.Specifically, according to the present invention, the voice memo service classifies the user's intention based on the voice recognition result, extracts the entity name, generates a summary text for the voice memo, and provides the content of the voice memo according to the priority of the content. allows you to understand effectively.

또한, 본 발명에 따르면, 음성 메모에 대한 음성 인식 성능을 향상시킬 수 있다.Additionally, according to the present invention, voice recognition performance for voice memos can be improved.

구체적으로, 본 발명에 따르면, 음성 인식 신뢰도가 낮은 텍스트 문장을 기반으로 문장 패턴 및 신규 문장을 생성하여 자동 전사 및 음성 인식 모델 학습에 사용함으로써 음성 인식 성능을 향상시킬 수 있다.Specifically, according to the present invention, speech recognition performance can be improved by generating sentence patterns and new sentences based on text sentences with low speech recognition reliability and using them for automatic transcription and speech recognition model learning.

또한, 본 발명에 따르면, 음성 메모 서비스를 이용하는 소상공인의 고객 응대 효율성을 향상시킬 수 있다.Additionally, according to the present invention, the customer response efficiency of small business owners using voice memo services can be improved.

구체적으로, 본 발명에 따르면, 소상공인에게 편리한 통화 기반 자유 발화의 음성 메모 서비스 제공을 위하여 고객이 소상공인에게 통화로 남긴 음성 메모를 음성 인식 결과를 기반으로 사용자의 의도를 분류하고 개체명을 추출하여 메모 내용을 간략하게 요약하여 보여주고, 내용의 우선 순위에 따라 화면에 출력해 줌으로써 소상공인이 효율적으로 고객에게 응대할 수 있게 한다.Specifically, according to the present invention, in order to provide a convenient call-based free speech voice memo service to small business owners, the user's intention is classified based on the voice recognition results from the voice memo left by the customer over the phone to the small business owner, and the entity name is extracted and memoized. It briefly summarizes the content and displays it on the screen according to the priority of the content, allowing small business owners to respond to customers efficiently.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과로 제한되지 않으며, 언급하지 않은 또 다른 효과는 본 명세서에 기재된 내용으로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects that can be obtained from the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the contents described in this specification. There will be.

첨부 도면은 본 발명에 관한 이해를 돕기 위해 상세한 설명의 일부로 포함되고 상세한 설명과 함께 본 발명의 실시예와 기술적 특징을 설명한다.
도 1은 본 발명의 제안 방법에 따른 음성 메모 서비스 시스템의 기능적 블록도를 예시한다.
도 2는 본 발명의 제안 방법의 순서도를 예시한다.
도 3은 본 발명의 제안 방법에 따라 요약 텍스트를 획득하는 것을 예시한다.
도 4는 본 발명의 제안 방법에 따른 음성 메모 출력을 예시한다.
도 5는 본 발명의 제안 방법에 따라 의도 분류 모델을 구축하는 방법을 예시한다.
도 6은 본 발명의 제안 방법에 따른 의도 분류 결과를 예시한다.
도 7은 본 발명의 제안 방법에 따른 우선 순위 리스트를 예시한다.
도 8은 본 발명의 제안 방법에 따른 우선 순위 결정을 예시한다.
도 9는 본 발명의 제안 방법의 순서도를 예시한다.
도 10은 본 발명의 제안 방법에 따라 문장 패턴과 신규 문장을 생성하는 것을 예시한다.
도 11은 본 발명의 제안 방법에 따라 작은 음성 인식 모델을 만들고 신규 문장을 생성하는 것을 예시한다.
도 12는 본 발명의 제안 방법이 적용될 수 있는 장치를 예시한다.The accompanying drawings are included as part of the detailed description to aid understanding of the present invention and explain embodiments and technical features of the present invention together with the detailed description.
Figure 1 illustrates a functional block diagram of a voice memo service system according to the proposed method of the present invention.
Figure 2 illustrates a flowchart of the proposed method of the present invention.
Figure 3 illustrates obtaining summary text according to the proposed method of the present invention.
Figure 4 illustrates voice memo output according to the proposed method of the present invention.
Figure 5 illustrates a method of building an intent classification model according to the proposed method of the present invention.
Figure 6 illustrates the intent classification results according to the proposed method of the present invention.
Figure 7 illustrates a priority list according to the proposed method of the present invention.
Figure 8 illustrates priority determination according to the proposed method of the present invention.
Figure 9 illustrates a flowchart of the proposed method of the present invention.
Figure 10 illustrates generating a sentence pattern and a new sentence according to the proposed method of the present invention.
Figure 11 illustrates creating a small speech recognition model and generating a new sentence according to the proposed method of the present invention.
Figure 12 illustrates a device to which the proposed method of the present invention can be applied.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 이하에서는 특정 실시예들을 첨부된 도면을 기초로 상세히 설명하고자 한다.The present invention can be modified in various ways and can have various embodiments. Hereinafter, specific embodiments will be described in detail based on the accompanying drawings.

이하의 실시예는 본 명세서에서 기술된 방법, 장치 및/또는 시스템에 대한 포괄적인 이해를 돕기 위해 제공된다. 그러나 이는 예시에 불과하며 본 발명은 이에 제한되지 않는다.The following examples are provided to provide a comprehensive understanding of the methods, devices and/or systems described herein. However, this is only an example and the present invention is not limited thereto.

본 발명의 실시예들을 설명함에 있어서, 본 발명과 관련된 공지기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다. 그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. 상세한 설명에서 사용되는 용어는 단지 본 발명의 실시예들을 기술하기 위한 것이며, 결코 제한적이어서는 안 된다. 명확하게 달리 사용되지 않는 한, 단수 형태의 표현은 복수 형태의 의미를 포함한다. 본 설명에서, “포함” 또는 “구비”와 같은 표현은 어떤 특성들, 숫자들, 단계들, 동작들, 요소들, 이들의 일부 또는 조합을 가리키기 위한 것이며, 기술된 것 이외에 하나 또는 그 이상의 다른 특성, 숫자, 단계, 동작, 요소, 이들의 일부 또는 조합의 존재 또는 가능성을 배제하도록 해석되어서는 안 된다. In describing the embodiments of the present invention, if it is determined that a detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. In addition, the terms described below are terms defined in consideration of functions in the present invention, and may vary depending on the intention or custom of the user or operator. Therefore, the definition should be made based on the contents throughout this specification. The terminology used in the detailed description is merely for describing embodiments of the present invention and should in no way be limiting. Unless explicitly stated otherwise, singular forms include plural meanings. In this description, expressions such as “including” or “including” are intended to indicate certain features, numbers, steps, operations, elements, parts or combinations thereof, and one or more than those described. It should not be construed to exclude the existence or possibility of any other characteristic, number, step, operation, element, or part or combination thereof.

또한, 제1, 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되는 것은 아니며, 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만 사용된다.In addition, terms such as first, second, etc. may be used to describe various components, but the components are not limited by the terms, and the terms are used for the purpose of distinguishing one component from another component. It is used only as

인공 지능(Artificial Intelligence, AI)Artificial Intelligence (AI)

인공 지능은 수학적 모델에 기반하여 컴퓨터 장치에 인공적인 지능을 구현하는 것을 지칭하고, 기계 학습(machine learning)은 인공 지능의 일 분야로서 명시적인 프로그래밍 없이도 학습을 통해 컴퓨터 장치가 특정 문제를 해결할 수 있도록 하는 것을 지칭한다.Artificial intelligence refers to implementing artificial intelligence in computer devices based on mathematical models, and machine learning is a field of artificial intelligence that allows computer devices to solve specific problems through learning without explicit programming. It refers to doing something.

인공 신경망(artificial neural network)은 기계 학습에서 사용되는 모델로서, 시냅스의 결합으로 네트워크를 형성한 인공 뉴런(neuron)들로 구성되는 모델을 지칭한다. 인공 신경망은 신경망으로 약칭될 수 있다. 인공 신경망은 입력층(input layer), 출력층(output layer), 그리고 선택적으로 하나 이상의 은닉층(hidden layer)을 포함할 수 있다. 각 층은 하나 이상의 뉴런을 포함하고, 서로 다른 층(layer) 간의 뉴런과 뉴런은 시냅스를 통해 연결될 수 있다. 입력층의 뉴런은 학습에 사용되는 특징 벡터(feature vector)를 입력 신호로 가질 수 있고, 출력층의 뉴런은 시냅스를 통해 입력되는 입력 신호들, 시냅스의 가중치, 편향에 대한 활성 함수의 함수 값을 출력할 수 있다. 은닉층의 뉴런은 입력층 또는 이전 은닉층의 뉴런 신호와 시냅스의 가중치에 기반하여 연산된 값을 입력 신호로 가질 수 있으며 다음 은닉층 또는 출력층으로 신호를 제공한다. An artificial neural network is a model used in machine learning and refers to a model composed of artificial neurons that form a network through the combination of synapses. Artificial neural networks can be abbreviated as neural networks. An artificial neural network may include an input layer, an output layer, and optionally one or more hidden layers. Each layer contains one or more neurons, and neurons between different layers can be connected through synapses. Neurons in the input layer may have feature vectors used for learning as input signals, and neurons in the output layer output the function values of the activation function for the input signals input through the synapse, the weight of the synapse, and the bias. can do. Neurons in the hidden layer may have values calculated based on the weights of the synapse and the neuron signal of the input layer or previous hidden layer as input signals, and provide the signal to the next hidden layer or output layer.

모델 파라미터(model parameter)는 학습을 통해 결정되는 파라미터를 지칭하며 시냅스의 가중치와 뉴런의 편향 등을 포함할 수 있다. 하이퍼 파라미터(hyper parameter)는 기계 학습 알고리즘에서 학습 전에 설정되어야 하는 파라미터를 의미하며, 학습률(learning rate), 반복 횟수, 미니 배치 크기, 초기화 함수 등이 포함된다. 인공 신경망의 학습 목적은 손실 함수(loss function)를 최소화하는 모델 파라미터를 결정하는 것으로 볼 수 있다. 손실 함수는 인공 신경망의 학습 과정에서 최적의 모델 파라미터를 결정하기 위한 지표로 이용될 수 있다.Model parameters refer to parameters determined through learning and may include synaptic weights and neuron biases. Hyper parameters refer to parameters that must be set before learning in a machine learning algorithm and include learning rate, number of iterations, mini-batch size, initialization function, etc. The learning goal of an artificial neural network can be seen as determining model parameters that minimize the loss function. The loss function can be used as an indicator to determine optimal model parameters in the learning process of an artificial neural network.

기계 학습은 학습 방식에 따라 지도 학습(supervised learning), 비지도 학습(unsupervised learning), 강화 학습(reinforcement learning)으로 분류할 수 있다. 지도 학습(supervised learning)은 학습 데이터에 대한 레이블(label)이 주어진 상태에서 인공 신경망을 학습시키는 방법을 의미하며, 레이블이란 학습 데이터가 인공 신경망에 입력되는 경우 인공 신경망이 추론해 내야 하는 정답(또는 결과 값)을 의미할 수 있다. 비지도 학습(unsupervised learning)은 학습 데이터에 대한 레이블이 주어지지 않는 상태에서 인공 신경망을 학습시키는 방법을 의미할 수 있다. 강화 학습(reinforcement learning)은 어떤 환경 안에서 정의된 에이전트(agent)가 각 상태에서 누적 보상을 최대화하는 행동 혹은 행동 순서를 선택하도록 학습시키는 학습 방법을 의미할 수 있다.Machine learning can be classified into supervised learning, unsupervised learning, and reinforcement learning depending on the learning method. Supervised learning refers to a method of training an artificial neural network given a label for the training data. A label is the correct answer (or result value). Unsupervised learning can refer to a method of training an artificial neural network in a state where no labels for training data are given. Reinforcement learning can refer to a learning method in which an agent defined within an environment learns to select an action or action sequence that maximizes the cumulative reward in each state.

인공 신경망 중에서 복수의 은닉층을 포함하는 심층 신경망(deep neural network)으로 구현되는 기계 학습을 심층 학습(deep learning)으로 지칭할 수 있으며, 심층 학습은 기계 학습의 일 분야이다.Among artificial neural networks, machine learning implemented as a deep neural network including multiple hidden layers can be referred to as deep learning, and deep learning is a field of machine learning.

본 명세서에서 “학습” 또는 “기계 학습” 이라는 용어는 기계 학습 또는 심층 학습 뿐만 아니라 다른 인공 지능 알고리즘에 기반한 학습, 규칙 기반 학습(rule-based learning)을 포함할 수 있고, “AI” 또는 “학습 모델” 또는 “기계 학습 모델”은 기계 학습 또는 심층 학습에 기반하여 학습된 인공 신경망 모델 뿐만 아니라 다른 인공 지능 알고리즘에 기반하여 학습된 인공 지능 모델, 규칙 기반 학습 모델을 포함할 수 있다.As used herein, the term “learning” or “machine learning” may include machine learning or deep learning, as well as learning based on other artificial intelligence algorithms, rule-based learning, and “AI” or “learning”. “Model” or “machine learning model” may include artificial neural network models learned based on machine learning or deep learning, as well as artificial intelligence models learned based on other artificial intelligence algorithms, and rule-based learning models.

본 발명의 제안 방법Proposed method of the present invention

본 발명에서는 다양한 사용자(예, 고객)가 남긴 음성 메모를 처리하여 사용자 의도를 분류하고 음성 메모를 요약하며 우선 순위에 따라 요약 내용 및/또는 음성 메모를 제공함으로써 음성 메모를 수신하는 사용자(예, 사업자)가 음성 메모를 효과적으로 파악하고 대응할 수 있게 하는 방법을 제안한다. 예를 들어, 본 발명의 제안 방법은 인공 지능(AI)에 기반한 컨택 센터(contact center), 통화 비서 등과 같은 서비스에 활용될 수 있으며, 소상공인과 같은 사업자가 다양한 고객과 바로 통화 연결이 안되어 고객들로부터 많은 양의 음성 메모를 수신하게 되는 경우에도, 본 발명의 제안 방법에 따라 자동으로 음성 메모를 요약하고 우선 순위에 따라 요약된 내용을 사업자에게 제공함으로써 사업자가 한 명의 고객도 놓치지 않고 효율적으로 응대할 수 있게 해준다.In the present invention, the user receiving the voice memo (e.g., customer) processes voice memos left by various users (e.g., customers) to classify user intent, summarize the voice memos, and provide summary content and/or voice memos according to priority. We propose a method to enable business operators to effectively identify and respond to voice memos. For example, the proposed method of the present invention can be used in services such as contact centers and call assistants based on artificial intelligence (AI), and can be used by business operators such as small business owners to avoid direct phone calls to various customers. Even when a large amount of voice memos are received, the proposed method of the present invention automatically summarizes the voice memos and provides the summarized contents to the business operator according to priority, allowing the business operator to respond efficiently without missing a single customer. It allows you to

또한, 본 발명에서는 음성 메모에 대한 의도 분류와 요약 결과를 활용하여 문장 패턴을 생성하고 문장 패턴에 따라 문장을 생성하며 생성한 문장에 기초하여 음성 인식 성능을 향상시킬 수 있는 방법을 제안한다. 구체적으로, 본 발명에서는 음성 메모에 대한 음성 인식 결과에 대해 의도 분류 및 개체명 인식을 수행하고 의도 분류 및 개체명 인식 결과를 이용하여 문장 패턴에 따라 추출된 문장을 음성 인식에 대한 전사 데이터 및 음성 인식 모델을 위한 학습 데이터로 활용함으로써 자동 전사 및 음성 인식 성능을 향상시킬 수 있는 방법을 제안한다. 예를 들어, 본 발명의 제안 방법은 인공 지능 기반 컨택 센터, 통화 비서 등과 같은 서비스 뿐만 아니라 음성 인식에 기반하여 제공되는 서비스에 활용될 수 있으며, 소상공인 업종이 지속적으로 생성 및 추가되고 사용자 발화가 다양하여 업종에 따른 자유 발화의 음성 인식 난이도가 있는 상황에서도 음성 인식 성능을 용이하게 확보할 수 있게 해준다.In addition, the present invention proposes a method for generating a sentence pattern using intent classification and summary results for a voice memo, generating a sentence according to the sentence pattern, and improving speech recognition performance based on the generated sentence. Specifically, in the present invention, intent classification and entity name recognition are performed on the voice recognition results for voice memos, and sentences extracted according to sentence patterns are used using the intent classification and entity name recognition results to provide transcription data and voice for speech recognition. We propose a method to improve automatic transcription and speech recognition performance by using it as learning data for a recognition model. For example, the proposed method of the present invention can be used in services such as artificial intelligence-based contact centers, call assistants, etc., as well as services provided based on voice recognition, and small business industries are continuously created and added and user utterances are diverse. This makes it possible to easily secure voice recognition performance even in situations where voice recognition of free speech is difficult depending on the industry.

음성 인식은 인공 지능(예, 심층 신경망)에 기반하여 학습된 기계 학습 모델을 이용하여 수행될 수 있는데, 본 명세서에서 음성 인식을 위한 기계 학습 모델을 음성 인식 모델이라고 지칭한다. 사용자의 음성 데이터는 음성 인식 모델을 이용하여 텍스트 데이터 또는 문장으로 변환될 수 있다. 예를 들어, 음성 인식 모델은 음성 데이터의 음향적 특성을 통계적으로 모델링하여 학습하는 음향 모델(Acoustic Model, AM) 및/또는 특정 분야에서 사용되는 언어 표현의 특성을 반영하여 학습하는 언어 모델(Language Model, LM)을 포함할 수 있다. 음성 인식은 음성 텍스트 변환(Speech-To-Text, STT) 또는 자동 음성 인식(Automatic Speech Recognition, ASR) 등의 다른 용어로 지칭될 수 있다.Speech recognition can be performed using a machine learning model learned based on artificial intelligence (e.g., deep neural network). In this specification, the machine learning model for speech recognition is referred to as a speech recognition model. The user's voice data can be converted into text data or sentences using a voice recognition model. For example, a speech recognition model is an acoustic model (AM) that is learned by statistically modeling the acoustic characteristics of speech data, and/or a language model (Language Model) that is learned by reflecting the characteristics of language expressions used in a specific field. Model, LM) may be included. Speech recognition may be referred to by other terms such as Speech-To-Text (STT) or Automatic Speech Recognition (ASR).

도 1은 본 발명의 제안 방법에 따른 음성 메모 서비스 시스템(100)의 기능적 블록도를 예시한다. 예를 들어, 음성 메모 서비스 시스템(100)은 AI 컨택 센터, AI 통화 비서 등과 같은 서비스를 제공하는 장치(1200)에 구현될 수 있다. 도 1의 예는 본 발명을 제한하는 것이 아니며 일부 블록이 삭제 또는 변형되거나 새로운 블록이 추가되는 등으로 수정될 수 있다.Figure 1 illustrates a functional block diagram of a voice memo service system 100 according to the proposed method of the present invention. For example, the voice memo service system 100 may be implemented in a device 1200 that provides services such as an AI contact center, AI call assistant, etc. The example in FIG. 1 does not limit the present invention and may be modified by deleting or modifying some blocks or adding new blocks.

음성 메모 서비스 시스템(100)은 음성 메모를 음성 인식(102) 처리하고 음성 인식 결과 (텍스트) 문장에 대해 의도 분류(110)를 수행하여 사용자가 어떤 의도를 가지고 음성 메모를 발화했는지 구분한다. 의도 분류(110)는 예를 들어 인공 신경망(예, CNN(Convolutional Neural Network), RNN(Recurrent Neural Network) 등), 언어 모델(예, BERT(Bidirectional Encoder Representations from Transformers), ALBERT(A Lite BERT) 등) 등에 기반한 의도 분류 모델을 이용하여 수행될 수 있다.The voice memo service system 100 processes the voice memo through voice recognition 102 and performs intent classification 110 on the voice recognition result (text) sentence to distinguish with what intention the user uttered the voice memo. Intent classification 110 may be performed using, for example, artificial neural networks (e.g., Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), etc.), language models (e.g., Bidirectional Encoder Representations from Transformers (BERT), A Lite BERT (ALBERT), etc. It can be performed using an intent classification model based on (etc.) etc.

또한, 음성 메모 서비스 시스템(100)은 음성 인식된 (텍스트) 문장에 대해 개체명 인식(named entity recognition)(112)을 수행하여 문장 키워드를 추출하고 문장 요약(114)을 통해 요약 텍스트를 생성 또는 추출한다. 개체명은 키워드(keyword), 핵심어, 주요 어휘 등과 같은 다양한 용어와 혼용될 수 있다. 개체명 인식(112)은 개체 설계를 하고 지식 모델(120)을 기반으로 각각의 유의어 사전을 구축하고, 발화 문장에 개체명을 태깅(tagging)하여 학습하는 방식으로 구축될 수 있다. 개체명 인식(112)은 예를 들어 HMM(Hidden Markov Model), SVM(Support Vector Machine), Bi-LSTM(Bi-directional Long Short Term Memory)과 CRF(Conditional Random Fields)를 사용한 모델, 언어 모델(예, KoBERT(Korean BERT), BERT, ALBERT 등) 등에 기반하여 수행될 수 있다. 문장 요약(114)은 예를 들어 의도 분류 결과와 개체명 인식 결과를 조합하여 수행되거나 또는 어텐션(attention)(또는 어텐션 메커니즘(attention mechanism))에 기반한 텍스트 요약(text summarization) 또는 생성 기법을 이용하여 수행될 수 있다.In addition, the voice memo service system 100 performs named entity recognition 112 on the voice-recognized (text) sentence to extract sentence keywords and generates a summary text through sentence summary 114. Extract. Entity names can be used interchangeably with various terms such as keywords, key words, and main vocabulary. Entity name recognition 112 can be constructed by designing entities, building each thesaurus based on the knowledge model 120, and learning by tagging entity names in speech sentences. Entity name recognition (112) is, for example, a model using Hidden Markov Model (HMM), Support Vector Machine (SVM), Bi-directional Long Short Term Memory (Bi-LSTM) and Conditional Random Fields (CRF), and language model ( Yes, it can be performed based on KoBERT (Korean BERT), BERT, ALBERT, etc.). Sentence summary 114 is performed, for example, by combining intent classification results and entity name recognition results, or by using text summarization or generation techniques based on attention (or attention mechanism). It can be done.

음성 메모 서비스 시스템(100)은 이와 같이 생성 또는 추출된 결과들에 대해 우선 순위를 결정(116)하여 음성 메모 출력을 관리(118)한다. 예를 들어, 우선 순위 결정(116)은 도 7 및 도 8을 참조하여 설명하는 바와 같이 우선 순위 리스트(700)로부터 업종 우선 순위 점수, 의도 우선 순위 점수, 개체명 우선 순위 점수를 획득하고 이들에 기반하여 최종 우선 순위 점수를 산출한 다음 우선 순위를 결정할 수 있다. 예를 들어, 음성 메모 출력(118)은 리스트 형태로 관리될 수 있으며, 우선 순위를 지시하는 필드(410), 처리 완료 여부를 지시하는 필드(420), 처리할 일을 지시하는 필드(430), 음성 메모를 지시하는 필드(440) 중 적어도 하나를 포함할 수 있다. 메모 서비스 시스템(100)은 우선 순위에 따른 음성 메모 출력을 단말로 전달하고 단말과 동기화하여 단말 화면에 출력할 수 있게 한다. The voice memo service system 100 determines the priority of the results generated or extracted in this way (116) and manages the voice memo output (118). For example, the priority determination 116 obtains the industry priority score, intent priority score, and entity name priority score from the priority list 700, as described with reference to FIGS. 7 and 8, and Based on this, the final priority score can be calculated and then the priority can be determined. For example, the voice memo output 118 may be managed in the form of a list, with a field indicating priority 410, a field indicating whether processing has been completed 420, and a field indicating work to be processed 430. , and may include at least one of the fields 440 indicating a voice memo. The memo service system 100 delivers voice memo output according to priority to the terminal, synchronizes it with the terminal, and outputs it on the terminal screen.

음성 메모 서비스 시스템(100)은 지식 모델을 구축(120)하여 의도 분류(110)와 개체명 인식(112)을 위한 유의어 사전 구축 및 신규 후보 문장 생성(108)에 이용할 수 있다(예, 도 5 및 도 9 관련 설명 참조). 예를 들어, 음성 메모 서비스 시스템(100)은 온톨로지(ontology) 기반 공개(open) 소상공인 모델을 발전시켜 지식 모델을 구축(120)할 수 있다.The voice memo service system 100 builds a knowledge model (120) and can be used to build a thesaurus for intent classification (110) and entity name recognition (112) and generate new candidate sentences (108) (e.g., Figure 5 and the related explanation of Figure 9). For example, the voice memo service system 100 may build a knowledge model (120) by developing an ontology-based open small business model.

그리고, 음성 메모 서비스 시스템(100)은 음성 인식 모델의 자동 학습을 위한 음성/텍스트 문장을 선정하기 위해 음성 인식 신뢰도를 검증(104)하고 신규 문장을 생성(108)하고 신규 문장을 위한 기계 학습 모델을 학습(106)하여 신규 문장에 대한 신뢰도 검증(108)을 거쳐 유효한 음성/텍스트 문장을 수집할 수 있다(예, 도 9 및 관련 설명 참조). 예를 들어, 음성 메모 서비스 시스템(100)은 지식 모델(120)에서 해당 업종과 해당 의도, 개체명을 기반으로 신규 문장 생성(108)을 위한 후보 단어군을 추출할 수 있다. 음성 메모 서비스 시스템(100)은 검증하여 수집된 문장/텍스트 데이터를 기반으로 음성 인식 모델을 자동 학습할 수 있게 하므로 음성 인식 성능을 향상시킬 수 있으며 전사를 위한 비용도 절약할 수 있다.In addition, the voice memo service system 100 verifies voice recognition reliability (104) to select voice/text sentences for automatic learning of the voice recognition model, creates a new sentence (108), and creates a machine learning model for the new sentence. By learning (106) and verifying the reliability of the new sentences (108), valid voice/text sentences can be collected (e.g., see FIG. 9 and related description). For example, the voice memo service system 100 may extract a candidate word group for creating a new sentence 108 based on the relevant industry, corresponding intention, and entity name from the knowledge model 120. The voice memo service system 100 can automatically learn a voice recognition model based on the verified and collected sentence/text data, thereby improving voice recognition performance and saving costs for transcription.

도 2는 본 발명의 제안 방법에 따라 음성 메모 서비스를 제공하는 순서도를 예시한다. 도 2에 예시된 방법은 장치(1200)에서 수행될 수 있다. 도 2의 예는 오로지 예시일 뿐이며 도 2의 일부 구성이 삭제 또는 변형되거나 도 2에 예시되지 않은 새로운 구성이 추가되도록 수정 구현될 수 있다.Figure 2 illustrates a flowchart of providing a voice memo service according to the proposed method of the present invention. The method illustrated in FIG. 2 may be performed in device 1200. The example of FIG. 2 is only an example and may be modified and implemented so that some components of FIG. 2 are deleted or modified, or new components not illustrated in FIG. 2 are added.

도 2를 참조하면, 장치(1200)는 하나 이상의 텍스트 문장에 대해 의도에 관한 정보와 개체명에 관한 정보를 획득할 수 있다(S202). 일 예로, 장치(1200)는 음성 인식 모델을 이용하여 사용자의 발화 또는 음성 데이터(예, 음성 메모)에 대해 음성 인식(102)을 수행하여 하나 이상의 텍스트 문장을 획득할 수 있으며, 편의상 장치(1200)가 음성 인식(102)을 통해 획득한 텍스트 문장을 제1 텍스트 문장이라고 지칭할 수 있다. 또한, 설명의 편의를 위해 업종에 관한 정보, 의도에 관한 정보, 개체명에 관한 정보는 각각 업종 정보, 의도 정보, 개체명 정보로 약칭될 수 있다.Referring to FIG. 2, the device 1200 can obtain information about intent and information about the entity name for one or more text sentences (S202). As an example, the device 1200 may perform voice recognition 102 on the user's utterance or voice data (e.g., voice memo) using a voice recognition model to obtain one or more text sentences. For convenience, the device 1200 ) may refer to the text sentence obtained through voice recognition 102 as a first text sentence. Additionally, for convenience of explanation, information about the industry, information about the intent, and information about the entity name may be abbreviated as industry information, intent information, and entity name information, respectively.

예를 들어, S202에서, 장치(1200)는 제1 텍스트 문장에 대해 의도 분류(110)를 통해 의도 정보를 획득하고 개체명 인식(112)을 통해 개체명 정보를 획득할 수 있다. 보다 구체적인 예로, 장치(1200)가 복수의 제1 텍스트 문장을 획득 또는 수신하였다고 가정하면, 장치(1200)는 복수의 제1 텍스트 문장 중 각각의 제1 텍스트 문장에 대해 의도 분류(110)와 개체명 인식(112)을 수행하여 의도 정보와 개체명 정보를 획득할 수 있다. 추가적으로, 장치(1200)는 제1 텍스트 문장에 대해 의도 분류(110)를 통해 업종 정보를 획득할 수 있다.For example, in S202, the device 1200 may obtain intent information through intent classification 110 and obtain entity name information through entity name recognition 112 for the first text sentence. As a more specific example, assuming that the device 1200 acquires or receives a plurality of first text sentences, the device 1200 determines the intent classification 110 and the entity for each first text sentence among the plurality of first text sentences. By performing name recognition 112, intent information and entity name information can be obtained. Additionally, the device 1200 may obtain industry information through intent classification 110 for the first text sentence.

장치(1200)는 S202에서 획득한 의도에 관한 정보와 개체명에 관한 정보에 기초하여 하나 이상의 제1 텍스트 문장에 대한 요약 텍스트를 획득할 수 있다(S204). 예를 들어, 장치(1200)는 S202에서 제1 텍스트 문장에 대해 획득한 의도 분류(110) 결과(또는 의도 정보)와 개체명 인식(112) 결과(또는 개체명 정보)를 조합하거나 또는 어텐션에 기반한 텍스트 요약 또는 생성하는 것과 같은 문장 요약(114)을 수행함으로써 요약 텍스트를 획득할 수 있다. 보다 구체적인 예로, 장치(1200)가 S202에서 복수의 제1 텍스트 문장을 획득 또는 수신하였다고 가정하면, 장치(1200)는 복수의 제1 텍스트 중 각각의 제1 텍스트 문장에 대해 획득한 의도 정보와 개체명 정보를 이용하여 문장 요약(114)을 수행함으로써 해당 제1 텍스트 문장에 대한 요약 텍스트를 획득할 수 있다. 추가적으로, 장치(1200)는 요약 텍스트를 획득하기 위해 의도 정보와 개체명 정보 외에 업종 정보를 추가로 이용할 수 있다.The device 1200 may obtain a summary text for one or more first text sentences based on the information about the intent and the information about the entity name obtained in S202 (S204). For example, the device 1200 combines the intent classification 110 result (or intent information) and the entity name recognition 112 result (or entity name information) obtained for the first text sentence in S202, or The summary text can be obtained by performing sentence summary 114, such as text summary or generating based text summary. As a more specific example, assuming that the device 1200 acquires or receives a plurality of first text sentences in S202, the device 1200 obtains intent information and entity information for each first text sentence among the plurality of first texts. By performing sentence summary 114 using name information, a summary text for the corresponding first text sentence can be obtained. Additionally, the device 1200 may additionally use industry information in addition to intent information and entity name information to obtain summary text.

또한, 장치(1200)는 S202에서 획득한 의도에 관한 정보와 개체명에 관한 정보에 기초하여 하나 이상의 제1 텍스트 문장에 대한 우선 순위를 결정할 수 있다(S206)(예, 도 7 및 도 8, 수학식 1 관련 설명 참조). 예를 들어, 장치(1200)는 의도 정보와 개체명 정보에 기반한 우선 순위 결정(116)을 통해 하나 이상의 제1 텍스트 문장에 대한 우선 순위를 결정할 수 있다. 보다 구체적인 예로, 장치(1200)가 S202에서 복수의 제1 텍스트 문장을 획득 또는 수신하였다고 가정하면, 장치(1200)는 복수의 제1 텍스트 중 각각의 제1 텍스트 문장에 대해 획득한 의도 정보와 개체명 정보를 이용하여 우선 순위 결정(116)을 수행함으로써 우선 순위 점수를 획득하고 우선 순위 점수에 따라 각각의 제1 텍스트 문장에 대한 우선 순위를 결정할 수 있다.Additionally, the device 1200 may determine the priority for one or more first text sentences based on the information about intent and the information about the entity name obtained in S202 (S206) (e.g., FIGS. 7 and 8, (see explanation related to Equation 1). For example, the device 1200 may determine the priority for one or more first text sentences through priority determination 116 based on intent information and entity name information. As a more specific example, assuming that the device 1200 acquires or receives a plurality of first text sentences in S202, the device 1200 obtains intent information and entity information for each first text sentence among the plurality of first texts. By performing priority determination 116 using name information, a priority score can be obtained and the priority for each first text sentence can be determined according to the priority score.

그런 다음, 장치(1200)는 S206에서 결정된 우선 순위에 따라 S204에서 획득한 요약 텍스트를 단말 장치로 제공할 수 있다(S208). 예를 들어, 장치(1200)는 음성 메모 출력 관리(118)를 통해 단말 장치와 동기화하여 우선 순위에 따른 제1 텍스트 출력이 장치(1200)에서 관리하고 있는 것과 동일하게 단말 장치의 화면으로 출력될 수 있게 한다.Then, the device 1200 may provide the summary text obtained in S204 to the terminal device according to the priority determined in S206 (S208). For example, the device 1200 synchronizes with the terminal device through the voice memo output management 118 so that the first text output according to priority is output on the screen of the terminal device in the same manner as the one managed by the device 1200. make it possible

도 3은 본 발명의 제안 방법에 따라 요약 텍스트를 획득하는 것을 예시한다. 도 3의 예는 본 발명의 이해를 돕기 위한 예시일 뿐이며 본 발명을 제한하는 것이 아니다.Figure 3 illustrates obtaining summary text according to the proposed method of the present invention. The example in FIG. 3 is only an example to aid understanding of the present invention and does not limit the present invention.

장치(1200)는 음성 인식(102)을 통해 하나 이상의 제1 텍스트 데이터(310)를 획득 또는 수신하고, 하나 이상의 제1 텍스트 데이터(310)에 대해 의도 분류(110)를 수행하여 의도에 관한 정보(320)를 획득하고 개체명 인식(112)을 수행하여 개체명에 관한 정보(330)를 획득할 수 있다(예, 도 2의 S202 참조). 도 3의 예에서, 의도 정보(320)는 폰구매, 연락부탁, 자리예약, 포장주문, 가격문의를 포함할 수 있고, 개체명 정보(330)는 [케이티](Company), [아이폰십삼](Phone_Model), [구매](Action), [공일공이오이오팔칠공](Phone_No), [유월십오일](date), [창가쪽 자리](Location), [네명](Person_No), [예약](Order_Reservation), [짬뽕](Item), [한 개](Item_No), [짜장면](Item), [포장주문](Action), [이만원](Price_Inquiry)를 포함할 수 있다.The device 1200 acquires or receives one or more first text data 310 through voice recognition 102 and performs intent classification 110 on the one or more first text data 310 to provide information about intent. By acquiring 320 and performing entity name recognition 112, information 330 about the entity name can be obtained (e.g., see S202 in FIG. 2). In the example of FIG. 3, intent information 320 may include phone purchase, contact request, seat reservation, packaging order, and price inquiry, and entity name information 330 may include [KT] (Company), [iPhone Thirteen]. (Phone_Model), [Purchase](Action), [15th of July](Phone_No), [June 15th](date), [Window seat](Location), [Four people](Person_No), [Reservation] It can include (Order_Reservation), [Jjambbong](Item), [One piece](Item_No), [Jajangmyeon](Item), [Package order](Action), and [20,000 won](Price_Inquiry).

또한, 장치(1200)는 순 한글 음성 인식 결과(310)의 가독성을 높이기 위해 예를 들어 텍스트 역변환 정규화 방법을 이용하여 한글 문자를 숫자, 영문자, 특수 기호로 변환하고 최종 음성 인식 결과(340)를 획득할 수 있다. 도 3의 예에서, 케이티 -> KT, 아이폰 -> Iphone, 십삼 -> 13, 공일공이오이오팔칠공 -> 010-252-5870, 유월 -> 6월, 십오일 -> 15일, 네명 -> 4명, 한 개 -> 1개, 이만원 -> 2만원으로 변환될 수 있고, 물음표(?)가 추가될 수 있다. 최종 인식 결과(340)를 출력하기 전에 의도 분류 결과(320)와 개체명 인식 결과(330)를 참고하여 보완 후보 문장을 생성해 주고 음성 인식 결과 문장(310)을 보완 후보 문장과 비교하여 좀 더 신뢰도가 높은 문장으로 최종 인식 결과(340)를 제공해 주게 된다. 이 결과에 따라 문장 요약 결과(350)도 보정될 수 있다. 이러한 방식은 도 9 내지 도 10을 참조하여 설명하는 본 발명의 제안 방법에도 활용될 수 있다. 예를 들어, 문장 요약 결과(350)는 사용자 요구 사항이 명확한 예약이나 주문, 연락 부탁 등에 대해 의도 분류 결과(320)와 개체명 인식 결과(330)를 조합하여 제공할 수 있으며, 그 외의 경우에는 어텐션(또는 어텐션 매커니즘) 기반 텍스트 요약 또는 생성 기법 등을 이용할 수 있다.In addition, in order to improve the readability of the pure Korean speech recognition result 310, the device 1200 converts Korean characters into numbers, English letters, and special symbols using, for example, a text inverse conversion normalization method and produces the final speech recognition result 340. It can be obtained. In the example of FIG. 3, KT -> KT, iPhone -> Iphone, thirteen -> 13, 10 days, 20 days, 5 days a week -> 010-252-5870, June -> June, 15 days -> 15 days, four people -> 4 people, 1 -> 1, 20,000 won -> 20,000 won, and a question mark (?) can be added. Before outputting the final recognition result (340), a supplementary candidate sentence is generated by referring to the intent classification result (320) and the entity name recognition result (330), and the speech recognition result sentence (310) is compared with the supplementary candidate sentence to provide more information. The final recognition result (340) is provided as a highly reliable sentence. The sentence summary result 350 may also be corrected according to this result. This method can also be used in the proposed method of the present invention described with reference to FIGS. 9 and 10. For example, the sentence summary result 350 can be provided by combining the intent classification result 320 and the entity name recognition result 330 for reservations, orders, contact requests, etc. with clear user requirements, and in other cases. Attention (or attention mechanism)-based text summarization or generation techniques can be used.

장치(1200)는 의도에 관한 정보(320)와 개체명에 관한 정보(330)에 기초하여 하나 이상의 제1 텍스트 문장(또는 숫자, 영문자, 특수 기호 변환된 제1 텍스트 문장)에 대한 문장 요약(114)을 수행하여 요약 텍스트(350)를 획득할 수 있다(예, 도 2의 S204 참조). 도 3의 예에서, 제1 텍스트 문장(310) “케이티 죠 아이폰 십삼 구매하고 싶은데 공일공이오이오팔칠공으로 전화부탁드립니다” 에 대해 요약 텍스트(350) “[폰구매] Iphone13 [연락부탁] 010-252-5870”을 획득할 수 있다.The device 1200 provides a sentence summary for one or more first text sentences (or first text sentences converted to numbers, English letters, and special symbols) based on the information about the intention 320 and the information 330 about the entity name. 114) can be performed to obtain the summary text 350 (eg, see S204 in FIG. 2). In the example of Figure 3, the first text sentence (310) “I would like to purchase a Katie Joe iPhone 13, please call GongilgongiOiopalchigong” for the summary text (350) “[Phone purchase] Iphone13 [Please contact me] 010- You can obtain “252-5870”.

도 4는 본 발명의 제안 방법에 따른 음성 메모 출력을 예시한다. 도 4의 예는 본 발명의 이해를 돕기 위한 예시일 뿐이며 본 발명을 제한하는 것이 아니다.Figure 4 illustrates voice memo output according to the proposed method of the present invention. The example in FIG. 4 is merely an illustration to aid understanding of the present invention and does not limit the present invention.

도 4를 참조하면, 장치(1200)는 의도에 관한 정보(320)와 개체명에 관한 정보(330)에 기초하여 우선 순위 결정(116)을 통해 하나 이상의 제1 텍스트 문장(310)(또는 숫자, 영문자, 특수 기호 변환된 제1 텍스트 문장(340))에 대한 우선 순위(410)를 결정할 수 있다(예, 도 2의 S206 참조, 도 7 및 도 8, 수학식 1 관련 설명 참조). 장치(1200)는 음성 메모 출력 관리(118)를 통해 단말 장치와 동기화하여 음성 메모 출력을 단말 장치의 화면에 출력할 수 있게 한다. 예를 들어, 음성 메모 출력(118)은 리스트 형태로 관리될 수 있으며, 우선 순위를 지시하는 필드(410), 처리 완료 여부를 지시하는 필드(420), 처리할 일을 지시하는 필드(430), 음성 메모를 지시하는 필드(440) 중 적어도 하나를 포함할 수 있다.Referring to FIG. 4, the device 1200 selects one or more first text sentences 310 (or numbers) through priority determination 116 based on information about intent 320 and information 330 about entity names. , the priority 410 for the converted first text sentence 340) can be determined (e.g., refer to S206 of FIG. 2, FIGS. 7 and 8, and the related description of Equation 1). The device 1200 synchronizes with the terminal device through voice memo output management 118 and outputs the voice memo output on the screen of the terminal device. For example, the voice memo output 118 may be managed in the form of a list, with a field indicating priority 410, a field indicating whether processing has been completed 420, and a field indicating work to be processed 430. , and may include at least one of the fields 440 indicating a voice memo.

예를 들어, 처리할 일을 지시하는 필드(430)는 요약 텍스트(350)를 포함할 수 있고, 음성 메모를 지시하는 필드(440)는 최종 음성 인식 처리된 제1 텍스트 문장(또는 숫자, 영문자, 특수 기호 변환된 제1 텍스트 문 장)(340)을 포함할 수 있다. 처리 완료 여부를 지시하는 필드(420)는 단말 사용자가 해당 업무를 처리하는 경우 또는 단말 사용자의 입력에 의해 처리 완료를 지시하도록 설정될 수 있다. 음성 메모 출력은 우선 순위를 지시하는 필드(410)에 기초하여 정렬될 수 있고, 처리 완료 여부를 지시하는 필드(420)가 처리 완료를 지시하도록 설정된 경우 장치(1200)는 해당 업무의 우선 순위를 (가장 낮게) 조정하거나 관련 필드 전체(410, 420, 430, 440)를 리스트로부터 삭제할 수 있다.For example, the field 430 indicating a task to be processed may include a summary text 350, and the field 440 indicating a voice memo may include the final voice recognition processed first text sentence (or number, English letter). , may include a first text sentence converted to a special symbol (340). The field 420 indicating whether processing is complete may be set to indicate processing completion when the terminal user processes the task or through input from the terminal user. The voice memo output may be sorted based on the field 410 indicating priority, and if the field 420 indicating whether processing is complete is set to indicate processing completion, the device 1200 determines the priority of the task. You can adjust it (lowest) or delete the entire relevant field (410, 420, 430, 440) from the list.

음성 메모 출력 내에서 연락이 필요한 전화번호에 대한 사용자 입력이 있으면 자동으로 전화 연결도 가능하다. 주차장 안내 요청 같은 간단한 처리 사항은 미리 제공된 관리자 메뉴 등을 통해 사용자가 미리 처리 방식을 설정하고 설정된 방식에 따라 바로 처리 가능하도록 구성될 수 있다.If the user inputs the phone number that needs to be contacted within the voice memo output, a call can be automatically connected. Simple processing matters, such as requests for parking lot guidance, can be configured so that the user can set a processing method in advance through a pre-provided administrator menu, etc. and process it immediately according to the set method.

도 5는 본 발명의 제안 방법에 따라 의도 분류 모델을 구축하는 방법을 예시한다. 도 5의 예는 오로지 예시일 뿐이며 도 5의 일부 구성이 삭제 또는 변형되거나 도 5에 예시되지 않은 새로운 구성이 추가되도록 수정 구현될 수 있다.Figure 5 illustrates a method of building an intent classification model according to the proposed method of the present invention. The example of FIG. 5 is only an example and may be modified and implemented so that some components of FIG. 5 are deleted or modified, or new components not illustrated in FIG. 5 are added.

도 1을 참조하여 설명한 바와 같이, 의도 분류(110)는 인공 신경망(예, CNN, RNN 등), 언어 모델(예, BERT, ALBERT 등)에 기반한 의도 분류 모델을 이용하여 수행될 수 있다. 장치(1200)는 의도 분류(110)를 위한 코퍼스로 이용하기 위해 음성 메모의 전사 결과와 음성 인식 결과 등을 활용하여 문장을 수집한다(S502). 장치(1200)는 초기에는 자동 분류(S504)하여 카테고리(category)를 나누고 카테고리 별로 업종/의도를 수동 레이블링(labeling)(S508)할 수 있다. 장치(1200)는 업종과 의도가 레이블링된 문장을 지속적으로 수집(S510)하고 이를 이용하여 분류 학습을 할 수 있다(S512). 장치(1200)는 분류 학습(S512) 결과를 반영하여 다시 업종/의도 분류(S504)를 하고 레이블링 안된 문장 중에 적절하게 분류된 것은 자동으로 레이블링(S506)하며, 조정이 필요한 것은 다시 수동 레이블링(S508)을 할 수 있다. 이런 프로세스가 지속되면서 점차 세분화 및 정교화한 의도 분류(110)가 가능하게 된다.As described with reference to FIG. 1, intent classification 110 may be performed using an intent classification model based on an artificial neural network (e.g., CNN, RNN, etc.) or a language model (e.g., BERT, ALBERT, etc.). The device 1200 collects sentences using the transcription results of the voice memo and the voice recognition results to use them as a corpus for intent classification 110 (S502). The device 1200 may initially automatically classify (S504), divide into categories, and manually label the industry/intention for each category (S508). The device 1200 can continuously collect sentences labeled with industry and intent (S510) and use them to perform classification learning (S512). The device 1200 reflects the results of classification learning (S512) to classify industry/intention again (S504), automatically labels appropriately classified sentences among unlabeled sentences (S506), and manually labels those that require adjustment (S508). )can do. As this process continues, increasingly segmented and sophisticated intent classification (110) becomes possible.

도 6은 본 발명의 제안 방법에 따른 의도 분류(110) 결과를 예시한다. 도 6의 예는 본 발명의 이해를 돕기 위한 예시일 뿐이며 본 발명을 제한하는 것이 아니다.Figure 6 illustrates the results of intent classification 110 according to the proposed method of the present invention. The example in FIG. 6 is only an example to aid understanding of the present invention and does not limit the present invention.

도 6을 참조하면, 장치(1200)는 수집 문장에 대해 업종 및/또는 의도에 따라 카테고리(612, 614, 616, 618)를 나누고 카테고리(612, 614, 616, 618) 별로 업종 및/또는 의도를 레이블링(labeling)하여 수집 문장(622, 624, 626, 628)을 관리할 수 있다. 장치(1200)는 초기에는 자동 분류하여 업종과 의도 결과가 없이 카테고리(612, 614, 616, 618)만 나누어 관리하다가 카테고리(612, 614, 616, 618)별로 업종과 의도를 자동 또는 수동으로 레이블링할 수 있다. 적절하게 분류가 안된 문장의 경우 카테고리(612, 614, 616, 618)간 이동이 가능하며 최종 저장된 문장들(622, 624, 626, 628)은 현재 분류로 자동 레이블링된다. 장치(1200)는 문장(622, 624, 626, 628)별로 레이블링하는 관리 화면을 별도로 제공할 수 있다.Referring to FIG. 6, the device 1200 divides the collected sentences into categories (612, 614, 616, 618) according to industry and/or intent, and divides the collected sentences into categories (612, 614, 616, 618) by industry and/or intent. You can manage the collected sentences (622, 624, 626, 628) by labeling. The device 1200 initially automatically classifies and manages only the categories (612, 614, 616, 618) without industry or intent results, and then automatically or manually labels the industry and intent by category (612, 614, 616, 618). can do. For sentences that are not properly classified, it is possible to move between categories (612, 614, 616, 618), and the final saved sentences (622, 624, 626, 628) are automatically labeled with the current classification. The device 1200 may separately provide a management screen labeling each sentence 622, 624, 626, and 628.

도 7은 본 발명의 제안 방법에 따른 우선 순위 리스트(700)를 예시한다. 도 7의 예는 본 발명의 이해를 돕기 위한 예시일 뿐이며 본 발명을 제한하는 것이 아니다. 예를 들어, 도 7의 예에서는 19개의 업종, 37개의 의도 분류, 13개의 개체명 분류가 예시되어 있지만 다른 개수와 유형의 업종, 의도, 개체명을 포함하거나 일부 업종, 의도, 개체명을 제외하도록 수정 구현될 수 있다. 우선 순위 리스트(700)는 리스트 또는 테이블 형태로 관리될 수 있으며 우선 순위 점수 리스트 또는 테이블로도 지칭될 수 있다.Figure 7 illustrates a priority list 700 according to the proposed method of the present invention. The example in FIG. 7 is only an example to aid understanding of the present invention and does not limit the present invention. For example, in the example of Figure 7, 19 industries, 37 intention classifications, and 13 entity name classifications are illustrated, but different numbers and types of industries, intentions, and entity names are included or some industries, intentions, and entity names are excluded. It can be modified and implemented to do so. The priority list 700 may be managed in the form of a list or table and may also be referred to as a priority score list or table.

도 7을 참조하면, 우선 순위 리스트(700)는 업종 우선 순위(또는 우선 순위 점수)를 위한 제1 리스트(710), 의도 우선 순위(또는 우선 순위 점수)를 위한 제2 리스트(720), 개체명 우선 순위(또는 우선 순위 점수)를 위한 제3 리스트(730) 중에서 적어도 하나를 포함할 수 있다. 이에 추가적으로 또는 독립적으로, 제1 리스트(710), 제2 리스트(720), 제3 리스트(730)에 대해 각각 가중치가 설정될 수 있다. 도 7의 예에서는, 제1 리스트(710)의 경우 0.01, 제2 리스트(720)의 경우 1, 제3 리스트(730)의 경우 0.1의 가중치가 설정되어 있지만, 본 발명이 이들 가중치로만 제한되는 것은 아니며 다양한 값으로 설정될 수 있다. 설명의 편의를 위해, 업종 우선 순위 점수는 제1 순위 점수로, 의도 우선 순위 점수는 제2 순위 점수로, 개체명 우선 순위 점수는 제3 순위 점수로 지칭될 수 있다.Referring to FIG. 7, the priority list 700 includes a first list 710 for industry priority (or priority score), a second list 720 for intention priority (or priority score), and an entity It may include at least one of the third list 730 for name priority (or priority score). Additionally or independently, weights may be set for the first list 710, the second list 720, and the third list 730, respectively. In the example of FIG. 7, the weights are set to 0.01 for the first list 710, 1 for the second list 720, and 0.1 for the third list 730, but the present invention is limited to these weights. This is not true and can be set to various values. For convenience of explanation, the industry priority score may be referred to as a first priority score, the intent priority score may be referred to as a second priority score, and the entity name priority score may be referred to as a third priority score.

장치(1200)는 사용자 인터페이스를 통해 사용자에게 리스트(700, 710, 720, 730)를 제공하고 사용자가 업종, 의도, 개체명을 추가 또는 변경하고 우선 순위 점수 및/또는 가중치를 조정할 수 있게 한다. 리스트(700, 710, 720, 730)는 서비스가 확장되고 사용자의 이용이 늘어나고 발화 패턴이 늘어남에 따라 지속적으로 확장 및 보완될 수 있다. 또한, 가중치는 업종, 의도, 개체명에 대해 독립적으로 설정될 수 있고, 우선 순위 점수와 가중치는 우선 순위 결정(116)에 이용될 수 있다(예, 도 2의 S206 참조).Device 1200 provides lists 700, 710, 720, and 730 to the user through a user interface and allows the user to add or change industry, intent, entity name, and adjust priority scores and/or weights. The lists 700, 710, 720, and 730 can be continuously expanded and supplemented as services expand, user usage increases, and speech patterns increase. Additionally, the weight can be set independently for industry, intent, and entity name, and the priority score and weight can be used to determine priority 116 (e.g., see S206 in FIG. 2).

도 8은 본 발명의 제안 방법에 따른 우선 순위 결정(116)을 예시한다. 도 8의 예는 본 발명의 이해를 돕기 위한 예시일 뿐이며 본 발명을 제한하는 것이 아니다. 예를 들어, 업종 분류(S804), 의도 분류(S806), 개체명 인식(S808)의 순서는 다양하게 수정될 수 있다.Figure 8 illustrates priority determination 116 according to the proposed method of the present invention. The example in FIG. 8 is only an example to aid understanding of the present invention and does not limit the present invention. For example, the order of industry classification (S804), intent classification (S806), and entity name recognition (S808) can be modified in various ways.

도 8을 참조하면, 장치(1200)는 3개의 음성 메모에 대한 음성 인식 결과(또는 제1 텍스트 문장)(S802)에 대해 업종 분류 결과(또는 업종 분류에 관한 정보)를 획득하고 업종 우선 순위(또는 우선 순위 점수)를 위한 제1 리스트(710)를 참조하여 업종 우선 순위 점수를 획득할 수 있다(S804). 장치(1200)는 음성 인식 결과(또는 제1 텍스트 문장)(S802)에 대해 의도 분류 결과(또는 의도에 관한 정보)를 획득하고 의도 우선 순위(또는 우선 순위 점수)를 위한 제2 리스트(720)를 참조하여 의도 우선 순위 점수를 획득할 수 있고(S806), 개체명 인식 결과(또는 개체명에 관한 정보)를 획득하고 개체명 우선 순위(또는 우선 순위 점수)를 위한 제3 리스트(730)를 참조하여 개체명 우선 순위 점수를 획득(S808)할 수 있다(예, 도 2의 S202 참조).Referring to FIG. 8, the device 1200 obtains industry classification results (or information about industry classification) for the voice recognition results (or first text sentences) (S802) for three voice memos and determines industry priority ( Alternatively, an industry priority score may be obtained by referring to the first list 710 (S804). The device 1200 obtains an intent classification result (or information about intent) for the voice recognition result (or first text sentence) S802 and creates a second list 720 for intent priority (or priority score). An intent priority score can be obtained by referring to (S806), an entity name recognition result (or information about the entity name) is obtained, and a third list 730 for entity name priority (or priority score) is obtained. An entity name priority score can be obtained (S808) by reference (e.g., see S202 in FIG. 2).

도 3을 참조하여 설명한 바와 같이, 장치(1200)는 숫자, 영문자, 특수 기호를 변환하여 최종 음성 인식 결과를 획득(S810)하고, 문장 요약(114)을 통해 요약 결과(또는 요약 텍스트)를 획득(S812)할 수 있다(예, 도 2의 S204 참조). 장치(1200)는 업종 우선 순위 점수, 의도 우선 순위 점수, 개체명 우선 순위 점수에 기초하여 최종 우선 순위 점수를 산출(S814)하고 우선 순위를 결정(S816)할 수 있다. 일 예로, 우선 순위 점수가 낮은 순으로 우선 순위가 높게 결정될 수 있다(S816). 다른 예로, 우선 순위 점수가 높은 순으로 우선 순위가 높게 결정될 수 있다(S816).As described with reference to FIG. 3, the device 1200 obtains the final voice recognition result by converting numbers, English letters, and special symbols (S810), and obtains a summary result (or summary text) through sentence summary 114. (S812) can be done (e.g., see S204 in FIG. 2). The device 1200 may calculate the final priority score (S814) and determine the priority (S816) based on the industry priority score, intent priority score, and entity name priority score. As an example, the priority may be determined in descending order of priority score (S816). As another example, the priority may be determined in the order of the priority score (S816).

예를 들어, 장치(1200)는 수학식 1에 기초하여 최종 우선 순위 점수를 획득(S814)할 수 있다. 구체적으로, 장치(1200)는 수학식 1에 기초하여 S804에서 획득한 업종 우선 순위 점수(또는 제1 순위 점수), S806에서 획득한 의도 우선 순위 점수(또는 제2 순위 점수), S808에서 획득한 개체명 우선 순위 점수(또는 제3 순위 점수)에 각각 해당 리스트(710, 720, 730)에 설정된 가중치(예, 0.01, 1, 0.1)를 적용하여 합산한 값을 최종 우선 순위 점수로 결정하고(S814), 최종 우선 순위에 기초하여 음성 인식된 텍스트 문장(또는 제1 텍스트 문장)(S802)에 대한 우선 순위를 결정(S816)할 수 있다.For example, the device 1200 may obtain the final priority score (S814) based on Equation 1. Specifically, the device 1200 is based on Equation 1: the industry priority score (or first priority score) obtained in S804, the intent priority score (or second priority score) obtained in S806, and the The weights (e.g., 0.01, 1, 0.1) set in the respective lists 710, 720, and 730 are applied to the entity name priority score (or third priority score), and the summed value is determined as the final priority score ( S814), the priority for the voice-recognized text sentence (or first text sentence) (S802) may be determined (S816) based on the final priority.

[수학식 1][Equation 1]

최종 우선 순위 = (업종 우선 순위 점수)*(업종 가중치) + (의도 우선 순위 점수)*(의도 가중치) + (개체명 우선 순위 점수)*(개체명 가중치)Final Priority = (Industry Priority Score)*(Industry Weight) + (Intent Priority Score)*(Intent Weight) + (Entity Name Priority Score)*(Entity Name Weight)

하나의 텍스트 문장(또는 제1 텍스트 문장)(S802)에서 복수의 개체명이 나오는 경우, 장치(1200)는 가장 높은 우선 순위를 가지는 개체명에만 (개체명) 가중치를 적용하여 최종 우선 순위를 결정할 수도 있다. 이렇게 하는 이유는 날짜, 시간 등의 중요한 키워드가 있는 문장의 순위를 높여 주기 위한 것이다. 문장 내에서 각 개체명에 해당하는 실제 단어 대한 가중치를 높이는 것도 가능하다. 예를 들어, 도 7의 예에서 개체명 Date나 Time 등은 우선 순위가 2이지만, 이와 관련된 실제 단어가 “오늘”이나 “지금” 등이면 우선 순위를 “1”로 조정하여 우선 순위를 높일 수 있다.When a plurality of entity names appear in one text sentence (or first text sentence) (S802), the device 1200 may determine the final priority by applying the (entity name) weight only to the entity name with the highest priority. there is. The reason for doing this is to increase the ranking of sentences with important keywords such as date and time. It is also possible to increase the weight of the actual word corresponding to each entity name within the sentence. For example, in the example of Figure 7, the entity names Date or Time have a priority of 2, but if the actual word related to them is “today” or “now,” the priority can be increased by adjusting the priority to “1”. there is.

도 9는 본 발명의 제안 방법에 따라 음성 인식 모델 학습을 위한 학습 데이터를 자동으로 생성하는 순서도를 예시한다. 도 9에 예시된 방법은 장치(1200)에서 수행될 수 있다. 도 9의 예는 오로지 예시일 뿐이며 도 9의 일부 구성이 삭제 또는 변형되거나 도 9에 예시되지 않은 새로운 구성이 추가되도록 수정 구현될 수 있다. Figure 9 illustrates a flowchart of automatically generating training data for learning a voice recognition model according to the proposed method of the present invention. The method illustrated in FIG. 9 may be performed in device 1200. The example of FIG. 9 is only an example and may be modified and implemented so that some components of FIG. 9 are deleted or modified, or new components not illustrated in FIG. 9 are added.

도 9에 예시된 방법은 음성 인식 신뢰도가 낮은 텍스트 문장에 대해 문장 패턴을 이용해 신규 문장을 생성하고 이를 이용하여 사용자 음성 데이터(예, 음성 메모)의 실제 발화 내용을 역추적하여 음성 인식 성능을 향상시키고 음성 인식 모델의 학습을 위한 학습 데이터를 자동으로 도출할 수 있게 한다.The method illustrated in Figure 9 generates a new sentence using a sentence pattern for a text sentence with low voice recognition reliability and uses this to backtrack the actual utterance content of the user's voice data (e.g., voice memo) to improve voice recognition performance. and automatically derive learning data for learning the voice recognition model.

본 발명의 제안 방법에 따르면, 사용자 음성 데이터(예, 음성 메모)에 대해 음성 인식(102)을 수행하여 획득한 텍스트 문장의 단어/문장에 대한 음성 인식 신뢰도가 특정 임계값을 넘으면 유효한 것으로 보고 해당 텍스트 문장을 그대로 학습 데이터로 수집하여 음성 인식 모델의 학습에 사용하게 한다. 만일 음성 인식 신뢰도가 특정 임계값을 넘지 못하는 경우 업종/의도 분류된 결과 및 개체명 인식 결과에 따라 문장 패턴을 생성하고 가능한 모든 후보 단어들을 조합하여 문장 패턴에 따라 신규 문장을 생성해 주고, 신규 문장을 이용해 음성 인식을 위한 작은(Small) 음성 인식 모델을 만들어 음성 인식을 해당 발화에 대해 다시 수행하여 단어/문장에 대한 신뢰도 값을 검증하는 단계를 신규 문장에 대한 음성 인식 모델 별로 실행하고 신규 문장 중에서 신뢰도가 가장 높은 문장을 최종 전사를 위한 문장으로 선정한다. 예를 들어, 신뢰도 측정은 베이즈 리스크(Bayes Risk) 등의 방법을 활용한다. 이렇게 선정된 문장은 음성 메모에 대한 최종 음성 인식 결과에도 활용이 가능하다. 이것은 음성 메모가 실시간이 아닌 약간의 지연(delay)을 가지고 결과를 제공해도 되기 때문이다.According to the proposed method of the present invention, if the voice recognition reliability for a word/sentence in a text sentence obtained by performing voice recognition 102 on user voice data (e.g., voice memo) exceeds a certain threshold, it is considered valid. Text sentences are collected as training data and used to train a speech recognition model. If the voice recognition reliability does not exceed a certain threshold, a sentence pattern is created according to the industry/intention classification result and the entity name recognition result, and all possible candidate words are combined to create a new sentence according to the sentence pattern. Create a small speech recognition model for speech recognition and perform speech recognition again on the corresponding utterance to verify the reliability value for the word/sentence. The step of verifying the reliability value for the word/sentence is executed for each speech recognition model for new sentences and among the new sentences. The sentence with the highest reliability is selected as the sentence for final transcription. For example, reliability measurement uses methods such as Bayes Risk. The sentences selected in this way can also be used in the final voice recognition results for voice memos. This is because voice memos may provide results with some delay rather than in real time.

후보 단어들은 지속적으로 업데이트되는 지식 모델로부터 선정되기 때문에 새롭게 출시되는 메뉴나 신조어 등에도 효과적으로 대응할 수 있고, 음성 인식 모델 학습에도 활용할 수 있다.Since candidate words are selected from a continuously updated knowledge model, they can effectively respond to newly released menus or new words, and can also be used to learn voice recognition models.

도 9를 참조하면, 장치(1200)는 제1 음성 인식 모델을 이용하여 사용자의 음성 데이터(예, 음성 메모)에 기초하여 하나 이상의 텍스트 문장(또는 제1 텍스트 문장)을 획득할 수 있다(S902). 또한, 장치(1200)는 하나 이상의 제1 텍스트 문장에 대해 단어 및/또는 문장 단위로 신뢰도 측정을 수행하여 각 제1 텍스트 문장에 대한 음성 인식 신뢰도를 검증할 수 있다(S904). 예를 들어, 신뢰도 측정은 베이즈 리스크(Bayes Risk) 등과 같은 방법에 기초하여 수행될 수 있다.Referring to FIG. 9, the device 1200 may obtain one or more text sentences (or first text sentences) based on the user's voice data (e.g., voice memo) using the first voice recognition model (S902) ). Additionally, the device 1200 may verify the voice recognition reliability for each first text sentence by performing reliability measurement on a word and/or sentence basis for one or more first text sentences (S904). For example, reliability measurement may be performed based on methods such as Bayes Risk.

장치(1200)는, 하나 이상의 제1 텍스트 문장에 대한 음성 인식 신뢰도가 적어도 하나의 제1 기준을 만족하는지 여부를 판별할 수 있다(S906). 예를 들어, 적어도 하나의 제1 기준은 문장에 대한 음성 인식 신뢰도와 단어에 대한 음성 인식 신뢰도를 포함할 수 있으며, 단어/문장에 대한 음성 인식 신뢰도가 특정 임계값 이상이면 적어도 하나의 제1 기준을 만족한다고 판별할 수 있고, 특정 임계값 이하이면 적어도 하나의 제1 기준을 만족하지 않는다고 판별할 수 있다(혹은 반대로 특정 임계값 이하이면 적어도 하나의 제1 기준을 만족한다고 판별하고 특정 임계값 이상이면 적어도 하나의 제1 기준을 만족하지 않는다고 판별하는 것도 가능하다). 보다 구체적인 예로, 적어도 하나의 제1 기준을 만족하지 못하는 것은 문장에 대한 음성 인식 신뢰도가 제1 임계값보다 작거나 단어에 대한 음성 인식 신뢰도가 제2 임계값보다 작은 것을 포함할 수 있고, 적어도 하나의 제1 기준을 만족하는 것은 문장에 대한 음성 인식 신뢰도가 제1 임계값 이상이고 단어에 대한 음성 인식 신뢰도가 제2 임계값 이상인 것을 포함할 수 있다.The device 1200 may determine whether the voice recognition reliability for one or more first text sentences satisfies at least one first criterion (S906). For example, the at least one first criterion may include the speech recognition reliability for a sentence and the speech recognition reliability for a word, and if the speech recognition reliability for a word/sentence is greater than a certain threshold, the at least one first criterion may be selected. It can be determined that it satisfies, and if it is below a certain threshold, it can be determined that it does not satisfy at least one first criterion (or, conversely, if it is below a certain threshold, it can be determined that it satisfies at least one first standard, and if it is below a certain threshold, it can be determined that it satisfies at least one first standard). If so, it is also possible to determine that at least one first criterion is not satisfied). As a more specific example, failing to meet at least one first criterion may include speech recognition confidence for a sentence being less than a first threshold or speech recognition confidence for a word being less than a second threshold, and at least one Satisfying the first criterion may include that the speech recognition reliability for the sentence is greater than or equal to the first threshold and the speech recognition reliability for the word is greater than or equal to the second threshold.

장치(1200)는, 하나 이상의 제1 텍스트 문장에 대한 음성 인식 신뢰도가 적어도 하나의 제1 기준을 만족하지 못하는 경우(S906), 하나 이상의 제1 텍스트 문장에 대해 획득한 의도에 관한 정보(또는 의도 정보)와 개체명에 관한 정보(또는 개체명 정보)(예, 도 2의 S202 참조)에 기초하여(S908) 지식 모델(912)로부터 적어도 하나의 개체명 후보 단어를 획득하고(S910), 적어도 하나의 개체명 후보 단어에 기초하여 신규 문장 패턴에 따라 하나 이상의 텍스트 문장을 생성할 수 있다(S912). 본 명세서에서 적어도 하나의 개체명 후보 단어에 기초하여 신규 문장 패턴에 따라 생성된 텍스트 문장을 제2 텍스트 문장이라고 지칭할 수 있다.If the voice recognition reliability for one or more first text sentences does not satisfy at least one first criterion (S906), the device 1200 provides information about the intention (or intention) obtained for the one or more first text sentences. information) and information about the entity name (or entity name information) (e.g., see S202 in FIG. 2) (S908), obtain at least one entity name candidate word from the knowledge model 912 (S910), and at least One or more text sentences can be generated according to a new sentence pattern based on one entity name candidate word (S912). In this specification, a text sentence generated according to a new sentence pattern based on at least one entity name candidate word may be referred to as a second text sentence.

장치(1200)는 하나 이상의 제2 텍스트 문장에 기초하여 작은(small) 음성 인식 모델을 학습시킬 수 있다(S914). 편의상 본 명세서에서 S902에서 음성 인식을 위해 이용된 음성 인식 모델을 제1 음성 인식 모델이라고 지칭하고, 하나 이상의 제2 텍스트 문장에 기초하여 학습된 작은(small) 음성 인식 모델을 제2 음성 인식 모델이라고 지칭한다. 예를 들어, 제2 음성 인식 모델은 제1 음성 인식 모델의 음향 모델(AM)을 그대로 이용하되 언어 모델(LM)을 새로 생성하고 제2 텍스트 문장을 이용하여 새로 생성된 언어 모델(LM)을 학습시킨 모델일 수 있다. 보다 구체적인 예로, 제2 음성 인식 모델은 각각의 제2 텍스트 문장에 대해 생성될 수 있고 각각의 제2 텍스트 문장을 이용하여 해당 제2 음성 인식 모델(의 언어 모델)을 학습시킬 수 있다. 혹은, 제2 음성 인식 모델은 하나 이상의 제2 텍스트 문장에 대해 생성될 수 있고 하나 이상의 제2 텍스트 문장을 이용하여 제2 음성 인식 모델(의 언어 모델)을 학습시킬 수 있다.The device 1200 may train a small speech recognition model based on one or more second text sentences (S914). For convenience, in this specification, the speech recognition model used for speech recognition in S902 is referred to as a first speech recognition model, and a small speech recognition model learned based on one or more second text sentences is referred to as a second speech recognition model. refers to For example, the second speech recognition model uses the acoustic model (AM) of the first speech recognition model, but creates a new language model (LM) and uses the second text sentence to create a new language model (LM). It may be a trained model. As a more specific example, a second speech recognition model may be generated for each second text sentence, and each second text sentence may be used to train the corresponding second speech recognition model (language model). Alternatively, the second speech recognition model may be generated for one or more second text sentences, and the second speech recognition model (language model) may be trained using the one or more second text sentences.

장치(1200)는 학습된 제2 음성 인식 모델을 이용하여 하나 이상의 제2 텍스트 문장에 대해 단어 및/또는 문장 단위로 신뢰도 측정을 수행하여 각 제2 텍스트 문장에 대한 음성 인식 신뢰도를 검증할 수 있다(S916). 예를 들어, 신뢰도 측정은 베이즈 리스크(Bayes Risk) 등과 같은 방법에 기초하여 수행될 수 있다The device 1200 may verify the voice recognition reliability for each second text sentence by performing a reliability measurement on a word and/or sentence basis for one or more second text sentences using the learned second speech recognition model. (S916). For example, reliability measurements can be performed based on methods such as Bayes Risk, etc.

장치(1200)는, 하나 이상의 제2 텍스트 문장에 대한 음성 인식 신뢰도가 적어도 하나의 제2 기준을 만족하는지 여부를 판별할 수 있다(S918). 예를 들어, 적어도 하나의 제2 기준은 문장에 대한 음성 인식 신뢰도와 단어에 대한 음성 인식 신뢰도를 포함할 수 있으며, 단어/문장에 대한 음성 인식 신뢰도가 특정 임계값 이상이면 적어도 하나의 제2 기준을 만족한다고 판별할 수 있고, 특정 임계값 이하이면 적어도 하나의 제2 기준을 만족하지 않는다고 판별할 수 있다(혹은 반대로 특정 임계값 이하이면 적어도 하나의 제2 기준을 만족한다고 판별하고 특정 임계값 이상이면 적어도 하나의 제2 기준을 만족하지 않는다고 판별하는 것도 가능하다). 보다 구체적인 예로, 적어도 하나의 제2 기준을 만족하지 못하는 것은 문장에 대한 음성 인식 신뢰도가 제3 임계값보다 작거나 단어에 대한 음성 인식 신뢰도가 제4 임계값보다 작은 것을 포함할 수 있고, 적어도 하나의 제2 기준을 만족하는 것은 문장에 대한 음성 인식 신뢰도가 제3 임계값 이상이고 단어에 대한 음성 인식 신뢰도가 제4 임계값 이상인 것을 포함할 수 있다.The device 1200 may determine whether the voice recognition reliability for one or more second text sentences satisfies at least one second criterion (S918). For example, the at least one second criterion may include the speech recognition reliability for a sentence and the speech recognition reliability for a word, and if the speech recognition reliability for a word/sentence is above a certain threshold, the at least one second criterion may be selected. It can be determined that it satisfies, and if it is below a certain threshold, it can be determined that at least one second standard is not satisfied (or, conversely, if it is below a certain threshold, it can be determined that it satisfies at least one second standard, and if it is below a certain threshold, it can be determined that it satisfies at least one second standard). If so, it is also possible to determine that at least one second criterion is not satisfied). As a more specific example, failing to meet at least one second criterion may include speech recognition confidence for a sentence being less than a third threshold or speech recognition confidence for a word being less than a fourth threshold, and at least one Satisfying the second criterion may include that the speech recognition reliability for the sentence is greater than or equal to the third threshold and the speech recognition reliability for the word is greater than or equal to the fourth threshold.

장치(1200)는, 하나 이상의 제2 텍스트에 대한 음성 인식 신뢰도가 적어도 하나의 제2 기준을 만족하는 경우, 하나 이상의 제2 텍스트 문장 중에서 가장 높은 음성 인식 신뢰도를 가지는 텍스트 문장을 제1 음성 인식 모델을 위한 학습 데이터로 수집하고(S920), 가장 높은 음성 인식 신뢰도를 가지는 제2 텍스트 문장에 기초하여 제1 음성 인식 모델을 학습시킬 수 있다(S922). 또한, 하나 이상의 제2 텍스트에 대한 음성 인식 신뢰도가 적어도 하나의 제2 기준을 만족하는 경우, 하나 이상의 제2 텍스트 문장 중에서 가장 높은 음성 인식 신뢰도를 가지는 제2 텍스트 문장은 사용자의 음성 데이터에 대한 최종 음성 인식 결과로 결정 또는 사용될 수 있다. 만일 하나 이상의 제2 텍스트에 대한 음성 인식 신뢰도가 적어도 하나의 제2 기준을 만족하지 않는 경우, 하나 이상의 제2 텍스트 문장을 학습 데이터로 수집하는 것과 이에 기초한 제1 음성 인식 모델의 학습은 생략될 수 있다(S924).When the speech recognition reliability for one or more second texts satisfies at least one second criterion, the device 1200 selects the text sentence with the highest speech recognition reliability among the one or more second text sentences as a first speech recognition model. (S920), and the first voice recognition model can be trained based on the second text sentence with the highest voice recognition reliability (S922). In addition, when the speech recognition reliability for one or more second texts satisfies at least one second criterion, the second text sentence with the highest speech recognition reliability among the one or more second text sentences is the final text for the user's speech data. It can be determined or used as a result of voice recognition. If the speech recognition reliability for one or more second texts does not satisfy at least one second criterion, collecting one or more second text sentences as training data and training the first speech recognition model based on them may be omitted. There is (S924).

또한, 예를 들어, 작은 음성 인식 모델(또는 제2 음성 인식 모델)을 이용하여 검증하는 과정에서 기존 음성 인식 신뢰도(즉, S904에서의 음성 인식 신뢰도)보다 높지 않은 신규 생성된 문장(또는 제2 텍스트 문장)은 최종 음성 인식 결과에 반영되지 않으며, 제1 음성 인식 모델을 위한 학습 데이터에도 수집되지 않는다. 마찬가지로, 기존 음성 인식 신뢰도(즉, S904에서의 음성 인식 신뢰도)가 적어도 하나의 제1 기준을 만족하지 않는 경우(S906)에도 하나 이상의 제1 텍스트 문장은 제1 음성 인식 모델을 위한 학습 데이터로 수집 및 사용하지 않는다.In addition, for example, in the process of verifying using a small speech recognition model (or a second speech recognition model), a newly generated sentence (or a second speech recognition reliability) that is not higher than the existing speech recognition reliability (i.e., the speech recognition reliability in S904) text sentences) are not reflected in the final speech recognition result, and are not collected as training data for the first speech recognition model. Likewise, even if the existing speech recognition reliability (i.e., the speech recognition reliability in S904) does not satisfy at least one first criterion (S906), one or more first text sentences are collected as training data for the first speech recognition model. and do not use

한편, 본 발명의 제안 방법을 적용할 때, 사용자에게 단어 인식 정확성을 확인 받고 싶은 경우 사용자 단말 화면에 해당 단어를 링크 등으로 표현하고 사용자가 단어를 직접 수정할 수 있도록 할 수 있다. 이 경우 신 메뉴 등의 신규 단어 수집에 좀 더 효과적이고 유용하게 활용이 가능하다. 링크를 제공하는 단어는 의도와 개체명, 신뢰도를 조합하여 선정하게 된다.Meanwhile, when applying the proposed method of the present invention, if the user wants to confirm the accuracy of word recognition, the word can be expressed as a link on the user terminal screen and the user can directly edit the word. In this case, it can be used more effectively and usefully for collecting new words such as new menus. Words that provide links are selected based on a combination of intent, entity name, and reliability.

도 10은 본 발명의 제안 방법에 따라 문장 패턴과 신규 문장을 생성하는 것을 예시한다. 도 10의 예는 본 발명의 이해를 돕기 위한 예시일 뿐이며 본 발명을 제한하는 것이 아니다.Figure 10 illustrates generating a sentence pattern and a new sentence according to the proposed method of the present invention. The example in FIG. 10 is only an example to aid understanding of the present invention and does not limit the present invention.

도 10을 참조하면, 장치(1200)는 사용자의 음성 데이터(예, 음성 메모)로부터 획득한 하나 이상의 제1 텍스트 데이터(310)에 대해 개체명 인식(112)을 수행한 결과 Month, Date, Time, Location, Person_No, Reservation를 포함하여 개체명에 관한 정보(330)를 획득하였다고 가정한다. 장치(1200)는 개체명에 관한 정보(330)에 기초하여 신규 문장 패턴(1010)을 생성할 수 있다.Referring to FIG. 10, the device 1200 performs entity name recognition 112 on one or more first text data 310 obtained from the user's voice data (e.g., voice memo), and as a result, Month, Date, and Time , Assume that information 330 about the entity name, including Location, Person_No, and Reservation, has been obtained. The device 1200 may generate a new sentence pattern 1010 based on the information 330 about the entity name.

또한, 장치(1200)는 업종 및/또는 의도에 관한 정보에 기초하여 이미 구축된 지식 모델 기반 유의어 사전(912)으로부터 신규 문장 패턴(1010)에 대입 가능한 개체명 후보 단어(1022, 1024, 1026, 1028, 1030)를 획득하고 신규 문장 패턴(1010)에 따라 개체명 후보 단어(1022, 1024, 1026, 1028, 1030)를 조합하여 신규 문장(또는 하나 이상의 제2 텍스트 문장)을 생성할 수 있다. 장치(1200)는 이들 조합된 신규 문장들을 이용하여 (신규 문장 각각에 대한 또는 신규 문장들에 대한) 작은 음성 인식 모델(또는 제2 음성 인식 모델)의 학습에 사용할 수 있다.In addition, the device 1200 selects entity name candidate words (1022, 1024, 1026, 1028, 1030) and combining the entity name candidate words (1022, 1024, 1026, 1028, 1030) according to the new sentence pattern (1010) to generate a new sentence (or one or more second text sentences). The device 1200 can use these combined new sentences to learn a small speech recognition model (or a second speech recognition model) (for each new sentence or for new sentences).

실제 발화 음성을 가지고 음성 인식 모델(예, 음향 모델)을 학습하기 위해서는 이 발화가 유효한지 검증해야 한다. 본 발명의 제안 방법에 따르면, 신규 문장 패턴의 조합으로부터 사용자의 음성 데이터에 가장 유사한 문장을 찾을 수 있다. 장치(1200)는 키워드들로 구성된 문장의 형태를 만드는 패턴 파일과 앞서 만들어준 패턴의 각 항목에 들어갈 수 있는 단어들을 카테고리화하여 가지고 있는 형태로 카테고리 파일을 만들어 문장 패턴 및 신규 문장 생성에 이용할 수 있다. In order to learn a speech recognition model (e.g., an acoustic model) using an actual utterance, it is necessary to verify whether the utterance is valid. According to the proposed method of the present invention, the sentence most similar to the user's voice data can be found from a combination of new sentence patterns. The device 1200 creates a pattern file that creates a sentence composed of keywords and a category file that categorizes words that can be included in each item of the previously created pattern, which can be used to create sentence patterns and new sentences. there is.

본 발명의 제안 방법에 따르면, AI 컨택 센터, AI 통화 비서 등과 같은 서비스에서 빠르게 업종과 발화 패턴이 확장되는 것에 상당히 유리하게 신규 문장을 만들어 학습 데이터를 제공할 수 있으므로 음성 인식 성능의 향상을 가능하게 해 준다. 또한, 신규 생성된 문장은 최종 음성 인식 결과로도 활용이 가능하여 음성 인식 모델의 성능을 더욱 향상시킬 수 있다.According to the proposed method of the present invention, new sentences can be created and training data can be provided, which is significantly advantageous for the rapid expansion of industries and speech patterns in services such as AI contact centers and AI call assistants, thereby enabling improvement in voice recognition performance. I will do it. Additionally, the newly created sentence can also be used as the final speech recognition result, further improving the performance of the speech recognition model.

도 11은 본 발명의 제안 방법에 따라 작은 음성 인식 모델을 만들고 신규 문장을 생성하는 것을 예시한다. 구체적으로, 장치(1200)는 음성 인식 결과와 신뢰도를 기반으로 작은 음성 인식 모델(또는 제2 음성 인식 모델)을 만들고 신뢰도를 검증하여 최종 텍스트 문장(예, 적어도 하나의 제2 기준을 만족하고 가장 높은 음성 인식 신뢰도를 가지는 제2 텍스트 문장)을 선정할 수 있다(예, 도 9의 S908 내지 S918 참조). 도 11의 예는 본 발명의 이해를 돕기 위한 예시일 뿐이며 본 발명을 제한하는 것이 아니다.Figure 11 illustrates creating a small speech recognition model and generating a new sentence according to the proposed method of the present invention. Specifically, the device 1200 creates a small speech recognition model (or second speech recognition model) based on the speech recognition result and reliability, verifies the reliability, and selects the final text sentence (e.g., satisfies at least one second criterion and is the most A second text sentence with high voice recognition reliability can be selected (e.g., see S908 to S918 of FIG. 9). The example in FIG. 11 is only an example to aid understanding of the present invention and does not limit the present invention.

도 11을 참조하면, 사용자 음성 데이터가 “오늘 2시 로제떡볶이 4인분 포장이요"를 포함하고 이에 대한 제1 텍스트 문장이 “오늘 2시 매운떡볶이 4인분 포장이요"(1110)로 음성 인식되었다고 가정한다. 즉, 사용자 음성 데이터와 달리 “로제떡볶이”가 “매운떡볶이”로 잘못 인식되었다고 가정한다. 또한, 장치(1200)는 단어/문장에 대한 음성 인식 신뢰도를 측정하여 문장 신뢰도를 0.838로 결정하고, 잘못 인식된 단어 “매운떡볶이”의 단어 신뢰도를 0.3으로 결정했다고 가정한다.Referring to FIG. 11, it is assumed that the user voice data includes “Today at 2 o’clock, 4 servings of rose tteokbokki are packaged” and the first text sentence for this is voice recognized as “Today at 2 o’clock, 4 servings of spicy tteokbokki are packaged” (1110). do. In other words, unlike the user voice data, we assume that “Rosé Tteokbokki” was incorrectly recognized as “Spicy Tteokbokki.” Additionally, it is assumed that the device 1200 measures the voice recognition reliability for a word/sentence and determines the sentence reliability to be 0.838, and the word reliability of the incorrectly recognized word “spicy tteokbokki” to 0.3.

이 경우, 장치(1200)가 제1 기준을 위한 문장 신뢰도 임계값(threshold)(또는 제1 임계값)을 0.9로 설정하고 단어 신뢰도 임계값(또는 제2 임계값)을 0.5으로 설정하였다면, 제1 텍스트 문장의 음성 인식 신뢰도는 적어도 하나의 제1 기준을 만족하지 못하며, 장치(1200)는 제2 음성 인식 모델의 학습을 위한 문장 생성을 위해 업종/의도 분류된 결과 및 개체명 인식 결과를 확인하고 신규 문장 패턴과 그에 따른 신규 문장을 생성할 수 있다. 도 11의 예에서, 개체명 <Date>, <Time>, <Item>, <Item_No>, <Action>을 이용하여 문장 패턴을 생성하고 개체명 후보 단어 “폭탄떡복이”, “로제떡복이”를 획득하며 문장 패턴에 따라 개체명 후보 단어를 조합하여 신규 문장(또는 제2 텍스트 문장) “오늘 2시 폭탄떡볶이 4인분 포장이요"(1120), “오늘 2시 로제떡볶이 4인분 포장이요"(1130)를 생성할 수 있다.In this case, if the device 1200 sets the sentence reliability threshold (or first threshold) for the first criterion to 0.9 and the word reliability threshold (or second threshold) to 0.5, 1 The voice recognition reliability of the text sentence does not satisfy at least one first criterion, and the device 1200 checks the industry/intention classification results and entity name recognition results to generate sentences for learning the second voice recognition model. And you can create new sentence patterns and new sentences accordingly. In the example of Figure 11, a sentence pattern is created using the entity names <Date>, <Time>, <Item>, <Item_No>, and <Action>, and the entity name candidate words “Bomb Rice Cake” and “Roje Rice Cake” are obtained. Then, combine the entity name candidate words according to the sentence pattern to create a new sentence (or second text sentence) “Today at 2 o’clock, bomb tteokbokki is packaged for 4 people” (1120), “Today at 2 o’clock, rose tteokbokki is packaged for 4 people” (1130) can be created.

이 예에서, 장치(1200)는 “오늘 2시 매운떡볶이 4인분 포장이요"(1110), “오늘 2시 폭탄떡볶이 4인분 포장이요"(1120), “오늘 2시 로제떡볶이 4인분 포장이요"(1130)라는 3개의 문장을 기반으로 작은(Small) 음성 인식 모델(또는 제2 음성 인식 모델)을 만들어 단어/문장에 대한 신뢰도 검증을 다시 수행한다. 앞서 설명한 바와 같이, 작은 음성 인식 모델(또는 제2 음성 인식 모델)은 3개 문장 각각에 대해 생성하고 각각의 문장을 이용하여 학습함으로써 만들 수도 있고, 혹은 3개 문장에 대해 하나의 작은 음성 인식 모델(또는 제2 음성 인식 모델)을 생성하고 3개의 문장을 이용하여 학습함으로써 만들 수도 있다. “오늘 2시 폭탄떡볶이 4인분 포장이요"(1120)는 단어/문장 신뢰도가 가장 낮아서 탈락되며, 가장 신뢰도가 높은 “오늘 2시 로제떡볶이 4인분 포장이요"(1130)를 최종 전사 및 음성 인식 모델(또는 제1 음성 인식 모델) 학습을 위한 문장 및/또는 최종 음성 인식 결과로 선정한다. 장치(1200)는 제2 음성 인식 모델을 이용하여 신뢰도 검증된 문장(1130)을 최종 음성 인식 결과로 결정할 수 있고, 제1 음성 인식 모델을 위한 학습 데이터로 수집하고 제1 음성 인식 모델의 학습에 이용할 수 있다.In this example, the device 1200 says “Today at 2 o’clock, spicy tteokbokki for 4 servings” (1110), “Today at 2 o’clock, bomb tteokbokki for 4 servings” (1120), “Today at 2 o’clock, rose tteokbokki for 4 servings” Reliability verification for words/sentences is performed again by creating a small speech recognition model (or second speech recognition model) based on the three sentences (1130). As described previously, the small speech recognition model (or A second speech recognition model) can be created for each of the three sentences and learned using each sentence, or one small speech recognition model (or a second speech recognition model) can be created for each of the three sentences. It can also be created by learning using three sentences: “Today at 2 o’clock, bomb tteokbokki is packaged for 4 people” (1120) is eliminated because it has the lowest word/sentence reliability, while the most reliable sentence is “Today at 2 o’clock, rose tteokbokki is packaged for 4 people.” “Iyo” (1130) is selected as the sentence and/or final speech recognition result for learning the final transcription and speech recognition model (or first speech recognition model). The device 1200 verifies reliability using the second speech recognition model. The sentence 1130 can be determined as the final voice recognition result, collected as training data for the first voice recognition model, and used for training the first voice recognition model.

이상에서 설명한 바와 같이, 본 발명의 제안 방법에 따르면, AI 통화 비서와 같은 음성 메모 서비스를 통해 고객이 소상공인에게 자유 발화로 음성 메모를 남겼을 때, 음성 메모를 요약하여 우선 순위에 따라 소상공인에게 제공함으로써 소상공인이 빠른 응대가 필요한 음성 메모인지 천천히 응대해도 되는 음성 메모인지를 쉽게 파악할 수 있게 하여 고객 응대 효율성을 향상시킬 수 있다.As described above, according to the proposed method of the present invention, when a customer leaves a voice memo in free speech to a small business owner through a voice memo service such as an AI call assistant, the voice memo is summarized and provided to the small business owner in order of priority. Customer response efficiency can be improved by allowing small business owners to easily determine whether a voice memo requires a quick response or a voice memo that can be responded to slowly.

또한, 본 발명의 제안 방법에 따르면, 지속적으로 업종이 업데이트되고 이에 따른 사용자 발화가 다이나믹(dynamic)하게 변화하는 소상공인 음성 메모 서비스에서 음성 인식 결과를 가지고 시스템이 분류한 의도와 요약 결과를 활용하여 자동으로 문장 패턴을 추출하여 자동 전사 및 음성 인식 모델 학습에 활용할 수 있게 함으로써 음성 인식 성능을 향상시킬 수 있고 고품질의 원활한 서비스를 제공할 수 있다. 본 발명의 제안 방법에 따르면, 자동 전사의 품질 확보를 위해서 신규 생성된 패턴이나 문장을 가지고 유효성을 바로 검증하고 검증된 문장 코퍼스를 음성 인식 모델의 학습에 사용함으로써 신규 문장에 대한 인식률이 향상 및 유지될 수 있다. 추가적으로, 텍스트 정규화 기법을 적용하여 숫자, 영문자, 특수 기호의 변환을 제공함으로써 가독성 높은 텍스트를 제공한다.In addition, according to the proposed method of the present invention, in a small business voice memo service where the industry is continuously updated and user utterances dynamically change accordingly, the intent and summary results classified by the system using the voice recognition results are used to automatically By extracting sentence patterns and using them for automatic transcription and speech recognition model learning, speech recognition performance can be improved and high-quality and smooth service can be provided. According to the proposed method of the present invention, in order to ensure the quality of automatic transcription, the validity of newly created patterns or sentences is immediately verified and the verified sentence corpus is used to learn a speech recognition model, thereby improving and maintaining the recognition rate for new sentences. It can be. Additionally, it provides highly readable text by applying text normalization techniques to convert numbers, English letters, and special symbols.

본 발명의 제안 방법에 따르면, 고객이 소상공인에게 통화하여 바로 연결이 안되었을 때 편안하게 말로 음성 메모를 남길 수 있으며, 소상공인은 본 발명의 제안 방법에 따라 요약된 내용을 기반으로 효율적으로 고객 응대를 할 수 있다. 요약된 결과에 따라 주차장 안내나 메뉴 안내 등의 단순 문의나 요청의 경우 미리 생성된 프로세스로 자동 응답을 할 수도 있다. 전화 연락 요청과 배달 확인 등의 급한 요구 사항은 우선 순위를 높여서 소상공인에게 제공되어 빠르게 처리될 수 있도록 한다.According to the proposed method of the present invention, when a customer calls a small business owner and is not immediately connected, he or she can comfortably leave a verbal voice memo, and the small business owner can efficiently respond to customers based on the summarized content according to the proposed method of the present invention. can do. Depending on the summarized results, simple inquiries or requests such as parking lot information or menu information can be automatically responded to using a pre-created process. Urgent requests, such as phone contact requests and delivery confirmation, are given higher priority so they can be provided to small business owners and processed quickly.

또한, 본 발명의 제안 방법에 따르면, 대규모 학습 데이터의 영향으로 순 한글로 제공되는 음성 인식 결과는 가독성 있게 사용자가 익숙한 숫자, 영문자, 특수 기호로 변환하여 보여줌으로써 고객이 남김 음성 메모의 가독성을 확보하여 소상공인의 응대 편리성도 향상될 수 있다.In addition, according to the proposed method of the present invention, the voice recognition results provided in pure Korean due to the influence of large-scale learning data are converted into numbers, English letters, and special symbols with which the user is familiar to make them readable, thereby ensuring the readability of the voice memo left by the customer. As a result, the convenience of small business owners can also be improved.

또한, 본 발명의 제안 방법에 따르면, 음성 인식 결과를 분류된 의도와 요약 결과를 활용하여 문장 패턴을 추출하여 자동 전사 및 음성 인식 모델 학습에 이용함으로써 음성 인식 품질 확보가 가능하다. In addition, according to the proposed method of the present invention, it is possible to secure voice recognition quality by extracting sentence patterns from voice recognition results using classified intent and summary results and using them for automatic transcription and voice recognition model learning.

본 발명의 제안 방법이 적용될 수 있는 장치Device to which the proposed method of the present invention can be applied

도 12는 본 발명의 제안 방법이 적용될 수 있는 장치(1200)를 예시한다.Figure 12 illustrates a device 1200 to which the proposed method of the present invention can be applied.

도 12를 참조하면, 장치(1200)는 본 발명의 제안 방법을 구현하도록 구성될 수 있다. 일 예로, 장치(1200)는 음성 데이터(예, 음성 메모 등)를 처리하도록 구성될 수 있으며, 이를 위해 본 발명의 제안 방법에 따라 하나 이상의 제1 텍스트 데이터를 문장 요약(114)하여 요약 텍스트를 획득하고 업종, 사용자 의도, 개체명 중 적어도 하나에 기초하여 결정된 우선 순위에 따라 요약 텍스트를 단말 장치로 제공하도록 구성될 수 있다. 이에 추가적으로 또는 대신하여, 장치(1200)는 본 발명의 제안 방법에 따라 업종, 사용자 의도, 개체명 중 적어도 하나에 따라 문장 패턴 및 신규 문장을 생성하고 생성된 문장을 이용하여 제2 음성 인식 모델을 학습시키고 음성 인식 결과를 획득하고 제1 음성 인식 모델을 학습시킴으로써 음성 인식 성능을 향상시킬 수 있도록 구성될 수 있다. 예를 들어, 장치(1200)는 네트워크 장치, 서버 장치, 또는 단말 장치를 포함할 수 있다.Referring to FIG. 12, device 1200 may be configured to implement the proposed method of the present invention. As an example, the device 1200 may be configured to process voice data (e.g., voice memo, etc.), and for this purpose, sentence summary 114 is performed on one or more first text data according to the method proposed by the present invention to create a summary text. It may be configured to obtain and provide summary text to the terminal device according to a priority determined based on at least one of industry, user intent, and entity name. In addition or instead of this, the device 1200 generates a sentence pattern and a new sentence according to at least one of industry, user intention, and entity name according to the proposed method of the present invention, and uses the generated sentence to create a second voice recognition model. It can be configured to improve voice recognition performance by learning, obtaining voice recognition results, and learning the first voice recognition model. For example, device 1200 may include a network device, a server device, or a terminal device.

예를 들어, 본 발명의 제안 방법이 적용될 수 있는 장치(1200)는 리피터, 허브, 브리지, 스위치, 라우터, 게이트웨이 등과 같은 네트워크 장치, 데스크톱 컴퓨터, 워크스테이션 등과 같은 컴퓨터 장치, 스마트폰 등과 같은 이동 단말, 랩톱 컴퓨터 등과 같은 휴대용 기기, 디지털 TV 등과 같은 가전 제품, 자동차 등과 같은 이동 수단 등을 포함할 수 있다. 다른 예로, 본 발명이 적용될 수 있는 장치(1200)는 SoC(System On Chip) 형태로 구현된 ASIC(Application Specific Integrated Circuit)의 일부로 포함될 수 있다.For example, devices 1200 to which the proposed method of the present invention can be applied include network devices such as repeaters, hubs, bridges, switches, routers, gateways, computer devices such as desktop computers, workstations, etc., and mobile terminals such as smartphones. , portable devices such as laptop computers, home appliances such as digital TVs, and means of transportation such as cars. As another example, the device 1200 to which the present invention can be applied may be included as part of an Application Specific Integrated Circuit (ASIC) implemented in the form of a System On Chip (SoC).

메모리(1204)는 프로세서(1202)의 처리 및 제어를 위한 프로그램 및/또는 명령어들을 저장할 수 있고, 본 발명에서 사용되는 데이터와 정보, 본 발명에 따른 데이터 및 정보 처리를 위해 필요한 제어 정보, 데이터 및 정보 처리 과정에서 발생하는 임시 데이터 등을 저장할 수 있다. 메모리(1204)는 ROM(Read Only Memory), RAM(Random Access Memory), EPROM(Erasable Programmable Read Only Memory), EEPROM(Electrically Erasable Programmable Read-Only Memory), 플래쉬(flash) 메모리, SRAM(Static RAM), HDD(Hard Disk Drive), SSD(Solid State Drive) 등과 같은 저장 장치로서 구현될 수 있다.The memory 1204 may store programs and/or instructions for processing and controlling the processor 1202, and may store data and information used in the present invention, control information, data and information required for processing data and information according to the present invention. Temporary data generated during information processing can be stored. The memory 1204 includes Read Only Memory (ROM), Random Access Memory (RAM), Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash memory, and Static RAM (SRAM). , can be implemented as a storage device such as a hard disk drive (HDD), solid state drive (SSD), etc.

프로세서(1202)는 장치(1200) 내 각 모듈의 동작을 제어한다. 특히, 프로세서(1202)는 본 발명의 제안 방법을 수행하기 위한 각종 제어 기능을 수행할 수 있다. 프로세서(1202)는 컨트롤러(controller), 마이크로 컨트롤러(microcontroller), 마이크로 프로세서(microprocessor), 마이크로 컴퓨터(microcomputer) 등으로도 불릴 수 있다. 본 발명의 제안 방법은 하드웨어(hardware) 또는 펌웨어(firmware), 소프트웨어, 또는 이들의 결합에 의해 구현될 수 있다. 하드웨어를 이용하여 본 발명을 구현하는 경우에는, 본 발명을 수행하도록 구성된 ASIC(application specific integrated circuit) 또는 DSP(digital signal processor), DSPD(digital signal processing device), PLD(programmable logic device), FPGA(field programmable gate array) 등이 프로세서(1202)에 구비될 수 있다. 한편, 펌웨어나 소프트웨어를 이용하여 본 발명의 제안 방법을 구현하는 경우에는 펌웨어나 소프트웨어는 본 발명의 제안 방법을 구현하는 데 필요한 기능 또는 동작들을 수행하는 모듈, 절차 또는 함수 등과 관련된 명령어(instruction)들을 포함할 수 있으며, 명령어들은 메모리(1204)에 저장되거나 메모리(1204)와 별도로 컴퓨터 판독가능한 기록 매체(미도시)에 저장되어 프로세서(1202)에 의해 실행될 때 장치(1200)가 본 발명의 제안 방법을 구현하도록 구성될 수 있다.Processor 1202 controls the operation of each module within device 1200. In particular, the processor 1202 can perform various control functions to perform the proposed method of the present invention. The processor 1202 may also be called a controller, microcontroller, microprocessor, microcomputer, etc. The proposed method of the present invention may be implemented by hardware, firmware, software, or a combination thereof. When implementing the present invention using hardware, an application specific integrated circuit (ASIC) or a digital signal processor (DSP), a digital signal processing device (DSPD), a programmable logic device (PLD), or an FPGA (FPGA) configured to perform the present invention is used. A field programmable gate array) may be provided in the processor 1202. Meanwhile, when the proposed method of the present invention is implemented using firmware or software, the firmware or software includes instructions related to modules, procedures, or functions that perform the functions or operations necessary to implement the proposed method of the present invention. It may include, and the instructions are stored in the memory 1204 or in a computer-readable recording medium (not shown) separate from the memory 1204, and when executed by the processor 1202, the device 1200 uses the proposed method of the present invention. It can be configured to implement.

또한, 장치(1200)는 네트워크 인터페이스 모듈(network interface module, NIM)(1206)을 포함할 수 있다. 네트워크 인터페이스 모듈(1206)은 프로세서(1202)와 동작시 연결(operatively connected)되며, 프로세서(1202)는 네트워크 인터페이스 모듈(1206)을 제어하여 무선/유선 네트워크를 통해 정보 및/또는 데이터, 신호, 메시지 등을 나르는 무선/유선 신호를 전송 또는 수신할 수 있다. 네트워크 인터페이스 모듈(1206)은 예를 들어 IEEE 802 계열, 3GPP LTE(-A), 3GPP 5G 등과 같은 다양한 통신 규격을 지원하며, 해당 통신 규격에 따라 제어 정보 및/또는 데이터 신호를 송수신할 수 있다. 네트워크 인터페이스 모듈(1206)은 필요에 따라 장치(1200) 밖에 구현될 수도 있다.Additionally, device 1200 may include a network interface module (NIM) 1206. The network interface module 1206 is operatively connected to the processor 1202, and the processor 1202 controls the network interface module 1206 to provide information and/or data, signals, and messages through a wireless/wired network. It can transmit or receive wireless/wired signals carrying lights. The network interface module 1206 supports various communication standards, such as IEEE 802 series, 3GPP LTE(-A), 3GPP 5G, etc., and can transmit and receive control information and/or data signals according to the corresponding communication standards. Network interface module 1206 may be implemented outside of device 1200 as needed.

이상에서 설명된 실시예들은 본 발명의 구성 요소들과 특징들이 소정 형태로 결합된 것들이다. 각 구성 요소 또는 특징은 별도의 명시적 언급이 없는 한 선택적인 것으로 고려되어야 한다. 각 구성 요소 또는 특징은 다른 구성 요소나 특징과 결합되지 않은 형태로 실시될 수 있다. 또한, 일부 구성 요소들 및/또는 특징들을 결합하여 본 발명의 실시예를 구성하는 것도 가능하다. 본 발명의 실시예들에서 설명되는 동작들의 순서는 변경될 수 있다. 어느 실시예의 일부 구성이나 특징은 다른 실시예에 포함될 수 있고, 또는 다른 실시예의 대응하는 구성 또는 특징과 교체될 수 있다. 특허청구범위에서 명시적인 인용 관계가 있지 않은 청구항들을 결합하여 실시예를 구성하거나 출원 후의 보정에 의해 새로운 청구항으로 포함시킬 수 있음은 자명하다.The embodiments described above combine the components and features of the present invention in a predetermined form. Each component or feature should be considered optional unless explicitly stated otherwise. Each component or feature may be implemented in a form that is not combined with other components or features. Additionally, it is also possible to configure an embodiment of the present invention by combining some components and/or features. The order of operations described in embodiments of the present invention may be changed. Some features or features of one embodiment may be included in other embodiments or may be replaced with corresponding features or features of other embodiments. It is obvious that claims that do not have an explicit reference relationship in the patent claims can be combined to form an embodiment or included as a new claim through amendment after filing.

본 발명은 인공 지능 컨택 센터, 인공 지능 통화 서비스 등을 포함하여 음성 기반 서비스를 제공하도록 구성된 네트워크 장치, 서버 장치, 단말 장치 등과 같은 다양한 장치에 적용될 수 있다.The present invention can be applied to various devices such as network devices, server devices, terminal devices, etc. configured to provide voice-based services, including artificial intelligence contact centers, artificial intelligence call services, etc.

100: 음성 메모 서비스 시스템 102: 음성 인식
104: 음성 인식 신뢰도 검증
106: 신규 문장을 위한 음성 인식 모델 학습
108: 신규 문장 생성/검증 110: 의도 분류
112: 개체명 인식 114: 문장 요약
116: 우선 순위 결정 118: 음성 메모 출력 관리
120: 지식 모델 구축
310: 하나 이상의 제1 텍스트 데이터
320: 의도에 관한 정보 330: 개체명에 관한 정보
340: 최종 음성 인식 결과 350: 요약 텍스트
410: 우선 순위를 지시하는 필드 420: 처리 완료 여부를 지시하는 필드
430: 처리할 일을 지시하는 필드 440: 음성 메모를 지시하는 필드
612, 614, 616, 618: 업종 및/또는 의도에 따른 카테고리
622, 624, 626, 628: 카테고리 별로 분류된 문장
700: 우선 순위 리스트
710: 업종 우선 순위(또는 우선 순위 점수)를 위한 제1 리스트
720: 의도 우선 순위(또는 우선 순위 점수)를 위한 제2 리스트
730: 개체명 우선 순위(또는 우선 순위 점수)를 위한 제3 리스트
1010: 신규 문장 패턴
1022, 1024, 1026, 1028, 1030: 개체명 후보 단어
1110: 제1 텍스트 문장 1120, 1130: 제2 텍스트 문장
1200: 장치 1202: 프로세서
1204: 메모리 1206: 네트워크 인터페이스 모듈100: Voice memo service system 102: Voice recognition
104: Verification of voice recognition reliability
106: Learning a speech recognition model for new sentences
108: New sentence creation/verification 110: Intent classification
112: Entity name recognition 114: Sentence summary
116: Prioritization 118: Voice memo output management
120: Building a knowledge model
310: One or more first text data
320: Information about intent 330: Information about entity name
340: Final voice recognition result 350: Summary text
410: Field indicating priority 420: Field indicating whether processing is complete
430: Field indicating work to be done 440: Field indicating voice memo
612, 614, 616, 618: Categories based on industry and/or intent.
622, 624, 626, 628: Sentences classified by category
700: Priority list
710: First list for industry priorities (or priority scores)
720: Second list for intent priorities (or priority scores)
730: Third list for entity name priority (or priority score)
1010: New sentence pattern
1022, 1024, 1026, 1028, 1030: Entity name candidate words
1110: first text sentence 1120, 1130: second text sentence
1200: Device 1202: Processor
1204: memory 1206: network interface module

Claims

A method of processing voice data in a server device, comprising:
obtaining information about intent and information about named entities for one or more first text sentences;
obtaining summary text for the one or more first text sentences based on the information about the intent and the information about the entity name;
determining a priority for the one or more first text sentences based on the information about the intent and the information about the entity name; and
A method comprising providing the summary text to a terminal device according to the determined priority.

In claim 1,
Obtaining the summary text includes:
Method comprising combining information about the intent and information about the entity name to obtain the summary text.

In claim 1,
These priorities determine:
obtaining a second priority score of the information about the intent and a third priority score of the information about the entity name in a priority list; and
A method comprising determining the priority based on a sum of the second priority score and the third priority score by applying weights to each of them.

In claim 1,
Further comprising obtaining information about industry classification for the one or more first text sentences,
These priorities determine:
Obtaining a first priority score of the information about the industry classification, a second priority score of the information about the intention, and a third priority score of the information about the entity name in the priority list; and
A method comprising determining the priority based on a value obtained by applying a weight to each of the first priority score, the second priority score, and the third priority score.

In claim 1,
obtaining the one or more first text sentences based on the user's speech data using a first speech recognition model;
Based on the speech recognition reliability for the one or more first text sentences not satisfying at least one first criterion, a sentence pattern and at least one entity name based on the information about the intent and the information about the entity name Obtaining candidate words and generating one or more second text sentences based on the at least one entity name candidate word according to the obtained sentence pattern; and
Based on the speech recognition reliability of the one or more second text sentences satisfying at least one second criterion, the second text sentence with the highest speech recognition reliability among the one or more second text sentences is selected as the user's speech data. A method further comprising determining with a speech recognition result.

In claim 5,
It further includes learning a second speech recognition model for each of the one or more second text sentences using each of the one or more second text sentences,
The method wherein speech recognition reliability for each of the one or more second text sentences is obtained based on the second speech recognition model.

In claim 5,
The voice recognition reliability includes reliability of sentences and reliability of words,
The speech recognition reliability of the one or more first text sentences does not satisfy the at least one first criterion. This means that the speech recognition reliability of the one or more first text sentences is less than a first threshold or the one or more first text sentences do not satisfy the first criterion. A method, wherein the confidence level for a word in a text sentence is less than a second threshold.

In claim 5,
The voice recognition reliability includes reliability of sentences and reliability of words,
Speech recognition reliability of the one or more second text sentences satisfies at least one second criterion, meaning that the reliability of the sentences of the one or more second text sentences is greater than or equal to a third threshold and that the speech recognition reliability of the one or more second text sentences is greater than or equal to a third threshold. wherein the confidence level for is greater than or equal to a fourth threshold.

In claim 5,
Based on the speech recognition reliability of the one or more second text sentences satisfying the at least one second criterion, learning the first speech recognition model using the second text sentence with the highest speech recognition reliability. Including more,
Based on the speech recognition reliability of the one or more second text sentences not satisfying the at least one second criterion, training of the first speech recognition model using the second text sentence with the highest speech recognition reliability is omitted.

In claim 5,
It further includes determining the one or more first text sentences as a speech recognition result for the user's speech data, based on the speech recognition reliability of the one or more first text sentences satisfying the at least one first criterion. How to.

processor; A device comprising: and a memory,
The memory includes instructions configured to implement specific operations that, when performed by the processor, cause the device to process voice data, the specific operations being:
obtaining information about intent and information about named entities for one or more first text sentences;
obtaining summary text for the one or more first text sentences based on the information about the intent and the information about the entity name;
determining a priority for the one or more first text sentences based on the information about the intent and the information about the entity name; and
A device comprising providing the summary text to a terminal device according to the determined priority.

In claim 11,
The specific operations above are:
obtaining the one or more first text sentences based on the user's speech data using a first speech recognition model;
Based on the speech recognition reliability for the one or more first text sentences not satisfying at least one first criterion, a sentence pattern and at least one entity name based on the information about the intent and the information about the entity name Obtaining candidate words and generating one or more second text sentences based on the at least one entity name candidate word according to the obtained sentence pattern; and
Based on the speech recognition reliability of the one or more second text sentences satisfying at least one second criterion, the second text sentence with the highest speech recognition reliability among the one or more second text sentences is selected as the user's speech data. The device further includes determining based on voice recognition results.

A computer-readable storage medium storing instructions configured to, when executed by a processor, cause a device including the processor to implement specific operations for processing voice data, the specific operations comprising:
obtaining information about intent and information about named entities for one or more first text sentences;
obtaining summary text for the one or more first text sentences based on the information about the intent and the information about the entity name;
determining a priority for the one or more first text sentences based on the information about the intent and the information about the entity name; and
A computer-readable storage medium comprising providing the summary text to a terminal device according to the determined priority.

In claim 13,
The specific operations above are:
obtaining the one or more first text sentences based on the user's speech data using a first speech recognition model;
Based on the speech recognition reliability for the one or more first text sentences not satisfying at least one first criterion, a sentence pattern and at least one entity name based on the information about the intent and the information about the entity name Obtaining candidate words and generating one or more second text sentences based on the at least one entity name candidate word according to the obtained sentence pattern; and
Based on the speech recognition reliability of the one or more second text sentences satisfying at least one second criterion, the second text sentence with the highest speech recognition reliability among the one or more second text sentences is selected as the user's speech data. A computer-readable storage medium further comprising determining based on a voice recognition result for .