KR102480872B1

KR102480872B1 - Method for generating data for machine learning and method for scheduling machine learning using the same

Info

Publication number: KR102480872B1
Application number: KR1020200164094A
Authority: KR
Inventors: 정휘웅; 배영준; 김미영
Original assignee: 주식회사 리노스; 토피도 주식회사
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2022-12-26
Also published as: KR20220075709A

Abstract

효율적인 챗봇의 기계학습을 위한 훈련 데이터를 생성하는 방법 및 기계학습을 스케줄링하는 방법이 제공된다. 본 발명의 일 실시예에 따른 기계학습용 데이터 생성 방법은, 학습할 데이터를 데이터의 속성에 따라 분류하는 데이터 분류단계와, 학습할 데이터의 속성에 따라 학습할 데이터로부터 학습용 데이터를 생성하는 학습용 데이터 생성단계를 포함한다. 데이터의 속성에는 데이터의 생성주기 및/또는 데이터의 유효기간이 포함될 수 있다. 본 발명의 일 실시예에 따른 기계학습 스케줄링 방법은, 전술한 기계학습용 데이터 생성 방법을 사용하여 생성된 학습용 데이터를 해당 학습용 데이터의 속성에 따라 적절한 시점에 학습용 데이터 DB에 로딩하고 챗봇엔진의 기계학습을 실행하도록 제어하는 기계학습 제어단계를 더 구비한다.A method for generating training data for machine learning of an efficient chatbot and a method for scheduling machine learning are provided. A method for generating data for machine learning according to an embodiment of the present invention includes a data classification step of classifying data to be learned according to attributes of the data, and generating data for learning from data to be learned according to the attributes of the data to be learned. Include steps. Attributes of data may include a data generation cycle and/or a data validity period. In the machine learning scheduling method according to an embodiment of the present invention, the learning data generated using the above-described machine learning data generation method is loaded into the learning data DB at an appropriate time according to the properties of the learning data, and the machine learning of the chatbot engine A machine learning control step of controlling to execute is further provided.

Description

Method for generating data for machine learning and scheduling method for machine learning using the same {Method for generating data for machine learning and method for scheduling machine learning using the same}

본 발명은 기계학습용 데이터 생성방법 및 이를 이용한 기계학습 스케줄링 방법에 관한 것으로서, 더욱 상세하게는 효율적인 챗봇의 기계학습을 위한 훈련 데이터를 생성하는 방법 및 기계학습을 스케줄링하는 방법에 관한 것이다.The present invention relates to a method for generating data for machine learning and a scheduling method for machine learning using the same, and more particularly, to a method for generating training data for machine learning of an efficient chatbot and a method for scheduling machine learning.

고객의 전화통화 또는 메신저를 통한 상담이나 문의에 대해서 인공지능을 활용하여 응대하는 챗봇 시스템(chatbot system)을 적용하고자 하는 기업이 늘고 있다. 챗봇은 사용자로부터 음성 또는 문자 메시지를 수신하여, 이로부터 사용자의 의도를 추출하여 추출된 의도에 대응하는 응답을 출력한다. 챗봇의 성능은 챗봇을 훈련시키는 기계학습 알고리즘에 의해서도 좌우되지만, 챗봇을 기계학습시키기 위한 훈련 데이터(이하, '기계학습용 데이터'라 함))의 양과 질에 의해서도 좌우된다..An increasing number of companies are trying to apply a chatbot system that uses artificial intelligence to respond to customer consultations or inquiries through phone calls or messengers. The chatbot receives a voice or text message from the user, extracts the user's intent from it, and outputs a response corresponding to the extracted intent. The performance of a chatbot is influenced not only by the machine learning algorithm that trains the chatbot, but also by the quantity and quality of training data for machine learning (hereinafter referred to as 'data for machine learning').

인공지능 기술이 발전함에 따라 챗봇을 사용한 간단한 응대는 가능해졌지만 아직도 사용자들의 만족도는 낮은 편이다. 이는 복잡한 질의로부터 정확한 의도를 추출하는데 어려움이 있다는 점 이외에도 비교적 간단한 질문에 대해서도 적절한 데이터를 가져오지 못하는 경우가 많은 점에도 기인한다.With the development of artificial intelligence technology, simple responses using chatbots have become possible, but users' satisfaction is still low. This is due not only to the difficulty in extracting accurate intent from complex queries, but also to the fact that appropriate data are often not fetched even for relatively simple questions.

예를 들면, 특허등록 제2119468호에서는 상담원의 상담내용을 기반으로 상담 챗봇을 기계학습함으로써 상담원의 상담 지식을 챗봇의 상담 지식으로 활용하는 방법이 제안되었다. 그런데 새로운 정보들이 빠르게 생성되고 변화되는 현대 상황에서는 상담원의 상담내용이 철지난 정보에 기반한 것일 수 있다. 따라서 이러한 방법으로 챗봇을 학습하더라도 잘못된 정보에 기반한 학습이 이루어질 수 있다. For example, Patent Registration No. 2119468 proposes a method of using counseling knowledge of a counselor as counseling knowledge of a chatbot by machine learning the counseling chatbot based on the counselor's counseling contents. However, in the modern situation where new information is quickly created and changed, the counselor's counseling content may be based on outdated information. Therefore, even if a chatbot is learned in this way, learning based on incorrect information may occur.

이러한 점을 개선하려면 기계학습을 최대한 자주 실행하는 것이 좋지만, 데이터가 방대한 상황에서는 기계학습에 시간이 많이 소요되어 원하는 결과를 얻기가 어렵다.To improve these points, it is recommended to run machine learning as often as possible, but machine learning takes a lot of time when there is a lot of data, making it difficult to obtain the desired result.

본 발명은 이러한 점을 감안하여 이루어진 것으로서, 챗봇 시스템에 적정한 데이터를 제공할 수 있는 기계학습용 데이터 생성방법 및 이를 이용한 기계학습 스케줄링 방법을 제공하는 것을 목적으로 한다. The present invention has been made in view of these points, and an object of the present invention is to provide a method for generating data for machine learning that can provide appropriate data to a chatbot system and a method for scheduling machine learning using the same.

본 발명의 다른 목적은 기계학습을 수행하는데 따른 부하를 줄이면서도 챗봇 시스템에 적정한 데이터를 제공할 수 있는 기계학습용 데이터를 생성하는 방법 및 이를 이용한 기계학습 스케줄링 방법을 제공하는 것이다.Another object of the present invention is to provide a method for generating data for machine learning that can provide appropriate data to a chatbot system while reducing the load associated with performing machine learning, and a machine learning scheduling method using the same.

본 발명의 일 실시예에 따른 기계학습용 데이터 생성 방법은, 학습할 데이터를 데이터의 속성에 따라 분류하는 데이터 분류단계와, 학습할 데이터의 속성에 따라 학습할 데이터로부터 학습용 데이터를 생성하는 학습용 데이터 생성단계를 포함한다. 데이터의 속성에는 데이터의 생성주기 및/또는 데이터의 유효기간이 포함될 수 있다.A method for generating data for machine learning according to an embodiment of the present invention includes a data classification step of classifying data to be learned according to attributes of the data, and generating data for learning from data to be learned according to the attributes of the data to be learned. Include steps. Data attributes may include data generation cycle and/or data validity period.

학습용 데이터는 학습용 데이터를 생성하는데 사용된 학습할 데이터의 속성과 동일한 속성을 가질 수 있다. 학습용 데이터는 속성에 따라 다른 속성의 학습용 데이터와 별도의 데이터베이스에 저장되거나 다른 필드를 갖는 등, 서로 구분되어 저장될 수 있다. The training data may have the same properties as the properties of the data to be learned used to generate the training data. The learning data may be stored in a database separate from learning data of other properties according to properties, or may be stored separately from each other, such as having a different field.

일 실시예에서, 학습용 데이터 생성단계에서는 탬플릿을 사용할 수 있다. 상기 탬플릿은 복수의 질의와 그에 대응되는 응답으로 구성되며, 응답에는 학습할 데이터에 따라 치환할 수 있는 부분이 포함된다. 학습용 데이터는 상기 치환할 수 있는 부분을 학습할 데이터에 따라 치환하여 생성될 수 있다. In one embodiment, a template may be used in the step of generating training data. The template is composed of a plurality of queries and corresponding responses, and the responses include parts that can be replaced according to data to be learned. Data for learning may be generated by substituting the replaceable part according to data to be learned.

일 실시예에 따른 본 발명의 기계학습용 데이터 생성 방법은, 유효기간이 만료된 학습용 데이터를 삭제하는 단계를 더 포함한다.The method for generating data for machine learning according to an embodiment of the present invention further includes deleting data for learning whose validity period has expired.

본 발명의 일 실시예에 따른 기계학습 스케줄링 방법은, 전술한 기계학습용 데이터 생성 방법을 사용하여 생성된 학습용 데이터를 해당 학습용 데이터의 속성에 따라 적절한 시점에 학습용 데이터 DB에 로딩하고 챗봇엔진의 기계학습을 실행하도록 제어하는 기계학습 제어단계를 더 구비한다. In the machine learning scheduling method according to an embodiment of the present invention, the learning data generated using the above-described machine learning data generation method is loaded into the learning data DB at an appropriate time according to the properties of the learning data, and the machine learning of the chatbot engine A machine learning control step of controlling to execute is further provided.

일 실시예에서, 상기 기계학습 제어단계에서는 학습용 데이터로 기계학습을 한번 실행한 다음에는 해당 학습용 데이터의 생성주기 이내에는 같은 데이터로 다시 학습을 실행하지 않도록 제어한다. In one embodiment, in the machine learning control step, once machine learning is executed with learning data, control is performed so that learning is not executed again with the same data within a generation period of the corresponding learning data.

일 실시예에서, 본 발명의 기계학습 스케줄링 방법은 학습용 데이터의 생성주기 속성에 관계없이 적어도 최소 훈련휴지기간마다 학습용 데이터로 기계학습을 실행하도록 제어하는 단계를 더 포함한다.In one embodiment, the machine learning scheduling method of the present invention further includes controlling to execute machine learning with the training data at least every minimum training pause period regardless of the generation cycle attribute of the training data.

본 발명에 따르면 학습할 데이터의 속성에 따라 학습할 데이터로부터 효율적인 훈련이 가능한 학습용 데이터를 생성하므로 챗봇의 효율적인 기계학습이 가능하다.According to the present invention, since learning data capable of efficient training is generated from data to be learned according to the properties of the data to be learned, efficient machine learning of the chatbot is possible.

또한, 데이터의 생성주기에 따라 기계학습 시점을 제어하므로 모든 훈련 데이터로 훈련을 할 필요가 없어서 1회 기계학습에 소요되는 시간이 대폭 단축된다.In addition, since the timing of machine learning is controlled according to the data generation cycle, there is no need to train with all the training data, greatly reducing the time required for machine learning once.

도 1은 챗봇 시스템에 본 발명이 적용된 예를 보여주는 블록도이다.
도 2는 본 발명의 바람직한 실시예에 따른 기계학습스케줄러의 기능 블록도이다.
도 3은 본 발명의 바람직한 실시예에 따른 기계학습용 데이터의 생성 및 기계학습 스케줄링 과정을 보여주는 흐름도이다.1 is a block diagram showing an example in which the present invention is applied to a chatbot system.
2 is a functional block diagram of a machine learning scheduler according to a preferred embodiment of the present invention.
3 is a flowchart showing a process of generating data for machine learning and scheduling machine learning according to a preferred embodiment of the present invention.

챗봇을 통해 사용자가 문의하는 정보에는 거의 변하지 않는 정보가 있는 반면에 시시각각으로 변하는 정보가 있다. 예를 들면 '경부고속도로'와 같은 고속도로명과 구간 길이는 거의 변하지 않지만, 경부고속도로의 특정 구간의 평균속도나 날씨는 시시각각으로 변하게 된다. The information that users inquire through chatbots includes information that is almost unchanged, while there is information that changes from moment to moment. For example, the name of an expressway such as 'Gyeongbu Expressway' and the length of a section hardly change, but the average speed or weather of a specific section of the Gyeongbu Expressway changes from moment to moment.

또한, 주기적으로 변경되는 데이터가 있는 반면에 비정기적으로 변경되는 데이터가 있다. 예를 들면, 사용자가 납부한 통신 요금은 매달 갱신되는 반면에 기본 통신 요금은 비정기적으로 변경될 수 있다.In addition, while there is data that changes periodically, there is data that changes irregularly. For example, while the communication fee paid by the user is renewed every month, the basic communication fee may be changed irregularly.

본 발명에서는 데이터마다의 고유한 변동성과 주기를 감안하여 각 데이터에 적합한 주기로 데이터를 기계학습 장치에 제공함으로써 학습 시스템의 부하를 줄이면서도 적절한 기계학습이 이루어지도록 한다.In the present invention, considering the unique variability and cycle of each data, data is provided to the machine learning device at a cycle suitable for each data, thereby reducing the load on the learning system and allowing appropriate machine learning to be performed.

이하, 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.

도 1은 챗봇 시스템에 본 발명이 적용된 예를 보여주는 블록도이다. 1 is a block diagram showing an example in which the present invention is applied to a chatbot system.

일반적인 챗봇 시스템에서 텍스트로 변환된 사용자의 대화 또는 사용자가 텍스트로 입력한 대화는 챗봇엔진(10)에 입력되고, 챗봇엔진(10)은 훈련된 학습결과에 따라 사용자의 대화에 응답하게 된다. 챗봇엔진(10)에는 예를 들면 BERT(Bidirectional Encoder Representations from Transformers)와 같은 언어모델이 사용될 수 있다. BERT는 셀프 어텐션(Self-Attention) 기법을 사용한 기계번역 신경망으로서, 특정 분야에 대해서 사전훈련 임베딩(embidding)을 통하여 해당 분야에 대한 인공지능 대화성능을 개선시킬 수 있다. 주기적으로 또는 수시로 챗봇엔진(10)을 기계학습용 데이터 DB(20)를 사용하여 기계학습을 함으로써 챗봇엔진(10)의 성능이 개선되고 최신의 정보를 반영할 수 있게 된다.In a general chatbot system, a user's conversation converted into text or a conversation entered as text by the user is input to the chatbot engine 10, and the chatbot engine 10 responds to the user's conversation according to the trained learning result. For example, a language model such as BERT (Bidirectional Encoder Representations from Transformers) may be used in the chatbot engine 10 . BERT is a machine translation neural network using a self-attention technique, and can improve the performance of artificial intelligence conversations in a specific field through pre-training embedding. By performing machine learning on the chatbot engine 10 periodically or frequently using the machine learning data DB 20, the performance of the chatbot engine 10 is improved and the latest information can be reflected.

본 발명에서는 챗봇엔진을 훈련시키는데 사용할 원 데이터(이하 '학습할 데이터'라 함)를 기계학습스케줄러(100)가 처리하여 챗봇엔진(10)을 효율적으로 훈련시키는데 적합한 훈련 데이터(이하, '학습용 데이터'라 함)를 생성하고, 생성된 학습용 데이터를 사용하여 데이터의 속성에 따라 챗봇엔진(10)을 기계학습시킨다. In the present invention, the machine learning scheduler 100 processes raw data to be used to train the chatbot engine (hereinafter referred to as 'data to be learned'), and training data suitable for efficiently training the chatbot engine 10 (hereinafter referred to as 'learning data') ') is generated, and the chatbot engine 10 is machine-learned according to the properties of the data using the generated learning data.

다음으로, 도 2를 참조하여 본 발명의 바람직한 실시예에 따른 기계학습스케줄러(100)의 구성에 대해서 설명한다.Next, the configuration of the machine learning scheduler 100 according to a preferred embodiment of the present invention will be described with reference to FIG. 2 .

데이터 분류부(110)는 입력되는 학습할 데이터를 그 속성에 따라 분류한다. 데이터의 속성으로는 데이터의 생성주기, 데이터의 유효기간 등이 있을 수 있다. 학습할 데이터는 컴퓨터, 센서장치 등에서 자동으로 생성되는 것일 수도 있고, 사람이 작성한 것일 수도 있다. The data classification unit 110 classifies input data to be learned according to its properties. Attributes of the data may include a generation period of the data, a validity period of the data, and the like. The data to be learned may be automatically generated by a computer, a sensor device, or the like, or may be created by a person.

속성에 따른 데이터의 분류는, 데이터의 입력원에 따라 정할 수도 있고, 데이터가 포함되어 있는 문서의 제목이나 내용에 포함된 단어로부터 결정할 수도 있다. 예를 들면, 일간 데이터를 제공하는 곳으로부터의 데이터는 주기가 일간이라고 정할 수 있다. 예를 들면, 문서의 제목이 '연간 보고서'를 포함하고 있으면 데이터의 주기가 연간이라고 정할 수 있다. 또는, 사람이 속성을 정해준 문서들을 인공지능 알고리즘으로 학습한 후에 이후에 학습된 인공지능 엔진에서 입력되는 문서들의 속성을 정하도록 할 수도 있다.Classification of data according to attributes may be determined according to data input sources, or may be determined from words included in the title or content of a document including data. For example, data from a place that provides daily data may be determined to have a period of one day. For example, if the title of the document includes 'annual report', the period of data can be determined as annual. Alternatively, after learning the documents whose properties have been set by a person with an artificial intelligence algorithm, the properties of the input documents may be set by the learned artificial intelligence engine.

데이터의 생성주기는 주기적으로 생성되는 데이터의 생성 주기를 나타낸다. 예를 들면, 기업의 사업보고서는 생성주기가 연간이며, 분기보고서는 분기이다. 매일 집계되는 일간 교통사고사망자수, 일간 강수량, 일간 통행량 등은 생성주기가 일간이다. The data generation cycle represents a generation cycle of periodically generated data. For example, a company's business report is annual, and its quarterly report is quarterly. The daily number of deaths from traffic accidents, daily precipitation, and daily traffic volume, etc., which are counted every day, have a daily generation cycle.

데이터의 유효기간은 생성된 데이터가 유효한 기간을 나타낸다. 예를 들면, 추석연휴기간에 통행료를 무료로 하는 경우에, 해당 기간의 통행료 정보는 추석연휴기간에만 유효하다. 일부 도로 구간에서 공사로 인하여 통행이 제한되는 경우에, 해당 정보도 공사기간 동안에만 유효하다. 데이터에 따라서는 유효기간이 정해져있지 않는 것이 있다. 예를 들면, 어떤 기업의 법인등록번호, 주소 등은 유효기간이 정해져있지 않으므로, 유효기간은 미정으로 설정해둘 수 있다.The validity period of the data indicates the validity period of the generated data. For example, when tolls are free during the Chuseok holiday period, toll information for the corresponding period is valid only during the Chuseok holiday period. If traffic is restricted due to construction in some road sections, the information is valid only during the construction period. Some data do not have a fixed expiration date. For example, since the validity period of a certain company's corporate registration number, address, etc. is not determined, the validity period may be set as undetermined.

데이터 분류부(110)는 입력되는 학습할 데이터의 속성을 분류하여 속성에 따라 서로 다른 데이터베이스(120a,..,120n)(또는 파일)에 저장해두거나, 학습할 데이터의 속성필드를 설정한다. 도 2에서는 주기별 데이터베이스, 즉 실시간으로 발생되는 실시간 데이터 DB(120a), 매일 발생되는 일간 데이터 DB(120b), 매주 발생되는 주간 데이터 DB(120c), 매년 발생되는 연간 데이터 DB(120n)를 예시적으로 보여주고 있다. The data classification unit 110 classifies input attributes of data to be learned and stores them in different databases 120a, .., 120n (or files) according to the attributes, or sets attribute fields of data to be learned. 2 illustrates a periodic database, that is, a real-time data DB 120a generated in real time, a daily data DB 120b generated every day, a weekly data DB 120c generated weekly, and an annual data DB 120n generated annually. showing hostility.

학습용 데이터 처리부(130)는 분류된 학습할 데이터를 속성에 따라 학습용 데이터로 변환한다. 변환된 학습용 데이터도 학습할 데이터와 동일한 속성을 가질 수 있다. 변환된 학습용 데이터는 속성에 따라 다른 속성의 학습용 데이터와 구분되어 저장될 수 있다. 또한, 학습용 데이터 처리부(130)는 유효기간이 만료된 학습용 데이터를 삭제하여 다음 학습시에 사용되지 않도록 할 수 있다. The learning data processing unit 130 converts the classified data to be learned into learning data according to attributes. The converted learning data may also have the same properties as the data to be learned. The converted training data may be stored separately from learning data of other attributes according to attributes. In addition, the learning data processing unit 130 may delete the learning data for which the validity period has expired so that they are not used in the next learning.

표 1은 생성주기가 1년인 연간 데이터를 학습용 데이터로 변환한 예를 보여준다.Table 1 shows an example of converting annual data with a generation cycle of 1 year into training data.

연간 데이터annual data 학습용 데이터data for training 공휴일수 67일 (2020년)67 public holidays (2020) 질문:
올해 공휴일수는 얼마나 되지요?
금년 공휴일수는 얼마나 되지요?
2020년은 공휴일수가 얼마나 되지요?
올해 공휴일수가 어떻게 되나요?
금년 공휴일수가 어떻게 되나요?
2020년은 공휴일수가 어떻게 되나요?
올해는 공휴일수가 며칠이죠?
금년은 공휴일수가 며칠이죠?
...
답변: 올해 공휴일수는 67일입니다.question:
How many holidays are there this year?
How many holidays are there this year?
How many public holidays are there in 2020?
What are the number of public holidays this year?
How many holidays are there this year?
What are the number of public holidays in 2020?
How many holidays are there this year?
How many holidays are there this year?
...
A: There are 67 public holidays this year.

또한, 새로운 연간 데이터가 발생하면 기존의 연간 데이터는 표 2와 같이 변환할 수 있다. 즉, 기존에 올해에 해당하는 정보가 해가 지나면서 작년에 해당하는 정보가 되었으므로 그에 맞게 학습용 데이터를 생성한다.In addition, when new annual data is generated, existing annual data can be converted as shown in Table 2. That is, since information corresponding to this year has become information corresponding to last year over the years, training data is generated accordingly.

연간 데이터annual data 학습용 데이터data for training 공휴일수 66일 (2019년)66 public holidays (2019) 질문:
지난해 공휴일수는 얼마나 되지요?
작년 공휴일수는 얼마나 되지요?
2019년은 공휴일수가 얼마나 되지요?
지난해 공휴일수가 어떻게 되나요?
작년 공휴일수가 어떻게 되나요?
2019S년은 공휴일수가 어떻게 되나요?
지난해는 공휴일수가 며칠이죠?
작년은 공휴일수가 며칠이죠?
...
답변: 작년 공휴일수는 66일입니다.question:
How many holidays were there last year?
How many holidays were there last year?
How many public holidays are there in 2019?
What was the number of public holidays last year?
What was the number of public holidays last year?
What are the number of public holidays in 2019S?
How many holidays were there last year?
How many holidays were there last year?
...
Answer: The number of public holidays last year was 66.

표 3은 일간 데이터의 처리예를 보여준다.Table 3 shows an example of processing daily data.

일간 데이터daily data 학습용 데이터data for training 교통사고사망자 5명 (2020년 11월 30일)5 deaths in traffic accidents (November 30, 2020) 질문:
어제 교통사고사망자는 얼마나 되지요?
어제는 교통사고사망자가 얼마나 되나요?
어제는 교통사고 사망이 얼마나 발생했나요?
어제의 교통사고사망자수를 알려주세요.
...
답변: 어제의 교통사고 사망자수는 5명입니다.question:
How many people died in traffic accidents yesterday?
How many people died in traffic accidents yesterday?
How many car accidents died yesterday?
Please tell me the number of deaths in traffic accidents yesterday.
...
Answer: Yesterday's traffic accident death toll was 5.

마찬가지로, 새로운 일간 데이터가 발생하면 기존의 일간 데이터는 표 4와 같이 변환할 수 있다.Similarly, when new daily data is generated, existing daily data can be converted as shown in Table 4.

일간 데이터daily data 학습용 데이터data for training 교통사고사망자 3명 (2020년 11월 29일)3 deaths in traffic accidents (November 29, 2020) 질문:
그제 교통사고사망자는 얼마나 되지요?
그제는 교통사고사망가가 얼마나 되나요?
그제는 교통사고 사망이 얼마나 발생했나요?
그제의 교통사고사망자수를 알려주세요.
...
답변: 그제의 교통사고 사망자수는 3명입니다.question:
How many people died in traffic accidents?
How many people died in a car accident the other day?
How many car accident deaths occurred the day before yesterday?
Please tell us the number of deaths in traffic accidents the day before yesterday.
...
Answer: The number of deaths in traffic accidents the day before yesterday was three.

일 실시예에서 학습용 데이터 처리부(130)는 템플릿을 사용하여 학습할 데이터를 생성할 수 있다.In one embodiment, the data processing unit 130 for learning may generate data to be learned using a template.

예를 들면, 표 5에 예시한 것과 같은 템플릿을 사용하여 연간 학습용 데이터의 질의 및 응답 세트를 생성할 수 있다. 템플릿은 질의와 그에 대응되는 응답으로 구성되며, 응답에는 데이터에 따라 치환할 수 있는 부분이 포함된다. 예를 들어 올해 공휴일수가 67일이라는 학습할 데이터가 입력되면, 표 5의 '#' 부분을 '67'이라는 데이터로 치환하여 연간 학습용 데이터를 생성할 수 있다.For example, a query and response set of data for annual learning may be generated using a template as illustrated in Table 5. A template consists of a query and a corresponding response, and the response includes parts that can be replaced according to data. For example, if data to be learned indicating that the number of holidays this year is 67 is input, data for annual learning can be generated by replacing the '#' part of Table 5 with data of '67'.

질의vaginal 응답answer 올해 공휴일수는 얼마나 되지요?How many holidays are there this year? 올해 공휴일수는 #일입니다.The number of public holidays this year is #day. 금년 공휴일수는 얼마나 되지요?How many holidays are there this year? 금년 공휴일수는 #일입니다.The number of public holidays this year is #day. 2020년은 공휴일수가 얼마나 되지요?How many public holidays are there in 2020? 2020년 공휴일수는 #일입니다.The number of public holidays in 2020 is #day. 올해 공휴일수가 어떻게 되나요?What are the number of public holidays this year? 올해 공휴일수는 #일입니다.The number of public holidays this year is #day. 금년 공휴일수가 어떻게 되나요?How many holidays are there this year? 금년 공휴일수는 #일입니다.The number of public holidays this year is #day. 2020년은 공휴일수가 어떻게 되나요?What are the number of public holidays in 2020? 2020년 공휴일수는 #일입니다.The number of public holidays in 2020 is #day. 올해는 공휴일수가 며칠이죠?How many holidays are there this year? 올해 공휴일수는 #일입니다.The number of public holidays this year is #day. 금년은 공휴일수가 며칠이죠?How many holidays are there this year? 금년 공휴일수는 #일입니다.The number of public holidays this year is #day. ...... ......

템플릿은 기존의 실제 질의응답 데이터에 기초하여 인공지능을 활용하여 작성하거나 수동으로 작성할 수도 있고, 이미 작성된 템플릿에 기초하여 동의어 사전, 용언활용 패턴 등을 사용하여 자동으로 추가 생성할 수도 있다. Templates can be created using artificial intelligence or manually based on existing actual Q&A data, or can be automatically additionally created based on already prepared templates using synonym dictionaries and verb usage patterns.

스케줄러(140)는 학습용 데이터 처리부(130)에서 작성된 학습용 데이터를 각 학습용 데이터의 연간, 일간 등의 생성주기를 포함하는 속성에 따라 적절한 시점에 학습용 데이터 DB(20)에 로딩하고 챗봇엔진(10)의 기계학습을 실행시킨다. 예를 들면, 연간 학습용 데이터는 데이터가 생성되고 한번 학습을 실행한 다음에는 1년동안은 해당 데이터로 학습을 실행하지 않도록 한다. 월간 학습용 데이터는 학습을 실행한 후 한달 동안은 동일 데이터로 학습을 실행하지 않는다. 일간 학습용 데이터는 매일 학습을 실행시킨다. 실시간 학습용 데이터는 기계학습을 할 때마다 포함시켜서 학습을 실행시킨다.The scheduler 140 loads the learning data created in the learning data processing unit 130 into the learning data DB 20 at an appropriate time according to the properties including the annual, daily, etc. generation cycle of each learning data, and the chatbot engine 10 run the machine learning of For example, for annual learning data, once data is created and learning is performed, learning is not performed with the data for one year. For monthly learning data, learning is not performed with the same data for one month after learning is executed. Data for daily learning runs daily learning. Data for real-time learning is included every time machine learning is performed and learning is executed.

실시예에 따라서는 너무 오랜 기간동안 훈련에서 제외되는 것을 방지하기 위하여 최소 훈련휴지기간을 설정해둘 수 있다. 이 경우에는 학습용 데이터의 속성에 관계없이 최소 훈련휴지기간마다 훈련을 시킨다. 예를 들어 최소 훈련휴지기간이 6개월인 경우에는 적어도 6개월마다는 모든 학습 데이터에 대해서 훈련이 수행될 수 있다.Depending on the embodiment, a minimum training suspension period may be set to prevent being excluded from training for too long. In this case, training is performed every minimum training pause period regardless of the attributes of the learning data. For example, if the minimum training suspension period is 6 months, training may be performed on all learning data at least every 6 months.

본 발명에 따르면 데이터의 생성주기에 따라 기계학습 시점을 제어하므로 모든 훈련 데이터로 훈련을 할 필요가 없어서 1회 기계학습에 소요되는 시간이 대폭 단축된다. 또한, 데이터의 유효기간 속성이 있는 데이터에 대해서는 유효기간이 경과한 시점에 해당 데이터를 삭제하도록 함으로써 필요없는 데이터가 기계학습에 사용되는 것을 방지할 수 있다.According to the present invention, since the machine learning timing is controlled according to the data generation cycle, it is not necessary to train with all training data, so the time required for one machine learning is greatly reduced. In addition, for data having an attribute of the expiration date, the data can be deleted when the expiration date has elapsed, thereby preventing unnecessary data from being used for machine learning.

다음으로 도 3을 참조하여 본 발명의 바람직한 실시예에 따른 기계학습용 데이터의 생성 및 기계학습 스케줄링 과정을 설명한다.Next, referring to FIG. 3, a process of generating data for machine learning and scheduling machine learning according to a preferred embodiment of the present invention will be described.

학습할 데이터가 입력되면 기계학습스케줄러(100)는 데이터의 속성에 따라 학습할 데이터를 분류한다. 데이터의 속성으로는 데이터의 생성주기, 데이터의 유효기간 등이 있을 수 있다. 속성에 따른 데이터의 분류는, 데이터의 입력원에 따라 정할 수도 있고, 데이터가 포함되어 있는 문서의 제목이나 내용에 포함된 단어로부터 결정할 수도 있다. 또는, 사람이 속성을 정해준 문서들을 인공지능 알고리즘으로 학습한 후에 이후에 학습된 인공지능 엔진에서 입력되는 문서들의 속성을 정하도록 할 수도 있다. 분류된 데이터는 속성에 따라 서로 다른 데이터베이스(또는 파일)에 저장해두거나, 속성필드가 설정된다.When data to be learned is input, the machine learning scheduler 100 classifies the data to be learned according to the properties of the data. Attributes of the data may include a generation period of the data, a validity period of the data, and the like. Classification of data according to attributes may be determined according to data input sources, or may be determined from words included in the title or content of a document including data. Alternatively, after learning the documents whose properties have been set by a person with an artificial intelligence algorithm, the properties of the input documents may be set by the learned artificial intelligence engine. Classified data is stored in different databases (or files) according to attributes, or attribute fields are set.

기계학습스케줄러(100)는 학습할 데이터의 속성에 따라 학습할 데이터로부터 학습용 데이터를 생성한다. 학습용 데이터는 예를 들면 표 1 내지 표 5를 참조하여 설명한 절차를 거쳐서 생성될 수 있다.The machine learning scheduler 100 generates learning data from data to be learned according to the properties of the data to be learned. Data for learning may be generated through the procedure described with reference to Tables 1 to 5, for example.

기계학습스케줄러(100)는 학습용 데이터를 각 학습용 데이터의 연간, 일간 등의 주기를 포함하는 속성에 따라 적절한 시점에 학습용 데이터 DB(20)에 로딩하고 챗봇엔진(10)의 기계학습을 실행하도록 제어한다. 예를 들면, 연간 학습용 데이터는 데이터가 생성되고 한번 학습을 실행한 다음에는 1년동안은 해당 데이터로 학습을 실행하지 않도록 한다. 월간 학습용 데이터는 학습을 실행한 후 한달 동안은 동일 데이터로 학습을 실행하지 않는다. 일간 학습용 데이터는 매일 학습을 실행시킨다. 실시예에 따라서는 너무 오랜 기간동안 훈련에서 제외되는 것을 방지하기 위하여 최소 훈련휴지기간을 설정해둘 수 있다. 이 경우에는 학습용 데이터의 속성에 관계없이 상기 최소 훈련휴지기간마다 훈련시킨다. The machine learning scheduler 100 loads the learning data into the learning data DB 20 at an appropriate time according to the attributes including the annual, daily, etc. period of each learning data and controls the chatbot engine 10 to execute machine learning do. For example, for annual learning data, once data is created and learning is performed, learning is not performed with the data for one year. For monthly learning data, learning is not performed with the same data for one month after learning is executed. Data for daily learning runs daily learning. Depending on the embodiment, a minimum training suspension period may be set to prevent being excluded from training for too long. In this case, training is performed for each minimum training pause period regardless of the attributes of learning data.

이상, 본 발명을 몇가지 예를 들어 설명하였으나, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합하거나 결합하여 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성 요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 그 컴퓨터 프로그램을 구성하는 코드들 및 코드 세그먼트들은 본 발명의 기술 분야의 당업자에 의해 용이하게 추론될 수 있을 것이다. 이러한 컴퓨터 프로그램은 컴퓨터 또는 프로세서가 읽을 수 있는 저장매체(Computer Readable Media)에 저장되어 컴퓨터 또는 프로세서에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시예를 구현할 수 있다. In the above, the present invention has been described with several examples, but even if all components constituting an embodiment of the present invention are combined or described as operating in combination, the present invention is not necessarily limited to these embodiments. That is, within the scope of the object of the present invention, all of the components may be selectively combined with one or more to operate. In addition, although all of the components may be implemented as a single independent piece of hardware, some or all of the components are selectively combined to perform some or all of the combined functions in one or a plurality of hardware. It may be implemented as a computer program having. Codes and code segments constituting the computer program may be easily inferred by a person skilled in the art. Such a computer program may implement an embodiment of the present invention by being stored in a computer or processor-readable storage medium, read and executed by a computer or processor.

이상에서 기재된 "포함하다", "구성하다" 또는 "가지다" 등의 용어는, 특별히 반대되는 기재가 없는 한, 해당 구성 요소가 내재할 수 있음을 의미하는 것이므로, 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것으로 해석되어야 한다. Terms such as "comprise", "comprise" or "having" described above mean that the corresponding component may be present unless otherwise stated, and therefore do not exclude other components. It should be construed that it may further include other components.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely an example of the technical idea of the present invention, and various modifications and variations can be made to those skilled in the art without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but to explain, and the scope of the technical idea of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed according to the claims below, and all technical ideas within the equivalent range should be construed as being included in the scope of the present invention.

10 챗봇엔진,
20 학습용 데이터 DB,
100 기계학습 스케줄러,
110 데이터 분류부,
120a,..120n 주기별 데이터베이스,
130 학습용 데이터 처리부,
140 스케줄러.10 chatbot engine,
20 Data DB for learning,
100 Machine Learning Scheduler,
110 data classification unit;
120a,..120n periodic database,
130 learning data processing unit;
140 Scheduler.

Claims

delete

A machine learning scheduling method in a machine learning scheduler for machine learning a chatbot engine of a chatbot system, the method comprising:
A data classification step of classifying data to be learned by the machine learning scheduler according to data properties including data generation cycle and data validity period;
a learning data generation step in which a machine learning scheduler generates learning data having the same properties as the properties of the data to be learned from the data to be learned according to the properties of the data to be learned;
Machine learning control step in which the machine learning scheduler loads the learning data into the learning data DB at the time corresponding to the generation cycle property of each learning data and controls the machine learning of the chatbot engine to be executed
Including,
The machine learning scheduler controls the learning data for which learning has been executed once in the machine learning control step so that learning is not executed again with the corresponding data within a data generation cycle according to the data generation cycle attribute of the learning data. Controls to execute machine learning with training data at least every minimum training pause period regardless of the data generation cycle attribute of , deletes learning data whose validity period has expired according to the validity period attribute,
The data generation period property includes a real-time property, and the learning data of the real-time property is included every time machine learning is performed and controlled to execute the machine learning scheduling method.

delete

According to claim 6,
In the generating data for learning step, data for learning is created from data to be learned using a template;
The template is composed of a plurality of queries and corresponding responses, and the responses include parts that can be replaced according to data to be learned.
A machine learning scheduling method for generating learning data by substituting the replaceable part according to data to be learned.

delete