KR102473854B1

KR102473854B1 - Industrial occupation code classificational system and the method including the same

Info

Publication number: KR102473854B1
Application number: KR1020200081517A
Authority: KR
Inventors: 김영진
Original assignee: 주식회사 에프에스
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2022-12-02
Also published as: KR20220003819A

Abstract

본 발명은 산업 직업 코드분류 시스템 및 이의 방법에 관한 것으로서, 입력되는 다수의 텍스트를 수신하는 텍스트 수신부, 수신되는 상기 다수의 텍스트를 분석하여 매칭되는 코드 목록을 분석하여 매칭되는 코드를 제공하는 코드 분석부 및 상기 키워드에 대응되어 검색된 분류 코드 및 키워드를 네트워크를 통해 사용자 단말로 전송하는 코드 전송부;를 포함하되, 상기 코드 분석부는, 사전(dictionary) 기반으로 상기 텍스트의 키워드를 매칭하여 코드 목록과 신뢰도를 제공하는 규칙 기반 변환모듈, 상기 텍스트를 임의 설정된 특징(feature) 벡터로 변환/누적하여 이로부터 매칭되는 코드 목록과 신뢰도를 제공하는 기계학습 기반 변환모듈 및 상기 텍스트를 각 코드 별로 임의 빈도수에 따라 분석하여 코드 목록과 신뢰도를 제공하는 통계 기반 변환모듈을 포함함으로써, 산업 및 직업 코드를 우수한 정확도와 신뢰도로 제공하는 이점을 제공한다.The present invention relates to an industrial occupational code classification system and method thereof, which includes a text receiving unit for receiving a plurality of input texts, and code analysis for analyzing a matching code list by analyzing the received plurality of texts and providing matching codes. and a code transmission unit for transmitting a searched classification code and a keyword corresponding to the keyword to a user terminal through a network, wherein the code analysis unit matches the keyword of the text based on a dictionary to obtain a code list and a code list. A rule-based conversion module that provides reliability, a machine learning-based conversion module that converts/accumulates the text into an arbitrarily set feature vector and provides matching code lists and reliability, and converts the text to a random frequency for each code. By including a statistics-based conversion module that analyzes according to the code list and provides reliability, it provides the advantage of providing industrial and occupational codes with excellent accuracy and reliability.

Description

Industrial occupation code classification system and its method {INDUSTRIAL OCCUPATION CODE CLASSIFICATIONAL SYSTEM AND THE METHOD INCLUDING THE SAME}

본 발명은 산업 직업 코드분류 시스템 및 그 방법에 관한 것이다.The present invention relates to an industrial occupation code classification system and method.

텍스트에 대한 분류 연구는 다양한 서비스 제공을 위해 수행되어왔다. 이를테면 인터넷 상의 텍스트에 대한 감성 또는 감정 분류 연구가 수행되기도 하였으며, 최근 각광받는 챗봇 또는 대화시스템 개발을 위해 텍스트로부터 서비스 도메인을 분류하는 연구도 수행되고 있다. 이러한 연구들은 과거에는 주로 규칙 또는 사전을 기반으로 연구되었으며, 이런 방식은 분류 모델의 성능 개선을 위한 엔지니어링 과정이 비교적 쉽다는 이점이 있으나, 규칙 또는 사전 규모가 커질수록 관리하기 어려워진다는 단점이 있었다. 이 문제를 극복하기 위해 대량의 데이터에 통계 모델을 적용하여 텍스트를 분류하기도 하였으나, 통계 모델은 텍스트에 내재된 의미적, 구조적 특징을 충분히 반영하지 못함으로 인해 성능에 한계가 있었다. 최근 각광받고 있는 딥러닝 모델을 비롯한 기계학습 모델은 텍스트에 내재된 의미적, 구조적 특징을 파악하여 높은 성능을 달성하였으며, 텍스트 데이터 뿐만 아니라 여러 종류의 데이터에 대한 각 분야에서 널리 활용되고 있다.Text classification studies have been conducted to provide various services. For example, research on emotion or emotion classification for text on the Internet has been conducted, and research on classifying service domains from text is also being conducted for the development of chatbots or conversation systems that have recently been in the limelight. These studies have been mainly based on rules or dictionaries in the past, and this method has the advantage that the engineering process for improving the performance of classification models is relatively easy, but it has the disadvantage that it becomes difficult to manage as the size of the rules or dictionaries increases. In order to overcome this problem, a statistical model was applied to a large amount of data to classify text, but the statistical model did not fully reflect the semantic and structural characteristics inherent in the text, and thus had limitations in performance. Machine learning models, including deep learning models that have recently been in the limelight, have achieved high performance by identifying semantic and structural features inherent in text, and are widely used in various fields as well as text data.

기계학습 모델의 복잡성과 데이터의 복잡성이 균형을 이루지 못할 때 Overfitting 또는 Underfitting 문제가 발생하곤 한다. 산업, 직업 코드분류의 경우, 사용자가 입력하는 텍스트의 길이는 짧은 반면, 코드의 가짓수는 매우 많으므로, 매우 많은 양의 데이터가 존재해야 어느 정도 이상의 분류 정확도를 달성할 수 있을 것이다. 다시 말해서, 기존의 기계학습 기반 연구방법들을 산업, 직업 코드분류 문제에 그대로 적용할 경우, 기대 이하의 성능을 얻을 수밖에 없다. 따라서, 여러 방법들을 복합적으로 적용함으로써 데이터가 부족함에도 불구하고 일정 수준 이상의 성능을 낼 수 있도록 해야 한다.Overfitting or underfitting problems often occur when the complexity of the machine learning model and the complexity of the data are not balanced. In the case of industry and occupation code classification, the length of text input by the user is short, but the number of codes is very large, so a very large amount of data must exist to achieve a certain degree of classification accuracy. In other words, if existing machine learning-based research methods are applied as they are to the problem of industrial and occupational code classification, performance below expectations is bound to be obtained. Therefore, it is necessary to achieve a certain level of performance despite the lack of data by applying various methods in a complex manner.

뿐만 아니라, 산업, 직업 코드 체계가 변화하기도 하는데, 이 문제 해결을 위한 별다른 해결책이 강구되어 있지 않다. 코드 가짓수가 많으므로 상대적으로 많은 양의 데이터가 필요한 상황에서 이러한 코드 체계 변화는 문제를 더욱 어렵게 만드는 요소이다.In addition, the industrial and occupational code system changes, but no specific solution has been devised to solve this problem. In a situation where a relatively large amount of data is required due to the large number of codes, this code system change makes the problem more difficult.

대한민국 공개특허공보 제10-2002-0004923호(2002.01.16. 공개일)Republic of Korea Patent Publication No. 10-2002-0004923 (2002.01.16. Publication date)

본 발명은 상기한 기술적 과제를 해결하기 위하여 안출된 것으로서, 방대한 산업 및 직업 코드 검색을 특정 방식을 적용하도록 함으로써, 검색 입력에 따른 결과물을 보다 우수한 정확도로 제공할 수 있는 산업 직업 코드분류 시스템 및 그 방법을 제공하는 것을 그 목적으로 한다.The present invention has been made to solve the above technical problem, and by applying a specific method to a vast industry and occupation code search, an industrial occupation code classification system that can provide results according to search input with greater accuracy, and its Its purpose is to provide a method.

또한, 본 발명은, 구 체계의 산업 직업코드를 기반으로 신 체계의 산업 직업코드로 자동으로 변환하도록 함으로써, 산업 및 직업 코드분류에 용이하게 활용할 수 있는 산업 직업 코드분류 시스템 및 그 방법을 제공하는 것을 그 목적으로 한다.In addition, the present invention provides an industrial occupational code classification system and method that can be easily utilized for industrial and occupational code classification by automatically converting the old system's industrial occupational code to the new system's industrial occupational code, for that purpose

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재들로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below.

상기의 과제를 달성하기 위하여, 본 발명에 따른 산업 직업코드 분류 시스템은,In order to achieve the above object, the industrial occupation code classification system according to the present invention,

입력되는 다수의 텍스트를 수신하는 텍스트 수신부;a text receiving unit for receiving a plurality of input texts;

수신되는 상기 다수의 텍스트를 분석하여 매칭되는 코드 목록을 분석하여 매칭되는 코드를 제공하는 코드 분석부; 및a code analyzer configured to analyze a list of codes matched by analyzing the plurality of received texts and provide a matched code; and

상기 키워드에 대응되어 검색된 분류 코드 및 키워드를 네트워크를 통해 사용자 단말로 전송하는 코드 전송부;를 포함하되,A code transmission unit for transmitting the searched classification code and keyword corresponding to the keyword to a user terminal through a network;

상기 코드 분석부는,The code analysis unit,

사전(dictionary) 기반으로 상기 텍스트의 키워드를 매칭하여 코드 목록과 신뢰도를 제공하는 규칙 기반 변환모듈;a rule-based conversion module that provides a code list and reliability by matching keywords of the text based on a dictionary;

상기 텍스트를 임의 설정된 특징(feature) 벡터로 변환/누적하여 이로부터 매칭되는 코드 목록과 신뢰도를 제공하는 기계학습 기반 변환모듈; 및A machine learning-based conversion module that converts/accumulates the text into an arbitrarily set feature vector and provides a matching code list and reliability therefrom; and

상기 텍스트를 각 코드 별로 임의 빈도수에 따라 분석하여 코드 목록과 신뢰도를 제공하는 통계 기반 변환모듈;을 포함할 수 있다.It may include a statistics-based conversion module that analyzes the text according to an arbitrary frequency for each code and provides a code list and reliability.

본 발명에 따른 일 실시예에서, 상기 시스템은, 데이터를 저장 및 출력하는 관리 서버; 및 상기 관리 서버와 유무선 네트워크를 통하여 통신하는 사용자 단말;를 더 포함할 수 있다.In one embodiment according to the present invention, the system includes a management server for storing and outputting data; and a user terminal communicating with the management server through a wired or wireless network.

본 발명에 따른 일 실시예에서, 상기 코드 전송부에서 매칭된 상기 코드 목록을 분석하여 새로운 코드 목록으로 변환하는 코드 체계 변환부;를 더 포함할 수 있다.In one embodiment according to the present invention, a code system conversion unit for analyzing the code list matched by the code transmission unit and converting it into a new code list; may be further included.

본 발명에 따른 일 실시예에서, 상기 규칙 기반 변환모듈은, 코드 우선 순위를 설정하고, 입력되는 상기 텍스트 중 기설정된 금지 키워드를 검색하여 후순위로 지정하도록 구성될 수 있다.In one embodiment according to the present invention, the rule-based conversion module may be configured to set a code priority, search for a preset prohibited keyword among the input text, and designate it as a lower priority.

본 발명에 따른 일 실시예에서, 상기 기계학습 기반 변환모듈은, 학습된 상기 모델을 통하여 상기 텍스트로부터 매칭되는 복수의 코드 목록 중 기설정된 임계값(threshold) 보다 큰 신뢰도를 가지는 코드 목록만 제공할 수 있다.In one embodiment according to the present invention, the machine learning-based conversion module provides only a code list having a reliability greater than a predetermined threshold among a plurality of code lists matched from the text through the learned model. can

본 발명에 따른 일 실시예에서, 상기 기계학습 기반 변환모듈의 특징(feature) 벡터는 n-gram, 글자 개수, 특수문자 및 숫자 중 적어도 하나를 포함할 수 있다.In one embodiment according to the present invention, the feature vector of the machine learning-based conversion module may include at least one of n-grams, the number of characters, special characters, and numbers.

본 발명에 따른 일 실시예에서, 상기 기계학습 기반 변환모듈의 모델은 Decision tree, Support vector machine, Random forest 및 딥러닝(deep learning) 중 적어도 하나를 포함할 수 있다.In one embodiment according to the present invention, the model of the machine learning-based conversion module may include at least one of a decision tree, a support vector machine, a random forest, and deep learning.

본 발명에 따른 일 실시예에서, 상기 통계 기반 변환모듈의 상기 임의 빈도수는, 입력되는 상기 텍스트의 글자마다 저장되고 빈도수에 따라 높은 매칭도 순서에 따라 백분율(%)로 표시될 수 있다.In one embodiment according to the present invention, the arbitrary frequency of the statistics-based conversion module may be stored for each letter of the input text and displayed as a percentage (%) in order of high matching degree according to the frequency.

본 발명에 따른 일 실시예에서, 상기 코드 분석부의 규칙 기반 변환모듈, 기계학습 기반 변환모듈 및 통계 기반 변환모듈은 최종 매칭되는 코드 목록의 개수에 따라서 적어도 하나 이상이 순차적 또는 임의적으로 동작될 수 있다.In one embodiment according to the present invention, at least one or more of the rule-based conversion module, the machine learning-based conversion module, and the statistics-based conversion module of the code analysis unit may be sequentially or arbitrarily operated according to the number of final matched code lists. .

한편, 본 발명은 상기 산업 직업 코드분류 시스템을 이용한 산업 직업코드를 새롭게 재분류할 수 있는 방법을 제공하는 바,On the other hand, the present invention provides a method for newly reclassifying industrial occupation codes using the industrial occupation code classification system,

입력되는 다수의 텍스트를 수신하는 텍스트 수신 단계;a text receiving step of receiving a plurality of input texts;

수신된 상기 다수의 텍스트를 분석하여 매칭되는 코드 목록을 분석하되,Analyzing the plurality of received texts to analyze a matching code list,

사전(dictionary) 기반으로 상기 텍스트의 키워드를 매칭하여 코드 목록과 신뢰도를 제공하는 규칙 기반 분석과정, 상기 텍스트를 임의 설정된 특징(feature) 벡터로 변환/누적하여 이로부터 매칭되는 코드 목록과 신뢰도를 제공하는 기계학습 기반 분석과정과 상기 텍스트를 각 코드 별로 임의 빈도수에 따라 분석하여 코드 목록과 신뢰도를 제공하는 통계 기반 분석과정 중 적어도 하나의 과정이 순차적 또는 임의적으로 수행되는 단계; 및 A rule-based analysis process that matches keywords of the text based on a dictionary to provide a code list and reliability, converts/accumulates the text into an arbitrarily set feature vector, and provides a matching code list and reliability sequentially or randomly performing at least one of a machine learning-based analysis process and a statistical analysis process of analyzing the text according to an arbitrary frequency for each code and providing a code list and reliability; and

매칭된 상기 코드 목록과 기존 코드와 비교하여 변동 여부를 판단하여 변동이 없는 경우에는 해당 코드 목록과 신뢰도를 제공하거나 변동이 필요한 경우에는 변경된 신규 코드 목록 및 신뢰도를 제공하는 단계;를 포함할 수 있다.Comparing the matched code list with an existing code to determine whether there is a change, and if there is no change, providing a corresponding code list and reliability, or providing a new code list and reliability if a change is required; may include. .

본 발명에 따른 일 실시예에서, 상기 코드 목록을 분석의 규칙 기반 분석과정, 기계학습 분석과정 및 통계 기반 분석과정은 최종 매칭되는 코드 개수가 기설정된 코드 개수에 미달하는 경우에 반복적으로 수행될 수 있다.In one embodiment according to the present invention, the rule-based analysis process of analyzing the code list, the machine learning analysis process, and the statistics-based analysis process may be repeatedly performed when the number of finally matched codes is less than the preset number of codes. have.

본 발명에 따른 일 실시예에서, 상기 코드 목록 변경 과정은, 상기 기존 코드를 수정하는 과정, 상기 기존 코드의 정의를 수정하는 과정, 새로운 코드를 정의하는 과정 및 상기 기존 코드를 삭제하는 과정을 포함할 수 있다.In one embodiment according to the present invention, the process of changing the code list includes the process of modifying the existing code, the process of modifying the definition of the existing code, the process of defining a new code, and the process of deleting the existing code. can do.

본 발명에 따른 산업 직업 코드분류 시스템 및 그 방법은, 방대한 산업 및 직업 코드 검색을 특정 방식을 적용하도록 함으로써, 검색 입력에 따른 결과물을 보다 우수한 정확도로 제공할 수 있는 효과를 가진다.The industrial occupation code classification system and method according to the present invention have an effect of providing a result according to a search input with better accuracy by applying a specific method to a vast industry and occupation code search.

또한, 본 발명에 따른 산업 직업 코드분류 시스템 및 그 방법은, 구 체계의 산업 직업코드를 기반으로 신 체계의 산업 직업코드로 자동으로 변환하도록 함으로써, 산업 및 직업 코드분류에 용이하게 활용할 수 있는 효과를 가진다.In addition, the industrial occupation code classification system and method according to the present invention automatically converts the industrial occupation codes of the old system into the industrial occupation codes of the new system based on the old system, so that it can be easily used for industrial and occupation code classification. have

도 1은 본 발명의 일 실시예에 따른 산업 직업 코드분류 시스템의 개략적인 구성도이고,
도 2는 본 발명의 일 실시예에 따른 산업 직업 코드분류 시스템에 의한 예시도이고,
도 3 내지 도 5는 본 발명의 일 실시예에 따른 산업 직업 코드분류 과정을 나타내는 흐름도들이고,
도 6은 본 발명의 일 실시예에 따른 산업 직업 코드분류 과정을 나타내는 흐름도이다.1 is a schematic configuration diagram of an industrial occupation code classification system according to an embodiment of the present invention;
Figure 2 is an exemplary diagram by the industrial occupation code classification system according to an embodiment of the present invention,
3 to 5 are flowcharts showing an industrial occupation code classification process according to an embodiment of the present invention,
6 is a flowchart illustrating an industrial occupation code classification process according to an embodiment of the present invention.

이하, 본 발명에 따른 산업 직업 코드분류 시스템 및 그 방법의 일 실시예를 첨부된 도면을 참조하여 상세하게 설명하기로 한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명의 실시예를 설명함에 있어, 관련된 공지 구성 도는 기능에 대한 구체적인 설명이 본 발명의 실시예에 대한 이해를 방해한다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, an embodiment of an industrial occupation code classification system and method according to the present invention will be described in detail with reference to the accompanying drawings. In adding reference numerals to components of each drawing, it should be noted that the same components have the same numerals as much as possible even if they are displayed on different drawings. In addition, in describing an embodiment of the present invention, if it is determined that a detailed description of a related known configuration or function hinders understanding of the embodiment of the present invention, the detailed description thereof will be omitted.

본 발명의 실시예의 구성 요소를 설명하는 데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 또한, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In describing the components of the embodiment of the present invention, terms such as first, second, A, B, (a), and (b) may be used. These terms are only used to distinguish the component from other components, and the nature, order, or order of the corresponding component is not limited by the term. In addition, unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art to which the present invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present application, they should not be interpreted in an ideal or excessively formal meaning. don't

명세서 전체에서, 어떤 부분이 어떤 구성요소를 '포함', '구비' 한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is said to 'include' or 'include' a certain component, it means that it may further include other components, not excluding other components unless otherwise stated. .

일반적으로, 종래에는 사업 분야, 직업 종류 등을 관리하는 코드를 찾는 과정에서 검색되는 방대한 결과물을 검수하는 과정에서 소모 자원이 크고 정확도도 크게 떨어지는 문제점이 있었다.In general, in the prior art, there has been a problem in that resource consumption is large and accuracy is greatly reduced in the process of inspecting vast results searched in the process of finding codes for managing business fields, job types, and the like.

이에 본 발명에서는 구 코드 체계를 단순히 검색하고 매칭하는 수준을 넘어서 충분한 검색결과와 높은 신뢰도를 갖을 수 있도록 규칙, 기계학습 및 통계 기반의 변환 방식을 적용하도록 함으로써, 높은 정확도와 신뢰도를 갖는 코드 체계를 제공할 수 있는 시스템 및 방법을 제공할 수 있다.Accordingly, in the present invention, a code system with high accuracy and reliability is applied by applying a conversion method based on rules, machine learning, and statistics so as to have sufficient search results and high reliability beyond simply searching and matching the old code system. A system and method that can be provided can be provided.

도 1은 본 발명의 일 실시예에 따른 산업 직업 코드분류 시스템의 개략적인 구성도이고, 도 2는 본 발명의 일 실시예에 따른 산업 직업 코드분류 시스템에 의한 예시도이다.Figure 1 is a schematic configuration diagram of an industrial occupation code classification system according to an embodiment of the present invention, Figure 2 is an exemplary diagram by the industrial occupation code classification system according to an embodiment of the present invention.

도 1 및 도 2를 참조하면, 본 발명에 따른 산업 직업 코드분류 시스템(100)은 입력되는 복수의 텍스트로부터 도출되는 키워드를 매칭하여 코드 목록과 신뢰도를 제공하는 시스템이며, 이러한 시스템은 입력되는 다수의 텍스트를 수신하는 텍스트 수신부(150), 수신되는 상기 다수의 텍스트를 분석하여 매칭되는 코드 목록을 분석하여 매칭되는 코드를 제공하는 코드 분석부(100) 및 상기 키워드에 대응되어 검색된 분류 코드 및 키워드를 네트워크를 통해 사용자 단말(200)로 전송하는 코드 전송부(160)를 포함할 수 있다. 1 and 2, the industrial occupational code classification system 100 according to the present invention is a system that provides a code list and reliability by matching keywords derived from a plurality of input texts. A text receiving unit 150 that receives the text of the received text, a code analysis unit 100 that analyzes a matching code list by analyzing the plurality of received texts and provides a matching code, and the searched classification code and keyword corresponding to the keyword It may include a code transmission unit 160 for transmitting to the user terminal 200 through the network.

여기서, 코드 분석부(100)는 구 분류코드(10)로부터 신 분류코드(20)로 매칭 및 변환하여 제공하는 구성이며, 사전(dictionary) 기반으로 상기 텍스트의 키워드를 매칭하여 코드 목록과 신뢰도를 제공하는 규칙 기반 변환모듈(110), 상기 텍스트를 임의 설정된 특징(feature) 벡터로 변환/누적하여 이로부터 매칭되는 코드 목록과 신뢰도를 제공하는 기계학습 기반 변환모듈(120) 및 상기 텍스트를 각 코드 별로 임의 빈도수에 따라 분석하여 코드 목록과 신뢰도를 제공하는 통계 기반 변환모듈(130)을 포함하여 구성될 수 있다. 또, 코드 분석부(100)는 코드 체계 변환모듈(140)을 포함할 수 있으며, 이에 대해서는 하기에서 설명한다.Here, the code analysis unit 100 is a component that matches and converts the old classification code 10 into the new classification code 20 and provides the code list and reliability by matching the keywords of the text based on a dictionary. A rule-based conversion module 110 that provides, a machine learning-based conversion module 120 that converts/accumulates the text into an arbitrarily set feature vector and provides a matching code list and reliability therefrom, and converts the text into each code It may be configured to include a statistics-based conversion module 130 that analyzes each code according to an arbitrary frequency and provides a code list and reliability. Also, the code analysis unit 100 may include a code system conversion module 140, which will be described below.

텍스트 수신부(150)는 유무선 네트워크를 통하여 입력되는 다수의 텍스트를 수신할 수 있다. 이러한 텍스트는 코드 분석부(100)로 제공되고, 코드 분석부(100)는 입력되는 다수의 상기 텍스트를 글자 단위로 분석, 매칭 및 변환할 수 있다. 이와 같이 분석, 매칭 및 변환될 수 있는 상기 텍스트는 하기에서 설명하는 코드 분석부(100)의 규칙 기반 변환모듈(110), 기계학습 기반 변환모듈(120) 및 통계 기반 변환모듈(130)을 통하여 분석 및 변환될 수 있다.The text receiving unit 150 may receive a plurality of texts input through a wired or wireless network. Such text is provided to the code analysis unit 100, and the code analysis unit 100 may analyze, match, and convert a plurality of input texts in character units. The text that can be analyzed, matched, and converted in this way is processed through the rule-based conversion module 110, the machine learning-based conversion module 120, and the statistics-based conversion module 130 of the code analysis unit 100 described below. can be analyzed and transformed.

코드 분석부(100)의 규칙 기반 변환모듈(110)은 텍스트 수신부(150)를 통하여 입력되는 상기 텍스트에 대하여 텍스트 매칭을 비롯한 조건들을 체크함으로써 구 분류코드(10)로부터 적절한 코드를 찾고 매칭할 수 있다. 규칙 기반 변환모듈(100)은 사전(dictionary) 기반으로 키워드 매칭을 수행할 수 있으며, 텍스트 수신부(150)를 통하여 입력하는 텍스트를 수신하고, 이와 같은 텍스트의 존재 여부를 체크하고 적절하게 매칭되는 2개 이상의 코드를 신뢰도 수치와 함께 제공할 수 있다. 이 때, 이미 정의되어 있는 사전(dictionary)을 통하여 매칭되는 복수의 코드는 키워드 간에 우선순위가 설정되거나 매칭된 모든 키워드의 코드가 제공될 수 있다. 구체적인 예에서, 도 2에 도시된 바와 같이, "벼 농사"라는 텍스트로 입력되는 키워드에 대해서, 매칭되는 코드는 "0111" (신뢰도: 90%)와 같은 형태로 제공할 수 있다. 물론, 이와 같이 매칭되는 코드는 필요한 개수만큼 유망한 순위에 기초하여 임의 개수만큼 제공할 수 있다.The rule-based conversion module 110 of the code analysis unit 100 can find and match an appropriate code from the old classification code 10 by checking conditions including text matching with respect to the text input through the text receiving unit 150. have. The rule-based conversion module 100 may perform keyword matching based on a dictionary, receives text input through the text receiving unit 150, checks whether such text exists, and provides two appropriately matched keywords. More than one code can be provided along with a confidence figure. In this case, priorities may be set between keywords for a plurality of codes matched through an already defined dictionary, or codes of all matched keywords may be provided. In a specific example, as shown in FIG. 2 , a matching code may be provided in the form of “0111” (reliability: 90%) for a keyword input as text “rice farming”. Of course, any number of matching codes may be provided based on the required number of promising rankings.

전술한 규칙 기반 변환모듈(100)은 복수 개의 키워드가 매칭되는 경우에 특정 키워드에 우선 순위를 설정하여 제공할 수 있다. 예를 들어, 텍스트 수신부(150)로 "화원면사무소"가 입력되는 경우에, "화원"은 A 코드로 인식하고 "면사무소"는 B 코드로 인식하게 되고, 기 설정된 코드 우선순위에 따라서 A 코드 및 B 코드 중 B 코드에 우선순위를 부여하여 출력(B 코드: 신뢰도(90%), A 코드: 신뢰도(10%))할 수 있다. The aforementioned rule-based conversion module 100 may prioritize and provide specific keywords when a plurality of keywords are matched. For example, when “Hwawon-myeon office” is input to the text receiving unit 150, “Hwawon” is recognized as an A code and “Myeon office” is recognized as a B code, and A codes and A codes according to preset code priorities. Among the B codes, the B code can be given priority and output (B code: reliability (90%), A code: reliability (10%)).

나아가, 코드 분석부(100)의 규칙 기반 변환모듈(110)은 텍스트 수신부(150)로부터 입력되는 상기 텍스트의 키워드 매칭뿐만 아니라 각 키워드 단위로 존재하면 안되는 금지 키워드를 검색할 수 있다. 예를 들어, 텍스트 수신부(150)로 "호프 재배"가 입력되는 경우에, "호프"은 A 코드로 인식하고 "재배"는 B 코드로 인식하게 되고, 기 설정된 금지 키워드에 따라서 A 코드는 삭제하고 출력(B 코드: 신뢰도(100%))할 수 있다.Furthermore, the rule-based conversion module 110 of the code analysis unit 100 may search not only keyword matching of the text input from the text receiving unit 150 but also prohibited keywords that should not exist in units of keywords. For example, when “hop cultivation” is input to the text receiving unit 150, “hope” is recognized as an A code and “cultivation” is recognized as a B code, and the A code is deleted according to a preset prohibited keyword. and output (B code: reliability (100%)).

이와 같은 예시 이외에, 규칙 기반 변환모듈(100)의 규칙 기반 방법은 조건문(If ~, Then ~, Else ~)으로써 가능한 모든 경우를 체크하여 적절한 코드 목록을 신뢰도와 함께 제공할 수 있다. 각 코드의 신뢰도 수치는 규칙의 우선순위 또는 매칭된 키워드 개수 등에 기반하여 임의로 계산될 수 있다. 또, 이러한 규칙 기반 방법을 통해 매칭되지 않는 경우도 존재할 수 있으며, 0개 이상의 코드 결과물을 제공할 수 있다.In addition to this example, the rule-based method of the rule-based transformation module 100 may check all possible cases with conditional statements (If ~, Then ~, Else ~) and provide an appropriate code list with reliability. The reliability value of each code may be arbitrarily calculated based on the priority of rules or the number of matched keywords. In addition, there may be cases where there is no matching through this rule-based method, and zero or more code results may be provided.

또한, 코드 분석부(100)의 기계학습 기반 변환모듈(120)은 텍스트 수신부(150)를 통하여 입력되는 상기 텍스트에 대하여 임의 특성(feature) 벡터(vector)로 표현한 후에 이를 기계학습 모델에 적용하여 적절한 코드를 매칭할 수 있다. 이러한 임의 특성(feature)의 벡터(vector)는 본 발명에 따른 시스템의 모델 설계에 의하여 임의로 정의될 수 있으며, 대표적인 특성(feature)으로써 n-gram 자질, 글자 개수, 특수문자 또는 숫자 등장 여부 등이 포함될 수 있다. 여기서, N-gram 자질은 임의 단위 기반 (예: 글자, 단어 등)으로 정의될 수 있다. 입력되는 상기 텍스트는 이러한 자질 벡터로 변환되며, 상기 자질 벡터값을 바탕으로 기계학습 모델이 학습될 수 있다.In addition, the machine learning-based conversion module 120 of the code analysis unit 100 expresses the text input through the text receiving unit 150 as an arbitrary feature vector and applies it to the machine learning model You can match the appropriate code. A vector of these arbitrary features can be arbitrarily defined by the model design of the system according to the present invention, and representative features include n-gram features, the number of letters, whether special characters or numbers appear, etc. can be included Here, N-gram features can be defined based on arbitrary units (eg letters, words, etc.). The input text is converted into such a feature vector, and a machine learning model can be learned based on the feature vector value.

여기서, 기계학습 모델은 Decision tree, Support vector machine, Random forest, 딥러닝 모델 등이 포함될 수 있다. 이렇게 학습된 기계학습 모델을 통해 입력되는 상기 텍스트로부터 복수 개의 유망한 코드들을 신뢰도 정보와 함께 제공될 수 있다. 이러한 결과물에 대하여 임의의 임계값(threshold)보다 높은 신뢰도를 가지는 일부 코드들만 추출하여 제공될 수도 있다. 이러한 기계학습 기반 변환모듈(120)에 따른 상기 기계학습 기반 방법은 0개 이상의 코드 결과물을 생성할 수 있다.Here, the machine learning model may include decision tree, support vector machine, random forest, deep learning model, and the like. A plurality of promising codes from the text input through the learned machine learning model may be provided together with reliability information. For these results, only some codes having higher reliability than a certain threshold may be extracted and provided. The machine learning-based method according to the machine learning-based conversion module 120 may generate zero or more code results.

또한, 코드 분석부(100)의 통계 기반 변환모듈(130)은 임의 단위 (예: 글자)로써 통계에 기반하여 코드를 분석 및 변환할 수 있다.In addition, the statistics-based conversion module 130 of the code analyzer 100 may analyze and convert codes based on statistics in arbitrary units (eg, letters).

텍스트 수신부(150)을 통하여 입력되는 상기 텍스트를 글자 단위로 통계를 낼 경우, 학습 데이터에 존재하는 유일한 글자 가짓수를 N이라고 할 때, 각 코드는 N-dimensional space의 빈도수 벡터로써 표현할 수 있다. 예를 들어, 코드 "0111"에서 "농"이라는 글자가 10번 등장했다면, [??, '농'=10, ??] 와 같은 형식으로 표현할 수 있다.When the text input through the text receiving unit 150 is statistically generated in character units, when N is the number of unique characters present in the training data, each code can be expressed as a frequency vector in N-dimensional space. For example, if the character "Nong" appears 10 times in the code "0111", it can be expressed in a format such as [??, 'Nong' = 10, ??].

이와 같은 학습 데이터에 대하여 각 코드 별로 각 글자의 임의 단위 빈도수를 저장한 뒤, 텍스트 수신부(150)을 통하여 입력되는 상기 텍스트에 대하여 각 단위 (예: 글자)가 어떤 코드에 더 유망한지를 누적한다. 예를 들어, "농사"라는 텍스트를 글자 단위로 처리한다면, '농' 글자에 대한 코드 분포는 ["0111": 10%, "0112", 8%, ??]와 같이 얻을 수 있을 것이며, '사' 글자에 대한 코드 분포도 얻어서 이를 합산한다. 이렇게 각 단위마다 합산한 코드 분포를 바탕으로 결과물을 생성한다. 마찬가지로, 이러한 통계 기반 방법도 임의 임계값을 적용하여 0개 이상의 코드 결과를 생성할 수 있다.For such learning data, after storing the frequency of arbitrary units of each letter for each code, it accumulates which code is more promising for each unit (eg, letter) for the text input through the text receiving unit 150. For example, if the text "Farming" is processed character by character, the code distribution for the word 'Farming' will be obtained as ["0111": 10%, "0112", 8%, ??], The code distribution for the letter 'four' is also obtained and summed up. In this way, the result is generated based on the code distribution summed up for each unit. Similarly, these statistics-based methods can also generate zero or more code results by applying arbitrary thresholds.

본 발명에 따르면, 전술한 규칙 기반 변환모듈(110)의 텍스트 분석/변환 방법, 기계학습 기반 변환모듈(120)의 텍스트 분석/변환 방법 및 통계 기반 변환모듈(130)의 테스트 분석/변환 방법은 임의 순서대로 적용되어 분석 결과가 제공될 수 있다. 즉, 이러한 방식은 순차적으로 수행되거나 각각의 변환모듈을 통하여 추출되는 결과물들에 대하여 임의 알고리즘을 통해 재조정할 수 있다. 이와 같은 순위 재조정 알고리즘에 의하여 결과물이 변경될 수 있다.According to the present invention, the text analysis/conversion method of the above-described rule-based conversion module 110, the text analysis/conversion method of the machine learning-based conversion module 120, and the test analysis/conversion method of the statistics-based conversion module 130 are They can be applied in any order to provide analytical results. That is, this method can be readjusted through an arbitrary algorithm with respect to the results performed sequentially or extracted through each conversion module. Results may be changed by the ranking rebalancing algorithm.

예를 들어, 상기 순차적 수행 방식은, 규칙 기반 변환모듈(110), 기계학습 기반 변환모듈(120) 및 통계 기반 변환모듈(130) 순서대로 동작될 수 있고, 이 경우에 규칙 기반 방법에서 생성된 결과물에 우선순위를 높일 수 있다. 만약, 규칙 기반 변환모듈(110), 기계학습 기반 변환모듈(120) 및 통계 기반 변환모듈(130) 각각의 방법을 모두 수행된 후에도 충분한 개수만큼의 코드 결과가 생성되지 못했다면, 신뢰도 임계값(threshold)을 줄임으로써 생성되는 결과물이 충분해질 때까지 순차적 처리를 반복할 수 있다. For example, in the sequential execution method, the rule-based conversion module 110, the machine learning-based conversion module 120, and the statistics-based conversion module 130 may be operated in order, and in this case, the rule-based conversion module generated by the method You can prioritize results. If a sufficient number of code results are not generated even after each method of the rule-based conversion module 110, the machine learning-based conversion module 120, and the statistics-based conversion module 130 is performed, the reliability threshold value ( The sequential processing can be repeated until the output generated by reducing the threshold is sufficient.

또, 상기 임의 순서에 의한 수행 방식은, 규칙 기반 변환모듈(110), 기계학습 기반 변환모듈(120) 및 통계 기반 변환모듈(130)이 랜덤(random)으로 수행될 수 있다. 모든 결과물들에 대한 순위를 임의 알고리즘을 통해 재조정하고, 이러한 임의 알고리즘은 각 방법들에 대한 가중치를 두거나 결과물들 간의 우선순위 규칙 등에 의해 정의될 수 있다. 마찬가지로, 만약 규칙 기반 변환모듈(110), 기계학습 기반 변환모듈(120) 및 통계 기반 변환모듈(130) 각각의 방법을 모두 수행된 후에도 충분한 개수만큼의 코드 결과가 생성되지 못했다면, 신뢰도 임계값(threshold)을 줄임으로써 생성되는 결과물이 충분해질 때까지 순차적 처리를 반복할 수 있다.In addition, in the method of performing the random order, the rule-based conversion module 110, the machine learning-based conversion module 120, and the statistics-based conversion module 130 may be randomly performed. The ranking of all results is re-adjusted through a random algorithm, and this random algorithm can be defined by a weight for each method or a priority rule between results. Similarly, if a sufficient number of code results are not generated even after each method of the rule-based transformation module 110, the machine learning-based transformation module 120, and the statistics-based transformation module 130 is performed, the reliability threshold value Sequential processing can be repeated until the output produced by reducing the threshold is sufficient.

따라서, 상기와 같은 규칙 기반 변환모듈(110), 기계학습 기반 변환모듈(120) 및 통계 기반 변환모듈(130)에 따른 각각의 분석/변환 방법을 통하여 진행하는 경우에 검색 입력에 따른 결과물을 보다 우수한 정확도로 제공할 수 있다. 더불어, 본 발명은 코드 분석부(100)의 코드 체계 변환모듈(140)을 포함할 수 있다. 이러한 코드 체계 변환모듈(140)은 구 분류코드(10)가 변경되는 경우에 새롭게 데이터를 수집하여 신 분류코드(20)로 변환할 수 있도록 할 수 있다.Therefore, when proceeding through each analysis/conversion method according to the rule-based conversion module 110, the machine learning-based conversion module 120, and the statistics-based conversion module 130 as described above, the result according to the search input is viewed. It can be provided with excellent accuracy. In addition, the present invention may include a code system conversion module 140 of the code analysis unit 100. When the old classification code 10 is changed, the code system conversion module 140 can collect new data and convert it into the new classification code 20 .

여기서, 코드 체계 변환모듈(140)의 동작 과정은 첫째, 기존에 정의되어 있던 코드 번호를 변경하고, 둘째, 기존에 정의되어 있던 코드의 정의를 변경하고, 셋째, 새로운 코드가 정의되고, 넷째, 기존 코드가 삭제되는 과정을 포함할 수 있다.Here, the operating process of the code system conversion module 140 is: First, a previously defined code number is changed, second, the definition of a previously defined code is changed, third, a new code is defined, and fourth, It may include the process of deleting the existing code.

구체적으로, 먼저 기존의 코드번호 변경 과정은 구 체계코드의 데이터에서 해당 코드들을 수동 또는 자동으로 수정하도록 처리함으로써 수행될 수 있다. 다음으로 기존의 코드의 정의에 대한 변경은 주로 키워드 또는 금지 키워드가 추가될 수 있다. 따라서, 우선적으로 이러한 코드에 해당하는 구체계 데이터를 추출하여 그룹화하고 1개 이상의 키워드와 매칭 여부에 따라서 서로 다른 그룹으로 설정할 수 있다. 예를 들어, 1개 이상의 키워드와 매칭되는 데이터들의 모음을 G(m) 이라고 하고, 단 한 개의 키워드와도 매칭되지 않는 데이터들을 G(nm) 이라고 하면, G(nm)에 대해서는 신 체계 코드 정의에 의해 새롭게 매칭되는 코드를 찾아서 부여하도록 할 수 있고, 새롭게 매칭되는 코드가 없는 데이터 모음을 G(u) 라고 할 수 있다. 이러한 G(m) 그룹에 대해서는 신 코드 체계에서의 '금지 키워드'와 매칭하고, 단 한 개라도 금지 키워드가 존재하는 것들은 신 체계 코드 정의에 의해 다른 코드를 찾아서 변경한다. 이 때, 신 체계 코드에서 매칭되는 코드가 없는 데이터들은 Gu 그룹에 포함시킬 수 있다.Specifically, first, the existing code number change process may be performed by manually or automatically correcting the corresponding codes in the data of the old system code. Next, changes to the definition of existing codes can mainly include keywords or prohibited keywords. Therefore, first of all, specific data corresponding to these codes can be extracted and grouped, and different groups can be set according to whether or not they match one or more keywords. For example, if G(m) is the collection of data that matches one or more keywords and G(nm) is the data that does not match even a single keyword, G(nm) defines a new system code. It is possible to find and assign a new matching code by , and a data collection without a newly matching code can be referred to as G(u). These G(m) groups are matched with 'prohibited keywords' in the new code system, and those with even one banned keyword are found and changed to other codes by the new system code definition. At this time, data without a matching code in the new system code can be included in the Gu group.

그 다음으로, 새로운 코드가 정의되는 과정은, 전술한 신코드 체계에 의해 매칭 키워드 기반으로 데이터를 임의 생성할 수 있지만, 금지 키워드는 제외시킬 수 있다. 마지막으로, 기존 코드가 삭제되는 과정은, 기본 코드에 해당하는 데이터를 모두 상기 G(u) 그룹으로 포함시킬 수 있다. 이와 같이 과정이 종료된 후에 상기 G(u) 그룹에 해당하는 데이터들은 사람이 수작업을 통해 적절히 키워드를 검색하여 태깅(tagging)할 수 있다.Next, in the process of defining a new code, data can be randomly generated based on matching keywords by the above-described new code system, but forbidden keywords can be excluded. Finally, in the process of deleting the existing code, all data corresponding to the basic code may be included in the G(u) group. After the process is completed in this way, the data corresponding to the G(u) group can be manually searched for and tagged with appropriate keywords.

삭제delete

여기에 제시된 실시 예들과 관련하여 설명된 방법 또는 알고리즘의 단계들은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Steps of a method or algorithm described in relation to the embodiments presented herein may be implemented in the form of program instructions that can be executed by various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the medium may be those specially designed and configured for the present invention or those known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to act as one or more software modules to perform the operations of the present invention, and vice versa.

지금까지 본 발명을 바람직한 실시 예를 참조하여 상세히 설명하였지만, 본 발명이 상기한 실시 예에 한정되는 것은 아니며, 이하의 특허청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형 또는 수정이 가능한 범위까지 본 발명의 기술적 사상이 미친다 할 것이다.Although the present invention has been described in detail with reference to preferred embodiments, the present invention is not limited to the above embodiments, and the technical field to which the present invention belongs without departing from the gist of the present invention claimed in the following claims. Anyone skilled in the art will extend the technical spirit of the present invention to the extent that various variations or modifications are possible.

10: 구 분류코드
20: 신 분류코드
100: 코드 분석부
110: 규칙 기반 변화모듈
120: 기계학습 기반 변환모듈
130: 통계기반 변환모듈
140: 코드 체계 변환모듈
150: 텍스트 수신부
160: 코드 전송부
200: 사용자 단말10: old classification code
20: new classification code
100: code analysis unit
110: rule-based change module
120: machine learning-based conversion module
130: statistics-based conversion module
140: code system conversion module
150: text receiving unit
160: code transmission unit
200: user terminal

Claims

a text receiving unit for receiving a plurality of texts input through a wired or wireless network;
a code analyzer configured to analyze a list of codes matched by analyzing the plurality of received texts and provide a matched code; and
A code transmission unit for transmitting the searched classification code and keyword corresponding to the keyword to a user terminal through a network;
The code analysis unit,
a rule-based conversion module that matches keywords of the text based on a dictionary and provides a preset code list stored in a server;
A machine learning-based conversion module that converts/accumulates the text from the server into at least one feature vector selected from randomly set characters, words, the number of characters, special characters, and numbers, and provides a list of codes matched therewith; and
A statistics-based conversion module for analyzing the text according to an arbitrary frequency for each code and providing a list of matching codes;
The machine learning-based conversion module,
Provides only a code list having a result matching a keyword derived from a plurality of texts higher than a predetermined threshold among a plurality of code lists matched from the text through the learned model,
The model of the machine learning-based conversion module includes at least one of a decision tree, a support vector machine, a random forest, and deep learning,
The rule-based conversion module,
It is configured to set a code priority, search for a preset prohibited keyword among the input text, and designate it as a lower priority;
The rule-based conversion module, the machine learning-based conversion module, and the statistics-based conversion module of the code analysis unit are sequentially or randomly operated at least one according to the number of final matched code lists.

delete

According to claim 1,
Industrial occupation code classification system further comprising: a code system conversion unit for analyzing the code list matched by the code transmission unit and converting it into a new code list.

delete

According to claim 1,
The random frequency of the statistics-based conversion module is stored for each letter of the input text and displayed as a percentage (%) in the order of high matching according to the frequency.

delete