KR102152900B1

KR102152900B1 - Method and apparatus for processing data whth ambiguous syllabl

Info

Publication number: KR102152900B1
Application number: KR1020200022503A
Authority: KR
Inventors: 황명진; 지창진
Original assignee: 주식회사 엘솔루
Priority date: 2020-02-24
Filing date: 2020-02-24
Publication date: 2020-09-07
Also published as: WO2021172786A1

Abstract

A data processing method of a data processing apparatus according to an embodiment comprises the steps of: receiving recognition target data including ambiguous syllables which can be recognized as numbers or characters; selecting a plurality of pieces of conversion candidate data from the recognition target data, by using an electronic dictionary in which input data including one or more ambiguous syllables and output data corresponding thereto are included as conversion rules; selecting at least one of a plurality of pieces of conversion candidate data as conversion target data; and replacing the selected conversion target data among the recognition target data with the output data of the electronic dictionary and outputting the replaced data. According to the present invention, the method can maximize the readability of address information.

Description

METHOD AND APPARATUS FOR PROCESSING DATA WHTH AMBIGUOUS SYLLABL}

본 발명은 중의적 음절을 포함하는 데이터를 인식하는 데이터 처리 방법 및 장치에 관한 것이다.The present invention relates to a data processing method and apparatus for recognizing data including ambiguous syllables.

일반적으로, 글의 작성 시점에 독자의 가독성을 고려하여 숫자는 아라비아 숫자로 작성하고, 외래어는 로마자나 한자 등의 문자로 작성을 하는 경우가 많다. 특히, 숫자로 인식될 수도 있고 문자로 인식될 수도 있는 중의적 음절의 경우에 더욱 독자의 가독성을 고려하게 된다.In general, at the time of writing, numbers are written in Arabic numerals in consideration of the reader's readability, and foreign words are written in characters such as Roman or Chinese characters. In particular, the reader's readability is considered more in the case of an ambiguous syllable that can be recognized as a number or a letter.

하지만, 음성 인식의 경우는 입력 방법이 음성이고 기계에 의해 자동으로 이뤄지는 문자화 변환 과정에서 독자의 가독성이 고려되지 않기 때문에, 음성 인식의 대상이 동일한 발음을 갖는 중의적 음절을 포함하는 경우에는 오류가 발생할 수 있다. 예를 들어, 발음이 "삼천일동"인 경우에 "3001동"이 올바른 변환 결과일 수도 있고 "삼천1동"이 올바른 변환 결과일 수도 있다. 또한, 발음이 "오이사이구요"인 경우에 "52429요", "5242구요", "524이구요", "오이사이구요" 등으로 변환될 수 있으나 이 중 하나만이 가독성이 높은 올바른 변환 결과일 수 있다.However, in the case of speech recognition, since the input method is speech and the reader's readability is not taken into account in the text conversion process performed automatically by the machine, an error occurs when the object of speech recognition contains an ambiguous syllable with the same pronunciation. Can occur. For example, when the pronunciation is “Three thousand days”, “3001 dong” may be a correct conversion result or “Three thousand one dong” may be a correct conversion result. Also, if the pronunciation is "Oisaiyo", it can be converted into "52429yo", "5242goyo", "524iyo", "Oisaiyo", etc., but only one of these may be the correct conversion result with high readability. have.

한편, 음성 인식 기술에서 이용되고 있는 변환 기법으로는 사전 기반 변환 방법, 정규식에 의한 변환 방법, 자연어처리에 의한 분석 후 확률이나 규칙에 의한 변환 방법, 신경망을 이용한 시퀀스투시퀀스(Sequence To Sequence) 변환 방법 등이 있다.Meanwhile, conversion techniques used in speech recognition technology include dictionary-based conversion methods, conversion methods using regular expressions, conversion methods based on probability or rules after analysis by natural language processing, and sequence to sequence conversion using neural networks. There are methods and so on.

사전 기반 변환 방법은 입력된 문장 중에서 사전에 등록된 단어나 문장을 발견하면 사전에 지정된 단어나 문장으로 교체하는 방식이다. 구조도 간단하고 고속 변환도 가능하게 설계할 수 있다. 그러나 사전 기반 변환 방법은 통상 정해진 입력에 대해 정해진 단어나 문장으로 1:1 변환용으로 쓰이며 문맥에 따라 변환 결과를 선택해야 하거나 입력의 형태가 가변적일 때는 적용할 수 없다.In the dictionary-based conversion method, when a word or sentence registered in a dictionary is found among input sentences, the word or sentence designated in the dictionary is replaced. The structure is simple and can be designed to enable high-speed conversion. However, the dictionary-based conversion method is usually used for 1:1 conversion into a prescribed word or sentence for a given input, and cannot be applied when a conversion result must be selected according to the context or the input type is variable.

정규식에 의한 변환 방법은 조건에 따른 패턴 변환으로 예컨대, "아파트" 뒤에 숫자 형태가 오면 해당 숫자는 아라비아 숫자로 바꾼다는 식이다. 속도가 느리고 규칙을 사람이 일일이 작성해야 하여 노동집약적이다. 정확도는 노력하는 정도에 따라 높아질 수 있으나 그만큼 속도는 더 느려진다.The conversion method using a regular expression is a pattern conversion according to conditions. For example, if a number type comes after "apartment", the number is converted to Arabic numerals. It is labor-intensive because it is slow and requires each person to write rules. Accuracy can increase depending on the level of effort, but the speed is slower.

자연어처리에 의한 변환 방법은 언어 분석을 통해 중의성 있는 부분을 판단하고 중의성 있는 부분이 변화될 수 있는 모든 후보를 확인한 후 이 중 가장 유력한 하나를 선택해 교체하는 식이다. 여기서, 유력한 후보 하나를 선택하는 것은 규칙이나 확률, 패턴 등에 의해서 이뤄지는데, 목표한 용도에 최적화된 변환을 위해서는 관련 데이터와의 접목 과정이 필요하다. 이러한 자연어처리에 의한 변환 방법은 근간이 되는 자연어처리 모듈이 무엇이냐에 따라 속도와 정확도, 용도 최적화 등에 영향을 받는데, 자연어처리 모듈은 단시간에 개발할 수 있는 것이 아니어서 기개발된 모듈을 사용하게 되는데, 이 경우는 속도, 정확도, 용도 최적화를 모두 달성하기 쉽지 않다.The conversion method by natural language processing is to judge the ambiguous part through language analysis, check all candidates for which the ambiguous part can be changed, and then select and replace the most influential one. Here, the selection of one promising candidate is made by rules, probabilities, and patterns, and a process of grafting with related data is required for conversion optimized for a target use. The conversion method by natural language processing is affected by speed, accuracy, and optimization of usage depending on what is the underlying natural language processing module. Since the natural language processing module cannot be developed in a short time, a previously developed module is used. In this case, it is not easy to achieve speed, accuracy, and application optimization.

신경망을 이용한 시퀀스투시퀀스 변환 방법은 신경망을 이용해 문자열을 다른 문자열로 변환하는 방법이다. 이를 구현하는 방법은 대규모의 입력-출력 문장 쌍을 학습하여 생성한 변환 모델을 이용하여 특정 입력에 대해 특정 출력을 생성하는 과정으로 되어있다. 변환 결과를 만들어 내는 과정이 변환이라기 보다는 생성에 가깝기 때문에 입력과 동떨어진 변환 결과가 발생할 수 있다. 또 학습 및 변환 과정을 사람이 세세하게 조절할 수 있는 여지가 적기 때문에 잘못된 결과에 대한 제어나 수정이 쉽지 않고 속도 또한 느리다.The sequence-to-sequence conversion method using a neural network is a method of converting a string into another string using a neural network. The method of implementing this consists of a process of generating a specific output for a specific input using a transform model generated by learning a large-scale input-output sentence pair. Since the process of generating the conversion result is closer to creation rather than conversion, a conversion result that is separate from the input may occur. In addition, because there is little room for humans to fine-tune the learning and conversion process, it is not easy to control or correct wrong results, and the speed is also slow.

한국등록특허 제10-1964514호 (2019.03.26. 등록)Korean Patent Registration No. 10-1964514 (registered on March 26, 2019)

일 실시예에 따르면, 숫자로 인식될 수도 있고 문자로 인식될 수도 있는 중의적 음절을 포함하는 인식 대상 데이터를 높은 가독성을 갖도록 기계적으로 변환하는 데이터 처리 방법 및 장치를 제공한다.According to an embodiment, there is provided a data processing method and apparatus for mechanically converting data to be recognized including an ambiguous syllable that may be recognized as a number or a character to have high readability.

본 발명의 해결하고자 하는 과제는 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 해결하고자 하는 과제는 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The problem to be solved of the present invention is not limited to those mentioned above, and another problem to be solved that is not mentioned will be clearly understood by those of ordinary skill in the art from the following description.

제 1 관점에 따른 데이터 처리 장치의 데이터 처리 방법은, 숫자로 인식될 수도 있고 문자로 인식될 수도 있는 중의적 음절을 포함하는 인식 대상 데이터를 입력받는 단계와, 상기 중의적 음절을 하나 이상 포함하는 입력 데이터와 이에 대응하는 출력 데이터가 변환 규칙으로 포함된 전자사전을 이용하여, 상기 인식 대상 데이터로부터 복수의 변환 후보 데이터 조각을 선정하는 단계와, 상기 복수의 변환 후보 데이터 조각 중 하나 이상을 변환 대상 데이터로 선정하는 단계와, 상기 인식 대상 데이터 중 상기 선정된 변환 대상 데이터를 상기 전자사전의 출력 데이터로 교체하여 출력하는 단계를 포함한다.A data processing method of a data processing apparatus according to a first aspect includes receiving data to be recognized including an ambiguous syllable that may be recognized as a number or a character, and includes at least one of the middle syllables. Selecting a plurality of conversion candidate data pieces from the recognition target data using an electronic dictionary including input data and output data corresponding thereto as conversion rules, and converting at least one of the plurality of conversion candidate data pieces And selecting data as data, and replacing and outputting the selected conversion target data among the recognition target data with output data of the electronic dictionary.

제 2 관점에 따른 데이터 처리 장치는, 숫자로 인식될 수도 있고 문자로 인식될 수도 있는 중의적 음절을 포함하는 인식 대상 데이터를 입력받는 입력부와, 상기 입력부를 통하여 입력받은 상기 인식 대상 데이터를 처리하는 처리부와, 상기 처리부에 의한 처리 결과를 출력하는 출력부를 포함하고, 상기 처리부는, 상기 중의적 음절을 하나 이상 포함하는 입력 데이터와 이에 대응하는 출력 데이터가 변환 규칙으로 포함된 전자사전을 이용하여, 상기 인식 대상 데이터로부터 복수의 변환 후보 데이터 조각을 선정하며, 상기 복수의 변환 후보 데이터 조각 중 하나 이상을 변환 대상 데이터로 선정하고, 상기 인식 대상 데이터 중 상기 선정된 변환 대상 데이터를 상기 전자사전의 출력 데이터로 교체하여 상기 출력부에 제공한다.The data processing device according to the second aspect includes an input unit for receiving data to be recognized including an ambiguous syllable that may be recognized as a number or a character, and processing the recognition target data input through the input unit. A processing unit and an output unit for outputting a processing result by the processing unit, wherein the processing unit uses an electronic dictionary including input data including one or more intermediate syllables and output data corresponding thereto as conversion rules, Selecting a plurality of conversion candidate data pieces from the recognition target data, selecting at least one of the plurality of conversion candidate data pieces as conversion target data, and outputting the selected conversion target data among the recognition target data from the electronic dictionary It is replaced with data and provided to the output unit.

제 3 관점에 따라 컴퓨터 프로그램을 저장하고 있는 컴퓨터 판독 가능 기록매체는, 상기 컴퓨터 프로그램이, 프로세서에 의해 실행되면, 숫자로 인식될 수도 있고 문자로 인식될 수도 있는 중의적 음절을 포함하는 인식 대상 데이터를 입력받는 단계와, 상기 중의적 음절을 하나 이상 포함하는 입력 데이터와 이에 대응하는 출력 데이터가 변환 규칙으로 포함된 전자사전을 이용하여, 상기 인식 대상 데이터로부터 복수의 변환 후보 데이터 조각을 선정하는 단계와, 상기 복수의 변환 후보 데이터 조각 중 하나 이상을 변환 대상 데이터로 선정하는 단계와, 상기 인식 대상 데이터 중 상기 선정된 변환 대상 데이터를 상기 전자사전의 출력 데이터로 교체하여 출력하는 단계를 포함하는 방법을 상기 프로세서가 수행하도록 하기 위한 명령어를 포함한다.According to the third aspect, a computer-readable recording medium storing a computer program, when the computer program is executed by a processor, includes recognition target data including ambiguous syllables that may be recognized as numbers or characters. Receiving input, and selecting a plurality of pieces of conversion candidate data from the recognition target data using an electronic dictionary including input data including one or more intermediate syllables and output data corresponding thereto as a conversion rule And, selecting one or more of the plurality of conversion candidate data fragments as conversion target data, and replacing and outputting the selected conversion target data among the recognition target data with output data of the electronic dictionary. And an instruction for causing the processor to perform.

제 4 관점에 따라 컴퓨터 판독 가능 기록매체에 저장되어 있는 컴퓨터 프로그램은, 프로세서에 의해 실행되면, 숫자로 인식될 수도 있고 문자로 인식될 수도 있는 중의적 음절을 포함하는 인식 대상 데이터를 입력받는 단계와, 상기 중의적 음절을 하나 이상 포함하는 입력 데이터와 이에 대응하는 출력 데이터가 변환 규칙으로 포함된 전자사전을 이용하여, 상기 인식 대상 데이터로부터 복수의 변환 후보 데이터 조각을 선정하는 단계와, 상기 복수의 변환 후보 데이터 조각 중 하나 이상을 변환 대상 데이터로 선정하는 단계와, 상기 인식 대상 데이터 중 상기 선정된 변환 대상 데이터를 상기 전자사전의 출력 데이터로 교체하여 출력하는 단계를 포함하는 방법을 상기 프로세서가 수행하도록 하기 위한 명령어를 포함한다.The computer program stored in the computer-readable recording medium according to the fourth aspect, when executed by a processor, receives data to be recognized including an ambiguous syllable that may be recognized as a number or a character; and , Selecting a plurality of pieces of conversion candidate data from the recognition target data using an electronic dictionary in which input data including one or more intermediate syllables and output data corresponding thereto are included as a conversion rule, and the plurality of The processor performs a method comprising selecting one or more of the pieces of conversion candidate data as conversion target data, and replacing and outputting the selected conversion target data among the recognition target data with output data of the electronic dictionary. Contains commands to do.

일 실시예에 따르면, 숫자로 인식될 수도 있고 문자로 인식될 수도 있는 중의적 음절을 포함하는 인식 대상 데이터를 높은 가독성을 갖도록 기계적으로 변환하여 출력할 수 있다. 예를 들어, 인식 대상 데이터가 주소 정보인 경우에 문자로 인식되어야 가독성이 높은 부분과 숫자로 인식되어야 가독성이 높은 부분을 구분하여 변환함으로써 인식 결과로 제공되는 주소 정보에 대한 독자의 가독성을 극대화할 수 있다.According to an embodiment, recognition target data including an ambiguous syllable that may be recognized as a number or a character may be mechanically converted and output to have high readability. For example, when the data to be recognized is address information, the reader's readability of the address information provided as a result of the recognition can be maximized by converting the highly readable part that must be recognized as a character and the part that is highly readable only when it is recognized as a number. I can.

또한, 인식 대상 데이터에 공백이 포함된 경우에 데이터 변환 등의 처리 중에는 공백을 제거하여 유연하고 빠른 처리 속도를 보장하면서도 최종 결과에서는 필요한 공백을 복원하여 제공함으로써 공백의 변화로 인한 가독성 저하가 발생하지 않도록 한다.In addition, when white space is included in the data to be recognized, white space is removed during data conversion, etc., to ensure flexible and fast processing speed, while restoring and providing necessary white space in the final result, so that readability due to changes in white space does not occur. Avoid.

도 1은 일 실시예에 따른 데이터 처리 장치의 구성도이다.
도 2는 일 실시예에 따른 데이터 처리 방법을 설명하기 위한 흐름도이다.
도 3은 일 실시예에 따른 데이터 처리 방법 중 변환 대상 데이터를 선정하는 과정을 설명하기 위한 흐름도이다.1 is a block diagram of a data processing apparatus according to an exemplary embodiment.
2 is a flowchart illustrating a data processing method according to an exemplary embodiment.
3 is a flowchart illustrating a process of selecting data to be converted from among a data processing method according to an exemplary embodiment.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and a method of achieving them will become apparent with reference to the embodiments described later together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in a variety of different forms, and only these embodiments make the disclosure of the present invention complete, and are common knowledge in the technical field to which the present invention pertains. It is provided to completely inform the scope of the invention to those who have, and the invention is only defined by the scope of the claims.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 발명에 대해 구체적으로 설명하기로 한다.The terms used in the present specification will be briefly described, and the present invention will be described in detail.

본 발명에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the present invention have been selected from general terms that are currently widely used while considering functions in the present invention, but this may vary depending on the intention or precedent of a technician working in the field, the emergence of new technologies, and the like. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning of the terms will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present invention should be defined based on the meaning of the term and the overall contents of the present invention, not a simple name of the term.

명세서 전체에서 어떤 부분이 어떤 구성요소를 '포함'한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. When it is said that a part in the specification'includes' a certain element, this means that other elements may be further included instead of excluding other elements unless specifically stated to the contrary.

또한, 명세서에서 사용되는 '부'라는 용어는 소프트웨어 또는 FPGA나 ASIC과 같은 하드웨어 구성요소를 의미하며, '부'는 어떤 역할들을 수행한다. 그렇지만 '부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 '부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '부'들로 결합되거나 추가적인 구성요소들과 '부'들로 더 분리될 수 있다.In addition, the term'unit' used in the specification means software or hardware components such as FPGA or ASIC, and'unit' performs certain roles. However,'part' is not limited to software or hardware. The'unit' may be configured to be in an addressable storage medium, or may be configured to reproduce one or more processors. Thus, as an example,'unit' refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, procedures, Includes subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays and variables. The functions provided within the components and'units' may be combined into a smaller number of components and'units' or may be further divided into additional components and'units'.

아래에서는 첨부한 도면을 참고하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art may easily implement the present invention. In the drawings, parts not related to the description will be omitted in order to clearly describe the present invention.

도 1은 일 실시예에 따른 데이터 처리 장치의 구성도이다.1 is a block diagram of a data processing apparatus according to an exemplary embodiment.

도 1을 참조하면, 데이터 처리 장치(100)는 입력부(110), 처리부(120) 및 출력부(130)를 포함하고, 저장부(140)를 더 포함할 수 있다. 처리부(120)는 데이터 조각 선정부(121), 변환 대상 선정부(122), 데이터 변환부(123) 및 공백 제거/복원부(124)를 포함할 수 있다.Referring to FIG. 1, the data processing apparatus 100 may include an input unit 110, a processing unit 120, and an output unit 130, and may further include a storage unit 140. The processing unit 120 may include a data piece selection unit 121, a conversion target selection unit 122, a data conversion unit 123, and a blank removal/restore unit 124.

입력부(110)는 인식 대상 데이터를 입력받아 처리부(120)에 제공한다. 여기서, 입력부(110)로 입력되는 인식 대상 데이터는 숫자로 인식될 수도 있고 문자로 인식될 수도 있는 중의적 음절을 포함할 수 있다.The input unit 110 receives data to be recognized and provides the data to the processing unit 120. Here, the recognition target data inputted through the input unit 110 may include an ambiguous syllable that may be recognized as a number or a character.

처리부(120)는 입력부(110)를 통하여 입력받은 인식 대상 데이터를 처리한다.The processing unit 120 processes the recognition target data received through the input unit 110.

처리부(120)의 데이터 조각 선정부(121)는 중의적 음절을 하나 이상 포함하는 입력 데이터와 이에 대응하는 출력 데이터가 변환 규칙으로 포함된 전자사전을 이용하여, 인식 대상 데이터로부터 복수의 변환 후보 데이터 조각을 선정한다. 이러한 데이터 조각 선정부(121)에 의하여 이용되는 전자사전은 저장부(140)에 미리 저장되거나 처리부(120)의 내장 메모리(도시 생략됨)에 미리 저장될 수 있다.The data piece selection unit 121 of the processing unit 120 uses an electronic dictionary in which input data including one or more intermediate syllables and output data corresponding thereto are included as conversion rules, and a plurality of conversion candidate data from the recognition target data Select a piece. The electronic dictionary used by the data piece selection unit 121 may be stored in advance in the storage unit 140 or in an internal memory (not shown) of the processing unit 120 in advance.

처리부(120)의 변환 대상 선정부(122)는 복수의 변환 후보 데이터 조각 중 하나 이상을 변환 대상 데이터로 선정한다. 이러한 처리부(120)를 구성하는 데이터 조각 선정부(121), 변환 대상 선정부(122), 데이터 변환부(123) 및 공백 제거/복원부(124)는 마이크로프로세서(microprocessor) 등과 같은 컴퓨팅 연산수단을 포함할 수 있다.The conversion target selection unit 122 of the processing unit 120 selects one or more of a plurality of conversion candidate data fragments as conversion target data. The data fragment selection unit 121, the conversion target selection unit 122, the data conversion unit 123, and the blank removal/restore unit 124 constituting the processing unit 120 are computing computing means such as a microprocessor, etc. It may include.

처리부(120)의 데이터 변환부(123)는 인식 대상 데이터 중 변환 대상 선정부(122)에 의하여 선정된 변환 대상 데이터를 전자사전의 출력 데이터로 교체하여 출력부(130)에 제공한다.The data conversion unit 123 of the processing unit 120 replaces the conversion target data selected by the conversion target selection unit 122 among the recognition target data with output data from the electronic dictionary and provides the data to the output unit 130.

처리부(120)의 공백 제거/복원부(124)는 인식 대상 데이터의 공백을 제어하여 공백이 제거된 인식 대상 데이터가 전자사전에 입력되도록 하고, 출력부(130)에 제공되기 전에 제거되었던 공백을 복원하여 제공한다. 여기서, 공백 제거/복원부(124)는 인식 대상 데이터 중 전자사전의 출력 데이터로 교체되지 않은 부분에 대하여 공백 복원을 수행할 수 있다.The white space removal/restore unit 124 of the processing unit 120 controls the white space of the recognition target data so that the recognition target data from which the white space has been removed is input into the electronic dictionary, and removes the white space that was removed before being provided to the output unit 130. Restore and provide. Here, the blank removal/restore unit 124 may perform blank recovery on a portion of the recognition target data that has not been replaced with the output data of the electronic dictionary.

출력부(130)는 처리부(120)에 의한 처리 결과를 출력한다. 예를 들어, 출력부(130)는 출력 인터페이스를 포함할 수 있고, 처리부(120)로부터 제공 받은 변환 데이터를 처리부(120)의 제어에 따라 출력 인터페이스에 연결되어 있는 다른 전자장치로 출력할 수 있다. 또는, 출력부(130)는 네트워크 카드를 포함할 수 있고, 처리부(120)로부터 제공 받은 변환 데이터를 처리부(120)의 제어에 따라 네트워크를 통하여 송신할 수도 있다. 또는, 출력부(130)는 처리부(120)에 의한 처리 결과를 화면에 표시할 수 있는 표시장치를 포함할 수도 있다.The output unit 130 outputs a result of processing by the processing unit 120. For example, the output unit 130 may include an output interface, and may output the converted data provided from the processing unit 120 to another electronic device connected to the output interface under the control of the processing unit 120. . Alternatively, the output unit 130 may include a network card, and may transmit the converted data provided from the processing unit 120 through a network under the control of the processing unit 120. Alternatively, the output unit 130 may include a display device capable of displaying a result of processing by the processing unit 120 on a screen.

저장부(140)에는 중의적 음절을 하나 이상 포함하는 입력 데이터와 이에 대응하는 출력 데이터가 변환 규칙으로 포함된 전자사전이 저장될 수 있고, 처리부(120)에 의한 처리 결과가 저장될 수도 있다. 예를 들어, 저장부(140)는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 플래시 메모리(flash memory)와 같은 프로그램 명령어들을 저장하고 수행하도록 특별히 구성된 하드웨어 장치 등과 같이 컴퓨터 판독 가능한 기록매체일 수 있다.The storage unit 140 may store an electronic dictionary including input data including one or more intermediate syllables and output data corresponding thereto as a conversion rule, and may store a result of processing by the processing unit 120. For example, the storage unit 140 includes magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and floptical disks. It may be a computer-readable recording medium such as a hardware device specially configured to store and execute program instructions such as magneto-optical media or flash memory.

도 2는 일 실시예에 따른 데이터 처리 방법을 설명하기 위한 흐름도이고, 도 3은 일 실시예에 따른 데이터 처리 방법 중 변환 대상 데이터를 선정하는 과정을 설명하기 위한 흐름도이다.FIG. 2 is a flowchart illustrating a data processing method according to an exemplary embodiment, and FIG. 3 is a flowchart illustrating a process of selecting data to be converted among data processing methods according to an exemplary embodiment.

이하, 도 1 내지 도 3을 참조하여 본 발명의 일 실시예에 따른 데이터 처리 장치(100)에서 수행하는 데이터 처리 방법에 대하여 자세히 살펴보기로 한다.Hereinafter, a data processing method performed by the data processing apparatus 100 according to an embodiment of the present invention will be described in detail with reference to FIGS. 1 to 3.

먼저, 중의적 음절을 하나 이상 포함하는 입력 데이터와 이에 대응하는 출력 데이터가 변환 규칙으로 포함된 전자사전을 생성하여야 하고, 생성한 전자사전을 저장부(140)에 미리 저장하거나 처리부(120)의 내장 메모리(도시 생략됨)에 미리 저장하여야 한다. 여기서, 전자사전은 데이터 처리 장치(100)의 용도나 사용목적 등에 맞추어 고유 기능을 발휘할 수 있도록 생성할 수 있다. 이러한 전자사전은 기계 학습을 통하여 획득된 입력 데이터와 출력 데이터로 이루어진 복수의 데이터 쌍을 변환 규칙으로서 포함할 수 있다. 예를 들어, 데이터 처리 장치(100)가 지리적 주소 변환에 이용되는 경우라면 다양한 형태로 나타낼 수 있는 주소 데이터를 출력 데이터로서 이용할 수 있고, 이러한 출력 데이터에 대한 발음 정보를 대응하는 입력 데이터로서 이용할 수 있다. 예컨대, 전자사전의 출력 데이터로는 "완산구 삼천1동", "삼천1동 성지산로 33", "삼천1동 573-2", "삼천1동주민센터", "삼천1동 거마공원", "3001동, "아파트 3001동" 등이 이용될 수 있고, 대응하는 입력 데이터로는 각 출력 데이터의 발음 정보가 이용될 수 있다. 또한, 전자사전에 등록할 변환 규칙의 출력 데이터가 먼저 결정되면 결정된 출력 데이터에 기초하여 복수의 입력 데이터를 생성할 수 있다. 예를 들어, 전자사전에 등록할 변환 규칙의 출력 데이터가 "삼천1동 1004"일 경우, 이에 대응하는 가능한 입력 데이터로서 "삼천일동 천사", "삼천일동 일공공사" 등을 생성할 수 있다. 또는, 전자사전에 등록할 변환 규칙의 출력 데이터가 "010-1234-5242"일 경우, 이에 대응하는 가능한 입력 데이터로서 "공일공 일이삼사에 오이사이", "공일공 하나둘삼사 다시 오둘사둘", "공일공 천이백삼십사에 오천이백사십이" 등을 생성할 수 있다. 예컨대, 전자사전에 등록된 변환 규칙의 "입력 데이터 → 출력 데이터"의 쌍으로는 "삼천일동 천사 → 삼천1동 1004", "삼천일동 일공공사 → 삼천1동 1004", "공일공 일이삼사에 오이사이 → 010-1234-5242"등을 포함할 수 있다. 또한, 전자사전에 등록할 변환 규칙의 입력 데이터에 공백이 포함되는 경우를 공백이 포함되지 않은 경우와 비교할 때에 입력 데이터의 변화 형태측면에서 상대적으로 더 많은 형태를 갖고 이는 연산량 측면에서 불리하게 작용할 수 있다. 이에, 변환 규칙의 입력 데이터를 전자사전에 등록할 때에 공백을 제거한 후에 등록할 수 있다.First, an electronic dictionary including input data including one or more intermediate syllables and output data corresponding thereto as a conversion rule must be created, and the generated electronic dictionary is stored in the storage unit 140 in advance or the processing unit 120 It must be stored in the internal memory (not shown) in advance. Here, the electronic dictionary may be generated to exhibit a unique function according to the purpose or purpose of use of the data processing device 100. Such an electronic dictionary may include a plurality of data pairs including input data and output data acquired through machine learning as a transformation rule. For example, if the data processing device 100 is used for geographic address conversion, address data that can be represented in various forms may be used as output data, and pronunciation information for such output data may be used as corresponding input data. have. For example, as the output data of the electronic dictionary, "Samcheon 1-dong, Wansan-gu", "33 Seongjisan-ro, Samcheon 1-dong", "Samcheon 1-dong 573-2", "Samcheon 1-dong Community Center", "Samcheon 1-dong Geoma Park" , “3001 Building, “Apartment 3001”, etc., and pronunciation information of each output data may be used as corresponding input data In addition, output data of a conversion rule to be registered in the electronic dictionary is first determined. Then, a plurality of input data can be generated based on the determined output data. For example, if the output data of the conversion rule to be registered in the electronic dictionary is “Samcheon 1-dong 1004”, the possible input data corresponding thereto is “Three thousand It is possible to create "Ildong Angel", "Samcheonildong Ilgong Corporation", etc. Or, if the output data of the conversion rule to be registered in the electronic dictionary is "010-1234-5242", the corresponding input data is "Gongilgong Ilii Samsa It is possible to create "Oisai", "Gongil-gong one-two-three, again five-two-three", and "Gongil-gong 1,200-three and fifty-two hundred forty-two", etc. For example, "input data → output data" of the conversion rule registered in the electronic dictionary. The pair may include "Samcheonil-dong Angel → Samcheon1-dong 1004", "Samcheonil-dong Industrial Construction → Samcheon1-dong 1004", "Gongil-dong Ilyi Samsa and Oisai → 010-1234-5242." When comparing the case where the input data of the conversion rule to be registered in advance contains blanks with the case where blanks are not included, the input data has a relatively larger number in terms of the form of change, and this may be disadvantageous in terms of the amount of computation. , When registering the input data of the conversion rule in the electronic dictionary, it can be registered after removing spaces.

앞서 설명한 바와 같이 생성된 전자사전이 저장부(140)에 저장되거나 처리부(120)의 내장 메모리(도시 생략됨)에 저장된 상태에서 데이터 처리 장치(100)의 입력부(110)에 인식 대상 데이터가 입력될 수 있고, 인식 대상 데이터에는 숫자로 인식될 수도 있고 문자로 인식될 수도 있는 중의적 음절이 포함될 수 있다. 예컨대, "주소는 삼천일동 천사번지예요 완산구 삼천일동요", "제 번호요 공일공 일이 삼사에 오이사이구요" 등이 인식 대상 데이터로서 입력될 수 있다(S210).Recognition target data is input to the input unit 110 of the data processing device 100 while the electronic dictionary generated as described above is stored in the storage unit 140 or stored in the internal memory (not shown) of the processing unit 120 The recognition target data may include an ambiguous syllable that may be recognized as a number or a character. For example, "The address is Samcheonil-dong Angel Street, Wansan-gu, Samcheonil Children's Song", "My number, Gongilgong, and I'm a three-year-old Oisai" may be input as recognition target data (S210).

그러면, 앞서 설명한 바와 같이 미리 생성하여 둔 전자사전에 등록된 변환 규칙의 입력 데이터에 공백이 제거된 경우에 처리부(120)의 공백 제거/복원부(124)는 단계 S210을 통하여 입력되는 인식 대상 데이터에 대해서도 공백을 제거한 후 처리부(120)의 데이터 조각 선정부(121)에 제공할 수 있다(S220).Then, as described above, when the blank is removed from the input data of the conversion rule registered in the previously generated electronic dictionary, the blank removal/restore unit 124 of the processing unit 120 performs the recognition target data input through step S210. Also, the blank may be removed and then provided to the data piece selection unit 121 of the processing unit 120 (S220).

다음으로, 데이터 조각 선정부(121)는 단계 S210 및 단계 S220을 통하여 제공된 인식 대상 데이터로부터, 중의적 음절을 하나 이상 포함하는 입력 데이터와 이에 대응하는 출력 데이터가 변환 규칙으로 포함된 전자사전을 이용하여 복수의 변환 후보 데이터 조각을 선정한다(S230). 이처럼, 변환 후보 데이터 조각을 선정하는 과정은 인식 대상 데이터에서 만들어질 수 있는 모든 문자열 조각에 대한 검색 결과를 찾는 과정이라고 볼 수도 있다. 예를 들어, 인식 대상 데이터에 "오 이사구"가 포함된 경우에, "오", "오 이", "오 이사", "오 이사구", "이", "이사", "이사구", "사", "사구", "구"가 모두 변환 후보 데이터 조각으로 선정될 수 있다. 이때, 변환 후보 데이터 조각의 경계에 대한 제약을 둘 수 있다. 예를 들어, 단어의 중간에 조각의 시작점이 있을 수 없다고 한다면, "사", "사구", "구"는 변환 후보 데이터 조각이 될 수 없다. 그 외에도 조각의 끝점이 단어의 중간이 될 수 없다거나 조각의 시작이나 끝이 공백이 될 수 없다는 등의 제약이 가능하다. 예컨대, 조각 검색을 위해 트라이(trie) 자료구조 및 알고리즘 등을 사용하면 시간복잡도를 줄일 수 있다.Next, the data piece selection unit 121 uses an electronic dictionary in which input data including one or more intermediate syllables and output data corresponding thereto are included as conversion rules from the recognition target data provided through steps S210 and S220. Thus, a plurality of transform candidate data fragments are selected (S230). As described above, the process of selecting a data fragment for conversion may be viewed as a process of finding search results for all character string fragments that can be created from the data to be recognized. For example, if "Isa-goo Oh" is included in the data to be recognized, "Isa-goo Oh", "Isa-goo Oh", "Isa-goo Oh", "Isa-goo Lee", "Isa-goo Lee" ", "four," "four," and "gu" may all be selected as the transformation candidate data pieces. At this time, it is possible to place restrictions on the boundaries of the transformation candidate data fragments. For example, if there is no starting point of a fragment in the middle of a word, "four", "four phrase", and "phrase" cannot be transform candidate data fragments. In addition, restrictions such as that the end of a piece cannot be in the middle of a word or the beginning or end of a piece cannot be blank. For example, time complexity can be reduced by using a trie data structure and an algorithm for fragment retrieval.

그리고, 처리부(120)의 변환 대상 선정부(122)는 단계 S230에서 선정된 복수의 변환 후보 데이터 조각 중 하나 이상을 변환 대상 데이터로 선정한다(S240).Then, the conversion target selection unit 122 of the processing unit 120 selects one or more of the plurality of conversion candidate data fragments selected in step S230 as the conversion target data (S240).

단계 S240에 의한 변환 후보 데이터 조각의 선정 과정은 도 3의 흐름도에 좀 더 상세하게 나타내었다. 단계 S250에 대하여 설명하기에 앞서 도 3을 참조하여 변환 후보 데이터 조각의 선정 과정에 대하여 좀 더 자세히 살펴보기로 한다.The process of selecting a transform candidate data piece in step S240 is shown in more detail in the flowchart of FIG. 3. Prior to describing step S250, a process of selecting a transform candidate data fragment will be described in more detail with reference to FIG. 3.

먼저, 데이터 조각 선정부(121)는 전자사전에 등록된 변환 규칙의 입력 데이터에 기초하여 인식 대상 데이터로부터 변환 후보 데이터 조각을 선정한다. 예를 들어, 인식 대상 데이터가 "주소는 삼천일동 천사번지예요 완산구 삼천일동요"일 경우, "[0]주소는 삼천일동", "[4]삼천일동", "[4]삼천일동 천사번지", "[9]천사번지", "[16]완산구 삼천일동", "[20]삼천일동" 등을 변환 후보 데이터 조각으로 선정할 수 있다. 즉, 선정된 변환 후보 데이터 조각들은 전자사전에 입력 데이터로서 등록된 데이터들이다. 여기서 "[]"의 값은 인식 대상 데이터 상의 위치 정보이다.First, the data piece selection unit 121 selects a conversion candidate data piece from recognition target data based on input data of a conversion rule registered in an electronic dictionary. For example, if the data to be recognized is "The address is Samcheonil-dong angel address, Wansan-gu Samcheonil-dong", "[0] address is Samcheonil-dong", "[4] Samcheonil-dong", "[4] Samcheonil-dong angel address. ", "[9]Angel's address", "[16] Samcheonil-dong, Wansan-gu", and "[20] Samcheonil-dong", etc. may be selected as the candidate data fragments. That is, the selected pieces of conversion candidate data are data registered as input data in the electronic dictionary. Here, the value of "[]" is location information on the recognition target data.

이어서, 변환 대상 선정부(122)는 복수의 변환 후보 데이터 조각 중 어느 하나를 변환 대상 데이터로 선정한다. 여기서, 복수의 변환 후보 데이터 조각 중 가장 긴 것을 변환 대상 데이터로서 우선 선정할 수 있다. 예컨대, "[4]삼천일동 천사번지"를 변환 대상 데이터로 선정할 수 있다(S310).Subsequently, the conversion target selection unit 122 selects any one of the plurality of conversion candidate data pieces as conversion target data. Here, the longest one of the plurality of conversion candidate data fragments may be first selected as the conversion target data. For example, "[4] three thousand-il-dong angel address" may be selected as the conversion target data (S310).

그리고, 변환 대상 선정부(122)는 복수의 변환 후보 데이터 조각 중 인식 대상 데이터 상에서 단계 S310의 변환 대상 데이터와 적어도 일부가 위치적으로 중첩되는 변환 후보 데이터 조각을 파악한다. 예컨대, "[0]주소는 삼천일동", "[4]삼천일동", "[9]천사번지"가 중첩되는 변환 후보 데이터 조각으로 파악될 수 있다(S320).In addition, the conversion target selection unit 122 determines a conversion candidate data fragment in which at least a portion of the conversion target data of step S310 overlaps positionally on the recognition target data among the plurality of conversion candidate data pieces. For example, the "[0] address may be identified as a piece of conversion candidate data in which "[4]", "[4]", "[9]" is overlapped (S320).

이어서, 변환 대상 선정부(122)는 복수의 변환 후보 데이터 조각 중 단계 S320에서 파악된 중첩 변환 후보 데이터 조각을 제외한다. 예컨대, "[0]주소는 삼천일동", "[4]삼천일동", "[9]천사번지"가 제거되어 변환 후보 데이터 조각으로는 "[16]완산구 삼천일동"과 "[20]삼천일동"만이 남을 수 있다(S330).Subsequently, the conversion target selection unit 122 excludes the overlapping conversion candidate data piece identified in step S320 from among the plurality of conversion candidate data pieces. For example, "[0] address is Samcheonil-dong", "[4] Samcheonil-dong", and "[9] Angela address" have been removed, and conversion candidate data fragments include "[16] Samcheonil-dong" and "[20]3cheon. Only "all" can remain (S330).

그리고, 변환 대상 선정부(122)는 복수의 변환 후보 데이터 조각 중 단계 S330을 통하여 중첩 변환 후보 데이터 조각을 제외한 후에도 변환 후보 데이터 조각이 잔존하는 경우, 복수의 변환 후보 데이터 조각 중 어느 하나를 변환 대상 데이터로 선정하는 단계 S310부터 재수행한다. 즉, 단계 S330에 의하여 단계 S230의 변환 후보 데이터 조각이 하나도 남지 않을 때까지 반복하는 것이다(S340).In addition, if the conversion candidate data piece remains after excluding the overlapping conversion candidate data piece through step S330 among the plurality of conversion candidate data pieces, the conversion target selection unit 122 selects any one of the plurality of conversion candidate data pieces. The data selection is performed again from step S310. That is, by step S330, it repeats until there are no pieces of conversion candidate data in step S230 (S340).

다시 도 2를 참조하면, 처리부(120)의 데이터 변환부(123)는 인식 대상 데이터 중 단계 S240에서 선정된 변환 대상 데이터를 전자사전에 등록된 변환 규칙의 대응하는 출력 데이터로 교체하여 출력한다. 예를 들어, 인식 대상 데이터가 "주소는 삼천일동 천사번지예요 완산구 삼천일동요"이고, 단계 S240에서 "[4]삼천일동 천사번지"가 변환 대상 데이터로서 선정된 경우에, 인식 대상 데이터 중 "[4]삼천일동 천사번지"가 "삼천1동 1004번지"로 교체될 수 있다. 또는, 인식 대상 데이터가 "제 번호요 공일공 일이삼사에 오이사이구요"이고, 단계 S240에서 "[7]공일공 일이삼사에 오이사이"가 변환 대상 데이터로서 선정된 경우에, 인식 대상 데이터 중 "[7]공일공 일이삼사에 오이사이"가 "010-1234-5242"로 교체될 수 있다(S250).Referring back to FIG. 2, the data conversion unit 123 of the processing unit 120 replaces the conversion target data selected in step S240 among the recognition target data with corresponding output data of the conversion rule registered in the electronic dictionary and outputs the data. For example, when the recognition target data is "The address is Samcheonil-dong Angel Address, Wansan-gu Samcheonil Children's Song", and in step S240, "[4] Samcheonil-dong Angel Address" is selected as the conversion target data, among the recognition target data, " [4] Samcheonil-dong angel address" can be replaced with "Samcheon 1-dong 1004 address". Alternatively, in the case where the recognition target data is "My number, Gongilgong Iisamsa and Oisai", and "[7] Gongilgong Igongilgong Iisamsa and Oisai" is selected as the conversion target data, "[7] of the recognition target data Gongilgong Ilyisamsa to Oisai" may be replaced with "010-1234-5242" (S250).

또한, 단계 S220에서 인식 대상 데이터 중의 공백이 제거된 경우에 공백 제거/복원부(124)는 단계 S250의 결과에 대하여 단계 S220에서 제거된 공백을 복원할 수 있다. 예를 들어, 인식 대상 데이터 중 단계 S250에서 교체되지 않은 부분, 즉 단계 S250의 교체 결과 데이터와 단계 S210에서 입력된 원본 인식 대상 데이터와 비교할 때에 변화되지 않은 부분에 대해서 단계 S220에서 제거된 공백을 복원할 수 있다. 이러한 공백 복원 과정에 대해서는 아래에서 다시 설명하기로 한다(S260).In addition, when a blank in the recognition target data is removed in step S220, the blank removal/restore 124 may restore the blank removed in step S220 with respect to the result of step S250. For example, restore the blank removed in step S220 for the portion of the recognition target data that was not replaced in step S250, that is, the portion that was not changed when comparing the replacement result data of step S250 and the original recognition target data input in step S210. can do. This blank restoration process will be described again below (S260).

그리고, 앞서 설명한 바와 같은 처리부(120)에 의한 변환 데이터는 출력부(130)을 통하여 출력될 수 있고, 저장부(140)에 저장될 수도 있다. 예를 들어, 출력부(130)는 단계 S250 및 단계 S260을 통한 변환 데이터를 처리부(120)에 의한 제어에 따라 출력 인터페이스에 연결되어 있는 다른 전자장치로 출력하거나 네트워크를 통하여 송신하거나 또는 표시장치의 화면에 표시할 수 있다(S270).In addition, the converted data by the processing unit 120 as described above may be output through the output unit 130 or may be stored in the storage unit 140. For example, the output unit 130 outputs the converted data through steps S250 and S260 to another electronic device connected to the output interface according to the control by the processing unit 120, transmits it through a network, or It can be displayed on the screen (S270).

이하, 단계 S220에 의한 공백 제거 과정과 단계 S260에 의한 공백 복원 과정에 대해 좀 더 자세히 살펴보기로 한다.Hereinafter, the process of removing the space in step S220 and the process of restoring space in step S260 will be described in more detail.

공백 제거/복원부(124)는 단계 S220에서 제거된 공백을 단계 S260에서 복원하기 위해서는 단계 S220에 의한 공백 제거 정보를 저장해 두어야만 한다.The blank removal/restore unit 124 must store blank removal information according to step S220 in order to restore blank space removed in step S220 in step S260.

예를 들어, 단계 S210을 통하여 입력되는 인식 대상 데이터가 "제 번호요 공일공 일이 삼사 에 오이 사이구요"이면 아래 표 1의 예시와 같이 공백 제거 정보가 저장될 수 있다.For example, if the recognition target data input through step S210 is "My number and Gong Il Gong is between three companies and cucumbers", blank removal information may be stored as shown in Table 1 below.

제My 번time 호arc 요Yo 공zero 일Work 공zero 일Work 이this 삼three 사four 에on 오Five 이this 사four 이this 구phrase 요Yo XOXO OXOX XXXX XOXO OXOX XXXX XOXO OXOX XOXO OXOX XXXX OOOO OXOX XOXO OXOX XXXX XXXX XXXX

예컨대, 표 1의 예시와 같이 공백 제거 정보는 "XO", "XX", "OX", "OO"로 구분하여 저장될 수 있다. "X"는 앞쪽이나 뒤쪽에 공백이 없다는 것을 의미하고, "O"은 앞쪽이나 뒤쪽에 공백이 있다는 것을 의미한다. 즉, "XO"는 뒤쪽에만 공백이 있다는 것이고, "XX"은 앞쪽 및 뒤쪽에 공백이 없다는 것이며, "OX"는 앞쪽에만 공백이 있다는 것이고, "OO"는 앞쪽 및 뒤쪽에 공백이 있다는 것이다.For example, as in the example of Table 1, blank removal information may be divided into "XO", "XX", "OX", and "OO" and stored. "X" means there are no leading or trailing spaces, and "O" means that there are leading or trailing spaces. That is, "XO" means that there is space only at the back, "XX" means that there is no space at the front and the back, "OX" means that there is space only at the front, and "OO" means that there are spaces at the front and back.

아울러, 공백 제거/복원부(124)는 단계 S260에서 공백을 복원하기에 앞서 단계 S250에 의하여 교체된 부분을 확인할 필요가 있다.In addition, the blank removal/restore unit 124 needs to check the part replaced by step S250 before restoring the blank in step S260.

예를 들어, 공백 제거/복원부(124)가 단계 S250에 의하여 교체된 부분을 확인할 때에 레벤슈타인 알고리즘(Levenstein's Algorithm) 등을 이용하여 공백이 제거된 인식 대상 데이터와 교체가 적용된 변환 데이터를 비교하여, 인식 대상 데이터 상의 위치별로 변형된 부분을 확인할 수 있다. 예를 들어, 공백이 제거된 인식 대상 데이터가 "제번호요공일공일이삼사에오이사이구요"이고, 교체가 적용된 변환 데이터가 "제번호요010-1234-5242구요"이면 아래 표 2의 예시와 같이 교체 부분을 확인할 수 있다.For example, when the blank removal/restore unit 124 checks the part replaced by step S250, by using the Levenstein's Algorithm, etc., the recognition target data from which the blank is removed and the converted data to which the replacement is applied are compared. , It is possible to check the deformed part for each location on the recognition target data. For example, if the recognition target data from which the blanks are removed is "My number required and public day is three companies," and the converted data to which the replacement is applied is "My number required 010-1234-5242", as shown in Table 2 below. You can check the replacement part.

제My 번time 호arc 요Yo 00 1One 00 -- 1One 22 33 44 -- 55 22 44 22 구phrase 요Yo == == == == RR RR RR RR RR RR RR RR RR RR RR RR RR == ==

표 1의 예시와 같이 교체되지 않은 부분은 "="로 구분할 수 있고, 교체된 부분은 "R"로 구분할 수 있다.As in the example of Table 1, parts that have not been replaced can be identified by "=", and parts that have been replaced can be identified by "R".

그리고, 공백 제거/복원부(124)는 단계 S260에서 아래 표 3의 예시와 같이 표 1 및 표 2를 결합하여 "="로 구분된 부분 중에서 "0"으로 표시된 위치에 공백을 복원함으로써, "제 번호요 010-1234-5242구요"를 변환 데이터로서 최종 출력할 수 있다.Then, the blank removal/restore unit 124 combines Tables 1 and 2 as in the example of Table 3 below in step S260 and restores the blank at a position marked "0" among the parts separated by "=", " My number 010-1234-5242 can be output as the converted data.

제My 번time 호arc 요Yo 00 1One 00 -- 1One 22 33 44 -- 55 22 44 22 구phrase 요Yo == == == == == == XOXO OXOX XXXX XOXO XXXX XXXX → 제 번호요 010-1234-5242구요→ My number is 010-1234-5242

한편, 전술한 일 실시예에 따른 데이터 처리 방법에 포함된 각각의 단계는, 이러한 단계를 수행하도록 하기 위한 명령어를 포함하는 컴퓨터 프로그램을 기록하는 컴퓨터 판독가능한 기록매체에서 구현될 수 있다.Meanwhile, each step included in the data processing method according to the above-described exemplary embodiment may be implemented in a computer-readable recording medium that records a computer program including instructions for performing these steps.

또한, 전술한 일 실시예에 따른 데이터 처리 방법에 포함된 각각의 단계는, 이러한 단계를 수행하도록 위한 명령어를 포함하도록 프로그램된, 컴퓨터 판독가능한 기록매체에 저장된 컴퓨터 프로그램의 형태로 구현될 수 있다.In addition, each step included in the data processing method according to the above-described exemplary embodiment may be implemented in the form of a computer program stored in a computer-readable recording medium, programmed to include instructions for performing these steps.

본 발명에 첨부된 각 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 기록매체에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 기록매체에 저장된 인스트럭션들은 흐름도의 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.Combinations of each step of each flowchart attached to the present invention may be performed by computer program instructions. Since these computer program instructions can be mounted on the processor of a general purpose computer, special purpose computer or other programmable data processing equipment, the instructions executed by the processor of the computer or other programmable data processing equipment are the functions described in each step of the flowchart. Will create a means of doing things. These computer program instructions can also be stored on a computer-usable or computer-readable recording medium that can be directed to a computer or other programmable data processing equipment to implement a function in a specific manner, so that the computer-readable or computer-readable recording medium. It is also possible to produce an article of manufacture that includes instruction means for performing the functions described in each step of the flow chart with instructions stored on the recording medium. Computer program instructions can also be mounted on a computer or other programmable data processing equipment, so a series of operating steps are performed on a computer or other programmable data processing equipment to create a computer-executable process to create a computer or other programmable data processing equipment. It is also possible for instructions to perform processing equipment to provide steps for executing the functions described in each step of the flowchart.

또한, 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.In addition, each step may represent a module, segment, or part of code that contains one or more executable instructions for executing the specified logical function(s). Further, it should be noted that in some alternative embodiments, the functions mentioned in the steps may occur out of order. For example, two steps shown in succession may in fact be performed substantially simultaneously, or the steps may sometimes be performed in the reverse order depending on the corresponding function.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 품질에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and those of ordinary skill in the art to which the present invention pertains will be able to make various modifications and variations without departing from the essential quality of the present invention. Accordingly, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but to explain the technical idea, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

100: 데이터 처리 장치 110: 입력부
120: 처리부 121: 데이터 조각 선정부
122: 변환 대상 선정부 123: 데이터 변환부
124: 공백 제거/복원부 130: 출력부
140: 저장부100: data processing device 110: input unit
120: processing unit 121: data piece selection unit
122: conversion target selection unit 123: data conversion unit
124: blank removal/restore unit 130: output unit
140: storage unit

Claims

delete

As a data processing method of a data processing device,
Receiving recognition target data including an ambiguous syllable that may be recognized as a number or a letter; and
Selecting a plurality of pieces of conversion candidate data from the recognition target data using an electronic dictionary in which input data including one or more intermediate syllables and output data corresponding thereto are included as a conversion rule; and
Selecting one or more of the plurality of conversion candidate data fragments as conversion target data,
Replacing the selected conversion target data among the recognition target data with output data of the electronic dictionary and outputting the data,
The step of selecting as the conversion target data,
Selecting any one of the plurality of conversion candidate data fragments as the conversion target data,
Grasping a transformation candidate data fragment in which at least a part of the transformation target data is positioned on the recognition target data among the plurality of transformation candidate data fragments;
Including the step of excluding the overlapping conversion candidate data pieces from among the plurality of conversion candidate data pieces
Data processing method.

The method of claim 2,
Selecting the longest of the plurality of conversion candidate data fragments as the conversion target data
Data processing method.

The method of claim 2,
If the transformation candidate data fragment remains after excluding the overlapping transformation candidate data fragment among the plurality of transformation candidate data fragments, re-performing from the step of selecting any one of the plurality of transformation candidate data fragments as the transformation target data
Data processing method.

As a data processing method of a data processing device,
Receiving recognition target data including an ambiguous syllable that may be recognized as a number or a letter; and
Selecting a plurality of pieces of conversion candidate data from the recognition target data using an electronic dictionary in which input data including one or more intermediate syllables and output data corresponding thereto are included as a conversion rule; and
Selecting one or more of the plurality of conversion candidate data fragments as conversion target data,
Replacing the selected conversion target data among the recognition target data with output data of the electronic dictionary and outputting the data,
The input data of the electronic dictionary does not contain spaces,
In the selecting of the plurality of conversion candidate data pieces, blanks in the recognition target data are removed and input into the electronic dictionary to select the plurality of conversion candidate data pieces,
The outputting includes restoring the removed blank with respect to the recognition target data.
Data processing method.

The method of claim 5,
The step of restoring the removed blank is performed on a portion of the recognition target data that has not been replaced with the output data of the electronic dictionary.
Data processing method.

An input unit for receiving data to be recognized including an ambiguous syllable that may be recognized as a number or a character;
A processing unit that processes the recognition target data input through the input unit,
And an output unit for outputting a result of processing by the processing unit,
The processing unit,
Selecting a plurality of pieces of conversion candidate data from the recognition target data by using an electronic dictionary in which input data including one or more intermediate syllables and output data corresponding thereto are included as conversion rules, and the plurality of conversion candidate data At least one of the pieces is selected as conversion target data, and the selected conversion target data among the recognition target data is replaced with output data of the electronic dictionary and provided to the output unit,
When selecting the data to be converted,
Selecting any one of the plurality of conversion candidate data pieces as the conversion target data, and selecting a conversion candidate data piece in which at least a part of the conversion target data is positionally overlapped on the recognition target data among the plurality of conversion candidate data pieces Grasping and excluding the overlapping conversion candidate data pieces among the plurality of conversion candidate data pieces
Data processing unit.

As a computer-readable recording medium storing a computer program,
The computer program, when executed by a processor,
Receiving recognition target data including an ambiguous syllable that may be recognized as a number or a letter; and
Selecting a plurality of pieces of conversion candidate data from the recognition target data using an electronic dictionary in which input data including one or more intermediate syllables and output data corresponding thereto are included as a conversion rule; and
Selecting one or more of the plurality of conversion candidate data fragments as conversion target data,
Replacing the selected conversion target data among the recognition target data with output data of the electronic dictionary and outputting the data,
The step of selecting as the conversion target data,
Selecting any one of the plurality of conversion candidate data fragments as the conversion target data,
Grasping a transformation candidate data fragment in which at least a part of the transformation target data is positioned on the recognition target data among the plurality of transformation candidate data fragments;
A computer-readable recording medium comprising instructions for causing the processor to perform a method including excluding the overlapping conversion candidate data pieces from among the plurality of conversion candidate data pieces.

As a computer program stored in a computer-readable recording medium,
The computer program, when executed by a processor,
Receiving recognition target data including an ambiguous syllable that may be recognized as a number or a letter; and
Selecting a plurality of pieces of conversion candidate data from the recognition target data using an electronic dictionary in which input data including one or more intermediate syllables and output data corresponding thereto are included as a conversion rule; and
Selecting one or more of the plurality of conversion candidate data fragments as conversion target data,
Replacing the selected conversion target data among the recognition target data with output data of the electronic dictionary and outputting the data,
The step of selecting as the conversion target data,
Selecting any one of the plurality of conversion candidate data fragments as the conversion target data,
Grasping a transformation candidate data fragment in which at least a part of the transformation target data is positioned on the recognition target data among the plurality of transformation candidate data fragments;
A computer program comprising instructions for causing the processor to perform a method comprising excluding the overlapping conversion candidate data pieces from among the plurality of conversion candidate data pieces.