KR20010098422A

KR20010098422A - Method to convert unicode text to mixed codepages

Info

Publication number: KR20010098422A
Application number: KR1020010015424A
Authority: KR
Inventors: 조아킴맨프레드바우어닥터
Original assignee: 포만 제프리 엘; 인터내셔널 비지네스 머신즈 코포레이션
Priority date: 2000-04-26
Filing date: 2001-03-24
Publication date: 2001-11-08
Also published as: DE60131490T2; DE60131490D1; KR100399495B1; JP2001357031A; JP3725443B2

Abstract

본 발명은 UNICODE 표준에 따라 인코딩되는 소스 스트링(source string)을 혼합형 코드 페이지(mixed code pages)에 따라 인코딩될 타겟 스트링(target string)으로 변환시키는 방법 및 시스템에 관한 것이다.The present invention relates to a method and system for converting a source string encoded according to the UNICODE standard into a target string to be encoded according to mixed code pages.

사전결정된 우선순위를 서브-코드페이지(14,15,16,17)의 각각과 연관시키고, 복수의 서브-코드페이지(14,15,16,17) 중 어느 서브-코드페이지에 상기 타겟 문자 및 그의 인코딩이 저장되어 있는지를 알아내는데 맵핑 테이블(mapping table)을 이용함 없이 상기 우선순위 시퀀스에 따라 상기 문자들을 엄격히 변환시킨다. 바람직하게, 가장 빈번하게 사용되는 문자를 포함하는 서브-코드페이지(14)는 최상위 우선순위(highest priority)와 연관되고 가장 드물게 사용되는 문자를 갖는 서브-코드페이지(17)는 최하위 우선순위와 연관된다(도 1).Associate a predetermined priority with each of the sub-code pages 14, 15, 16, 17, and assign the target character and the target character to any one of a plurality of sub-code pages 14, 15, 16, 17; Strictly convert the characters according to the priority sequence without using a mapping table to find out if its encoding is stored. Preferably, sub-codepage 14 containing the most frequently used characters is associated with the highest priority and sub-codepage 17 with the least frequently used characters is associated with the lowest priority. (Fig. 1).

Description

How to convert a source string to a target string, its computer system and program product {METHOD TO CONVERT UNICODE TEXT TO MIXED CODEPAGES}

본 발명은 컴퓨터-판독가능한 문자들(computer-readable characters)과 연관된 문자 코드(character codes)간에 변환시키는 시스템 및 방법에 관한 것이다. 특히, 본 발명은 유니코드 표준(unicode standard)에 따라 인코딩되는 소스 스트링(source string)을 혼합형 코드 페이지에 따라 인코딩될 타겟 스트링(targetstring)으로 변환시키는 방법 및 시스템에 관한 것이다.The present invention relates to a system and method for converting between character codes associated with computer-readable characters. In particular, the present invention relates to a method and system for converting a source string encoded according to a Unicode standard into a targetstring to be encoded according to a mixed code page.

컴퓨터 및 다른 전자 디바이스는 전형적으로 유저와 상호작용하기 위해 텍스트(text)를 이용한다. 텍스트는 통상적으로 모니터 및 소정의 다른 디스플레이 디바이스상에 디스플레이된다. 텍스트는 컴퓨터 또는 다른 전자 디바이스내에 디지탈 형태로 나타내어져야 하기 때문에, 문자 세트 인코딩(character set encoding)이 이용되어져야 한다. 일반적으로 말해서, 고유한 디지탈 표현으로 문자 세트의 각각의 문자를 인코딩하기 위해 문자 세트 인코딩이 동작한다. (인코딩된) 문자들은 문자(letters), 숫자 및 다양한 텍스트 심볼에 대응한다. 이들은 컴퓨터 또는 다른 전자 디바이스에 의한 이용을 위해 수치 코드(numeric codes)가 할당된다. 컴퓨터 및 다른 전자 디바이스에 이용되는 가장 대중적인 문자는 정보 교환용 미국 표준 코드(American Standard Code for Information Exchange:ASCII)이다. ASCII는 이의 코딩을 위해 7-비트 시퀀스(7-bit sequences)를 이용한다. 다른 나라에서는, 다른 문자 세트가 이용된다. 유럽에 있어서, 우세한 문자 인코딩 표준은 국제 표준화 기구(International Standards Organization:ISO)에 의해 개발된 ISO 8859-X 계열, 특히 ISO 8859-1("Latin-1"이라 불리움)이다. 일본에 있어서, 우세한 문자 인코딩 표준은 JIS가 일본 정보 표준을 지칭하는 JIS X0208이고 이는 일본 표준화 기구(Japan Standards Association:JSA)에 의해 개발되었다. 다른 현존하는 문자 세트의 실례로는 Mac^TMOS Standard Roman encoding(애플 컴퓨터사에 의함), Shift-JIS(일본), Big5(타이완) 등을 포함한다.Computers and other electronic devices typically use text to interact with a user. Text is typically displayed on a monitor and some other display device. Because text must be represented in digital form within a computer or other electronic device, character set encoding must be used. Generally speaking, character set encoding is operated to encode each character of the character set with a unique digital representation. Characters (encoded) correspond to letters, numbers and various text symbols. They are assigned numeric codes for use by a computer or other electronic device. The most popular character used in computers and other electronic devices is the American Standard Code for Information Exchange (ASCII). ASCII uses 7-bit sequences for its coding. In other countries, different character sets are used. In Europe, the dominant character encoding standard is the ISO 8859-X family, in particular ISO 8859-1 (called "Latin-1") developed by the International Standards Organization (ISO). For Japan, the dominant character encoding standard is JIS X0208, where JIS refers to the Japanese information standard, which was developed by the Japan Standards Association (JSA). Examples of other existing character sets include Mac ^TM OS Standard Roman encoding (by Apple Computer, Inc.), Shift-JIS (Japan), Big5 (Taiwan), and the like.

전술된 문자 세트는 상기 문자 세트들로 이루어진 각각의 문자들의 코딩을 나타내는 표의 일종인 소위 코드페이지(codepage)에 저장된다. 그러므로, 각각의 문자에 대해, 이와 연관된 수치 코드는 고유한 매핑(mapping)이 이 둘 사이에 존재하도록 주어진다. 대부분의 코드 페이지는 각각의 문자에 대해 1 바이트 길이의 수치 코드를 연관시킨다. 그러나 그 이상, 예를 들면 2 바이트 또는 3 바이트 길이 이상의 수치 코드를 가지는 코드 페이지가 존재한다. 동일한 코드 길이를 가지는 문자를 포함하는 코드 페이지는 단순 코드 페이지(simple code pages)라고 불리운다.The character set described above is stored in a so-called codepage, which is a kind of table representing the coding of each character of the character sets. Therefore, for each character, the numerical code associated with it is given such that a unique mapping exists between the two. Most code pages associate a one-byte long numeric code for each character. But more than that, for example, there are code pages that have numeric codes that are at least two bytes or three bytes long. Code pages that contain characters having the same code length are called simple code pages.

개별적인 언어-지정 국가 요구조건(language-specific national requirements)의 복잡도를 보다 잘 수용하기 위해, 소위 혼합형 코드 페이지가 또한 존재한다. 혼합형 코드페이지는 그의 코딩이 길이면에서 다를 수 있는 적어도 두개의 서브-코드페이지로 구성된다. 상기 서브-코드페이지는 또한 코드 세트라고 불리운다. 이들은 0으로부터 3까지 번호가 매겨진다(number). 예를 들면, 혼합형 일본 코드 페이지 IBM-33722(the mixed Japanese codepage IBM-33722)는 코드 세트 IBM-953 (1 바이트, 코드세트 0), IBM-952(2바이트, 코드세트 1), 코드페이지 IBM-896(에스케이프(escape) 문자 8E + 1 바이트, 코드세트 2) 및 IBM-953(에스케이프 문자 8F + 2 바이트, 코드세트 3)을 포함한다.In order to better accommodate the complexity of individual language-specific national requirements, so-called mixed code pages also exist. A hybrid code page consists of at least two sub-code pages whose coding may differ in length. The sub-code page is also called a code set. They are numbered from 0 to 3. For example, the mixed Japanese codepage IBM-33722 is codeset IBM-953 (1 byte, codeset 0), IBM-952 (2 bytes, codeset 1), codepage IBM -896 (escape character 8E + 1 byte, codeset 2) and IBM-953 (escape character 8F + 2 bytes, codeset 3).

꾸준히 증가하는 비지니스 및 네트워크의 세계화로 인해 그리고 세계 도처의모든 국가들 간을 접속시키는 더욱 더 증가하는 인터넷의 영향력으로 인해, 상이한 종류의 코드 페이지를 이용하는 컴퓨터들 간의 어떠한 데이터 변환도 가능한한 신속하여야 하고 선택에 따라서는 가능한한 단순화되어야 한다.Due to the ever-increasing globalization of businesses and networks and the increasing influence of the Internet connecting all countries around the world, any data conversion between computers using different kinds of code pages should be as fast as possible. The choice should be as simple as possible.

상기 코드 변환을 단순화시키기 위해, 이른바 유니코드 표준이 개발되었고 한편으로 국제적으로 인정되고 있다. 유니코드는 모든 현존하는 코드 세트를 나타내는 단일 체계(single scheme)를 제공한다. 유니코드 인코딩 체계의 설계는 방향성(directionality)을 제외하고는 기본 텍스트 처리 알로리즘의 설계와 독립적이다. 유니코드 구현은 일정한 적절한 텍스트 처리 및/또는 렌더링 알고리즘(rendering algorithms)을 포함한다고 가정한다. 유니코드 표준화에 따라 인코딩된 임의의 문자들은 2-바이트 길이의 수치 코드로 표현된다.In order to simplify the code conversion, so-called Unicode standards have been developed and internationally recognized. Unicode provides a single scheme that represents all existing code sets. The design of the Unicode encoding scheme is independent of the design of the basic text processing algorithm, except for directionality. The Unicode implementation is assumed to include some suitable text processing and / or rendering algorithms. Any characters encoded according to Unicode standardization are represented as 2-byte long numeric codes.

이 문제점은 이제 유니코드 표준으로부터 전술된 혼합형 코드 페이지로 변환시키는 매우 효율적인 방법을 획득하는 것이다, 즉 소스 스트링은 유니코드 표준에 의해 표현되고 이것은 매우 단순하게 그리고 신속하게 복수의 코드 페이지, 예를 들면 전술된 바와 같이 복수의 4개의 코드 페이지를 포함하는 코드 시스템으로 변환되는 것이 바람직한다.The problem now is to obtain a very efficient way of converting from the Unicode standard to the hybrid code page described above, ie the source string is represented by the Unicode standard and this is very simply and quickly a plurality of code pages, for example It is desirable to convert to a code system including a plurality of four code pages as described above.

유니코드로부터 복수의 코드 페이지로 변환시키는 종래 기술의 변환 방법은 미국 특허 번호 5,793,381에서 나타내진다. 상기 코드 변환 시스템은 연관된 타겟 문자의 위치를 맵핑 테이블에서 룩업(look up)함으로서 단일 소스 문자 또는 문자 시퀀스를 단일 타겟 문자 또는 타겟 문자 시퀀스에 맵핑한다. 소스 문자를 판독할때 상기 맵핑 테이블은 서브-코드페이지들 중 어느 것이 코드 변환을 위해 이용될 수 있는지를 판정하기 위해 액세스된다. 소스 문자가 상기 서브-코드페이지로 변환될 수 없는 입력 문자 스트링에서 검색될 때까지 코드 변환을 위해 계속해서 이용될 특정 서브-코드페이지가 검색된다. 이 경우에, 상기 보조(auxiliary) 맵핑 테이블은 올바른 서브-코드페이지를 검색하기 위해 재액세스된다. 부가적으로, 상기 종래 기술 코드 변환 시스템은 룩업 처리기(look-up handler)가 텍스트 요소를 위한 타겟 인코딩(target encoding)에서 하나 이상의 문자들을 식별할 수 없는 경우에 텍스트 요소를 위한 폴백 맵핑(fallback mapping)으로서 이용될 수 있는 타겟 인코딩내의 하나 이상의 문자들을 식별하기 위해 맵핑 테이블로 동작하는 폴백 처리(fallback handling)를 포함한다. 그러나, 종래 기술의 접근방안은 필요 이상으로 느리고 복잡하게 만드는 추가적인 룩업 테이블을 이용한다.A prior art conversion method for converting from Unicode to a plurality of code pages is shown in US Pat. No. 5,793,381. The transcoding system maps a single source character or character sequence to a single target character or target character sequence by looking up the location of the associated target character in the mapping table. When reading the source character the mapping table is accessed to determine which of the sub-code pages can be used for code conversion. The particular sub-codepage to be continued for code conversion is retrieved until a source character is retrieved from an input character string that cannot be converted to the sub-codepage. In this case, the auxiliary mapping table is accessed again to retrieve the correct sub-codepage. Additionally, the prior art code conversion system provides fallback mapping for text elements when the look-up handler cannot identify one or more characters in the target encoding for the text elements. A fallback handling that operates with a mapping table to identify one or more characters in the target encoding that can be used as. However, prior art approaches use additional lookup tables that make it slower and more complicated than necessary.

그러므로, 본 발명은 목적은 유니코드 텍스트로부터 보다 나은 성능으로 실행될 수 있는 혼합형 코드 페이지로의 코드 변환을 위한 방법 및 시스템을 제공하는데 있다.It is therefore an object of the present invention to provide a method and system for code conversion from Unicode text to mixed code pages that can be executed with better performance.

본 발명의 이들 목적은 첨부된 독립 청구항에서 설명된 특징에 의해서 달성된다. 후속적으로 본 발명의 또 다른 이점이 있는 장치(arrangements) 및 실시예는 제각각의 종속항(subclaims) 내에 기재된다.These objects of the invention are achieved by the features described in the appended independent claims. Subsequently, other advantageous arrangements and embodiments of the invention are described in the respective subclaims.

본 발명의 기본적인 개념을 요약하면, 사전결정된 우선순위(priority)와 각각의 서브-코드페이지에 연관시키고, 상기 복수의 서브-코드페이지 중 어느 서브-코드페이지내에 타겟 문자 및 그의 인코딩이 저장되어 있는지를 알아내는 데 맵핑 테이블을 이용하지 않고서 상기 우선순위 시퀀스에 따라 문자들을 엄격히 변환시킨다. 바람직하게, 가장 빈번하게 사용되는 문자를 포함하는 서브-코드페이지는 최상위 우선순위(highest priority)와 연관되고 가장 드물게 사용되는 문자를 갖는 서브-코드페이지는 최하위 우선순위와 연관된다. 따라서, 4개의 서브-코드페이지의 경우에, 상기 서브-코드페이지간의 우선순위 시퀀스가 설정될 수 있다. 각각의 우선순위는 제각각의 서브-코드페이지에서 특정 문자를 검색할 확률에 대한 척도 이다.In summary, the basic concept of the present invention relates to a predetermined priority and to each sub-code page, and to which sub-code page of the plurality of sub-code pages the target character and its encoding are stored. Strictly convert characters according to the priority sequence without using a mapping table to find out. Preferably, the sub-codepage containing the most frequently used characters is associated with the highest priority and the sub-codepage with the least frequently used characters is associated with the lowest priority. Therefore, in the case of four sub-code pages, a priority sequence between the sub-code pages can be set. Each priority is a measure of the probability of searching for a particular character in each sub-codepage.

이 기본적인 접근방안 이외에도, 문자가 특정 서브-코드 페이지에서 검색되지 않는 경우에 상기 문자에 대해 아직 액세스하지 않은 최고 우선순위를 가지는 서브-코드 페이지를 액세스하는 것이 또한 제안된다. 전술된 본 발명의 방식을 적용시킴시키면 다음과 같은 이점들이 존재한다.In addition to this basic approach, it is also proposed to access the sub-code page with the highest priority that has not yet been accessed for that character if the character is not retrieved in a particular sub-code page. Applying the manner of the present invention described above has the following advantages.

첫째, 전술된 종래 기술 수준의 변환 방법과 비교하여 성능이 현저하게 증가되는데 이는 문자가 현재 사용되는 서브-코드페이지에서 검색되지 않을 때마다 액세스되어져야 하는 어떠한 별도의 맵핑 테이블이 존재하지 않기 때문이다.First, the performance is significantly increased compared to the prior art level conversion method described above because there is no separate mapping table that must be accessed whenever a character is not retrieved in the currently used sub-codepage. .

둘째, 상기 보조 맵핑 테이블이 전혀 생성될 필요가 없다.Second, the secondary mapping table does not need to be created at all.

세번째, 복수의 서브-코드페이지에 주어진 우선순위 시퀀스(priority sequence)는 그 언어에 관한 국가-특정 지식(country-specific knowledge)이 이용될 수 있도록 설정될 수 있다. 따라서, 본 발명의 변환 방법은 일정한, 국가-특정코드페이지 시스템에 따라 부가된 특수한 상황(particularities)에 용이하게 적용가능하다.Third, a priority sequence given to a plurality of sub-codepages can be set such that country-specific knowledge about that language is available. Therefore, the conversion method of the present invention is easily applicable to special circumstances added according to a constant, country-specific code page system.

이와 달리, 제각기 개개의 경우에 따라 전술된 우선순위 시퀀스는 텍스트가 평균을 나타내지는 않는다는 것이 사전에 알려졌을 때 본 발명의 코드 변환을 변환될 지정 텍스트에 의해 부과된 지정 요구조건에 적응시키기 위해 코드 변환을 실행시키기 전에 표준 설정으로부터 개별적인 설정으로 동적으로 변경될 수 있다. 새로운 우선순위 시퀀스는 예컨대 변환될 파일의 헤더에서 주어질 수도 있다.In contrast, in each individual case, the above-described priority sequence is used to adapt the code conversion of the present invention to the designation requirements imposed by the designation text to be converted when it is known in advance that the text does not represent an average. It can be changed dynamically from standard settings to individual settings before running the transformation. The new priority sequence may for example be given in the header of the file to be converted.

본 발명의 또 다른 현저한 이점은 하드웨어 인스트럭션(hardware instructions)이 한번에 하나의 문자만을 처리하는 대신에 복수의 문자들을 처리하는데 이용될 수 있다는 현대 컴퓨터 시스템의 특수한 이점을 이용하기에 용이한 개념을 제공한다는 것이다. 이러한 현대의 하드웨어 인스트럭션은 임의의 종류의 맴핑 테이블에 대한 추가적인 체킹 액세스 없이 타겟 문자를 룩업하기 위한 선형 테이블(linear table)을 요구한다.Another significant advantage of the present invention is that it provides an easy concept to take advantage of the special advantages of modern computer systems that hardware instructions can be used to process multiple characters instead of only one character at a time. will be. These modern hardware instructions require a linear table to look up the target character without additional checking access to any kind of mapping table.

임의의 코드 변환이 필요하게 될 때 본 발명은 바람직하게 인터넷으로 이용될 수 있다. 게다가, 본 발명의 툴은 상기 데이터 베이스의 몇몇 콘텐츠가 유니코드 텍스트로부터 혼합형 코드 페이지로 변환될 때 데이터 베이스 애플리케이션에서 채택될 수 있다.The present invention can preferably be used on the Internet when any code conversion is needed. In addition, the tools of the present invention can be employed in database applications when some of the content of the database is converted from Unicode text to mixed code pages.

본 발명의 방법이 복수의 서브-코드페이지 중 하나에서 특정 문자를 검색할 확률이 모든 서브-코드페이지에서 검색할 수 있는 경우와 동일한 경우에 적용될 때, 4개의 서브-코드페이지가 존재하는 경우에는 오직 2회의 추가적인 액세스의 통계학적인 평균값만이 필요하게 된다. 이 값은 3개의 서브-코드페이지에 대해서는 1.5로 감소하고 두개의 서브-코드페이지의 경우에 대해서는 1로 감소한다. 모든 문자들의 70%가 남아있는 코드 세트 2 및 3에서 검색되는 Japanese EUC-테이블인 경우에, 30%는 코드세트 0에서 검색되고 1%미만은 남아있는 코드 세트 2 및 3에서 검색되며, 상기 통계학적 평균값은 1보다 약간 크다.When the method of the present invention is applied when the probability of searching for a specific character in one of a plurality of sub-codepages is the same as that in which all sub-codepages can be searched, when four sub-codepages exist Only a statistical average of two additional accesses is needed. This value is reduced to 1.5 for three sub-code pages and to 1 for two sub-code pages. If 70% of all characters are Japanese EUC-tables searched in the remaining code sets 2 and 3, 30% are searched in codeset 0 and less than 1% are searched in the remaining code sets 2 and 3. The mean value is slightly larger than 1.

게다가, 본 발명은 바람직하게 하드웨어 칩내로 직접 번-인된(burnt-in) 하드웨어 구현에서 적어도 부분적으로 채택될 수 있다. 이후에, 이러한 칩 수단은 본 발명의 코드 변환 방법 단계 중 적어도 일부를 구현 및 반영하는 하드웨어 회로를 포함한다. 점차적으로 성장하는 원격통신 디바이스의 다이버시티(diversity) 및 더욱 더 기술적 특징을 포함하는 점차적으로 증가하는 이들의 동작 범위(function range)를 고려하여, 이러한 칩은 이후에 다양한 디바이스에서 사용될 수 있다. 오늘날 이용가능한 디바이스 관점에서, 이러한 디바이스는 소정의 국제적인 통신 중 부분을 형성하는 임의의 디바이스에서 바람직하게 사용될 수 있다. 예를 들면, 소정의 네트워크 종류, 예를 들면 인터넷, TV 또는 라디오 수신 디바이스, 특히 디지탈 TV 또는 라디오, 모빌 폰 용 셋-탑 박스, 소정의 핸드-헬드 컴퓨팅 (hand-held computing) 및/또는 원격통신 디바이스 종류 등에서의 라우터는 임의의 외국-언어 데이터를 처리하는 입력 인터페이스를 갖는다.In addition, the present invention is preferably at least partly employed in hardware implementations that are burnt-in directly into hardware chips. This chip means then comprises hardware circuitry for implementing and reflecting at least some of the code conversion method steps of the present invention. In view of their increasingly increasing function range, which includes the diversity and increasingly technical features of the ever-growing telecommunications devices, such chips may later be used in a variety of devices. In view of the devices available today, such devices can be preferably used in any device that forms part of certain international communications. For example, certain network types, such as the Internet, TV or radio receiving devices, in particular digital TV or radio, set-top boxes for mobile phones, certain hand-held computing and / or remote Routers in communication device types and the like have an input interface that handles any foreign-language data.

본 발명은 실시예를 이용하여 예시되고 첨부 도면의 도면의 형상에 의해 제한되는 것은 아니다.The invention is illustrated by way of example and is not limited by the shape of the drawings in the accompanying drawings.

도 1은 본 발명의 방법의 기본 요소를 도시하는 개략 논리도,1 is a schematic logic diagram illustrating the basic elements of the method of the present invention;

도 2는 복수의 4개의 서브-코드페이지 중에서 각각의 문자가 검색될 수 있는복수의 230 소스 문자 각각에 대해 나타내는 임의로 선택된 실시예의 개략도,2 is a schematic diagram of an arbitrarily selected embodiment showing for each of a plurality of 230 source characters from which a plurality of four sub-code pages may be retrieved, each character being searched for;

도 3은 본 방법이 코드 변환(code conversion)동안 본 발명의 바람직한 실시예에 따라 적용될 때 코드세트 액세스 시퀀스를 나타내는 논리 개략도.3 is a logic diagram illustrating a codeset access sequence when the method is applied in accordance with a preferred embodiment of the present invention during code conversion.

도면의 주요 부분에 대한 부호의 설명Explanation of symbols for the main parts of the drawings

10 : 유니코드 12 : 우선순위 규칙10: Unicode 12: Priority Rule

14,15,16,17 : 코드세트14,15,16,17: code set

전반적으로 도면, 이제 특히 도 1을 참조하면, 박스(10)내의 전체 유니코드 문자(unicode characters)는 상징적으로 본 발명의 변환 방법하에 놓이도록 나타내어진다.Overall, with reference now to FIG. 1 in particular, the entire Unicode characters in box 10 are represented symbolically under the inventive transformation method.

상기 본 발명의 방법의 바람직한 실시예에 따르면, 사용된 복수의 서브-코드페이지(sub-codepages) 사이에 소정의 잘 정의된 우선순위 시퀀스(priority sequence)를 설정하는 임의의 우선순위 규칙(priority rules:12)이 확립된다. 용어'codeset n' - n은 정수임 - 는 본 명세서에서 용어 'sub-codepage n'와 동일한 것을 의미하는 것으로 사용된다. 도 1에 도시된 바와 같이, 코드세트 1, 14, 코드세트 0, 15, 코드세트 2,16 및 코드세트 3,17로 도시된 4개의 서브-코드페이지가 사용된다. 박스(10)에는 4개의 예시적으로 선택된 문자들이 도시되고 이의 인코딩은 도 1의 오른편에 도시된 바와 같이 상이한 별도의 서브-코드페이지에 각각 위치한다. 도면으로부터 상기 테이블(10,14,15,16,17) 각각내에 나타내지는 바와 같이, 수치 코드(numerical code)는 각각의 문자에 대해 저장된다.According to a preferred embodiment of the method of the present invention, any priority rules for establishing a predetermined well-defined priority sequence between a plurality of sub-codepages used. 12 is established. The term 'codeset n'-n is an integer-is used herein to mean the same as the term 'sub-codepage n'. As shown in Fig. 1, four sub-code pages are shown, which are codesets 1 and 14, codesets 0 and 15, codesets 2 and 16 and codesets 3 and 17. In box 10 four exemplary selected characters are shown and their encodings are each located in different separate sub-codepages as shown on the right side of FIG. As shown in each of the tables 10, 14, 15, 16 and 17 from the figure, numerical codes are stored for each character.

이제 도 2 및 도 3을 참조하면, 본 발명의 바람직한 실시예는 Japanese UNICODE로부터 혼합형 Japanese EUC 서브-코드페이지로의 예시적인 코드 변환에서 좀 더 상세히 기술될 것이다.Referring now to FIGS. 2 and 3, a preferred embodiment of the present invention will be described in more detail in an example code conversion from Japanese UNICODE to a mixed Japanese EUC sub-codepage.

코드 변환을 시작하기 전에, 기존의 상기 Japanese EUC 서브-코드페이지에 대한 추정은, 이 특수한 경우의 서브-코드페이지는 코드세트 1이 모든 발생 소스 문자의 대략 70%를 포함하지만, 코드세트 0은 대략 29%를 포함하고, 코드세트 2는 대략 0.6%를 포함하며 코드세트 3은 전체 발생 문자의 대략 0.4%를 포함하도록 구성되는 것으로 추정하는데 이용된다. 가장 빈번하게 사용되는 코드세트 14가 맨앞에 도시되고 가장 드물게 사용되는 코드세트는 코드세트 '스택(stack)의 마지막 세트(17)로서 도시된 도 1 내에 코드세트 확률 분포(probability distribution)가 도시된다.Prior to starting the code conversion, the existing estimate of the Japanese EUC sub-codepage is that in this particular case the sub-codepage has codeset 1 containing approximately 70% of all occurrence source characters, Including approximately 29%, codeset 2 comprising approximately 0.6% and codeset 3 being configured to comprise approximately 0.4% of the total occurrence characters. The codeset probability distribution is shown in FIG. 1, where the most frequently used codeset 14 is shown first and the rarest used codeset is shown as the last set 17 of codesets' stacks. .

코드세트 1(codeset 1), 코드세트 0, 코드세트 2, 코드세트 3.Codeset 1, codeset 0, codeset 2, codeset 3.

도 2에서, 임의로 선택된 실시예의 개략도는 각각의 문자가 복수의 4개의 서브-코드페이지 중 어디에서 검색될 수 있는지를 복수의 230개의 소스 문자들 각각에 대해 나타내고 있다.In FIG. 2, a schematic diagram of a randomly selected embodiment shows for each of a plurality of 230 source characters where each character can be retrieved from among a plurality of four sub-code pages.

230개의 소스 문자들 전체는 단일 예시적인 변환 프로세스에서 변환되어진다. 프로세스의 명료성을 개선하기 위해, 230개의 수는 매우 작게 선택된다는 것을 이해해야 한다.All of the 230 source characters are translated in a single example conversion process. In order to improve the clarity of the process, it should be understood that the 230 numbers are chosen very small.

따라서, 230개의 소스 문자 전체는 도 1의 부호(10)을 참조하여 상징적으로 나타내는 입력 세트로 구성된다. 본 발명의 방법에 의해 발행될 필요가 있는 새로운 수치 코드는 4개의 서브-코드페이지(14,15,16,16)에 저장되며, 다음과 같이 도 1의 오른편을 참조한다.Thus, all 230 source characters consist of an input set symbolically represented with reference to reference numeral 10 in FIG. 1. The new numeric codes that need to be issued by the method of the present invention are stored in four sub-code pages 14, 15, 16 and 16, referring to the right side of FIG. 1 as follows.

코드세트 1에는 문자 1 내지 171,Codeset 1 contains characters 1 through 171,

코드세트 2에는 문자 172,173,Codeset 2 contains the characters 172,173,

다시, 코드세트 1에는 문자 174 내지 196,Again, codeset 1 contains the characters 174-196,

매우 드물게 사용되는 문자인 문자 197은 코드세트 3에 위치하고,A very rarely used character, character 197, is located in codeset 3.

다시, 코드세트 1에는 문자 198 내지 210,Again, codeset 1 contains characters 198 through 210,

문자 211 내지 215는 코드세트 0에 저장되고,Characters 211 through 215 are stored in codeset 0,

극히 드물게 사용되는 코드세트 2에는 문자 216,217 그리고,Very rarely used codeset 2 has the characters 216,217 and

코드세트 1에는 문자 218 내지 230.Codeset 1 contains the characters 218-230.

변환 체계(conversin scheme)는 순차적으로 전술된 소스 문자를 처리한다. 본 발명의 방법을 적용시키는 바람직한 방법에 있어서, 한번에 복수의 문자들을 처리할 수 있는 하드웨어 인스트럭션이 이용될 수 있다. 이에 대한 실례로는 IBM OS/390 하드웨어 인스트럭션 " Translate Two to One"이 있으며, 이는 2-바이트 문자로 구성된 스트링을 1-바이트로 문자를 포함하는 출력 버퍼로 변환시키는 TRTO로서 약칭된다. 상기 하드웨어 인스트럭션은 다음과 같은 변수를 갖는다.The conversion scheme processes the source characters described above sequentially. In a preferred method of applying the method of the present invention, a hardware instruction capable of processing a plurality of characters at a time may be used. An example of this is the IBM OS / 390 hardware instruction "Translate Two to One", which is abbreviated as TRTO, which converts a string of 2-byte characters into an output buffer containing characters as 1-byte characters. The hardware instruction has the following variables.

변환될 스트링,The string to be converted,

변환된 스트링이 저장되는 타겟 버퍼,The target buffer where the converted string is to be stored,

특정 입력 문자는 변환될 수 없음을 나타내는 문자,Character that indicates that certain input characters cannot be converted,

변환될 문자로써 어드레스되는 변환 테이블 - 여기서 변환된 문자는 그 어드레스된 위치에 저장됨 - .A conversion table addressed as the character to be converted, where the converted character is stored in its addressed location.

그러나, 명료성을 위해, 그리고 본 발명의 본래 목적에 따르기 위해, 상기 전술된 입력 문자 시퀀스는 단일-문자 변환 프로세스, 즉 각각의 문자를 별도로 처리하는 프로세스에 제공(submit)된다.However, for the sake of clarity and for the purpose of the present invention, the above-described input character sequence is submitted to a single-character conversion process, that is, a process for processing each character separately.

본 실시예의 바람직한 특징에 따르면, 전술된 우선순위 시퀀스로부터 도출되는 처리 규칙 세트(set of processing rules)가 설정된다. 상기 처리 규칙은 다음과 같다.According to a preferred feature of this embodiment, a set of processing rules derived from the above-described priority sequence is set. The processing rule is as follows.

1. 최우선순위 코드세트를 먼저 액세스한다.1. Access the highest priority code set first.

2, 특정 문자가 최우선순위 코드세트에서 검색되지 않을 때, 이후에 다음 보다 낮은 우선순위를 가지는 코드세트로 계속 진행한다.2, when a particular character is not retrieved from the highest priority code set, then proceeds to the code set with a lower priority than the next.

3, 문자가 코드세트에서 검색되지 않는 경우에, 이 문자에 대해 아직 액세스하지 않는 최우선순위를 가지는 코드세트에 액세스한다.3, if a character is not found in the codeset, accesses the codeset with the highest priority that has not yet been accessed for this character.

이들 규칙을 적용함으로써 도 3에 주어진 체계를 발생시킨다.Applying these rules generates the scheme given in FIG.

도 3은 행(rows)들을 포함한다. 제 1 행은 서브-코드페이지 1, 즉 최우선순위를 가지는 서브-코드페이지에서 액세스를 시작할 때 특정 문자가 검색되는 않는 경우에 후속적으로 액세스될 서브-코드페이지의 시퀀스를 반영한다. 따라서, 문자가 서브-코드페이지 1에서 검색되지 않을 때, 코드페이지 0이 현재 문자를 검색하기 위해 액세스될 것이다. 상기 현재 문자가 서브-코드페이지 0에서 검색되는 경우에, 프로세스는 상기 서브-코드페이지에서 변환될 다음 문자를 가지고 계속 진행된다. 이 다음 문자의 경우에 제 2 행이 검색을 위해 적용될 것이다. 이와 달리, 전술된 현재 문자가 서브-코드페이지 0에서 검색되지 않는 경우에, 이후에 또 다른 검색을 위해 서브-코드페이지 2가 액세스될 것이다. 이후에, 대응 방안이 서브-코드 2에 대해 후속될 것이다.3 includes rows. The first row reflects the sequence of sub-codepage 1 to be subsequently accessed if a particular character is not retrieved when starting access in sub-codepage 1, ie the sub-codepage with the highest priority. Thus, when a character is not retrieved in sub-codepage 1, codepage 0 will be accessed to retrieve the current character. If the current character is retrieved in sub-codepage 0, the process continues with the next character to be converted in the sub-codepage. In the case of this next character, the second line will be applied for the search. Alternatively, if the current character described above is not searched in sub-codepage 0, then sub-codepage 2 will be accessed for another search later. Subsequently, the countermeasure will follow for sub-code 2.

현재 문자가 서브-코드페이지 2에서 검색되는 경우에, 연관된 수치 코드, 즉 변환된 코드가 발행되고 다음 문자가 도 3에 도시된 제 3 열에 따라 검색될 것이다. 그렇지 않으면, 즉, 현재 문자가 서브-코드페이지 2에서 검색되지 않는 경우에, 마지막 서브-코드페이지(3)가 검색을 위해 액세스될 것이다. 이후에 문자가 검색될 것이고 검색은 도 3에 도시된 4번째 열로 진행한다.If the current character is retrieved in sub-codepage 2, the associated numeric code, i.e. the converted code, is issued and the next character will be retrieved according to the third column shown in FIG. Otherwise, ie, if the current character is not retrieved in sub-codepage 2, the last sub-codepage 3 will be accessed for the search. The text will then be searched and the search proceeds to the fourth column shown in FIG.

본 발명의 상기 실시예에 따른 상기 세부 사항으로부터 이해될 수 있는 바와 같이, 검색은 항상 최종 문자가 성공적으로 검출(detected)된 특정 서브-코드페이지에서 진행한다.As can be appreciated from the above details according to this embodiment of the present invention, the search always proceeds in the particular sub-codepage where the last character has been successfully detected.

특히, 제 2, 제 3 및 제 4 행을 참조하면, 여기에서 도 1의 참조 부호 (14)로 도시된 코드페이지 1이 제각각의 현재 서브-코드페이지에서 상기 다음 문자가 검색되지 않을 때 통상적으로 변환될 다음 문자를 위해 액세스된다.In particular, referring to the second, third and fourth lines, codepage 1, shown here by reference numeral 14 in FIG. 1, is typically used when the next character is not retrieved in each current sub-codepage. Accessed for the next character to be converted.

특히, 도 2에서 예시된 문자 스트링을 참조하면, 다른 서브-코드페이지를 처리, 즉 액세스하는 방법이 이제 좀 더 상세히 기술될 것이다. 도 2 및 도 3의 화살표는 하나의 코드세트으로부터 또 다른 코드세트로의 제각각의 액세스 변화를 나타내는 A)로부터 G)까지 도시된다.In particular, referring to the character string illustrated in FIG. 2, a method of processing, i.e., accessing another sub-codepage will now be described in more detail. The arrows of FIGS. 2 and 3 are shown from A) to G) indicating respective access changes from one codeset to another.

검색은 서브-코드페이지 1에 대한 액세스로 시작하는데 이는 이것이 최우선순위 액세스이기 때문이다. 따라서, 문자 1이 검색되고 이의 수치 코드는 서브-코드페이지 1에 저장된 수치 코드를 출력함으로써 변환된다. 이후에 변환 프로세스는 입력으로서 제 2 문자를 취하고 제 2 문자가 서브-코드페이지 1에 저장되기 때문에 동일한 프로시쥬어가 반복되다.The search begins with access to sub-codepage 1 because this is the highest priority access. Thus, character 1 is retrieved and its numeric code is converted by outputting the numeric code stored in sub-codepage 1. The conversion process then takes the second character as input and the same procedure is repeated because the second character is stored in sub-codepage 1.

상기 현재 문자(172)는 서브-코드페이지 1에서 검색될 수 없다. 그러므로, 화살표 A)에 나타내진 바와 같이, 서브-코드페이지 0은 자신이 다음 최우선순위를가지는 서브-코드페이지이기 때문에 그 다음으로 액세스될 것이다. 부수적으로, 문자 172는 서브-코드페이지 0에서 검색된다. 따라서, 이의 수치 코드는 전술된 바와 같이 발행될 것이다. 서브-코드페이지 0은 이제 문자 173을 적용시키는 것이 계속 수행될 것이다. 부수적으로, 도 2에 나타내진 바와 같이, 그것은 또한 서브-코드페이지 0에 저장된다. 이후에 문자 174가 처리된다. 이번에, 이 문자는 코드세트 0에서 검색되지 않는다. 따라서, 도 3의 제 2 행이 이행된다. 화살표 b)로부터 나타내진 바와 같이, 코드세트 1은 이 코드세트에서 검색될 때 문자를 검색할 확률이 최고이기 때문에 재액세스된다.The current character 172 cannot be retrieved in sub-codepage 1. Therefore, as indicated by arrow A), sub-codepage 0 will be accessed next because it is the sub-codepage with the next highest priority. Incidentally, the character 172 is searched for in sub-codepage 0. Thus, its numerical code will be issued as described above. Sub-codepage 0 will now continue to apply character 173. Incidentally, as shown in Fig. 2, it is also stored in sub-codepage 0. The character 174 is then processed. This time, this character is not found in codeset 0. Thus, the second row of FIG. 3 is implemented. As indicated by arrow b), codeset 1 is reaccessed because the probability of searching for a character is highest when retrieved from this codeset.

부수적으로, 다시 도 2를 참조하면, 상기 문자 174는 코드세트 1에서 다시 검색된다. 따라서, 발행한 이후 제 1 행이 다시 이행된다. 문자 175 내지 196은 코드세트를 변경시킴없이 전술된 바와 같이 진행된다.Incidentally, referring again to FIG. 2, the character 174 is searched again in codeset 1. FIG. Therefore, after issuance, the first row is again implemented. Characters 175 through 196 proceed as described above without changing the code set.

이후에, 매우 드물게 사용되는 문자인 문자 197은 코드세트 1에서 검색되지 않는다. 따라서, 화살표 c)에서 나타내진 바와 같이, 도 3의 코드세트 0이 액세스되고 검색되며, 여기서 검색되지 않으며, 최종적으로 코드세트 3이 액세스된다. 여기에서, 문자 197이 검색되고 이의 수치 코드가 발행된다. 이후에 검색은 코드세트 3에서 계속된다.Later, a very rarely used character, character 197, is not retrieved from codeset 1. Thus, as indicated by arrow c), codeset 0 of FIG. 3 is accessed and retrieved, not retrieved here, and finally codeset 3 is accessed. Here, the character 197 is retrieved and its numeric code is issued. The search then continues in codeset 3.

문자 198은 코드세트 3에서 검색되지 않는다. 따라서 도 3에 도시된 제 4 행이 이행되고 화살표 d)가 나타내는 바와 같이 코드세트 1이 다음으로 액세스된다. 여기에서 문자 211이 검색되지 않을 때까지 문자 198에 대한 검색은 성공적이다. 그러므로, 제 1 행이 다시 이행된다. 화살표 E)로부터 나타내진 바와 같이,코드세트 0이 다음으로 액세스된다. 코드세트 0내의 문자 211 내지 215가 검색된다.Character 198 is not searched in codeset 3. Thus, the fourth row shown in FIG. 3 is implemented and codeset 1 is accessed next, as indicated by arrow d). From here, the search for character 198 is successful until no character 211 is found. Therefore, the first row is again implemented. As indicated by arrow E), codeset 0 is accessed next. Characters 211 to 215 in codeset 0 are searched.

그러나, 문자 216이 검색될 수 없고, 따라서 제 2 행이 이행되고 코드세트 1이 검색을 위해 재-액세스된다. 그러나, 여기에서 검색되지 않으므로, 화살표 F)에서 나타내진 바와 같이 코드세트 2가 액세스된다. 여기에서 그것이 검색되고 방행 이후 문자 217이 코드세트 2로부터 성공적으로 또한 처리된다.However, character 216 cannot be retrieved, so the second line is implemented and codeset 1 is re-accessed for retrieval. However, since it is not retrieved here, codeset 2 is accessed as indicated by arrow F). Here it is retrieved and after the character character 217 is also successfully processed from codeset 2.

이후에, 문자 218이 처리되고 그것이 코드세트 2에서 검색되지 않으므로, 코드세트 1이 도 3의 제 3열로부터 나타내진 바와 같이 재액세스된다. 문자 218 및 변환될 입력 문자 세트에 남아있는 모든 후속 문자는 다시 코드세트 1에서 검색된다. 따라서, 이들은 전술된 바와 같이 처리되고 최종 문자 230이 변환된 이후에, 변환 프로세스는 정지한다. 그러므로, 모든 소스 문자 코드가 성공적으로 변환된다.Thereafter, character 218 is processed and it is not retrieved from codeset 2, so codeset 1 is re-accessed as shown from the third column of FIG. Character 218 and all subsequent characters remaining in the input character set to be converted are again retrieved from codeset 1. Thus, they are processed as described above and after the last character 230 is converted, the conversion process stops. Therefore, all source character codes are converted successfully.

상기 상세한 설명에 있어서, 본 발명은 특정한 예시적인 실시예를 참조하여 기술되어진다. 그러나, 다양한 수정 및 변경이 첨부된 청구항에서 설정된 바와 같이 본 발명의 사상과 범주로부터 벗어남없이 행해질 수 있으리라는 것은 자명할 것이다. 따라서 상세한 설명 및 도면은 본 발명을 국한시키기 보다는 예시에 지나지 않다.In the foregoing detailed description, the invention has been described with reference to specific exemplary embodiments. However, it will be apparent that various modifications and changes may be made without departing from the spirit and scope of the invention as set forth in the appended claims. Accordingly, the detailed description and drawings are only illustrative rather than limiting of the invention.

예를 들면, 검색은 드물게 사용되는 코드세트에서 적중(hit)한 이후 상이하게 진행될 수 있다. 이와 달리, 검색은 동일하게 드물게 사용되는 코드세트에서 그것을 검색하려는 노력 없이 최우선순위 코드세트로 자동적으로 진행될 수도 있다. 이 상황은 문자 197이 상기 주어진 상세한 설명에서 처리되어진 이후에 발생할 수 있다. 통계적으로 보면, 그것은 작은 성능 이득(performance gain)이 달성될 수 있다.For example, the search may proceed differently after hit in a rarely used code set. Alternatively, the search may proceed automatically to the highest priority codeset without the effort to search for it in the same rarely used codeset. This situation may occur after character 197 has been processed in the detailed description given above. Statistically, it can be achieved that a small performance gain.

본 발명은 하드웨어, 소프트웨어, 또는 하드웨어 및 소프트웨어의 조합으로 구현될 수 있다. 본 발명에 따른 코드 변환 툴은 소정의 컴퓨터 시스템의 중앙집중 방식(centralized fashion)으로 또는 상이한 요소들이 다수의 상호접속된 컴퓨터 시스템에 분포되어 있는 분산 방식(distributed fashion)으로 실현된다. 임의의 컴퓨터 시스템 종류 또는 여기에서 기술된 방법을 수행하는 기타 장치가 적합하다. 전형적인 하드웨어 및 소프트웨어 결합이 로딩되고 실행될 때 컴퓨터 시스템을 제어하는 컴퓨터 프로그램으로 일반적은 목적 컴퓨터 시스템이 될 수 있으므로 그것은 여기에서 기술된 방법을 수행한다.The invention can be implemented in hardware, software, or a combination of hardware and software. The code conversion tool according to the present invention is realized in a centralized fashion of a given computer system or in a distributed fashion in which different elements are distributed across multiple interconnected computer systems. Any type of computer system or other apparatus that performs the methods described herein is suitable. A computer program that controls a computer system when a typical hardware and software combination is loaded and executed can be a general purpose computer system and thus performs the methods described herein.

본 발명은 또한 컴퓨터 프로그램 제품에 또한 내장될 수 있으며, 본 명세서에서 기술된 방법의 구현을 가능하게 하는 모든 특징을 포함하고, 컴퓨터 프로그램에서 로딩될 때 이들 방법을 수행하는 것을 가능하게 한다.The invention may also be embedded in a computer program product, including all the features that enable the implementation of the methods described herein, and making it possible to perform these methods when loaded in a computer program.

본 문맥에서의 컴퓨터 프로그램 수단 또는 컴퓨터 프로그램은 임의의 언어, 코드 또는 표기법(notation)에 있어서 정보 처리 능력(information processing capability)을 가지는 시스템이 직접 또는 a) 또 다른 언어, 코드 또는 표기법으로의 변환 b) 상이한 재료 형태로 재생산 중 어느 하나 또는 둘다 이후에 특정 기능을 수행하도록 의도하는 임의의 인스트럭션 세트 표현을 의미한다.A computer program means or computer program in this context means that a system having information processing capability in any language, code or notation is directly or a) converted to another language, code or notation b. ) Any instruction set representation intended to perform a particular function after either or both reproductions in different material forms.

본 발명은 유니코드 텍스트로부터 보다 나은 성능으로 실행될 수 있는 혼합형 코드 페이지로의 코드 변환을 위한 방법 및 시스템을 제공하는데 효과가 있다.The present invention is effective to provide a method and system for code conversion from Unicode text to mixed code pages that can be executed with better performance.

Claims

A method of converting a source string including a plurality of source characters into a target string, wherein the source string is encoded according to Unicode code pages. And the target string is encoded according to a mixed code page comprising a plurality of sub-code pages 14, 15, 16 and 17.

Associating a predetermined processing priority with each sub-codepage 14,15,16,17 to generate a processing priority sequence;

Converting the characters strictly according to the priority sequence;

How to convert a source string into a target string.

The method of claim 1,

And the priority sequence reflects the probability of retrieving a source character from one of the subcode pages (14, 15, 16, 17).

The method of claim 1,

Accessing the sub-code page with the highest priority that has not yet been accessed for the character if the character is not retrieved from the current subcode page. Method of conversion.

The method of claim 1,

A method of converting a source string into a target string where one or more characters are processed by a single hardware instruction.

The method of claim 1,

And said priority sequence is dynamically changed from a standard setting to an individual setting before said code conversion is executed.

A computer system having program means installed for carrying out the steps of the method according to any of the preceding claims.

The method of claim 6,

The computer system is arranged for use as an Internet server with program means installed for carrying out the steps of the method according to any of the preceding claims.

Chip means comprising hardware circuitry for implementing at least some of the steps of the method according to any of the preceding claims.

A device comprising a chip according to claim 8.

A computer program product for execution in a data processing system, comprising a portion of computer program code for performing the steps of the respective method according to any one of the preceding claims.

The method of claim 10,

The computer program is a browser program.

A computer program product stored on a computer usable medium comprising computer readable program means for causing a computer to perform the method according to any one of claims 1 to 5.