KR20030075594A - The Web Document Transform System based on Unicode involving Korean Ancient Writings and Chinese Characters - Google Patents

The Web Document Transform System based on Unicode involving Korean Ancient Writings and Chinese Characters Download PDF

Info

Publication number
KR20030075594A
KR20030075594A KR1020020014893A KR20020014893A KR20030075594A KR 20030075594 A KR20030075594 A KR 20030075594A KR 1020020014893 A KR1020020014893 A KR 1020020014893A KR 20020014893 A KR20020014893 A KR 20020014893A KR 20030075594 A KR20030075594 A KR 20030075594A
Authority
KR
South Korea
Prior art keywords
unicode
document
web
chinese characters
processor
Prior art date
Application number
KR1020020014893A
Other languages
Korean (ko)
Inventor
임승태
김민수
Original Assignee
주식회사 인터유져
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 인터유져 filed Critical 주식회사 인터유져
Priority to KR1020020014893A priority Critical patent/KR20030075594A/en
Publication of KR20030075594A publication Critical patent/KR20030075594A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Document Processing Apparatus (AREA)

Abstract

PURPOSE: A system for converting a Unicode based web document including ancient Korean and extended Chinese is provided to support the data including ancient Korean and extended Chinese as a text on the web, and to offers a service regardless of a terminal by converting a presently written document into an XML(eXtensible Markup Language) format as well as a standard Unicode basis. CONSTITUTION: A web document(50) formed by a markup language is generated by receiving the document through a document registration processor(40) and a web document processor(30). An inputted binary document is separated into each document component by a binary document processor(31). A text part is separated into a region using a standard character set and a non-standard character set. The text converted by the Unicode is converted into a character set code of each nation by a fixed character set converter again. The Unicode without a range of the fixed character set is converted into a Unicode scalar value. A web service is offered by displaying the Unicode scalar value on a web browser.

Description

한글고문(古文)과 확장한자를 포함한 유니코드기반의 웹문서 변환 시스템{The Web Document Transform System based on Unicode involving Korean Ancient Writings and Chinese Characters}The Web Document Transform System based on Unicode involving Korean Ancient Writings and Chinese Characters}

과거에는 한글 고문이나 확장한자를 웹 상에서 텍스트로 처리하는 기술이 없었다.그래서, 일반적으로 웹 상에서 한글 고문이나 한자 2수준은 이미지로 처리되어 서비스되어 왔다. 웹 상에서 우리나라 전통의 시, 가사문학 그리고 국학자료 등을 서비스하는 싸이트에서는 대부분 자료들은 이미지로 처리되어 서비스되기 때문에 웹 브라우저를 통해 볼 수는 있지만, 다운로드를 받는 등의 작업은 불가능하였다. 그래서, 이미지로 처리된 국학자료는 데이터로서 이용되기 힘들었다.In the past, there was no technology for processing Hangul torture or extended Chinese characters as text on the web, so in general, Hangul torture and Hanja 2 levels have been serviced as images on the web. Most of the sites that service traditional Korean poems, lyrics, literature, and national literature on the web are processed as images, so they can be viewed through a web browser, but they cannot be downloaded. Thus, national studies processed as images were difficult to use as data.

본 발명은 한글문자와 유니코드, CJK 코드들을 분석으로 이루어졌으며 다양한 OS와 다양한 DB가 지원이 될뿐만아니라 HWP, PDF, MS word 등의 이진문서파일포맷 분석과 SDK 분석을 통해 변환시스템이 구축되며 XML을 비롯한 다양한 마크업 언어로의 확장을 기술적 과제로 한다.The present invention consists of analyzing Korean characters, Unicode, and CJK codes, and is supported by various OS and various DBs, as well as conversion system through binary document file format analysis and SDK analysis such as HWP, PDF, MS word. Expanding to various markup languages, including XML, is a technical challenge.

또한, 이진문서파일을 XML포맷으로 변환시키고, 문서양식 설정에 따른 XML 생성 및 편집을 위해서 DTD, XMLSchema, XSL 등이 이용된다.Also, DTD, XMLSchema, XSL, etc. are used for converting binary document file to XML format and generating and editing XML according to document format setting.

변환하되는 마크업 언어는 XML, HTML, DHTML, XHTML, cHTML, mHTML, WML 이다.The markup languages to be converted are XML, HTML, DHTML, XHTML, cHTML, mHTML, WML.

도 1은 시스템 구성도1 is a system configuration diagram

도 2는 한글고문 및 확장한자를 포함한 웹문서 서비스 진행도2 is a web document service progress diagram including Hangul advisor and extended Chinese characters

도 3은 유니코드 문자처리기 구성도3 is a Unicode character processor configuration diagram

〈도면 주요부분에 대한 부호의 설명〉<Explanation of symbols for main parts of drawing>

도 1은 본 발명에 따른 한글고문과 확장한자를 포함한 유니코드기반의 웹문서 변환시스템의 개략적 구성도이며, 이 그림의 주요부분에 대한 기호의 설명은 다음과 같다.1 is a schematic configuration diagram of a web document conversion system based on Unicode including an Hangul torture and extended Chinese characters according to the present invention, and descriptions of symbols for the main parts of the figure are as follows.

·10 - 클라이언트의 웹 브라우저10-the client's web browser

·20 - 본 시스템에서 구동되는 웹 서버20-Web server running on this system

·30 - 위20번 웹 서버 및 자체 구동되는 본 웹 문서 변환시스템30-Web server No. 20 above and this web document conversion system

·31 - 이진문서파일 분석처리기로서 위 30번의 구성요소31-Component 30 above as a binary document file analysis processor

·32 - 문서구성요서 처리기로서 위 30번의 구성요소32-component 30 above as the document component processor

·33 - 유니코드 문자처리기로서 위 30번의 구성요소33-Component above 30, as a Unicode character processor

·34 - 문서구성요소 및 유니코드에 위한 웹문서 생성기로서 위 30번의 구성요소34-document component and component number 30 above as a web document generator for Unicode

·35 - 웹문서 관리기로서 위 30번의 구성요소35-component of the above 30 as a web document manager

·40 - 위30번의 웹 문서 변환 시스템의 입력을 처리하는 문서등록처리기40-document registration processor for processing input of the above 30 web document conversion systems

·50 - 각종 워드프로세스에서 생성되는 이진문서파일50-binary document files created by various word processes

[도 1]은 한글고문과 확장한자를 포함한 유니코드기반의 웹문서 변환 시스템의 구성도로서, 아래한글(HWP), MS-Word(DOC), 기타 이진문서을(50) 문서등록처리기(40)로 입력을 받아 웹문서 처리기(30)에 의해 웹서비스 가능한 XML, HTML, XHTML, WML 등(35)의 마크업언어로 구성된 웹문서를 생성한다.1 is a block diagram of a Unicode-based web document conversion system including Korean torture and extended Chinese characters. The document registration processor 40 includes the following Hangul (HWP), MS-Word (DOC), and other binary documents (50). The web document processor 30 generates a web document composed of markup languages such as XML, HTML, XHTML, WML, and the like, which can be web serviced by the web document processor 30.

입력된 이진문서는 이진문서처리기(31)에 의해 각각의 문서구성요소로 분리되며, 이중 텍스트를 포함하는 부분은 [도 3]의 문자집합판별기에 의해 표준문자집합 사용영역과 비표준문자집합 사용영역으로 나누워진다. 표준문자집합은 유니코드이 사용영역에 모두 포함되어 변환될 수 있으나, 이중 한글고문과 확장한자는 유니코드 사용범위에 대응되지 않는다. 따라서 한글고문 범위는 유니코드의 개인사용(Private Use)영역을 사용하며, 확장한자는 유니코드의 한자영역(CJKV Ideographs)과 개인사용(Private Use) 영역을 사용한다.The input binary document is divided into respective document components by the binary document processor 31, and the portion including the double text is used by the character set discriminator of FIG. 3 and the non-standard character set use area. Divided by. The standard character set can be converted to include all Unicode in the scope of use, but the double Hangul torture and extended Chinese characters do not correspond to the scope of Unicode usage. Therefore, the scope of Hangul torture uses the private use area of Unicode and the extended Chinese characters use the CJKV Ideographs and private use area of Unicode.

기본 2Byte를 사용하는 유니코드에서 개인사용영역은 하나의 문자당 최대 4Byte까지 배정할 수 있어 세계 모든 문자집합을 충족하고도 남는 영역이나, 웹서비스를 위해서는 임의의 문자코드 대응은 호환성의 문제를 발생시킬 수 있다. 따라서 비표준 영역의 사용은 최소화 되어야 하며, 본 발명은 이러한 문제를 해결하기 위해 한글의 초성, 중성, 종성으로 나누어 유니코드의 개인영역(Private Use)에 배정한다. 따라서 하나의 한글고문자에 대해 4Byte에서 6Byte의 유니코드가 사용된다.In Unicode that uses the default 2 bytes, the private use area can be allocated up to 4 bytes per character, so the area that satisfies all the character sets in the world, but random character code correspondence for web services causes compatibility problems. You can. Therefore, the use of non-standard areas should be minimized, and the present invention divides the initial, neutral, and final characters of Hangul and assigns them to the Private Use of Unicode. Therefore, 4 bytes to 6 bytes of Unicode is used for one Hangul old letter.

유니코드로 변환된 텍스트는 [도 3]의 지정문자집합 변환기에 의해 국가별 문자집합코드로 재변환될 수 있으며, 지정된 문자집합 범위 밖의 유니코드는 W3C XML 표준에 의해 &#_____; 형태의 유니코드 스칼라값으로 변환된다. 이 부분은 웹서비스시에 지정된 폰트에 의해 사용자 웹브라우저에 디스플레이되어 한글고문과 확장한자를 웹서비스할 수 있게 된다.Text converted to Unicode can be reconverted to national character set codes by the designated character set converter in [Figure 3], and Unicode outside the specified character set range is converted to &#_____; by the W3C XML standard. Converts to a Unicode scalar value of type. This part is displayed in the user's web browser by the font specified at the time of web service, and it is possible to web service Hangul torture and extended Chinese characters.

또한, [도 4]와 같이 입력된 이진문서는 HTML, XHTML, XML 등으로 단계별 변환되어 다양한 웹서비스를 제공할 수 있다.In addition, the input binary document as shown in FIG. 4 may be converted into HTML, XHTML, XML, etc. step by step to provide various web services.

웹 문서의 폭발적 증가와 더불어 전자문서관리시스템의 도입이 컨텐츠 유통업체를 중심으로 급격히 늘어나고 있는 시점에서 한글의 고문(古文), 확장 한자까지 웹상의 텍스트 환경에서 변환하고 전시할 수 있는 기술은 상당히 유용한 기술이 될 것 이다.With the explosive increase of web documents and the introduction of electronic document management system, which is rapidly increasing, especially in content distributors, technology that can transform and display texts on the web such as Hangul torture and extended Chinese characters is very useful. Technology will be.

각종 문서, 도서, 문헌 등을 보관하고 서비스하는 공공도서관, 박물관, 국학자료보관기관 등의 웹싸이트를 방문하면 대부분이 한글고문이나 한자가 포함된 문서는 서비스를 하지 않거나 이미지(image)의 형태로 서비스함을 알 수 있다. 이에 본 발명은 한글 고문과 확장 한자까지를 웹 상에서 서비스할 수 있는 기술로서 기존의 전자문서관리시스템과 같이 효과적인 개인서비스와 새로운 부가가치를 발생시킬 수 있는 시스템이 필요하게 된다.If you visit the web sites of public libraries, museums, and archives that hold and serve various documents, books, and literature, most of them do not provide services such as Hangul torture or Chinese characters, or provide images in the form of images. You can see the service. Accordingly, the present invention requires a system capable of generating effective personal service and new added value like the existing electronic document management system as a technology capable of servicing Korean torture and extended Chinese characters on the web.

웹 게시판에 연결된 문서의 내용을 확인하기 위해서는 다운로드한 후 내용을확인할 수 있으나, 본 발명제품을 활용시에는 다운로드 이전에 문서의 원문검색, 페이지, 단락별 미리보기가 가능하다.In order to check the content of the document connected to the web bulletin board, you can check the content after downloading, but when using the present invention, it is possible to search the original text, page, paragraph and preview of the document before downloading.

또한 본 발명은 기존의 문서 및 양식을 완벽하게지원하는 기술이다. 또한 본 발명으로 유무선 인터넷을 통합한 XML처리가 가능하며 기 작성된 문서를 XML 포맷의 문서로 변환하면, 적은 비용과 인력으로 유선 인터넷 사용을 위한 HTML 포맷뿐만 아니라 mHTHM, HDML, WML 등의 무선 인터넷 포맷으로 재변환이 가능하다.In addition, the present invention is a technology that fully supports the existing documents and forms. In addition, according to the present invention, it is possible to process XML integrated with wired / wireless Internet, and converting a pre-written document into an XML format document, wireless Internet formats such as mHTHM, HDML, and WML as well as HTML format for wired Internet use at a low cost and manpower. Can be reconverted.

Claims (7)

한글고문 및 확장한자를 포함한 웹문서(HTML, XML, WML 등)를 인터넷을 통하여 배포, 서비스하는 기술.Technology that distributes and services web documents (HTML, XML, WML, etc.) including Korean torture and extended Chinese characters. 본 발명에서 명시된 워드프로세서에 의해 생성된 이진문서파일(Binary Document File)의 분석처리기 및 구성요소처리기Analysis Processor and Component Processor of Binary Document File Generated by the Word Processor Specified in the Present Invention 본 발명에서 명시된 한글고문 및 확장한자를 포함한 유니코드 기반의 문자처리기Unicode-based character processor including Hangul torture and extended Chinese characters specified in the present invention 청구항1, 2, 3에 있어서 본 발명을 독립된 어플리케이션으로 개발하여 상품화하거나 웹기반 온라인 유료 혹은 무료 서비스하는 것.The method according to claim 1, 2, 3 to develop and commercialize the present invention as an independent application or to provide a web-based online paid or free service. 한글고문 및 확장한자를 포함한 유니코드 기반의 웹문서를 검색 추출하는 기술Technology to search and extract Unicode based web documents including Hangul advisor and extended Chinese characters 이진문서파일의 인터넷 서비스에 관하여 본 발명에서 기술한 방법을 응용하는 것Applying the method described in the present invention to an Internet service of a binary document file 위 청구항과 관련하여 각 해당 기술을 모듈화 하여 한글고문 및 확장한자를 인터넷 서비스하는 것The Internet service of Hangul advisors and Chinese characters by modularizing each relevant technology in connection with the above claims.
KR1020020014893A 2002-03-19 2002-03-19 The Web Document Transform System based on Unicode involving Korean Ancient Writings and Chinese Characters KR20030075594A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020020014893A KR20030075594A (en) 2002-03-19 2002-03-19 The Web Document Transform System based on Unicode involving Korean Ancient Writings and Chinese Characters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020020014893A KR20030075594A (en) 2002-03-19 2002-03-19 The Web Document Transform System based on Unicode involving Korean Ancient Writings and Chinese Characters

Publications (1)

Publication Number Publication Date
KR20030075594A true KR20030075594A (en) 2003-09-26

Family

ID=32225402

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020020014893A KR20030075594A (en) 2002-03-19 2002-03-19 The Web Document Transform System based on Unicode involving Korean Ancient Writings and Chinese Characters

Country Status (1)

Country Link
KR (1) KR20030075594A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100762712B1 (en) * 2005-12-13 2007-10-02 한국과학기술정보연구원 Method for transforming of electronic document based on mapping rule and system thereof
WO2008032962A1 (en) * 2006-09-11 2008-03-20 Ddh, Inc. System and method for transforming electronic document

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100762712B1 (en) * 2005-12-13 2007-10-02 한국과학기술정보연구원 Method for transforming of electronic document based on mapping rule and system thereof
WO2008032962A1 (en) * 2006-09-11 2008-03-20 Ddh, Inc. System and method for transforming electronic document
KR100955077B1 (en) * 2006-09-11 2010-04-28 주식회사 디디에이치 System and method for transforming electronic document

Similar Documents

Publication Publication Date Title
US7770108B2 (en) Apparatus and method for enabling composite style sheet application to multi-part electronic documents
US7770107B2 (en) Methods and systems for extracting and processing translatable and transformable data from XSL files
US20020010725A1 (en) Internet-based font server
US7054952B1 (en) Electronic document delivery system employing distributed document object model (DOM) based transcoding and providing interactive javascript support
US6725424B1 (en) Electronic document delivery system employing distributed document object model (DOM) based transcoding and providing assistive technology support
US6829746B1 (en) Electronic document delivery system employing distributed document object model (DOM) based transcoding
Bos et al. Cascading style sheets level 2 revision 1 (css 2.1) specification
US7356807B1 (en) Transforming server-side processing grammars
US8484552B2 (en) Extensible stylesheet designs using meta-tag information
US6738951B1 (en) Transcoding system for delivering electronic documents to a device having a braille display
US7024415B1 (en) File conversion
US20050235202A1 (en) Automatic graphical layout printing system utilizing parsing and merging of data
GB2382174A (en) Data formatting in a platform independent manner
RU2001128738A (en) Method and device for forming structured documents for various presentations
EP1126380A1 (en) Converting a formatted document into an XML-document
US20030106021A1 (en) Apparatus and method for creating PDF documents
JP2003114882A (en) System and method for formatting contents for publication
US20060179406A1 (en) Methods and systems for rendering electronic data
KR20030075594A (en) The Web Document Transform System based on Unicode involving Korean Ancient Writings and Chinese Characters
Curtin Internationalization of the file transfer protocol
US20050229099A1 (en) Presentation-independent semantic authoring of content
Hardie From legacy encodings to Unicode: the graphical and logical principles in the scripts of South Asia
EP1377917A2 (en) Extensible stylesheet designs using meta-tag information
EP1061456A2 (en) A conversion apparatus for converting an HTML document to an MHEG document
JP4243038B2 (en) System, apparatus and method for converting JSP to PvC format

Legal Events

Date Code Title Description
WITN Withdrawal due to no request for examination