KR100497543B1

KR100497543B1 - Apparatus for converting HTML to VoiceXML having a selecting function of transformation purpose by prior knowledge and communication of user and method for employing as the same

Info

Publication number: KR100497543B1
Application number: KR10-2002-0081539A
Authority: KR
Inventors: 장영건; 최훈일
Original assignee: 장영건
Priority date: 2002-12-20
Filing date: 2002-12-20
Publication date: 2005-07-01
Also published as: KR20040054982A

Abstract

본 발명은 HTML로 작성된 웹 문서의 컨텐츠를 음성 인터페이스를 통해 제공하기 위하여 VoiceXML 문서로 변환하는 사용자의 사전 지식과 상호 작용을 통한 변환 대상 선택 기능을 갖는 HTMLtoVoiceXML 변환기 및 이의 운용방법에 대한 것이다. 변환하고 싶은 웹페이지에 대한 URL을 입력하여 해당 웹 페이지에서 사용자의 상호 작용, 또는 미리 정의된 규칙기반에 의거하여 원하는 컨텐츠 군을 사용자의 의도대로, 원래의 웹 문서의 구현기법에 상관없이 자동으로 선택하고, 해당 컨텐츠에 대한 내용을 분리하여 음성 인터페이스 시나리오에 따라 VoiceXML문서를 생성하여 저장하는 자동 변환 방식을 제공한다. 또한, 컨텐츠의 리스트가 여러 페이지에 걸쳐 동일한 형태와 구조의 정보를 갖는 연속된 웹 문서는 첫 번째 웹 페이지의 접속과 연결을 위한 문자의 내용이나 그래픽의 구조적 특징을 사용자가 선택하여 한 번의 접속으로 연결된 모든 웹 페이지의 변환이 가능하도록 한 것이다. 본 발명은 HTML 웹 문서에 있는 컨텐츠 중 사용자가 필요로 하는 변환 대상을 문자, 규칙, 시각적 표현을 사용하여 규정하여, 다양한 저작 기법의 변화에 관계없이 정확한 대상 컨텐츠의 선택이 가능하며, 웹 컨텐츠 군의 성격적 변화를 쉽게 감지하여 사용자 선택 후 시간적 경과에 따라 나타날 수 있는 웹 페이지 구성의 변화에 따른 내용 선택 여부를 신뢰성 있게 결정할 수 있다. 또한 사용자의 변환 대상 선택 이후의 작업을 자동화하여, 특히 동일 형식을 가지면서 대용량인 HTML 웹 문서를 VoiceXML 문서로 변환하는데 필요한 비용과 시간을 크게 감축시킬 수 있다. The present invention relates to an HTMLtoVoiceXML converter having a function of selecting a conversion target through interaction with a user's prior knowledge and interaction for converting a content of a web document written in HTML into a VoiceXML document to provide a voice interface. Enter the URL of the web page you want to convert to automatically interact with the user on the web page or to the desired content group based on a predefined rule base, regardless of the implementation method of the original web document. It provides an automatic conversion method that selects and separates the contents for the corresponding contents to generate and save the VoiceXML document according to the voice interface scenario. In addition, a continuous web document in which a list of contents has the same form and structure information over several pages can be accessed in a single access by selecting a structural feature of a graphic content or graphic for connection and connection of the first web page. This allows conversion of all linked web pages. The present invention defines the conversion target required by the user by using text, rules, and visual expression among contents in the HTML web document, so that accurate target content can be selected regardless of various authoring techniques. It can easily detect the personality change of and make it possible to reliably decide whether to select the content according to the change of web page composition that can appear over time after user selection. In addition, by automating the work after the user selects a conversion target, the cost and time required to convert a large-format HTML web document into a VoiceXML document can be greatly reduced.

Description

Application for converting HTML to VoiceXML having a selecting function of transformation purpose by prior knowledge and communication of user and method for employing as the same}

본 발명은 HTMLtoVoiceXML 변환기 및 이의 운용방법에 관한 것으로서, 특히 HTML로 작성된 웹 정보를 표준화된 음성 인터페이스를 통해 제공하기 위해 VoiceXML 형식으로 변환하는 사용자의 사전 지식과 상호 작용을 통한 변환 대상 선택 기능을 갖는 HTMLtoVoiceXML 변환기 및 이의 운용방법에 대한 것이다. The present invention relates to an HTMLtoVoiceXML converter and its operation method, in particular HTMLtoVoiceXML having a conversion target selection function through the user's prior knowledge and interaction for converting web information written in HTML to VoiceXML format to provide a standardized voice interface The converter and its operation method.

HTML을 다른 형식의 마크업 언어로 변환하는 방법에는 HTML 태그의 기능과 비슷한 다른 마크업 언어의 태그를 대치하는 방법이 대표적이며, 종래에는 HTML을 무선 인터넷용 마크업 언어로 변환하는 경우에 자주 사용되었다. 그러나 두 마크업 언어의 용도는 다르기 때문에 태그만을 대치하여 변환을 하면 정보 전달 시나리오의 구성이 어색하여 명확한 정보 전달이 이루어지지 않는다. VoiceXML 태그는 무선 인터넷용 마크업 언어와 달리 음성 인터페이스를 통해 정보를 제공하기 위해 작성된 언어이기 때문에 시각적인 정보 제공을 위해 작성된 HTML 태그와 그 용도가 확연히 다르기 때문에 태그를 대치하여 변환하기는 불가능하며, 정보 전달 시나리오의 구성 방식도 다르다. 또한, 일반적으로 사용자는 HTML 문서에 표시되어 있는 모든 컨텐츠를 사용하는 것이 아니라 필요한 컨텐츠만을 사용한다. 웹 페이지에서 필요한 컨텐츠를 선택하는 방법으로써 종래에는 해당 웹 페이지를 트리 구조로 변환하고, 해당 페이지에서 가장 많은 자식노드를 가진 트리를 컨텐츠로 선택하는 방법과 같은 구조 해석법, 내용 기반 접근으로써 온토로지(ontology)로 필요한 정보의 내용과 관계를 기술하여 높은 적합도를 갖는 컨텐츠를 선택하는 방법 등이 사용되었으나 구조적 방식의 경우 사용자의 의도와는 다른 컨텐츠를 선택할 확률이 있으며, 트리 구조가 문서의 구현 기법에 따라 시각적으로는 같은 내용이지만, 다른 트리로 표현되는 등 구현 기법에 따라 웹 페이지의 시각적 표현에 의하여 사용자가 인식하는 컨텐츠 군과 다른 구조적 특징을 갖는 경우가 많다. 내용기반의 접근은 인터넷에서 사용자가 필요로 하는 내용만을 선택하는 방법으로서는 좋은 접근이지만, 이미 결정된 웹 페이지를 대상으로 필요한 내용을 선택하는 방법으로서는 기술해야 할 내용이 복잡하고, 해당 HTML 문서의 자세한 내용까지 분석해야 하는 전문성과 복잡성을 요구한다. 또한 웹 문서의 속성상 구성과 내용이 수시로 바뀌어서 그때마다 해당 온토로지 문서를 갱신해야하는 불편함이 있다. 따라서 전문성이 덜 요구되면서 복잡성과 문서 작성에 소요되는 시간을 줄이기 위하여 사용되는 변환기의 특성과 맞지 않는 방법이다. 종래의 변환기는 그 대상을 접속하는 웹 페이지로 한정하였는데, 이는 동일한 성격과 구조의 대용량 정보를 갖는 웹사이트의 정보 변환에는 수많은 변환 과정을 거쳐야 하는 불편함이 있다. The most common method of converting HTML to other forms of markup languages is to replace tags of other markup languages, which are similar to the functions of HTML tags, and are often used frequently when converting HTML to markup languages for wireless Internet. It became. However, since the use of the two markup languages is different, the conversion of only the tags is awkward in the construction of the information delivery scenario, so that the information is not clearly delivered. Unlike the markup language for the wireless Internet, VoiceXML tags are written to provide information through a voice interface. Therefore, since the purpose is different from HTML tags written for visual information, it is impossible to replace the tags. The way the information delivery scenarios are organized is also different. Also, in general, users do not use all the content displayed in the HTML document, but only the necessary content. As a method of selecting the necessary contents from a web page, conventionally, the structure analysis method such as converting the web page into a tree structure and selecting a tree having the most child nodes as the content as the content, and ontology as a content-based approach The method of selecting contents with high suitability by describing the contents and relations of information required by (ontology) has been used, but in the case of the structural method, there is a possibility of selecting contents different from the intention of the user. In some cases, the content is visually the same, but is represented in a different tree, and according to the implementation technique, the user may have a structural characteristic different from the content group recognized by the visual representation of the web page. The content-based approach is a good way to select only the content that the user needs on the Internet.However, the content-based approach is complicated to select the necessary content for the web page that has been decided. Requires expertise and complexity to be analyzed. In addition, the structure and contents of the web document is changed from time to time, which is inconvenient to update the ontology document at each time. Therefore, it requires less expertise and does not match the characteristics of the converter used to reduce the complexity and the time required for writing the document. Conventional converters have been limited to web pages connecting the target, which is inconvenient to undergo a number of conversion process for information conversion of a website having a large amount of information of the same nature and structure.

본 발명의 목적은, HTML로 작성된 웹 문서를 읽어 트리 구조로 변환 및 분석하여 관련있는 컨텐츠끼리 군을 형성하고, 읽은 HTML 웹 문서에서 같은 형식의 컨텐츠를 갖는 연속 연결 문서의 URL을 추출하며, 형성된 컨텐츠 군 정보를 사용자에게 제시하여 사용자의 선택여부를 판별하여 VoiceXML 문서로 변환할 대상을 설정하는 한 사용자의 사전 지식과 상호 작용을 통한 변환 대상 선택 기능을 갖는 HTMLtoVoiceXML 변환기 및 이의 운용방법을 제공하는데 있다. An object of the present invention is to read and convert a web document written in HTML into a tree structure to form a group of related content, extract the URL of a continuous linked document having the same content from the read HTML web document, The present invention provides an HTMLtoVoiceXML converter having a function of selecting a conversion target through interaction with the user's prior knowledge and a method of presenting the content group information to the user to determine the user's selection and setting the target to be converted into a VoiceXML document. .

HTML 문서에 있는 컨텐츠 군을 필요에 따라 신뢰성 있게 선택하기 위하여 해당 웹 문서에 포함된 문자를 사용하거나, 변환기의 전처리 부에서 생성된 트리 정보에 대응하는 웹 페이지의 시각적 표현을 별도로 제공하거나, 생성된 트리에 포함된 노드의 수, 문자수를 규칙 기반으로 하여 선택하는 사용자와의 상호 작용을 통한 변환 대상 컨텐츠 군 선별 방식을 사용하여 종래의 방법에 비하여 해당 웹 페이지의 HTML을 사용한 구성 방식과 구현 기법에 관계없이 신뢰성 있고, 사용자의 의도가 정확하게 반영된 선택을 할 수 있으며, 한 웹 페이지에서 다중의 선택이 가능하다. In order to reliably select the content group in the HTML document as needed, the characters included in the web document are used, or a visual representation of the web page corresponding to the tree information generated by the preprocessing unit of the converter is provided separately, or generated. The organization method and implementation method using HTML of the web page compared to the conventional method by using the method of sorting the target content group through interaction with the user who selects the number of nodes and characters in the tree based on rules Regardless, you can make choices that are reliable, accurately reflect your intentions, and make multiple choices on one web page.

해당 컨텐츠 군만을 대상으로 컨텐츠를 지능적이고, 발견적인 방식으로 컨텐츠의 리스트를 분리, 추출하여 저장할 뿐만 아니라 대응되는 링크를 통한 컨텐츠를 동시에 추출하여 저장함으로써 한번의 웹 페이지 접속으로 컨텐츠를 중심으로 연결된 다중의 웹 페이지를 처리한다. It not only separates, extracts and stores the list of contents in the intelligent and heuristic way, but also extracts and saves the contents through the corresponding link at the same time. Handles web pages.

음성을 통한 정보제공을 위해 정보 전달 시나리오를 재구성하여 이를 바탕으로 VoiceXML 문서를 생성한다. In order to provide information through voice, we reconstruct information delivery scenario and generate VoiceXML document based on this.

또한, 신문이나 구인 구직 사이트와 같이 컨텐츠에 따라서는 그 리스트가 동일한 성격과 구조를 갖으면서 연속하여 연결된 수많은 웹 페이지로 구성되어 있는 컨텐츠가 많은데, 사용자가 여러 문서에 있는 컨텐츠도 필요로 할 경우 이런 컨텐츠를 모두 추출하여 정보 전달 시나리오를 구성하는데 반영하지 않으면 명확한 정보 전달이 되지 않는다. Also, depending on the content, such as newspapers and job sites, the list consists of a large number of web pages that have the same personality and structure and are linked in succession. If you do not extract all of the content and reflect it in constructing an information delivery scenario, you will not be able to communicate clearly.

컨텐츠가 여러 문서로 구성되어 있는 경우 각 문서마다 연속 연결 문서가 링크되어 있기 때문에 컨텐츠를 추출하기 위해 각각의 문서를 개별적으로 접속하여 컨텐츠를 추출하는 것은 비효율적이다. When the content is composed of multiple documents, since consecutive documents are linked to each document, it is inefficient to extract the content by accessing each document individually to extract the content.

그러나, HTML 코드만으로는 연속 연결 문서에 대한 링크가 어느 것인지 알 수가 없기 때문에 사용자와의 상호작용을 통해 연속 연결 문서에 대한 연결관계에 대한 정보를 얻어 이를 통해 한번의 접속만으로 여러 문서에 있는 컨텐츠를 추출할 수 있도록 한다. However, since HTML code alone does not know which link is a link to a continuous linked document, the user can obtain information about the link relation to the linked linked document through interaction with the user, and extract content from multiple documents with only one access. Do it.

본 발명은 HTML로 작성된 웹 정보를 음성 인터페이스를 통해 웹 정보에 접근할 수 있도록 VoiceXML로 변환하는 변환기로써, 사용자에 의하여 입력되어 DB화된 사전지식을 이용하거나 사용자와의 상호작용을 통해 변환대상 컨텐츠를 선별하여, 사용자가 필요로 하는 정보만을 취합하여 변환함으로써, 명확한 정보 전달 시나리오의 구성이 가능하며, 불필요한 정보의 변환을 방지함으로 코드를 간결하고 명확하게 작성할 수 있다. 또한, 동일한 성격을 갖는 여러 웹 문서에 걸쳐있는 컨텐츠는 한번의 접속만으로 모두 추출할 수 있기 때문에 변환기의 효율을 높일 수 있다. The present invention is a converter that converts web information written in HTML into VoiceXML so that web information can be accessed through a voice interface, and converts the content to be converted by using DB prior knowledge or interaction with the user. By selecting and converting only the information required by the user, a clear information transmission scenario can be configured, and the code can be written concisely and clearly by preventing the conversion of unnecessary information. In addition, content that spans multiple web documents with the same characteristics can be extracted with only one connection, thereby increasing the efficiency of the converter.

도 1은 본 발명에 따른 HTMLtoVoiceXML 변환기의 구조를 나타내는 것으로써 크게, 웹페이지에 접속하고 접속한 웹 페이지의 컨텐츠 군을 선택하고, 컨텐츠의 리스트를 분리하고 추출하는 리스트 추출/VoiceXML 생성부(101)와, 컨텐츠의 리스트에 의하여 링크되어 있는 상세 내용을 추출하고, 그 내용을 VoiceXML로 생성하는 상세 내용 추출/VoiceXML 생성부(102)와, 그리고 동일한 성격과 구조를 갖는 연속된 웹 문서 처리를 하는 연속 연결 문서 URL 추출부(103)의 세 부분으로 구성되어 있다. Fig. 1 shows the structure of the HTMLtoVoiceXML converter according to the present invention. The list extracting / VoiceXML generating unit 101 connects a web page, selects a group of contents of the connected web page, and separates and extracts a list of contents. And a detail content extraction / VoiceXML generation unit 102 for extracting detailed content linked by a list of content and generating the content in VoiceXML, and for continuous web document processing having the same personality and structure. It is composed of three parts of the connection document URL extraction unit 103.

상기 리스트 추출/VoiceXML 생성부(101)는 변환하고자 하는 HTML문서의 URL(104)이 입력되면 그 URL에 대응하는 웹사이트의 웹페이지에 접속하고, 리스트 HTML 분석 모듈(105)을 통해 해당 웹페이지의 HTML문서로부터 리스트 컨텐츠 군 정보를 추출하여 리스트 컨텐츠 군 선별 모듈(106)로 전달하고, 이 리스트 컨텐츠 군 선별 모듈(106)에서는 사용자에게 리스트 컨텐츠 군으로 분리된 화면을 보여주어 리스트 컨텐츠 군으로 분리된 화면을 통해 사용자가 변환대상 리스트 컨텐츠 군을 선택할 수 있도록 하거나, 웹 문서에 포함된 문자 정보를 사용자가 복사, 붙여 넣기 과정을 통하여 선택하여 선택된 문자열을 포함하는 리스트 컨텐츠 군을 자동으로 선택하거나, 속한 노드의 수가 가장 많은 트리와 같이 정해진 규칙에 따라 사용자가 리스트 컨텐츠 군을 선택하도록 하는 방식을 제공하며, 이와 같이 웹 페이지의 선택된 리스트 컨텐츠 군 정보를 리스트 컨텐츠 군 정보 DB(110)에 저장한다. When the URL extraction 104 of the HTML document to be converted is input, the list extraction / VoiceXML generation unit 101 accesses a web page of a website corresponding to the URL, and the corresponding web page through the list HTML analysis module 105. Extracts the list content group information from the HTML document, and transmits the list content group information to the list content group selection module 106. The list content group selection module 106 shows the user a screen divided into the list content group to separate the list content group. Allows the user to select the list of conversion target list contents through the screen, or automatically selects the list content group containing the selected character string by selecting and copying and pasting text information included in the web document, Allows the user to select a list content family based on a set of rules, such as a tree with the largest number of nodes It provides a method and to thus save the selected content list information group of webpage listed in the content group information DB (110).

사용자가 리스트 컨텐츠 군을 선택할 때에 둘 이상의 컨텐츠 군에 대한 병합을 통해 하나의 컨텐츠 군으로 나타낼 수도 있다. 컨텐츠 군 선별과정이 끝나면 선별된 리스트 컨텐츠 군을 대상으로 리스트 컨텐츠 추출 모듈(107)을 통해 컨텐츠를 추출한다. 추출된 각 리스트 컨텐츠에 링크 정보가 있으면 각 컨텐츠 링크 정보 추출 모듈(109)에서 링크 정보(링크 URL, 링크 텍스트)를 추출하여 상세 내용을 추출하기 위해 상세 내용 추출/VoiceXML 생성부(102)에 링크 URL과 링크 텍스트 정보를 전달한다. 리스트 컨텐츠 추출 모듈(107)을 통해 추출된 모든 컨텐츠를 대상으로 리스트 VoiceXML 문서 생성 모듈(108)을 통해 리스트 컨텐츠 군에 대한 VoiceXML 문서를 생성한다. 이때 생성되는 리스트 VoiceXML 문서의 양은 추출된 리스트 컨텐츠 개수와 각 리스트 컨텐츠의 문자열 평균 크기에 따라 여러 개의 VoiceXML 문서가 생성된다. 생성된 VoiceXML 문서는 Web Server(119)에 저장한다. When the user selects the list content group, the content may be represented as one content group by merging two or more content groups. After the content group selection process is completed, the content is extracted through the list content extraction module 107 for the selected list content group. If there is link information in each of the extracted list contents, the link information (link URL, link text) is extracted from each content link information extracting module 109 and linked to the detail content extracting / VoiceXML generator 102 to extract the detail contents. Passes URL and link text information. The VoiceXML document for the list content group is generated through the list VoiceXML document generation module 108 for all contents extracted through the list content extraction module 107. In this case, the number of generated List VoiceXML documents is generated according to the number of extracted list contents and the average size of strings of the list contents. The generated VoiceXML document is stored in the web server 119.

각 컨텐츠 링크 정보 추출 모듈(109)을 통해 전달된 링크 정보는 상세 내용을 추출하기 위해 상세 내용 HTML 분석 모듈(111)을 통해 상세 내용 컨텐츠 군 정보를 추출하여 상세 내용 컨텐츠 군 선별 모듈(112)로 전달하고, 이 모듈을 통해 선별된 상세 내용 컨텐츠 군 정보는 상세 내용 컨텐츠 추출 모듈(113)에 전달되어 상세 내용 컨텐츠를 추출하며, 추출된 컨텐츠를 대상으로 상세 내용 VoiceXML 생성 모듈(114)을 통해 상세 내용에 대한 VoiceXML 문서를 생성한다. 상세 내용 VoiceXML 문서는 상세 내용에 대해 하나의 문서로 생성되며, 생성된 VoiceXML 문서는 Web Server(119)에 저장한다. The link information delivered through each content link information extraction module 109 extracts the detailed content group information through the detailed content HTML analysis module 111 to extract the detailed content, and then returns to the detailed content group selection module 112. The detailed content group information selected through the module is delivered to the detailed content content extraction module 113 to extract the detailed content content, and the detailed content voice XML generation module 114 targets the extracted content. Create a VoiceXML document for the content. Details The VoiceXML document is generated as one document for the details, and the generated VoiceXML document is stored in the Web Server 119.

만약, 리스트 컨텐츠가 동일한 성격을 갖는 여러 문서에 존재하여 현재 문서 외의 다른 문서에 있는 컨텐츠까지 추출하고자 한다면, 사전 지식 입력 모듈(118)을 통해 사용자로부터 연속 연결 문서의 연결관계 정보를 입력받아 사전 지식 DB(117)에 저장하고, 처음 입력된 URL에 있는 HTML 문서를 HTML 문서 저장 모듈(115)을 통해 저장하고, 연속 연결 문서 URL 추출 모듈(116)로 저장된 HTML 문서의 코드 내용을 전달하고, 연속 연결 문서 URL 추출 모듈(116)은 사전 지식 DB(117)에 저장된 사전 지식과 HTML 코드를 분석하여 연속 연결 문서 URL을 추출하여 이를 리스트 추출/VoiceXML 생성부(101)로 전달하여 리스트 컨텐츠를 추출한다. If the list contents exist in several documents having the same personality and want to extract contents in other documents other than the current document, the prior knowledge input module 118 receives the connection relationship information of the continuous connection document from the user through the prior knowledge input module 118. Save to the DB 117, save the HTML document at the first input URL through the HTML document storage module 115, deliver the code content of the HTML document stored to the continuous link document URL extraction module 116, and The connection document URL extraction module 116 analyzes prior knowledge and HTML codes stored in the prior knowledge DB 117, extracts a continuous connection document URL, and delivers the URL to the list extraction / VoiceXML generator 101 to extract the list contents. .

도 2는 리스트 HTML 분석 모듈(105)의 흐름도로서, URL이 입력되면 해당 웹사이트에 접속(201)하여 HTML 문서를 추출(202)하여 문서 객체(Document Object)를 통해 트리 구조로 변환(203)하고 루트 노드부터(204) 순차적으로 분석하여 컨텐츠 군 정보를 추출한다. 먼저, 각 노드의 태그 이름을 구하여 분석 대상인지를 판별(205)한다. 분석 대상여부는 태그 이름이 주석(!) 태그, SCRIPT 태그, STYLE 태그일 경우는 분석 대상에서 제외하여 분석 효율을 향상시킨다. 분석 대상이면 현재 노드의 자식 노드 수를 구하여 3개 이상인지를 확인(206)하고, 3개 이상이면 현재 노드를 루트로 하는 서브 트리를 추출(207)하여 내부 문자열의 존재여부를 확인(208)한다. 내부 문자열이 존재하면 현재 노드와 서브 트리의 마지막 노드의 이름 및 인덱스 구하여(209) 컨텐츠 군 정보 배열에 저장(210)한다. 그리고, 현재 노드의 포인터를 다음 노드로 이동(211)하고 노드 포인터가 트리의 끝일 때까지(212) 노드가 분석 대상인지의 판별과정부터 반복한다. 트리의 끝일 경우에는 작업을 종료한다(213). FIG. 2 is a flowchart of the list HTML analysis module 105. When a URL is input, the website is accessed 201, the HTML document is extracted 202, and a document object is converted into a tree structure 203. The content group information is extracted by sequentially analyzing the root node (204). First, a tag name of each node is obtained to determine whether or not it is an analysis target (205). Whether to analyze the tag improves the analysis efficiency by excluding the tag name as comment (!) Tag, SCRIPT tag, and STYLE tag from the analysis target. If it is an analysis target, it checks whether the number of child nodes of the current node is three or more (206), and if it is three or more, extracts a subtree rooted at the current node (207) and checks whether an internal string exists (208). do. If the internal string exists, the name and index of the current node and the last node of the subtree are obtained (209) and stored in the content group information array (210). Then, the pointer of the current node is moved to the next node (211), and it is repeated from the process of determining whether the node is the analysis target until the node pointer is the end of the tree (212). If it is the end of the tree, the operation ends (213).

도 3은 리스트 컨텐츠 군 선별 모듈(106)의 흐름도로서, 리스트 HTML 분석 모듈(105)을 통해 컨텐츠 군 정보(300)를 전달받아 웹 문서의 컨텐츠 군마다 번호를 표시(301)하여 사용자에게 보여준다. 사용자는 화면에 표시된 번호로 변환대상 컨텐츠 군을 선택한다. 이 과정에서 둘 이상의 컨텐츠 군이 유사한 컨텐츠 군이면 병합대상으로 각 컨텐츠 군의 번호를 선택하여 병합한다(302). 그런 다음 선택된 컨텐츠 군에 알맞은 제목을 설정(303)한다. 컨텐츠 군 선택 작업이 끝날 때까지(304) 선택 작업을 반복한다. 3 is a flowchart of the list content group selection module 106. The content group information 300 is received through the list HTML analysis module 105, and the number is displayed 301 for each content group of the web document to the user. The user selects the content group to be converted by the number displayed on the screen. If two or more content groups are similar in this process, the number of each content group is selected and merged as a merge target (302). Then, a title appropriate to the selected content group is set (303). The selection operation is repeated until the content group selection operation is finished (304).

도 4는 상세 내용 HTML 분석 모듈(111)의 흐름도로서, 도 1의 각 컨텐츠 링크 정보 추출 모듈(109)에 의해 추출된 링크 URL과 링크 텍스트 정보(400)가 전달되면, 링크 URL의 해당 웹 사이트에 접속(401)하여 HTML 문서를 추출(402)하여 문서 객체(Document Object)를 통해 트리 구조로 변환(403)하고 루트 노드부터(404) 순차적으로 분석하여 컨텐츠 군 정보를 추출한다. 분석과정은 먼저, 각 노드의 태그 이름을 구하여 분석 대상인지를 판별(405)한다. 분석 대상여부는 태그 이름이 주석(!) 태그, SCRIPT 태그, STYLE 태그일 경우는 분석 대상에서 제외하여 분석 효율을 향상시킨다. 분석 대상이면 현재 노드의 자식 노드 수와 자식 노드의 태그 이름을 구하여(406) 자식 노드 수의 반 이상이 BR, P 태그인지를 판별(407)하여 맞으면 현재 노드를 루트로 하는 서브 트리를 추출(408)하여, 현재 노드 이름 및 인덱스와 서브 트리의 텍스트 크기를 구하여(409) 컨텐츠 군 배열에 저장(410)한다. 그리고, 현재 노드의 포인터를 다음 노드로 이동(411)하고 노드 포인터가 트리의 끝일 때까지(412) 노드의 분석 대상 판별과정부터 반복한다. 트리의 끝일 경우에는 작업을 종료한다(413). 4 is a flowchart of the details HTML analysis module 111. When the link URL and the link text information 400 extracted by each content link information extraction module 109 of FIG. 1 are delivered, the corresponding website of the link URL is transmitted. In operation 401, the HTML document is extracted 402, converted into a tree structure through a document object 403, and sequentially analyzed from the root node 404 to extract content group information. The analysis process first determines the tag name of each node to determine whether it is an analysis target (405). Whether to analyze the tag improves the analysis efficiency by excluding the tag name as comment (!) Tag, SCRIPT tag, and STYLE tag from the analysis target. If the target is to be analyzed, the number of child nodes of the current node and the tag names of the child nodes are obtained (406), and if more than half of the number of child nodes is a BR or P tag (407), and if correct, the subtree rooted at the current node is extracted ( In operation 408, the current node name, index, and text size of the subtree are obtained (409) and stored in the content group arrangement (410). Then, the pointer of the current node is moved to the next node (411), and the process is repeated from the analysis target determination process of the node until the node pointer is the end of the tree (412). If it is the end of the tree, the operation ends (413).

컨텐츠에 따라서는 동일한 형식의 연속 연결 문서로 표시된 것이 있는데, 이런 형태의 컨텐츠를 한번에 모두 추출하기 위해서는 연속 연결 문서의 URL을 얻어 해당 문서에서 컨텐츠를 추출해야 한다. 그러나, HTML 코드만으로는 어떤 링크가 연속 연결 문서에 대한 링크인지를 알 수 없기 때문에 연속 연결 문서에 대한 링크의 형태, 마지막 컨텐츠에 대한 상대적 위치 등의 정보를 사용자로부터 입력받아 입력받은 정보를 토대로 링크 정보를 추출할 수 있다. Depending on the content, there is a marked as a continuous linked document of the same type. In order to extract all of these types of content at once, it is necessary to obtain the URL of the continuous linked document and extract the content from the document. However, since the HTML code alone does not know which link is a link to the continuous link document, the link information is based on the input information received from the user, such as the link type of the continuous link document and the relative position of the last content. Can be extracted.

도 5는 본 발명에서 사용자로부터 연속 연결 문서에 대한 링크 정보에 대한 사전지식을 입력받은 스키마를 표현한 것으로서 크게 3단계로 구성된다. 정보 입력 방식은 윈도우의 마법사처럼 단계별로 사용자에게 선택사항을 선택하도록 하여 사용자로부터 정보를 쉽게 얻을 수 있다. 사전지식 입력 정보의 1단계는 문서에 표시되어 있는 연속 연결 문서의 연결 링크 형태가 무엇인지를 지정(502)한다. 연속 연결 문서의 연결 링크 형태는 3가지로 나눌 수 있는데 첫 번째는 “다음”, “이전”과 같은 텍스트로서 연결 링크 형태를 나타내는 경우가 있으며, 두 번째는 화살표, 삼각형 표시등과 같은 이미지로서 나타내는 경우가 있으며, 세 번째로는 1, 2, 3, . . . 과 같이 연속된 일련번호로 나타내는 경우가 있다. 1단계의 연속 연결 문서의 연결 링크 형태를 사용자가 선택하고 나면, 2단계는 처음 시작 문서와 중간 문서에 표시된 연결 링크의 상세 정보를 입력받게 된다. 만약, 1단계에서 선택된 연결 링크 형태가 텍스트(503)일 경우 링크 문자를 입력(506)받아야 하는데, 링크 문자의 입력은 처음 시작 문서의 링크 문자(507)와 중간 문서의 링크 문자를 입력(508)받는다. 연결 링크 형태가 이미지(504)인 경우 마지막 컨텐츠를 기준으로 상대적인 링크 위치를 입력(509)받는다. 이때도 시작 문서에 표시된 링크 위치(510)와 중간 문서에 표시된 링크 위치(511)를 따로 입력한다. 연결 링크 형태가 일련번호(505)일 경우 번호의 순서가 오름차순인지 내림차순인지와 시작 번호가 무엇인지를 입력(512)받는다. 마지막으로 3단계에서는 몇 번째 문서까지를 추출할 것인지 즉, 마지막 문서의 위치를 지정(513)한다. FIG. 5 is a diagram representing a schema for receiving prior knowledge of link information of a continuous connection document from a user, and is largely composed of three steps. The information input method, like the wizard in Windows, allows the user to select options step by step so that information can be easily obtained from the user. Step 1 of prior knowledge input information designates (502) what is the form of a linked link of a continuous linked document displayed in the document. There are three types of link links in a continuous link document. The first is text such as “next” or “previous”, which sometimes indicates the form of a link, and the second is displayed as an image such as an arrow or triangle light. And third, 1, 2, 3,. . . It may be indicated by consecutive serial numbers as shown below. After the user selects the link form of the continuous link document of the first stage, the second stage receives the detailed information of the link link displayed in the first start document and the intermediate document. If the link form selected in step 1 is the text 503, the link text should be input 506. Input of the link text inputs the link text 507 of the first starting document and the link text of the intermediate document 508. ) When the connection link type is an image 504, a relative link position is received 509 based on the last content. In this case, the link position 510 displayed in the start document and the link position 511 displayed in the intermediate document are separately input. If the connection link type is a serial number (505) is received (512) whether the order of the number in ascending or descending order and the start number. Finally, in step 3, up to the number of documents to be extracted, that is, the position of the last document is designated (513).

상기한 바와 같이, 본 발명의 사용자의 사전 지식과 상호 작용을 통한 변환 대상 선택 기능을 갖는 HTMLtoVoiceXML 변환기 및 이의 운용방법은, HTML로 작성된 시각적 정보를 음성적 정보로 제공하기 위하여 VoiceXML 문서로 변환하는데 있어 사용자와의 상호 작용을 통해 변환 대상 컨텐츠를 선별함으로써, 사용자가 필요로 하는 정보만을 제공할 수 있으며, 또한 한번의 접속만으로 여러 문서로 작성된 컨텐츠를 추출함으로써 각각의 웹 문서마다 따로따로 접속하여 컨텐츠를 추출하는 비효율성을 해결할 수 있다. As described above, the HTMLtoVoiceXML converter having a conversion target selection function through interaction with the user's prior knowledge and interaction thereof, and a method of operating the same, convert the visual information written in HTML into a VoiceXML document to provide voice information. By selecting the content to be converted through interaction with, the user can provide only the information needed by the user, and extracts the content created by several documents with only one connection, and accesses each web document separately to extract the content. This can solve the inefficiency.

도 1은 본 발명에 따른 HTMLtoVoiceXML 변환기 구조, 1 is an HTMLtoVoiceXML converter structure according to the present invention,

도 2는 본 발명의 리스트 HTML 분석 모듈 흐름도, 2 is a list HTML analysis module flow chart of the present invention;

도 3은 본 발명의 리스트 컨텐츠 군 선별 모듈 흐름도, 3 is a flowchart illustrating a list content group selection module of the present invention;

도 4는 본 발명의 상세 내용 HTML 분석 모듈 흐름도, 4 is a flowchart of a detailed HTML analysis module of the present invention;

도 5는 본 발명의 연속 연결 문서 URL 추출을 위한 사전지식 입력 스키마. 5 is a prior knowledge input schema for extracting continuous connection document URL of the present invention.

* 도면의 주요 부분에 대한 부호의 설명 * Explanation of symbols on the main parts of the drawings

101 : 리스트 추출/VoiceXML 생성부 101: list extraction / VoiceXML generation unit

102 : 상세 내용 추출/VoiceXML 생성부 102: extract details / VoiceXML generation unit

103 : 연속 연결 문서 URL 추출부 103: continuous connection document URL extractor

104 : URL 104: URL

105 : 리스트 HTML 분석 모듈 105: List HTML Analysis Module

106 : 리스트 컨텐츠 군 선별 모듈 106: list content group selection module

107 : 리스트 컨텐츠 추출 모듈 107: list content extraction module

108 : 리스트 VoiceXML 문서 생성 모듈 108: List VoiceXML document generation module

109 : 각 컨텐츠 링크 정보 추출 모듈 109: each content link information extraction module

110 : 리스트 컨텐츠 군 정보 DB 110: list content group information DB

111 : 상세 내용 HTML 분석 모듈 111: Details HTML Analysis Module

112 : 상세 내용 컨텐츠 군 선별 모듈 112: Contents content group selection module

113 : 상세 내용 컨텐츠 추출 모듈 113: Details content extraction module

114 : 상세 내용 VoiceXML 생성 모듈 114: Details VoiceXML Generation Module

115 : HTML 문서 저장 모듈 115: HTML document storage module

116 : 연속 연결 문서 URL 추출 모듈 116: continuous connection document URL extraction module

117 : 사전 지식 DB 117: prior knowledge DB

118 : 사전 지식 입력 모듈 118: prior knowledge input module

Claims

As the URL of the web page to be converted by the user is entered, the user accesses the web page corresponding to the URL, extracts the list content group information from the HTML document of the web page, and stores it according to the user's selection. A list extraction / VoiceXML generation unit for extracting link information from each content and generating a VoiceXML document for list content corresponding to the selected and stored content information;

By using the link information delivered from the list extraction / VoiceXML generation unit, the linked web page is accessed to extract and select detailed content group information from the HTML document of the corresponding web page, and then the detailed content is targeted to the extracted content. A detail extraction / VoiceXML generator for generating a VoiceXML document for the; And

Analyze the prior knowledge entered by the user and stored in the DB, and the code of the HTML document of the web page corresponding to the URL to extract the URL of the continuously linked document, and retrieve the URL information of the continuous web document having the same characteristics and structure. Continuous connection document URL extraction unit which is transmitted to the list extraction / VoiceXML generation unit to generate a VoiceXML document for the continuous web document

HTMLtoVoiceXML converter having a conversion target selection function through the user's prior knowledge and interaction, characterized in that made.

A list HTML analysis module for accessing a web page corresponding to the URL and extracting list content group information from an HTML document of the web page as a URL of a web page to be converted by the user is input; Shows the screen divided into the extracted list contents group to the user so that the user can select the conversion target list contents group through the screen divided into the list contents group, or the user copies and pastes the text information included in the web document. It provides a way to automatically select the content group that contains the selected string by selecting it through the insert process, or to select the content group according to a predetermined rule such as the tree with the largest number of nodes. A list content group selection module for storing group information in the list content group information DB; A list content extraction module for extracting content from the selected content group; A list VoiceXML document generation module for generating a VoiceXML document for a list of all extracted contents; And each content link information extraction module for transferring the link URL and the link text information to the following detail content extraction / VoiceXML generation unit to extract link information of each of the extracted list contents and extract detailed contents. VoiceXML generation unit;

A detailed content HTML analysis module for extracting detailed content group information from the link information delivered from each of the content link information extraction modules; A detailed content group selection module for selecting the extracted detailed content group information; A detailed content contents extraction module for extracting detailed content contents from the selected detailed content group information; And a detail contents extracting / VoiceXML generating unit, comprising a detail VoiceXML generating module for generating a VoiceXML document with respect to the extracted contents. And

A prior knowledge input module that assists a user in entering prior knowledge; A prior knowledge DB for receiving and storing connection relationship information of the continuous connection document as prior knowledge input by the user through the prior knowledge input module; An HTML document storage module for storing the HTML document at the first input URL; And extracting a continuous connection document URL by analyzing the prior knowledge stored in the prior knowledge DB and the HTML code transferred from the HTML document storage module, and extracting the list content by transferring the same to the list extraction / VoiceXML generator. Modular continuous document URL extractor

The method of claim 2, wherein when the user selects the list content group, the user selects the list content group and displays the content as one content group by merging at least two or more content groups. HTMLtoVoiceXML converter.

The method according to claim 2, wherein a plurality of VoiceXML documents are generated according to the number of list contents extracted for the amount of the generated List VoiceXML document and the average size of strings of the list contents, and the generated VoiceXML documents are stored. Contents It is made by creating a single document about the details of the VoiceXML document and providing a Web Server for storing the generated VoiceXML document. HTMLtoVoiceXML converter.

In the operating method of the HTMLtoVoiceXML converter having a conversion target selection function through interaction and prior knowledge of the user having the configuration of claim 2,

The process of the list HTML analysis module,

A first step of accessing a web page corresponding to the URL and extracting an HTML document of the web page according to the input of the URL of the web page to be converted by the user;

A second step of converting the extracted HTML document into a tree structure through a document object;

A third step of extracting content group information by sequentially analyzing the root node of the document;

Determining a tag name of each node to determine whether to be analyzed, and if not, to proceed to the next node, and if it is to be analyzed, determining whether the number of child nodes of the current node is three or more;

A fifth step of proceeding to the next node when the number of child nodes is less than three in the fourth step, and extracting a subtree rooted at the current node when the number of child nodes is three or more;

In the extracted subtree, the existence of the internal string is determined. If the internal string does not exist, the process proceeds to the next node. If the internal string exists, the name and index of the current node and the last node of the subtree are obtained. A sixth step of storing;

A seventh step of moving the pointer of the current node to the next node; And

An eighth step of determining whether the current node pointer is the end of the tree, ending if the end of the tree, and repeating to the fourth step if not the end of the tree

Operation method of the HTMLtoVoiceXML converter having a function of selecting a conversion target through the user's prior knowledge and interaction.

The method of claim 5,

The process of the list content group selection module,

Receiving content group information from the list HTML analysis module and displaying a number for each content group of the web document to the user;

Selecting a content group to be converted using a number displayed on a screen by a user;

Setting a title in the selected content group; And

Determining if the content group selection operation has ended and repeating the selection operation until it is finished

The method of claim 6,

In the step of selecting the content group, if the two or more content groups are similar content group, the conversion target selection function through the prior knowledge and interaction of the user, characterized in that by merging by selecting the number of each content group as a merging target How to use the HTMLtoVoiceXML converter.

The method of claim 5,

The processing of the above-described details HTML analysis module,

If the link URL and link text information extracted by each content link information extraction module are delivered, accessing a corresponding web site of the link URL and extracting an HTML document;

Converting the extracted HTML document into a tree structure through a document object;

Extracting content group information by sequentially analyzing the root node of the document;

Determining a tag name of each node to determine whether it is an analysis target, and if not an analysis target, proceeds to the next node, and if it is an analysis target, obtaining the number of child nodes of the current node and a tag name of the child node;

Determining whether at least half of the number of child nodes is a BR or P tag, and if less than half, proceeds to the next node, and if more than half, extracts a subtree rooted at the current node;

Obtaining a current node name, an index, and a text size of a subtree, and storing the same in a content group arrangement;

Moving the pointer of the current node to the next node; And

Determining whether the current node pointer is at the end of the tree, ending at the end of the tree, and repeating the determination of whether to analyze the target object if not at the end of the tree

The method according to claim 5 or 8,

If the tag name is an annotation tag, SCRIPT tag, or STYLE tag in the analysis target, HTMLtoVoiceXML converter having a conversion target selection function through prior knowledge and interaction of the user, which is excluded from the analysis target to increase the analysis speed. How to operate.

The method of claim 5,

In the case of the content displayed as the continuous linked document of the same format, in order to extract the content all at once, the user obtains the URL of the continuous linked document and extracts the content from the corresponding document. Operation method of HTMLtoVoiceXML converter with target selection function.

The method of claim 10,

The form of the link to the continuous linked document receives information of a relative position of the last content from the user, and extracts the link information based on the received information, to distinguish whether any link is a link to the continuous linked document Operation method of HTMLtoVoiceXML converter having the function of selecting conversion target through user's prior knowledge and interaction.

The method according to claim 5 or 11,

The process of expressing a schema that has received prior knowledge about link information about a continuous connection document from the user,

Designating what type of linking links the continuous linking documents displayed on the document have;

Receiving detailed information of the connection link displayed in the first start document and the intermediate document; And

Steps to specify the location of the last document for which number of documents to extract

The method of claim 12,

Specifying what the connection link type is,

User's prior knowledge and interaction with the user, characterized in that the user selects any one of a series of serial numbers such as 1, 2, 3, text such as text, arrows, and triangle lights. How to operate the HTMLtoVoiceXML converter with optional features.

The method of claim 12,

The receiving of the detailed information may include receiving the link text of the first start document and the link text of the intermediate document when the selected link form is text, and the relative link based on the last content when the link form is an image. The location is input, but the link location shown in the start document and the link location shown in the intermediate document are input separately. If the link form is a serial number, the order of the number is ascending or descending, and the start number is received. Operating method of the HTMLtoVoiceXML converter having a function of selecting a conversion target through the user's prior knowledge and interaction.