WO2006046323A1

WO2006046323A1 - Internet information collection device, program, and method

Info

Publication number: WO2006046323A1
Application number: PCT/JP2005/006919
Authority: WO
Inventors: Masanobu Masuda
Original assignee: Fujitsu Limited
Priority date: 2004-10-28
Filing date: 2005-04-08
Publication date: 2006-05-04
Also published as: JP4507206B2; JPWO2006046323A1

Abstract

An Internet information collection device includes: a page analysis unit for analyzing the web page on the Internet opened on the screen by the page read unit and extracting an event operation tag sentence for dynamically generating link information in accordance with an event generated by user operation. A link information detection unit detects and stores web information on the link destination from the page transition by the link information generated by an event generation for an event operation tag sentence extracted in the page analysis unit by an event generation unit.

Description

Specification

Internet information collecting apparatus, program and method

Technical field

TECHNICAL FIELD [0001] The present invention relates to an Internet information collection apparatus, program, and method for collecting link destination web information from a web page that has been developed on a screen. The present invention relates to an Internet information collecting apparatus, a program, and a method for collecting. Background art

[0002] In recent years, it has been planned to construct a web library system that saves a web page that has been published only on the Internet and that has been updated and deleted in a short time, and that is publicly available. Therefore, it is necessary to have web archiving technology that collects and stores information resources on the Internet.

[0003] In conventional web archiving, the Internet website has link information to other contents in the web page as a hyperlink. Judging the transition to the next page and collecting the information of the related page, the method is taken.

[0004] Conventionally, web robots are known to automatically collect content on the Internet. Web robots collect link information by analyzing HTML documents of web pages, and hierarchically transition web pages. To collect content and enable users to search and browse web page information previously published on the Internet.

Patent Document 1: JP 2002-073609

Patent Document 2: Japanese Patent Laid-Open No. 2003-345826

Disclosure of the invention

Problems to be solved by the invention

[0005] By the way, Internet content in recent years has been known as dynamic HTML. As an extended specification of HTML that allows users to interact with users on a page, an increasing number of HTML documents that dynamically generate external links by embedding scripts in HTML documents.

[0006] For example, the selection menu displayed on the web page indicates options “1” to “3” to the user, and generates URLs as different link destination information according to the options selected by the user. Transition to the web page.

[0007] However, in the conventional method of collecting link information by analyzing the HTML document of a web page, link information generated by user operation in interactive content is detected. There is a problem that the collection of link information is large.

[0008] For example, in the case of a web page that contains a code that generates a link destination URL according to the option selected by the user, the transition destination URL can be determined only from the code written on the web page. Therefore, there is a problem that it is not possible to collect the web information of the transition destination.

[0009] Of course, it is possible for the operator to detect link information by performing an operation such as an operation button or menu selection while the web page is open, but it requires laborious operation. There is a problem that takes too much time.

An object of the present invention is to provide an Internet information collection device, program, and method for automatically collecting link information generated by user operations without omission. Means for solving the problem

[0011] The present invention provides an Internet information collecting apparatus. The Internet information collection device of the present invention is

A page browsing part (browser) that acquires web pages on the Internet and expands the screen,

A page analysis unit that analyzes a web page displayed on the page browsing unit and extracts event operation tag statements that dynamically generate link information according to an event generated by a user operation; An event generation unit that generates an event for the event operation tag sentence extracted by the page analysis unit;

Page transition force based on link information generated by event generated by event generation unit Link information detection unit for detecting and storing link destination web information,

It is provided with.

[0012] Here, the link information detection unit further detects and stores link destination web information from the proxy server accessed by the page browsing unit.

[0013] The page analysis unit extracts the input sentence from the range specified by the form sentence in the tag sentence that constructs the web page, and the event generation part displays all the events defined for the input sentence. It is generated sequentially, and link information is generated by a valid event in it.

[0014] In addition, the page analysis unit extracts an external sentence indicating options to the user and an input sentence that requires user operation from the range specified by the form sentence in the tag sentence that constructs the web page, and the event. The generator generates an event for the input sentence while changing the options of the select sentence.

[0015] Specifically, the page analysis unit selects the input tag that is a child tag of the form tag and the selection tag that is a sibling tag of the input tag as well as the range power specified by the form tag in the tag sentence that constructs the web page. The selection tag that creates the selection tag and multiple selection options that indicate the contents of the selection list that is the child tag of the selection tag are extracted, and the event generator changes the multiple selection options tag and changes the event of the input tag. appear.

[0016] In this case as well, the event generation unit sequentially generates all the events defined for the input tag, and generates link information based on the valid events therein.

[0017] The link information detection unit detects and saves all link information of the web page that transitions when an event occurs for the event operation tag statement of the currently deployed web page, and then develops another web page on the screen. Repeat the process of acquiring and saving the link information of the web page that changes the page when an event occurs for the event operation tag statement.

[0018] The link information detection unit detects link information of a web page that transitions without page transition from communication event information notified before communication to a link destination.

The present invention provides an Internet information collection program. Internet information of the present invention Information gathering program

A page browsing step for acquiring web pages on the Internet and expanding the screen, and analyzing the web pages expanded in the page browsing step, and dynamically generating link information in response to events generated by user operations A page analysis step to extract event operation tag statements;

An event generation step for generating a W event in the event operation tag sentence extracted in the page analysis step;

Page transition force by link information generated by event generated by event generation step Link information detection step to detect and save link destination web information,

Is executed.

[0020] The present invention provides a method for collecting Internet information. The Internet information collection method of the present invention includes:

It is provided with.

[0021] The details of the Internet information collection program and method of the present invention are basically the same as those of the Internet information collection apparatus of the present invention.

The invention's effect

According to the present invention, the script is executed by executing a script sentence or the like according to an event generated by a user operation requiring a mouse operation or a keyboard operation of a button or a selection list of a web page expanded on the page browsing unit. Page transition by automatically generated URL It is realized by a pseudo operation by generating an event in the application, and it is possible to detect the link information that is transitioned by the powerful user operation that cannot be detected by analyzing the HTML document. Web information can be collected.

[0023] For link destination content, link information is similarly detected by a pseudo operation caused by an event generated by an application, and by repeating this, all information that is published on the Internet is collected. It becomes possible.

[0024] Further, for events that cannot be generated by a pseudo operation, for example, when the mouse passes, the link destination URL information is stored in the proxy server accessed by the browser. By acquiring the URL, it is possible to collect web information on the Internet without omission of deployed web page power.

Brief Description of Drawings

[0025] [FIG. 1] Block diagram of an Internet information collecting apparatus according to the present invention.

[Figure 2] Block diagram of the hardware environment of a computer that implements the Internet information collection device of Figure 1

FIG. 3 is an explanatory diagram of a web page on which form parts to be events are generated according to the present invention.

[Figure 4] Explanatory diagram of the HTML source text that builds the web page in Figure 3

[Figure 5] Illustration of DOM tree obtained by parsing HTML source sentence of Figure 3 using DOM parsing

[Figure 6] Explanatory drawing of the list of events that occurred corresponding to the A tag

[Figure 7] HTML explanatory diagram describing the event that triggers the script

[Figure 8] Explanatory diagram of the event management table in Figure 1

[Figure 9] Explanatory diagram of the HTML source text of the event and event execution methods provided by Internet Explorer (R)

[FIG. 10] An explanatory diagram of a web page in which a selection list and operation buttons for event generation are arranged according to the present invention.

[Figure 11] Explanatory diagram of the HTML source text that builds the web page in Figure 10 [Figure 12] Illustration of DOM tree obtained by parsing HTML source sentence in Figure 11 using DOM parsing

[Figure 13] Explanatory diagram of before-navigation, which is event information including the link destination URL notified before communication with Internet Explorer (R)

[Fig.14] Explanation of link list table in Fig.1

FIG. 15 is a flowchart of Internet information collection processing according to the present invention.

FIG. 16 is a flowchart of the link information detection process of FIG.

FIG. 17 is a flowchart of link information detection processing following FIG.

FIG. 18 is a block diagram of another embodiment of the Internet information collecting apparatus according to the present invention.

[Figure 19] Illustration of URLs that cannot be extracted by the Internet information collection device in Figure 1

[FIG. 20] Explanatory diagram of processing operation of the embodiment of FIG.

FIG. 21 is a flowchart of internet information collection processing in the embodiment of FIG. 18. BEST MODE FOR CARRYING OUT THE INVENTION

FIG. 1 is a block diagram showing an embodiment of a functional configuration of an Internet information collecting apparatus according to the present invention. In FIG. 1, an Internet information collecting apparatus 10 of the present invention is composed of, for example, a computer and can be connected to websites 14 1, 14 2, 14-3 as information collecting destinations via the Internet 12. it can.

The Internet information collecting apparatus 10 is provided with a communication control unit 16 and an application execution environment 18. The communication control unit 16 performs communication control for browsing and browsing web pages with the websites 14-1 to 14 3 via the Internet 12.

[0028] The application execution environment 18 is realized by executing a program by a computer, and includes a browser 20, a page analysis unit 22, an event generation unit 24, a link information detection unit 26, an event management table 28, a link list table 30, and A content acquisition unit 32 is provided.

[0029] The browser 20 provided in the application execution environment 18 of the Internet information collecting apparatus 10 functions as a page browsing unit, acquires a web page of a website, for example, the website 14 1 via the Internet 12, and develops a screen. To do.

[0030] The page analysis unit 22 uses the browser 22 that functions as a page browsing unit. The page operation is analyzed, and event operation tag statements that dynamically generate link information according to events generated by user operations are extracted.

[0031] This event operation tag sentence is a tag sentence that constructs a radio button or a selection list that requires mouse operation or keyboard operation, which is arranged in an HTML source sentence that constructs a web page. Extracts the form sentence indicated by the FORM> tag.

[0032] The event generation unit 24 executes an event that executes a script statement that dynamically generates a link destination LRU in response to a user operation on the event operation tag statement extracted by the page analysis unit 22. Is generated. In the event management table 28, a list of events generated by the event generation unit 24 is stored in correspondence with the tags for which events are generated.

[0033] Generating power of event for event operation tag sentence by event generation unit 24 When a user operates a form part such as a button or a selection list arranged on a web page with a mouse or a keyboard. A pseudo operation that operates in the same manner will be performed.

[0034] The link information detection unit 26 detects the page transition force link page information generated by the execution of the script statement by the event generated by the event generation unit 24, that is, the link destination web page information, that is, the link destination URL. save.

[0035] When the collection of link destination URLs is completed, the content acquisition unit 32 sequentially extracts the URLs from the link list table 30, connects to the link destination website, acquires the web page of the website, and acquires the database. Save to.

[0036] The Internet information collecting apparatus 10 of the present invention in FIG. 1 is realized by, for example, hardware resources of a computer as shown in FIG. In the computer shown in FIG. 2, the CPU 101 has a bus 101 with RAM 102, hard disk controller (software) 104, floppy disk driver (software) 110, CD-ROM driver (software) 114, mouse controller 118, keyboard controller 122, display controller. 126, communication board 130 is connected.

[0037] The hard disk controller 104 is connected to the hard disk drive 106 and loads the Internet information collection program of the present invention. When the computer is started, the necessary program is called from the hard disk drive 106 and stored in the RAM 102. Expand, Executed by CPU 100.

[0038] A floppy disk drive (hardware) 112 is connected to the floppy disk driver 110, and can read and write to the floppy disk (R). A CD drive (node) 116 is connected to the CD-ROM driver 114, and data and programs stored on the CD can be read.

The mouse controller 118 transmits an input operation of the mouse 120 to the CPU 100. The keyboard controller 122 transmits the input operation of the keyboard 124 to the CPU 100. The display controller 126 performs display on the display unit 128. The communication board 130 uses a communication line 132 including radio, and communicates with a website server via a network such as the Internet.

FIG. 3 is an explanatory diagram of a web page on which form parts that are subject to event generation in the present invention are arranged. In the web page 34 of FIG. 3, a link URL 36 is arranged, and an operation button 38 and an operation button 40 are arranged below it.

[0041] When the user clicks the link URL 36 on the web page 34 with a mouse, for example, the screen transitions to the “a.htmlj web page. When the user depresses the operation button 38, the web page“ b.html ”is displayed. Then, when the user further presses down the operation button 40, the screen changes to the web page “c.html”.

FIG. 4 is an explanatory diagram of an HTML source sentence for constructing the web page 34 of FIG. In the HTML source sentence 42 in FIG. 4, the link URL 36 in the web page 34 in FIG. 3 jumps to “a.html” by the function of the A tag on the 11th line. The link destination “a.html” by the A tag on the 11th line of the HTML source sentence 42 can be directly detected by analyzing the HTML source sentence 42 as in the past.

[0043] The operation buttons 38 and 40 on the web page 34 in FIG. 3 are constructed by a form sentence in a range surrounded by <FORM> tags on the 12th to 15th lines in the HTML source sentence 42 in FIG. In this form sentence, for example, when the user presses down the operation button 38 on the web page 34 in FIG. 3, the “011 1 ^^” event on the 13th line of the HTML source sentence 42 is displayed. Event occurs, and the ""; jump () "J function defined here is called.

[0044] In this jump function, the id attribute value of the INPUT tag is set for the script statements on the 3rd to 8th lines. The page transition is done by creating the URL of the link destination and changing the location object.

[0045] In this way, a tag statement that dynamically generates a link destination URL by a script statement when an event occurs in response to a user operation in a form statement! Analyzes the HTML source statement 42 itself. However, “b. Html” and “c.

Therefore, in the present invention, the page analysis unit 22 provided in the application execution environment 18 of the Internet information collecting apparatus 10 in FIG. 1 analyzes the HTML source sentence 42 shown in FIG. 4 and functions as an application. The DOM tree 44 shown in Fig. 5 that can be operated by the event generation unit 24 is constructed, and the event generation unit 24 directly generates an event onclick for the INPUT tag by the event generation unit 24, and the link is made by executing the script statement. The page transitions of the previous URLs “b.html” and “c.html” are performed, and the link destination URL is detected as link destination information associated with this page transition.

[0047] Here, the page analysis unit 22 shown in FIG. 1 has an SDK (Software Development Kit) for the browser 20, and the SDK is an application programming interface (hereinafter referred to as “API” t \, This is a tool for building software using

[0048] Specifically, it has a DOM parser that parses the HTML source sentence 42 of the web page expanded by the browser 20, parses the HTML source sentence with this DOM parser, and has a DOM tree 44 shown in Fig. 5. Generate an 'object' model DOM. The document 'object' model DOM shown in the DOM tree 44 is an API for accessing HTML tag statements as a collection of tree-structured node objects.

[0049] By generating the document object model as the DOM tree 44 shown in Fig. 5, the onclick event is generated directly from the event generator 24 as a program for the INPUT tag in the form tag, and the script By executing a statement, you can generate a link destination URL and change the page.

[0050] That is, the onclick event of the 13th and 14th line I NPUT tags in the form sentence in the HTML source sentence 42 in FIG. 4 is originally generated by operating the push buttons 38 and 40 shown in the web page 34 in FIG. An event occurs when the button is pressed down. It is a mechanism in which JavaScript functions in script statements are called.

[0051] On the other hand, in the present invention, the software development 'kit SDK provided in the page analysis unit 22 uses the DOM parser (DOM analysis means) in the kit SDK. By building the document object model DOM, which is an API for accessing a set of node objects having a tree structure as shown in Fig. 1, the event onclick is generated directly from the event generator 24 as a program and the script text Can be executed to generate “b.html” and “c.html” to change pages. This means that the program simulates a button-down operation by the user.

By the way, as an event that occurs for the INPUT tags on the 13th and 14th lines in FIG. 4, in this embodiment, “onclick” is generated as a valid event. There are various types of events used in response to user operations.

FIG. 6 is an explanatory diagram of the A tag occurrence event list 46 showing the types of events defined corresponding to the A tag used for the link setting on the 11th line in FIG.

[0054] As shown in the A tag event occurrence list 46, 17 types of events are generated only by the A tag. The types of events that occur are defined in the same way even if the IN PUT tags shown in lines 13 and 14 of Fig. 4 are used.

[0055] If this A tag occurrence event list 46 is described as shown in script activation HTML source statement 48 in Fig. 7, the script statement can be activated by generating event onclick. For these events, even if an event occurs, it will be discarded immediately.

[0056] Utilizing such a mechanism that automatically discards unnecessary events in HTML tag sentences, in the present invention, for the tags that are subject to event generation, the form sentence power of which is extracted, When all the events in the list defined for the tag are generated and only the event defined as shown in Fig. 7 is executed, this method is used.

[0057] In this way, all events in the event list that are defined corresponding to the event generation target tags are generated, and by knowing the event that actually executed the script statement, a specific event can be obtained. Script statements without being aware of valid events can be executed when an event occurs.

[0058] In addition, the validity recognized by generating all the events and executing the script sentence For events, valid events are registered in the event management table 28 corresponding to the tag names as shown in FIG. As shown in Figure 8, the effective event corresponding to the tag name registered in the event management table 28 can be used as statistical information for event generation for subsequent tags. Will process all corresponding events for the tag.

[0059] Here, when Internet 'Explorer (R) is used as the browser 20 in FIG. 1, as a method for directly generating an event by a program, as shown in FIG. 9, "fireEvent" and! ヽぅThe method is prepared.

[0060] The method of the fire event is, as shown in the fire event HTML source statement 50 of Fig. 9, for example, the focus setting "onfocusj" as shown in the 3rd and 4th lines for all tags. By performing “ondlur”, which is a release, it is possible to issue an event directly to all tags, and as a result, the URL of the link destination URL by executing the script statement similar to the user's pseudo operation is executed. Generation is performed and page transition can be performed.

FIG. 10 is an explanatory diagram of a web page in which a selection list to be an event generation target and operation buttons are arranged in the present invention. In FIG. 10, a map display button 54 is arranged on the web page 52. Corresponding to the map display button 54, a selection list 56 is provided, and the selection list 56 has three choices “Tokyo”, “Kanagawa”, and “Shizuoka”.

FIG. 11 is an explanatory diagram of an HTML source sentence 58 that constructs the web page 52 of FIG. In the web page 52 of FIG. 10, the jump destination when the map display button 54 is pressed is changed depending on the selection location of the selection list 56.

That is, when the map display button 54 is pressed while “Tokyo” in the selection list 56 is selected, a jump is made to “Tokyo.html” as the link destination. If the map display button 54 is pressed while “Kanagawa” is selected in the selection list 56, the link jumps to “Kanagawa. Html”. Furthermore, when the map display button 54 is pressed with “Shizuoka Prefecture” selected in the selection list 56, the link jumps to “Shizuoka Prefecture.html”.

[0064] In the HTML source sentence 58 of Fig. 11 for constructing such a link page 52, form parts such as the map display button 54 and the selection list 56 are basically located on the 13th to 20th lines. Made with a form sentence enclosed in tags! 14 in this form statement It includes a SELECT> tag on the line and an INPUT> tag on the 19th line, and these tags are positioned as! / And the child tags of the <FORM> tag.

[0065] In this example, when the map display button 54 placed in the <INPUT> tag on the 19th line is pressed, the sibling tag is selected in the 14th to 18th lines in the SELECT> tag select statement. There are 3 states.

[0066] Therefore, when the INPUT> tag indicating the map display button 54 is detected, the OPTION> tag on the 15th to 17th lines indicating the three patterns of the sibling tag SELECT> tag is obtained. It can be analyzed that there are three selection patterns.

[0067] Therefore, in order to generate a pseudo event in the INPUT> tag using the map display button 54, iterate three times, each time selecting the option with the SELECT> tag OPTION> tag. Change it and generate an event in the INPUT> tag.

[0068] Fig. 12 is an explanatory diagram of the DOM tree 60 obtained by the DOM path analysis in the page analysis unit 22 shown in Fig. 1 of the HTML source sentence in Fig. 11, with siblings inside the <FORM> tag. INPUT> tag and SELECT> tag are related, and the selection list 56 is constructed. Under the SELECT> tag, three options are supported. OPTION> tag power It is arranged corresponding to Kanagawa Prefecture and Shizuoka Prefecture.

That is, the process is basically the following procedure.

• Manipulate all tags in HTML source sentence 58 in Figure 11.

• Set the <FORM> tag semi-U.

• Check all child tags within the <FORM> tag range, such as INPUT> <SELECT>, and check the status of sibling tag selection patterns.

• After changing the state of the sibling tag according to the pattern for the number of patterns obtained from <SELECT> !, issue an event to the current child tag, INPUT>, 3 to 1 The link destination URL is generated by executing the script statement on the 0th line.

[0070] FIG. 13 shows a link destination URL notified before communication when an Internet 'Explorer (R) used in the link information detection unit 26 of FIG. 1 starts communication access to an arbitrary web page. It is explanatory drawing of the before-navigation 62 which is event information including.

That is, in the case of Internet Explorer (R), a web page is specified by specifying a certain URL. When browsing, BeforeNavigate is known as an event to be notified before starting communication to the website!

In this before-navigation 62, as shown in FIG. 13, the link destination URL is set in the argument “url” on the third line. In the link information detection unit 26 of the present invention, the URL of the link destination is detected from the argument “url” in the event information of the before navigation 62.

[0073] Furthermore, since the transition to the link page will be performed if it is left as it is after being notified of before-navigation 62, the URL of the link destination has already been detected. Communication is canceled by setting “True” to “Cancel” which is the final parameter shown in the 8th line of 62. As a result, it is possible to detect and acquire only the link destination URL without page transition.

FIG. 14 is an explanatory diagram of the link list table 30 of FIG. 1, in which URLs of link destinations detected by the link information detection unit 26 are stored!

[0075] Here, the collection of link information in the present invention is performed by developing a web page by the browser 20 using a certain URL, and the web page by the page analysis unit 22, the event generation unit 24, and the link information detection unit 26. For all form parts that require user operation placed in, if a simulated operation is performed due to an event and a transition to a web page is generated to obtain the link destination URL, then Referring to the link list table, expand the newly acquired link destination web page, and acquire the link destination URL when an event occurs for a form part that requires user operation on the web page. Repeat the acquisition.

That is, in the present invention, when a link destination URL is detected from a page transition caused by an event occurrence for a form part existing in a currently deployed web page, a web page of the newly detected link destination URL is displayed. Open and do not collect link information in the hierarchical direction such as acquisition of the URL of the next link destination when an event of the form part of the page occurs, but repeat the collection of the URL of the next link destination in the Web page unit. If link information is collected in the hierarchy direction, after reaching the last web page, it must return to the original hierarchy, and the processing becomes complicated.

FIG. 15 is a flowchart of Internet information collection processing according to the present invention. Figure 15 Oh! After acquiring a list of URLs collected by conventional web robots in step SI, select one URL from step S2 and launch browser 20 to open a web page in step S3. .

The opening of this web page by the browser is performed as a background process in the operation of the computer as the Internet information collecting apparatus 10 without actually developing the screen.

[0079] Next, in step S4, the page is analyzed with a DOM parser, etc., and an API that can generate an event in the event generator 24 is constructed. After constructing the document 'object' model DOM, step S 5 The link information detection process is executed by a pseudo operation when an event occurs.

[0080] Subsequently, in step S6, it is checked whether or not there is an unprocessed URL for the URL read in step S1, and if there is an unprocessed URL, the process returns to step S2 and the same processing is repeated. When processing for all URLs is completed in step S6, the process proceeds to step S7 to obtain a list of newly detected link destination URLs and link information from step S2 until there are no unprocessed URLs in step S8. Repeat the process for detection.

FIGS. 16 and 17 are flowcharts of the link information detection processing according to the present invention corresponding to step S5 of FIG.

In FIG. 16, in the link information detection process, the tag in the HTML tag sentence is manipulated in step S1, and whether or not the non-event occurrence tag power is checked in step S2. Non-event occurrence tags include <A> tag, <IMG> tag, <LINK> tag, etc., shown on line 11 in Fig.4. If it is a non-event occurrence tag, proceed to step S3 to directly detect and save the link destination URL.

On the other hand, if it is determined in step S2 that a non-event occurrence tag has been used, the process proceeds to step S4, and it is determined whether or not it is a <FORM> tag. If it is a <FORM> tag, proceed to step S5 and check whether the form part is an operation button.

[0084] If it is an operation button, the process proceeds to step S6, where it is checked whether or not <INPUT> tag, and if it is <INPUT> tag, a list of generated events prepared in advance in step S7. By selecting and issuing one event in order from the corresponding script statement By executing, link destination URL is generated and page transition is made.

Subsequently, whether or not there is a page transition is checked in step S8. If there is a page transition, the link destination URL is acquired and stored in step S9. Note that the page transition in step S8 is the presence / absence of acquisition of before-navigation 62, which is event information acquired before communication in the case of Internet Explorer (R), as shown in FIG. In this case, the link destination URL is detected and saved.

[0086] The processing from step S7 is repeated until all event generations are completed in step S10. For all these events, only the event that is defined in the INPUT> tag in the HTML statement functions as a valid event, and the link destination URL is generated by executing the script statement.

Next, proceeding to step 11 in FIG. 17, it is checked whether or not the form part is a selection list. If it is a selection list, proceed to step S12 and operate all child tags within the range of <FORM> tags such as INPUT> <SELECT>.

[0088] Next, in step S13, INPUT> tag sibling SELECT> tag selection pattern is analyzed. In the case of Fig. 10 to Fig. 12, there are three types of this selection pattern. Next, in step S14, the state of sibling tag SELECT> is changed according to the selection pattern.

[0089] Subsequently, in step S15, one event is selected and issued for the current child tag INPUT>, and in step S16, whether or not a page transition has occurred is checked. If there is a page transition, the link destination URL is detected and stored in step S17. Subsequently, in step S18, it is checked whether or not all event occurrence end powers are satisfied, and the processing from step S15 is repeated until all event generation ends.

[0090] Next, in step S19, it is checked whether or not all selection patterns have been completed. If the selection pattern has not been completed, the process returns to step S14 and the state of sibling tag SELECT> is changed to the next selection pattern. , Steps S14 to S18 are repeated.

[0091] When processing for all the selected patterns is completed in step S19, the process proceeds to step S20, where it is checked whether processing has been completed for all tags. If not, the process returns to step S1 in FIG. The process is performed for the next tag, and thereafter, the processing of steps S1 to S20 is repeated until the processing is completed for all the tags. The present invention also provides an Internet information collection program that is executed by the Internet information collection device 10 constituted by a computer. This program is a process according to the flowcharts of FIGS. 15 to 16 and FIG. It is constructed as a program with procedures.

FIG. 18 is a block diagram of another embodiment of the Internet information collecting apparatus according to the present invention. In this embodiment, as a function of the link information detection unit 26 provided in the application execution environment 18, the page transition force web information based on the link information generated by the event generated by the event generation unit 24 in the embodiment of FIG. In addition to the function of detecting and saving, the web information of the link destination is detected and stored from the proxy server 64 accessed by the browser 20 functioning as the page browsing unit.

This solves the problem that there is a URL that cannot be extracted by the function of the Internet information collecting apparatus 10 of FIG.

Here, the following URLs cannot be extracted in the embodiment of FIG.

(1) When static links are updated by user operations.

(2) When performing HTTP communication independently with Java programs such as Java Applet.

(3) When an original program such as Active 'X' component (HTTP) communicates independently.

(4) When operating on a platform that does not have the before-navigation function as shown in Fig. 13 in the software development kit (SDK) in the Unix (R) environment.

FIG. 19 is an explanatory diagram of URLs that cannot be extracted in the embodiment of FIG. 1 in which the user updates a static link in (1). In Figure 19, HTML source sentence 65 describes script part 66 on lines 3-5 and script sentence 67 on lines 6-8!

The split sentence 66 performs an operation of changing the image file to “over.gif” when the cursor passes over the image by a mouse operation or the like. This split sentence 68 “over.gif” is a force that will be acquired from the website for the first time by the user's mouse operation. Nah ... this Therefore, in the embodiment of FIG. 1, it is impossible to detect the URL of a website having the file name “rover.gif” in the full path.

The next split sentence 67 performs the operation of returning the image file to “out.gif” when the cursor moves over or away from the image. This “out.gif” is also acquired from the website for the first time by the user's mouse operation and is not a page transition operation. Therefore, the before-navigation event in the embodiment of FIG. 1 does not occur, and the URL can be acquired. Can not.

Therefore, in the present invention, as shown in FIG. 18, when the Internet information collecting apparatus 10 uses the browser 20, the website 14-1 to 14-3 side is always accessed via the proxy server 64. In this case, in the proxy server 64, the problem is solved by focusing on the fact that the access information is stored on the file along with the HTTP response from the website and the HTTP response from the website.

[0100] That is, in the present invention, the link information detection unit 26 generates the power of page transition by the before-navigation function. Access Sano 64, and obtain the URL of the file information power transition destination stored there with the full path and save it in the link list table 30.

FIG. 20 is an explanatory diagram of the processing operation of the embodiment of FIG. 18 for detecting and collecting the file power URL of the proxy server. In FIG. 20, when a fire event 68 occurs when the cursor is moved on the image based on the split sentence 68 shown in FIG. 19 by the Internet information collecting device 10, for example, an HTTP is sent from the browser 20 to the website 14 via the proxy server 64. Request 72 is sent.

The website 14 that has received this HTTP request 72 responds to the browser 20 through the proxy server 64 with the web page 74 of the file name “over.gifj” as the HTTP response 78.

[0103] In Proxysano 64, when sending HTTP request 72 to website 14, access information 76 is stored in file 85, and website 14 also sends HTTP response 78 to browser 20. In this case, the access information 80 is stored in the file 85.

[0104] The first line of the access information 76 saved with the HTTP request 72 stores "over.gif" as the file name, and the third line stores the domain name "domain" of the website 14. Has been.

Accordingly, the link information detecting unit 20 provided in the Internet information collecting apparatus 10 shown in FIG. 18 refers to the file 85 of the proxy server 64, and starts with “HTTP: ZZ” and starts with the file name rover.gif. “Http: ZZdomain / over.gif” is detected as the URL 84 of the link destination of the full path showing up to and saved as shown in the record 82 of the link list table 30.

FIG. 21 is a flowchart of the Internet information collecting apparatus in the embodiment of FIG. In FIG. 21, the processing of steps S1 to S8 is the same as the processing according to the embodiment of FIG. 1 shown in FIG. In the embodiment of FIG. 18, after the processing of steps S1 to S8 is completed, the processing of acquiring the full path URL from the proxy sano 64 and registering it in the link list table is executed in step S9. .

[0107] Deployed to collect Internet information by acquiring URLs as link information from the proxy server that the browser accesses, for example, events that occur with a mouse or the like that cannot be generated by temporary operations. It is possible to collect web information on the Internet without any effort on the web page.

Note that the present invention includes appropriate modifications that do not impair the object and advantages thereof, and is not limited by the numerical values shown in the above embodiments.

Claims

The scope of the claims

[1] A page browsing part that acquires web pages on the Internet and expands the screen,

A page analysis unit that analyzes a web page displayed on the page browsing unit and extracts an event operation tag sentence that dynamically generates link information according to an event generated by a user operation;

An event generation unit that generates the event in response to the event operation tag sentence extracted by the page analysis unit;

A link information detection unit that detects and stores link destination web information from page transitions by link information generated by an event generated by the event generation unit;

An Internet information collecting apparatus comprising:

[2] The Internet information collection device according to claim 1, wherein the link information detection unit further detects and stores link destination web information from the proxy server accessed by the page browsing unit. Internet information collection device.

[3] In the Internet information collecting device according to claim 1,

The page analysis unit extracts an input sentence from a range defined by a form sentence in a tag sentence that constructs the web page,

The Internet information collection device, wherein the event generation unit sequentially generates all events defined for the input sentence, and generates link information according to valid events therein.

[4] In the Internet information collecting device according to claim 1,

The page analysis unit extracts a select sentence indicating an option to the user and an input sentence that requires a user operation from a range specified by the form sentence in the tag sentence constructing the web page,

The Internet information collection device, wherein the event generation unit generates an event for the input sentence while changing options of the selection sentence. In the Internet information collecting device according to claim 4,

The page analysis unit selects an input tag that is a child tag of the form tag and a selection list that is a sibling tag of the input tag as well as a range power defined by the form tag in the tag sentence that constructs the web page. A plurality of option tags indicating the contents of a selection list that is a tag and a child tag of the select tag;

The Internet information collection device, wherein the event generation unit generates an event of the input tag while changing a plurality of option tags in the select tag

[6] In the Internet information collection device according to claim 5, the event generation unit sequentially generates all the events defined for the input tag, and generates link information based on the valid events therein. An Internet information collecting apparatus characterized by causing the information to be collected.

[7] The Internet information collection device according to claim 1, wherein the link information detection unit detects all link information of a web page that changes a page when an event occurs for an event operation tag statement of a currently deployed web page. Internet information characterized by repeating the process of acquiring and saving the link information of the web page that changes the page when an event occurs in response to the event operation tag sentence after the other web page is expanded. Collection device.

[8] The Internet information collection device according to claim 1, wherein the link information detection unit is a communication event information force notified before communication to the link destination. An Internet information collecting device characterized by detecting

[9] On the computer,

A page browsing step for acquiring a web page on the Internet and expanding the screen; analyzing the web page expanded in the page browsing step; A page analysis step for extracting event operation tag statements that dynamically generate link information according to events generated by

An event generation step for generating the event with respect to the event operation tag sentence extracted in the page analysis step;

A link information detection step of detecting and storing link destination web information from page transitions based on link information generated by an event generated by the event generation step, and executing the following.

[10] The Internet information collection program according to claim 9, wherein the link information detection step further detects and stores link destination web information from the proxy server accessed in the page browsing step. Internet information collection device.

[11] In the Internet information collecting program according to claim 9,

The page analysis step extracts an input sentence from a range defined by a form sentence in a tag sentence that constructs the web page,

The event generation step sequentially generates all events defined for the input sentence, and generates link information according to valid events therein.

[12] In the Internet information collecting program according to claim 9,

The page analysis step extracts a select sentence indicating an option to the user and an input sentence that requires a user operation from a range defined by the form sentence in the tag sentence that constructs the web page,

The Internet event collection program characterized in that the event generation step generates an event for the input sentence while changing options of the selection sentence.

[13] In the Internet information collecting program according to claim 12,

The page analysis step includes a form tag in a tag sentence for constructing the web page. The range power specified in the group also includes an input tag that is a child tag of the form tag, a selection tag that creates a selection list that is a sibling tag of the input tag, and a plurality of selection lists that indicate the contents of the selection list that is a child tag of the selection tag. Extract option tags,

The event generation step generates an event of the input tag while changing a plurality of option tags in the select tag.

[14] The Internet information collection program according to claim 13, wherein the event generation step sequentially generates all events defined for the input tag, and generates link information based on the valid events therein. Internet information collection program characterized by

[15] The Internet information collection program according to claim 9, wherein the link information detection step detects all link information of a web page that transitions when an event occurs in response to an event operation tag statement of a currently deployed web page. Internet information collection program characterized by repeating the process of acquiring and saving the link information of a web page that transitions when an event occurs in response to an event operation tag statement. .

[16] The Internet information collection program according to claim 9, wherein the link information detection step includes link information of a web page that does not transition from a communication event information notified before communication to a link destination. Internet information collection program characterized by detecting

[17] A page browsing step for acquiring web pages on the Internet and expanding the screen, analyzing the web pages expanded in the page browsing step, and dynamically linking them according to events generated by user operations A page analysis step for extracting event operation tag statements for generating information; An event generation step for generating the event with respect to the event operation tag sentence extracted in the page analysis step;

An internet information collecting method comprising: a link information detecting step of detecting and storing link destination web information from page transitions based on link information generated by an event generated by the event generating step.

18. The Internet information collecting method according to claim 17, wherein the link information detection unit further detects and stores link destination web information from the proxy server accessed by the page browsing unit. Internet information collection method.

[19] The Internet information collecting method according to claim 17,

The Internet event collection method, wherein the event generation step sequentially generates all events defined for the input sentence and generates link information according to valid events therein.

[20] In the Internet information collecting method according to claim 17,

The Internet event collection method, wherein the event generation step generates an event for the input sentence while changing options of the selection sentence.

[21] In the Internet information collecting method according to claim 20,

In the page analysis step, an input tag that is a child tag of the form tag and a selection list that is a sibling tag of the input tag are created by the range power specified by the form tag in the tag sentence that constructs the web page. Select tag, children of the select tag Extract multiple option tags that indicate the contents of the selection list to be a tag,

The event generation step generates the event of the input tag while changing a plurality of option tags in the select tag.

[22] The Internet information collecting method according to claim 21, wherein the event generation step sequentially generates all events defined for the input tag, and generates link information based on the valid events therein. Internet information collection method characterized by

[23] In the Internet information collecting method according to claim 17, the link information detecting step includes all link information of a web page that changes a page when an event occurs with respect to an event operation tag statement of a currently deployed web page. After detecting and saving, the Internet is characterized by repeating the process of acquiring and saving the link information of the web page that expands the screen of another web page and the page transitions when an event occurs for the event operation tag sentence. Information collection method.

[24] The Internet information collecting method according to claim 17, wherein the link information detection step includes a communication event information page notified before communication to the link destination, and a link of a web page that transitions without page transition. A method for collecting Internet information, characterized by detecting information.