WO2016075829A1 - データ取得プログラム、データ取得方法及びデータ取得装置 - Google Patents
データ取得プログラム、データ取得方法及びデータ取得装置 Download PDFInfo
- Publication number
- WO2016075829A1 WO2016075829A1 PCT/JP2014/080268 JP2014080268W WO2016075829A1 WO 2016075829 A1 WO2016075829 A1 WO 2016075829A1 JP 2014080268 W JP2014080268 W JP 2014080268W WO 2016075829 A1 WO2016075829 A1 WO 2016075829A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- tag
- document
- data acquisition
- extraction target
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Definitions
- the present invention relates to a data acquisition program, a data acquisition method, and a data acquisition device.
- the crawler tool is known as a tool for collecting information published on the Internet.
- the crawler tool circulates a homepage on the Internet and stores contents in URL (Uniform Resource Locator) units, that is, in page units. Also, by analyzing the content of the homepage etc. and specifying the user's specified location, and newly receiving the content, the corresponding specified location is extracted and compared with the original data to determine whether or not there is a partial update It has been proposed to detect.
- URL Uniform Resource Locator
- the tag information of the independence part that is a part that can be processed independently of the web page, that is, the home page is extracted, and when the independence part is designated by the user, the page including the contents of the designated independence part It has been proposed to generate parts. Furthermore, it has been proposed to generate a new web page based on the generated page component.
- the present invention is to provide a data acquisition program, a data acquisition method, and a data acquisition device that can extract and output data of a target portion even without specific tag information.
- the data acquisition program specifies the position on the hierarchical structure of the tag included in the document of the extraction target part in the document including the tag structure information, which is associated with the specific URL, To allow the computer to execute a process that permits the registration of the position of. Further, the data acquisition program accesses the document associated with the specific URL regularly or irregularly, extracts the data corresponding to the registered hierarchical position of the tag, and outputs it. Cause the computer to execute the process.
- FIG. 1 is a block diagram illustrating an example of the configuration of the data acquisition apparatus.
- FIG. 2 is a diagram illustrating an example of the target storage unit.
- FIG. 3 is a diagram illustrating an example of the item storage unit.
- FIG. 4 is a diagram illustrating an example of the page storage unit.
- FIG. 5 is a diagram illustrating an example of the extracted data storage unit.
- FIG. 6 is a diagram illustrating an example of an extraction target part reception screen.
- FIG. 7 is a flowchart illustrating an example of the definition generation process.
- FIG. 8 is a flowchart illustrating an example of the crawl process.
- FIG. 9 is a diagram illustrating an example of a computer that executes a data acquisition program.
- FIG. 1 is a block diagram showing an example of the configuration of the data acquisition device.
- the data acquisition apparatus 100 shown in FIG. 1 is connected to the Internet via the network N, for example, and circulates a home page (hereinafter also referred to as a site) on the Internet designated by an administrator to acquire predetermined data. Accumulate in the database.
- the data acquisition device 100 circulates a tourist information site or a tourist information site provided by a prefecture in order to acquire tourist information of a certain area, and includes an address, a telephone number, a description, etc. of each tourist spot. Get the data.
- the data acquisition device 100 generates a definition of the data item to be acquired in advance, and acquires data from each site based on the definition.
- the data acquisition device 100 identifies the position on the hierarchical structure of the tag included in the document of the extraction target portion in the document including the tag structure information, which is associated with the specific URL, and determines the position on the hierarchical structure Allow registration. Further, the data acquisition device 100 accesses a document associated with a specific URL regularly or irregularly, extracts data corresponding to the position of the registered tag in the hierarchical structure, and outputs the extracted data. As a result, the data acquisition apparatus 100 can extract and output the data of the target portion for documents on sites with different formats of various data without having unique tag information.
- examples of the document including the tag structure information include a document described in a markup language, such as an HTML (HyperText Markup Language) document, an XML (Extensible Markup Language) document, and the like.
- a markup language such as an HTML (HyperText Markup Language) document, an XML (Extensible Markup Language) document, and the like.
- HTML HyperText Markup Language
- XML Extensible Markup Language
- the data acquisition apparatus 100 includes an input unit 101, an output unit 102, a communication unit 110, a storage unit 120, and a control unit 130.
- the data acquisition apparatus 100 may include various functional units included in a known computer in addition to the functional units illustrated in FIG.
- the input unit 101 is an input device such as a keyboard or a mouse, for example, and receives input of various information from the administrator of the data acquisition apparatus 100.
- the input unit 101 receives the URL of the site to be visited, the data item to be acquired, and the like by the administrator of the data acquisition apparatus 100 and outputs the input result to the control unit 130.
- the input unit 101 may be a reader / writer such as an SD (Secure Digital) memory card.
- SD Secure Digital
- the input unit 101 outputs the URL of the site to be visited, the data item to be acquired, and the like read from the SD memory card to the control unit 130.
- the input unit 101 may include both an input device and a reader / writer such as an SD memory card.
- the output unit 102 is a display device for displaying various information, for example.
- the output unit 102 is realized by, for example, a liquid crystal display as a display device.
- the output unit 102 may be a reader / writer such as an SD memory card.
- the output unit 102 displays or writes the output data to the memory card.
- the input unit 101 and the output unit 102 may be integrated.
- a device having both functions may be used, such as a reader / writer such as an SD memory card.
- the output unit 102 may include both a display device and an SD card reader / writer, for example.
- the communication unit 110 is realized by, for example, a NIC (Network Interface Card).
- the communication unit 110 is a communication interface that is connected to the Internet, for example, via the network N in a wired or wireless manner and manages information communication with servers at various sites on the Internet.
- the communication unit 110 receives page contents such as HTML documents and image files from various sites on the Internet.
- the communication unit 110 outputs the received page content to the control unit 130.
- the communication unit 110 transmits a page request or the like input from the control unit 130 to various sites on the Internet.
- the storage unit 120 is realized by, for example, a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk.
- the storage unit 120 includes a target storage unit 121, an item storage unit 122, a page storage unit 123, and an extracted data storage unit 124.
- the storage unit 120 stores information used for processing in the control unit 130.
- the target storage unit 121 stores a URL of a site that is a target of a crawl process for acquiring data (hereinafter referred to as a target URL) and position specifying information of an extraction target portion in an HTML document in association with each other. That is, the target storage unit 121 stores the definition of the target URL.
- FIG. 2 is a diagram illustrating an example of the target storage unit. As illustrated in FIG. 2, the target storage unit 121 includes items such as “URLID”, “target URL”, and “position specifying information of the extraction target portion”. In addition, the “extraction target portion position specifying information” includes items such as “title” and “address”. Note that the position specifying information of the extraction target portion includes items such as a telephone number, an update date, position information, and an explanatory text, although not illustrated.
- the target storage unit 121 stores, for example, one record for each target URL.
- “URLID” identifies the target URL.
- “Target URL” indicates the URL of an HTML document to be accessed in the crawl process. The target URL is input by the administrator using the input device of the input unit 101, for example.
- the “extraction target part position specifying information” indicates information for specifying the position of the extraction target part in the HTML document of the target URL.
- “Title” indicates the position of the tag in the hierarchical structure of the target HTML document by combining one or more of the tag name, the order of the tag in the document, and the hierarchical structure of the tag.
- “Address” indicates the position of the tag in the hierarchical structure of the target HTML document by combining one or more of the tag name, the tag order in the document, and the hierarchical structure of the tag. .
- the position specifying information of the title and address in the HTML document of the target URL “http: //aaaaa.bbb.ccc/ddd/eee/001.html” whose URL ID is “1”.
- “Order: 1” indicates the first tag among the tags indicating the title in the HTML document.
- “/ Title /” indicates a hierarchical structure of tags indicating the title of the HTML document. Note that data extracted as a title from the HTML document is a portion surrounded by DIV tags.
- “Order: 1” indicates the first tag among the tags indicating the address in the HTML document.
- “/ Info / address /” indicates a hierarchical structure of tags indicating the address of the HTML document. Note that the data extracted from the HTML document as an address is a portion surrounded by DIV tags. Further, the position specifying information of the extraction target portion may be specified using one or more of the tag name, the tag order, and the tag hierarchical structure.
- the tag name may be expressed using a regular expression.
- the name of the tag indicating the address is expressed as “/ ⁇ DIV.*>(.+) ⁇ /DIV>//address: (. +) $ /”.
- the portion surrounded by the DIV tag or the portion following “Address:” is the data extracted as the address.
- the position specifying information of the extraction target part may be a combination of a CSS selector and a regular expression.
- the position specifying information of the extraction target portion may be expressed using a clipping method.
- the position specifying information of the title is expressed as “div # left h2, order: 3, / tps / table /” using a CSS selector.
- the location information of the address is, for example, “# infoContent @ ⁇ h3> location ⁇ / h3> ⁇ s +? ⁇ P> (. +?) ⁇ / P> @is, using a CSS selector and a regular expression. "Order: 5, / info / address /".
- the item storage unit 122 stores the definition of the data item extracted from the page content of the target URL.
- FIG. 3 is a diagram illustrating an example of the item storage unit. As illustrated in FIG. 3, the item storage unit 122 includes items such as “item ID”, “data name”, “data type”, and “cutout method”. The item storage unit 122 stores, for example, one record for each data name.
- “Item ID” identifies a data item, that is, a data name.
- “Data name” indicates the name of data to be extracted. Examples of the data name include data such as a title, address, telephone number, update date, location information, and explanatory text.
- “Data type” indicates the type of data when the extracted data is stored in the extracted data storage unit 124. Data types include, for example, types such as letters, numbers, dates, and latitude and longitude.
- the “cutout method” indicates a method of cutting out data from the page content of the target URL, that is, a method of extracting the data. Examples of the clipping method include a CSS selector and a regular expression.
- the page storage unit 123 stores, for the target URL, the page contents obtained by accessing the crawl process, that is, an HTML document, an image file, and the like.
- FIG. 4 is a diagram illustrating an example of the page storage unit. As illustrated in FIG. 4, the page storage unit 123 includes items such as “URLID”, “target URL”, and “storage area”. For example, the page storage unit 123 stores one record for each target URL.
- URLID identifies the target URL.
- Target URL indicates the URL of the HTML document accessed by the crawl process.
- the “storage area” indicates a storage area that stores the acquired HTML document, image file, and the like.
- the storage area stores, for example, a file system directory in the storage unit 120, and stores an HTML document, an image file, and the like in the corresponding directory. Note that the page storage unit 123 may directly store the acquired HTML document or image file in the storage area.
- the extracted data storage unit 124 stores the data of the extraction target portion extracted from the HTML document.
- the extracted data storage unit 124 is a database that stores data collected by the crawl process.
- FIG. 5 is a diagram illustrating an example of the extracted data storage unit. As illustrated in FIG. 5, the extracted data storage unit 124 includes items such as “URLID”, “title”, “address”, “phone number”, “update date”, “location information”, and “description”. The extracted data storage unit 124 stores, for example, one record for each URLID.
- URLID identifies the target URL.
- Tile is one of the data items extracted from the HTML document of the target URL, and indicates the title of the HTML document of the target URL.
- Address is one of the data items extracted from the HTML document of the target URL, and indicates an address described in the HTML document of the target URL.
- Tephone number is one of the data items extracted from the HTML document of the target URL, and indicates the telephone number described in the HTML document of the target URL.
- Update date is one of the data items extracted from the HTML document of the target URL, and indicates the update date described in the HTML document of the target URL.
- Position information indicates latitude and longitude.
- the latitude and longitude are acquired by using, for example, an external API (Application Programming Interface) service based on the address extracted from the HTML document of the target URL.
- the position information may be the latitude and longitude as long as the latitude and longitude are described in the HTML document.
- the “descriptive text” is one of data items extracted from the HTML document of the target URL. For example, if the HTML document of the target URL is a document related to a tourist spot, an explanatory text related to a tourist spot in the document is shown.
- the address acquired by using an external API service using the tourist spot name described in the title may be used, for example.
- control unit 130 executes, for example, a program stored in an internal storage device using a RAM as a work area by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like. Is realized.
- the control unit 130 may be realized by an integrated circuit such as ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).
- the control unit 130 includes a registration unit 131, a crawl unit 132, an extraction unit 133, and an output control unit 134, and realizes or executes functions and operations of information processing described below. Note that the internal configuration of the control unit 130 is not limited to the configuration illustrated in FIG. 1, and may be another configuration as long as the information processing described below is performed.
- the registration unit 131 registers the definition of the target URL and the definition of the data item. For example, when the administrator operates the input unit 101, the registration unit 131 receives input of a data name, a data type, and a clipping method that are extraction target portions. The registration unit 131 associates the received data name, data type, and extraction method to generate a data item definition. The registration unit 131 stores the definition of the generated data item in the item storage unit 122. That is, the registration unit 131 registers the generated data item definition in the item storage unit 122.
- the registration unit 131 outputs the HTML document source corresponding to the target URL to the output unit 102 for display. For example, when the administrator operates the input unit 101, the registration unit 131 accepts selection of an extraction target portion on the source of the HTML document corresponding to the displayed target URL.
- the registration unit 131 may display an HTML document with a target URL and accept selection of an extraction target portion on the HTML document.
- the registration unit 131 identifies the position of the tag corresponding to the accepted extraction target part on the hierarchical structure.
- the registration unit 131 uses the specified position on the hierarchical structure as position specifying information of the extraction target portion.
- the registration unit 131 uses the name of the tag corresponding to the extraction target part and the order of the tags in the document as position specifying information of the extraction target part together with the specified position on the hierarchical structure.
- the registration unit 131 accepts selection of the extraction target portion for each data item in the HTML document of the target URL and specifies the position of the tag in the hierarchical structure.
- the registration unit 131 when there are a plurality of target URLs, similarly specifies the position of the tag corresponding to the extraction target portion on the hierarchical structure for the HTML document corresponding to each target URL.
- the registration unit 131 associates the target URL with the position specifying information of the extraction target portion, and generates a definition of the target URL.
- the registration unit 131 stores the generated definition of the target URL in the target storage unit 121. That is, the registration unit 131 registers the definition of the generated target URL in the target storage unit 121.
- FIG. 6 is a diagram illustrating an example of an extraction target part reception screen.
- the reception screen 21 has an area 22 for displaying the source of the HTML document and an area 23 for receiving selection of the extraction target portion.
- the registration unit 131 selects an address in the extraction target part selection field in the region 23.
- the registration unit 131 reads the definition of the data item corresponding to the address from the item storage unit 122 and displays it in the extraction definition column 24.
- the extraction definition field 24 may be displayed as editable text.
- the registration unit 131 displays a portion corresponding to one or more of the CSS selector and the regular expression in the extraction definition field 24 as the extraction target portion 25 on the source displayed in the region 22, for example, by coloring the background. .
- the registration unit 131 accepts the selection of the extraction target portion 25 by confirming the extraction target portion 25 by the administrator and pressing a selection button on a user interface (not shown), for example. Further, for example, the registration unit 131 may select the extraction target portion 25 of the region 22 by an administrator's mouse operation and accept the selected extraction target portion 25.
- the registration unit 131 may perform a conversion process for removing unnecessary characters on the extraction target portion 25.
- the registration unit 131 performs conversion processing on the character string of the extraction target portion 25 using the conversion definition in the conversion processing column 26 set by the administrator.
- the registration unit 131 inserts the conversion result 27 under the extraction target portion 25 and displays the background in a color different from that of the extraction target portion 25.
- the registration unit 131 can accept the conversion result 27 as an extraction target portion.
- the crawl unit 132 accesses the home page including the target URL, for example, the top page of a certain tourism information site, with reference to the target storage unit 121. That is, the crawl unit 132 transmits a page request to the server of a certain tourism information site via the communication unit 110 and receives the page content from the server via the communication unit 110.
- the crawl unit 132 accesses the home page including the target URL regularly or irregularly, that is, at an interval specified in advance by the administrator or at an arbitrary timing.
- the designated interval can be an arbitrary interval such as one day, one week, one month, or the like.
- the crawl unit 132 refers to the target storage unit 121 and selects a target URL for acquiring the page contents from all the links in the home page. For example, the crawl unit 132 selects a target URL of a page for each sightseeing spot. The crawl unit 132 acquires page contents from the selected target URL. The crawl unit 132 stores the acquired page content in the page storage unit 123. Further, the crawl unit 132 outputs acquisition completion information indicating that the acquisition of the page content is completed to the extraction unit 133.
- the extraction unit 133 When the acquisition completion information is input from the crawl unit 132, the extraction unit 133 refers to the position specifying information of the extraction target portion of the target storage unit 121, and from the page content of the target URL stored in the page storage unit 123, Extract the data item data of the extraction target part.
- the extraction unit 133 associates the extracted data with the URL ID, and stores the extracted data in the extracted data storage unit 124 according to the definition of the data item in the item storage unit 122.
- the extraction unit 133 stores the extracted data in the extraction data storage unit 124, the extraction unit 133 outputs extraction completion information to the output control unit 134.
- the extraction unit 133 extracts data using the method specified by the cutout method of the item storage unit 122 when extracting data of the data item of the extraction target portion.
- the extraction unit 133 extracts the address by using, for example, a CSS selector in which the tag hierarchy indicating the address is defined by “/ info / address /” and described as “.address”, for example. In this case, for example, the extraction unit 133 can cut out an item including “address” in the tag as an address.
- the extraction unit 133 describes, for example, “.info” on the first line, “/ ⁇ DIV.*>(.+) ⁇ /DIV>/” on the second line, and on the third line.
- An address is extracted by using a regular expression described as “/ address: (. +) $ /”.
- the extraction unit 133 can extract, as an address, a character string that follows the character string “address:” from a hierarchy included in a tag whose DIV tag class is “info”.
- the extraction unit 133 when storing the extracted data in the extracted data storage unit 124, the extraction unit 133 outputs information indicating that the data has changed if the extracted data is different from the data extracted in the past. It may be output to 102 and displayed. That is, when the data corresponding to the position of the registered tag extracted in the past in the hierarchical structure is different from the data corresponding to the position of the registered tag extracted this time, Information indicating that the data has changed is output to the output unit 102. Examples of information indicating that the data has changed include messages such as “Address has been updated. Please check” and “Page layout has been changed. Check.”
- the extraction unit 133 is information according to the number or rate of data that matches the past data among the data corresponding to the plurality of positions. Is output to the output unit 102. That is, for example, when there are six data items registered in the HTML document and the two data are different from the past data, the extraction unit 133, for example, “information on two locations has been updated. Please output "to the output unit 102.
- the extraction unit 133 may output information corresponding to the number or rate of data that matches the acquired homepage data to the output unit 102 when the unknown homepage is crawled. Information according to the number or rate of data that matches the acquired homepage data is, for example, a message such as “Data matching rate with similar pages is 66%. Check for mismatched data items.” .
- the output control unit 134 refers to the extraction data storage unit 124 and outputs the extracted data as output data to the output unit 102 for display.
- the output control unit 134 may display, for example, a display color if the data acquired and extracted by the past crawl process is different from the data acquired and extracted by the current crawl process. It may be changed.
- the output unit 102 is a reader / writer such as an SD memory card
- the output control unit 134 outputs the extracted data as output data to the output unit 102 and stores it in the SD memory card or the like.
- FIG. 7 is a flowchart showing an example of the definition generation process.
- the registration unit 131 receives an input of a data name, a data type, and a clipping method that are to be extracted (step S1).
- the registration unit 131 associates the received data name, data type, and extraction method to generate a data item definition.
- the registration unit 131 registers the generated data item definition in the item storage unit 122 (step S2).
- the registration unit 131 outputs and displays the source of the HTML document corresponding to the target URL on the output unit 102 (step S3). For example, when the administrator operates the input unit 101, the registration unit 131 accepts selection of an extraction target portion on the source of the HTML document corresponding to the displayed target URL (step S4). The registration unit 131 specifies the position on the hierarchical structure of the tag corresponding to the accepted extraction target part (step S5). The registration unit 131 sets the specified position on the hierarchical structure as position specifying information of the extraction target portion (step S6). In addition, the registration unit 131 uses the name of the tag corresponding to the extraction target part and the order of the tags in the document as position specifying information of the extraction target part together with the specified position on the hierarchical structure. When there are a plurality of data items in the HTML document of the target URL, the registration unit 131 accepts selection of the extraction target part and specifies the position of the tag in the hierarchical structure.
- the registration unit 131 associates the target URL with the position specifying information of the extraction target portion and generates a definition of the target URL.
- the registration unit 131 registers the generated definition of the target URL in the target storage unit 121 (step S7). Thereby, the data acquisition apparatus 100 can register the definition of the data item and the definition of the target URL.
- FIG. 8 is a flowchart illustrating an example of the crawl process.
- the crawl unit 132 refers to the target storage unit 121 and accesses a home page including the target URL (step S11).
- the crawl unit 132 refers to the target storage unit 121 and selects a target URL for acquiring page contents from all links in the home page (step S12).
- the crawl unit 132 acquires page contents from the selected target URL (step S13).
- the crawl unit 132 stores the acquired page content in the page storage unit 123. Further, the crawl unit 132 outputs acquisition completion information indicating that the acquisition of the page content is completed to the extraction unit 133.
- the extraction unit 133 refers to the position specifying information of the extraction target portion of the target storage unit 121, and from the page content of the target URL stored in the page storage unit 123, Data of the data item of the extraction target part is extracted (step S14).
- the extraction unit 133 stores the extracted data in the extracted data storage unit 124 in association with the URLID (step S15).
- the extraction unit 133 outputs extraction completion information to the output control unit 134.
- the output control unit 134 refers to the extraction data storage unit 124, and outputs and displays the extracted data on the output unit 102 (step S16).
- the data acquisition apparatus 100 identifies and registers the position of the tag in the hierarchical structure, so that the data of the target portion can be extracted from the HTML document and output without the unique tag information.
- the data acquisition apparatus 100 can collect various information from various websites with different formats and construct a database that is unified in a predetermined format.
- the data acquisition device 100 identifies the position on the hierarchical structure of the tag included in the document of the extraction target portion in the document that is associated with the specific URL and includes the tag structure information, Allows registration of location. Further, the data acquisition device 100 accesses a document associated with a specific URL regularly or irregularly, extracts data corresponding to the position of the registered tag in the hierarchical structure, and outputs the extracted data. As a result, the data of the target portion can be extracted and output even without specific tag information.
- the position of the extraction target part is further specified using a combination of the tag name or the order of the tag in the document and the hierarchical structure of the tag. As a result, the data of the target portion can be extracted and output more accurately.
- the data acquisition apparatus 100 determines that the data corresponding to the position of the registered tag extracted in the past in the hierarchical structure is different from the data corresponding to the position of the registered tag extracted this time in the hierarchical structure. , Output information indicating that the data has changed. As a result, it can be easily determined that the document corresponding to the target URL has been updated.
- the data acquisition apparatus 100 outputs according to the number or rate of data that matches the past data among the data corresponding to the plurality of positions. Do. As a result, even when a crawling process is performed on an unknown home page, a definition for easily extracting data can be set, and desired data can be extracted and output.
- the data acquisition apparatus 100 displays the document described in the HTML format or the source of the document, and accepts selection of the extraction target portion included in the displayed document or the source of the document.
- the data acquisition apparatus 100 identifies the tag hierarchy corresponding to the received extraction target part, and registers the identified hierarchy as information for specifying the position of the extraction target part. As a result, data items acquired by the crawl process can be easily set.
- the said Example demonstrated the case where the homepage regarding a sightseeing spot was crawl-processed, it is not limited to this. For example, you may make it patrol the homepage regarding disaster prevention information, traffic information, tour product information, job offer information, etc.
- the data acquisition apparatus 100 can construct a database without omission by collecting information on various homepages with different managers and integrating data with the same attribute.
- a character string of a tag including “address” or a character string following “address” in a regular expression is acquired, but the present invention is not limited to this.
- a keyword that may be used for address notation such as “location” may be acquired using a regular expression.
- the data acquisition apparatus 100 can be integrated into a database as data having the same attribute even when similar terms are used.
- the present invention is not limited to this.
- the data acquisition device 100 may divide each tourist spot using a splitter, and use the divided portion as a substitute for the target URL. Thereby, the data acquisition device 100 can acquire desired data from home pages in various formats.
- each component of each part illustrated does not necessarily need to be physically configured as illustrated.
- the specific form of distribution / integration of each part is not limited to the one shown in the figure, and all or a part thereof may be functionally or physically distributed / integrated in arbitrary units according to various loads and usage conditions. Can be configured.
- the crawl unit 132, the extraction unit 133, and the output control unit 134 may be integrated into an output control unit.
- various processing functions performed in each device may be executed entirely or arbitrarily on a CPU (or a microcomputer such as an MPU or MCU (Micro Controller Unit)).
- the various processing functions may be executed entirely or arbitrarily on a program that is analyzed and executed by a CPU (or a microcomputer such as an MPU or MCU) or hardware based on wired logic. Needless to say, it is good.
- FIG. 9 is a diagram illustrating an example of a computer that executes a data acquisition program.
- the computer 200 includes a CPU 201 that executes various arithmetic processes, an input device 202 that receives data input, and a monitor 203.
- the computer 200 also includes a medium reading device 204 that reads a program and the like from a storage medium, an interface device 205 for connecting to various devices, and a communication device 206 for connecting to other information processing devices and the like by wire or wirelessly.
- Have The computer 200 also includes a RAM 207 that temporarily stores various types of information and a hard disk device 208.
- the devices 201 to 208 are connected to a bus 209.
- the hard disk device 208 stores a data acquisition program having the same functions as the processing units of the registration unit 131, the crawl unit 132, the extraction unit 133, and the output control unit 134 illustrated in FIG. Further, the hard disk device 208 stores a target storage unit 121, an item storage unit 122, a page storage unit 123, an extracted data storage unit 124, and various data for realizing a data acquisition program.
- the input device 202 has a function equivalent to that of the input unit 101, and receives input of various information such as a target URL, definition, management information, and the like from an administrator of the computer 200, for example.
- the monitor 203 has a function equivalent to that of the output unit 102, and displays various screens such as a management information screen, a reception screen, and a data display screen to the administrator of the computer 200, for example.
- the interface device 205 is connected to, for example, a printing device.
- the communication device 206 has the same function as the communication unit 110 shown in FIG. 1 and is connected to the network N to exchange various information with a site on the Internet.
- the CPU 201 reads out each program stored in the hard disk device 208, develops it in the RAM 207, and executes it to perform various processes.
- these programs can cause the computer 200 to function as the registration unit 131, the crawl unit 132, the extraction unit 133, and the output control unit 134 illustrated in FIG.
- the above data acquisition program is not necessarily stored in the hard disk device 208.
- the computer 200 may read and execute a program stored in a storage medium readable by the computer 200.
- the storage medium readable by the computer 200 corresponds to, for example, a portable recording medium such as a CD-ROM, a DVD disk, a USB (Universal Serial Bus) memory, a semiconductor memory such as a flash memory, and a hard disk drive.
- the data acquisition program may be stored in a device connected to a public line, the Internet, a LAN, etc., and the computer 200 may read and execute the data acquisition program from these.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
101 入力部
102 出力部
110 通信部
120 記憶部
121 対象記憶部
122 項目記憶部
123 ページ記憶部
124 抽出データ記憶部
130 制御部
131 登録部
132 クロール部
133 抽出部
134 出力制御部
N ネットワーク
Claims (15)
- 特定のURLに対応付けられ、タグの構造情報を含む文書における抽出対象部分の前記文書に含まれるタグの階層構造上の位置を特定し、該階層構造上の位置を登録することを許容し、
定期的又は不定期に、前記特定のURLに対応付けられた前記文書にアクセスして、登録された前記タグの階層構造上の位置に対応するデータを抜き出して、出力する
処理をコンピュータに実行させることを特徴とするデータ取得プログラム。 - 前記抽出対象部分の位置は更に、タグの名称又はタグの文書内における順と、前記タグの階層構造との組み合わせを用いて特定されることを特徴とする請求項1に記載のデータ取得プログラム。
- 過去に抜き出した登録された前記タグの階層構造上の位置に対応するデータと、今回抜き出した登録された前記タグの階層構造上の位置に対応するデータとが異なる場合に、データが変化したことを示す情報を出力することを特徴とする請求項1に記載のデータ取得プログラム。
- 前記文書についての前記抽出対象部分の位置が複数登録された場合に、前記複数の位置に対応するデータの内、過去のデータと一致するデータの数又は率に応じた出力を行うことを特徴とする請求項1に記載のデータ取得プログラム。
- HTML形式で記述された前記文書又は該文書のソースを表示し、
表示された該文書又は該文書のソースに含まれる抽出対象部分の選択を受け付け、
受け付けた前記抽出対象部分に対応するタグの階層を特定し、
特定した該階層を前記抽出対象部分の位置を特定する情報として登録することを特徴とする請求項1に記載のデータ取得プログラム。 - 特定のURLに対応付けられ、タグの構造情報を含む文書における抽出対象部分の前記文書に含まれるタグの階層構造上の位置を特定し、該階層構造上の位置を登録することを許容し、
定期的又は不定期に、前記特定のURLに対応付けられた前記文書にアクセスして、登録された前記タグの階層構造上の位置に対応するデータを抜き出して、出力する
処理をコンピュータが実行することを特徴とするデータ取得方法。 - 前記抽出対象部分の位置は更に、タグの名称又はタグの文書内における順と、前記タグの階層構造との組み合わせを用いて特定されることを特徴とする請求項6に記載のデータ取得方法。
- 過去に抜き出した登録された前記タグの階層構造上の位置に対応するデータと、今回抜き出した登録された前記タグの階層構造上の位置に対応するデータとが異なる場合に、データが変化したことを示す情報を出力することを特徴とする請求項6に記載のデータ取得方法。
- 前記文書についての前記抽出対象部分の位置が複数登録された場合に、前記複数の位置に対応するデータの内、過去のデータと一致するデータの数又は率に応じた出力を行うことを特徴とする請求項6に記載のデータ取得方法。
- HTML形式で記述された前記文書又は該文書のソースを表示し、
表示された該文書又は該文書のソースに含まれる抽出対象部分の選択を受け付け、
受け付けた前記抽出対象部分に対応するタグの階層を特定し、
特定した該階層を前記抽出対象部分の位置を特定する情報として登録することを特徴とする請求項6に記載のデータ取得方法。 - 特定のURLに対応付けられ、タグの構造情報を含む文書における抽出対象部分の前記文書に含まれるタグの階層構造上の位置を特定し、該階層構造上の位置を登録することを許容する登録部と、
定期的又は不定期に、前記特定のURLに対応付けられた前記文書にアクセスして、登録された前記タグの階層構造上の位置に対応するデータを抜き出して、出力する出力制御部と
を有することを特徴とするデータ取得装置。 - 前記抽出対象部分の位置は更に、タグの名称又はタグの文書内における順と、前記タグの階層構造との組み合わせを用いて特定されることを特徴とする請求項11に記載のデータ取得装置。
- 前記出力制御部は、過去に抜き出した登録された前記タグの階層構造上の位置に対応するデータと、今回抜き出した登録された前記タグの階層構造上の位置に対応するデータとが異なる場合に、データが変化したことを示す情報を出力することを特徴とする請求項11に記載のデータ取得装置。
- 前記出力制御部は、前記文書についての前記抽出対象部分の位置が複数登録された場合に、前記複数の位置に対応するデータの内、過去のデータと一致するデータの数又は率に応じた出力を行うことを特徴とする請求項11に記載のデータ取得装置。
- 前記登録部は、HTML形式で記述された前記文書又は該文書のソースを表示し、
表示された該文書又は該文書のソースに含まれる抽出対象部分の選択を受け付け、
受け付けた前記抽出対象部分に対応するタグの階層を特定し、
特定した該階層を前記抽出対象部分の位置を特定する情報として登録することを特徴とする請求項11に記載のデータ取得装置。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
SG11201703830XA SG11201703830XA (en) | 2014-11-14 | 2014-11-14 | Recording medium, data acquisition method, and data acquisition apparatus. |
JP2016558843A JP6500908B2 (ja) | 2014-11-14 | 2014-11-14 | データ取得プログラム、データ取得方法及びデータ取得装置 |
EP14905762.2A EP3220285A4 (en) | 2014-11-14 | 2014-11-14 | Data acquisition program, data acquisition method and data acquisition device |
PCT/JP2014/080268 WO2016075829A1 (ja) | 2014-11-14 | 2014-11-14 | データ取得プログラム、データ取得方法及びデータ取得装置 |
US15/589,150 US10769216B2 (en) | 2014-11-14 | 2017-05-08 | Data acquisition method, data acquisition apparatus, and recording medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2014/080268 WO2016075829A1 (ja) | 2014-11-14 | 2014-11-14 | データ取得プログラム、データ取得方法及びデータ取得装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/589,150 Continuation US10769216B2 (en) | 2014-11-14 | 2017-05-08 | Data acquisition method, data acquisition apparatus, and recording medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016075829A1 true WO2016075829A1 (ja) | 2016-05-19 |
Family
ID=55953942
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2014/080268 WO2016075829A1 (ja) | 2014-11-14 | 2014-11-14 | データ取得プログラム、データ取得方法及びデータ取得装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US10769216B2 (ja) |
EP (1) | EP3220285A4 (ja) |
JP (1) | JP6500908B2 (ja) |
SG (1) | SG11201703830XA (ja) |
WO (1) | WO2016075829A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2020086996A (ja) * | 2018-11-27 | 2020-06-04 | 株式会社クリエイト | 掲載情報検索システム |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10922366B2 (en) * | 2018-03-27 | 2021-02-16 | International Business Machines Corporation | Self-adaptive web crawling and text extraction |
CN110909123B (zh) * | 2019-10-23 | 2023-08-25 | 深圳价值在线信息科技股份有限公司 | 一种数据提取方法、装置、终端设备及存储介质 |
TWI757733B (zh) * | 2020-05-05 | 2022-03-11 | 華碩電腦股份有限公司 | 網路資料收集方法 |
US20230229850A1 (en) * | 2022-01-14 | 2023-07-20 | Microsoft Technology Licensing, Llc | Smart tabular paste from a clipboard buffer |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011100403A (ja) * | 2009-11-09 | 2011-05-19 | Sony Corp | 情報処理装置、情報抽出方法、プログラム及び情報処理システム |
JP2012103929A (ja) * | 2010-11-11 | 2012-05-31 | Nippon Telegr & Teleph Corp <Ntt> | 情報抽出装置、情報抽出方法および情報抽出プログラム |
JP2014522030A (ja) * | 2011-07-22 | 2014-08-28 | アリババ・グループ・ホールディング・リミテッド | ウェブページ情報を抽出するためのウェブクローラの構成 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3715444B2 (ja) * | 1998-06-30 | 2005-11-09 | 株式会社東芝 | 構造化文書保存方法及び構造化文書保存装置 |
JP3946934B2 (ja) | 1999-08-05 | 2007-07-18 | 株式会社東芝 | ウェブページ部品統合処理装置、ウェブページ部品統合処理方法及びクライアント装置 |
US6754648B1 (en) * | 1999-09-30 | 2004-06-22 | Software Ag | Method for storing and managing data |
JP2001202283A (ja) | 1999-11-09 | 2001-07-27 | Fujitsu Ltd | コンテンツ更新状況監視システム |
JP2003248613A (ja) * | 2001-11-20 | 2003-09-05 | Sharp Corp | 情報配信システムおよびそれに用いられる配信情報生成装置 |
JP2006318138A (ja) * | 2005-05-11 | 2006-11-24 | Nec Personal Products Co Ltd | Webシステム、Webシステム用サーバコンピュータおよびコンピュータプログラム |
US7627571B2 (en) * | 2006-03-31 | 2009-12-01 | Microsoft Corporation | Extraction of anchor explanatory text by mining repeated patterns |
JP2011039766A (ja) * | 2009-08-11 | 2011-02-24 | Ricoh Co Ltd | 情報配信サーバ、情報配信システム、情報配信プログラム、及び情報配信方法 |
-
2014
- 2014-11-14 WO PCT/JP2014/080268 patent/WO2016075829A1/ja active Application Filing
- 2014-11-14 SG SG11201703830XA patent/SG11201703830XA/en unknown
- 2014-11-14 JP JP2016558843A patent/JP6500908B2/ja active Active
- 2014-11-14 EP EP14905762.2A patent/EP3220285A4/en not_active Withdrawn
-
2017
- 2017-05-08 US US15/589,150 patent/US10769216B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011100403A (ja) * | 2009-11-09 | 2011-05-19 | Sony Corp | 情報処理装置、情報抽出方法、プログラム及び情報処理システム |
JP2012103929A (ja) * | 2010-11-11 | 2012-05-31 | Nippon Telegr & Teleph Corp <Ntt> | 情報抽出装置、情報抽出方法および情報抽出プログラム |
JP2014522030A (ja) * | 2011-07-22 | 2014-08-28 | アリババ・グループ・ホールディング・リミテッド | ウェブページ情報を抽出するためのウェブクローラの構成 |
Non-Patent Citations (1)
Title |
---|
See also references of EP3220285A4 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2020086996A (ja) * | 2018-11-27 | 2020-06-04 | 株式会社クリエイト | 掲載情報検索システム |
JP7018202B2 (ja) | 2018-11-27 | 2022-02-10 | 株式会社クリエイト | 掲載情報検索システム |
Also Published As
Publication number | Publication date |
---|---|
SG11201703830XA (en) | 2017-06-29 |
US10769216B2 (en) | 2020-09-08 |
US20170300574A1 (en) | 2017-10-19 |
JPWO2016075829A1 (ja) | 2017-08-17 |
EP3220285A1 (en) | 2017-09-20 |
JP6500908B2 (ja) | 2019-04-17 |
EP3220285A4 (en) | 2017-11-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10769216B2 (en) | Data acquisition method, data acquisition apparatus, and recording medium | |
US9910870B2 (en) | System and method for creating data models from complex raw log files | |
KR20170073693A (ko) | 유사 그룹 요소 추출 | |
US20110197133A1 (en) | Methods and apparatuses for identifying and monitoring information in electronic documents over a network | |
JP2019040260A (ja) | 情報処理装置及びプログラム | |
JP2007172482A (ja) | 情報表示システム | |
JP2008123425A (ja) | ウェブ文書データ提供装置、方法、およびシステム | |
JP2008071116A (ja) | 情報配信システム、情報配信装置、情報配信方法および情報配信用プログラム | |
JP2018106556A (ja) | 画面情報生成装置、画面情報生成方法、及びプログラム | |
JP6601412B2 (ja) | 情報取得プログラム、情報取得方法および情報取得装置 | |
JP5585816B2 (ja) | ポータルサイト生成システム、ポータルサイト生成方法、及びコンピュータプログラム | |
JP2018152015A (ja) | 記憶制御装置、記憶制御プログラムおよび記憶制御方法 | |
JP6493413B2 (ja) | データ取得プログラム、データ取得方法及びデータ取得装置 | |
JP2006209598A (ja) | サイト情報収集システム | |
JP6915322B2 (ja) | ウェブサイト比較処理プログラム、ウェブサイト比較方法およびウェブサイトを比較する装置 | |
JP6528341B1 (ja) | 情報処理装置、情報処理方法及びプログラム | |
JP6520955B2 (ja) | データ検証プログラム、データ検証方法及びデータ検証装置 | |
US7222296B2 (en) | Configurable display of web site content | |
JP5247543B2 (ja) | 情報提供装置、情報提供方法、およびプログラム | |
JP6485462B2 (ja) | 情報処理装置、情報処理方法および情報処理プログラム | |
US20150347610A1 (en) | Methods and apparatus for modifying a plurality of markup language files | |
JP2005327157A (ja) | 情報統合化方法およびそのプログラム | |
JP2007086842A (ja) | 入力フォーム提示システムおよび方法 | |
JP2018180979A (ja) | ログ構造可視化装置、ログ構造可視化方法、およびプログラム | |
WO2013038508A1 (ja) | 計算機、計算機システム及びデータベースの構築支援方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14905762 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2016558843 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 11201703830X Country of ref document: SG |
|
REEP | Request for entry into the european phase |
Ref document number: 2014905762 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |