WO2013038519A1

WO2013038519A1 - Web page analysis device and program for analyzing web page

Info

Publication number: WO2013038519A1
Application number: PCT/JP2011/070978
Authority: WO
Inventors: 竜一得上
Original assignee: 株式会社マイニングブラウニー
Priority date: 2011-09-14
Filing date: 2011-09-14
Publication date: 2013-03-21
Also published as: JPWO2013038519A1; JP4959032B1

Abstract

The present invention analyzes the hierarchical structure of tags of a structured document configuring a Web page, imparts a depth point corresponding to depth from a root for each row, and adds a keyword point corresponding to a keyword for a row in which the keyword is included to the depth point. In addition, on the basis of the depth point and the keyword point, a predetermined range, including the row in which the keyword is included and a number of rows before and after the row, is extracted as an object block, and by extracting information satisfying a predetermined condition from within the extracted object block, it is possible to automatically extract desired information, which is often listed at a short distance from the keyword, from the Web page.

Description

Web page analysis apparatus and web page analysis program

The present invention relates to a web page analysis apparatus and a web page analysis program, and more particularly to an apparatus and program for analyzing a structured document of a web page described in HTML (HyperText Markup Language) or the like.

Currently, there are many web pages on the Internet, and a wide variety of information is provided. The user can acquire desired information by browsing these web pages.

Conventionally, a program called a search engine has been provided to search for a web page on which desired information is posted. Basically, when a user inputs a keyword related to desired information on a search site, a search engine built in the search site searches a web page including the input keyword, and a plurality of searched Are presented as a list of URLs.

However, since the search engine simply searches for and extracts web pages including keywords, the extracted web pages include many web pages that cause noise that the user does not want. Therefore, the user has to manually access the URLs listed by the search engine and check the contents of the web pages one by one.

Suppose, for example, that a search engine is used to search a web page of an EC (electronic commerce) site on which products and prices are posted in order to conduct price surveys on various products. For example, if you search by entering the keyword “price” that seems to be described in any EC site, many web pages of EC sites are extracted, but web pages other than EC sites are also detected as noise in the extraction results. It will be included. In this case, the user needs to open the web pages one by one including a lot of noise and check the contents, so that there is a problem that work efficiency deteriorates.

On the other hand, a method for determining the type of a structured document such as a web page has been proposed (see, for example, Patent Documents 1 and 2). If this determination method is used, it is possible to search only the web page of the EC site, for example.
JP 2000-29902 A JP 2003-308327 A

In Patent Document 1, for all HTML documents, classification is performed by extracting structural features from features based on tags and keywords, features based on image information, features based on link information, and features based on tag structures, and collating with rules. Calculate the goodness of fit. Then, using the input keyword and type, the result of narrowing down by deleting documents below a certain fitness level is displayed.

However, in the technique described in Patent Document 1, it is essential to construct and adjust the structural feature rule base and the adjustment rule. For this reason, there is a problem that it takes a lot of time and labor to make a precise determination because it requires tuning such as selecting features that serve as the base of the rule and setting the number of points given to each rule. .

Also, the technique described in Patent Document 1 has a problem that it cannot immediately respond to changes in web pages on the Internet. In other words, the characteristics of web pages change day by day, and in response to this change, it is necessary to reconstruct rules by repeating trial and error while accumulating experience knowledge in the same way as creating a decision rule base first. .

For the purpose of solving such a problem, Patent Document 2 discloses a teacher data input means for inputting a plurality of structured document types collected via a network as teacher data, and based on the structured document and the teacher data. A determination rule creating unit that creates a determination rule for determining the types of a plurality of structured documents, and a determination rule executing unit that determines the type of the structured document according to the determination rule created by the determination rule creating unit. .

However, the techniques described in

Patent Documents

1 and 2 have a problem that although the web page type can be determined, the content of the web page cannot be analyzed. Usually, a lot of information is included in one web page. The information desired by the user is more often present in a part of the web page than in the entire web page. For this reason, even if extraction can be performed by narrowing down to the type of web page desired by the user, it is necessary for the user to visually confirm the content of the web page.

For example, if it is desired to analyze the price of various products by analyzing the information on the web page as described above, according to the techniques described in

Patent Literatures

1 and 2, only the web page of the EC site is extracted by determining the type. Is possible. However, it is impossible to analyze where the product and price are listed on the web page. For this reason, the user needs to manually open the extracted web pages of the EC site one by one and check the contents, resulting in a problem that work efficiency deteriorates.

The present invention has been made to solve such a problem, and an object thereof is to make it possible to efficiently extract desired information from many web pages on a website on the Internet.

In order to achieve this object, the present invention analyzes the hierarchical structure of the tags of the structured document constituting the web page, and sets the depth point corresponding to the depth from the root for each line of the structured document. At the same time, for a line including the keyword, a keyword point corresponding to the keyword is added to the depth point. Then, based on the depth point and the keyword point, a predetermined range including a line including the keyword and several lines before and after the keyword is extracted as a target block, and information satisfying a predetermined condition is extracted from the extracted target block. I try to extract.

According to the present invention configured as described above, in a structured document constituting a web page, a set of rows in a predetermined range at a short distance from a row including a keyword is extracted as a target block, and the target Information satisfying a predetermined condition is extracted from the block. Therefore, if the conditions necessary for extracting the desired information are set as the predetermined conditions, the desired information that is often posted at a short distance from the keyword can be automatically extracted from the web page. it can. Thereby, desired information can be efficiently extracted from many web pages on a website on the Internet.

It is a figure which shows the structural example of the web page analysis system containing the web page analysis apparatus by this embodiment. It is a block diagram which shows the function structural example of the web page analysis apparatus by this embodiment. It is a figure which shows an example of the web page made into analysis object by the web page analysis apparatus of this embodiment. It is a figure which shows an example of the HTML document of the web page shown in FIG. It is a figure which shows the point for every line of the HTML document shown in FIG. FIG. 6 is a diagram for explaining an extraction example of a target block by graphing the points shown in FIG. 5. It is a flowchart which shows the operation example of the web page analysis apparatus by this embodiment.

Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a diagram illustrating a configuration example of a web page analysis system including a web page analysis apparatus according to the present embodiment. As shown in FIG. 1, the web page analysis system of this embodiment includes a web page collection unit 10, a web page storage unit 20, a web page analysis device 30, a display unit 40, and an operation unit 50.

The web page collection unit 10 collects many web pages from many websites on the Internet. The function of the web page collection unit 10 is realized by, for example, a program of a page collection robot called “crawler”. The web page storage unit 20 stores many web pages collected by the web page collection unit 10.

The web page analysis device 30 analyzes many web pages stored in the web page storage unit 20, extracts information desired by the user from the web pages, and outputs the extracted information. In the present embodiment, desired information extracted by the web page analyzing apparatus 30 is output to the display unit 40 such as a display. However, the output destination is not limited to the display unit 40. For example, it may be a printing unit such as a printer, or a recording medium such as a hard disk or a semiconductor memory.

The operation unit 50 is used when a user inputs a keyword to the web page analyzing apparatus 30, and is configured by a keyboard or a mouse, for example. The operation unit 50 is also used when the user designates an information type indicating what kind of information is desired to be extracted as desired information.

For example, the user designates, as a desired information type, the type of information such as whether he wants to extract product price information, company information, or job information. In addition, the user inputs a word that seems to be related to the specified information type as a keyword. For example, a word such as “tax included” for product price information, “representative” for company information, and “work” for job offer information are input as keywords.

Although an example in which the user inputs both a desired information type and a keyword by operating the operation unit 50 has been described here, the present invention is not limited to this. For example, an information type and a predetermined keyword are stored in association with each other in advance, and when the user operates the operation unit 50 to specify a desired information type, the keyword associated with the information type is automatically input. It may be.

The web page analysis device 30 includes an analysis processing unit 31, a keyword setting unit 32, and a keyword storage unit 33. The analysis processing unit 31 analyzes many web pages stored in the web page storage unit 20, and executes processing for extracting information desired by the user from the web page and outputting it. A detailed functional configuration of the analysis processing unit 31 will be described later with reference to FIG.

The keyword setting unit 32 sets a keyword used when the analysis processing unit 31 analyzes a web page. The keyword setting unit 32 also sets the value of keyword points (details will be described later) to be assigned to the set keyword. The keyword storage unit 33 stores the keywords set by the keyword setting unit 32 and the corresponding keyword points.

In the present embodiment, the keyword setting unit 32 sets a word input by the user through the operation of the operation unit 50 as a keyword. In addition, the keyword setting unit 32 additionally sets a word extracted from the web page to be analyzed as a keyword in the course of web page analysis processing by the analysis processing unit 31.

The keyword setting unit 32 sets keyword point values to be assigned to the keywords set in this way, and stores the keywords and keyword points in the keyword storage unit 33 in association with each other.

Note that the keyword setting unit 32 initially sets, for example, a predetermined value as a keyword point at the time of input for the keyword input by the user through the operation of the operation unit 50. On the other hand, for a keyword additionally set from the web page in the course of the analysis process, the keyword setting unit 32 initially sets a value obtained by a predetermined calculation as a keyword point. Even after the keyword points are initially set in this way, the keyword setting unit 32 performs the predetermined calculation and updates the keyword points as needed each time the analysis of the web page is advanced.

The analysis processing unit 31 performs an analysis process on a web page including a keyword stored in the keyword storage unit 33 among web pages stored in the web page storage unit 20. As described above, since the word extracted from the web page in the course of the analysis process is additionally set as a keyword, the keyword stored in the keyword storage unit 33 changes depending on the learning effect. Also, the value of the keyword points given to the keyword is increased or decreased due to the learning effect.

Therefore, at the beginning of the analysis when no keyword is added, only the keyword input by the user through the operation of the operation unit 50 (for example, the word “tax included” when the information type is product price information) is the keyword storage unit 33. Only the web page including the keyword is subjected to analysis by the analysis processing unit 31. On the other hand, as the analysis processing for a plurality of web pages progresses, keywords extracted from the web pages are additionally stored in the keyword storage unit 33, and there are web pages that do not include the original keyword “tax included”. It is included in the analysis target by the analysis processing unit 31.

FIG. 2 is a block diagram illustrating a functional configuration example of the web page analyzing apparatus 30 according to the present embodiment. FIG. 3 is a diagram illustrating an example of a web page to be analyzed by the web page analyzing apparatus according to the present embodiment. FIG. 4 is a diagram showing an example of the structured document (HTML document) of the web page shown in FIG. FIG. 5 is a diagram showing points for each line of the HTML document shown in FIG. FIG. 6 is a diagram for explaining an extraction example of the target block by graphing the points shown in FIG.

Hereinafter, the functional configuration of the web page analyzing apparatus 30 according to the present embodiment will be described with reference to FIGS. 2 to 6. 3 and 4 show a web page of an EC site and its HTML document as an example of the web page. 5 and 6 show analysis examples of the web page shown in FIGS. 3 and 4.

As shown in FIG. 2, the web page analysis apparatus 30 of the present embodiment has a depth point assigning unit 34 and a keyword point assigning unit 35 in addition to the keyword setting unit 32 and the keyword storage unit 33 described above as the functional configuration. A block extracting unit 36, an information extracting unit 37, and a filtering unit 38. 1 is configured by the depth point assigning unit 34, the keyword point assigning unit 35, the block extracting unit 36, the information extracting unit 37, and the filtering unit 38.

The web page analysis device 30 of the present embodiment actually includes a CPU that executes various arithmetic processes, a ROM that stores a web page analysis program, a RAM that is used as a work area for data storage and program execution, and a hard disk. And the CPU operates according to the web page analysis program stored in the ROM, thereby executing processing by the functional blocks 32 to 38.

As a recording medium for storing the web page analysis program, a CD-ROM, a flexible disk, a hard disk, a magnetic tape, an optical disk, a magneto-optical disk, a DVD, a nonvolatile memory card, or the like can be used instead of the ROM. Further, the web page analysis program may be downloaded to a computer via a network such as the Internet.

The depth point assigning unit 34 analyzes the hierarchical structure of the tags of the structured document (HTML document) constituting the web page to be analyzed among the web pages stored in the web page storage unit 20. A depth point corresponding to the depth from the root is assigned to each line of the structured document.

Usually, HTML can express various things on a web page by using a command sentence called a tag surrounded by “<” and “>” symbols. The tag includes a start tag indicating the start location of the instruction and an end tag indicating the end location of the instruction. In principle, the start tag is represented by a combination of “<” and “>” and a command statement. On the other hand, the end tag is represented by a combination of “</”, “>” and a command statement.

Also, HTML documents take the form of a tree structure with tags. Usually, there are a <head> tag and a <body> tag in the lower hierarchy (child hierarchy) of the <html> tag, and a tree structure corresponding to the contents of the web page is expanded in the lower hierarchy of the <body> tag. The start tag and end tag for one command statement are always in the same level, but if another command tag is inserted between the start tag and the end tag, the tag for the other command statement is hierarchical. Goes down by one.

For example, in the example of the HTML document of FIG. 4 describing the web page shown in FIG. 3, “<html>” on the first line is a start tag and the corresponding end tag is “</ html>” on the last line. Yes, these two tags belong to the first hierarchy (root hierarchy). Also, “<head>” in the second line is a start tag, and the corresponding end tag is “</ head>” in the sixth line. These two tags belong to the second layer.

As described above, the depth point assigning unit 34 assigns a depth point for each row according to the depth of the hierarchy from the root of the HTML document. In the present embodiment, “every row” is synonymous with “every start tag”. That is, in appearance of an HTML document, even if a plurality of start tags are apparently present on the same line, it is assumed that the line has changed for each start tag, and a depth point is given. In addition, since the start tag and the end tag related to one command statement are always in the same hierarchy in the HTML document, it is sufficient to give a depth point to the line of the start tag.

For example, in the example of FIG. 4, since the <html> tag in the first row is the first layer, the depth point is “1” (see FIG. 5; the same applies hereinafter). Further, since the <head> tag on the second line is the second layer, the depth point is “2”. Further, the depth point of the <meta> tag on the third line, the <link> tag on the fourth line, and the <title> tag on the fifth line are all “3”. “</ Head>” in the next 6th line is an end tag, so it goes up to the second layer. Therefore, the <body> tag on the seventh line is the second layer, and the depth point is “2”.

The keyword point giving unit 35 adds a keyword point corresponding to the keyword to the depth point for a line including the keyword in the HTML document. The keywords used here and the corresponding keyword points are set by the keyword setting unit 32 and stored in the keyword storage unit 33.

As described above, the keywords stored in the keyword storage unit 33 are initially only input by the user through the operation of the operation unit 50, but are gradually added by repeated learning. Further, the value of the keyword point stored in the keyword storage unit 33 is updated as needed as the analysis of a plurality of web pages proceeds.

In the example of FIGS. 4 and 5, the keyword “special price” described in the sixth layer to which the <span> tag on the 24th line belongs is additionally set by learning. “2.31” is set as the point. For this reason, the keyword point giving unit 35 adds the keyword point “2.31” set for the keyword “special price” to the depth point “6” of the <span> tag, so that the <span> tag The point on the 24th line is “8.31”.

In the example of FIGS. 4 and 5, the keyword “cart” described in the seventh layer to which the <input> tag on the 28th line belongs is additionally set by learning, and for this keyword “cart” As a keyword point, “2.02” is set. For this reason, the keyword point giving unit 35 adds the keyword point “2.02” set for the keyword “cart” to the depth point “7” of the <input> tag, so that the <input> tag The point on the 28th line is “9.02”.

A keyword point calculation method set by the keyword setting unit 32 and stored in the keyword storage unit 33 will be described later.

The block extraction unit 36 adds the keyword points and produces a difference from the depth points (in the example of FIG. 4, the 24th line of the <span> tag including the keyword “special price” and “cart”). A set of rows in a predetermined range including the <input> tag in the <input> tag in which the keyword is included) is extracted as a block (referred to as a target block) from which desired information and keywords are to be extracted. The target block defines a range in which there is a high possibility that desired information is included.

The end point of the predetermined range that defines the target block is the first line where the keyword points are not added and the depth point is minimal, after the line where the difference occurs. The start point of the predetermined range is a line before the line where the difference occurs, and is the line having the same value as the end point and the minimum depth point.

Here, the start point and end point of the target block will be described with reference to the graph shown in FIG. Note that FIG. 6 is a simple line graph of the points given to each line of the HTML document shown in FIG. 4 (the depth points shown in FIG. 5 and the result of adding this to the keyword points). The horizontal axis indicates the number of rows, and the vertical axis indicates the point value.

In FIG. 6, a broken line graph 61 is a graph of depth points, and a solid line graph 62 is a graph of the addition results of depth points and keyword points. A range surrounded by a broken-line square is the target block 63. Further, when the corresponding portion of the target block 63 is shown on the web page shown in FIG. 3, a range 63 ′ surrounded by a broken-line square corresponds to the target block 63.

As shown in FIG. 6, there is a difference in graph values between the 24th and 28th lines. Therefore, the block extraction unit 36 does not add keyword points in the line after the 28th line (that is, there is no difference from the depth point), and the depth point is minimized. The first line is the end point of the predetermined range. In the example of FIG. 6, the 29th line is the end point. The value of the depth point at this end point is “3”. On the other hand, the block extraction unit 36 sets the line preceding the 24th line and having the same value “3” as the end point and the minimum depth point as the start point of the predetermined range. In the example of FIG. 6, the 15th line is the starting point.

The information extraction unit 37 extracts information satisfying a predetermined condition from the target block extracted by the block extraction unit 36. For example, when the user specifies price information of a product as the type of information desired to be extracted from the web page through the operation of the operation unit 50, the information extraction unit 37 extracts the product name and price as information satisfying a predetermined condition. . That is, the information extraction unit 37 extracts the product name and price from the target block 63 set in the web page of the EC site shown in FIG.

Specifically, the information extraction unit 37 extracts a product name by morphological analysis. In general, product names are often composed of unknown words and nouns. Therefore, the information extraction unit 37 performs a morphological analysis on a sentence or a word including a product name, and determines that it is a product name if 70% of the morphemes are occupied by unknown words and nouns. In the case of a price, there are a list of numerical values and (comma), and characters such as “¥” or “yen” are often included before or after the list. Therefore, the information extraction unit 37 determines the price using such a regular expression condition.

In addition, when company information is specified as the information type, the information extraction unit 37 includes information such as the location, representative name, capital, telephone number, number of employees, date of establishment, etc. as information that satisfies a predetermined condition. To extract. For example, the information extraction unit 37 performs morphological analysis and determines that a part formed by a combination of a place name and a numerical value is a location. Further, it is determined that a part consisting of a combination of a number and () or-is a telephone number. If there is a list of numbers and (comma) and there is a character string of capital near the list, it is determined that the number is capital. Other information is also determined by morphological analysis, regular expression conditions, and nearby character string conditions.

In addition, when job offer information is specified as the information type, the information extraction unit 37 extracts information such as work hours, salary, allowance, and work location as information that satisfies a predetermined condition. When extracting these pieces of information, the information extraction unit 37 determines whether the information is desired information based on the morphological analysis, the regular expression condition, and the condition of the character string existing nearby.

The filtering unit 38 determines whether or not one or more types of information determined in advance according to the information type are prepared for the information extracted from the target block by the information extracting unit 37, and only when they are prepared. The information extracted from the target block is output to the display unit 40. For example, when the information type designated by the user is price information of a product, the filtering unit 38 determines whether or not both a product name and a price are available. If only one of the information extracted from the target block is present, the filtering unit 38 does not output the information to the display unit 40.

If company information is specified as the information type, the filtering unit 38 determines whether, for example, the location, representative name, and capital are all available. When job offer information is specified as the information type, the information extraction unit 37 determines whether, for example, three of salary, allowance, and work location are available. If all three pieces of information extracted from the target block are not prepared, the filtering unit 38 does not output the information to the display unit 40.

Here, a method for calculating keyword points by the keyword setting unit 32 will be described. As described above, the keyword setting unit 32 sets, as a keyword, a word input by the user through the operation of the operation unit 50 in the first stage before starting the analysis of the web page stored in the web page storage unit 20. And stored in the keyword storage unit 33. In addition, the keyword setting unit 32 extracts a word included in the target block extracted by the block extraction unit 36 in the course of the web page analysis process, additionally sets it as a keyword, and stores it in the keyword storage unit 33. .

For example, when the price information of the product is specified as the information type, the keyword setting unit 32 extracts words other than the product name and price included in the target block from the target block, and stores the keyword as a new keyword. Additional setting is made in the section 33. Here, for the product name and price information included in the target block, the keyword setting unit 32 receives a notification from the information extraction unit 37 that extracted the information. The keyword setting unit 32 extracts words other than the product name and price grasped by receiving this notification from the target block.

Also, the keyword setting unit 32 calculates and stores keyword points corresponding to the keywords stored in the keyword storage unit 33, including existing keywords and newly set keywords. For example, the keyword setting unit 32 uses the total number of target blocks extracted by the block extraction unit 36 from the web page to be analyzed and the number of appearances of words set as a keyword, and the number of words relative to the total number of target blocks. A value corresponding to the ratio of the number of appearances is calculated and set as a keyword point.

Below, this calculation method will be explained in a little more detail. That is, the block extraction unit 36 extracts zero or one or more target blocks from one web page. If a plurality of web pages are analyzed, the block extraction unit 36 can extract a total of N target blocks from the plurality of web pages. Then, the keyword setting unit 32 extracts various words as keywords from the N target blocks. At this time, the same word can be extracted M times from one or more target blocks. In this case, the keyword setting unit 32 calculates the value of M / N and sets it as a keyword point.

Thus, as the number of occurrences M of the word increases, the keyword points given to the word become larger. Further, if the number M of appearances of the words is the same, the smaller the total number N of target blocks, the larger the keyword points. In this embodiment, target blocks that are likely to contain product names and prices are identified based on keyword points and depth points, and new words other than product names and prices are extracted from the target blocks as keywords. It is a mechanism to do. For this reason, the number of appearances M increases and the keyword points tend to increase for words that are often placed close to product names and prices.

In the example of the web page of the EC site shown in FIG. 3, as an example of a word that is often placed at a distance close to a product name or price, 2.31 points for the word “special price” and the word “cart” 2.02 points are given as keyword points.

However, this is the keyword point value set at a certain point in time. As analysis of a plurality of web pages proceeds, the total number N of target blocks extracted from the plurality of web pages and the number M of appearances of words extracted from within the target blocks change. Thus, the keyword points also change constantly. Therefore, the keyword setting unit 32 stores the number of appearances M and the total number N of the extracted target blocks in association with each word extracted as a keyword, and uses them for calculating keyword points.

The keyword setting unit 32 stores the calculated keyword points in the keyword storage unit 33 in association with the keywords. Here, for newly set keywords, the newly calculated keyword points are stored in the keyword storage unit 33. For existing keywords, the recalculated keyword points are updated and stored in the keyword storage unit 33.

Next, the operation of the web page analyzing apparatus 30 according to the present embodiment configured as described above will be described. FIG. 7 is a flowchart showing an operation example of the web page analyzing apparatus 30 according to the present embodiment. The flowchart shown in FIG. 7 starts when the user operates the operation unit 50 to give a web page analysis instruction to the web page analysis device 30. It is assumed that a plurality of web pages are already stored in the web page storage unit 20 at the start of the flowchart shown in FIG.

In FIG. 7, first, the user designates an information type indicating what kind of information is desired to be extracted as desired information through the operation of the operation unit 50 (step S1). Here, as an example, it is assumed that product price information is specified as an information type. Further, a word that is considered to be related to the designated information type is input as a keyword by the user through the operation of the operation unit 50 (step S2). Here, it is assumed that the word “tax included” is input. The keyword setting unit 32 sets the input word as a keyword and stores it in the keyword storage unit 33.

Next, the depth point assigning unit 34 acquires any one of the plurality of web pages stored in the web page storage unit 20 (step S3), and includes a keyword (in this case) in the web page. ("Tax included") is included (step S4). If no keyword is included, the process proceeds to step S13. As a result, web pages that do not contain any keywords are excluded from the analysis target.

On the other hand, when a keyword is included in the web page, an analysis process described below is executed. That is, first, the depth point assigning unit 34 analyzes the hierarchical structure of the tags of the HTML document constituting the web page currently being analyzed, and calculates the depth point corresponding to the depth from the root for each row. (Step S5).

Further, the keyword point assigning unit 35 deepens the keyword point corresponding to the keyword in a line including the keyword in the HTML document based on the keyword stored in the keyword storage unit 33 and the corresponding keyword point. Is added to the point (step S6). In the analysis of the first web page, a keyword point (for example, a predetermined value) corresponding to the keyword is added to the depth point for a line including the keyword “tax included”.

Next, the block extraction unit 36 extracts a target block from the web page currently being analyzed based on the depth point and the keyword point (step S7). Here, the block extraction unit 36 includes a line in which a difference from the depth point is generated by adding the keyword points, and the keyword points are not added in the lines before and after the line and the depth is increased. A predetermined range having a start point and an end point at a line having a minimum point is extracted as a target block.

Next, the information extraction unit 37 extracts desired information satisfying a predetermined condition from the target block extracted by the block extraction unit 36 (step S8). Here, since the price information of the product is specified as the information type, the information extraction unit 37 extracts the product name and price from the target block as desired information that satisfies a predetermined condition.

Then, the filtering unit 38 determines whether or not one or more types of information predetermined according to the information type are available for the desired information extracted from the target block by the information extracting unit 37 (step S9). ). Here, it is determined whether or not two of a predetermined product name and price according to the information type of product price information are available.

Here, when necessary information is not prepared as desired information extracted from the target block, the process proceeds to step S13. On the other hand, if the necessary information is available, the filtering unit 38 outputs the desired information (product name and price) extracted from the target block to the display unit 40 (step S10).

Thereafter, the keyword setting unit 32 extracts words (words other than the product name and price) included in the target block extracted by the block extraction unit 36, additionally sets them as keywords, and stores them in the keyword storage unit 33. (Step S11). The keyword setting unit 32 calculates keyword points corresponding to the keywords stored in the keyword storage unit 33 including the newly set keywords, and stores the keyword points in the keyword storage unit 33 (step S12).

Finally, the depth point assigning unit 34 determines whether or not all the plurality of web pages stored in the web page storage unit 20 have been processed (step S13). When the process is completed for all web pages, the process of the flowchart illustrated in FIG. 7 is terminated. On the other hand, if the processing has not been completed for all the web pages, the process returns to step S3, another web page is acquired, and the same processing as described above is repeated.

If a new keyword is additionally set in steps S11 and S12 before returning to step S3 to acquire another web page, the keyword “tax included” entered by the user is included in the other web page. If the additional set keyword is included in the other web page, the other web page extracts the desired information (steps S5 to S10) and the keyword learning process (step S11 to S12).

In the flowchart shown in FIG. 7, the processing is terminated when the processing of the plurality of web pages stored in the web page storage unit 20 is completed, but the present invention is not limited to this example. For example, the processing of a plurality of web pages stored in the web page storage unit 20 may be performed a plurality of times automatically or through the operation of the operation unit 50 by the user.

As described above, as the analysis processing of a plurality of web pages proceeds, keywords gradually increase due to the learning effect. Therefore, a web page that has not been subjected to analysis processing in the first round (determined that no keyword is included in step S4) may be subject to analysis processing in the second and subsequent rounds. As a result, there is an advantage that the possibility that desired information can be extracted from more web pages is increased. In the second and subsequent rounds, if the analysis is performed only on web pages that have not been subjected to analysis processing, the processing efficiency can be improved.

As described above in detail, in this embodiment, the hierarchical structure of the tags of the HTML document constituting the web page is analyzed, and a depth point corresponding to the depth from the root is given for each line, and the keyword For a line including, a keyword point corresponding to the keyword is added to the depth point. Based on the depth point and the keyword point, a predetermined range before and after the line including the keyword is extracted as a target block, and information satisfying a predetermined condition from the extracted target block (for example, a product name) And price).

According to the present embodiment configured as described above, in the HTML document constituting the web page, a set of rows within a predetermined range at a short distance from the row including the keyword is extracted as a target block, and the target Desired information satisfying a predetermined condition is extracted from the block. For example, a product name and a price are extracted as desired information that satisfies a predetermined condition using morphological analysis or regular expressions.

Therefore, if the conditions necessary for extracting desired information according to the information type (product price information, company information, job offer information, etc.) are set as predetermined conditions, they are posted at a distance close to the keyword. Often desired information can be automatically extracted from a web page. Thereby, desired information can be efficiently extracted from many web pages on a website on the Internet.

In the present embodiment, the block extraction unit 36 is provided, and instead of simply extracting information satisfying a predetermined condition from the web page, the target block is extracted from the web page, and the predetermined block is extracted only from the target block. Information that satisfies the condition is extracted. Therefore, for example, even if the product name and price exist in one web page, those that are structurally distant from each other and poorly related to each other are not considered as the desired information. Can be excluded from extraction.

If a product name and price are described for one product, they are usually arranged at a short distance as shown in FIG. In the present embodiment, since the product name and price can be extracted only for such a case, extraction of noise that is not desired information can be reduced.

Further, in this embodiment, the filtering unit 38 is provided so that the product name and price extracted from the target block are not output as desired information unless both are available. Accordingly, a case where only one of the product name and the price happens to be in the target block can be excluded as not corresponding to the desired information. Therefore, it is possible to reduce the extraction of noise that is not desired information.

Further, in the present embodiment, the keyword setting unit 32 is provided so that the keyword used for extracting the target block can be variably set by learning. The keyword to be variably set is extracted from the target block. That is, in the present embodiment, an existing keyword included in the target block or a word that is close to desired information can be additionally set as a keyword.

If a keyword is used in a fixed manner, the extraction accuracy of desired information will depend on the quality of most of the keyword. On the other hand, according to the present embodiment, it is necessary to set a predetermined keyword at first, but as the analysis of the web page proceeds, it is preferable to extract a desired word (which actually exists on the Internet). (Words used in the vicinity of desired information in a plurality of web pages) are sequentially added as keywords. Thereby, the extraction accuracy of desired information can be increased.

In the above embodiment, as an example of a keyword point calculation method, an example has been described in which a value corresponding to the ratio of the number of words to the total number of target blocks is calculated and set as a keyword point. It is not limited to. For example, the keyword points may be calculated in consideration of the size of the structural distance from the desired information to the word. Specifically, a method is conceivable in which a word is multiplied by a coefficient so that a word having a shorter structural distance from the desired information has a larger keyword point. The “structural distance” referred to here may be, for example, a difference in the number of rows or a difference in the number of layers. Or it is good also as a difference of a close degree when the tree structure of a hierarchy is seen as a family tree in a pseudo manner.

In the above embodiment, the example in which the keyword is learned only when the information extracted from the target block by the information extraction unit 37 has the necessary information as the information type has been described. However, the necessary information is prepared. The keyword may be learned even when it is not. However, since the filtering unit 38 controls to output desired information only when necessary information is available, it is preferable to perform keyword learning only when necessary information is available.

In the above embodiment, when the keyword point calculated by the keyword setting unit 32 is equal to or less than the threshold value, the keyword point may be set to “0”. Even if the occurrence frequency of a word set as a keyword is very low and the keyword point has a small value, if it is set as it is, the difference between the depth point and the line containing the word will be small. There will be.

Therefore, a predetermined range including a line having only a slight difference is also an extraction target of the target block. In this case, since there is a high possibility that desired information is not included in the target block, the extracted target block itself may become noise. On the other hand, if the values are all rounded to “0” when the keyword points are equal to or less than the threshold value, the extraction of the target block that becomes noise can be reduced, and the processing efficiency can be improved.

In the above embodiment, the method of extracting the target block by the block extraction unit 36 is shown as an example, but the present invention is not limited to this. For example, a line where a difference from the depth point is generated by adding keyword points, a predetermined number of lines before the line where the difference is generated, and a predetermined number of lines after the line where the difference is generated A range including a line may be extracted as a target block. However, according to the extraction method according to the above embodiment, there is a high possibility that the target block can be set to an accurate range without excess or deficiency, so that it is possible to increase processing efficiency while reducing omission of extraction of desired information. it can.

In addition, each of the above-described embodiments is merely an example of implementation in carrying out the present invention, and the technical scope of the present invention should not be construed in a limited manner. That is, the present invention can be implemented in various forms without departing from the gist or the main features thereof.

Claims

A depth point assigning unit that analyzes the hierarchical structure of the tags of the structured document constituting the web page and assigns a depth point corresponding to the depth from the root for each line of the structured document;
For a line including a keyword in the structured document, a keyword point giving unit that adds a keyword point corresponding to the keyword to the depth point;
A block extraction unit that extracts a predetermined range including a line in which the keyword point is added and a difference from the depth point is generated, and a few lines before and after the line, as a target block;
An information extraction unit that extracts information satisfying a predetermined condition from the target block extracted by the block extraction unit.
The block extraction unit includes a line in which the keyword point is added and a difference from the depth point is generated, and the keyword point is not added in a line after the line in which the difference is generated And the first line where the depth point is minimum is the end point, the line before the difference is the same value as the end point and the line where the depth point is minimum is the start point The web page analysis apparatus according to claim 1, wherein a range to be extracted is extracted as the target block.
The web page analysis apparatus according to claim 1, further comprising a keyword setting unit that sets a word input by a user as the keyword.
4. The web page analyzing apparatus according to claim 3, wherein the keyword setting unit additionally sets a word included in the target block extracted by the block extraction unit as the keyword.
The keyword setting unit uses the total number of the target blocks extracted by the block extraction unit from the analysis-target web page and the number of appearances of the word set as the keyword, and the total number of the target blocks. 5. The web page analyzing apparatus according to claim 3, wherein a value corresponding to a ratio of the number of appearances of words is set as a keyword point for the keyword.
For the information extracted from the target block by the information extraction unit, it is determined whether or not one or more types of information determined in advance are available, and the information extracted from the target block only when the information is available The web page analyzing apparatus according to claim 1, further comprising: a filtering unit that outputs.
Depth point assigning means for analyzing the hierarchical structure of tags of the structured document constituting the web page and assigning a depth point corresponding to the depth from the root for each line of the structured document,
A keyword point giving means for adding a keyword point corresponding to the keyword to the depth point for a line including the keyword in the structured document;
Block extraction means for extracting a predetermined range including a line in which the keyword point is added and a difference from the depth point is generated and several lines before and after the line, and a target block extracted by the block extraction means Information extracting means for extracting information satisfying a predetermined condition from within,
A computer-readable web page analysis program for causing a computer to function as a computer.