WO2010073592A1 - 情報推定装置、情報推定方法、及びコンピュータ読み取り可能な記録媒体 - Google Patents
情報推定装置、情報推定方法、及びコンピュータ読み取り可能な記録媒体 Download PDFInfo
- Publication number
- WO2010073592A1 WO2010073592A1 PCT/JP2009/007072 JP2009007072W WO2010073592A1 WO 2010073592 A1 WO2010073592 A1 WO 2010073592A1 JP 2009007072 W JP2009007072 W JP 2009007072W WO 2010073592 A1 WO2010073592 A1 WO 2010073592A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- transmission time
- specified
- group
- time point
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9558—Details of hyperlinks; Management of linked annotations
Definitions
- the present invention relates to an information estimation device, an information estimation method, and a computer-readable recording medium.
- the information provided by the web page is miscellaneous, it is necessary to judge the correctness of the information.
- information such as a transmission date and a transmission time for content such as a web page is useful and useful.
- Patent Document 1 proposes one method of presenting to the user when the content was uploaded even when the creation date of the content is not explicitly written in the web page. (Patent Document 1).
- Patent Document 1 In the method of Patent Document 1, first, a user designates a web page in which updated page information is collected in a list. Then, link information to the updated page is acquired from the designated web page (designated web page). Furthermore, the designated web page is periodically referred to, the previous designated web page is compared with the current designated web page, and if a new difference is found in the link information to the updated page as a result of the comparison, The date on which the comparison was made is the creation date of the linked page.
- Non-Patent Document 1 discloses a method for estimating a transmission date of a web page whose transmission date is unknown using a web page whose transmission date is already known. Specifically, first, document clustering is performed on web pages with similar timing and content based on the words in the page, and then it is determined to which cluster the web page whose transmission date is unknown should be classified. Then, using the transmission dates of a plurality of web pages of the cluster to be classified, the transmission date of a web page whose transmission date is unknown is estimated.
- Patent Document 1 and Non-Patent Document 1 have the following problems.
- Non-Patent Document 1 a transmission date of a web page whose transmission date is unknown is estimated using a web page whose transmission date is known. For this reason, it is not necessary to specify a web page that lists the updated pages.
- Non-Patent Document 1 since the transmission date is estimated based on the words in the web page, there is a problem that if the appearance tendency of the words in each web page is different, it cannot be estimated correctly. That is, if the word used in each web page is different, it cannot be properly classified into a cluster to be originally classified, and cannot be estimated correctly.
- An object of the present invention is to solve the above-mentioned problems and to provide an information estimation apparatus and information estimation that can estimate the transmission time of the content even when the transmission date and time expression are not explicitly described in the document constituting the content. It is to provide a method and a computer-readable recording medium.
- an information estimation apparatus is an information estimation apparatus for estimating a transmission time point of a document whose transmission time point is not specified in a document set to be analyzed, A document having a document structure in which a link relation to another document is displayed in a table of contents is specified from the document set, and a link relation of documents included in the document set is determined from the document structure of the specified document.
- a structural analysis unit to be extracted A grouping unit that sets a group of documents using the document specified by the structure analysis unit and the link relation extracted by the structure analysis unit; the group set by the grouping unit;
- An estimation unit configured to estimate a transmission time point of a document whose transmission time point included in the group is not specified based on a transmission time point of a document whose transmission time point included in the group is specified;
- the information estimation method in the present invention is an information estimation method for estimating the transmission time of a document whose transmission time is not specified in the document set to be analyzed, (A) A document having a document structure in which a link relation to another document is shown in a table of contents is specified from the document set, and a document included in the document set is determined from the document structure of the specified document.
- Extracting a link relationship (B) setting a group of documents using the document specified in the step (a) and the link relation extracted in the step (a); and (c) the (b And a step of estimating a transmission time point of a document whose transmission time point included in the group is not specified, based on the group set in step) and a transmission time point of a document whose transmission time point included in the group is specified. It is characterized by having.
- the computer-readable recording medium of the present invention records a program for causing a computer to estimate the transmission time point of a document whose transmission time point is not specified in the document set to be analyzed.
- a computer-readable recording medium In the computer, (A) A document having a document structure in which a link relation to another document is shown in a table of contents is specified from the document set, and a document included in the document set is determined from the document structure of the specified document.
- Extracting a link relationship (B) setting a group of documents using the document specified in the step (a) and the link relation extracted in the step (a); and (c) the (b And a step of estimating a transmission time point of a document whose transmission time point included in the group is not specified, based on the group set in step) and a transmission time point of a document whose transmission time point included in the group is specified.
- the information estimation device As described above, according to the information estimation device, the information estimation method, and the computer-readable recording medium of the present invention, even when the transmission date and time expression are not explicitly described in the document that constitutes the content, It is possible to estimate the content transmission time.
- FIG. 1 is a block diagram showing a schematic configuration of an information estimation apparatus according to an embodiment of the present invention.
- FIG. 2 is a diagram showing the link relationship in the document set to be analyzed.
- FIG. 3 is a flowchart showing a flow of processing in the information estimation method according to the embodiment of the present invention.
- FIG. 4 is a diagram showing a result of determination as to whether or not the transmission time point of each document indicated by the document ID is specified.
- FIG. 5 is a diagram showing a link source and a link destination in the link relationship shown in FIG.
- FIG. 6 is a diagram showing an example of a document structure in which a link relation to another document in an arbitrary document is shown in a table of contents.
- FIG. 1 is a block diagram showing a schematic configuration of an information estimation apparatus according to an embodiment of the present invention.
- FIG. 2 is a diagram showing the link relationship in the document set to be analyzed.
- FIG. 3 is a flowchart showing a flow of processing in the information estimation
- FIG. 7 is a diagram showing an example of a document structure in which a link relation to another document in an arbitrary document is shown in a table of contents.
- FIG. 8 is a diagram illustrating an example of group setting.
- FIG. 9 is a diagram illustrating a result of the estimation process.
- FIG. 1 is a block diagram showing a schematic configuration of an information estimation apparatus according to an embodiment of the present invention.
- FIG. 2 is a diagram showing the link relationship in the document set to be analyzed.
- the information estimation apparatus 1 shown in FIG. 1 is an apparatus that estimates the transmission time point of a document whose transmission time point is not specified in the document set to be analyzed.
- the information estimation apparatus 1 includes a structure analysis unit 3, a grouping unit 4, and an estimation unit 5.
- the transmission time point is specified for some documents.
- the structure analysis unit 3 specifies a document having a document structure in which a link relation to another document is shown in a table of contents from the document set to be analyzed, and further, from the document structure of the specified document, the document set The link relation (see FIG. 2) of the documents included in is extracted.
- document structure is information describing a logical document structure in a document.
- the logical document structure include a document structure including components such as an outline portion, title, chapter, and section. If a document structure is analyzed in a document in which these components exist in another document, a document having a document structure in which a link relation to the other document is shown in a table of contents can be specified.
- the structure analysis unit 3 uses this document structure as a link relation that is a group candidate at the same transmission time point. Can be extracted.
- the reason for extracting the link relationship indicating the group candidates at the same transmission time point based on the document structure in which the link relationship to other documents is shown in a table of contents is as follows. In other words, if the logical components of a document form a single structure across multiple documents, there is a high possibility that these multiple documents were sent at the same time.
- the link relationship it is possible to specify a set of documents transmitted at the same time and estimate the transmission time of each document. For example, in the case of a web page, the logical component of a document may extend over multiple web pages, and these web pages are likely to be sent at the same time. The transmission time of another web page can be estimated from the transmission time of the web page. It is.
- the link relationship shown in FIG. FIG. 2 shows a graph structure in which each document is a node and each link is an edge.
- the direction of the arrow indicating each link means that a hyperlink is extended from the link source to the link destination.
- the grouping unit 4 sets a group including documents whose transmission time points are not specified by using the document specified by the structure analysis unit 3 and the link relation extracted by the structure analysis unit 3.
- the number of groups set by the grouping unit 4 may be one or more.
- the estimation unit 5 determines the transmission time point of the document whose transmission time point included in the group is not specified. presume.
- the information estimation apparatus 1 can estimate when the content is transmitted even when the transmission date and time expression are not explicitly described in the document that configures the content.
- the reason is that according to the information estimation apparatus 1, a set (group) of documents considered to be transmitted at the same time can be estimated based on the link relation from the documents whose transmission time can be specified.
- the information estimation apparatus 1 in the present embodiment will be described more specifically.
- the information estimation apparatus 1 in the present embodiment is realized by a computer that operates by program control, as will be described later.
- the information estimation device 1 includes a reference time point determination unit 2 and an input reception unit 6.
- the input receiving unit 6 receives information input from an external input device.
- the document ID is described in parentheses. For example, document (0), document (1), etc. are described.
- a storage device 10 an input device 20, and an output device 30 are connected to the information estimation device 1.
- the input device 20 is a device that inputs a set of documents to be analyzed and an instruction to the information estimation device 1.
- the input device 20 includes input devices such as a keyboard and a mouse, and another computer connected via a network.
- the output device 30 is a device for notifying the estimation result by the estimation unit 5 to the outside. Examples of the output device include output devices such as a display device and a printing device.
- the “sending time” used in this specification is time information regarding a time when a certain content is sent.
- the time information is, for example, date information such as date and date.
- the transmission time point may be time information when the content is updated, such as an update date, or may be time information when the content is created, such as a creation date.
- the transmission time point needs to have each element of the date.
- the transmission time point may include elements such as hour, minute and second in addition to the date.
- the “document” used in this specification includes all information that can be read and stored in a data processing apparatus such as a computer.
- Examples of the document include a web page, a file, and a combination of files.
- content used in the present specification means a unit of information that is the content of a document but is a certain unit. That is, there may be a document made up of one content or a document made up of a plurality of contents.
- a web page indicated by a certain URL may include a plurality of articles, and each article may have a different outgoing date. In this case, it is possible to interpret a web page as a document and each of a plurality of articles included in the page as one of contents.
- the document set received by the input receiving unit 6, that is, the document set to be analyzed is stored in the document storage unit 11 in the storage device 10.
- a set of documents to be analyzed may be collected in advance and stored in the document storage unit 11.
- the information estimation apparatus 1 starts processing from a part of the document set, determines these link destinations, further collects the document set as necessary, and stores the newly collected document set as the document storage unit 11. Can also be stored.
- the document set to be analyzed is a web page, for example, the web page set in which the URL belongs to a specific domain name, or the directory path in the URL has a specific directory path. It may be limited to a set of web pages. The reason is that a web page set made up of contents created at the same transmission time is often a web page set of URLs having the same domain name or URLs having a common directory path. Therefore, by providing such a restriction, it is possible to improve the estimation accuracy and shorten the processing time by reducing the number of objects. In addition, the aspect in which a process is performed without such a restriction
- limiting may be sufficient.
- the structure analysis unit 3 when the document is a web page as described above, the structure analysis unit 3 is configured to execute at least one of HTML tags and DOM tree subtrees described in the web page, and The document having the document structure described above can be specified using the link.
- the structure analysis unit 3 extracts a link relationship using at least one of the SGML tag and tag structure and the url tag.
- the structure analysis unit 3 extracts a link relationship by using at least one of an XML tag and a subtree of the XML DOM tree and link information such as Xlink.
- the grouping unit 4 combines a document whose transmission time is specified with a document that has a link between the document and the transmission time is not specified, and creates a group. Can be set. Further, in this aspect, the grouping unit 4 selects a document whose transmission time point is not specified when a document whose transmission time point is not specified has a link with a document where a plurality of transmission time points are specified. A group is set in combination with the document with the older transmission time. This makes it possible to estimate the transmission time more accurately. Because, in general, there are various types of logical relationships between documents, multiple groups can be set. A document may overlap with multiple groups, but the logical relationship set later is This is because there is a high possibility that a document in the document set having the logical relationship set earlier is cited.
- the grouping unit 4 sets one group for the document (0), sets one group for the document (1), the document (2), and the document (3), and sets the document (4)
- a group can be set for the document (5) and the document (6).
- the estimation unit 5 determines the transmission time point of the document whose transmission time point is specified in each group, and the transmission time of the document whose transmission time point in the group is not specified. As a time point, it can be estimated. In the example of FIG. 2 described above, the estimation unit 5 estimates the document transmission times of the document (2) and the document (3) as the document transmission time of the document (1). Similarly, the estimation unit 5 estimates the transmission time of the documents (5) and (6) as the transmission time of the document (1).
- FIG. 3 is a flowchart showing a flow of processing in the information estimation method according to the embodiment of the present invention.
- the information estimation method is implemented by operating the information estimation apparatus 1 shown in FIG. Therefore, in the following, the flow of processing in the information estimation method will be described together with the operation of the information estimation apparatus 1 shown in FIG. 1 with appropriate reference to FIGS. 1 and 2.
- the reference time determination unit 2 retrieves a set of documents to be analyzed from the document storage unit 11, and whether or not the transmission time point is specified for each document included in the set. Is determined (step A1).
- the reference time point determination unit 2 inputs information indicating which document has a specified transmission time point to the structure analysis unit 3 and the grouping unit 4.
- the structure analysis unit 3 specifies a document having a document structure in which a link relation to another document is shown in a table of contents from the document set, and further converts the document structure of the specified document into the document set.
- the link relationship (see FIG. 2) of the included document is extracted (step A2).
- the grouping unit 4 sets a group of documents including documents whose transmission time points are not specified using the documents specified in step A2 and the link relation extracted in step A2 (step A3). Specifically, the grouping unit 4 combines a document whose transmission time is specified with a document that has a link between the document and the transmission time is not specified.
- the estimation unit 5 transmits the document whose transmission time point included in the group is not specified.
- a time point is estimated (step A4). Specifically, in each group, the estimation unit 5 sets the transmission time of a document whose transmission time is specified as the transmission time of a document whose transmission time is not specified.
- the document whose transmission time is estimated is output to the output device 30 and notified to the user.
- the information estimation method in the present embodiment it is possible to estimate when the content is transmitted even when the transmission date and the time expression are not explicitly described in the document constituting the content. It becomes possible.
- the program in the embodiment of the present invention may be a program including instructions that cause a computer to execute steps A1 to A4 shown in FIG. If the program in the present embodiment is installed in a computer and executed, the information estimation apparatus in the present embodiment can be realized, and the information processing method in the present embodiment is implemented.
- a CPU central processing unit
- the storage device 10 can also be realized by storing data files constituting these in a storage device such as a hard disk provided in the computer.
- the program according to the embodiment of the present invention is supplied in a state of being stored in a computer-readable recording medium, for example, an optical disk, a magnetic disk, a magneto-optical disk, a semiconductor memory, a floppy disk, etc., or via a network.
- a computer-readable recording medium for example, an optical disk, a magnetic disk, a magneto-optical disk, a semiconductor memory, a floppy disk, etc., or via a network.
- the examples described below correspond to the information estimation apparatus, information estimation method, and program in the above-described embodiment.
- a keyboard and a mouse are used as the input device 20.
- the information estimation apparatus 1 is implement
- the storage device 10 a magnetic disk recording device provided in the above computer is used.
- a display device is used as the output device 30.
- the reference time point determination unit 2 determines whether the transmission time point is known or unknown with respect to the content of each document included in the document set stored in the storage device 10. Judgment is made. If known, the reference time point determination unit 2 also identifies the transmission time point. The document determined to be known here becomes a reference time point for estimating a transmission time point of the subsequent processing.
- the reference time determination unit 2 can determine that a document is known if a transmission time is given to a document in advance, and can determine that a document that is not known is unknown. In addition, the reference time determination unit 2 tries to specify the transmission time even if the transmission time is not given to each document in advance, and determines that the document for which the transmission time can be specified is known. It can be determined that the document is unknown.
- the method for specifying the transmission time by the reference time determination unit 2 there are various methods using existing technology.
- a specific method for specifying the transmission time point for example, when the content transmission time point is explicitly described in a document, there is a method of specifying the content from the described information.
- a method for specifying the transmission time point a method of specifying based on information extracted from a date expression, a time expression in the document, or an expression representing a time similar thereto may be mentioned.
- the reference time determination unit 2 determines that when feed information such as RSS is separately obtained for the target document, or when information of RDF (Resource Description Framework) is described in the document, The transmission time may be specified from the information.
- a feed is a distribution format of websites and web pages, such as RSS (RDF Site Summary, Rich Site Summary, Really Simple Syndication), and Atom.
- the reference time determination unit 2 specifies the transmission time of the document from the information at the time of archiving acquired when the web page is archived by collection by a crawler or the like and the response information from the web server hosting the target document. You may make it do.
- the document set to be analyzed includes documents (document (0) to document (8)) having document IDs “0” to “8”.
- the document ID is an identifier for distinguishing each document.
- the document ID may be indicated by a URL or the like.
- FIG. 4 is a diagram showing a result of determination as to whether or not the transmission time point of each document indicated by the document ID is specified. In FIG. 4, when the transmission time is known, the date is shown, and when it is unknown, information indicating unknown is shown.
- the transmission date of the content of the document of document (0) is specified as “February 10, 2000”, which indicates known.
- the transmission date of the content of the document (2) is determined to be unknown, and “u” that is a flag indicating “unknown” is input.
- the structure analysis unit 3 identifies a document having a document structure in which a link relation to another document is shown in a table of contents from a set of documents to be analyzed, and extracts the link relation.
- FIG. 5 is a diagram showing a link source and a link destination in the link relationship shown in FIG.
- the link relationship (see FIG. 2) is extracted from the document structure in which the link relationship to other documents in the document set is shown in a table of contents.
- the link relationship is specified by the correspondence between the link source document ID and the link destination document ID.
- FIG. 6 and FIG. 7 are used to show an example of a document structure in which a link relationship between a document and another document is shown in a table of contents.
- 6 and 7 are diagrams illustrating an example of a document structure in which a link relation to an arbitrary document in a given document is shown in a table of contents.
- the document to be analyzed is a web page, which is an HTML document.
- FIG. 6 shows a part of the HTML of the document (0)
- FIG. 7 shows a part of the HTML of the document (1).
- the document (0) has a description indicating the structure of the itemized list using the UL elements.
- the LI element there are hyperlinks to the document (1) and the document (4), and characters such as “chapter 1” and “chapter 2” that indicate a part of the table of contents of the document as anchor text. Contains columns.
- the document (1) has a description indicating the structure of the table using the TABLE element.
- the TD element there are hyperlinks to the document (2) and the document (3), and characters such as “section 1” and “section 2” that indicate a part of the table of contents of the document as anchor text. Contains columns.
- a method for specifying the document structure by determining a pattern that is a characteristic of the document structure. Is mentioned.
- determination can be made by combining a plurality of the above patterns.
- the patterns may be combined to form a rule.
- a rule for example, if the document is data such as HTML or XML, there are a condition that the document has an anchor element surrounded by a specific tag, a condition that the document has a partial structure indicated by a specific Xpath, and the like. Applicable.
- a syntax such as “/ td / a”.
- a condition having a specific word or character string may be added to the anchor text, attribute name, or surrounding text node included in the specific document structure. For example, “previous”, “next”, “last month”, “next month”, “previous issue”, “next issue”, “>>”, “NEXT” If there is a character string such as “Read more”, there is a high possibility of being a component of a logical document structure.
- a score or probability value is specified in consideration of the likelihood of being a group element at the same transmission time.
- rules For example, a large number of patterns that can be characteristic of the document structure in which links to other documents are displayed in a table of contents are listed as candidates, and a score is given to each pattern. Then, the sum or product of the scores may be used to determine that the link relationship indicates a group candidate at the same transmission time point when an acceptance condition such as a predetermined score threshold is satisfied.
- an HTML document such a pattern serving as a feature can be comprehensively created from an arbitrary subtree of a DOM tree or text and element information included in these subtrees.
- Other methods for specifying a document structure in which links to other documents are shown in a table of contents include a method of preparing a training document set in which a group at the same transmission point is specified in advance. In this method, a link relation between documents in a group, a pattern that characterizes the document structure related to the link, and a known machine learning method are used from the training document set to determine whether such a document structure. Is determined.
- an event in which a certain document structure is correct is an event C
- an occurrence probability of the event C at that time is P (C).
- a conditional probability that a document structure feature pattern X i exists under a condition in which an event C occurs is P (X i
- ⁇ is a constant that depends on the probability P (X i ) of occurrence of each event X i .
- an event C2 in which a certain document structure is incorrect in the training document set can also be modeled.
- X 1 ,..., X n ) is obtained.
- MAP estimation method a known maximum posterior probability estimation method for this P (C2
- the same transmission time point can be obtained. It is possible to determine whether the document structure indicates a group candidate or not. That is, when it is determined that the document structure indicating the group candidate at the same transmission time is more likely, the link relationship of the portion corresponding to the document structure may be extracted as the group candidate at the same transmission time. .
- the grouping unit 4 uses a document whose content transmission time is specified by the reference time determination unit 2 in addition to the document specified by the structure analysis unit 3 and the link relationship extracted in the same manner. Set up document groups. At this time, the grouping unit 4 sets a group of documents that are estimated to have the same transmission time point so that the content transmission time points do not overlap.
- a document having a document structure that is specified by the structure analysis unit 3 and that has a table of contents showing a link relationship to another document is set as an initial element. Then, a document having a link relationship that is a candidate for a group whose transmission time is estimated to be the same as that of the document is extracted, added to the group, and a group is set.
- the new document to be added to the group is an already specified document at the time of transmission, this document is not added.
- the document to be added is a document whose transmission time is unknown and it is found that it overlaps with another group, this document takes precedence over the group having the old transmission time. Added.
- FIG. 8 is a diagram illustrating an example of group setting.
- the groups at the same transmission time are identified by a specific group ID.
- the document (1), the document (2), and the document (3) have the same group ID “0”, and these are the same group. The same applies to the group ID “1” and the group ID “2”.
- a candidate group which includes a document with a link source document ID and a set of link destination documents having the link source document ID.
- the link source document is confirmed, and the following processing is executed in order from the oldest of the transmission time points out of the link source documents whose transmission time points are determined to be known.
- the document with the oldest transmission time shown in FIG. 4 is the document (1). Therefore, a candidate group including document (1) is generated. A candidate group having the document (2) with the oldest transmission time next as the link source is generated in the same manner.
- the document (0) is a link source document, and has a document (1) and a document (4) as link destinations, but the transmission time points of the document (1) and the document (4) are known. These are not added to the group of document (0).
- the link source document IDs shown in FIG. are identified, and a group is generated based on the identified linked document.
- this procedure it is possible to add to a group at another outgoing time point, and when there is a document that causes duplication in group generation, the document that causes duplication is Included in preference to a group of documents.
- each document (1) and document (4) is a group element is first set based on document (0).
- document (1) and document (4) have an origination time older than document (0), and each will belong to a group different from the group of document (0). Therefore, the document (1) and the document (4) are not added to the group of the document (0).
- the estimation unit 5 estimates a transmission time point for a document whose transmission time point is unknown based on the group set by the grouping unit 4 and a document whose transmission time point is known.
- the estimation unit 5 uses a document whose transmission time in the group is known, and gives the transmission time of a known document to a document whose transmission time is unknown.
- FIG. 4 is updated as shown in FIG. 9 from the document whose transmission time is known in FIG. 4 and the group shown in FIG.
- FIG. 9 is a diagram illustrating a result of the estimation process.
- the estimation of the transmission time point for documents not included in the group can be performed as follows. First, the estimation unit 5 selects groups in order starting from the group having the document with the oldest transmission time, starts with each document included in the selected group, and starts with each document serving as the starting point (to a document outside the group). Trace the previous document of (link relation). Further, the estimation unit 5 repeatedly traces the linked document based on the link relationship from the document, and specifies the linked document. Then, the estimating unit 5 determines whether the transmission time of the identified document is known or unknown, and if a document with the known transmission time is encountered when tracing here, the link relation ahead is not followed.
- the estimation unit 5 sets the transmission time of the document in the selected group (document that is the starting point) to the arrived document. It is applied and this is estimated as the transmission time of the document.
- the reason to estimate by following the links in order from the group that has the old document is that documents that are unknown at the time of transmission are often referred to later, such as hyperlink reference relationships. This is because it is possible to estimate the transmission time point with higher accuracy if the estimation is performed in the oldest order.
- the link destination is traced based on the link relationship.
- the document (2) cannot reach a new document that is not included in the group and is unknown at the time of transmission.
- the document (7) can be traced as a new link destination. Therefore, the transmission time of the document (3) can be applied to the document (7).
- the document (8) can be newly traced as a link destination, and the document (5) is transmitted to the document (8). Can be applied.
- the estimation unit 5 can exclude link relationships that can be determined to be unnecessary.
- an unnecessary link is a link relationship that does not belong to a group whose transmission time is estimated to be the same, or a link relationship in which it is meaningless to give a transmission date.
- the URL May be found in other unrelated domains. It can be considered unnecessary to reflect such a link relationship in the specification of the time of transmission. Such link relationships are preferably excluded as necessary.
- the information estimation apparatus, information estimation method, and computer-readable recording medium according to the present invention have the following characteristics.
- An information estimation apparatus for estimating a transmission time point of a document whose transmission time point is not specified in a document set to be analyzed, A document having a document structure in which a link relation to another document is displayed in a table of contents is specified from the document set, and a link relation of documents included in the document set is determined from the document structure of the specified document.
- a structural analysis unit to be extracted A grouping unit that sets a group of documents using the document specified by the structure analysis unit and the link relation extracted by the structure analysis unit;
- An estimation unit that estimates a transmission time point of a document whose transmission time point included in the group is not specified based on the group set by the grouping unit and a transmission time point of a document whose transmission time point included in the group is specified.
- An information estimation apparatus comprising:
- the grouping unit has the link relation extracted by the structure analysis unit between the document whose transmission time is specified and the document, and the transmission time is not specified.
- the said estimation part presumes the transmission time of the document in which the said transmission time in the said group was specified as a transmission time of the document in which the said transmission time in the said group is not specified,
- the said (1) Information estimation device
- the grouping unit sets a plurality of groups, The estimation unit selects a group in order from a group having a document with the oldest transmission time among the plurality of groups, Then, starting from each document included in the selected group, the reachable document is identified by following the linked documents in order from the origin, and if the identified document transmission time is not identified, the identification
- the information estimation apparatus according to (1), wherein the transmission time of the received document is estimated as the transmission time of the document as the starting point.
- a document included in the document set is a web page
- the structural analysis unit uses a hyperlink described in the web page and at least one of HTML tags and subtrees of the DOM tree to display a table of links to other documents.
- the information estimation apparatus according to (1) wherein a document having a document structure is specified.
- An information estimation method for estimating a transmission time point of a document whose transmission time point is not specified in a document set to be analyzed (A) A document having a document structure in which a link relation to another document is shown in a table of contents is specified from the document set, and a document included in the document set is determined from the document structure of the specified document. Extracting a link relationship; (B) setting a group of documents using the document identified in the step (a) and the link relation extracted in the step (a); (C) Based on the group set in the step (b) and the transmission time point of the document whose transmission time point included in the group is specified, the transmission time point included in the group is not specified. And a step of estimating a transmission time point.
- step (b) the link relationship extracted in the step (a) is established between the document for which the transmission time point is specified and the document, and the transmission time point.
- step (10) In the step (b), when the document whose transmission time is not specified has a link with a plurality of documents whose transmission time is specified, the transmission time is not specified.
- the information estimation method according to (8), wherein the group is set by combining a document with a document having an earlier specified transmission time.
- step (c) the transmission time point of the document in which the transmission time point in the group is specified is estimated as the transmission time point of the document in which the transmission time point in the group is not specified (8) Information estimation method described in 1.
- step (b) a plurality of groups are set,
- step (c) a group is selected in order from the group having the document with the oldest transmission time among the plurality of groups, Then, starting from each document included in the selected group, the reachable document is identified by following the linked documents in order from the origin, and if the identified document transmission time is not identified, the identification
- the information estimation method according to (8) wherein the transmission time point of the received document is estimated as the transmission time point of the starting document.
- a document included in the document set is a web page
- the link relation to other documents is indexed by using hyperlinks described in the web page and at least one of HTML tags and subtrees of the DOM tree.
- a computer-readable recording medium storing a program for causing a computer to estimate a transmission time of a document whose transmission time is not specified in a document set to be analyzed,
- a document having a document structure in which a link relation to another document is shown in a table of contents is specified from the document set, and a document included in the document set is determined from the document structure of the specified document. Extracting a link relationship;
- C Based on the group set in the step (b) and the transmission time point of the document whose transmission time point included in the group is specified, the transmission time point included in the group is not specified.
- the document has the link relationship extracted in the step of (a) between the document whose transmission time is specified and the document, and the transmission time
- step (b) when the document whose transmission time is not specified has a link with a plurality of documents whose transmission time is specified, the transmission time is not specified.
- step (c) the transmission time point of the document in which the transmission time point in the group is specified is estimated as the transmission time point of the document in which the transmission time point in the group is not specified (15) A computer-readable recording medium described in 1.
- step (b) a plurality of groups are set,
- step (c) a group is selected in order from the group having the document with the oldest transmission time among the plurality of groups, Then, starting from each document included in the selected group, the reachable document is identified by following the linked documents in order from the origin, and if the identified document transmission time is not identified, the identification
- a document included in the document set is a web page
- the link relation to other documents is indexed by using hyperlinks described in the web page and at least one of HTML tags and subtrees of the DOM tree.
- the present invention is effective when creating time-series data for a web page.
- the present invention has industrial applicability.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
前記文書集合から、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書を特定し、特定された前記文書の前記ドキュメント構造から、前記文書集合に含まれる文書のリンク関係を抽出する構造解析部と、
前記構造解析部によって特定された前記文書と、前記構造解析部によって抽出された前記リンク関係とを用いて、文書のグループを設定する、グルーピング部と、 前記グルーピング部が設定した前記グループと、前記グループに含まれる発信時点が特定された文書の発信時点とに基づき、前記グループに含まれる発信時点が特定されていない文書の発信時点を推定する推定部とを、備えることを特徴とする。
(a)前記文書集合から、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書を特定し、特定された前記文書の前記ドキュメント構造から、前記文書集合に含まれる文書のリンク関係を抽出するステップと、
(b)前記(a)のステップによって特定された前記文書と、前記(a)のステップによって抽出された前記リンク関係とを用いて、文書のグループを設定するステップと、(c)前記(b)のステップで設定された前記グループと、前記グループに含まれる発信時点が特定された文書の発信時点とに基づき、前記グループに含まれる発信時点が特定されていない文書の発信時点を推定するステップとを、有することを特徴とする。
前記コンピュータに、
(a)前記文書集合から、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書を特定し、特定された前記文書の前記ドキュメント構造から、前記文書集合に含まれる文書のリンク関係を抽出するステップと、
(b)前記(a)のステップによって特定された前記文書と、前記(a)のステップによって抽出された前記リンク関係とを用いて、文書のグループを設定するステップと、(c)前記(b)のステップで設定された前記グループと、前記グループに含まれる発信時点が特定された文書の発信時点とに基づき、前記グループに含まれる発信時点が特定されていない文書の発信時点を推定するステップとを、実行させる、命令を含むプログラムを記録していることを特徴とする。
以下、本発明の実施の形態における情報推定装置、情報推定方法、及びプログラムについて、図1~図3を参照しながら説明する。最初に、本実施の形態における情報推定装置の構成について説明する。図1は、本発明の実施の形態における情報推定装置の概略構成を示すブロック図である。図2には、分析対象となる文書集合におけるリンク関係を示す図である。
である。
本実施例では、基準時点判定部2(図1参照)は、記憶装置10に記憶された文書集合に含まれる各文書のコンテンツに対して、発信時点が既知であるか、又は未知であるかの判定を行う。既知の場合には、基準時点判定部2は、その発信時点の特定も行う。ここで既知と判定された文書は、後段の処理の発信時点推定のための基準の時点となる。
構造解析部3は、分析対象となる文書集合の中から、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書を特定し、そのリンク関係を抽出する。具体的な例を図5に示す。図5は、図2に示されたリンク関係におけるリンク元とリンク先とを示す図である。図5に示すように、文書集合中の他の文書へのリンク関係が目次的に示されたドキュメント構造から、リンク関係(図2参照)が抽出されている。リンク関係は、リンク元の文書IDとリンク先の文書IDとの対応付けによって特定されている。
本実施例では、グルーピング部4は、構造解析部3によって特定された文書と、同じく抽出されたリンク関係とに加えて、基準時点判定部2によってコンテンツの発信時点が特定された文書も用いて、文書のグループを設定する。また、このとき、グルーピング部4は、コンテンツの発信時点が重複しないようにして、発信時点が同一であると推定される文書のグループを設定する。
推定部5は、グルーピング部4が設定したグループと、発信時点が既知の文書とに基づいて、発信時点が未知の文書に対して発信時点を推定する。本実施例では、推定部5は、グルーピング部4が生成したグループについて、グループ内の発信時点が既知の文書を用いて、発信時点が未知の文書に既知の文書の発信時点を付与する。この場合、図4の発信時点が既知の文書と、図8に示されたグループとから、図4は、図9のように更新される。図9は、推定処理の結果を示す図である。
(1)分析対象となる文書集合において発信時点が特定されていない文書の発信時点を推定する情報推定装置であって、
前記文書集合から、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書を特定し、特定された前記文書の前記ドキュメント構造から、前記文書集合に含まれる文書のリンク関係を抽出する構造解析部と、
前記構造解析部によって特定された前記文書と、前記構造解析部によって抽出された前記リンク関係とを用いて、文書のグループを設定する、グルーピング部と、
前記グルーピング部が設定した前記グループと、前記グループに含まれる発信時点が特定された文書の発信時点とに基づき、前記グループに含まれる発信時点が特定されていない文書の発信時点を推定する推定部とを、備えることを特徴とする情報推定装置。
(2)前記グルーピング部は、前記発信時点が特定された文書と、当該文書との間で、前記構造解析部によって抽出された前記リンク関係を有し、且つ、前記発信時点が特定されていない文書とを組み合わせて、前記グループを設定する、上記(1)に記載の情報推定装置。
(3)前記グルーピング部は、前記発信時点が特定されていない文書が、複数の前記発信時点が特定された文書との間でリンクを有する場合に、前記発信時点が特定されていない文書を、特定されている発信時点が古い方の文書に組み合わせて、前記グループを設定する、上記(1)に記載の情報推定装置。
(4)前記推定部は、前記グループにおける前記発信時点が特定された文書の発信時点を、前記グループにおける前記発信時点が特定されていない文書の発信時点として推定する、上記(1)に記載の情報推定装置。
(5)前記グルーピング部が、複数のグループを設定し、
前記推定部は、前記複数のグループのうち発信時点が最も古い文書を有するグループから順にグループを選択し、
そして、選択したグループに含まれる各文書を起点とし、前記起点から順にリンク先の文書を辿ることによって到達可能な文書を特定し、特定した文書の発信時点が特定されていない場合は、前記特定した文書の発信時点を、前記起点となる文書の発信時点と推定する、上記(1)に記載の情報推定装置。
(6)分析対象となる前記文書集合に含まれる文書それぞれに対して、発信時点が特定されているかどうかを判定する、基準時点判定部を更に備えている、上記(1)に記載の情報推定装置。
(7)前記文書集合に含まれる文書が、ウェブページであり、
前記構造解析部が、前記ウェブページに記述されている、ハイパーリンクと、HTMLタグ及びDOMツリーの部分木のうちの少なくとも一つとを用いて、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書の特定を行っている、上記(1)に記載の情報推定装置。
(8)分析対象となる文書集合において発信時点が特定されていない文書の発信時点を推定するための情報推定方法であって、
(a)前記文書集合から、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書を特定し、特定された前記文書の前記ドキュメント構造から、前記文書集合に含まれる文書のリンク関係を抽出するステップと、
(b)前記(a)のステップによって特定された前記文書と、前記(a)のステップによって抽出された前記リンク関係とを用いて、文書のグループを設定するステップと、
(c)前記(b)のステップで設定された前記グループと、前記グループに含まれる発信時点が特定された文書の発信時点とに基づき、前記グループに含まれる発信時点が特定されていない文書の発信時点を推定するステップとを、有することを特徴とする情報推定方法。
(9)前記(b)のステップにおいて、前記発信時点が特定された文書と、当該文書との間で、前記(a)のステップで抽出された前記リンク関係を有し、且つ、前記発信時点が特定されていない文書とを組み合わせて、前記グループを設定する、上記(8)に記載の情報推定方法。
(10)前記(b)のステップにおいて、前記発信時点が特定されていない文書が、複数の前記発信時点が特定された文書との間でリンクを有する場合に、前記発信時点が特定されていない文書を、特定されている発信時点が古い方の文書に組み合わせて、前記グループを設定する、上記(8)に記載の情報推定方法。
(11)前記(c)のステップにおいて、前記グループにおける前記発信時点が特定された文書の発信時点を、前記グループにおける前記発信時点が特定されていない文書の発信時点として推定する、上記(8)に記載の情報推定方法。
(12)前記(b)のステップにおいて、複数のグループを設定し、
前記(c)のステップにおいて、前記複数のグループのうち発信時点が最も古い文書を有するグループから順にグループを選択し、
そして、選択したグループに含まれる各文書を起点とし、前記起点から順にリンク先の文書を辿ることによって到達可能な文書を特定し、特定した文書の発信時点が特定されていない場合は、前記特定した文書の発信時点を、前記起点となる文書の発信時点と推定する、上記(8)に記載の情報推定方法。
(13)(d)分析対象となる前記文書集合に含まれる文書それぞれに対して、発信時点が特定されているかどうかを判定するステップを更に有する、上記(8)に記載の情報推定方法。
(14)前記文書集合に含まれる文書が、ウェブページであり、
前記(a)のステップにおいて、前記ウェブページに記述されている、ハイパーリンクと、HTMLタグ及びDOMツリーの部分木のうちの少なくとも一つとを用いて、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書の特定が行われる、上記(8)に記載の情報推定方法。
(15)コンピュータに、分析対象となる文書集合において発信時点が特定されていない文書の発信時点を推定させるための、プログラムを記録したコンピュータ読み取り可能な記録媒体であって、
前記コンピュータに、
(a)前記文書集合から、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書を特定し、特定された前記文書の前記ドキュメント構造から、前記文書集合に含まれる文書のリンク関係を抽出するステップと、
(b)前記(a)のステップによって特定された前記文書と、前記(a)のステップによって抽出された前記リンク関係とを用いて、文書のグループを設定するステップと、
(c)前記(b)のステップで設定された前記グループと、前記グループに含まれる発信時点が特定された文書の発信時点とに基づき、前記グループに含まれる発信時点が特定されていない文書の発信時点を推定するステップとを、実行させる、命令を含むプログラムを記録したコンピュータ読み取り可能な記録媒体。
(16)前記(b)のステップにおいて、前記発信時点が特定された文書と、当該文書との間で、前記(a)のステップで抽出された前記リンク関係を有し、且つ、前記発信時点が特定されていない文書とを組み合わせて、前記グループを設定する、上記(15)に記載のコンピュータ読み取り可能な記録媒体。
(17)前記(b)のステップにおいて、前記発信時点が特定されていない文書が、複数の前記発信時点が特定された文書との間でリンクを有する場合に、前記発信時点が特定されていない文書を、特定されている発信時点が古い方の文書に組み合わせて、前記グループを設定する、上記(15)に記載のコンピュータ読み取り可能な記録媒体。
(18)前記(c)のステップにおいて、前記グループにおける前記発信時点が特定された文書の発信時点を、前記グループにおける前記発信時点が特定されていない文書の発信時点として推定する、上記(15)に記載のコンピュータ読み取り可能な記録媒体。
(19)前記(b)のステップにおいて、複数のグループを設定し、
前記(c)のステップにおいて、前記複数のグループのうち発信時点が最も古い文書を有するグループから順にグループを選択し、
そして、選択したグループに含まれる各文書を起点とし、前記起点から順にリンク先の文書を辿ることによって到達可能な文書を特定し、特定した文書の発信時点が特定されていない場合は、前記特定した文書の発信時点を、前記起点となる文書の発信時点と推定する、上記(15)に記載のコンピュータ読み取り可能な記録媒体。
(20)(d)分析対象となる前記文書集合に含まれる文書それぞれに対して、発信時点が特定されているかどうかを判定するステップを、更に前記コンピュータに実行させる、上記(15)に記載のコンピュータ読み取り可能な記録媒体。
(21)前記文書集合に含まれる文書が、ウェブページであり、
前記(a)のステップにおいて、前記ウェブページに記述されている、ハイパーリンクと、HTMLタグ及びDOMツリーの部分木のうちの少なくとも一つとを用いて、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書の特定が行われる、上記(15)に記載のコンピュータ読み取り可能な記録媒体。
2 基準時点判定部
3 構造解析部
4 グルーピング部
5 推定部
6 入力受付部
10 記憶装置
11 文書記憶部
20 入力装置
30 出力装置
Claims (21)
- 分析対象となる文書集合において発信時点が特定されていない文書の発信時点を推定する情報推定装置であって、
前記文書集合から、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書を特定し、特定された前記文書の前記ドキュメント構造から、前記文書集合に含まれる文書のリンク関係を抽出する構造解析部と、
前記構造解析部によって特定された前記文書と、前記構造解析部によって抽出された前記リンク関係とを用いて、文書のグループを設定する、グルーピング部と、
前記グルーピング部が設定した前記グループと、前記グループに含まれる発信時点が特定された文書の発信時点とに基づき、前記グループに含まれる発信時点が特定されていない文書の発信時点を推定する推定部とを、備えることを特徴とする情報推定装置。 - 前記グルーピング部は、前記発信時点が特定された文書と、当該文書との間で、前記構造解析部によって抽出された前記リンク関係を有し、且つ、前記発信時点が特定されていない文書とを組み合わせて、前記グループを設定する、請求項1に記載の情報推定装置。
- 前記グルーピング部は、前記発信時点が特定されていない文書が、複数の前記発信時点が特定された文書との間でリンクを有する場合に、前記発信時点が特定されていない文書を、特定されている発信時点が古い方の文書に組み合わせて、前記グループを設定する、請求項1または2に記載の情報推定装置。
- 前記推定部は、前記グループにおける前記発信時点が特定された文書の発信時点を、前記グループにおける前記発信時点が特定されていない文書の発信時点として推定する、請求項1~3のいずれかに記載の情報推定装置。
- 前記グルーピング部が、複数のグループを設定し、
前記推定部は、前記複数のグループのうち発信時点が最も古い文書を有するグループから順にグループを選択し、
そして、選択したグループに含まれる各文書を起点とし、前記起点から順にリンク先の文書を辿ることによって到達可能な文書を特定し、特定した文書の発信時点が特定されていない場合は、前記特定した文書の発信時点を、前記起点となる文書の発信時点と推定する、請求項1~4のいずれかに記載の情報推定装置。 - 分析対象となる前記文書集合に含まれる文書それぞれに対して、発信時点が特定されているかどうかを判定する、基準時点判定部を更に備えている、請求項1~5のいずれかに記載の情報推定装置。
- 前記文書集合に含まれる文書が、ウェブページであり、
前記構造解析部が、前記ウェブページに記述されている、ハイパーリンクと、HTMLタグ及びDOMツリーの部分木のうちの少なくとも一つとを用いて、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書の特定を行っている、請求項1~6のいずれかに記載の情報推定装置。 - 分析対象となる文書集合において発信時点が特定されていない文書の発信時点を推定するための情報推定方法であって、
(a)前記文書集合から、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書を特定し、特定された前記文書の前記ドキュメント構造から、前記文書集合に含まれる文書のリンク関係を抽出するステップと、
(b)前記(a)のステップによって特定された前記文書と、前記(a)のステップによって抽出された前記リンク関係とを用いて、文書のグループを設定するステップと、
(c)前記(b)のステップで設定された前記グループと、前記グループに含まれる発信時点が特定された文書の発信時点とに基づき、前記グループに含まれる発信時点が特定されていない文書の発信時点を推定するステップとを、有することを特徴とする情報推定方法。 - 前記(b)のステップにおいて、前記発信時点が特定された文書と、当該文書との間で、前記(a)のステップで抽出された前記リンク関係を有し、且つ、前記発信時点が特定されていない文書とを組み合わせて、前記グループを設定する、請求項8に記載の情報推定方法。
- 前記(b)のステップにおいて、前記発信時点が特定されていない文書が、複数の前記発信時点が特定された文書との間でリンクを有する場合に、前記発信時点が特定されていない文書を、特定されている発信時点が古い方の文書に組み合わせて、前記グループを設定する、請求項8または9に記載の情報推定方法。
- 前記(c)のステップにおいて、前記グループにおける前記発信時点が特定された文書の発信時点を、前記グループにおける前記発信時点が特定されていない文書の発信時点として推定する、請求項8~10いずれかに記載の情報推定方法。
- 前記(b)のステップにおいて、複数のグループを設定し、
前記(c)のステップにおいて、前記複数のグループのうち発信時点が最も古い文書を有するグループから順にグループを選択し、
そして、選択したグループに含まれる各文書を起点とし、前記起点から順にリンク先の文書を辿ることによって到達可能な文書を特定し、特定した文書の発信時点が特定されていない場合は、前記特定した文書の発信時点を、前記起点となる文書の発信時点と推定する、請求項8~11いずれかに記載の情報推定方法。 - (d)分析対象となる前記文書集合に含まれる文書それぞれに対して、発信時点が特定されているかどうかを判定するステップを更に有する、請求項8~12のいずれかに記載の情報推定方法。
- 前記文書集合に含まれる文書が、ウェブページであり、
前記(a)のステップにおいて、前記ウェブページに記述されている、ハイパーリンクと、HTMLタグ及びDOMツリーの部分木のうちの少なくとも一つとを用いて、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書の特定が行われる、請求項8~13のいずれかに記載の情報推定方法。 - コンピュータに、分析対象となる文書集合において発信時点が特定されていない文書の発信時点を推定させるための、プログラムを記録したコンピュータ読み取り可能な記録媒体であって、
前記コンピュータに、
(a)前記文書集合から、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書を特定し、特定された前記文書の前記ドキュメント構造から、前記文書集合に含まれる文書のリンク関係を抽出するステップと、
(b)前記(a)のステップによって特定された前記文書と、前記(a)のステップによって抽出された前記リンク関係とを用いて、文書のグループを設定するステップと、
(c)前記(b)のステップで設定された前記グループと、前記グループに含まれる発信時点が特定された文書の発信時点とに基づき、前記グループに含まれる発信時点が特定されていない文書の発信時点を推定するステップとを、実行させる、命令を含むプログラムを記録したコンピュータ読み取り可能な記録媒体。 - 前記(b)のステップにおいて、前記発信時点が特定された文書と、当該文書との間で、前記(a)のステップで抽出された前記リンク関係を有し、且つ、前記発信時点が特定されていない文書とを組み合わせて、前記グループを設定する、請求項15に記載のコンピュータ読み取り可能な記録媒体。
- 前記(b)のステップにおいて、前記発信時点が特定されていない文書が、複数の前記発信時点が特定された文書との間でリンクを有する場合に、前記発信時点が特定されていない文書を、特定されている発信時点が古い方の文書に組み合わせて、前記グループを設定する、請求項15または16に記載のコンピュータ読み取り可能な記録媒体。
- 前記(c)のステップにおいて、前記グループにおける前記発信時点が特定された文書の発信時点を、前記グループにおける前記発信時点が特定されていない文書の発信時点として推定する、請求項15~17のいずれかに記載のコンピュータ読み取り可能な記録媒体。
- 前記(b)のステップにおいて、複数のグループを設定し、
前記(c)のステップにおいて、前記複数のグループのうち発信時点が最も古い文書を有するグループから順にグループを選択し、
そして、選択したグループに含まれる各文書を起点とし、前記起点から順にリンク先の文書を辿ることによって到達可能な文書を特定し、特定した文書の発信時点が特定されていない場合は、前記特定した文書の発信時点を、前記起点となる文書の発信時点と推定する、請求項15~18のいずれかに記載のコンピュータ読み取り可能な記録媒体。 - (d)分析対象となる前記文書集合に含まれる文書それぞれに対して、発信時点が特定されているかどうかを判定するステップを、更に前記コンピュータに実行させる、請求項15~19のいずれかに記載のコンピュータ読み取り可能な記録媒体。
- 前記文書集合に含まれる文書が、ウェブページであり、
前記(a)のステップにおいて、前記ウェブページに記述されている、ハイパーリンクと、HTMLタグ及びDOMツリーの部分木のうちの少なくとも一つとを用いて、他の文書へのリンク関係が目次的に示されたドキュメント構造を有する文書の特定が行われる、請求項15~20のいずれかに記載のコンピュータ読み取り可能な記録媒体。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/141,365 US20110320452A1 (en) | 2008-12-26 | 2009-12-21 | Information estimation apparatus, information estimation method, and computer-readable recording medium |
JP2010543841A JP5494978B2 (ja) | 2008-12-26 | 2009-12-21 | 情報推定装置、情報推定方法、及びプログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008335328 | 2008-12-26 | ||
JP2008-335328 | 2008-12-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010073592A1 true WO2010073592A1 (ja) | 2010-07-01 |
Family
ID=42287242
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/007072 WO2010073592A1 (ja) | 2008-12-26 | 2009-12-21 | 情報推定装置、情報推定方法、及びコンピュータ読み取り可能な記録媒体 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20110320452A1 (ja) |
JP (1) | JP5494978B2 (ja) |
WO (1) | WO2010073592A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012203672A (ja) * | 2011-03-25 | 2012-10-22 | Fuji Xerox Co Ltd | プログラムおよび情報処理装置 |
JP5263851B1 (ja) * | 2012-10-09 | 2013-08-14 | 株式会社エスキュービズム | 文書変換方法および文書変換プログラム |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013165338A1 (en) * | 2012-04-30 | 2013-11-07 | Hewlett-Packard Development Company, L.P. | Print production scheduling |
US9613133B2 (en) * | 2014-11-07 | 2017-04-04 | International Business Machines Corporation | Context based passage retrieval and scoring in a question answering system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004220251A (ja) * | 2003-01-14 | 2004-08-05 | Nippon Telegr & Teleph Corp <Ntt> | 情報抽出規則作成システム、情報抽出規則作成方法及び情報抽出規則作成プログラム |
JP2004318506A (ja) * | 2003-04-16 | 2004-11-11 | Nippon Telegr & Teleph Corp <Ntt> | 文書情報検索装置及び文書検索方法並びにそのプログラム |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6205125B1 (en) * | 1998-07-31 | 2001-03-20 | Motorola, Inc. | Method and system for determining an estimate of a transmission time of a packet |
US7054279B2 (en) * | 2000-04-07 | 2006-05-30 | Broadcom Corporation | Method and apparatus for optimizing signal transformation in a frame-based communications network |
JP3773770B2 (ja) * | 2000-09-13 | 2006-05-10 | シャープ株式会社 | ハイパーテキスト表示装置 |
JP4489994B2 (ja) * | 2001-05-11 | 2010-06-23 | 富士通株式会社 | 話題抽出装置、方法、プログラム及びそのプログラムを記録する記録媒体 |
US20040260735A1 (en) * | 2003-06-17 | 2004-12-23 | Martinez Richard Kenneth | Method, system, and program for assigning a timestamp associated with data |
US7702618B1 (en) * | 2004-07-26 | 2010-04-20 | Google Inc. | Information retrieval system for archiving multiple document versions |
US20080097972A1 (en) * | 2005-04-18 | 2008-04-24 | Collage Analytics Llc, | System and method for efficiently tracking and dating content in very large dynamic document spaces |
JP2008537264A (ja) * | 2005-04-18 | 2008-09-11 | コラージュ・アナリティクス・エルエルシー | 非常に大きいダイナミック文書スペース中のコンテンツを効率的に追跡および年代決定するためのシステムおよび方法 |
US9015301B2 (en) * | 2007-01-05 | 2015-04-21 | Digital Doors, Inc. | Information infrastructure management tools with extractor, secure storage, content analysis and classification and method therefor |
-
2009
- 2009-12-21 US US13/141,365 patent/US20110320452A1/en not_active Abandoned
- 2009-12-21 WO PCT/JP2009/007072 patent/WO2010073592A1/ja active Application Filing
- 2009-12-21 JP JP2010543841A patent/JP5494978B2/ja active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004220251A (ja) * | 2003-01-14 | 2004-08-05 | Nippon Telegr & Teleph Corp <Ntt> | 情報抽出規則作成システム、情報抽出規則作成方法及び情報抽出規則作成プログラム |
JP2004318506A (ja) * | 2003-04-16 | 2004-11-11 | Nippon Telegr & Teleph Corp <Ntt> | 文書情報検索装置及び文書検索方法並びにそのプログラム |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012203672A (ja) * | 2011-03-25 | 2012-10-22 | Fuji Xerox Co Ltd | プログラムおよび情報処理装置 |
JP5263851B1 (ja) * | 2012-10-09 | 2013-08-14 | 株式会社エスキュービズム | 文書変換方法および文書変換プログラム |
Also Published As
Publication number | Publication date |
---|---|
US20110320452A1 (en) | 2011-12-29 |
JP5494978B2 (ja) | 2014-05-21 |
JPWO2010073592A1 (ja) | 2012-06-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12038885B2 (en) | Method and system for document versions encoded in a hierarchical representation | |
US7483903B2 (en) | Unsupervised learning tool for feature correction | |
US20080091706A1 (en) | Apparatus, method, and computer program product for processing information | |
US20080306941A1 (en) | System for automatically extracting by-line information | |
WO2013101489A1 (en) | Extracting search-focused key n-grams and/or phrases for relevance rankings in searches | |
JP5494978B2 (ja) | 情報推定装置、情報推定方法、及びプログラム | |
Uzun et al. | An effective and efficient Web content extractor for optimizing the crawling process | |
JP2008090404A (ja) | 文書検索装置、文書検索方法および文書検索プログラム | |
CN113987320B (zh) | 基于智能页面解析的实时资讯爬虫方法、装置及设备 | |
US20210174013A1 (en) | Information processing apparatus and non-transitory computer readable medium storing program | |
JP5063877B2 (ja) | 情報処理装置およびコンピュータプログラム | |
JP2004220251A (ja) | 情報抽出規則作成システム、情報抽出規則作成方法及び情報抽出規則作成プログラム | |
Yu et al. | Web content information extraction based on DOM tree and statistical information | |
US20110252313A1 (en) | Document information selection method and computer program product | |
JP2009140020A (ja) | アノテーションプログラム、アノテーション装置及びアノテーション方法 | |
WO2006046665A1 (ja) | 文書処理装置及び文書処理方法 | |
US7512905B1 (en) | Highlight linked-to document sections for increased readability | |
JP5712496B2 (ja) | アノテーション復元方法、アノテーション付与方法、アノテーション復元プログラム及びアノテーション復元装置 | |
CN105787032B (zh) | 网页快照的生成方法及装置 | |
JP2010272006A (ja) | 関係抽出装置、関係抽出方法、及びプログラム | |
JP5391738B2 (ja) | アノテーションプログラム、アノテーション装置及びアノテーション方法 | |
JP5187064B2 (ja) | Web資源追跡管理プログラム、Web資源追跡管理装置及びWeb資源追跡管理方法 | |
JP5564442B2 (ja) | 文章検索装置 | |
JP2020046805A (ja) | 情報処理装置、情報処理方法、およびプログラム | |
KR100871470B1 (ko) | 색인 데이터를 구축하기 위한 검색 시스템 및 이를 위한 방법 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09834403 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2010543841 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13141365 Country of ref document: US |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09834403 Country of ref document: EP Kind code of ref document: A1 |