WO2016098739A1 - 情報抽出装置、情報抽出方法、及び情報抽出プログラム - Google Patents
情報抽出装置、情報抽出方法、及び情報抽出プログラム Download PDFInfo
- Publication number
- WO2016098739A1 WO2016098739A1 PCT/JP2015/084974 JP2015084974W WO2016098739A1 WO 2016098739 A1 WO2016098739 A1 WO 2016098739A1 JP 2015084974 W JP2015084974 W JP 2015084974W WO 2016098739 A1 WO2016098739 A1 WO 2016098739A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- extraction
- variable
- variable element
- extracted
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/248—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Definitions
- the present invention relates to an information extraction apparatus, an information extraction method, and an information extraction program for extracting specific information from a structured document.
- a conventional information extraction apparatus detects a difference between Web pages having the same structure, identifies a place (tag) where the difference is detected as a difference area, and describes the difference area. Information is extracted as difference data, and the difference area and difference data are linked and stored as specific information. For example, the “zip code” tag and the actual zip code (for example, 100-1000) are stored in association with each other. According to this information extraction device, for example, by calculating the difference between the Web pages of the English learning records of Mr. A and Mr. B, different parts for each user (user name, word learning time, grammar learning time, etc.) ) Can be extracted as personal information.
- Another information extraction device automatically creates an extraction rule for extracting data from a portion common to each of the tree structures of a plurality of Web pages, and the Web to which the extraction rule is applied.
- a specific rule for specifying the URL of the page is automatically created.
- This information extraction apparatus stores a specific rule for specifying the URL of the created Web page and an extraction rule for extracting data from the Web page in association with each other.
- the information extraction apparatus selects a specific rule that specifies the URL of the Web page to be extracted, and is associated with the selected specific rule.
- the selected extraction rule is selected, and data (specific information) is extracted from the Web page to be extracted based on the selected extraction rule.
- Patent Document 3 specifies and extracts a portion corresponding to a personal area from a single Web page (such as a bulletin board) in which a plurality of personal areas are mixed, and corresponds to each personal area.
- the function to identify the attached information is realized. For example, in a bulletin board page, a location written by the user is specified, and the written content is extracted for each user.
- Non-Patent Document 1 Another information extraction device (see Non-Patent Document 1) is able to perform “contextual clues” without modifying the extraction program for a specific element to be extracted when a specification is changed in a function test for a Web application.
- a method for describing rules that extract specific elements with reference to surrounding information is implemented. For example, when extracting word learning time and grammar learning time from a web page of English learning records, “word learning time exists near the word“ word ”” and “grammar learning time exists near the word“ grammar ” ”Is used to extract specific information continuously and robustly.
- specifications for example, page design, arrangement of information in a page, and tree structure of a page
- specifications for example, page design, arrangement of information in a page, and tree structure of a page
- the above-described conventional information extraction device when the specification of a structured document (for example, a Web page) is changed, specific information (for example, personal information) extracted before the specification is changed. After the specification change, it cannot be extracted easily and reliably.
- Patent Document 1 does not track extraction information before and after a specification change. Therefore, for example, even if the word learning time and the grammar learning time can be extracted from the Web page of the English learning record at a certain time, it may not be possible to distinguish whether the information extracted after the specification change is the word learning time or the grammar learning time. .
- Patent Document 2 recreates the extraction rule and the specific rule manually or automatically when a change in the structure of the Web page is detected. That is, in the case of Patent Document 2, if there is a specification change in the Web page, it is necessary to recreate the extraction rule and the specific rule.
- the information extracted in Patent Document 2 is limited to a common part of a plurality of Web pages.
- Patent Document 3 does not track extraction information before and after a change when the design or configuration of a Web page changes.
- Non-Patent Document 1 requires the user to select peripheral information used when extracting elements to be extracted.
- peripheral information is limited to specific information (for example, in the vicinity of the word “grammar”), when the peripheral information disappears due to a change in the specification of the Web page, the extraction target element cannot be extracted.
- a conventional information extraction apparatus easily and reliably extracts specific information extracted before the specification change after the specification change. I can't.
- An object is to provide an extraction device, an information extraction method, and an information extraction program.
- the information extraction apparatus acquires a plurality of structured documents (specifically, a plurality of documents having the same structure and different contexts), and extracts different portions as a variable element between the plurality of acquired documents.
- a control unit that extracts elements within a predetermined range from each variable element as peripheral information, at least one of the variable elements as an extraction target, and a storage unit that stores at least the variable element and the peripheral information for the extraction target;
- the control unit obtains a plurality of structured documents again, re-extracts a portion that is different between the obtained documents again as a variable element, and within a predetermined range from each re-extracted variable element.
- the re-extracted variable element and the surrounding information are re-extracted based on the re-extracted variable element and the peripheral information and the variable element and the peripheral information stored in the storage unit.
- To calculate the similarity of information based on the calculated degree of similarity, to identify from among the variables after re-extracted variables corresponding to the extracted object.
- the information extraction method of the present invention includes a step of acquiring a plurality of structured documents, a step of extracting different portions between the acquired plurality of documents as variable elements, and elements within a predetermined range from each variable element.
- a step of extracting as peripheral information a step of extracting at least one of variable elements as an extraction target, a step of storing at least the variable element and peripheral information for the extraction target in a storage unit, and a step of acquiring a plurality of structured documents again
- a method, based on the calculated similarity comprising the steps of: identifying from among the variables
- the information extraction program of the present invention causes a computer to execute each step of the information extraction method.
- the information extraction apparatus of the present invention extracts different parts (for example, personal information such as name, weight, height, etc.) between a plurality of structured documents as variable elements, and elements within a predetermined range from each variable element (For example, text, HTML tag, attribute, etc.) are extracted as peripheral information, at least one of the variable elements is set as an extraction target (specific information), and at least the variable element and the peripheral information are stored for the extraction target.
- the information extraction apparatus of the present invention when the variable element and its peripheral information are extracted again, the stored variable element and the peripheral information are calculated again, and the similarity between the extracted variable element and the peripheral information is calculated. Based on the result, the variable element corresponding to the extraction target is specified from the re-extracted variable elements.
- a structured document for example, a Web page
- FIG. 1 is a configuration diagram of an information extraction device according to a first embodiment of the present invention.
- the flowchart which shows extraction of the variable element and surrounding information in Embodiment 1 of this invention.
- 5A and 5B are specific examples of a Web page according to the first embodiment of the present invention, where FIG. 5A is a URL, FIG. 5B is an HTML document, and FIG. Example of extraction information stored in memory according to Embodiment 1 of the present invention The flowchart which shows extraction of the specific information in Embodiment 1 of this invention.
- the information extraction apparatus extracts, as variable elements, portions that differ between a plurality of structured documents (in this embodiment, Web pages), and also includes elements within a predetermined range from each variable element as peripheral information. And at least one of the variable elements is set as an extraction target (specific information), and at least the variable element and the peripheral information are stored for the extraction target.
- the information extraction device extracts the variable element and its peripheral information again, the information extraction device calculates the similarity between the stored variable element and the peripheral information and the extracted variable element and the peripheral information again, and based on the result
- the variable element corresponding to the extraction target is specified from the variable elements after re-extraction.
- the specific information extracted before the specification change is easily and reliably extracted after the specification change, that is, the specific information is tracked before and after the specification change. Can do.
- the specific information can be extracted mechanically and constantly by tracking the position of the extraction location before and after the specification change.
- FIG. 1 shows a configuration of an information extracting device according to an embodiment of the present invention.
- the information extraction apparatus 100 can be realized by a personal computer or the like.
- the information extraction device 100 includes an input unit 110 that receives input from a user, a control unit 120 that controls the entire information extraction device 100, a display unit 130, a memory 140, and a communication unit 150.
- the input unit 110 is used, for example, to input information indicating the location of a structured document (in this embodiment, the URL of a Web page).
- the input unit 110 is also used to select at least one of variable elements that are different parts among a plurality of Web pages as specific information (extraction elements) to be extracted.
- the input unit 110 is, for example, a keyboard or a touch panel.
- the control unit 120 extracts a variable element that is different between a plurality of Web pages and its peripheral information, an extraction unit 121 that extracts the variable element and its peripheral information, a storage unit 122 that writes the extracted variable element and its peripheral information to the memory 140, and a write to the memory 140
- a tracking unit 123 that tracks the extracted elements using the variable elements and the surrounding information is provided.
- the extraction unit 121 acquires configuration information (HTML (Hyper Text Markup Language) document in this embodiment) of each of a plurality of Web pages including the target Web page based on the corresponding URL, and the acquired configuration information Based on this, a portion that differs between a plurality of Web pages is extracted as a variable element.
- variable elements are extracted by calculating differences between a plurality of Web pages.
- the variable element corresponds to, for example, personal information (name, weight, height, etc.).
- the extraction unit 121 extracts elements (text, HTML tags, attributes, and the like) within a predetermined range from all variable elements in the target page as peripheral information from the target page.
- the display unit 130 displays the variable elements extracted by the extraction unit 121.
- the display unit 130 can be realized by a display or the like. The user selects an element to be extracted from the variable elements displayed on the display unit 130 and inputs the selected element to the input unit 110.
- the storage unit 122 records the extracted information as shown in FIG. 4 in the database (DB) 141 in the memory 140.
- the extraction information includes all variable elements in the target page, its peripheral information, and whether or not the user has selected as an extraction target. Further, the storage unit 122 stores the input URL in the memory 140.
- the memory 140 is, for example, a hard disk.
- the memory 140 is not limited to a hard disk, but may be a storage device such as an optical disk, a semiconductor memory element such as a flash memory, or a RAM.
- the tracking unit 123 tracks a variable element (specific information) selected as an extraction target. Specifically, the tracking unit 123 uses the variable elements of the current Web page extracted again by the extraction unit 121 and its peripheral information, and the extracted information of the database 141, and the correspondence between the variable elements before and after the re-extraction. Restore the relationship. In the present embodiment, the correspondence is restored by calculating the similarity between the information about the newly extracted variable element and the information about the variable element already recorded in the database 141, and associating the variable elements with high similarity with each other. Do. More specifically, the similarity is calculated by comprehensively judging both the similarity of the variable element itself and the similarity of the surrounding information. Thereby, the element which the user previously specified as an extraction object is specified from the variable elements after re-extraction.
- the communication unit 150 is connected to a network such as the Internet.
- the extraction unit 121 acquires an HTML document corresponding to the URL via the communication unit 150.
- the extraction element may be selected by the user via the communication unit 150.
- the tracked extraction element may be output to an external device via the communication unit 150.
- FIG. 2 shows a flowchart of extraction of variable elements and peripheral information by the information extraction device 100.
- 3A shows an example of a URL
- FIG. 3B shows an HTML document
- FIG. 3C shows an example of a screen display of the extracted variable elements.
- the left side of FIG. 3B shows a Web page to be extracted in this embodiment, and the right side shows a Web page having a different context (account, date, etc.) from the Web page to be extracted.
- the HTML document includes four types of information for each user: name, current weight, weight one month ago, and height.
- FIG. 4 shows an example of the extracted information DB 141 stored in the memory 140.
- the input unit 110 inputs URLs of a plurality of Web pages as shown in FIG. 3A (step S201). Specifically, the URL of the Web page to be extracted and the URL of one or more other Web pages having the same layout and structure as the Web page to be extracted and having a different context are input.
- the storage unit 122 stores the input URL in the memory 140.
- the extraction unit 121 acquires configuration information (HTML document) corresponding to the URLs of a plurality of Web pages via the communication unit 150 (step S202).
- the extraction unit 121 extracts, as variable elements, portions different from other Web pages in the Web page to be extracted based on the acquired page configuration information (Step S203). For example, personal information (“55 kg”, “54 kg”, “171 cm”, “Sakamoto”) that differs for each user is extracted as a variable element from a Web page on which personal information as shown in FIG. 3B is posted. To do.
- variable elements are extracted by calculating differences between the Web page to be extracted and other Web pages. As the difference calculation, for example, an existing algorithm (XDiff: Wang, Yuan, David J. DeWitt, and JY. Cai. 519-530, 2003.). Note that the difference calculation is not limited to this algorithm.
- the personal information coincides with the same content (for example, if Sakamoto and Sato have the same weight or height), the personal information cannot be extracted as a variable element. Therefore, by preparing a plurality of other Web pages for comparison with the page to be extracted, the possibility of having the same information by chance can be sufficiently reduced, and variable elements can be extracted more accurately.
- the extraction unit 121 extracts peripheral information that is an element within a predetermined range from the variable element (for example, within 100 characters around the variable element) from the configuration information (HTML document) of the Web page (step S204). More specifically, a token string composed of an HTML tag name, an attribute name, an attribute value, and text is extracted as peripheral information. For example, as shown in FIG. 3B and FIG. 4, for a variable element “55 kg”, text (“Your weight is”, “.”), HTML tag (div, span), attribute name (id ) And an attribute value (“high”) (eg, “Your weight is”, span, id, “bw”, / span, “.”).
- the extraction unit 121 displays the extracted variable elements on the display unit 130 as shown in FIG. 3C (step S205). Thereby, the user can visually recognize the variable element in the target Web page, and can select the extraction target (element to be tracked) from the variable elements. For example, the user selects “55 kg (current weight)” from the variable elements shown in FIG.
- the input unit 110 inputs the selection (step S206). As shown in FIG. 4, the storage unit 122 extracts extraction information including all variable elements in the Web page to be extracted and its peripheral information, and whether or not there is selection as an extraction target acquired through the input unit 110. It memorize
- the information recording necessary for tracking the specific information (extraction element) selected as the extraction target is completed.
- the extraction element is tracked using the extraction information recorded in the database 141. This makes it possible to track the extracted element even if the design or configuration changes due to changes in the specifications of the Web page.
- FIG. 5 shows a flowchart of the tracking of the specific information (extraction element) by the information extraction apparatus 100.
- FIG. 6 shows an example of an HTML document before and after the specification change of the Web page.
- FIG. 7 shows the similarity between recorded and re-extracted variable elements.
- the information extraction device 100 tracks specific information (extraction element) according to a predetermined cycle (for example, once a month) or user designation.
- a predetermined cycle for example, once a month
- the extraction unit 121 of the information extraction apparatus 100 uses the URL stored in the memory 140 and again in the same manner as steps S202 and S203 in FIG.
- configuration information HTML documents
- variable elements of the current Web page are extracted (Step S502). For example, as shown in FIG. 6, it is assumed that the specification change of the Web page has occurred, the month has changed, and the weight has increased by 1 kg.
- “Sakamoto”, “56 kg”, “55 kg”, and “171 cm” are extracted as variable elements of the target Web page.
- the extraction unit 121 extracts again the peripheral information of the variable elements by the same method as in step S204 of FIG. 2 (step S503).
- a token string composed of an HTML tag name, an attribute name, an attribute value, and text is extracted from 100 characters around the variable element (for example, div, “weight:”, span, id, “bw”). ", / Span, / div).
- the tracking unit 123 calculates the similarity between the variable elements using the re-extracted variable elements and the variable elements already recorded in the database 141 (step S504). Further, the tracking unit 123 calculates the similarity of the peripheral elements using the re-extracted peripheral information and the peripheral information already recorded in the database 141 (step S505). By comprehensively judging the similarity of the variable elements calculated in this way and the similarity of the surrounding information, it is determined that the combination with the highest similarity is the same variable element, Corresponding to restore the correspondence of variable elements. Thereby, an extraction element is specified (step S506). That is, specific information to be extracted is tracked.
- any calculation method can be used as a calculation method of the similarity between the variable element and the peripheral information (the surrounding structured character string).
- the Levenshtein distance can be used in calculating the similarity between the variable element and the surrounding information.
- the similarity is calculated using a real number normalized from 0 to 1.0.
- the coefficient A and the coefficient B are parameters, and the accuracy of similarity calculation can be adjusted according to the application destination by changing the value.
- the similarity between the numeric part and the character part in the variable element is calculated as follows.
- the similarity of the numerical part of the variable element (S3) first, the extracted variable is subjected to the absolute value of the difference between the numerical parts (for example,
- ), the order of the variable elements after re-extraction is determined. If there is no numeric part, set the absolute value of the difference as infinity. Thereafter, the similarity of the numerical part is obtained by “similarity (number of types of absolute value of difference ⁇ rank) ⁇ 1 / (number of types of absolute value of difference ⁇ 1)”. For example, the similarity (S3) of the numerical part of the variable element after re-extraction with respect to the numerical part “55” of the extracted element “55 kg” in the upper part of FIG.
- the similarity (S4) of the character part (character string) of the variable element first, the length of the longest common subsequence (LCS) is used for the character string of the variable element.
- the similarity (S4) of the character part of the variable element after re-extraction for the character part “kg” of the extracted element “55 kg” is as follows.
- the similarity of the entire variable element is obtained from the similarity between the numeric part and the character part of the variable element.
- each token is subjected to morphological analysis and converted to a word string (“id”, “name”, “/ span”, “/ div” are unchanged. “Id”, “bw” “”, “/ Span”, “. Last month 54 kg! Is “id”, “bw”, “/ span”, “.”, “Last month”, “ha”, “54 kg”, “! " Conversion to).
- the word string obtained in this way is as follows, for example, when extracting two tokens before and after.
- the word string of the 55 kg peripheral information before the specification change is “id”, “bw”, “/ span”, “. “,” Last month “,” Ha “,” 54kg “,”! “”.
- the word string after the specification change is Sakamoto area information (1) “id”, “name”, “/ span”, “/ div”, 56kg peripheral information (2) “id”, “bw”, “/ span”, “/ div”, 55kg peripheral information (3) “id”, “lbw”, “/ span”, “/ div”, 171 cm peripheral information (4) “id”, “height”, “/ span”, “/ div”.
- the number of words before the specification change is 8 and the number of words after the specification change is 4.
- the number of common words before and after the specification change, including duplication, is counted both before and after the specification change (for example, In the case of the peripheral information of “55 kg” before the specification change and the peripheral information (1) of “Sakamoto” after the specification change, “id” and “/ span” are included in both before and after the specification change. id ” ⁇ 2 and“ / span ” ⁇ 2 is“ 4 ”).
- the final similarity regarding “55 kg” before the specification change, which is the extraction element (in this example, the current weight) is as follows.
- each numerical value is obtained by the calculation result by the above method.
- “56 kg” among the variable elements after re-extraction has the highest similarity of 0.4 with respect to the extracted element “55 kg”. . Therefore, it is assumed that there is a correspondence between “56 kg” after re-extraction and “55 kg” recorded as an extraction target. That is, “56 kg” after re-extraction is specified as the extraction element.
- the recorded “54 kg” also has the highest similarity of 0.3 with respect to “56 kg” after re-extraction.
- the pair of “55 kg (recorded)” and “56 kg (after re-extraction)” has a similarity of 0.4
- the pair of “54 kg (recorded)” and “56 kg (after re-extraction)” Since the similarity is 0.3, it is assumed that the pair of “55 kg (recorded)” and “56 kg (after re-extraction)” having a higher similarity has a correspondence, and the correspondence is restored.
- “Sakamoto” and “171 cm” have no change in the text of the variable element before and after the specification change.
- FIG. 7 shows the correspondence between all variable elements (including variable elements other than the extracted elements) in the target page and the variable elements after re-extraction in order to explain the similarity calculation.
- the similarity may be calculated only for at least the variable element selected as the extraction target (for example, the top of FIG. 7). “(Recorded) 55 kg” only).
- the information extraction apparatus 100 newly acquires configuration information of a target Web page based on stored extraction information (variable elements, peripheral information, and whether or not to select as an extraction target). From this, the specific information to be extracted is extracted.
- a Web page is frequently changed in specifications such as design and structure. For example, the specifications may be changed as shown in FIG.
- the specific information is extracted using the variable element and its peripheral information, the specific information specified by the user is automatically extracted (tracked) even if the configuration information of the Web page is changed. be able to.
- the specific information itself specified by the user may be changed. For example, as shown in FIG.
- the numerical value of the specific information (the numerical value of the current month's weight) may be updated.
- the specific information since the specific information is extracted using the stored extraction information, the specific information specified by the user can be automatically extracted (tracked) even if the specific information itself is changed. it can.
- specific information can be automatically extracted (tracked), so that the information extraction apparatus 100 can be used for various services.
- the achievement achievement support system that performs achievement support for the goal set by the user and rewards or fines the user according to the result of achievement of the goal.
- the information extraction apparatus 100 may be used.
- the personal information can be automatically collected, so that the service using the extracted personal information can be used. Useful.
- the mechanism for extracting information does not function in the conventional information extraction device.
- the information extraction apparatus 100 of the present invention it is possible to continue to extract specific information mechanically and regularly from a Web page even when the design or configuration of the Web page changes. Therefore, it is possible to realize a mechanism for collecting personal information from a plurality of life log services and managing the collected information and the previously collected history collectively. As a result, information aggregation and management costs can be reduced.
- the aggregated information handles numerical values such as the number of reading pages and English study time, it is possible to generate and visualize graphs.
- a mechanism for providing feedback for motivation can be constructed.
- the present invention is useful when collecting personal information regularly.
- the present invention is useful for a web application having a plurality of web pages.
- the present invention functions effectively in the software industry, mainly in industries that use software that analyzes information sources on the Web.
- the calculation of the similarity (S2) of the peripheral information is performed by creating a token string excluding variable elements, but may be performed by creating a token string including variable elements.
- the number of words before the specification change and the number of words after the specification change may be counted including variable elements (for example, when two tokens before and after the variable part are extracted as surrounding character strings, Peripheral information of Sakamoto after use change (1)
- the number of words of ““ id ”,“ name ”,“ Sakamoto ”,“ / span ”,“ / div ”” is five).
- the information extraction apparatus 100 can be applied not only to a Web page but also to a structured document.
- the variable element extraction method is not limited to the difference calculation, and may be performed by an arbitrary method. Further, the method of calculating the similarity is not limited to the example of the present embodiment, and any method may be used.
- the extraction unit 121 acquires the HTML document corresponding to the URL input to the input unit 110 via the communication unit 150.
- the HTML document acquisition method is not limited to this.
- the communication unit 150 may directly receive the HTML document from the user without inputting the URL.
- the HTML document received in this way may be stored in the memory 140.
- the information extraction apparatus 100 is realized by one computer, but the function of the information extraction apparatus 100 may be realized by a plurality of devices.
- the input unit 110 and the display unit 130 may be provided in another mobile terminal.
- the extraction unit 121, the storage unit 122, and the tracking unit 123 may be different parts.
- the information extraction apparatus can extract only information associated with a target person as a variable element that is a candidate for extraction.
- the information extraction apparatus according to the present embodiment is a portion (for example, every minute) that has changed in a short period (for example, every minute) within the target person's document (in this embodiment, a Web page). Time) is excluded from the variable element.
- the peripheral information can be extracted or similar
- the degree calculation process (for example, step S204 in FIG. 2 and steps S503 to S506 in FIG. 5) is accelerated, and only necessary information can be presented to the user as variable elements (step S205 in FIG. 2). Furthermore, the accuracy of the restoration of the correspondence relationship based on the similarity is improved (step S506 in FIG. 5).
- the information extracting device of the present embodiment has the same configuration as that of the first embodiment shown in FIG.
- FIG. 8 shows an HTML document before and after 1 minute corresponding to the URL of a Web page to be extracted (target person's Web page).
- the current time changes from “11:59” to “12:00”.
- the current time is extracted as a variable element.
- the current time is an element that changes even when the target person is the same.
- elements that change even when the subject is the same are excluded from the variable elements.
- FIG. 9 shows a flowchart of exclusion candidate extraction and exclusion in Embodiment 2 of the present invention.
- 9 may be performed before extraction of variable elements (immediately before step S203 in FIG. 2) or after extraction of variable elements (immediately after step S203 in FIG. 2). You can go.
- the extraction and exclusion process of exclusion candidates shown in FIG. 9 may be performed at an arbitrary timing, but is preferably performed before the peripheral information of variable elements is extracted (step S204 in FIG. 2).
- steps S901 to S908 shown in FIG. 9 are performed after extracting the variable elements and before extracting the peripheral information (between steps S203 and S204 in FIG. 2).
- the extraction unit 121 of the information extraction apparatus 100 of the present embodiment sets a counter value indicating “frequency of change” to “0”, and starts the processing illustrated in FIG.
- the extraction unit 121 determines whether or not a predetermined time (for example, 1 minute) has elapsed after acquiring the page configuration information (HTML document of the Web page) of the subject in Step S202 (Step S901). If the predetermined time has elapsed (Yes in step S901), the extraction unit 121 acquires again the page configuration information corresponding to the URL of the target person via the communication unit 150 (step S902). The extraction unit 121 compares the page configuration information acquired this time with the page configuration information acquired last time (step S903).
- a predetermined time for example, 1 minute
- the extraction unit 121 determines whether there is a changed portion (step S904), and if there is a changed portion, the changed portion is extracted as an exclusion candidate (step S905). Thereby, for example, “11:59” and / or “12:00” of the current time shown in FIG. 8 is extracted. In step S905, the extraction unit 121 increments the counter value representing “change frequency” by “+1”.
- the extraction unit 121 determines whether or not the page configuration information of the target person has been compared (step S903) a predetermined number of times (for example, 10 times) (step S906). If the predetermined number of times has not been performed (No in step S906), the process returns to step S901 to repeat the process of comparing the page configuration information of the target person. When the predetermined number of comparisons is completed (Yes in step S906), the extraction unit 121 determines whether or not the counter value indicating the frequency of change of the elements extracted as exclusion candidates is equal to or greater than a predetermined number (for example, 9 times) ( S907).
- a predetermined number of times for example, 10 times
- step S907 the extraction unit 121 determines that the exclusion candidate is an exclusion element that is to be excluded from the variable element, and determines the exclusion candidate from the variable element. Exclude (step S908). If the counter value representing the frequency of change is not equal to or greater than the predetermined number (No in step S907), the exclusion candidate is not excluded from the variable elements.
- the presence / absence of a change in the page configuration information of the subject is detected every time one minute elapses, and if there is a place that has changed more than 9 times out of 10 times, the changed place (current time) is It is determined that the value does not depend on the subject (the value depends on time), and is excluded from the variable elements.
- the information associated with the target person by comparing the page configuration information of the target person acquired a plurality of times and excluding the changed portion (current time in the present embodiment) from the variable elements. (In this embodiment, only 55 kg, 54 kg, 171 cm, Sakamoto) can be extracted as a variable element.
- the more candidates, the more likely the correspondence relationship is erroneously restored For example, if “Weight”, “Height”, and “Temperature” are variable elements, it corresponds to the “Weight” value of the initial page acquired first and the “Temperature” value of the newly acquired current page. There is a possibility that it is erroneously determined that there is a relationship, and in this case, it becomes impossible to track the current weight information. In a case where the similarity cannot be calculated well (for example, there are few words around the variable element), if the number of variable element types is large, there is a possibility that the restoration of the correspondence relationship may fail. Therefore, by excluding unnecessary exclusion elements from the variable elements in advance, the accuracy of restoring the correspondence relationship is increased.
- step S903 the page configuration information acquired this time is compared with the page configuration information acquired last time (for example, the HTML documents acquired at 12:00 and 11:59 are compared, and acquired at 12:01 and 12:00).
- the page configuration information (for example, 12:00, 12:01, 12: newly acquired) is obtained from the page configuration information acquired first (for example, the HTML document acquired at 11:59). (HTML document acquired at 02, 12:03).
- the context to be changed to extract the excluded element is the Web page acquisition time, but is changed to extract the excluded element.
- the context can be set arbitrarily.
- the extraction unit 121 may set, or the user may set through the input unit 110.
- information that changes only when the context changes can be extracted as a variable element.
- the weather, the access source area, or the like may be set as a context to be changed in order to extract excluded elements.
- the predetermined time in step S901 is 1 minute
- the predetermined number in step S906 is 10 times
- the predetermined number in step S907 is 9 times.
- the candidate for exclusion is excluded from the variable elements when it has changed 9 times or more in 10 times.
- the predetermined time (determination criterion) in step S901, the predetermined number of times in step S906, and the predetermined number in step S907 can be arbitrarily set.
- the extraction unit 121 may set, or the user may set through the input unit 110.
- the predetermined time (determination criterion) of step S901 the predetermined number of steps S906, the predetermined number of step S907 May be set.
- the weight, height, and name of an individual are unlikely to change every minute, so every minute, the presence or absence of changes in the page configuration information of the subject is detected and changed 3 times in 3 times. It is good also considering the made part as an exclusion element (current time). Also, for example, when the context to be changed to extract the exclusion element (advertising banner) is “access source region”, the presence or absence of change in the page configuration information of the target person is detected every time the access source region changes. It is good also considering the location changed 5 times among 5 times as an exclusion element. In order to prevent erroneous determination, it is preferable to compare a plurality of times. As the number of comparisons increases, erroneous determination can be prevented.
- an account operated by the machine is prepared for the extraction method, and a machine account and a user who wants to perform extraction are set in a friend state where information can be shared. Thereafter, the page is saved once before the machine account performs writing, and the page is saved again immediately after the machine account performs writing. Excluded elements (variable elements that are unnecessary as extraction targets) are removed by calculating the difference between the pages before and after the machine account writes.
- the access frequency and the number of changes may be set according to the information to be extracted or the information to be excluded.
- the frequency and the number of times that satisfy the condition that the desired information (variable element) does not change and satisfy the condition that the unnecessary information (excluded element) changes are set. Thereby, it is possible to extract and exclude only unnecessary information as an exclusion element with higher accuracy.
- variable elements may be extracted only from the contents of the BODY tag of the HTML document.
- variable elements may be extracted from only the menu bar at the top of the web page.
- the extraction points of the variable elements may be narrowed down. By narrowing down the extraction locations, it is possible to prevent unnecessary information from being extracted as variable elements.
- the variable element extraction range may be limited together with the exclusion element extraction (FIG. 9) of the second embodiment.
- the information extraction apparatus of the present invention can continue to extract specific information regardless of whether or not the specification of a structured document has been changed, the specific information extracted by periodically extracting the specific information is used. Useful for service.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
本実施形態の情報抽出装置は、構造化された複数の文書(本実施形態において、Webページ)間で異なる部分を可変要素として抽出すると共に、各可変要素から所定範囲内にある要素を周辺情報として抽出し、可変要素のうち少なくとも1つを抽出対象(特定情報)とし、少なくとも抽出対象について可変要素と周辺情報を記憶する。情報抽出装置は、再度、可変要素とその周辺情報を抽出したときに、記憶されている可変要素及び周辺情報と再度抽出された可変要素及び周辺情報の類似度を計算し、その結果に基づいて、抽出対象に対応する可変要素を再抽出後の可変要素の中から特定している。これにより、構造化された文書の仕様が変更した場合であっても、仕様変更前に抽出した特定情報を仕様変更後も容易且つ確実に抽出する、すなわち仕様変更前後において特定情報を追跡することができる。本実施形態によれば、仕様変更前後で抽出箇所の位置を追跡することにより、機械的且つ定常的に特定情報の抽出を行うことができる。以下、構造化された文書がWebページである場合を例にして説明する。
図1に、本発明の実施形態の情報抽出装置の構成を示す。情報抽出装置100は、パーソナルコンピュータなどで実現できる。情報抽出装置100は、ユーザからの入力を受け付ける入力部110、情報抽出装置100全体を制御する制御部120、表示部130、メモリ140、及び通信部150を有する。
図2に、情報抽出装置100による可変要素及び周辺情報の抽出のフローチャートを示す。図3(a)はURL、図3(b)はHTML文書、図3(c)は抽出後の可変要素の画面表示の例をそれぞれ示している。図3(b)の左側が、本実施形態において抽出対象となるWebページ、その右側が抽出対象のWebページとコンテキスト(アカウント、日時など)が異なるWebページを示す。図3(b)の例では、HTML文書は、ユーザごとに、名前、現在の体重、一ヶ月前の体重、及び身長の4種類の情報を含む。図4は、メモリ140に格納される抽出情報のDB141の例を示している。
類似度=「可変要素の類似度(S1)×係数A」+「周辺情報の類似度(S2)×係数B」
(ここで、係数Aと係数Bは0以上の実数、且つ、係数A+係数B=1.0)
係数Aと係数Bはパラメータであり、値を変更して、適用先に応じて類似度計算の精度を調整できる。
可変要素の類似度(S1)=「数字部の類似度(S3)×係数C」+「文字部の類似度(S4)×係数D」
(ここで、係数Cと係数Dは0以上の実数、且つ、係数C+係数D=1.0)
よって、可変要素の類似度において、まず、可変要素のテキストを数字部と文字部に分解する。例えば、「55kg」→「55」と「kg」、「56kg」→「56」と「kg」、「171cm」→「171」と「kg」。
仕様変更前の55kgの周辺情報の単語列は「“id”,“bw”,“/span”,“。”,“先月”,“は”,“54kg”,“!”」となる。
仕様変更後の単語列は、
坂本の周辺情報(1)「“id”,“name”,“/span”,“/div”」、
56kgの周辺情報(2)「“id”,“bw”,“/span”,“/div”」、
55kgの周辺情報(3)「“id”,“lbw”,“/span”,“/div”」、
171cmの周辺情報(4)「“id”,“height”,“/span”,“/div”」となる。
以上のようにして、情報抽出装置100は記憶している抽出情報(可変要素、周辺情報、及び抽出対象としての選択の有無)に基づいて、対象とするWebページの新たに取得した構成情報から、抽出対象の特定情報を抽出する。Webページは一般にデザインや構造などの仕様が変更される頻度が高く、例えば図6のように仕様が変更される場合がある。しかし、本発明によれば、可変要素及びその周辺情報を用いて特定情報を抽出するため、Webページの構成情報に変更があっても、ユーザが指定した特定情報を自動で抽出(追跡)することができる。また、ユーザが指定した特定情報自体が変更している場合がある。例えば、図6に示すように特定情報の数値(今月の体重の数値)が更新されている場合もある。しかし、本発明によれば、記憶している抽出情報を用いて特定情報を抽出するため、特定情報自体に変更があっても、ユーザが指定した特定情報を自動で抽出(追跡)することができる。
本実施形態において、周辺情報の類似度(S2)の計算は、可変要素を除いたトークン列を作成することにより行ったが、可変要素を含めたトークン列を作成して行っても良い(例えば、「<div>名前:<span id=“name”>坂本</span></div>」から「“div”、“名前:”、“span”、“id”、“name”、“坂本”、“/span”、“/div”」のトークン列を生成)。この場合、仕様変更前の単語数及び仕様変更後の単語数として、可変要素を含めてカウントしても良い(例えば、可変部分の前後2個のトークンを周囲の文字列として抽出した場合の、使用変更後の坂本の周辺情報(1)「“id”,“name”,“坂本”,“/span”,“/div”」の単語数は5である)。
本実施形態の情報抽出装置は、抽出対象の候補となる可変要素として、対象者に紐づく情報のみを抽出することができるようにする。具体的には、本実施形態の情報抽出装置は、対象者の文書(本実施形態において、Webページ)内で短期間(例えば、1分毎)に変化した部分(本実施形態においては、現在時刻)を可変要素から除外する。このように、可変要素として抽出したくない要素(本実施形態の場合、現在時刻などの対象者に紐づかない情報)を除外要素として、可変要素から除外することにより、周辺情報の抽出や類似度の計算の処理(例えば、図2のステップS204及び図5のステップS503~S506)が速くなると共に、必要な情報だけを可変要素としてユーザに提示できる(図2のステップS205)。さらに、類似度に基づいた対応関係の復元の精度が良くなる(図5のステップS506)。
本実施形態の情報抽出装置は、図1に示される実施形態1と同一の構成を持つ。
図8に、抽出対象のWebページ(対象者のWebページ)のURLに対応する、1分経過前後のHTML文書を示す。この例では、現在時刻が「11:59」から「12:00」に変化している。実施形態1の場合、複数のWebページを比較した結果、現在時刻が異なれば、その現在時刻が可変要素として抽出される。しかし、現在時刻は、図8に示されるように、対象者が同一の場合でも、変化する要素である。本実施形態では、対象者が同一の場合でも変化する要素を可変要素から除外する。
本実施形態によれば、複数回取得した対象者のページ構成情報を比較して、変化した箇所(本実施形態において、現在時刻)を可変要素から除外することにより、対象者に紐づく情報(本実施形態において、55kg、54kg、171cm、坂本)のみを可変要素として抽出することができる。
なお、ステップS903では、今回取得したページ構成情報を前回取得したページ構成情報と比較(例えば、12:00と11:59に取得したHTML文書を比較、12:01と12:00に取得したHTML文書とを比較)したが、最初に取得したページ構成情報(例えば、11:59に取得したHTML文書)を新たに取得したページ構成情報(例えば、12:00、12:01、12:02、12:03・・・に取得したHTML文書)と比較しても良い。
110 入力部
120 制御部
121 抽出部
122 保存部
123 追跡部
130 表示部
140 メモリ
141 データベース(DB)
150 通信部
Claims (15)
- 構造化された複数の文書を取得し、取得した複数の文書間で異なる部分を可変要素として抽出すると共に、各可変要素から所定範囲内にある要素を周辺情報として抽出する、制御部と、
前記可変要素のうち少なくとも1つを抽出対象とし、少なくとも前記抽出対象について前記可変要素と前記周辺情報を格納する記憶部と、
を有し、
前記制御部は、前記構造化された複数の文書を再度取得して、再度取得した複数の文書間で異なる部分を可変要素として再抽出すると共に、再抽出した各可変要素から所定範囲内にある要素を周辺情報として再抽出し、再抽出した前記可変要素及び前記周辺情報と前記記憶部に格納されている前記可変要素及び前記周辺情報とに基づいて、再抽出前後の前記可変要素及び前記周辺情報の類似度を計算し、計算した前記類似度に基づいて、前記抽出対象に対応する前記可変要素を再抽出後の前記可変要素の中から特定する、
情報抽出装置。 - 再抽出後の前記可変要素の中から、前記抽出対象の可変要素に対する類似度が最も高い可変要素を特定する、請求項1に記載の情報抽出装置。
- 再抽出した前記可変要素と前記記憶部に格納されている前記可変要素の類似度を計算し、且つ再抽出した前記周辺情報と前記記憶部に格納されている前記周辺情報の類似度とを計算し、前記可変要素同士の類似度と前記周辺情報同士の類似度とに基づいて、前記抽出対象に対応する可変要素を再抽出後の前記可変要素の中から特定する、請求項1に記載の情報抽出装置。
- 再抽出した前記可変要素と前記記憶部に格納されている前記可変要素とにそれぞれ含まれる数字部分と文字部分を、前記数字部分と前記文字部分に分割し、前記数字部分同士の類似度と前記文字部分同士の類似度とに基づいて、前記可変要素の類似度を決定する、請求項1に記載の情報抽出装置。
- 前記構造化された複数の文書の差分を計算することにより、前記可変要素を抽出する、請求項1に記載の情報抽出装置。
- 抽出された前記可変要素を表示する表示部と、
表示された前記可変要素の中からユーザにより選択された前記抽出対象を入力する入力部と、
をさらに有する、請求項1に記載の情報抽出装置。 - 対象とする文書を複数回取得し、複数回取得した文書間で所定回数異なった部分を除外要素として、前記可変要素から除外する、請求項1に記載の情報抽出装置。
- 構造化された複数の文書を取得するステップと、
取得した複数の文書間で異なる部分を可変要素として抽出するステップと、
各可変要素から所定範囲内にある要素を周辺情報として抽出するステップと、
前記可変要素のうち少なくとも1つを抽出対象とし、少なくとも前記抽出対象について前記可変要素と前記周辺情報を記憶部に格納するステップと、
前記構造化された複数の文書を再度取得するステップと、
再度取得した複数の文書間で異なる部分を可変要素として再抽出するステップと、
再抽出した各可変要素から所定範囲内にある要素を周辺情報として再抽出するステップと、
再抽出した前記可変要素及び前記周辺情報と前記記憶部に格納されている前記可変要素及び前記周辺情報とに基づいて、再抽出前後の前記可変要素及び前記周辺情報の類似度を計算するステップと、
計算した前記類似度に基づいて、前記抽出対象に対応する可変要素を再抽出後の前記可変要素の中から特定するステップと、
を含む、情報抽出方法。 - 再抽出後の前記可変要素の中から、前記抽出対象の可変要素に対する類似度が最も高い可変要素を特定する、請求項8に記載の情報抽出方法。
- 再抽出した前記可変要素と前記記憶部に格納されている前記可変要素の類似度を計算し、且つ再抽出した前記周辺情報と前記記憶部に格納されている前記周辺情報の類似度とを計算し、前記可変要素同士の類似度と前記周辺情報同士の類似度とに基づいて、前記抽出対象に対応する可変要素を再抽出後の可変要素の中から特定する、請求項8に記載の情報抽出方法。
- 再抽出した前記可変要素と前記記憶部に格納されている前記可変要素にそれぞれ含まれる数字部分と文字部分を、前記数字部分と前記文字部分に分割し、前記数字部分同士の類似度と前記文字部分同士の類似度とに基づいて、前記可変要素の類似度を決定する、請求項8に記載の情報抽出方法。
- 前記構造化された複数の文書の差分を計算することにより、前記可変要素を抽出する、請求項8に記載の情報抽出方法。
- 抽出された前記可変要素を表示するステップと、
表示された前記可変要素の中からユーザにより選択された前記抽出対象を入力するステップと、
をさらに含む、請求項8に記載の情報抽出方法。 - 対象とする文書を複数回取得し、複数回取得した文書間で所定回数異なった部分を除外要素として、前記可変要素から除外する、請求項8に記載の情報抽出方法。
- 構造化された複数の文書を取得するステップと、
取得した複数の文書間で異なる部分を可変要素として抽出するステップと、
各可変要素から所定範囲内にある要素を周辺情報として抽出するステップと、
前記可変要素のうち少なくとも1つを抽出対象とし、少なくとも前記抽出対象について前記可変要素と前記周辺情報を記憶部に格納するステップと、
前記構造化された複数の文書を再度取得するステップと、
再度取得した複数の文書間で異なる部分を可変要素として再抽出するステップと、
再抽出した各可変要素から所定範囲内にある要素を周辺情報として再抽出するステップと、
再抽出した前記可変要素及び前記周辺情報と前記記憶部に格納されている前記可変要素及び前記周辺情報とに基づいて、再抽出前後の可変要素及び周辺情報の類似度を計算するステップと、
計算した前記類似度に基づいて、前記抽出対象に対応する可変要素を再抽出後の前記可変要素の中から特定するステップと、
をコンピュータに実行させるための情報抽出プログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2016564846A JP6562276B2 (ja) | 2014-12-15 | 2015-12-14 | 情報抽出装置、情報抽出方法、及び情報抽出プログラム |
US15/536,097 US11144565B2 (en) | 2014-12-15 | 2015-12-14 | Information extraction apparatus, information extraction method, and information extraction program |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2014-253058 | 2014-12-15 | ||
JP2014253058 | 2014-12-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016098739A1 true WO2016098739A1 (ja) | 2016-06-23 |
Family
ID=56126628
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2015/084974 WO2016098739A1 (ja) | 2014-12-15 | 2015-12-14 | 情報抽出装置、情報抽出方法、及び情報抽出プログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US11144565B2 (ja) |
JP (1) | JP6562276B2 (ja) |
WO (1) | WO2016098739A1 (ja) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180025389A1 (en) * | 2016-07-21 | 2018-01-25 | Facebook, Inc. | Determining an efficient bid amount for each impression opportunity for a content item to be presented to a viewing user of an online system |
US10956106B1 (en) * | 2019-10-30 | 2021-03-23 | Xerox Corporation | Methods and systems enabling a user to customize content for printing |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004178426A (ja) * | 2002-11-28 | 2004-06-24 | Mitsubishi Electric Corp | 構造化文書処理装置 |
JP2005063332A (ja) * | 2003-08-19 | 2005-03-10 | Fujitsu Ltd | 情報体系対応付け装置および対応付け方法。 |
JP2007293874A (ja) * | 2007-05-18 | 2007-11-08 | Degital Works Kk | 文書の圧縮格納方法及び装置 |
US20130226944A1 (en) * | 2012-02-24 | 2013-08-29 | Microsoft Corporation | Format independent data transformation |
EP2648115A1 (en) * | 2012-04-03 | 2013-10-09 | Seeburger AG | Method and/or system for the execution of transformations of hierarchically structured data and relational data |
US20140297670A1 (en) * | 2013-04-01 | 2014-10-02 | Oracle International Corporation | Enhanced flexibility for users to transform xml data to a desired format |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6092091A (en) * | 1996-09-13 | 2000-07-18 | Kabushiki Kaisha Toshiba | Device and method for filtering information, device and method for monitoring updated document information and information storage medium used in same devices |
US6353840B2 (en) * | 1997-08-15 | 2002-03-05 | Ricoh Company, Ltd. | User-defined search template for extracting information from documents |
JP2001166981A (ja) * | 1999-12-06 | 2001-06-22 | Fuji Xerox Co Ltd | ハイパーテキスト解析装置および方法 |
JP4226261B2 (ja) * | 2002-04-12 | 2009-02-18 | 三菱電機株式会社 | 構造化文書種別判定システム及び構造化文書種別判定方法 |
JP2004086845A (ja) * | 2002-06-27 | 2004-03-18 | Oki Electric Ind Co Ltd | 電子文書情報拡充装置、方法及びプログラム、並びに、電子文書情報拡充プログラムを記録した記録媒体 |
US20040158799A1 (en) * | 2003-02-07 | 2004-08-12 | Breuel Thomas M. | Information extraction from html documents by structural matching |
JP2004348706A (ja) * | 2003-04-30 | 2004-12-09 | Canon Inc | 情報処理装置及び情報処理方法ならびに記憶媒体、プログラム |
US7669119B1 (en) * | 2005-07-20 | 2010-02-23 | Alexa Internet | Correlation-based information extraction from markup language documents |
JP4314221B2 (ja) * | 2005-07-28 | 2009-08-12 | 株式会社東芝 | 構造化文書記憶装置、構造化文書検索装置、構造化文書システム、方法およびプログラム |
US8351706B2 (en) * | 2007-07-24 | 2013-01-08 | Sharp Kabushiki Kaisha | Document extracting method and document extracting apparatus |
JP4429356B2 (ja) * | 2007-12-26 | 2010-03-10 | 富士通株式会社 | 属性抽出処理方法及び装置 |
JP2012059212A (ja) | 2010-09-13 | 2012-03-22 | Nippon Telegr & Teleph Corp <Ntt> | 抽出装置、抽出方法及び抽出プログラム |
JP5331084B2 (ja) | 2010-11-01 | 2013-10-30 | 日本電信電話株式会社 | 特定情報抽出装置および特定情報抽出プログラム |
JP5669611B2 (ja) | 2011-02-16 | 2015-02-12 | 田中 成典 | グループ化装置およびエレメント抽出装置 |
US20130311875A1 (en) * | 2012-04-23 | 2013-11-21 | Derek Edwin Pappas | Web browser embedded button for structured data extraction and sharing via a social network |
JP6056610B2 (ja) * | 2013-03-29 | 2017-01-11 | 株式会社Jvcケンウッド | テキスト情報処理装置、テキスト情報処理方法、及びテキスト情報処理プログラム |
-
2015
- 2015-12-14 US US15/536,097 patent/US11144565B2/en active Active
- 2015-12-14 WO PCT/JP2015/084974 patent/WO2016098739A1/ja active Application Filing
- 2015-12-14 JP JP2016564846A patent/JP6562276B2/ja active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004178426A (ja) * | 2002-11-28 | 2004-06-24 | Mitsubishi Electric Corp | 構造化文書処理装置 |
JP2005063332A (ja) * | 2003-08-19 | 2005-03-10 | Fujitsu Ltd | 情報体系対応付け装置および対応付け方法。 |
JP2007293874A (ja) * | 2007-05-18 | 2007-11-08 | Degital Works Kk | 文書の圧縮格納方法及び装置 |
US20130226944A1 (en) * | 2012-02-24 | 2013-08-29 | Microsoft Corporation | Format independent data transformation |
EP2648115A1 (en) * | 2012-04-03 | 2013-10-09 | Seeburger AG | Method and/or system for the execution of transformations of hierarchically structured data and relational data |
US20140297670A1 (en) * | 2013-04-01 | 2014-10-02 | Oracle International Corporation | Enhanced flexibility for users to transform xml data to a desired format |
Also Published As
Publication number | Publication date |
---|---|
US11144565B2 (en) | 2021-10-12 |
US20180018378A1 (en) | 2018-01-18 |
JP6562276B2 (ja) | 2019-08-21 |
JPWO2016098739A1 (ja) | 2017-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11113342B2 (en) | Techniques for compiling and presenting query results | |
JP6462970B1 (ja) | 分類装置、分類方法、生成方法、分類プログラム及び生成プログラム | |
US10290032B2 (en) | Blacklisting based on image feature analysis and collaborative filtering | |
US20220147835A1 (en) | Knowledge graph construction system and knowledge graph construction method | |
KR102078627B1 (ko) | 사용자-입력 컨텐츠와 연관된 실시간 피드백 정보 제공 방법 및 시스템 | |
WO2012111226A1 (ja) | 時系列文書要約装置、時系列文書要約方法およびコンピュータ読み取り可能な記録媒体 | |
KR20210090576A (ko) | 품질을 관리하는 방법, 장치, 기기, 저장매체 및 프로그램 | |
JP6230725B2 (ja) | 因果関係分析装置、及び因果関係分析方法 | |
CN116821475B (zh) | 基于客户数据的视频推荐方法、装置及计算机设备 | |
CN111475700A (zh) | 一种数据提取方法及相关设备 | |
JP6562276B2 (ja) | 情報抽出装置、情報抽出方法、及び情報抽出プログラム | |
JP6921598B2 (ja) | 計算装置、影響出力システム | |
JP2020067987A (ja) | 要約作成装置、要約作成方法、及びプログラム | |
Garcia et al. | Concept Drift Adaptation in Text Stream Mining Settings: A Comprehensive Review | |
TW201911083A (zh) | 用於回饋及判定之語意屬性的動態合成與暫態叢集之系統及方法 | |
JP5478229B2 (ja) | データ解析システム、及びその方法 | |
CN112149402B (zh) | 文档对比方法、装置、电子设备和计算机可读存储介质 | |
JP2023036334A (ja) | 戸籍解析プログラム及び戸籍解析システム | |
JP6804913B2 (ja) | 表構造推定システムおよび方法 | |
JP6716919B2 (ja) | 情報抽出装置、抽出方法、および、抽出プログラム | |
US11907508B1 (en) | Content analytics as part of content creation | |
Shaheen et al. | Machine failure prediction using joint reserve intelligence with feature selection technique | |
US11776176B2 (en) | Visual representation of directional correlation of service health | |
JP7418781B2 (ja) | 企業類似度算出サーバ及び企業類似度算出方法 | |
JP2018156549A (ja) | データ種別を推定するための情報処理方法、情報処理装置および情報処理プログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15869944 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2016564846 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15536097 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15869944 Country of ref document: EP Kind code of ref document: A1 |