WO2023085500A1 - Système et procédé d'extraction de connaissances sur la base d'une lecture de graphe - Google Patents

Système et procédé d'extraction de connaissances sur la base d'une lecture de graphe Download PDF

Info

Publication number
WO2023085500A1
WO2023085500A1 PCT/KR2021/018458 KR2021018458W WO2023085500A1 WO 2023085500 A1 WO2023085500 A1 WO 2023085500A1 KR 2021018458 W KR2021018458 W KR 2021018458W WO 2023085500 A1 WO2023085500 A1 WO 2023085500A1
Authority
WO
WIPO (PCT)
Prior art keywords
knowledge
answer
entity
graph
query
Prior art date
Application number
PCT/KR2021/018458
Other languages
English (en)
Korean (ko)
Inventor
이경일
김창완
Original Assignee
주식회사 솔트룩스
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 솔트룩스 filed Critical 주식회사 솔트룩스
Publication of WO2023085500A1 publication Critical patent/WO2023085500A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the technical idea of the present invention relates to knowledge extraction, and more particularly to a system and method for extracting knowledge based on graph reading.
  • the present invention is derived from research conducted by Saltlux Co., Ltd. as part of the Innovation Growth Engine Project (Artificial Intelligence) project of the Ministry of Science and ICT. (Research period: 2021.01.01 - 2021.12.31, research management institution: National Institute of Information and Communications Technology Evaluation and Planning, research project name: [Exobrain 2 detailed task] WiseKB: development of self-learning knowledge base and reasoning technology based on big data understanding, task Unique number: 1711126235, detailed task number: 2013-2-00109-009)
  • a knowledgebase that stores knowledge in a form recognizable by a computer enables various services that utilize the stored knowledge.
  • a question and answering system that provides an answer to a user's query may provide an answer by referring to knowledge stored in a knowledge base after analyzing the user's query.
  • the quality and scope of services utilizing such a knowledge base may depend on the accuracy and quantity of knowledge included in the knowledge base, and accordingly, it may be important to secure accurate knowledge to reinforce the knowledge base.
  • the technical idea of the present invention provides a system and method for automatically extracting knowledge contained in tables based on graph reading.
  • a knowledge extraction system for extracting knowledge from a document including a table includes an entity extraction unit for extracting an entity from an input document; A query generator that generates a query including an entity based on attributes included in an attribute list of an entity, a graph generator that generates graph data from a table, a graph comprehension engine that extracts an answer to a query from graph data, and knowledge Based on the format of the base, it may include a knowledge creation unit that creates a knowledge instance from the entity, attribute and answer.
  • the object extraction unit may extract the object from the title of the input document.
  • the query generator sequentially selects each of a plurality of attributes included in the attribute list, and generates a first word vector corresponding to the entity and a second word vector corresponding to the selected attribute.
  • a first deep learning network that is trained to generate sample queries according to a preprocessing unit, sample objects and sample properties, and generates a third word vector corresponding to the query from the first word vector and the second word vector; and It may include a post-processing unit that generates a query from a 3-word vector.
  • the graph reading engine may generate first input data by processing graph data in natural language, and a natural language processing unit configured to generate second input data by processing a query in natural language, and a sample of the first input data.
  • a second deep learning network learned to output sample graph vectors according to s, a third deep learning network learned to output sample word vectors according to samples of the second input data, sample graph vectors and sample word vectors It may include a fourth deep learning network learned to output samples of output data according to, and an answer generation unit that generates an answer based on the output data of the fourth deep learning network, wherein the output data includes the correct answer in the table It may include at least one of whether or not, the location of the correct answer, and the reliability of the correct answer.
  • the answer generation unit may determine failure of extracting an answer when the correct answer is not included in the input document or when the confidence level is less than a predefined threshold.
  • graph data includes nodes including the index, coordinates, and contents of each of the cells included in the table, and edges connecting the nodes based on the arrangement of the cells included in the table.
  • a knowledge extraction method for extracting knowledge from a document includes extracting an entity from an input document, including the entity based on an attribute included in a property list of the entity. generating a query from an input document, generating graph data from a table included in the input document, extracting an answer to the query from the graph data, based on the format of the knowledge base, knowledge from entities, attributes, and answers It may include creating an instance.
  • knowledge can be extracted from a table in which knowledge extraction is not easy.
  • the extracted knowledge can be verified, and finally accurate knowledge can be extracted.
  • the knowledge base can be efficiently reinforced due to easily and accurately extracted knowledge, thereby improving the quality of services based on the knowledge base and expanding the scope.
  • FIG. 1 is a block diagram showing a knowledge extraction system and input/output thereof according to an exemplary embodiment of the present invention.
  • FIG. 2 is a diagram showing examples of an input document and a table extracted from the input document according to an exemplary embodiment of the present invention.
  • FIG. 3 is a block diagram illustrating an example of a query generator according to an exemplary embodiment of the present invention.
  • 5A and 5B show examples of structured data generated from a table according to exemplary embodiments of the present invention.
  • FIG. 6 is a diagram showing an example of graph data according to an exemplary embodiment of the present invention.
  • FIG. 8 is a diagram illustrating an example of an operation of the natural language processing unit of FIG. 7 according to an exemplary embodiment of the present invention.
  • FIG. 9 is a flowchart illustrating a method for knowledge extraction according to an exemplary embodiment of the present invention.
  • FIG. 10 is a block diagram illustrating an example of a knowledge generation unit according to an exemplary embodiment of the present invention.
  • FIG. 11 is a diagram illustrating an example of an operation of a knowledge generation unit according to an exemplary embodiment of the present invention.
  • FIG. 12 is a flowchart illustrating a method for knowledge extraction according to an exemplary embodiment of the present invention.
  • FIG. 13 is a flowchart illustrating a method for knowledge extraction according to an exemplary embodiment of the present invention.
  • Steps, blocks, or functions of a method or algorithm described below may be directly implemented as hardware, a software module executed by a processor, or a combination thereof. If implemented in software, the functions may be stored as one or more instructions or code on a tangible, non-transitory computer readable medium.
  • a software module may include random access memory (RAM), flash memory, read only memory (ROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), registers, hard disk, removable disk, CD ROM, or any other It can exist in a storage medium in the form of
  • a component represented or described as a block may be a hardware block or a software block.
  • each of the components may be an independent hardware block that exchanges signals with each other, or may be a software block executed on a single processor.
  • a software block may include program code or a series of instructions executable by at least one processor, from which program code has been compiled.
  • the software block may be stored in a non-transitory computer readable medium, such as a semiconductor memory device, a magnetic disk device, an optical disk device, or the like.
  • a “system” or “database” may refer to a computing system including at least one processor and memory accessed by the processor.
  • the knowledge extraction system 100 may receive an input document DIN and a property list PL, may communicate with a knowledgebase 200, and may communicate with a network 300 can access
  • the knowledge extraction system 100 may extract knowledge from a table included in the input document DIN and reinforce the extracted knowledge in the knowledge base 200, as will be described below with reference to the drawings. .
  • the knowledge extraction system 100 includes an entity extraction unit 110, a query generator 120, a graph generator 130, a graph reading engine 140, a knowledge generator 150, and A knowledge verification unit 160 may be included.
  • the knowledge extraction system 100 instead of communicating with the knowledge base 200, transfers the extracted knowledge to another system, such as a system for augmenting the knowledge base 200. may also provide. Also, in some embodiments, differently from that shown in FIG.
  • the knowledge verification unit 160 may be outside the knowledge extraction system 100, and the knowledge extraction system 100 may be external to the knowledge verification unit ( Verification of knowledge may also be performed by communicating (eg, via a network). Also, in some embodiments, differently from that shown in FIG. 1 , the graph generator 130 and the graph reading engine 140 may be outside the knowledge extraction system 100, and the knowledge extraction system 100 may provide a query (QUE) or receive an answer (ANS) by communicating with the external graph generator 130 and the graph reading engine 140 (eg, through a network).
  • QUE query
  • ANS an answer
  • the knowledge base 200 may include structured knowledge based on an ontology, that is, knowledge instances.
  • Ontology is a representation of things that exist or can be recognized by humans in a form that can be handled by a computer, and ontology components may include, for example, entities, classes, properties, values, etc. . Additionally, ontology components may further include relations, function terms, restrictions, rules, events, and the like. Specific information about an entity, i.e., knowledge, may be referred to as a knowledge instance (or simply an instance), and the knowledge base 200 may store vast knowledge instances.
  • knowledgebase 200 may include knowledge instances expressed based on a Resource Description Framework (RDF), and a knowledge instance may be expressed as a triple.
  • RDF Resource Description Framework
  • the knowledge base 200 may return a knowledge instance, that is, a triple in response to a query, for example, a SPARQL Protocol and RDF Query Language (SPARQL) query.
  • a triple can consist of "subject (S) - predicate (P) - object (O)", and a knowledge instance can be the object as well as the subject of the triple, and can also be the predicate.
  • the knowledge base 200 may have a triple of "Yi Sun-sin (S)-nationality (P)-Joseon (O)" as a knowledge instance including the entity "Yi Sun-sin.”
  • the entity extractor 110 may extract the entity ENT from the input document DIN.
  • the entity ENT is a subject of knowledge and may constitute a knowledge instance included in the knowledge base 200 .
  • the input document DIN may refer to an entity ENT and arbitrary data including information about the entity ENT.
  • the input document DIN may be documents provided from an encyclopedia service (or server) such as Wikipedia (wikipedia.org), articles provided from newspapers or portals, and social network services ( It may be documents prepared on Social Network Service (SNS).
  • SNS Social Network Service
  • the input document DIN may be data stored in a local storage.
  • the input document DIN may be structured data or unstructured data.
  • the input document DIN may include a table.
  • the table may include a plurality of cells and may include knowledge according to the location and contents of the cells.
  • the knowledge included in the table may be based on the structure of the table, and it may not be easy to extract knowledge from the table through a method of extracting knowledge from text such as natural language sentences through semantic analysis.
  • knowledge extraction system 100 may use graphs to extract knowledge from tables, so that knowledge can be easily extracted from tables.
  • the input document DIN may be provided not only to the object extractor 110 but also to the graph generator 130 . An example of the input document DIN will be described later with reference to FIG. 2 .
  • the entity extraction unit 110 may extract an entity from the input document DIN in an arbitrary method.
  • the entity extractor 110 may perform morpheme analysis on text included in the input document DIN, and extract one of words corresponding to nouns among the morphemes as an entity ENT. can do.
  • the entity extractor 110 may filter words corresponding to nouns by referring to entities and/or knowledge instances included in the knowledge base 200, and converting the filtered words into entities (ENT). can be extracted as
  • the entity extractor 110 may extract entities through sentence analysis included in the text of the input document DIN based on dependency parsing and/or semantic role labeling (SRL). there is.
  • the entity extractor 110 may extract the entity ENT from the title of the input document DIN, as shown in FIG. 2 .
  • the query generator 120 may receive the entity ENT extracted from the entity extractor 110 and may receive the attribute list PL.
  • the attribute list PL may define attributes each of the classes in the knowledge base 200 has.
  • the class "person” may define attributes such as "age”, “gender”, “name”, “birth”, “occupation”, “nationality”, and “spouse”, ,
  • a knowledge instance containing "Yi Soon-sin” as an entity belonging to the class "person” may be an attribute of one of the aforementioned attributes and another entity corresponding to the attribute (e.g., an entity corresponding to occupation, nationality, or spouse), or It may contain values (eg, entities corresponding to age, gender, name, birth).
  • the query generator 120 may refer to the input document DIN and/or the knowledge base 200 to obtain a class to which the entity ENT provided from the entity extractor 110 belongs. , an attribute group corresponding to the acquired class may be extracted from the attribute list PL.
  • the query generator 120 may generate a query QUE based on the entity ENT and one attribute PRO among a plurality of attributes included in the attribute group. As shown in FIG. 1 , the query generator 120 may provide the query QUE to the graph reading comprehension engine 140 and the property PRO to the knowledge generator 150 .
  • the query (QUE) provided to the graph reading engine 140 may be used to generate an answer (ANS) or may be used by the knowledge verification unit 160 to verify the knowledge instance (INS).
  • the query generator 120 may receive the attribute list PL from the knowledge base 200 . Examples of the query generator 120 will be described later with reference to FIGS. 3 and 4 .
  • the graph generator 130 may generate graph data GRA from the input document DIN.
  • the knowledge contained in the table included in the input document DIN may depend on the location of the cell, and accordingly, it may not be easy to extract knowledge from the table in a way of extracting knowledge from general text. there is.
  • the graph generator 130 may generate graph data GRA, and the graph data GRA may include information about the position of the cell. Examples of the operation of the graph generator 130 and the graph data GRA will be described later with reference to FIGS. 5A, 5B, and 6 .
  • graph reading engine 140 may be used to verify extracted knowledge.
  • the graph reading engine 140 may communicate with the knowledge verification unit 160 and may communicate with the answer ANS as well as graph data GRA and additional information related to the query QUE. information can be generated.
  • the knowledge generating unit 150 and/or the knowledge verifying unit 160 may use additional information. Examples of the graph reading engine 140 will be described below with reference to FIG. 7 .
  • the knowledge generator 150 may receive the entity ENT from the entity extractor 110, the attribute PRO from the query generator 120, and the answer from the graph reading engine 140 ( ANS) can be received.
  • the knowledge generator 150 may generate a knowledge instance (INS) (eg, a triple) from the entity (ENT), the attribute (PRO), and the answer (ANS).
  • INS knowledge instance
  • the knowledge generation unit 150 may post-process the entity (ENT), the attribute (PRO), and the answer (ANS) according to the format of the knowledge base 200, and the entity (ENT) and the attribute (PRO)
  • the knowledge instance (INS) may be created by extracting at least one identifier among the answers (ANS) from the knowledge base 200 .
  • the knowledge generation unit 150 may verify the knowledge instance INS by providing the knowledge instance INS to the knowledge verification unit 160, and reinforce the verified knowledge instance INS to the knowledge base 200. can do. Examples of the knowledge generator 150 will be described with reference to FIGS. 10 and 11 .
  • the knowledge verifier 160 may receive the knowledge instance INS from the knowledge generator 150, and may provide a verification result to the knowledge generator 150 by verifying the knowledge instance INS.
  • the knowledge verification unit 160 may access the network 300 and communicate with the graph reading engine 140 .
  • the network 300 may include a local network as well as a wide area network such as the Internet, and the knowledge verification unit 160 communicates with other systems connected to the network 300 to obtain information required for verification of the knowledge instance INS. Data (eg, documents) may be obtained.
  • the knowledge verification unit 160 may verify the knowledge instance INS based on machine reading comprehension.
  • Machine Reading Comprehension (MRC) may refer to the ability of a machine to read texts on various subjects and understand their meaning.
  • the knowledge verification unit 160 may include a machine reading comprehension engine or may communicate with the machine reading comprehension engine.
  • the knowledge verification unit 160 may provide a query for verifying the document and knowledge instance (INS) obtained through the network 300 to the machine reading comprehension engine, and obtain a response corresponding to the query from the machine reading comprehension engine.
  • INS document and knowledge instance
  • the knowledge verifier 160 may determine whether the verification of the knowledge instance (INS) is successful or not based on the response of the machine reading comprehension engine. An example of the knowledge verification unit 160 will be described later with reference to FIG. 10 .
  • the input document DIN′ may include a table, and graph data GRA may be generated from the table.
  • the graph generating unit 130 of FIG. 1 may extract a table from the input document DIN′, and FIG. 2 will be described with reference to FIG. 1 below.
  • the input document DIN′ may include information about the person “Son Heung-min”.
  • the input document DIN' may be provided from an encyclopedia service such as Wikipedia or namu.wiki.
  • the input document DIN' may include "Son Heung-min” representing a person's name as a title, and include text describing the person "Son Heung-min is a Korean" can do.
  • the contents of the input document DIN' may be updated, and accordingly, even after knowledge extraction from the input document DIN' is completed, the entity extraction unit 110 may extract the same or different entities from the updated input document again. there is.
  • the entity extractor 110 of FIG. 1 may extract the entity ENT from the title of the input document DIN'.
  • the input document DIN' may include information about a subject, and accordingly, various knowledge instances including entities extracted from the subject may be extracted from the input document DIN'. Accordingly, the entity extraction unit 110 may extract "Son Heung-min" as the entity ENT from the input document DIN' of FIG. 2 .
  • the preprocessor 122 may generate a first word vector V1 and a second word vector V2 corresponding to the entity ENT and the attribute PRO with reference to the word vector model 400 .
  • the word vector model 400 may refer to a multidimensional space in which a word (or token, word, etc.) having meaning is represented by one coordinate, that is, a word vector, or a system that includes word vectors and updates the word vectors. . Semantically similar words may be placed adjacently in a multidimensional space, and thus word vectors corresponding to semantically similar words may have similar values.
  • the first word vector V1 may have coordinate values corresponding to the entity ENT
  • the second word vector V2 may have coordinate values corresponding to the attribute PRO
  • the first word vector V2 may have coordinate values corresponding to the attribute PRO.
  • the deep learning network 124 may perform mathematical operations.
  • the word vector model 400 may be included in the knowledge extraction system 100 of FIG. 1, and the query generator 120' may access the word vector model 400 external to the knowledge extraction system 100. .
  • the deep learning network 124 may receive the first word vector V1 and the second word vector V2 from the preprocessor 122 and output a third word vector V3.
  • the deep learning network 124 may be trained to generate sample queries according to sample objects and sample properties, such as based on reinforcement learning (RL), and may have an arbitrary structure.
  • RL reinforcement learning
  • deep learning networks, including deep learning network 124 of FIG. 3 may be implemented as hardware or a combination of hardware and software, and may be referred to as an artificial neural network (ANN).
  • ANN artificial neural network
  • the deep learning network 124 may generate a third word vector V3 in response to the first word vector V1 and the second word vector V2, and the post-processing unit 126 may generate the third word vector V3 ), it is possible to obtain a series of words (W1 to W4), and by combining the series of words (W1 to W4), "What is Son Heung-min's job?" can be generated as a query.
  • FIGS. 5A and 5B show examples of structured data generated from a table according to exemplary embodiments of the present invention. Specifically, FIGS. 5A and 5B show first data D51 and second data D52 generated from the table T20 of FIG. 2 .
  • the graph generator 130 of FIG. 1 may generate first data D51 and second data D52 from a table (eg, T20 of FIG. 2 ) extracted from the input document DIN.
  • FIGS. 5A and 5B will be described with reference to FIGS. 1 and 2 .
  • the graph generator 130 may extract information of cells included in the table and generate data including the extracted information. For example, as shown in FIG. 5A, the graph generator 130 may generate first data D51 from the table T20 of FIG. 2, and one row in the first data D51 is Information corresponding to one cell included in the table T20 may be included.
  • a column 'ENT' may represent an entity
  • a column 'INDEX' may represent a cell index
  • columns 'TOP, BOTTOM, LEFT, RIGHT' may represent an upper border, a lower border, Coordinates of the left and right edges are respectively indicated
  • the column 'TEXT' indicates content included in the cell.
  • cells included in the same row may have the same values in the columns 'TOP and BOTTOM', respectively, and cells included in the same column may have the same values in the columns 'LEFT and RIGHT', respectively.
  • one cell, that is, one row of the first data D51 may correspond to one node in the graph.
  • the graph generator 130 may generate data representing relationships between cells. For example, the graph generator 130 may identify a relationship between cells from the first data D51 of FIG. 5A and generate second data D52 of FIG. 5B based on the identified relationship between the cells. can In the second data D52, one row may indicate indices of two cells, and the two cells may be adjacent to each other in a row direction or a column direction. For example, in the table T20 of FIG. 2 , a cell including “name” may have index “0”, a cell with index “1” including “Son Heung-min” and an index including “occupation”. It may be adjacent to the cell of "2" in the row direction and column direction, respectively. Accordingly, as shown in FIG.
  • index “0” and index “1” may be included in one row, and index “0” and index “2” may be included in one row.
  • cells adjacent to each other in a row direction or a column direction that is, two cells included in one row of the second data D52 may be connected by an edge in the graph.
  • FIG. 6 is a diagram showing an example of graph data according to an exemplary embodiment of the present invention. Specifically, FIG. 6 shows a graph represented by graph data GRA generated from the table T20 of FIG. 2 .
  • the graph data GRA may define a graph including nodes corresponding to cells included in the table and edges connecting adjacent cells.
  • the table T20 may include cells as indicated by dotted lines in FIG. 6
  • the graph data GRA is a graph including nodes respectively corresponding to the cells and edges connecting adjacent cells. can define them.
  • a node may include information about a corresponding cell, such as an index of a cell, coordinate information of a cell, content (eg, text), and the like. Accordingly, information related to positions of cells in the table may be included in the graph data GRA.
  • the graph data GRA may have any format defining a graph.
  • the graph reading comprehension engine 140' includes a natural language processor 141, a graph embedding model 143, a text embedding model 145, a classification model 147, and an answer generator 149.
  • graph embedding model 143, text embedding model 145, and classification model 147 may be based on deep learning networks, which will be referred to herein as second to fourth deep learning networks, respectively.
  • the natural language processing unit 141 may generate the first input data IN1 by natural language processing the graph data GRA, that is, the contents included in each node of the graph data GRA, and process the query QUE in natural language. By doing so, the second input data IN2 can be generated.
  • the natural language processing unit 141 may perform morphological analysis on the graph data (GRA) and the query (QUE). For example, as shown in FIG. 8, the natural language processing unit 141 may receive "What is Son Heung-min's occupation?" as a query (QUE), and through morphological analysis of the query (QUE) shown in FIG. Second input data IN2 may be generated. In the second input data IN2 of FIG.
  • the graph embedding model 143 may receive first input data IN1 from the natural language processing unit 141 and generate a graph vector GV in response to the first input data IN1.
  • the graph embedding model 143 may be in a state where it has been learned to output sample graph vectors according to the samples of the first input data IN1, and accordingly, the knowledge inherent in the first input data IN1 corresponding to the table A corresponding graph vector (GV) can be created. In this way, knowledge can be extracted based on graph data GRA considering the relationship between cells as well as contents of cells included in the table, and thus knowledge can be easily extracted from the table.
  • the text embedding model 145 may receive the second input data IN2 from the natural language processing unit 141 and generate a word vector WV in response to the second input data IN2.
  • the text embedding model 145 may be in a learned state to output sample word vectors according to samples of the second input data IN2, and thus the meaning of the second input data IN2 corresponding to the query QUE.
  • a word vector (WV) corresponding to may be generated.
  • the text embedding model 145 will generate one word vector (WV) from the second input data (IN2) corresponding to one query (QUE) and provide it to the classification model 147 described below.
  • the second input data IN2 may include a plurality of words generated by processing the query QUE in natural language
  • the text embedding model 145 may include a plurality of words corresponding to each of the plurality of words.
  • a plurality of word vectors may be generated, and the plurality of word vectors may be provided to the classification model 147 .
  • text embedding model 145 in graph reading engine 140' may be omitted.
  • the graph reading engine 140' receives the third word vector V3 output from the deep learning network 124.
  • the third word vector V3 may be provided to the classification model 147 as the word vector WV of FIG. 7 .
  • the operation of generating the second input data IN2 from the query QUE by the natural language processing unit 141 may be omitted, and the text embedding model for generating the word vector WV from the second input data IN2. (145) may be omitted.
  • the classification model 147 may receive a graph vector (GV) from the graph embedding model 143 and may receive at least one word vector (WV) from the text embedding model 145 .
  • the classification model 147 may output output data OUT in response to the graph vector GV and at least one word vector WV.
  • Classification model 147 may be in a state where it has been trained to output samples of output data OUT according to sample graph vectors and sample word vectors, and output data OUT is a word, extracted from graph vector GV.
  • Information about a response corresponding to the vector WV may be included.
  • the output data OUT may further include additional information as well as content corresponding to the answer ANS.
  • the output data OUT may further include the location (eg, index) of a cell including the answer.
  • the answer generator 149 may receive output data OUT from the classification model 147 and generate an answer ANS based on the output data OUT.
  • the output data OUT may include the index of a cell including “soccer/NNG player/NNG” in the natural language-processed first input data IN1, that is, “3”, and an answer generator ( 149) may generate “soccer player” as the answer ANS based on the cell index included in the output data OUT.
  • the answer generation unit 149 may generate a decision result DET indicating whether answer extraction is successful based on additional information included in the output data OUT.
  • the decision result (DET) is provided to at least one of other components of the knowledge extraction system 100, such as the object extraction unit 110, the query generator 120, and the knowledge generator 150. It can be. An example of an operation of the answer generator 149 will be described later with reference to FIG. 9 .
  • step S92 an operation of determining whether a correct answer is included may be performed.
  • the classification model 147 may generate output data OUT including information indicating whether or not the correct answer to the query QUE is included in the graph data GRA, and the answer generation unit 149 ) may determine whether a correct answer is included based on information included in the output data OUT.
  • step S96 may be subsequently performed, while when it is determined that the correct answer is included in the graph data GRA, step S94 This can be done subsequently.
  • step S94 an operation of comparing the reliability of the answer (ANS) with a predefined threshold may be performed.
  • the classification model 147 may generate output data OUT including the reliability of the answer ANS together with location information of the answer ANS, and the answer generator 149 may generate the output data OUT.
  • the threshold As shown in FIG. 9 , if the reliability is less than the threshold, step S96 may be subsequently performed, while if the reliability is greater than or equal to the threshold, step S98 may be subsequently performed.
  • an operation of determining failure of extracting the answer may be performed in step S96.
  • the answer generating unit 1549 may generate a decision result (DET) indicating a failure to extract an answer, and provide the decision result (DET) to other components included in the knowledge extraction system 100.
  • can Components included in the knowledge extraction system 100 may perform an operation for extracting the next knowledge in response to a decision result (DET) indicating extraction failure.
  • the entity extraction unit 110 may extract an entity different from the previous entity from the input document DIN or may receive another input document DIN.
  • the query generation unit 120 may generate a query QUE by obtaining attributes different from previous attributes from the attribute list PL.
  • the knowledge generating unit 150 may stop generating knowledge instances for the current entity ENT and attribute PRO.
  • an operation of generating the answer ANS may be performed in step S98.
  • the answer generation unit 149 generates the answer ANS from the graph data GRA based on the location information of the correct answer included in the output data OUT. can be extracted.
  • FIG. 10 is a block diagram illustrating an example of a knowledge generation unit according to an exemplary embodiment of the present invention
  • FIG. 11 is a diagram illustrating an example of an operation of a knowledge generation unit according to an exemplary embodiment of the present invention.
  • the knowledge generation unit 150 ′ of FIG. 10 may generate a knowledge instance (INS) (eg, triple) from an entity (ENT), an attribute (PRO), and an answer (ANS). there is.
  • INS knowledge instance
  • PRO attribute
  • ANS answer
  • the knowledge generation unit 150' may include a candidate instance generation unit 152 and an instance comparison unit 154.
  • the candidate instance generation unit 152 may receive the entity (ENT), the attribute (PRO), and the answer (ANS), and may generate a candidate knowledge instance (CAN).
  • the candidate instance generation unit 152 may generate entities (ENTs), attributes, based on the format of the knowledge base 200, that is, the format of knowledge instances (eg, triples) included in the knowledge base 200.
  • (PRO) and answers (ANS) can be post-processed.
  • the candidate instance creation unit 152 when “Yi Soon-sin”, “birthday”, and “April 28, 1545” are received as entities (ENT), attributes (PRO), and answers (ANS), the candidate instance creation unit 152 generates knowledge base In (200), based on the format “YYYY-MM-DD" for representing the date, "April 28, 1545” as an answer (ANS) can be converted to "1545-04-28".
  • the instance comparison unit 154 may receive candidate knowledge instances (CAN) and generate knowledge instances (INS) based on knowledge instances included in the knowledge base 200 .
  • the instance comparison unit 154 is a "similarity calculation unit" described in Korean Patent Application No. 10-2018-0151222 filed by the same applicant as the present application and incorporated herein by reference in its entirety.
  • the similarity between the candidate knowledge instance (CAN) and the knowledge instances included in the knowledge base 200 may be calculated.
  • the instance comparator 154 may detect entities of the knowledge base 200 corresponding to the subject, predicate, and object included in the candidate instance based on the calculated similarity, and determine the subject, predicate, and object objects based on the detection result.
  • a knowledge instance (INS) may be created by extracting a corresponding identifier, for example, a Uniform Resource Identifier (URI) from the knowledge base 200 .
  • URI Uniform Resource Identifier
  • the candidate instance creation unit 152 A candidate knowledge instance (CAN') corresponding to "Heung-Min Son-Occupation-Soccer player” may be created.
  • the knowledge base 200 includes all of “Son Heung-min”, “occupation”, and “soccer player”, it may not include knowledge indicating that the person "Son Heung-min” is a soccer player, that is, a knowledge instance. there is.
  • FIG. 12 is a flowchart illustrating a method for knowledge extraction according to an exemplary embodiment of the present invention. Specifically, the flowchart of FIG. 12 shows a method of verifying a knowledge instance corresponding to extracted knowledge. In some embodiments, the method of FIG. 12 may be performed by the knowledge verification unit 160 of FIG. 1 and may be referred to as an operating method of the knowledge verification unit 160 . As shown in FIG. 12, the method of FIG. 12 may include a plurality of steps (S121, S122, S123, S124, S125, S126), hereinafter FIG. 12 will be described with reference to FIG. 1, Among the descriptions of FIG. 12 , contents overlapping with those of FIG. 9 will be omitted.
  • S121, S122, S123, S124, S125, S126 steps
  • step S121 an operation of searching for a document through the network 300 based on the knowledge instance (INS) may be performed.
  • the knowledge verification unit 160 sends a document including at least one of the names of the knowledge instances (INS), for example, "Son Heung-min", "occupation”, and "soccer player” to the system connected to the network 300. can be searched in the field.
  • knowledge verification unit 160 may retrieve documents by accessing systems that provide documents containing various information.
  • step S122 an operation of providing the searched document and query (QUE) to the machine reading engine may be performed.
  • the knowledge verification unit 160 may provide the document retrieved in step S121 to the machine reading comprehension engine, while directly providing the query (QUE) used to generate the knowledge instance (INS) to the machine reading comprehension engine, or querying the machine reading comprehension engine. It is possible to cause the generation unit 120 to provide. Accordingly, the machine reading engine can find the correct answer of the query (QUE) in the searched document.
  • step S123 an operation of determining whether a correct answer is included may be performed.
  • the knowledge verification unit 160 may directly receive the output data OUT of FIG. 7 and determine whether a correct answer is included based on the output data OUT.
  • the knowledge verification unit 160 may determine whether a correct answer is included based on the determination result (DET) provided by the answer generation unit 149 of FIG. 7 .
  • step S125 when it is determined that the correct answer is not included in the retrieved document, step S125 may be subsequently performed, while when it is determined that the retrieved document contains the correct answer, step S126 may be subsequently performed. there is.
  • step S124 an operation of comparing reliability of an answer extracted from a searched document with a predefined threshold may be performed.
  • the threshold in FIG. 12 may be different from the threshold in FIG. 9 , for example, the threshold in FIG. 12 may be higher than the threshold in FIG. 9 .
  • step S125 if the reliability is less than the threshold, step S125 may be subsequently performed, while if the reliability is greater than or equal to the threshold, step S126 may be subsequently performed.
  • verification failure of the knowledge instance may be determined in step S125. For example, when verification failure is determined, the knowledge verification unit 160 may perform verification of the knowledge instance INS again using another searched document. When verification of the knowledge instance (INS) fails by using all of the retrieved documents or a predefined amount of documents, the knowledge verification unit 160 may finally determine that the verification of the knowledge instance (INS) has failed. It can be notified to the knowledge generation unit 150.
  • the verification success of the knowledge instance may be determined in step S126.
  • the knowledge verification unit 160 may finally determine verification success of the knowledge instance INS.
  • the knowledge verification unit 160 may finally determine verification success of the knowledge instance INS when verification success is determined using a predefined number or ratio of retrieved documents.
  • an entity ENT may be extracted from the input document DIN.
  • the input document DIN may include a title
  • the entity extraction unit 110 may extract the entity ENT from the title of the input document DIN.
  • an answer (ANS) of the query (QUE) may be extracted from the graph data (GRA).
  • the graph reading comprehension engine 140 may generate input data by natural language processing of the graph data GRA and the query QUE, and output data OUT from the input data using at least one learned model. can create The output data OUT may include information on the answer ANS as well as additional information, and the answer ANS may be generated based on the output data OUT.
  • step S900 an operation of generating a knowledge instance (INS) based on the entity (ENT), attribute (PRO), and answer (ANS) may be performed.
  • the knowledge generating unit 150 may generate a candidate knowledge instance (eg, CAN of FIG. 10 ) based on the entity (ENT), attribute (PRO), and answer (ANS).
  • the knowledge generation unit 150 may generate a knowledge instance INS from a candidate knowledge instance by comparing the candidate knowledge instance with the knowledge instances included in the knowledge base 200, and the knowledge base based on the knowledge instance INS. (200) may determine whether to reinforce.
  • an operation of verifying the knowledge instance (INS) generated in step S900 may be further performed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Un système d'extraction de connaissances pour extraire des connaissances d'un document comprenant une table peut comprendre : une unité d'extraction d'entité qui extrait une entité à partir d'un document d'entrée ; une unité de génération de question qui génère une question comprenant l'entité sur la base d'un attribut compris dans une liste d'attributs de l'entité ; une unité de génération de graphe qui génère des données de graphe à partir d'une table ; un moteur de lecture de graphe qui extrait une réponse de la question à partir des données de graphe ; et une unité de génération de connaissances qui génère une instance de connaissances à partir de l'entité, de l'attribut et de la réponse sur la base de la forme d'une base de connaissances.
PCT/KR2021/018458 2021-11-15 2021-12-07 Système et procédé d'extraction de connaissances sur la base d'une lecture de graphe WO2023085500A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020210157084A KR20230070968A (ko) 2021-11-15 2021-11-15 그래프 독해 기반 지식 추출을 위한 시스템 및 방법
KR10-2021-0157084 2021-11-15

Publications (1)

Publication Number Publication Date
WO2023085500A1 true WO2023085500A1 (fr) 2023-05-19

Family

ID=86336222

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/018458 WO2023085500A1 (fr) 2021-11-15 2021-12-07 Système et procédé d'extraction de connaissances sur la base d'une lecture de graphe

Country Status (2)

Country Link
KR (1) KR20230070968A (fr)
WO (1) WO2023085500A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100250598A1 (en) * 2009-03-30 2010-09-30 Falk Brauer Graph based re-composition of document fragments for name entity recognition under exploitation of enterprise databases
KR20150032164A (ko) * 2013-09-17 2015-03-25 인터내셔널 비지네스 머신즈 코포레이션 심층적 문서 분석에 기초한 능동적 지식 안내
KR20200017347A (ko) * 2018-08-08 2020-02-18 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. 지식 그래프를 생성하기 위한 방법, 장치, 기기 및 컴퓨터 판독 가능 저장 매체
KR20210000952A (ko) * 2019-06-26 2021-01-06 주식회사 카카오 지식그래프 색인 방법 및 장치
KR20210043283A (ko) * 2019-10-11 2021-04-21 주식회사 솔트룩스 기계 독해 기반 지식 추출을 위한 시스템 및 방법

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100250598A1 (en) * 2009-03-30 2010-09-30 Falk Brauer Graph based re-composition of document fragments for name entity recognition under exploitation of enterprise databases
KR20150032164A (ko) * 2013-09-17 2015-03-25 인터내셔널 비지네스 머신즈 코포레이션 심층적 문서 분석에 기초한 능동적 지식 안내
KR20200017347A (ko) * 2018-08-08 2020-02-18 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. 지식 그래프를 생성하기 위한 방법, 장치, 기기 및 컴퓨터 판독 가능 저장 매체
KR20210000952A (ko) * 2019-06-26 2021-01-06 주식회사 카카오 지식그래프 색인 방법 및 장치
KR20210043283A (ko) * 2019-10-11 2021-04-21 주식회사 솔트룩스 기계 독해 기반 지식 추출을 위한 시스템 및 방법

Also Published As

Publication number Publication date
KR20230070968A (ko) 2023-05-23

Similar Documents

Publication Publication Date Title
CN109697162B (zh) 一种基于开源代码库的软件缺陷自动检测方法
CN107391677B (zh) 携带实体关系属性的中文通用知识图谱的生成方法及装置
US10503830B2 (en) Natural language processing with adaptable rules based on user inputs
WO2021049706A1 (fr) Système et procédé de réponse aux questions d'ensemble
CN110162771B (zh) 事件触发词的识别方法、装置、电子设备
KR102292040B1 (ko) 기계 독해 기반 지식 추출을 위한 시스템 및 방법
EP3333731A1 (fr) Procédé et système de création d'un modèle d'instance
Stancheva et al. A model for generation of test questions
CN116127090B (zh) 基于融合和半监督信息抽取的航空系统知识图谱构建方法
CN113157859A (zh) 一种基于上位概念信息的事件检测方法
CN113761208A (zh) 一种基于知识图谱的科技创新资讯分类方法和存储设备
Kalo et al. Knowlybert-hybrid query answering over language models and knowledge graphs
Solanki et al. A system to transform natural language queries into SQL queries
CN115114419A (zh) 问答处理方法、装置、电子设备和计算机可读介质
WO2023085500A1 (fr) Système et procédé d'extraction de connaissances sur la base d'une lecture de graphe
Bhattacharjee et al. Named entity recognition: A survey for indian languages
WO2022177372A1 (fr) Système de fourniture de service de tutorat à l'aide d'une intelligence artificielle et son procédé
US20220229990A1 (en) System and method for lookup source segmentation scoring in a natural language understanding (nlu) framework
US11386132B2 (en) System and method for retrieving results and responses with context based exclusion criteria
WO2021054512A1 (fr) Système et procédé destinés au renforcement de base de connaissances
US20200380012A1 (en) System and method for enabling interoperability between a first knowledge base and a second knowledge base
Quy Tran et al. FU Covid-19 AI Agent built on Attention algorithm using a combination of Transformer, ALBERT model, and RASA framework
Zhekova et al. Software Tool for Translation of natural language text to SQL query
WO2023128020A1 (fr) Procédé et dispositif de normalisation de données cliniques multinationales
Yuan et al. Robustness analysis on natural language processing based AI Q&A robots

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21964209

Country of ref document: EP

Kind code of ref document: A1