WO2021107760A1 - System and method for dynamically processing data into a knowledge base repository - Google Patents

System and method for dynamically processing data into a knowledge base repository Download PDF

Info

Publication number
WO2021107760A1
WO2021107760A1 PCT/MY2020/050118 MY2020050118W WO2021107760A1 WO 2021107760 A1 WO2021107760 A1 WO 2021107760A1 MY 2020050118 W MY2020050118 W MY 2020050118W WO 2021107760 A1 WO2021107760 A1 WO 2021107760A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
concept
component
triple
property
Prior art date
Application number
PCT/MY2020/050118
Other languages
French (fr)
Inventor
Ma. Stella Tabora DOMINGO
Nghia Pham DUC
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2021107760A1 publication Critical patent/WO2021107760A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Definitions

  • the present invention relates to data processing, in particular to a system and method for dynamically processing data into a knowledge base repository.
  • processing of data involves creation of valid, useful information from a collected data.
  • Data processing includes classification, computation, coding, and updating.
  • the processed data needs to be stored in a best suitable format and in a best available medium.
  • storage and processing of data from standard templates/format into a knowledge base repository require manual and extensive data validation and conversion which led to offline update and slowing down growth of a knowledge base.
  • Manual data validation and conversion are conducted by a knowledge engineer who are knowledgeable on the structure and definition of a specific knowledge base repository which is called ontology.
  • TBox terminological component ontology
  • US patent number 9,406,018 B2 filed by Upadhyaya, et al. discloses a system for data integration.
  • the system disclosed in the patent includes a semantic data integrator.
  • the semantic data integrator includes an ontology repository containing an ontology.
  • the ontology includes an ontology level and a concept in an ontology level.
  • the semantic data integrator includes a query interface module that receives an input command from the query interface and maps the input command to concepts in the ontology repository. Mapped commands may further be composed into subcommands from the input command.
  • the data sources are queried in accordance with the composed data queries, wherein one or more of the data sources queried are tagged with concepts from the ontology repository.
  • US patent publication number US 20150310676 A1 filed by Lambert, Daniel, et al. discloses a system for a dynamic uploading protocol.
  • the system disclosed in Lambert, Daniel, et al. publication includes an input interface configured to receive a manifest including events which may be uploaded.
  • the manifest additionally includes sensor information relating to each of the plurality of events.
  • the system further includes a processor to determine whether to upload additional information about each event.
  • Determining whether to upload additional information about each event is based in part on the sensor information and contextual information.
  • the system also includes an output interface to request additional information.
  • the input interface disclosed in Lambert, Daniel, et al. publication does not allow target users to configure and configuration cannot be changed anytime.
  • the data and structure disclosed in Lambert, Daniel, et al. publication are specific as configured initially and accepted data is fixed.
  • a method for dynamically processing data into a knowledge base repository, through a dynamic data processing device comprises facilitating a system administrator with a user interface to configure at least one of an acceptable data input content and a data source based on a predefined terminological component, TBox, ontology of one or more domain of interest through an input configuration and TBox assignment module, wherein the acceptable data input content is having a predefined structure; receiving the acceptable data input content from the system administrator through a parser; validating the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component (TBox) assignment module through a field mapper and converter module; and processing the acceptable data input content to generate at least one of a plurality of semantic triples, and a plurality of semantic quads to define an assertion component, ABox, to ensure homogeneity and cardinality of data
  • the input configuration and terminological component (TBox) assignment module perform steps comprises selecting at least one of the acceptable data input content and the data source content; adding one or more preferred field headers; assigning the concept and the concept -property with an object type of the concept and the concept-property to the field headers based on a terminological component (TBox) definition library; adding to a concept definition list; and confirming the concept definition list to a data declaration library.
  • TBox terminological component
  • the field mapper and converter module perform a plurality of steps comprises identifying and validating the acceptable data input content and structure based on a data input/source list; extracting the field header; mapping the extracted field header; validating the field header as concept and concept-property; converting the field header to assigned concept and concept property to be stored in a temporary database; and logging the converted data to a summary log on identifying invalid field header.
  • the assertion component (ABox) generator module performs a plurality of steps comprises retrieving row data; reading and validating datatype of column data; logging the read and validated datatype of the column data to a summary log on identifying invalid data; and trimming data with space and removing one or more special characters based on one or more data compliance rules to output cleansed data.
  • the cleansed data is received by a triple-quad generator for generating triple-quad, and uploading the triple-quads to the knowledge base repository by an assertion component, ABox, uploader.
  • the assertion component (ABox) generator module comprises the triple-quad generator performs a plurality of steps comprises obtaining cleansed column data; obtaining assigned predicate for the concept or concept- property; constructing semantic triples/quads; and storing generated triples/quads to the temporary database; and the assertion component, ABox, uploader performs a plurality of steps comprises retrieving a list of generated triples/quads from the temporary database; validating each triple/quads for homogeneity from the knowledge base repository; checking if exceed the concept-property cardinality; logging to the summary log on determining that the triple/quad is homogenous and exceeding cardinality; uploading triple/quad to the knowledge base repository; and generating a summary report.
  • a system for dynamically processing data into a knowledge base repository comprising a processor and a memory communicatively coupled to the processor, wherein the memory stores instructions for processing data; an user interface adapted for facilitating a system administrator to configure at least one of an acceptable data input content and a data source based on a predefined terminological component, TBox, ontology of one or more domain of interest through an input configuration and TBox assignment module, wherein the acceptable data input content is having a predefined structure; a parser adapted for receiving the acceptable data input content from the system administrator; a field mapper and converter module adapted for validating the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component (TBox) assignment module; and an assertion component, ABox, generator module for processing the acceptable data input content to generate at least one of a plurality of semantic
  • the input configuration and terminological component are identical to [0014] in one embodiment.
  • TBox assignment module is further adapted to perform steps comprises selecting at least one of the acceptable data input content and the data source content; adding one or more preferred field headers; assigning a concept and a concept-property with an object type of the concept and the concept-property to the field headers based on a terminological component (TBox) definition library; adding the field header to a concept definition list; and confirming the concept definition list to a data declaration library.
  • TBox terminological component
  • the field mapper and converter module adapted for identifying and validating the acceptable data input content and structure based on a data input/source list; extracting the field header; mapping the extracted field header; validating the field header as concept and concept-property; converting the field header to assigned concept and concept property to be stored in a temporary database; and log the converted data to a summary log on identifying invalid field header.
  • the assertion component, ABox, generator module is adapted for retrieving row data; reading and validating datatype of column data; logging the read and validated datatype of the column data to the summary log on identifying invalid data; and trimming data with space; and removing one or more special characters based on one or more data compliance rules.
  • the assertion component (ABox) generator module comprises a triple-quad generator adapted for obtaining cleansed column data; extracting assigned predicate for the concept or concept-property; constructing semantic triples/quads; and storing generated triples/quads to the temporary database; and an assertion component, ABox, uploader adapted for retrieving triples/quads from the temporary database; validating each triple/quads for homogeneity from the knowledge base repository; checking (the concept-property cardinality if exceeded; logging to the summary log on determining that the triple/quad is homogenous and exceeding cardinality; uploading triple/quad to the knowledge base repository; and generating the summary report.
  • ABox assertion component
  • FIG. 1 illustrates a block diagram of the present system for dynamically processing data into a knowledge base repository, in accordance with one embodiment of the present invention.
  • FIG. 2 illustrates a block diagram of modules within a memory of a dynamic data processing device for dynamically processing data into a knowledge base repository, in accordance with another embodiment of the present invention.
  • FIG. 3 illustrates an operational flowchart of the present system for dynamically processing data into the knowledge base repository in another embodiment of the present invention.
  • FIG. 4 illustrates a flowchart of the method for dynamically processing data into the knowledge base repository, in accordance with an alternative embodiment of the present invention.
  • FIG. 5 illustrates a flowchart of various steps performed by the input configuration and terminological component (TBox) assignment module in a further embodiment of the present invention.
  • FIG. 6 illustrates a flowchart of various steps performed by field mapper and converter module in yet another embodiment of the present invention.
  • FIG. 7 illustrates a flowchart of various steps performed by an assertion component (ABox) generator module in a further embodiment of the present invention.
  • FIG. 8 illustrates a flowchart of various steps performed by a triple-quad generator in another embodiment of the present invention.
  • FIG. 9 illustrates a flowchart of various steps performed by ABox Uploader in the further embodiment of the present invention.
  • FIG. 10 illustrates a perspective view of an input configuration and TBox assignment user interface in the further embodiment of the present invention.
  • FIG. 11 illustrates a perspective view of a sample excel file in the further embodiment of the present invention.
  • FIG. 12 illustrates a perspective view of the assigned concept/concept-property in the further embodiment of the present invention.
  • FIG. 13 illustrates a perspective view of a sample generated semantic triples in the further embodiment of the present invention.
  • Systems and methods are disclosed for dynamically processing data into a knowledge base repository.
  • Embodiments of the present invention include various steps, which will be described below. The steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps. Alternatively, steps may be performed by a combination of hardware, software, firmware, and/or by human operators.
  • Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable storage medium tangibly embodying thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process.
  • the machine -readable medium may include, but is not limited to, fixed (hard) drives, magnetic tape, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, semiconductor memories, such as ROMs, PROMs, random access memories (RAMs), programmable read-only memories (PROMs), erasable PROMs (EPROMs), electrically erasable PROMs (EEPROMs), flash memory, magnetic or optical cards, or other type of media/machine- readable medium suitable for storing electronic instructions (e.g., computer programming code, such as software or firmware).
  • Various methods described herein may be practiced by combining one or more machine-readable storage media containing the code according to the present invention with appropriate standard computer hardware to execute the code contained therein.
  • An apparatus for practicing various embodiments of the present invention may involve one or more computers (or one or more processors within a single computer) and storage systems containing or having network access to computer program(s) coded in accordance with various methods described herein, and the method steps of the invention could be accomplished by modules, routines, subroutines, or subparts of a computer program product.
  • the present invention discloses a system and method, whereby the data is automatically processed from standard templates or format into the knowledge base repository.
  • the present system and method dynamically analyzed the data, convert the analyzed data and upload the converted data into an assertion component (ABox) of the knowledge base repository to ensure its semantic relationship based on a defined terminological component ontology (TBox).
  • ABox is a “terminological component” and ABox is an “’’assertion component”.
  • TBox and ABox describe two different types of statements in the knowledge base repository. TBox statements described a conceptualization of a domain of interest by defining different sets of individuals described in terms of their characteristics (properties). Generally, TBox statements are associated with object-oriented classes and ABox statements accociated with instances of those classes.
  • machine-readable storage medium includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing, containing, or carrying instruction(s) and/or data.
  • a machine-readable medium may include a non-transitory medium in which data can be stored, and that does not include carrier waves and/or transitory electronic signals propagating wirelessly or over wired connections. Examples of a non-transitory medium may include but are not limited to, a magnetic disk or tape, optical storage media such as compact disk (CD) or versatile digital disk (DVD), flash memory, memory or memory devices.
  • FIG. 1 illustrates a block diagram of a system 100 for dynamically processing data into a knowledge base repository, in accordance with one embodiment of the present invention.
  • the system 100 includes a dynamic data processing device 102 that automates the data processing from standard templates or a format into the knowledge base repository.
  • the dynamic data processing device 102 facilitates a system administrator with a user interface 116 to configure at least one of an acceptable data input content and a data source based on a predefined terminological component (TBox) ontology of one or more domain of interest through an input configuration and terminological component (TBox) assignment module.
  • the acceptable data input content is having a predefined structure.
  • the dynamic data processing device 102 is then configured to receive the acceptable data input content from the system administrator through a parser. Further, the dynamic data processing device 102 is configured to validate the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component (Tbox) assignment module through a field mapper and converter module.
  • Tbox terminological component
  • the dynamic data processing device 102 is configured to process the acceptable data input content to generate at least one of a plurality of semantic triples, and a plurality of semantic quads to define an assertion component (ABox) to ensure homogeneity and cardinality of data within the knowledge base repository during upload through an assertion component (ABox) generator module.
  • ABox assertion component
  • the dynamically processed data into the knowledge base repository may be presented to the user by a plurality of computing devices 104 for example, a laptop 104a, a desktop 104b, and a smartphone 104c.
  • Other examples of a plurality of computing devices 104 may include but are not limited to a phablet and a tablet.
  • the dynamically processed data may be stored on a server 106 and may be accessed by a plurality of computing devices 104 via a network 108.
  • the network 108 may be a wired or a wireless network, and the examples may include but are not limited to the Internet, Wireless Local Area Network (WLAN), Wi-Fi, Long Term Evolution (LTE), Worldwide Interoperability for Microwave Access (WiMAX), and General Packet Radio Service (GPRS).
  • WLAN Wireless Local Area Network
  • Wi-Fi Wireless Fidelity
  • LTE Long Term Evolution
  • WiMAX Worldwide Interoperability for Microwave Access
  • GPRS General Packet Radio Service
  • the dynamic data processing device 102 includes a processor 110 that is communicatively coupled to a memory 112, which may be a non-volatile memory or a volatile memory.
  • a memory 112 may be a non-volatile memory or a volatile memory.
  • non volatile memory may include, but are not limited to flash memory, a Read Only Memory (ROM), a Programmable ROM (PROM), Erasable PROM (EPROM), and Electrically EPROM (EEPROM) memory.
  • Examples of volatile memory may include but are not limited Dynamic Random Access Memory (DRAM), and Static Random- Access memory
  • the processor 110 may include at least one data processor for executing program components for executing user- or system-generated requests.
  • a user may include a person, a person using a device such as those included in this invention, or such a device itself.
  • the processor 110 may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc.
  • the processor 110 may include a microprocessor, such as AMD® ATHLON® microprocessor, DURON® microprocessor OR OPTERON® microprocessor, ARM's application, embedded or secure processors, IBM® POWERPC®, INTEL'S CORE® processor, ITANIUM® processor, XEON® processor, CELERON® processor or other line of processors, etc.
  • the processor 110 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application- specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc.
  • ASICs application- specific integrated circuits
  • DSPs digital signal processors
  • FPGAs Field Programmable Gate Arrays
  • the processor 110 may be disposed of in communication with one or more input/output (I/O) devices via an I/O interface.
  • I/O interface may employ communication protocols/methods such as, without limitation, audio, analog, digital, RCA, stereo, IEEE-
  • serial bus serial bus
  • universal serial bus USB
  • infrared PS/2
  • BNC coaxial, component, composite
  • DVI digital visual interface
  • HDMI high-definition multimedia interface
  • RF antennas S-Video
  • VGA IEEE 802.n/b/g/n/x
  • Bluetooth cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc.
  • CDMA code-division multiple access
  • HSPA+ high-speed packet access
  • GSM global system for mobile communications
  • LTE long-term evolution
  • WiMax wireless wide area network
  • the memory 112 further includes various modules that enable the dynamic data processing device 102 for dynamically processing data into the knowledge base repository. These modules are explained in detail in conjunction with FIG. 2.
  • the dynamic data processing device 102 may further include a display 114 having a User Interface (UI) 116 that may be used by a user or an administrator to initiate a request to view the dynamically processed data in a specific domain and provide various inputs to the dynamic data processing device 102.
  • the display 114 may further be used to display the dynamically processed data.
  • the functionality of the dynamic data processing device 102 may alternatively be configured within each of plurality of computing devices 104.
  • the terms "ABox" and "TBox" are used to describe two different types of statements in the knowledge base repository.
  • TBox statements describe a conceptualization of a domain of interest by defining different sets of individuals described in terms of their characteristics (properties).
  • ABox is TBox-compliant statements about individuals belonging to these sets. For example, a specific employee is an individual in the set called “employee”. This set can be defined as a subset of all people that work in some service or manufacturing industries, making it possible to state the specific organization where each individual work.
  • FIG. 2 illustrates a block diagram of the various modules within the memory 112 of the dynamic data processing device 102 for dynamically processing data into a knowledge base repository, in accordance with another embodiment of the present invention.
  • the memory 112 includes an input configuration and terminological component (TBox) assignment module 202, a parser 204, a field mapper and converter module 206, and an assertion component (ABox) generator module 208.
  • TBox input configuration and terminological component
  • ABox assertion component
  • FIG. 10 exemplifies a screenshot 1000 of an input configuration and TBox assignment user interface in one embodiment of the present invention.
  • the acceptable data input content is having a predefined structure.
  • the parser 204 is adaptable to receive the acceptable data input content from the system administrator.
  • the field mapper and converter module 206 is operable to validate the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component (TBox) assignment module 202.
  • the assertion component (ABox) generator module 208 processes the acceptable data input content to generate at least one of a plurality of semantic triples, and a plurality of semantic quads to define an assertion component (ABox) to ensure homogeneity and cardinality of data within the knowledge base repository during upload.
  • the assertion component (ABox) generator module 208 includes a triple-quad generator 702 (shown and explained in FIG. 7 and operations therefor are further provided FIG. 8) and an assertion component (ABox) uploader 704 (shown and explained in detail in FIG. 7 and operations therefor are further provided in FIG. 9).
  • the triple-quad generator 702 performs a plurality of steps that includes obtaining cleansed column data, obtaining assigned predicate for the concept or concept-property, constructing semantic triples/quads, and storing generated triples/quads to a temporary database 308.
  • the assertion component (ABox) generator 208 performs a plurality of steps that includes retrieving a list of generated triples/quads from the temporary database 308, validating each triple/quads for homogeneity from the knowledge base repository, checking the concept-property cardinality if exceeded, logging to a summary log on determining that the triple/quad is homogenous and exceeding cardinality, uploading triple/quad to the knowledge base repository, and generating the summary report.
  • FIG. 3 illustrates an operational flowchart 300 of the present system 100 for dynamically processing data into a knowledge base repository 312 in another embodiment of the present invention.
  • FIG. 3 is explained in conjunction with FIG. 2.
  • the input configuration and terminological component (TBox) assignment module 202 perform a plurality of steps. The steps include selecting at least one of the acceptable data input content and the data source content.
  • the input configuration and terminological component (TBox) assignment module 202 then add one or more preferred field headers.
  • the input configuration and terminological component (TBox) assignment module 202 assign the concept and the concept-property with an object type of the concept and the concept- property to the field headers based on a terminological component (TBox) definition library 302.
  • the input configuration and terminological component (TBox) assignment module 202 adds to a concept definition list and confirms the concept definition list to a data declaration library 304.
  • the field mapper and converter module 206 perform a plurality of steps. The steps include identifying and validating the acceptable data input content and structure based on a data input/source list 306.
  • the field mapper and converter module 206 extract the field header.
  • the field mapper and converter module 206 maps the extracted field header and validates the field header as concept and concept-property.
  • the field mapper and converter module 206 convert the field header to the assigned concept and concept property to be stored in a temporary database 308 and logs the converted data to a summary log 310 on identifying invalid field header.
  • the assertion component (ABox) generator module 208 performs a plurality of steps that include retrieving row data.
  • the assertion component (ABox) generator module 208 reads and validates the datatype of column data.
  • the assertion component (ABox) generator module 208 logs the read and validated datatype of the column data to the summary log 310 on identifying invalid data.
  • the assertion component (ABox) generator module 208 trims data with space and removing one or more special characters based on one or more data compliance rules 314.
  • FIG. 4 illustrates a flowchart 400 of the method for dynamically processing data into a knowledge base repository, in accordance with an alternative embodiment of the present invention.
  • the method includes a step 402 of facilitating a system administrator with a user interface to configure at least one of an acceptable data input content and a data source based on a predefined terminological component (TBox) ontology of one or more domain of interest through an input configuration and terminological component (TBox) assignment module 202.
  • the acceptable data input content is having a predefined structure.
  • the method then includes the step 404 of receiving the acceptable data input content from the system administrator through a parser 204.
  • the method includes the step 406 of validating the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component (TBox) assignment module 202 through a field mapper and converter module 206.
  • the method includes the step 408 of processing the acceptable data input content to generate at least one of a plurality of semantic triples, and a plurality of semantic quads to define an assertion component (ABox) to ensure homogeneity and cardinality of data within the knowledge base repository during upload through an assertion component (ABox) generator module 208.
  • ABox assertion component
  • FIG. 5 illustrates a flowchart 500 of various steps performed by the input configuration and terminological component (TBox) assignment module 202 in a further embodiment of the present invention.
  • the user selects acceptable data input/source from pre-defined data input/source list 306.
  • a preferred field header is added.
  • the control flow moves to step 508 to select the concept type of the field header based on the TBox definition library 302.
  • the control flow moves to step 514 to select the concept-property type and, at step 516, the object type is selected to assign to field header.
  • step 510 if the field header is not assigned as concept-property, the control flow returns to step 504, to add the preferred field header.
  • step 512 the selected object type of the field header is added to the concept definition list.
  • step 518 the user can add multiple preferred field header. If the user would like to add a new field header, the process will repeat from 504 to 518.
  • step 520 the user has to confirm the submission.
  • step 520 if the user does not confirm the submission, the control flow returns to step 504, to add the preferred field header.
  • step 522 the user confirms the submission of the concept definition list to the data declaration library 304.
  • FIG. 6 illustrates a flowchart 600 of various steps performed by the field mapper and converter module 206 in an embodiment of the present invention.
  • input/source and its structure 1102 shown in FIG. 11
  • FIG. 11 illustrates a perspective view 1100 of a sample excel file in the further embodiment of the present invention.
  • input/source and its structure are validated based on data input/source list 306. If the input and structure is not a valid one, the process ends at step 605. If the input and structure is determined to be valid, at step 606, field header is extracted. At step 608, the extracted field header is mapped against the data declaration library 403.
  • the field header is determined if it is a valid concept, and if not, at step 614, it is further determined if it is a valid concept-property. If the field header is not a valid concept-property, at step 616 the field header is logged to the summary log 310 as invalid field header 1104, as exemplifies in FIG. 11. Returning to the step 610, if the field header is a valid concept, at step 612, the field header is converted to its assigned concept 1200, as exemplifies in FIG. 12, and stored at a temporary database 308. At step 614, if field header is determined as a valid concept-property, the field header is converted to its assigned concept-property at step 618.
  • FIG. 7 illustrates a process 700 of various steps performed by the assertion component (ABox) generator module 208 in a further embodiment of the present invention.
  • the process 700 is carried out after all the field headers are processed under FIG. 6.
  • the assertion component (ABox) generator module 208 includes a triple-quad generator 702 and an assertion component (ABox) uploader 704. The process 700 is initiated with a step
  • the column data for the header (such as Rl-Hl, R2-H2.. Rn-Hn) is read.
  • the column data is being validated against the TBox definition library 302 and the data declaration library 304.
  • the datatype matched with the datatype assigned to its designated field header. If the datatype does not match with the datatype assigned to its designated field header, at step 714, that field header and data is logged to the summary log 310 as an incorrect datatype.
  • cleansing is performed by trimming data with spaces and at step 718, removing unnecessary special characters, both based on data compliance rules 314, which is predefined by the system administrator.
  • the cleansed data is received by the triple quad generator 702 (further explanation in FIG. 8).
  • the output of the triple quad generator 702 is cleansed column data.
  • FIG. 8 illustrates a flowchart 800 of processes performed by a triple-quad generator 702 in one embodiment of the present invention.
  • the triple-quad generator 702 gets the cleansed column data.
  • the assigned predicate is extracted for the concept or concept-property of the field header from the TBox Definition Library 302 and Data Declaration Library 304.
  • semantic triples/quads are constructed as exemplified in FIG. 13.
  • the generated triples/quads are saved into the temporary database 308.
  • the triple-quad generator 702 determines if there are more predicate to be extracted. In case there is more predicate, the steps 806 and 808 are repeated.
  • FIG. 9 illustrates a flowchart 900 of steps performed by the ABox Uploader 704 in one embodiment of the present invention.
  • step 902 from the temporary database 308, the list of generated triples/quads are retrieved.
  • step 904 and 906 for each triple/quads, validate if triple/quads are homogeneous from any of the existing content in the knowledge base repository 312.
  • step 914 log the triple/quads to summary log 310 as data already exist.
  • a concept-property cardinality module 910 checks against the knowledge base repository 312 and the TBox definition library 302 to determine if the triple/quads exceeds the concept property cardinality.
  • the triple/quad is logged to summary log 310 at step 914.
  • the concept-property cardinality for the triple/quad has not exceeded the pre-defined value per unique (URI)the triple/quad is added to knowledge base repository 312.
  • the ABox Uploader 704 determines if there are more triple/quads to be processed. If there are, the aforesaid processes are repeated, and if not, at step 920, a summary report of the summary log is generated once all triple has been processed. [0065]
  • the present system and method provide an efficient, simpler and more elegant framework that automatically processes the data from standard templates or format into the knowledge base repository. Further, the present system and method dynamically analyzed the data, convert the analyzed data and upload the converted data into the knowledge base’ ABox to ensure its semantic relationship based on defined Tbox Ontology.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed is a system and method for dynamically processing data into a knowledge base repository, comprises facilitating (402) a system administrator to configure an acceptable data input content having a structure and a data source based on a predefined terminological component, TBox, ontology of domain of interest through an input configuration and TBox assignment module (202); receiving (404) the acceptable data input content a parser (204); validating (406) the acceptable data input content before mapping and converting field headers to a concept and a concept-property as configured in the input configuration and TBox assignment module (202) through a field mapper and converter module (206); and processing (408) the acceptable data input content to generate semantic triples, and semantic quads to define an assertion component, ABox, to ensure homogeneity and cardinality of data within the knowledge base repository (312) during upload through an assertion component, ABox, generator module (208).

Description

SYSTEM AND METHOD FOR DYNAMICALLY PROCESSING DATA INTO A
KNOWLEDGE BASE REPOSITORY
FIELD OF INVENTION
[0001] The present invention relates to data processing, in particular to a system and method for dynamically processing data into a knowledge base repository.
BACKGROUND [0002] Typically, processing of data involves creation of valid, useful information from a collected data. Data processing includes classification, computation, coding, and updating. The processed data needs to be stored in a best suitable format and in a best available medium. Currently, storage and processing of data from standard templates/format into a knowledge base repository require manual and extensive data validation and conversion which led to offline update and slowing down growth of a knowledge base. Manual data validation and conversion are conducted by a knowledge engineer who are knowledgeable on the structure and definition of a specific knowledge base repository which is called ontology.
[0003] Data collections from various sectors are often stored as comma- separated values (CSV) files or database following a specific standard format. Knowledge engineers manually map and interprets those files or database based on the ontology design into a format that is readable and accepted by the knowledge base. Data conversion usually missed out validation of unnecessary special characters which generates garbage data for the knowledge base. [0004] Current process of data processing involves various problems such as data from standard templates/format which cannot be directly added to a knowledge base repository. Further, the data must first be analyzed and converted by a knowledge engineer to ensure its semantic relationship to comply with defined terminological component (TBox) ontology before it can be added into an assertion component (ABox) of the knowledge base repository. Hence dependency on the knowledge engineer is increased because data has to be manually added into the knowledge base repository to perform manual analysis. Further, manual conversion are needed to ensure its semantic relationship to comply with the defined terminological component ontology (TBox) before it can be added into the ABox of the knowledge base repository. Additionally, data conversion to ABox takes longer time resulting to slow update and growth of the knowledge base.
[0005] US patent number 9,406,018 B2 filed by Upadhyaya, et al. discloses a system for data integration. The system disclosed in the patent includes a semantic data integrator. The semantic data integrator includes an ontology repository containing an ontology. The ontology includes an ontology level and a concept in an ontology level. The semantic data integrator includes a query interface module that receives an input command from the query interface and maps the input command to concepts in the ontology repository. Mapped commands may further be composed into subcommands from the input command. The data sources are queried in accordance with the composed data queries, wherein one or more of the data sources queried are tagged with concepts from the ontology repository. However, the system does not accept data from input files and data mapping is fixed as defined in the ontology. The system requires a direct update of the actual ontology for the mapping to adjust. Also, the Upadhyaya, et al. patent updates the ontology repository which only contains the ontologies but does not update the data repository. [0006] US patent publication number US 20150310676 A1 filed by Lambert, Daniel, et al. discloses a system for a dynamic uploading protocol. The system disclosed in Lambert, Daniel, et al. publication includes an input interface configured to receive a manifest including events which may be uploaded. The manifest additionally includes sensor information relating to each of the plurality of events. The system further includes a processor to determine whether to upload additional information about each event. Determining whether to upload additional information about each event is based in part on the sensor information and contextual information. The system also includes an output interface to request additional information. However, the input interface disclosed in Lambert, Daniel, et al. publication does not allow target users to configure and configuration cannot be changed anytime. Also, the data and structure disclosed in Lambert, Daniel, et al. publication are specific as configured initially and accepted data is fixed.
[0007] Accordingly, there is a need for a system and method to reduce the dependency on the knowledge engineer and automatically process the data into the knowledge base repository. Further, there is a need for a system and method to reduce the time duration to convert the data while updating the data in the knowledge base repository.
SUMMARY
[0008] In one aspect of the present invention, there is provided a method for dynamically processing data into a knowledge base repository, through a dynamic data processing device. The method comprises facilitating a system administrator with a user interface to configure at least one of an acceptable data input content and a data source based on a predefined terminological component, TBox, ontology of one or more domain of interest through an input configuration and TBox assignment module, wherein the acceptable data input content is having a predefined structure; receiving the acceptable data input content from the system administrator through a parser; validating the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component (TBox) assignment module through a field mapper and converter module; and processing the acceptable data input content to generate at least one of a plurality of semantic triples, and a plurality of semantic quads to define an assertion component, ABox, to ensure homogeneity and cardinality of data within the knowledge base repository during upload through an assertion component, ABox, generator module.
[0009] In one embodiment, the input configuration and terminological component (TBox) assignment module perform steps comprises selecting at least one of the acceptable data input content and the data source content; adding one or more preferred field headers; assigning the concept and the concept -property with an object type of the concept and the concept-property to the field headers based on a terminological component (TBox) definition library; adding to a concept definition list; and confirming the concept definition list to a data declaration library. [0010] In another embodiment, the field mapper and converter module perform a plurality of steps comprises identifying and validating the acceptable data input content and structure based on a data input/source list; extracting the field header; mapping the extracted field header; validating the field header as concept and concept-property; converting the field header to assigned concept and concept property to be stored in a temporary database; and logging the converted data to a summary log on identifying invalid field header.
[0011] In yet another embodiment, the assertion component (ABox) generator module performs a plurality of steps comprises retrieving row data; reading and validating datatype of column data; logging the read and validated datatype of the column data to a summary log on identifying invalid data; and trimming data with space and removing one or more special characters based on one or more data compliance rules to output cleansed data. The cleansed data is received by a triple-quad generator for generating triple-quad, and uploading the triple-quads to the knowledge base repository by an assertion component, ABox, uploader.
[0012] In an alternative embodiment, the assertion component (ABox) generator module comprises the triple-quad generator performs a plurality of steps comprises obtaining cleansed column data; obtaining assigned predicate for the concept or concept- property; constructing semantic triples/quads; and storing generated triples/quads to the temporary database; and the assertion component, ABox, uploader performs a plurality of steps comprises retrieving a list of generated triples/quads from the temporary database; validating each triple/quads for homogeneity from the knowledge base repository; checking if exceed the concept-property cardinality; logging to the summary log on determining that the triple/quad is homogenous and exceeding cardinality; uploading triple/quad to the knowledge base repository; and generating a summary report. [0013] In another aspect of the present invention, there is provided a system for dynamically processing data into a knowledge base repository. The system comprises a processor and a memory communicatively coupled to the processor, wherein the memory stores instructions for processing data; an user interface adapted for facilitating a system administrator to configure at least one of an acceptable data input content and a data source based on a predefined terminological component, TBox, ontology of one or more domain of interest through an input configuration and TBox assignment module, wherein the acceptable data input content is having a predefined structure; a parser adapted for receiving the acceptable data input content from the system administrator; a field mapper and converter module adapted for validating the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component (TBox) assignment module; and an assertion component, ABox, generator module for processing the acceptable data input content to generate at least one of a plurality of semantic triples, and a plurality of semantic quads to define an assertion component, ABox, to ensure homogeneity and cardinality of data within the knowledge base repository during upload.
[0014] In one embodiment, the input configuration and terminological component
(TBox) assignment module is further adapted to perform steps comprises selecting at least one of the acceptable data input content and the data source content; adding one or more preferred field headers; assigning a concept and a concept-property with an object type of the concept and the concept-property to the field headers based on a terminological component (TBox) definition library; adding the field header to a concept definition list; and confirming the concept definition list to a data declaration library. [0015] In another embodiment, the field mapper and converter module adapted for identifying and validating the acceptable data input content and structure based on a data input/source list; extracting the field header; mapping the extracted field header; validating the field header as concept and concept-property; converting the field header to assigned concept and concept property to be stored in a temporary database; and log the converted data to a summary log on identifying invalid field header.
[0016] In a further embodiment, the assertion component, ABox, generator module is adapted for retrieving row data; reading and validating datatype of column data; logging the read and validated datatype of the column data to the summary log on identifying invalid data; and trimming data with space; and removing one or more special characters based on one or more data compliance rules.
[0017] In an alternative embodiment, the assertion component (ABox) generator module comprises a triple-quad generator adapted for obtaining cleansed column data; extracting assigned predicate for the concept or concept-property; constructing semantic triples/quads; and storing generated triples/quads to the temporary database; and an assertion component, ABox, uploader adapted for retrieving triples/quads from the temporary database; validating each triple/quads for homogeneity from the knowledge base repository; checking (the concept-property cardinality if exceeded; logging to the summary log on determining that the triple/quad is homogenous and exceeding cardinality; uploading triple/quad to the knowledge base repository; and generating the summary report. BRIEF DESCRIPTION OF THE DRAWINGS
[0018] In the figures, similar components and/or features may have the same reference label. Further, various components of the same type may be distinguished by following the reference label with a second label that distinguishes among the similar components. If only the first reference label is used in the specification, the description applies to any one of the similar components having the same first reference label irrespective of the second reference label.
[0019] FIG. 1 illustrates a block diagram of the present system for dynamically processing data into a knowledge base repository, in accordance with one embodiment of the present invention.
[0020] FIG. 2 illustrates a block diagram of modules within a memory of a dynamic data processing device for dynamically processing data into a knowledge base repository, in accordance with another embodiment of the present invention.
[0021] FIG. 3 illustrates an operational flowchart of the present system for dynamically processing data into the knowledge base repository in another embodiment of the present invention.
[0022] FIG. 4 illustrates a flowchart of the method for dynamically processing data into the knowledge base repository, in accordance with an alternative embodiment of the present invention. [0023] FIG. 5 illustrates a flowchart of various steps performed by the input configuration and terminological component (TBox) assignment module in a further embodiment of the present invention. [0024] FIG. 6 illustrates a flowchart of various steps performed by field mapper and converter module in yet another embodiment of the present invention.
[0025] FIG. 7 illustrates a flowchart of various steps performed by an assertion component (ABox) generator module in a further embodiment of the present invention. [0026] FIG. 8 illustrates a flowchart of various steps performed by a triple-quad generator in another embodiment of the present invention.
[0027] FIG. 9 illustrates a flowchart of various steps performed by ABox Uploader in the further embodiment of the present invention.
[0028] FIG. 10 illustrates a perspective view of an input configuration and TBox assignment user interface in the further embodiment of the present invention.
[0029] FIG. 11 illustrates a perspective view of a sample excel file in the further embodiment of the present invention.
[0030] FIG. 12 illustrates a perspective view of the assigned concept/concept-property in the further embodiment of the present invention. [0031] FIG. 13 illustrates a perspective view of a sample generated semantic triples in the further embodiment of the present invention.
DETAILED DESCRIPTION
[0032] The present invention is best understood with reference to the detailed figures and description set forth herein. Various embodiments have been discussed with reference to the figures. However, those skilled in the art will readily appreciate that the detailed descriptions provided herein with respect to the figures are merely for explanatory purposes, as the methods and systems may extend beyond the described embodiments. For instance, the teachings presented and the needs of a particular application may yield multiple alternative and suitable approaches to implement the functionality of any detail described herein. Therefore, any approach may extend beyond certain implementation choices in the following embodiments.
[0033] Systems and methods are disclosed for dynamically processing data into a knowledge base repository. Embodiments of the present invention include various steps, which will be described below. The steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps. Alternatively, steps may be performed by a combination of hardware, software, firmware, and/or by human operators.
[0034] Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable storage medium tangibly embodying thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. The machine -readable medium may include, but is not limited to, fixed (hard) drives, magnetic tape, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, semiconductor memories, such as ROMs, PROMs, random access memories (RAMs), programmable read-only memories (PROMs), erasable PROMs (EPROMs), electrically erasable PROMs (EEPROMs), flash memory, magnetic or optical cards, or other type of media/machine- readable medium suitable for storing electronic instructions (e.g., computer programming code, such as software or firmware).
[0035] Various methods described herein may be practiced by combining one or more machine-readable storage media containing the code according to the present invention with appropriate standard computer hardware to execute the code contained therein. An apparatus for practicing various embodiments of the present invention may involve one or more computers (or one or more processors within a single computer) and storage systems containing or having network access to computer program(s) coded in accordance with various methods described herein, and the method steps of the invention could be accomplished by modules, routines, subroutines, or subparts of a computer program product. [0036] The present invention discloses a system and method, whereby the data is automatically processed from standard templates or format into the knowledge base repository. Further, the present system and method dynamically analyzed the data, convert the analyzed data and upload the converted data into an assertion component (ABox) of the knowledge base repository to ensure its semantic relationship based on a defined terminological component ontology (TBox). Typically, Tbox is a “terminological component” and ABox is an “’’assertion component”. TBox and ABox describe two different types of statements in the knowledge base repository. TBox statements described a conceptualization of a domain of interest by defining different sets of individuals described in terms of their characteristics (properties). Generally, TBox statements are associated with object-oriented classes and ABox statements accociated with instances of those classes.
[0037] Although the present invention has been described with the purpose of dynamically processing data into a knowledge base repository, it should be appreciated that the same has been done merely to illustrate the invention in an exemplary manner and to highlight any other purpose or function for which explained structures or configurations could be used and is covered within the scope of the present invention.
[0038] The term “machine-readable storage medium” or “computer-readable storage medium” includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing, containing, or carrying instruction(s) and/or data. A machine-readable medium may include a non-transitory medium in which data can be stored, and that does not include carrier waves and/or transitory electronic signals propagating wirelessly or over wired connections. Examples of a non-transitory medium may include but are not limited to, a magnetic disk or tape, optical storage media such as compact disk (CD) or versatile digital disk (DVD), flash memory, memory or memory devices.
[0039] FIG. 1 illustrates a block diagram of a system 100 for dynamically processing data into a knowledge base repository, in accordance with one embodiment of the present invention. The system 100 includes a dynamic data processing device 102 that automates the data processing from standard templates or a format into the knowledge base repository.
In particular, the dynamic data processing device 102 facilitates a system administrator with a user interface 116 to configure at least one of an acceptable data input content and a data source based on a predefined terminological component (TBox) ontology of one or more domain of interest through an input configuration and terminological component (TBox) assignment module. The acceptable data input content is having a predefined structure.
[0040] The dynamic data processing device 102 is then configured to receive the acceptable data input content from the system administrator through a parser. Further, the dynamic data processing device 102 is configured to validate the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component (Tbox) assignment module through a field mapper and converter module. [0041] Furthermore, the dynamic data processing device 102 is configured to process the acceptable data input content to generate at least one of a plurality of semantic triples, and a plurality of semantic quads to define an assertion component (ABox) to ensure homogeneity and cardinality of data within the knowledge base repository during upload through an assertion component (ABox) generator module. [0042] The dynamically processed data into the knowledge base repository may be presented to the user by a plurality of computing devices 104 for example, a laptop 104a, a desktop 104b, and a smartphone 104c. Other examples of a plurality of computing devices 104, may include but are not limited to a phablet and a tablet. Alternatively, the dynamically processed data may be stored on a server 106 and may be accessed by a plurality of computing devices 104 via a network 108. The network 108 may be a wired or a wireless network, and the examples may include but are not limited to the Internet, Wireless Local Area Network (WLAN), Wi-Fi, Long Term Evolution (LTE), Worldwide Interoperability for Microwave Access (WiMAX), and General Packet Radio Service (GPRS). [0043] When a user of laptop 104a, for example, wants to visualize the dynamically processed data, the laptop 104a communicates the same with the dynamic data processing device 102, via the network 108. The dynamic data processing device 102 then presents the dynamically processed data as per the user's request. To this end, the dynamic data processing device 102 includes a processor 110 that is communicatively coupled to a memory 112, which may be a non-volatile memory or a volatile memory. Examples of non volatile memory may include, but are not limited to flash memory, a Read Only Memory (ROM), a Programmable ROM (PROM), Erasable PROM (EPROM), and Electrically EPROM (EEPROM) memory. Examples of volatile memory may include but are not limited Dynamic Random Access Memory (DRAM), and Static Random- Access memory
(SRAM).
[0044] The processor 110 may include at least one data processor for executing program components for executing user- or system-generated requests. A user may include a person, a person using a device such as those included in this invention, or such a device itself. The processor 110 may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc.
[0045] The processor 110 may include a microprocessor, such as AMD® ATHLON® microprocessor, DURON® microprocessor OR OPTERON® microprocessor, ARM's application, embedded or secure processors, IBM® POWERPC®, INTEL'S CORE® processor, ITANIUM® processor, XEON® processor, CELERON® processor or other line of processors, etc. The processor 110 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application- specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc.
[0046] The processor 110 may be disposed of in communication with one or more input/output (I/O) devices via an I/O interface. I/O interface may employ communication protocols/methods such as, without limitation, audio, analog, digital, RCA, stereo, IEEE-
1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 802.n/b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc.
[0047] The memory 112 further includes various modules that enable the dynamic data processing device 102 for dynamically processing data into the knowledge base repository. These modules are explained in detail in conjunction with FIG. 2. The dynamic data processing device 102 may further include a display 114 having a User Interface (UI) 116 that may be used by a user or an administrator to initiate a request to view the dynamically processed data in a specific domain and provide various inputs to the dynamic data processing device 102. The display 114 may further be used to display the dynamically processed data. The functionality of the dynamic data processing device 102 may alternatively be configured within each of plurality of computing devices 104. [0048] Typically, the terms "ABox" and "TBox" are used to describe two different types of statements in the knowledge base repository. TBox statements describe a conceptualization of a domain of interest by defining different sets of individuals described in terms of their characteristics (properties). ABox is TBox-compliant statements about individuals belonging to these sets. For example, a specific employee is an individual in the set called "employee". This set can be defined as a subset of all people that work in some service or manufacturing industries, making it possible to state the specific organization where each individual work.
[0049] FIG. 2 illustrates a block diagram of the various modules within the memory 112 of the dynamic data processing device 102 for dynamically processing data into a knowledge base repository, in accordance with another embodiment of the present invention. The memory 112 includes an input configuration and terminological component (TBox) assignment module 202, a parser 204, a field mapper and converter module 206, and an assertion component (ABox) generator module 208. [0050] The input configuration and terminological component (TBox) assignment module
202 facilitates a system administrator with an user interface (shown in FIG. 10) to configure at least one of an acceptable data input content and a data source based on a predefined terminological component (TBox) ontology of one or more domain of interest. FIG. 10 exemplifies a screenshot 1000 of an input configuration and TBox assignment user interface in one embodiment of the present invention. The acceptable data input content is having a predefined structure.
[0051] The parser 204 is adaptable to receive the acceptable data input content from the system administrator. The field mapper and converter module 206 is operable to validate the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component (TBox) assignment module 202.
[0052] The assertion component (ABox) generator module 208 processes the acceptable data input content to generate at least one of a plurality of semantic triples, and a plurality of semantic quads to define an assertion component (ABox) to ensure homogeneity and cardinality of data within the knowledge base repository during upload. In an embodiment, the assertion component (ABox) generator module 208 includes a triple-quad generator 702 (shown and explained in FIG. 7 and operations therefor are further provided FIG. 8) and an assertion component (ABox) uploader 704 (shown and explained in detail in FIG. 7 and operations therefor are further provided in FIG. 9). The triple-quad generator 702 performs a plurality of steps that includes obtaining cleansed column data, obtaining assigned predicate for the concept or concept-property, constructing semantic triples/quads, and storing generated triples/quads to a temporary database 308. [0053] Further, the assertion component (ABox) generator 208 performs a plurality of steps that includes retrieving a list of generated triples/quads from the temporary database 308, validating each triple/quads for homogeneity from the knowledge base repository, checking the concept-property cardinality if exceeded, logging to a summary log on determining that the triple/quad is homogenous and exceeding cardinality, uploading triple/quad to the knowledge base repository, and generating the summary report.
[0054] FIG. 3 illustrates an operational flowchart 300 of the present system 100 for dynamically processing data into a knowledge base repository 312 in another embodiment of the present invention. FIG. 3 is explained in conjunction with FIG. 2. The input configuration and terminological component (TBox) assignment module 202 perform a plurality of steps. The steps include selecting at least one of the acceptable data input content and the data source content. The input configuration and terminological component (TBox) assignment module 202 then add one or more preferred field headers. The input configuration and terminological component (TBox) assignment module 202 assign the concept and the concept-property with an object type of the concept and the concept- property to the field headers based on a terminological component (TBox) definition library 302. The input configuration and terminological component (TBox) assignment module 202 adds to a concept definition list and confirms the concept definition list to a data declaration library 304. [0055] In an embodiment, the field mapper and converter module 206 perform a plurality of steps. The steps include identifying and validating the acceptable data input content and structure based on a data input/source list 306. The field mapper and converter module 206 extract the field header. The field mapper and converter module 206 maps the extracted field header and validates the field header as concept and concept-property. The field mapper and converter module 206 convert the field header to the assigned concept and concept property to be stored in a temporary database 308 and logs the converted data to a summary log 310 on identifying invalid field header.
[0056] In an embodiment, the assertion component (ABox) generator module 208 performs a plurality of steps that include retrieving row data. The assertion component (ABox) generator module 208 reads and validates the datatype of column data. The assertion component (ABox) generator module 208 logs the read and validated datatype of the column data to the summary log 310 on identifying invalid data. The assertion component (ABox) generator module 208 trims data with space and removing one or more special characters based on one or more data compliance rules 314. [0057] FIG. 4 illustrates a flowchart 400 of the method for dynamically processing data into a knowledge base repository, in accordance with an alternative embodiment of the present invention. The method includes a step 402 of facilitating a system administrator with a user interface to configure at least one of an acceptable data input content and a data source based on a predefined terminological component (TBox) ontology of one or more domain of interest through an input configuration and terminological component (TBox) assignment module 202. The acceptable data input content is having a predefined structure.
[0058] The method then includes the step 404 of receiving the acceptable data input content from the system administrator through a parser 204. The method includes the step 406 of validating the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component (TBox) assignment module 202 through a field mapper and converter module 206. The method includes the step 408 of processing the acceptable data input content to generate at least one of a plurality of semantic triples, and a plurality of semantic quads to define an assertion component (ABox) to ensure homogeneity and cardinality of data within the knowledge base repository during upload through an assertion component (ABox) generator module 208.
[0059] FIG. 5 illustrates a flowchart 500 of various steps performed by the input configuration and terminological component (TBox) assignment module 202 in a further embodiment of the present invention. At step 502, the user selects acceptable data input/source from pre-defined data input/source list 306. At step 504, a preferred field header is added. At step 506, if the field header is assigned as a concept, the control flow moves to step 508 to select the concept type of the field header based on the TBox definition library 302. At step 510, if the field header is assigned as concept-property, the control flow moves to step 514 to select the concept-property type and, at step 516, the object type is selected to assign to field header. At the step 510, if the field header is not assigned as concept-property, the control flow returns to step 504, to add the preferred field header. At step 512, the selected object type of the field header is added to the concept definition list. At step 518, the user can add multiple preferred field header. If the user would like to add a new field header, the process will repeat from 504 to 518. At step 520, the user has to confirm the submission. At step 520, if the user does not confirm the submission, the control flow returns to step 504, to add the preferred field header. At step 522, the user confirms the submission of the concept definition list to the data declaration library 304.
[0060] FIG. 6 illustrates a flowchart 600 of various steps performed by the field mapper and converter module 206 in an embodiment of the present invention. At step 602, input/source and its structure 1102 (shown in FIG. 11) is identified. FIG. 11 illustrates a perspective view 1100 of a sample excel file in the further embodiment of the present invention. At step 604, input/source and its structure are validated based on data input/source list 306. If the input and structure is not a valid one, the process ends at step 605. If the input and structure is determined to be valid, at step 606, field header is extracted. At step 608, the extracted field header is mapped against the data declaration library 403. At step 610, the field header is determined if it is a valid concept, and if not, at step 614, it is further determined if it is a valid concept-property. If the field header is not a valid concept-property, at step 616 the field header is logged to the summary log 310 as invalid field header 1104, as exemplifies in FIG. 11. Returning to the step 610, if the field header is a valid concept, at step 612, the field header is converted to its assigned concept 1200, as exemplifies in FIG. 12, and stored at a temporary database 308. At step 614, if field header is determined as a valid concept-property, the field header is converted to its assigned concept-property at step 618. At step 618, the converted concept/concept- property is also stored in the temporary database 308. At step 620, on identifying succeeding field header, the control flow processed in a loop from 606 to 618, until all the extracted field headers are processed. [0061] FIG. 7 illustrates a process 700 of various steps performed by the assertion component (ABox) generator module 208 in a further embodiment of the present invention. The process 700 is carried out after all the field headers are processed under FIG. 6. The assertion component (ABox) generator module 208 includes a triple-quad generator 702 and an assertion component (ABox) uploader 704. The process 700 is initiated with a step
706 of getting a content row data from various rows e.g. Rl, R2...Rn. At step 708, the column data for the header (such as Rl-Hl, R2-H2.. Rn-Hn) is read. At step 710, the column data is being validated against the TBox definition library 302 and the data declaration library 304. At step 712, for every column data within the content row data, validate its datatype if the datatype matched with the datatype assigned to its designated field header. If the datatype does not match with the datatype assigned to its designated field header, at step 714, that field header and data is logged to the summary log 310 as an incorrect datatype. If the datatype matches with the datatype assigned to it designated field header, at step 716, cleansing is performed by trimming data with spaces and at step 718, removing unnecessary special characters, both based on data compliance rules 314, which is predefined by the system administrator. The cleansed data is received by the triple quad generator 702 (further explanation in FIG. 8). The output of the triple quad generator 702 is cleansed column data.
[0062] The validations, matchings and cleansings of the data are looped through the steps 703 and 705 to ensure that all the column data and row data are processed accordingly. rows. The processed data is transmitted to ABox Uploader 704, which will be further illustrated in FIG. 9.
[0063] FIG. 8 illustrates a flowchart 800 of processes performed by a triple-quad generator 702 in one embodiment of the present invention. At step 802, the triple-quad generator 702 gets the cleansed column data. At step 804, the assigned predicate is extracted for the concept or concept-property of the field header from the TBox Definition Library 302 and Data Declaration Library 304. At step 806, semantic triples/quads are constructed as exemplified in FIG. 13. At step 808, the generated triples/quads are saved into the temporary database 308. At step 810, the triple-quad generator 702 determines if there are more predicate to be extracted. In case there is more predicate, the steps 806 and 808 are repeated.
[0064] FIG. 9 illustrates a flowchart 900 of steps performed by the ABox Uploader 704 in one embodiment of the present invention. At step 902, from the temporary database 308, the list of generated triples/quads are retrieved. At step 904 and 906, for each triple/quads, validate if triple/quads are homogeneous from any of the existing content in the knowledge base repository 312. At step 908, if the knowledge base repository 312 has similar content, at step 914, log the triple/quads to summary log 310 as data already exist. Otherwise at step 910, if the knowledge base repository 312 does not have similar content, a concept-property cardinality module 910 checks against the knowledge base repository 312 and the TBox definition library 302 to determine if the triple/quads exceeds the concept property cardinality. At step 912, if concept-property cardinality for the triple/quad has exceeded a pre-defined value per unique (URI) as defined by the system administrator, the triple/quad is logged to summary log 310 at step 914. At step 916, if the concept-property cardinality for the triple/quad has not exceeded the pre-defined value per unique (URI)the triple/quad is added to knowledge base repository 312. Either way, at step 918, the ABox Uploader 704 determines if there are more triple/quads to be processed. If there are, the aforesaid processes are repeated, and if not, at step 920, a summary report of the summary log is generated once all triple has been processed. [0065] Thus the present system and method provide an efficient, simpler and more elegant framework that automatically processes the data from standard templates or format into the knowledge base repository. Further, the present system and method dynamically analyzed the data, convert the analyzed data and upload the converted data into the knowledge base’ ABox to ensure its semantic relationship based on defined Tbox Ontology.
[0066] While embodiments of the present invention have been illustrated and described, it will be clear that the invention is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions, and equivalents will be apparent to those skilled in the art, without departing from the scope of the invention, as described in the claims.

Claims

1. A method for dynamically processing data into a knowledge base repository (312), through a dynamic data processing device (102), the method comprising steps of: facilitating (402) a system administrator with a user interface (116) to configure at least one of an acceptable data input content and a data source based on a predefined terminological component, TBox, ontology of one or more domain of interest through an input configuration and terminological component, TBox, assignment module (202), wherein the acceptable data input content is having a predefined structure; receiving (404) the acceptable data input content from the system administrator through a parser (204); validating (406) the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component, TBox, assignment module (202) through a field mapper and converter module (206); and processing (408) the acceptable data input content to generate at least one of a plurality of semantic triples, and a plurality of semantic quads to define an assertion component, ABox, to ensure homogeneity and cardinality of data within the knowledge base repository (312) during upload through an assertion component, ABox, generator module (208). 2. The method according to claim 1, wherein the input configuration and terminological component, TBox, assignment module (202) perform steps comprising: selecting (502) at least one of the acceptable data input content and the data source content; adding (504) one or more preferred field headers; assigning (506) the concept and the concept-property with an object type of the concept and the concept-property to the field headers based on a terminological component, Tbox, definition library; adding (512) the selected type of the field header to a concept definition list; and confirming (522) the concept definition list to a data declaration library.
3. The method according to claim 1 , wherein the field mapper and converter module (206) perform a plurality of steps comprising: identifying (602) and validating (604) the acceptable data input content and structure based on a data input/source list; extracting (606) the field header; mapping (608) the extracted field header; validating (610, 614) the field header as concept and concept-property; converting (612, 618) the field header to assigned concept and concept property to be stored in a temporary database (308); and logging (616) the converted data to a summary log (310) on identifying invalid field header.
4. The method according to claim 1, wherein the assertion component (ABox) generator module (208) performs a plurality of steps comprising: retrieving (706) row data; reading (708) and validating (710) datatype of column data; logging (714) the read and validated datatype of the column data to a summary log (310) on identifying invalid data; and trimming (716) data with space and removing (718) one or more special characters based on one or more data compliance rules (314) to output cleansed data, wherein the cleansed data is received by a triple-quad generator (702) for generating triple-quads, and uploading the triple-quads to the knowledge base repository (312) by an assertion component, ABox, uploader (704).
5. The method according to claim 4, wherein the assertion component (ABox) generator module (208) comprises: the triple-quad generator (702) performs a plurality of steps comprising: obtaining (802) cleansed column data; obtaining (804) assigned predicate for the concept or concept-property; constructing (806) semantic triple-quads; and storing (808) generated the triple-quads to the temporary database; and the assertion component, ABox, uploader (704) performs a plurality of steps comprising: retrieving (902) a list of generated triple-quads from the temporary database
(308); validating (906) each triple-quads for homogeneity from the knowledge base repository; checking (912) if exceed the concept-property cardinality; logging (914) to the summary log upon determining that the triple-quad is homogenous and exceeding cardinality; uploading (916) the triple-quads to the knowledge base repository (312); and generating (920) a summary report. 6. A system (100) for dynamically processing data into a knowledge base repository (312), the system (100) comprising: a processor (110) and a memory (112) communicatively coupled to the processor (110), wherein the memory (112) stores instructions for processing data; a user interface (116) adapted for facilitating a system administrator to configure at least one of an acceptable data input content and a data source based on a predefined terminological component, TBox, ontology of one or more domain of interest through an input configuration and terminological component, TBox, assignment module (202), wherein the acceptable data input content is having a predefined structure; a parcer (204) adapted for receiving the acceptable data input content from the system administrator; a field mapper and converter module (206) adapted for validating the acceptable data input content together with the predefined structure of the acceptable data input content before mapping and converting one or more field headers to a concept and a concept-property as configured in the input configuration and terminological component, TBox, assignment module (202); and an assertion component, ABox, generator module (208) for processing the acceptable data input content to generate at least one of a plurality of semantic triples, and a plurality of semantic quads to define an assertion component, ABox, to ensure homogeneity and cardinality of data within the knowledge base repository during upload.
7. The system (100) according to claim 6, wherein the input configuration and terminological component, TBox, assignment module (202) is further adapted to perform steps comprises selecting (502) at least one of the acceptable data input content and the data source content; adding (504) one or more preferred field headers; assigning (506) a concept and a concept-property with an object type of the concept and the concept-property to the field headers based on a terminological component, TBox, definition library; adding (512) the field header to a concept definition list; and confirming (522) the concept definition list to a data declaration library.
8. The system (100) according to claim 6, wherein the field mapper and converter module (206) adapted for identifying (602) and validating (604) the acceptable data input content and structure based on a data input/source list; extracting (606) the field header; mapping (608) the extracted field header; validating (610) the field header as concept and concept-property; converting (612, 618) the field header to assigned concept and concept property to be stored in a temporary database (308); and logging (616) the converted data to a summary log (310) on identifying invalid field header.
9. The system (100) according to claim 6, wherein the assertion component, ABox, generator module (208) is adapted for retrieving row data; reading (708) and validating (710) datatype of column data; logging (714) the read and validated datatype of the column data to the summary log (310) upon identifying invalid data; and trimming (716) data with space; and removing (718) one or more special characters based on one or more data compliance rules
(314).
10. The system (100) according to claim 9, wherein the assertion component, ABox, generator module (208) comprises: a triple-quad generator (702) adapted for obtaining (802) cleansed column data; extracting (804) assigned predicate for the concept or concept-property; constructing (806) semantic triple-quads; and storing (808) generated triple-quads to the temporary database (308); and an assertion component, ABox, uploader (704) adapted for retrieving (902) triple quads from the temporary database (308); validating (906) each triple-quads for homogeneity from the knowledge base repository; checking the concept-property cardinality if exceeded; logging to the summary log upon determining that the triple-quad is homogenous and exceeding cardinality; uploading triple-quads to the knowledge base repository; and generating the summary report.
PCT/MY2020/050118 2019-11-29 2020-10-23 System and method for dynamically processing data into a knowledge base repository WO2021107760A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2019007071 2019-11-29
MYPI2019007071 2019-11-29

Publications (1)

Publication Number Publication Date
WO2021107760A1 true WO2021107760A1 (en) 2021-06-03

Family

ID=76128873

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2020/050118 WO2021107760A1 (en) 2019-11-29 2020-10-23 System and method for dynamically processing data into a knowledge base repository

Country Status (1)

Country Link
WO (1) WO2021107760A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910036A (en) * 2023-09-11 2023-10-20 中国林业科学研究院森林生态环境与自然保护研究所(国家林业和草原局世界自然遗产保护研究中心) Construction method of multisource forest soil attribute database
CN117150002A (en) * 2023-11-01 2023-12-01 浙江大学 Abstract generation method, system and device based on dynamic knowledge guidance

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030154085A1 (en) * 2002-02-08 2003-08-14 Onevoice Medical Corporation Interactive knowledge base system
US20060136428A1 (en) * 2004-12-16 2006-06-22 International Business Machines Corporation Automatic composition of services through semantic attribute matching
KR20090089601A (en) * 2008-02-19 2009-08-24 주식회사 엘지화학 System and method for automatically building document with its data and layout using document making application program
US20110093467A1 (en) * 2009-10-16 2011-04-21 Silver Creek Systems, Inc. Self-indexing data structure
US20150261796A1 (en) * 2014-03-13 2015-09-17 Ab Initio Technology Llc Specifying and applying logical validation rules to data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030154085A1 (en) * 2002-02-08 2003-08-14 Onevoice Medical Corporation Interactive knowledge base system
US20060136428A1 (en) * 2004-12-16 2006-06-22 International Business Machines Corporation Automatic composition of services through semantic attribute matching
KR20090089601A (en) * 2008-02-19 2009-08-24 주식회사 엘지화학 System and method for automatically building document with its data and layout using document making application program
US20110093467A1 (en) * 2009-10-16 2011-04-21 Silver Creek Systems, Inc. Self-indexing data structure
US20150261796A1 (en) * 2014-03-13 2015-09-17 Ab Initio Technology Llc Specifying and applying logical validation rules to data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910036A (en) * 2023-09-11 2023-10-20 中国林业科学研究院森林生态环境与自然保护研究所(国家林业和草原局世界自然遗产保护研究中心) Construction method of multisource forest soil attribute database
CN116910036B (en) * 2023-09-11 2023-12-26 中国林业科学研究院森林生态环境与自然保护研究所(国家林业和草原局世界自然遗产保护研究中心) Construction method of multisource forest soil attribute database
CN117150002A (en) * 2023-11-01 2023-12-01 浙江大学 Abstract generation method, system and device based on dynamic knowledge guidance
CN117150002B (en) * 2023-11-01 2024-02-02 浙江大学 Abstract generation method, system and device based on dynamic knowledge guidance

Similar Documents

Publication Publication Date Title
US11995073B2 (en) One-shot learning for text-to-SQL
US10572822B2 (en) Modular memoization, tracking and train-data management of feature extraction
US20240012810A1 (en) Clause-wise text-to-sql generation
US8601438B2 (en) Data transformation based on a technical design document
US10606957B1 (en) Method and system for translating natural language policy to logical access control policy
US9811605B2 (en) Methods, apparatuses and computer program products for automated learning of data models
US9355152B2 (en) Non-exclusionary search within in-memory databases
CN108334609B (en) Method, device, equipment and storage medium for realizing JSON format data access in Oracle
US20170220654A1 (en) Method for automatically generating extract transform load (etl) codes using a code generation device
JP5791149B2 (en) Computer-implemented method, computer program, and data processing system for database query optimization
WO2021107760A1 (en) System and method for dynamically processing data into a knowledge base repository
US10171311B2 (en) Generating synthetic data
CN111858913A (en) Method and system for automatically generating text abstract
US20130174048A1 (en) Techniques for guided access to an external distributed file system from a database management system
CN112346775B (en) Index data general processing method, electronic device and storage medium
CN109063091B (en) Data migration method and device for hybrid coding and storage medium
EP3336726B1 (en) Systems and methods for facilitating data transformation
US11263396B2 (en) System and method for document conversion to a template
CN106570095B (en) XML data operation method and equipment
US10782942B1 (en) Rapid onboarding of data from diverse data sources into standardized objects with parser and unit test generation
US10146822B1 (en) Automated script generator for processing storage system data files
US20170262507A1 (en) Feedback mechanism for query execution
CN112988163B (en) Intelligent adaptation method, intelligent adaptation device, intelligent adaptation electronic equipment and intelligent adaptation medium for programming language
US10216743B2 (en) Format management for a content repository
CN113434123A (en) Service processing method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20893113

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20893113

Country of ref document: EP

Kind code of ref document: A1