CN103412917B - The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method - Google Patents

The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method Download PDF

Info

Publication number
CN103412917B
CN103412917B CN201310343157.XA CN201310343157A CN103412917B CN 103412917 B CN103412917 B CN 103412917B CN 201310343157 A CN201310343157 A CN 201310343157A CN 103412917 B CN103412917 B CN 103412917B
Authority
CN
China
Prior art keywords
data
database
domain
module
hierarchical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310343157.XA
Other languages
Chinese (zh)
Other versions
CN103412917A (en
Inventor
陈宁江
肖中正
董世龙
胡丹丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanning Super Cube Science And Technology Co Ltd
Original Assignee
Guangxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University filed Critical Guangxi University
Priority to CN201310343157.XA priority Critical patent/CN103412917B/en
Publication of CN103412917A publication Critical patent/CN103412917A/en
Application granted granted Critical
Publication of CN103412917B publication Critical patent/CN103412917B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method, including data resource ontology library module, hierarchical FIELD Data library module, network-type FIELD Data library module and FIELD Data genetic module, wherein data resource ontology library module and multiple types of data storehouse, hierarchical FIELD Data library module and network-type FIELD Data library module collectively constitute data base set.The present invention can build the data repository in the service-oriented field of substantial amounts, build extendible data resource ontology library system on this basis, expand different types of domanial hierarchy type data base and network-type field database rapidly, and the data object made new advances can be extracted from non-structured urtext data to build new FIELD Data.

Description

Extensible database system and management method for data coordination management in multiple types of fields
Technical Field
The invention relates to an extensible database system and a management method for data coordination management in multiple types of fields, and belongs to the fields of databases and artificial intelligence.
Background
A database is a repository that organizes, stores, and manages data according to a data structure, and is a unit or a general-purpose data processing system of an application domain. With the development of information technology and markets, data management is no longer just storing and managing data, but is turning into various data management ways required by users. Databases are of many types, ranging from the simplest tables that store various types of data to large database systems that are capable of mass data storage. With the acceleration of the informatization process and the arrival of the 'big data' era, enterprise data tends to be more and more massive, unstructured and complicated. The intelligent development of the database is promoted by the organic combination of the artificial intelligence and the database. While general applications implicitly encode problem-solving knowledge in the program, intelligent database-based systems explicitly express problem-solving elements in the application domain and individually compose a relatively independent program entity.
With the acceleration of the informatization process, the management of massive complex data is more and more emphasized by enterprises, but the enterprises often encounter the following problems in the resource management process: mass enterprise data is difficult to store and manage; the search is slow and the efficiency is low; domain data version management confusion; data security lacks guarantee; databases in various fields cannot be effectively collaboratively shared. Therefore, in response to the management of massive and complex unstructured data, an intelligent database which is expandable, evolvable and cooperatively managed in multiple types of fields is needed to store, process and analyze the data.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the problems of high-efficiency storage organization, index and query of unstructured data with various complex relationships are solved, and an extensible database system and a management method for multi-type field data coordination management are provided.
The technical solution of the invention is as follows: a database system for extensible multi-type domain data coordination management, comprising: data resource ontology library module, network type domain database module, hierarchical type domain database module and domain data evolution module, wherein:
the data resource ontology library module is used for defining a top-level data resource model, realizing logic view design and storage structure design of basic data units, providing data storage and access basic support capacity and establishing a database containing a large number of business data objects, relations and concepts; the data resource ontology library module provides top-level data abstraction rules and data access rules for the network type domain database module and the hierarchical type domain database module;
the network type domain database module is used for constructing a database based on network type data attributes on the basis of a data resource ontology base according to the attributes of the data objects, a relational network and other special attributes, realizing the data structure design, storage design and index design of the network type data objects, forming a relational network containing a large number of network type data objects and realizing the provision of a network type database access interface to the outside; the network type domain data module is used for realizing inheritance of the data resource ontology base and instantiation on the network type domain data; providing a query interface based on network type field data for a user, other modules and an external system;
the hierarchical domain database module is used for constructing a database specially representing data objects and related information of hierarchical membership thereof according to the characteristics of membership, adjacency, intersection, peer and the like among hierarchical data objects, and realizing an access interface for providing the data objects and the hierarchical membership database thereof to the outside; the hierarchical domain database module is used for further evolving the network type domain data module, storing and organizing the domain data only with a hierarchical structure in a tree form, realizing hierarchical semantics, and providing a query interface based on the hierarchical domain data for a user, other modules and an external system;
the domain data evolution module tracks and controls the change of the domain data in the data resource ontology base, the network type domain database and the hierarchical domain database in the using process of the user, establishes the version history of a data object, analyzes an original data set provided by the user by combining with the existing data so as to obtain new domain data and inputs the new domain data into the domain database through screening, provides record-based data version control for the three databases, automatically discovers the new domain data from the original data input by the user, and uses an interface of the domain data evolution module to perform corresponding evolution management.
The data resource ontology library module comprises a data persistence module, a bottom word library establishing module, a relation defining module, a data index module and an interface module;
the data persistence module defines an interface-oriented implementation method and flexibly configures data persistence according to different hardware environments, context environments and other requirements; defining serialization and deserialization protocols of field data related objects based on an object serialization technology, and outputting binary streams obtained after object serialization to a file, a database or a network position through a file organization protocol during data persistence; when an object which is not loaded in the object buffer pool needs to be loaded, reading a corresponding data stream according to the logic address information sent by an upper layer request, and reconstructing the object through an deserialization protocol of the object; the logical organization mode of the data in the file is a block storage mode, and the management of the blocks adopts a heap structure for management. The data persistence module of the data resource ontology base is also a data persistence abstraction of the network type domain database and the hierarchical type domain database, and the data storage functions of the latter two modules are customized and expanded based on the persistence module according to different persistence protocols to form a persistence base of a specific data type;
the bottom word stock establishing module is used for storing data objects without extended attributes and relations, and establishing basic field data objects, a serialization protocol, an anti-serialization protocol and a storage manager; the single data form of the bottom word stock provides a data basis for the definition of the network type field data and the hierarchical type field data. The network type domain database module and the hierarchical type domain database module realize the serialization and deserialization interfaces defined herein;
the relation definition module is used for establishing a synonymy relation, an antisense relation and a membership relation for the entries of the bottom data object database by using a new file on the basis of realizing the bottom word database; the highly abstract and generalized relation definition, organization, storage and management enable the network type domain database to realize flexible extension on the basis;
the data index module is used for firstly defining the abstract of the field data object, and mapping the logical storage information of the field data abstract and the field data object through a quick double-coding algorithm so as to achieve the purposes of quick retrieval and access control; the network type domain database module and the hierarchical type domain database module both comprise index parts, wherein keywords are realized by obtaining a long and integer digital pair through double-coding calculation;
the interface module is realized based on EJB3.0 standard, and is published in the form of EJB interface and Web Service interface to realize cross-platform Service, and the network type domain database and the hierarchical type domain database realize the customized interface publishing function by inheriting the data resource ontology interface module.
The network type domain database module is realized by the following steps:
(1) defining a domain data object and a persistence protocol related to the network type domain data on the basis of a storage management layer of the data resource ontology library in the aspect of storage design;
(2) and establishing a storage layer on the basis of the defined network type field data object, and defining the basic structure and the process of the attribute part. Dividing an attribute part into two parts, wherein one part is an existing attribute during database design and is called a basic attribute; the other part is a user-defined attribute which is called an extended attribute;
(3) establishing a data index on a network type domain data object storage structure, dynamically generating a B tree when inserting a network type data object based on the B tree with a fast buffer zone and a Bloom Filter, and not limiting the maximum layer number of the B tree, connecting attribute blocks together by using pointers to form an attribute block linked list aiming at the condition of the same name of the network type data object, and quickly obtaining the network type data object list with the same name when inquiring the network type data object by using the name as a key word;
(4) after the data index is realized, the efficient access and the high fault tolerance of the system are guaranteed by realizing the check point and the log file aiming at the update of the data record.
The hierarchical domain database module is realized by the following steps:
(1) defining a domain data object and a persistence protocol related to a hierarchical domain database on the basis of a storage management layer of the data resource ontology base in storage design;
(2) performing storage design of a hierarchical structure on the basis of the hierarchical domain data object and the extension to a persistence protocol based on the hierarchical domain data structure;
(3) after the storage layer is completed, the relational structure between the hierarchical data objects is organized by a defined binary tree. The key in the index file is formed by a double-coding even number of a data object, and the problem of conflict does not need to be considered due to the uniqueness of the even number. Storing the father data object and the son data object in the attribute file by adopting the even; during retrieval, the number pairs of the data objects are calculated, the same number pairs are matched in the index file, pointers of corresponding attributes are obtained, the attributes are read out, and if a plurality of sub-data objects exist, all lower-level data objects can be found by the pointers of the attributes pointing to the next attribute;
(4) after the index is finished, a simplified log file suitable for a hierarchical structure is constructed on the basis of the functions of the check point and the log file of the network type domain database module.
The field data evolution module is realized by the following steps:
(1) first, user activity records of each database are collected, and changes in activity levels of domain data objects are monitored.
(2) Analyzing the activity change data of the collected data objects, and bringing the domain data objects with the activity lower than a system threshold value into a guard backup library;
(3) further analyzing the activity record of the user, and establishing a version change record of the data object for the field data object with the changed core attribute;
(4) the system also analyzes the text data provided by the user or on the Internet to construct a huge data object analysis library; when new data is input into the data object analysis library, the version information of the associated data object is triggered to be read, then the probability that the current data object is new domain data is calculated by analyzing the relation between the data object and the associated data object version, and the new domain data is automatically or manually modified by a user and added into the corresponding domain database.
An extensible database management method for data coordination management in multiple types of fields comprises the following steps:
(1) preprocessing a text data file provided by a user, removing non-core field data including stop words, tone words and punctuations, and obtaining preprocessed text data;
(2) inputting the preprocessed data output in the step (1) into an LDA probability model, and matching the preprocessed data with the established data model to obtain a field-related data object in the LDA probability model;
(3) constructing a suffix tree for the field-related data object output in the step (2), fusing the existing suffix tree, gradually traversing the merged suffix tree to obtain a high-frequency character string, and initializing a field-related data object;
(4) inputting the field-related data object obtained in the step (3) into a data resource ontology base for type and relationship judgment and matching, and obtaining the type of the field-related data object, namely the hierarchical type, the network type or the user-defined type, and other field data objects related to the data object;
(5) inputting the field-related data object and the associated data output in the step (4) into a field database of a corresponding type, establishing a data change log record, and inputting the field-related data object into a double-coding algorithm to obtain a corresponding index even;
(6) and (4) carrying out service combination on the number pairs obtained in the step (5), the data objects and the related field data objects, and finally outputting the field data objects containing field correlation, multiple relations and multiple attributes.
Compared with the prior art, the invention has the advantages that:
(1) the invention is based on the storage technology of block storage, heap management and multi-log group, and ensures the high efficiency and safety of bottom storage;
(2) based on an extensible design method, a domain database of a plurality of sub-domains can be expanded on a data resource ontology base through self-defining data relations;
(3) the invention realizes the functions of automatically detecting, analyzing and extracting a large amount of text data, and evolves the acquired data object into new field data on the basis of the data resource ontology base of the latest version.
Drawings
FIG. 1 is a block diagram of the system of the present invention;
FIG. 2 is a conceptual relationship diagram of a data resource ontology library module according to the present invention;
FIG. 3 is a schematic diagram of a triple basic data model of the data resource ontology library module according to the present invention;
FIG. 4 is a schematic diagram of a data resource ontology base index according to the present invention;
FIG. 5 is a schematic diagram of the access process of the data resource ontology library module according to the present invention;
FIG. 6 is a flow chart illustrating the definition of new field names in the data resource ontology library according to the present invention;
FIG. 7 is a schematic view of a process of inserting a new relationship into a data resource ontology library according to the present invention;
FIG. 8 is a schematic diagram illustrating a process of retrieving data objects from the data resource ontology library according to the present invention;
FIG. 9 is a diagram illustrating the basic structure and relationship of a networked database according to the present invention;
FIG. 10 is a diagram illustrating the basic structure and relationship of a hierarchical domain database according to the present invention;
FIG. 11 is a schematic view of a hierarchical database query process according to the present invention;
FIG. 12 is a schematic view of a process for creating new domain data objects in the hierarchical domain database according to the present invention;
FIG. 13 is a data evolution module LDA model representation of the field of the present invention;
FIG. 14 is a diagram illustrating a data evolution version control structure in the field of the present invention.
Detailed Description
The invention realizes the effective storage and query of the data objects and the relations by a high-efficiency data storage organization method. And new data objects and domain data can be discovered through a large amount of data provided by a user; the data resource ontology base, the network type domain database and the hierarchical type domain database are realized, and the domain database can be expanded.
The system comprises a data resource ontology base module, a hierarchical type domain database module, a network type domain database module and a domain data evolution module, wherein the data resource ontology base module, the hierarchical type domain database module and the network type domain database module jointly form a database set, provide information extraction and data object retrieval, establish various related data object libraries required by a domain database system, and provide data storage and access basic support capacity for multi-type domain data coordination management; and the field data evolution module tracks and controls the change of the field data, and realizes the self evolution of the field database and the version management of the field data.
1. Data resource ontology library module
Definition of the attributes: the property and relationship of a thing is called the attribute of the thing. The specific attributes of a certain kind of things are abstracted from concrete things, such as human voice and thinking. The accidental attribute of a certain kind of object is that some objects of a certain kind have but not all objects have attributes, such as human skin color and ethnicity are accidental attributes.
Definition of concept: a concept is a mental form that reflects a specific attribute (intrinsic attribute or essential attribute) of a thing. The concept has abstraction and universality. The concept is divided into true and false, and the true concept is the concept which correctly reflects the specific attribute of things. The connotation of a concept is the characteristic property of things that the concept reflects. The extension of a concept is something having a specific attribute reflected by the concept. The "is-a" relationship is generally used to indicate that a conceptual model is an extension of a conceptual model. Relationships are an extension of the concept. Roles are also an extension of the concept. The concept requirement is explicit, i.e. it is required to clarify a concept from both the connotation and the extension. Concepts can be divided into individual concepts and general concepts depending on whether the concept is extended to one thing or a plurality of things. The extension of the individual concepts is a unique thing, such as a specific time and a specific space. And extension of the general concept can encompass many things. Such as "city", "goods", and the like. A "city" encompasses many specific cities, and a "commodity" is also a conceptual collection of a large number of physical commodities. "is" and "is-a" are unique concepts. Concepts can also be divided into collective concepts and non-collective concepts. The aggregate concept is a concept reflecting an aggregate. The non-aggregate concept is a concept that does not reflect aggregates. Concepts can also be divided into positive and negative concepts. Positive concepts are concepts that reflect something with a certain property. A negative concept is a concept that reflects something that does not have some property. Concepts are also divided into relative concepts and absolute concepts. A relative concept is a concept that reflects something that has a certain relationship. An absolute concept is a concept that reflects something of a certain nature.
The following 4 basic relationships exist between concepts, as shown in fig. 2:
(1) the holomorphic relation is as follows: if all a's are b's, and b's are a's, then a and b have an identical relationship. Two concepts have an identical relationship, and the extent of the two concepts is then the same.
(2) Membership relationship: if all b are a, but a does not belong to b, then a and b are related in an upper and a lower relationship.
(3) The cross relationship is as follows: if there is a that is b, and there is a that is not b, and there is b that is not a, then a is a cross-over relationship with b.
(4) The holomorphic relation is as follows: all a's are not b, then a and b are all in a disparate relationship.
The basic data model of the triple, namely "Concept A-relationship-Concept B", is the basis of the domain database, namely the basic logical structure and the basic physical storage structure (FIG. 3). The concept is a basic element in a domain database, expresses the cognition of people on things, and is a thinking form reflecting the specific attributes of things. A concept may be either a real-life object or a human imagination or design, either an attribute or an activity. The logical meaning of the triple expression is that there is this "relationship" between "concept a" and "concept B", where "concept a" is the identity of the owner and "concept B" is the identity of the participant. "concept A" and "concept B" may also own or participate in other relationships. Thus, a very complex network structure is established among the n concepts through the relationship, and the structural complexity is determined by a user or an actual operation environment. The "concepts" can be divided into "classes" (Class) and "individuals" (induvidual) according to the level of abstraction. "relationships" are used to establish relationships between concepts and can be divided into "inter-class relationships", "class-to-individual relationships", and "inter-individual relationships". In the design of the data model, the relations are uniformly regarded as concepts, so that all the field databases are concepts, and each concept has a unique identifier. All domain data can be expressed in the concept-relationship-concept mode, if the data structure of a certain domain is complex, the structure of the domain data table is likely to be a very complex network structure, but the pulse is clear. Many field data in the world have certain relation directly or indirectly, related data can be found easily through an expression structure of the field data and the relation, and different from a relation data structure, the structure of the concept relation can almost contain all the relation and data, and new relation and data can be modified and created at will, unlike a relation database, after the database is designed, the relation between the data and the data is not changed any more, and from this point, the data composition mode of the concept relation has incomparable advantages when model searching or data mining is carried out.
1.1 data resource ontology library structural description
In the data resource ontology library, ontology data stored in the library has the characteristics of clear concept, simple word form, univocal property and the like. An ontology represents only one concept and is based on nouns and noun phrases. Some data objects have specific relationships, such as synonyms, anti-sense, and membership. Firstly, designing a bottom database, wherein data objects without extended attributes and relations are arranged in the bottom database. The bottom database provides services for realizing three relations of synonymy, antisense and membership. The other important part is a memory index which has the structure of an object Hash value and a disk block address; the object Hash value is a unique number, and if two data objects collide, i.e., the Hash values are equal, they are stored in the same block of the disk, so that multiple records can be stored in one block.
Disk file: dividing a disk space with a specified size into a block; one block stores multiple records, and if the records exceed the upper limit of the block, the next block is pointed to by a pointer. And writing the source data object file into the binary data file through mapping by the program. In the data file, the data block is taken as a unit, and data objects with the same Hash value, namely, the data objects with Hash collision are stored in one data block. Stored in this data block are a plurality of attribute fields of the data object. Other fields may be added if desired, such as an address field for a synonymous data object that adds it to the synonymous database. When a certain data object is searched, the Hash function is used for calculating the Hash value of the certain data object, the relative physical block number of the data object can be directly found according to the mapping relation, and then the data object is handed to a database manager and an object manager for data reading and unsealing. The method has the advantages that the database fast buffer area is arranged in the middle, so that a large number of data objects are prevented from being called into the memory, the storage space is saved, and the access efficiency is high.
Independent data objects: that is, the data object has no relation with other data objects, and only returns the word of the data object when inquiring, and does not show other relations. This portion directly utilizes the underlying database, which does not implement the relationships between data objects. The domain data storage logic structure comprises (the structure is shown in FIG. 4):
(1) indexing: the data object is used for positioning the data object address by a value calculated by the hash function.
(2) Block address: the relative physical addresses of the blocks that store the data object.
(3) Data object record: the data object itself.
Data objects with relationships: this relationship includes synonyms, antisense, and membership. These relationships are implemented on the basis of the underlying database, using the addresses of the data objects that they provide.
The data object relationships are implemented as follows: and respectively establishing an index table file and a record set file on the basis of a bottom database. The index table file and the record set file are respectively divided into continuous blocks of a specific size on a physical storage structure, wherein a block is referred to as a record for short, and each record corresponds to a unique number (namely, the first record number is 0, the second record number is 1, and so on). In the index table file, the index record includes a forward association group number, a reverse association group number, a lower data object group number, and an upper data object relative physical address (these numbers are numbers in the record set file). In the record set file, each record has two marks to record the former record number and the subsequent record number of the record, and can store a certain amount of relative physical addresses of data objects, and when one record space is used up, the subsequent record can be redistributed as required. When a positive correlation (inverse correlation/lower correlation) data object is added to a data object, the data object is firstly processed by a Hash function to generate a unique relative physical address of the data object, then an index record is allocated to the data object in an index table file, and the index record number of the data object is recorded at a specific position of a data block management layer according to the address of the data object. Next, a record dedicated to storing the relative physical address of the data object at the synonym positive correlation address (reverse correlation/lower correlation) of the data object is assigned to the record set file, and then the newly assigned record number is written in the index entry of the data object. Thereafter, the synonym positive-associated address (anti-sense reverse-associated/lower-associated) data object relative physical address is added to the corresponding record in the record set file. If a lower-level word-related data object is added, the relative physical address of the upper-level word-related data object needs to be set in the index word record of the lower-level word-related data object. When the operations of deleting the positive correlation, the reverse correlation, the lower correlation and the like are finished, the corresponding record set file records and the corresponding index records are checked, if the corresponding record contents are empty or the index record contents are empty, the numbers of the record sets and the index record contents are recorded, and the relation between the record sets and the corresponding data objects is removed, so that the free disk space can be recycled, and the utilization rate of the disk space is greatly improved.
1.2 definition database
The purpose of this module is primarily to facilitate the management and expansion of the data object library. Only after a field name is defined, the data object side in the data object library can add the field and the field value; if a field name is modified, the data object in the library having the field will automatically modify the field name into a new field name, but the corresponding field value is not changed; if a field name already defined in the library is deleted, all data objects in the database that have the field will automatically delete the field and its field value. Similarly, after a certain relation name is defined, the data object in the database can establish the relation; if a certain relationship name is modified, the relationship will be automatically modified into a new relationship between the data objects having the relationship in the database; if a relationship name already defined by the database is deleted, the relationship will be automatically released between all data objects in the database that have the relationship. The default field names and the relation names defined by the database cannot be modified, deleted and the like, and the field names and the relation names defined by the user can be added, modified, deleted and the like. FIG. 5 shows a data resource ontology repository abstraction architecture and access process. Taking the name of the defined field as an example,
fig. 6 is an execution flow of defining field names. The method comprises the following steps:
(1) the client interface service component issues a request to the server wait for service component to define a field name.
(2) The server side calls a method definition field name of the domain database manager. The database manager calls the database access object to check if the field name has been defined. The database access object returns the check result. And if the field name is defined in the database, returning an integer 0 to the server. And the server returns an operation result 0 to the client, and the process is ended.
(3) The database manager issues a request to the database access object to read the old checkpoint in the log (the end position of the log valid segment).
(4) The database accessor returns the old checkpoint and assigns a startPos. The database manager issues a request to the abstract data access object to read the current database valid backup version number. The database access object returns the request result, and the database manager assigns the return result to the variable version. The database registry, new registry for short, is copied.
(5) If the field name recorded in the new registry is empty, directly adding the field name to the new registry, and coding the field name to be 1; otherwise: the current maximum field name encoding in the new registry is read. If an idle field name code i is smaller than the current maximum field name code in the new registry, distributing the code to the newly added field name; otherwise, the current maximum field name code is automatically added with 1. And distributing the current maximum field name code to the newly added field name. The current maximum field name encoding is reset.
(6) The current maximum field name encoding in the registry is set to 1. A logging object is created. The log object is assigned to each variable identifier, including the data it loads and the actions to be performed, and then the log record is written to a log file.
(7) And returning a new effective check point (a new check point for short) in the log file. The new checkpoint is written to the log file header. And writing the newly written log record content in the log file into a data file (referred to as submission).
(8) And returning a submission result, and if the submission is wrong or fails, returning an operation result-1 to the server by the database manager, and ending the process. If the submission is successful: the size of the log file is checked and if the size exceeds a certain size, the log file is reset. The database manager returns the operation result 1 to the server. The server returns a request operation result 1 to the client, and the field name is successfully defined.
1.3 managing data object information
The system is mainly responsible for managing data object information and providing functions of adding, deleting, modifying the data object information and the like. When a data object is destroyed, all information for that data object (including the fields that the data object owns and its relationships to other words) is completely deleted from the database. There are three ways to add a database to a database: firstly, a certain data object is directly added without any information of the data object; when a field is newly added to a certain data object, if the data object does not exist in the database, the database automatically adds the data object, and then newly added field information is added to the data object; thirdly, when a certain relation is established for the two data objects, if one or two data objects do not exist in the database, the system automatically adds the data objects which do not exist in the database into the database, and then establishes corresponding relations for the data objects. When adding or modifying a field for a data object, the field name must be a field name already defined by the database; similarly, when a relationship is established for two data objects, the relationship name must also be the relationship name that the database has defined. Meanwhile, the frequency word frequency of the data object can be set through the module, and the database file can be imported and exported. The following description refers to the flow of operations associated with the insertion of data object fields, and FIG. 7 is a sequence diagram illustrating the insertion of data object fields. The sequence diagram corresponds to the following steps:
(1) the user service interface sends a field with the value of content to the server side to add the keyword.
(2) And the server side calls a method of the word stock manager to add fields and field contents to the data object.
(3) The database manager invokes the dual encoder to compute the dual encoding of the keyword.
(4) The double encoder returns a key word double-encoding object key (index key for short); if the return result is empty, turning to step 5; otherwise go to step 7.
(5) And if the double-code calculation for the key words fails, the data object manager returns an operation result-1 to the server side.
(6) The server returns an operation result-1 to the client, declares that the request operation fails, and goes to step 40.
(7) The data object manager issues an encoding request to the abstract data access object to obtain the field names in the registry.
(8) The database access object returns the code corresponding to the field name if the returned value is not null (i.e., the field name is defined). Go to step 11.
(9) The database manager returns the operation result 0 to the server declaring that the database does not define the field name, fieldName, and cannot insert this field and its contents (field values) for the keyword.
(10) The server returns an operation result 0 to the client, declares the fieldName of the undefined field of the word stock, and the request operation fails.
(11) The database manager sends the index value of the index key in the index table to the database access object.
(12) Returning the index value mapped by the index key to the database access object, if the value is null, indicating that no key word exists in the database, and turning to step 13; otherwise go to step 16.
(13) Adding keywords to the database, and returning an addition result (integer); if the addition fails, go to step 6, otherwise go to step 14.
(14) And sending the index value of the index key in the acquisition index table to the database access object again.
(15) The database access object returns the index value of the index key.
(16) The database manager issues a request to the database access object to read the old checkpoint in the log (the end position of the log valid segment).
(17) The database access object returns the old checkpoint and the database manager assigns the returned result to the variable startPos.
(18) The database manager issues a request to the abstract data access object to read the current database valid backup version number.
(19) And the database access object returns a read result, and the database manager assigns the return result to the variable version.
(20) The database registry is copied, referred to as the new registry.
(21) The database manager sends byte data information of the read keyword to the database access object.
(22) The database access object returns byte data information of the key words and a carrier of the disk address set (called a data car).
(23) The database manager sends a request for processing the byte data of the keywords into a visual information object to the data processing factory.
(24) The data processing factory returns the keyword content carrier to the database manager, and the database manager checks the keyword content carrier; if there is already a field name to be added and its corresponding field value, go to step 25, otherwise go to step 27.
(25) The database manager returns the operation result 2 to the server indicating that the content to be added already exists.
(26) The server returns an operation result 2 to the client, which indicates that the content to be added already exists.
(27) fieldName and content are added to the keyword content carrier.
(28) The database manager sends the data of the key words to the data processing factory for processing and converting the data into byte data.
(29) The data processing factory returns the byte data of the keyword content to the database manager, and the database manager loads the returned byte data into the data car.
(30) And reallocating the addresses of the required disk blocks according to the data of the data vehicle and the information of the new registry, and modifying the information of the new registry.
(31) A new logging object is created and the data cart, the new registry, and the action to be performed are loaded into the logging object.
(32) The database manager sends a request to the database access object to write the newly created logging object into the logging file.
(33) The database access object returns a new valid checkpoint of the log file (referred to as a new checkpoint) to the database manager.
(34) The data manager issues a request to the database access object to write the new checkpoint to the header of the log file, and the database access object responds to the request.
(35) The database manager sends out to the database access object to write the newly written log record content in the log file into the data file (refer for short).
(36) The database access object returns the commit result and if the commit is successful, it goes to step 38, otherwise it goes to step 37.
(37) If the submission fails or the above steps throw an exception, go to step 6.
(38) The database manager returns the operation result 1 to the server.
(39) And the server returns a request operation result 1 to the client, and the field is inserted successfully.
(40) And (6) ending.
1.4 retrieving data object information
The search function is as follows:
(1) check for data object presence: checking whether a data object exists in the database;
(2) retrieving data packets of data objects: retrieving all visual data information (fields, field values, relationships, relational terms) of the data objects, and encapsulating into data packets for network or other form of transmission;
(3) retrieving data object field values: retrieving a field value (field content) of a field of the data object;
(4) searching data relation words: retrieving all relation terms of a certain relation of the data objects;
(5) and (3) searching according to field names: the method is divided into single-field retrieval and double-field combined retrieval. The retrieval according to single-field means that the data packets of all data objects with a certain field are retrieved; the retrieval according to the double-field combination refers to retrieving the data packets of all the data objects which simultaneously have a certain two fields;
(6) searching according to the relationship name: retrieving data packets of all data objects having a certain relationship;
(7) and (3) backward matching retrieval: retrieving data packets of all data objects with a certain keyword as a first word;
(8) fuzzy matching retrieval: retrieving data packets of all data objects containing a certain keyword;
(9) searching high-frequency words: retrieving a specified number of data objects or data object data packets from high to low in frequency;
(10) searching low-frequency words: retrieving all data objects with a word frequency lower than a certain frequency;
(11) searching frequency word frequency: the frequency (number of times retrieved) at which a data object is retrieved;
(12) retrieving all field names defined by the database;
(13) all relationship names defined by the database are retrieved.
FIG. 8 shows a sequence diagram for retrieving data object packets. Sequence diagram introduction:
(1) the client sends a request for retrieving a data packet of the data object of the keyword to the server.
(2) And the server side calls a method of the database retriever to retrieve the data object data packet.
(3) The database manager invokes the dual encoder to compute the dual encoding of the keyword.
(4) The double encoder returns a key word double-encoding object key (index key for short).
(5) And the database retriever sends a request for acquiring the index value corresponding to the key in the index table to the database access object.
(6) The database access object returns the index value mapped by the key to the database retriever, if the value is a null value, the database does not have keywords, and the step 7 is carried out; otherwise go to step 9.
(7) And the database retriever returns a retrieval result null to the server.
(8) The server returns the search result null to the client, and goes to step 14.
(9) Adding 1 to the frequency of the keywords, then sending a request for updating the word frequency of the keywords in the database index table to the database access object by the database retriever, and automatically responding the request by the database access object.
(10) The database retriever sends a request for updating the keyword frequency in the data file to the database access object, and the database access object automatically responds to the request.
(11) And the database retriever calls a self method to retrieve the data packet of the keyword at the head address of the disk according to the keyword.
(12) And the database retriever returns the data packet of the keyword to the server.
(13) And the server side returns the data packet of the keyword to the client side.
(14) And (6) ending.
2. Network type domain database module:
2.1 basic principle
Each network type domain data collection is divided into an index part and an attribute part, wherein the index part is stored in a name.dct file, and the attribute part is stored in an attr.dct file. The name index portion of the network-type data is an unrestricted B-tree that is dynamically generated when a data object is inserted.
Pointer operations are used in the index file, so N pointers are defined, which correspond to the N common Chinese characters in the GB2312-80 code, respectively. The pointer points to the root of the B-tree where the first name of the word is located. All names with the same word as the first character are stored in the same B-tree.
During retrieval, the name can be used as a key word to retrieve the network type data, and the key word is obtained by calculation through a Hash function by using GB2312-80 codes of the name. When searching, firstly finding the B tree, and then finding the name by using a B tree searching algorithm. The names in the index file are in one-to-one correspondence with the attributes in the attribute file, that is, the summary information of the data object is found in the index file, and the attribute of the network type data object is inevitably present in the attribute file. And searching in the index file by taking the abstract information as a key, and taking the position of the network type data attribute corresponding to the abstract information in the attribute file as a searching result. When the summary information is found in the index file, the related attributes can be directly read from the attribute file according to the corresponding attribute index in front of the summary information. Therefore, the operation on the attribute file is very fast, and the time is mainly spent on searching the index file. In the index file, the storage and search of the summary information use Hash algorithm and B-tree algorithm, the algorithm is based on the search of the hard disk, and meanwhile, the addressing operation is directly searched according to the pointer, so the algorithm efficiency is higher.
2.2 index storage Structure
In the index file, the Hash algorithm and the B-tree algorithm are used for storing and searching the summary information of the network type data. The first character of the summary information is referred to herein as a "first word", and the remaining portion excluding the first word is referred to as a "suffix". Firstly, an area table with N table entries is established in an index file, and each table entry consists of a single character and GB2312-80 values thereof. The address of each character in the location table can be obtained by calculating the key value of the character in each table entry through the Hash function. The table entry also stores a pointer pointing to the root of the B-tree. The B-tree is used to store the suffix of the summary information.
When storing the name of the network type data, the B-tree corresponding to the word is found in the region table according to the first character of the summary information, and then the suffix of the B-tree is inserted into the B-tree. The summary information is searched for in a similar manner to the storage, a corresponding B-tree is first found in the location table according to the first character of the summary information, and a suffix of the summary information is searched for in the B-tree.
Structure of the B tree: the B-tree is used to store the suffix of the summary information, which is stored as a key in the B-tree. In order to reduce the number of disk reads, an n-order B-tree is used as needed. Each node in the B-tree contains the following information:
(n,C0,A1,K1,C1,A2,K2,C2,…,An,Kn,Cn,Father)
wherein n is the number of keywords in the node; ki(i =1, ….., n) is a keyword (suffix of summary information), and K isi<Ki+1(i=1,…..,n);Ci(i =1, ….., n) is a pointer to the subtree root node, and pointer Ci-1The keywords in the subtrees are all less than Ki(i=1,…..,n),CnThe keywords of all nodes in the subtree are greater than Kn;Ai(i =1, ….., n) is a pointer of the attribute file, which points to the character corresponding to the B-tree where the node is located as the first character and K is used as the initial characteriThe position of the abstract information attribute of the suffix in the attribute file; the Father is a pointer to the parent node.
When a data object is searched, the table entry address of the word in the region table is obtained through Hash calculation according to the first character of the given abstract information, then the content of the table entry is read to find the root address of the B-tree corresponding to the word, and then the suffix of the abstract information is searched in the B-tree. The time used for one-time search is the sum of the time of one-time Hash calculation and the time of B-tree search, so that the efficiency of the search algorithm is higher. By using the memory mapping technology, the index file is not required to be read into the memory, and only the used nodes are required to be read into the memory, so that the disk reading time is greatly reduced, and the memory utilization rate is improved.
2.3 Attribute storage Structure for network-type data
The attributes of the network type data object except the name are stored in an attribute file, the attributes are divided into two parts, one part is the existing attribute when the database is designed, is called as a basic attribute and is stored in a basic attribute file; the other part is a user-defined attribute, called an extended attribute, and is stored in an extended attribute file. The basic attribute of the network type data object is stored in the form of attribute blocks, the attribute of one data object is stored in one attribute block, and the basic attribute blocks of the data object are sequentially stored in a basic attribute file according to the insertion sequence of the data object. The data object extended attributes are stored in a linked list. The storage structure of the property file is shown in fig. 9. The basic property block is composed of a basic property block of the data object, a pointer a, a pointer b, and the like. In the basic attribute block, a pointer a points to the same-name attribute block, and a pointer b points to the extended attribute initial address of the data object in the extended attribute file.
When searching the network type data object attribute, firstly searching in the index file by using the name of the data object, and at the same time of finding the name of the data object, finding the position pointer of the data object attribute block in the attribute file in the node for storing the name, directly reading the basic attribute from the corresponding address of the attribute file according to the pointer, and then reading the extended attribute from the extended attribute file according to the related pointer in the attribute block. The retrieval efficiency in the property file is very high.
3. Hierarchical domain database module:
3.1 basic principle
According to the relationship characteristics among the hierarchical data, the hierarchical data mainly has four relationships of membership, adjacency, intersection and same indication, wherein the membership is a main relationship, one hierarchical data object can have both an upper level and a lower level, each hierarchical data object is specified to be directly responsible for the upper level or lead the lower level, and the design is based on the consideration that a plurality of basic main attributes in the hierarchical data are completely equal.
For each hierarchical data object, it may be used as a key to the database to uniquely identify the data object. The key word is a digit-even pair calculated by a Hash function based on double codes and used as a logical address of the data object. That is, the key word of each data object is in one-to-one correspondence with the storage address of the data object on the disk, and when a certain data object needs to be searched, the key value is calculated through the Hash function, which is equivalent to obtaining the logical address of the data object, and then the data access task is handed to the object manager and the log manager to be completed. The method avoids searching and matching, the time consumption is mainly calculated on the Hash value, all data blocks are not required to be called into the memory, only the required data objects are required to be read into the memory, and the method is feasible in both the execution efficiency of the algorithm and the utilization rate of the memory space. Meanwhile, the addressing is directly searched according to the pointer, so that the retrieval efficiency of the data object is extremely high.
3.2 Structure of index File
The index file structure mainly comprises the following fields, and the functions of the index file structure are described as follows:
key is a Key value calculated by Hash, and the Key value is unique for each data object;
the Father/Son domain represents the membership, Father represents the Key of the hierarchical data object at the previous level, and Son represents the Key of the hierarchical data object at the next level;
neighbour fields represent adjacency relationships, Neighbour represents keys for hierarchically-typed data objects that are adjacent in a relational structure, and the fields may have more than one Key;
cross domain represents a Cross relationship, Cross represents a Key of a hierarchical data object having a Cross relationship in a relational structure, and similarly, the domain may have more than one Key;
co-ref fields represent homonym relationships, Co-ref represents semantically understood hierarchical data objects in the same field, and the field may have more than one Key;
0/1 field indicates that the hierarchical data object may have a rename phenomenon, i.e., the name is the same but two completely different meanings. If the value is set to 0, no duplicate name exists, the value is 1, the duplicate name exists, and the following Fathers field records all the upper level data objects containing the data objects;
since the facts field describes all the upper-level data objects including the hierarchical data object when there is a duplication phenomenon, there may be more than one Key in the field. NULL if there is no rename.
Each field above occupies four bytes.
3.3 Structure of data File
The data file is a file for storing the hierarchical data object itself, and the access operation to the data object can be realized only by combining the data file with the index file. The data file is a data object linear table logically, and the table entry in the linear table, namely the data entry, establishes a relationship with each domain in the structure body through a pointer, as shown in fig. 10. Wherein Wi、WjThe pointers of the respective items, gather, Son, neighbor, Cross, Co-ref, and gather fields of a hierarchical data object point to the previous data object, the next data object, the neighboring data object, the Cross data object, the parity data object, and all the previous data objects of the data object when there is a basic attribute completely equal. The two data objects can be distinguished by the higher-level data object.
When a certain data object needs to be searched, the organization structure of the data object and other related data objects can be easily constructed as long as the address of the data object is obtained through Hash calculation and is called into the memory. And the processes of searching, matching and the like are not needed, the whole searching time is mainly consumed on Hash calculation, the algorithm time efficiency is extremely high, and the filling factor of Hash is more than 0.8.
Fig. 11 and 12 show the related business algorithm process of the hierarchical domain database.
4. The domain data evolution module:
4.1 LDA-based information extraction
The first step of the field data evolution is to analyze and process a large amount of valuable text information. The system adopts a text clustering mining technology based on an LDA (latent Dirichlet allocation) probability generation model, and helps to discover related field data by automatically aggregating similar texts in a text set into different categories. The text is represented by a vector space model, a text representation matrix usually has high dimensionality, and similarity measurement loses meaning due to dimensionality damage in the clustering process. The LDA topic model has good text representation capability, potential semantic information of the text can be mined, representation of the document in a topic space is obtained, and the dimension of document representation is reduced. Through modeling of the text, feature selection, topic classification, similarity judgment and the like can be carried out on the text. The LDA model adopts a bag-of-words method, and each text data resource is regarded as a word frequency vector, so that text information is converted into digital information which is easy to model.
A three-level bayesian representation of the LDA model is shown in fig. 13. PhikRepresenting the probability distribution, theta, of terms in the topic KmRepresenting the topic probability distribution, phi, of the mth documentk、θmAnd parameters distributed as polynomials are used to generate topics and words, respectively. K represents the number of topics, M represents the number of documents, NmDocument length, ω, representing mth documentm,nAnd Zm,nα and β are parameters of the Dirichlet distribution, usually fixed and symmetrically distributed, and thus expressed as a scalar Φk、θmAll obey a Dirichlet distribution, which is shown as:
Dir ( &mu; | &alpha; ) = &Gamma; ( &alpha; 0 ) &Gamma; ( &alpha; 1 ) . . . &Gamma; ( &alpha; k ) &Pi; k = 1 K &mu; k &alpha; k - 1 (formula one)
Wherein, 0 is less than or equal to muk≤1,Is a gamma function. The LDA generation process is as follows.
(a) Sampling subject matter
(b) For the mth document in the corpus, M belongs to [1, M ];
(c) sampling topic probability distribution θm~Dir(α);
(d) Using document length Nm~Poiss(ξ);
(e) For the nth word in document m, n ∈[1,Nm];
(f) Selecting implicit topic zm,n~Mult(θm);
(g) Generating words
For LDA parameter estimation, firstly, calculating the conditional probability of a subject sequence under a word sequence, wherein the formula is as follows:
p ( z | w ) = p ( w , z ) &Sigma; z p ( w , z ) (formula two)
Then Gibbs sampling is carried out on the subject sequence, and the sampling formula is as follows:
p ( z i = k | z . . . i , w ) &Proportional; n k , . . . , i ( t ) + &beta; t [ &Sigma; &upsi; = 1 V n k ( &upsi; ) + &beta; &upsi; ] - 1 &CenterDot; n m , . . . , i ( k ) + &alpha; k [ &Sigma; z = 1 K n m ( z ) + &alpha; z ] - 1 (formula three)
The label of the subject z of each word ω is obtained, and the final parameter calculation formula is expressed as follows:
(formula four)
&theta; m , k = n m ( k ) + &alpha; k &Sigma; z = 1 K n m ( z ) + &alpha; k
The trained model M gives new documentsWherein the underlying topic sampling formula for each word is as follows:
p ( z ~ t = k | &omega; ~ i = t , z ~ &RightArrow; i , &omega; ~ &RightArrow; i ; M ) = n k ( t ) + n k , &RightArrow; i ( t ) + &beta; t &Sigma; &upsi; = 1 V n k ( &upsi; ) + n ~ k , &RightArrow; i ( &upsi; ) + &beta; &upsi; &CenterDot; n m ~ , &RightArrow; i ( k ) + &alpha; k [ &Sigma; z = 1 K n m ~ ( z ) + &alpha; z ] - 1 (formula five)
Wherein,representing new documentsThe corresponding topic vector.
And obtaining the topic label of each word by the Gibbs sampling method, and calculating the value of the document on each topic component by using a formula six, wherein the document in the term space is represented in the topic space.
&theta; m ~ , k = n m ~ ( k ) + &alpha; k &Sigma; z = 1 K n m ~ ( z ) + &alpha; z (formula six)
After the above steps, a clustering process may be performed. After the LDA is used for selecting the characteristics in a certain proportion, a K-means algorithm is selected for clustering the texts. The text clustering process is as follows:
(1) preprocessing an original text, including word segmentation and removal of stop words;
(2) selecting characteristics by using an LDA model;
(3) for the selected features, counting the weight of each feature in each text, wherein the calculation formula of the weight W (d, W) of the feature in the text is as follows:
W ( d , w ) = log ( tf ( d , w ) + 1 ) &times; log ( ( M + 1 ) / ( df ( w ) + 0.5 ) ) &Sigma; log ( tf ( d , w &prime; ) + 1 ) &times; log ( ( M + 1 ) / ( df ( w &prime; ) + 0.5 ) ) (formula seven)
Where M is the total text number, tf (d, w) is the number of occurrences of the lemma w in the text d, and df (w) is the text frequency of the lemma w. After a representation of the text is obtained, a vector space model is generated.
(4) And randomly selecting an initial point, and obtaining a final clustering result by using a K-means algorithm. The K-means clustering algorithm needs to measure the distance between texts and adopts cosine similarity to calculate. For two texts d and d', their similarity calculation formula is as follows:
sim ( d , d &prime; ) = &Sigma; w &Element; d , d &prime; W ( d , w ) &times; W ( d &prime; , w ) d &times; d &prime; (formula eight)
4.2 field database version control
The field database version control module refers to a version control theory and method to realize the evolution process management and control of the database. Each evolution state of the data can be regarded as a version, and the module provides the functions of version generation, version recovery, version deletion and the like. Specifically, due to the factors such as modification and evolution of the domain database, the domain database will evolve continuously along with the advance of time, and the function of the module records a series of evolution processes. On one hand, the method records the evolution process of specific field data, so that a user can check the specific field data at any time and restore the specific field data to a past version, and on the other hand, the user can mark the database in a certain state as a version so as to restore the whole database to the version at a certain future time. The user may delete a non-critical version if desired. The structure is shown in fig. 14.
4.3 New data object discovery
The discovery of new data objects requires a user to provide a large amount of basic text data, and the system analyzes the data through the LDA-based analysis model to construct a huge data object analysis library; when new data is input into the data object analysis library, the version information of the associated field data is triggered to be read, then the possibility that the current data object is the new field data is calculated by analyzing the relation between the data object and the associated field data version, and the new field data is automatically or manually modified by a user and added into the field database. The core structure of the data object analysis library is a suffix tree. The other important part of the part is a catalog monitoring module which is used for automatically sensing the arrival of new data by a system and further automatically carrying out evolution processing, and the processing method comprises the following steps:
(1) and starting the system, and checking the configuration file to acquire the field data evolution data source directory.
(2) The startup directory monitor (autodelete) listens for a change in the state of the data source directory. When a new file is added, the directory monitor detects the change, checks the file format of the file, and reads and analyzes the file if the file is one of a text file, a PDF file, an HTML file, and a Word document.
(3) According to different types of input files, different file analyzers are realized: TxtAnalyzer, pdffanalyzer, HtmlAnalyzer, wordalyzer. Wherein PdfAnalyzer and WordAnalyzer are implemented using the open source tool Apache POI. After passing through the file parser, a text or text stream is obtained (text stream is returned when the data size is large).
(4) The obtained text or text stream is input into an analysis library, i.e. inserted into the current latest suffix tree. The examination of the terms of the relevant changes is triggered when the suffix tree changes: and after the frequency of the entry reaches a threshold t, inquiring the domain data related to the entry from the data resource ontology base, and determining whether to construct new basic domain data according to the inquiry result.
(5) After the file has been analyzed, the file is renamed to a file ending with an ". analysized" to distinguish it from the unanalyzed file. Thereafter, the metadata files in the data source directory are examined and some of the oldest analyzed files are deleted if the size or the size of the number of currently analyzed files reaches an upper limit.
Those skilled in the art will appreciate that the invention may be practiced without these specific details.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (6)

1. An extensible database system for multi-type field data coordination management, comprising: data resource ontology library module, network type domain database module, hierarchical type domain database module and domain data evolution module, wherein:
the data resource ontology library module is used for defining a top-level data resource model, realizing logic view design and storage structure design of basic data units, providing data storage and access basic support capacity and establishing a database containing a large number of business data objects, relations and concepts; the data resource ontology library module provides top-level data abstraction rules and data access rules for the network type domain database module and the hierarchical type domain database module;
the network type domain database module is used for constructing a database based on network type data attributes on the basis of a data resource ontology base according to the attributes of the data objects, a relational network and other special attributes, realizing the data structure design, storage design and index design of the network type data objects, forming a relational network containing a large number of network type data objects and realizing the provision of a network type database access interface to the outside; the network type domain database module is used for realizing inheritance of a data resource ontology base and instantiation on network type domain data; providing a query interface based on network type field data for a user, other modules and an external system; the hierarchical domain database module is used for constructing a database specially representing data objects and related information of hierarchical membership thereof according to the characteristics of membership, adjacency, intersection and peer relationship among hierarchical data objects, and realizing an access interface for providing the data objects and the hierarchical membership database thereof to the outside; the hierarchical domain database module is used for further evolving the network domain database module, storing and organizing the data only having the hierarchical structure domain in a tree form, realizing hierarchical semantics, and providing a query interface based on the hierarchical domain data for a user, other modules and an external system;
the domain data evolution module tracks and controls the change of the domain data in the data resource body base, the network type domain database and the hierarchical domain database in the using process of the user, establishes data version history, analyzes an original data set provided by the user in combination with the existing data to obtain new domain data and inputs the new domain data into the domain database through screening, provides record-based data version control for the data resource body base module, the network type domain database module and the hierarchical domain database module, automatically discovers the new domain data from the original data input by the user, and uses an interface thereof to carry out corresponding evolution management.
2. The scalable multi-domain data orchestration management database system according to claim 1, wherein: the data resource ontology library module comprises a data persistence module, a bottom database establishing module, a relation defining module, a data index module and an interface module;
the data persistence module defines an interface-oriented implementation method and flexibly configures data persistence according to different hardware environments, context environments and other requirements; defining serialization and deserialization protocols of field data related objects based on an object serialization technology, and outputting binary streams obtained after object serialization to a file, a database or a network position through a file organization protocol during data persistence; when an object which is not loaded in the object buffer pool needs to be loaded, reading a corresponding data stream according to the logic address information sent by an upper layer request, and reconstructing the object through an deserialization protocol of the object; the logical organization mode of the data in the file is a block storage mode, and the management of the blocks adopts a heap structure for management; the data persistence module of the data resource ontology base is also a data persistence abstraction of the network type domain database and the hierarchical type domain database, and the data storage functions of the network type domain database and the hierarchical type domain database are customized and expanded based on the persistence module according to different persistence protocols to form a persistence base of a specific data type;
the bottom database establishing module is used for storing data object data without extended attributes and relations, and establishing a basic field data object, a serialization protocol, an anti-serialization protocol and a storage manager; the single data form of the bottom database provides a data basis for the definition of the network type field data and the hierarchical type field data; the network type domain database module and the hierarchical type domain database module realize defined serialization and deserialization interfaces;
the relation definition module is used for establishing a synonymy relation, an antisense relation and a membership relation for the entries of the bottom database by using a new file on the basis of the realization of the bottom database; the highly abstract and generalized relation definition, organization, storage and management enable the network type domain database to realize flexible extension on the basis;
the data index module is used for performing abstract definition on the field data object, and mapping the field data abstract and the logic storage information of the field data object through a quick double-coding algorithm so as to achieve the purposes of quick retrieval and access control; the network type domain database module and the hierarchical type domain database module both comprise index parts, wherein keywords are realized by obtaining a long and integer digital pair through double-coding calculation;
the interface module is realized based on EJB3.0 standard, and is published in the form of EJB interface and Web Service interface to realize cross-platform Service, and the network type domain database and the hierarchical type domain database realize the customized interface publishing function by inheriting the data resource ontology interface module.
3. The scalable multi-domain data orchestration management database system according to claim 1, wherein: the network type domain database module is realized by the following steps:
(1) defining a domain data object and a persistence protocol related to the network type domain data on the basis of a storage management layer of the data resource ontology library in storage design;
(2) performing storage design on the basis of a defined network type field data object, and firstly defining a basic structure and a process of an attribute part; dividing an attribute part into two parts, wherein one part is an existing attribute during database design and is called a basic attribute; the other part is a user-defined attribute which is called an extended attribute;
(3) establishing a data index on a network type domain data object storage structure, dynamically generating a B tree when inserting a network type data object based on the B tree with a fast buffer and a Bloom Filter, and not limiting the maximum layer number of the B tree, connecting attribute blocks together by using pointers to form an attribute block linked list aiming at the condition that the network type data object has the same keyword, and quickly obtaining a network type data object list with the same keyword when inquiring the network type data object by using the keyword;
(4) after the data index is realized, the data record is updated according to the updating of the data record, and the data update is recorded in the finest granularity through the check point and the log file, so that the high-efficiency access and the high fault tolerance of the system are guaranteed.
4. The scalable multi-domain data orchestration management database system according to claim 1, wherein: the hierarchical domain database module is realized by the following steps:
(1) defining a domain data object and a persistence protocol related to a hierarchical domain database on the basis of a storage management layer of the data resource ontology base in storage design;
(2) on the basis of the extension of the hierarchical domain data object and the persistent protocol based on the hierarchical domain data structure, the storage establishment of the hierarchical structure is carried out;
(3) after the storage and establishment of the hierarchical structure are completed, organizing a relationship structure between hierarchical data objects through a defined binary tree; the key words in the index file are formed by unique double-code number pairs of data objects, and the problem of conflict does not need to be considered; in the attribute file, storing the hierarchical data of the father data object and the son data object by adopting the even number; during retrieval, the number pairs of the data objects are calculated, the same number pairs are matched in the index file, pointers of corresponding attributes are obtained, the attributes are read out, and if a plurality of sub-data objects exist, all lower-level data objects can be found by the pointers of the attributes pointing to the next attribute;
(4) after the index is finished, a simplified log file suitable for a hierarchical structure is constructed on the basis of the functions of the check point and the log file of the network type domain database module.
5. The scalable multi-domain data orchestration management database system according to claim 1, wherein: the field data evolution module is realized by the following steps:
(1) firstly, collecting user activity records of various domain databases, and monitoring the change of activity degree of domain data objects;
(2) analyzing the activity change data of the collected data objects, and bringing the domain data objects with the activity lower than a system threshold value into a guard backup library;
(3) further analyzing the activity record of the user, and establishing a version change record of the data object for the field data object with the changed core attribute;
(4) the system analyzes the text data provided by the user or on the Internet to construct a huge data analysis library; when new data is input into the data analysis base, the version information of the associated data object is triggered to be read, then the probability that the current data object is new domain data is calculated by analyzing the relation between the data object and the associated data object version, the new domain data is automatically or manually modified by a user and added into the corresponding domain database.
6. An extensible database management method for data coordination management in multiple types of fields is characterized by comprising the following implementation steps:
(1) preprocessing text data provided by a user, removing non-core field data including stop words, tone words and punctuation marks to obtain preprocessed text data;
(2) inputting the preprocessed data output in the step (1) into an LDA (latent Dirichlet allocation) probability model, and matching the preprocessed data with the established data model to obtain a field-related data object;
(3) constructing a Suffix Tree (Suffix Tree) on the field-related data object output in the step (2), fusing the existing Suffix Tree, gradually traversing the merged Suffix Tree to obtain a high-frequency character string, and initializing a field-related data object;
(4) inputting the field-related data object obtained in the step (3) into a data resource ontology base for type and relationship judgment and matching, and obtaining the type of the field-related data object, namely the hierarchical type, the network type or the user-defined type, and other field data objects related to the data object;
(5) inputting the field-related data object and the associated data output in the step (4) into a field database of a corresponding type, establishing a data change log record, and inputting the field-related data object into a double-coding algorithm to obtain a corresponding index even;
(6) and (4) carrying out service combination on the number pairs obtained in the step (5), the data objects and the related field data objects, and finally outputting the field data objects containing field correlation, multiple relations and multiple attributes.
CN201310343157.XA 2013-08-08 2013-08-08 The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method Active CN103412917B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310343157.XA CN103412917B (en) 2013-08-08 2013-08-08 The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310343157.XA CN103412917B (en) 2013-08-08 2013-08-08 The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method

Publications (2)

Publication Number Publication Date
CN103412917A CN103412917A (en) 2013-11-27
CN103412917B true CN103412917B (en) 2016-08-10

Family

ID=49605929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310343157.XA Active CN103412917B (en) 2013-08-08 2013-08-08 The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method

Country Status (1)

Country Link
CN (1) CN103412917B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183735B (en) * 2014-06-18 2019-02-19 阿里巴巴集团控股有限公司 The querying method and inquiry unit of data
CN105354266A (en) * 2015-10-23 2016-02-24 北京航空航天大学 Rich graph model RichGraph based graph data management method
US9952931B2 (en) * 2016-01-19 2018-04-24 Microsoft Technology Licensing, Llc Versioned records management using restart era
CN106326457B (en) * 2016-08-29 2019-04-30 山大地纬软件股份有限公司 The construction method and system of people society personnel file pouch database based on big data
CN106569941B (en) * 2016-11-04 2019-01-01 金蝶软件(中国)有限公司 The method and apparatus for recording data course
CN106682173B (en) * 2016-12-28 2019-10-18 华南理工大学 A kind of social security big data OLAP preprocess method and on-line analysis querying method
CN107133283A (en) * 2017-04-17 2017-09-05 北京科技大学 A kind of Legal ontology knowledge base method for auto constructing
CN109254962B (en) * 2017-07-06 2020-10-16 中国移动通信集团浙江有限公司 Index optimization method and device based on T-tree and storage medium
CN110019474B (en) * 2017-12-19 2022-03-04 北京金山云网络技术有限公司 Automatic synonymy data association method and device in heterogeneous database and electronic equipment
CN108182265B (en) * 2018-01-09 2021-06-29 清华大学 Multilayer iterative screening method and device for relational network
CN109446175A (en) * 2018-11-12 2019-03-08 郑州云海信息技术有限公司 A kind of method and apparatus for the log object constructing key operation
CN110569327A (en) * 2019-07-08 2019-12-13 电子科技大学 multi-keyword ciphertext retrieval method supporting dynamic updating
CN110851848B (en) * 2019-11-12 2022-03-25 广西师范大学 Privacy protection method for symmetric searchable encryption
CN111192176B (en) * 2019-12-30 2023-04-28 华中师范大学 Online data acquisition method and device supporting informatization assessment of education
CN111897824A (en) * 2020-03-25 2020-11-06 上海云励科技有限公司 Data operation method, device, equipment and storage medium
CN111767332B (en) * 2020-06-12 2021-07-30 上海森亿医疗科技有限公司 Data integration method, system and terminal for heterogeneous data sources
CN112597348A (en) * 2020-12-15 2021-04-02 电子科技大学中山学院 Method and device for optimizing big data storage
CN112990601B (en) * 2021-04-09 2023-10-31 重庆大学 Worm wheel machining precision self-healing system and method based on data mining
KR102392880B1 (en) * 2021-09-06 2022-05-02 (주) 바우디움 Method for managing hierarchical documents and apparatus using the same
CN115048344B (en) * 2022-08-16 2022-11-04 安格利(成都)仪器设备有限公司 Storage method for three-dimensional contour and image data of inner wall and outer wall of pipeline or container
CN115543960B (en) * 2022-09-16 2024-01-05 北京神舟航天软件技术股份有限公司 Dynamic modeling method and system for business object

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5724577A (en) * 1995-06-07 1998-03-03 Lockheed Martin Corporation Method for operating a computer which searches a relational database organizer using a hierarchical database outline
CN102110165A (en) * 2011-02-28 2011-06-29 深圳市五巨科技有限公司 Method and system for scheduling interior of browser of mobile terminal

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5724577A (en) * 1995-06-07 1998-03-03 Lockheed Martin Corporation Method for operating a computer which searches a relational database organizer using a hierarchical database outline
CN102110165A (en) * 2011-02-28 2011-06-29 深圳市五巨科技有限公司 Method and system for scheduling interior of browser of mobile terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
用户访问特征驱动的中间件语义缓存替换策略;陈宁江等;《广西大学学报:自然科学版》;20101031;第35卷(第5期);第787-792页 *

Also Published As

Publication number Publication date
CN103412917A (en) 2013-11-27

Similar Documents

Publication Publication Date Title
CN103412917B (en) The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method
Jäschke et al. Discovering shared conceptualizations in folksonomies
Leung et al. Mining frequent patterns from uncertain data with MapReduce for big data analytics
Castano et al. Ontology and instance matching
RU2507574C2 (en) Page-by-page breakdown of hierarchical data
CN102768674B (en) A kind of XML data based on path structure storage method
Rashid et al. Dependable large scale behavioral patterns mining from sensor data using Hadoop platform
Vijayalakshmi et al. Mining of users access behavior for frequent sequential pattern from web logs
Liu et al. Dynamic labeling scheme for XML updates
CN102999637A (en) Method and system for automatically adding file tab to file according to file feature code
Markov et al. Natural Language Addressing
Su-Cheng et al. Node labeling schemes in XML query optimization: a survey and trends
Lu et al. Efficient infrequent pattern mining using negative itemset tree
US11520763B2 (en) Automated optimization for in-memory data structures of column store databases
Xie et al. Efficient storage management for social network events based on clustering and hot/cold data classification
Hewasinghage et al. Managing polyglot systems metadata with hypergraphs
Hsu et al. UCIS-X: an updatable compact indexing scheme for efficient extensible markup language document updating and query evaluation
Senthilselvan et al. Distributed frequent subgraph mining on evolving graph using SPARK
IVANOVA et al. Introduction to storing graphs by NL-addressing
Voß Describing data patterns. A general deconstruction of metadata standards
Gao et al. Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing
Pokorný et al. Graph pattern index for Neo4j graph databases
Giatsoglou et al. Massive graph management for the web and web 2.0
Dyreson Using couchdb to compute temporal aggregates
Vieira et al. Incremental entity resolution process over query results for data integration systems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20191017

Address after: No. 1089, building a, No. 19, Guokai Avenue, Nanning City, 530000 Guangxi Zhuang Autonomous Region

Patentee after: Nanning super cube science and Technology Co Ltd

Address before: 530004 No. 100, University Road, the Guangxi Zhuang Autonomous Region, Nanning

Patentee before: Guangxi University