CN103034738A - Relevant database for managing heterogeneous unstructured data and method for creating and inquiring description information of unstructured data thereof - Google Patents

Relevant database for managing heterogeneous unstructured data and method for creating and inquiring description information of unstructured data thereof Download PDF

Info

Publication number
CN103034738A
CN103034738A CN2012105940955A CN201210594095A CN103034738A CN 103034738 A CN103034738 A CN 103034738A CN 2012105940955 A CN2012105940955 A CN 2012105940955A CN 201210594095 A CN201210594095 A CN 201210594095A CN 103034738 A CN103034738 A CN 103034738A
Authority
CN
China
Prior art keywords
data
database
formatted text
relevant database
unstructured data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012105940955A
Other languages
Chinese (zh)
Inventor
武新
范振勇
张学
崔维力
赵伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TIANJIN NANDA GENERAL DATA TECHNOLOGY Co Ltd
Original Assignee
TIANJIN NANDA GENERAL DATA TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN NANDA GENERAL DATA TECHNOLOGY Co Ltd filed Critical TIANJIN NANDA GENERAL DATA TECHNOLOGY Co Ltd
Priority to CN2012105940955A priority Critical patent/CN103034738A/en
Publication of CN103034738A publication Critical patent/CN103034738A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a relevant database for managing heterogeneous unstructured data; and the database comprises formatted text which is used for describing the unstructured data which is stored outside the database. The formatted text comprises a uniform resources identifier (URI) character string which provides the access protocol and storage position of the data, a data validation attribute field and a data format field. Meanwhile, the invention also provides a creation method and an inquiry method for managing the relevant database for managing the external heterogeneous data. The relevant database for managing heterogeneous unstructured data and the method for creating and inquiring the description information of the unstructured data thereof have the beneficial effects that the external data management mechanism of the database has high extendability and can adapt to various access protocols of the external data, and meanwhile the completeness of the external data management of the database and the data independency of external data orientation are improved.

Description

Be used for the relevant database of management isomery unstructured data and the method for establishment and inquiry unstructured data descriptor thereof
Technical field
The invention belongs to field of data storage, especially a kind of relating to be used to the relevant database of managing outside isomeric data and establishment and querying method.
Background technology
" large data " (Big data), in brief, namely from various mass datas, obtain fast the ability of information, it is exactly large data technique, large data communication device is commonly used to describe a large amount of destructurings and the semi-structured data of company's creation, and these data can the overspending time and money when downloading to relevant database for analysis.Normal and the cloud computing of large data analysis is linked together because the framework that real-time large data set analysis need to be as MapReduce to tens of, hundreds of or even thousands of computers share out the work.
The fields such as frequent use BLOB, TEXT can't satisfy the support to large data in the existing relevant database, relevant database ORACLE and MS SQL Server have the BLOB type field that is stored in outside the database, BFILE in ORACEL, MS SQL Server is FileStream, their characteristics all are the filenames of store data in database, and database reads the data that leave on the disk by filename.Shortcoming be must by applications guarantee data integrality and with the consistance of other fields of database, and database itself does not have corresponding restriction ability.Simultaneously, the agreement that exterior storage is supported is single, can't support multiple remote storage, and adapt to the various distributed storage agreements that emerge in an endless stream.
Summary of the invention
The problem to be solved in the present invention provides a kind of relevant database be used to managing outside isomeric data, is particularly suitable for multiple external data storage in multiple systems.
For solving the problems of the technologies described above, the technical solution used in the present invention is:
A kind of relevant database be used to managing outside isomeric data, described relevant database comprises the formatted text that is stored in the isomeric data of database outside in order to description.
Further, described formatted text comprises access protocal that data are provided and the URI character string of memory location.
Further, described formatted text comprises the URI character string be used to the access protocal that data are provided and memory location.
Further, described formatted text comprises the data check attribute field.
Further, described data check field comprises MD5 or the above three kinds combination in any of data length, the last modification time of data, data.
Further, described formatted text comprises the data layout field.
Further, described data layout field comprises medium type and encryption algorithm or above two kinds of combinations.
It is a kind of for above-mentioned relational data base establishing method be used to managing outside isomeric data that the present invention also provides, and comprising:
Isomeric data is kept at accumulation layer;
Create the describable formatted text of isomery;
Describable formatted text is stored up in the database.
Further, described method second step also comprises: deposit the data check attribute of isomeric data of statistics in formatted text.
Further, described method second step also comprises: deposit the data attribute of isomeric data of statistics in formatted text.
According to a further aspect in the invention, also provide the querying method of a kind of database for holding large data to external data, having comprised:
Database is received query requests;
Database returns the formatted text of describing external data base to query requests;
Formatted text is resolved;
Pass through the accumulation layer reading out data by the formatted text of resolving.
Advantage and good effect that the present invention has are: owing to adopt technique scheme, so that database has high expansion, the multiple access protocal that can adapt to external data has also strengthened the integrality of database external management and the data independence that external data is pointed to simultaneously.
Description of drawings
Fig. 1 is the synoptic diagram that database realizes reading the URI data in one embodiment of the present of invention
Fig. 2 is the method synoptic diagram of one embodiment of the present of invention database initialize
Fig. 3 is the synoptic diagram of a kind of pattern that database is inquired about in one embodiment of the present of invention
Fig. 4 is the synoptic diagram of the another kind of pattern that database is inquired about in one embodiment of the present of invention
Embodiment
The invention will be further elaborated below in conjunction with one embodiment of the present of invention, GBase8a is as a kind of database of supporting large data, data are kept at the outside of GBase8a, and its access protocal can be local file, also can be the data that leave Http server, Ftp server and the storage of other specialized protocols in.
Generic resource identifier (URI) can position numerous types of data, can comprise the URI character string by descriptive formatted text among the GBase8a, and this URI character string is exactly by simple formatted text, can store the URI of external data.
The URI character string is by what identify to realize for varchar type increase URI, and its data are multiline text, and row, comprising with a pair of carriage return character and newline separation with in the ranks:
The URI of first trip
URI=protocol name ": " authentication information catalogue file name [" " query argument] [" # " bookmark]
Only support absolute URI, do not support relative address.
The GBase8a database can also comprise the data check attribute field by descriptive formatted text, the data check attribute field can be the MD5(Content-MD5 of length (Content-Length), last modification time (Last-Modified) and data) or its combination, GBase8a database URI data type data check attribute field has comprised above three kinds.
Its citation form is:
Field name: field value
Field is divided into check sum format description two parts.Data check partly is mainly used in data constraint, for database judge data that URI points to whether the data when putting in storage change.
Data check
1, Content-Length is used for pointing out the size of data, and form is the decimal digit string,
Content-Length=″Content-Length″″:″1*DIGIT
As, Content-Length:3495
2, Last-Modified points out the last date and time of revising of data.
Last-Modified=″Last-Modified″″:″RFC1123HTTP-date
As, Last-Modified:Tue, 15Nov199412:45:26GMT
3, MD5 verification.
Content-MD5=″Content-MD5″″:″md5-digest
The base64 coding of 128 MD5 digests of md5-digest=<RFC1864 〉
Content-Length, Last-Modified and Content-MD5 are options, if exist, then application program and GBase8a are when reading out data, whether the size that just should check real data conforms to description, find that difference represents that then data's consistency is destroyed, if there is no, then do not carry out consistency check.
After having increased the data check attribute field, also strengthened the data independence that external data is pointed to, it is had the ability by the complete attribute of field description data.
The GBase8a database can also comprise format fields on the descriptive formatted text, and the format description field is mainly in order to make things convenient for GBase8a to read unstructured data and correctly to resolve.In addition, can provide more detailed explanation to data according to the protocol type flexible expansion of URI, for the expansion module that third party developer's development data extracts, information is identified.
The format description explanation of field
1, Content-Type identification medium type, grammer
Content-Type=″Content-Type″″:″media-type
media-type=type″/″subtype(″;″″charset″″=″charset))
type=token
subtype=token
charset=token
As, Content-Type:text/html; Charset=ISO-8859-4
2, Content-Encoding is used for the encryption algorithm of expression data.GBase8a also can increase by user's the description that expands to external data the description attribute, especially when this attribute is sparse data, so that Data Sheet Design is enough flexible.
When having Content-Encoding, its value is pointed out whether compressing of data field, occurs when generally only being plain text in the data field.
Content-Encoding=″Content-Encoding″″:″″gzip″
As, Content-Encoding:gzip
Start corresponding conversion plug-in unit according to Content-Type, comprise Content-Type in the as a result form of conversion, GBase8a continues to start corresponding converter, until be output as till the plain text " text/plain ".If type, subtype, the charset of Content-Type or setting are not set not within the support scope in the URI field, then call general conversion plug-in unit and change, at this moment, do not guarantee to change successfully by the expection of using.
In GBase8a database URI field, adopt a null, finish in order to represent the URI field data.
The URI type can make the multiple access protocal of GBase8a Database Systems adaptation external data, so that database has high expansion for external data.As shown in Figure 1, turning to of DAP supported protocol turns to the agreement into Ftp such as the Http agreement, and network file system(NFS) turns to and is local file system, until exceed that maximum turns to number of times or the agreement that occurs not supporting and stopping.
As shown in Figure 2, when data loading, at first by application system unstructured data is kept at accumulation layer, it may be a disk, array or other local memory devices, also may be the remote storage services such as a ftp server, distributed file system service, then application program generates the URI data according to the URI field format of agreement and stores among the GBase8a, and the URI access program reading out data of GBase8a by correspondence carries out processing such as content analysis, full-text index and process.
When GBase8a is inquired about, two kinds of patterns generally can be arranged: pattern one, as shown in Figure 3, application program sends query requests to GBase8a, return URI information after, by application program according to it to the parsing of URI and by the accumulation layer reading out data, return after processing.
Pattern two, as shown in Figure 4, application program sends query requests to GBase8a, directly obtains the data that URI points to by GBase8a by built-in function or User-Defined Functions UDF, and returns client with the interface of TEXT or BLOB.
More than one embodiment of the present of invention are had been described in detail, but described content only is preferred embodiment of the present invention, can not be considered to be used to limiting practical range of the present invention.All equalizations of doing according to the present patent application scope change and improve etc., all should still belong within the patent covering scope of the present invention.

Claims (10)

1. relevant database that is used for management isomery unstructured data, it is characterized in that: described relevant database comprises the formatted text that is stored in the isomeric data of database outside in order to description.
2. relevant database according to claim 1 is characterized in that: described formatted text comprises access protocal that data are provided and the URI character string of memory location.
3. relevant database according to claim 1, it is characterized in that: described formatted text comprises the data check attribute field.
4. relevant database according to claim 3, it is characterized in that: described data check field comprises MD5 or the above three kinds combination in any of data length, the last modification time of data, data.
5. described relevant database according to claim 1, it is characterized in that: described formatted text comprises the data layout field.
6. database according to claim 5, it is characterized in that: described data layout field comprises medium type and encryption algorithm or above two kinds of combinations.
7. the relevant database for management isomery unstructured data as claimed in claim 1 comprises the querying method of unstructured data:
Database is received query requests;
Database returns the formatted text of describing external data base to query requests;
Formatted text is resolved;
Pass through the accumulation layer reading out data by the formatted text of resolving.
8. relational data base establishing method for management isomery unstructured data as claimed in claim 1 comprises:
Isomeric data is kept at accumulation layer;
Create the describable formatted text of isomery;
Describable formatted text is stored up in the database.
9. creation method according to claim 8 is characterized in that: described the 2nd step comprises that also the check field of isomeric data that will statistics deposits formatted text in.
10. creation method according to claim 8 is characterized in that: described the 2nd step comprises that also the attribute field of isomeric data data that will statistics deposits formatted text in.
CN2012105940955A 2012-12-29 2012-12-29 Relevant database for managing heterogeneous unstructured data and method for creating and inquiring description information of unstructured data thereof Pending CN103034738A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012105940955A CN103034738A (en) 2012-12-29 2012-12-29 Relevant database for managing heterogeneous unstructured data and method for creating and inquiring description information of unstructured data thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012105940955A CN103034738A (en) 2012-12-29 2012-12-29 Relevant database for managing heterogeneous unstructured data and method for creating and inquiring description information of unstructured data thereof

Publications (1)

Publication Number Publication Date
CN103034738A true CN103034738A (en) 2013-04-10

Family

ID=48021632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012105940955A Pending CN103034738A (en) 2012-12-29 2012-12-29 Relevant database for managing heterogeneous unstructured data and method for creating and inquiring description information of unstructured data thereof

Country Status (1)

Country Link
CN (1) CN103034738A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699682A (en) * 2013-12-31 2014-04-02 福建星网视易信息系统有限公司 Method for verifying structural logics of databases
CN104169914A (en) * 2013-12-11 2014-11-26 华为技术有限公司 Data storage method, data processing method, device and mobile terminal
CN105138533A (en) * 2015-06-29 2015-12-09 北京奇虎科技有限公司 Method and device for accessing statistical and scientific database (SSDB) server
CN106027374A (en) * 2016-06-30 2016-10-12 乐视控股(北京)有限公司 Information transmitting method and information transmitting server
CN109446157A (en) * 2018-10-18 2019-03-08 武汉虹旭信息技术有限责任公司 System and method are looked into a kind of data format school based on format data
CN111143623A (en) * 2019-12-31 2020-05-12 科技谷(厦门)信息技术有限公司 Data quality monitoring method in big data environment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6301579B1 (en) * 1998-10-20 2001-10-09 Silicon Graphics, Inc. Method, system, and computer program product for visualizing a data structure
CN1723462A (en) * 2004-02-10 2006-01-18 微软公司 Systems and methods for a large object infrastructure in a database system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6301579B1 (en) * 1998-10-20 2001-10-09 Silicon Graphics, Inc. Method, system, and computer program product for visualizing a data structure
CN1723462A (en) * 2004-02-10 2006-01-18 微软公司 Systems and methods for a large object infrastructure in a database system

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104169914A (en) * 2013-12-11 2014-11-26 华为技术有限公司 Data storage method, data processing method, device and mobile terminal
CN103699682A (en) * 2013-12-31 2014-04-02 福建星网视易信息系统有限公司 Method for verifying structural logics of databases
CN103699682B (en) * 2013-12-31 2017-02-15 福建星网视易信息系统有限公司 Method for verifying structural logics of databases
CN105138533A (en) * 2015-06-29 2015-12-09 北京奇虎科技有限公司 Method and device for accessing statistical and scientific database (SSDB) server
CN105138533B (en) * 2015-06-29 2019-03-05 北京奇虎科技有限公司 Method and apparatus for accessing SSDB server
CN106027374A (en) * 2016-06-30 2016-10-12 乐视控股(北京)有限公司 Information transmitting method and information transmitting server
CN109446157A (en) * 2018-10-18 2019-03-08 武汉虹旭信息技术有限责任公司 System and method are looked into a kind of data format school based on format data
CN109446157B (en) * 2018-10-18 2021-10-29 武汉虹旭信息技术有限责任公司 Data format checking system and method based on formatted data
CN111143623A (en) * 2019-12-31 2020-05-12 科技谷(厦门)信息技术有限公司 Data quality monitoring method in big data environment

Similar Documents

Publication Publication Date Title
TWI698108B (en) Blockchain-based data processing method and device
CN101313495B (en) Method, system and apparatus for data synchronization
US8250102B2 (en) Remote storage and management of binary object data
CN102917020B (en) A kind of method of mobile terminal based on packet and operation system data syn-chronization
CN103034738A (en) Relevant database for managing heterogeneous unstructured data and method for creating and inquiring description information of unstructured data thereof
TWI286696B (en) System and method for accessing different types of back end data stores
US7487191B2 (en) Method and system for model-based replication of data
CN1956452B (en) Method, system, user terminal and server for implementing data synchronization
WO2018036324A1 (en) Smart city information sharing method and device
US20050160088A1 (en) System and method for metadata-based distribution of content
CN106815338A (en) A kind of real-time storage of big data, treatment and inquiry system
CN109582722A (en) Public security resource data service system
CN109033757B (en) Data sharing method and system
CN103473696A (en) Method and system for collecting, analyzing and distributing internet business information
CN103605698A (en) Cloud database system used for distributed heterogeneous data resource integration
CN105302920A (en) Optimal management method and system for cloud storage data
CN102333108A (en) Distributed cache synchronization system and method
CN104866316A (en) Data center middleware system
CN103745599A (en) Vehicle intelligent management system based on cloud computing platform
CN105245369B (en) Component issuing container method supporting multiple transport protocols
CN102385617A (en) Dynamic domain query and query translation
CN104077355B (en) A kind of method, apparatus and system of unstructured data storage and inquiry
CN110417860A (en) File transfer management method, apparatus, equipment and storage medium
US11823240B2 (en) Efficient updates of biometric data for remotely connected devices
KR20190092901A (en) Spark SQL-based data federation apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent of invention or patent application
CB02 Change of applicant information

Address after: Haitai 300384 in Tianjin Binhai high tech Zone Huayuan Industrial Zone Development six road No. 6 Haitai green industry base J

Applicant after: Tianjin NanKai University General Data Technologies Co., Ltd.

Address before: Haitai 300384 in Tianjin Binhai high tech Zone Huayuan Industrial Zone Development six road No. 6 Haitai green industry base J

Applicant before: Tianjin Nanda General Data Technology Co., Ltd.

COR Change of bibliographic data

Free format text: CORRECT: APPLICANT; FROM: TIANJIN NANDA GENERAL DATA TECHNOLOGY CO., LTD. TO: TIANJIN NANDA CONVENTIONAL DATA TECHNOLOGY CO., LTD.

C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130410