CN102012912B

CN102012912B - Management method for unstructured data based on cloud computing environment

Info

Publication number: CN102012912B
Application number: CN2010105545374A
Authority: CN
Inventors: 王建民; 丁贵广; 朱妤晴
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2010-11-19
Filing date: 2010-11-19
Publication date: 2012-08-22
Anticipated expiration: 2030-11-19
Also published as: CN102012912A

Abstract

The invention relates to a management method for unstructured data based on a cloud computing environment, belonging to the technical field of computer data management. In the method, unstructured data management based on a plurality of cloud computing storage systems is supported; an architecture structure provided with a cloud storage system of loosely-coupled unstructured source data, a characteristic data cloud storage system of non-text unstructured data, a characteristic data cloud system of text unstructured data and other storage subsystems is adopted; and the source data and characteristic data of unstructured data are associated by scheduling of an independent query processing module with multiple deployments to the storage subsystem at a bottom layer and a multi-type characteristic extraction submodule, so as to realize management functions of storage, acquisition, query and the like to multiple unstructured data oriented to the source data and the characteristic data in a unified mode. The method has the advantage of high expandability in the aspects of the formed system architecture, management contents and the like.

Description

A kind of management method of the unstructured data based on cloud computing environment

Technical field

The present invention relates to a kind of management method of the unstructured data based on cloud computing environment, belong to the computer data management technical field.

Background technology

Along with becoming increasingly abundant and the continuous development of IT application in enterprise of emerging application such as Web, a large amount of unstructured datas has appearred.The data of forms such as the HTML of the magnanimity that occurs in the Web environment and XML; Multi-medium data such as sound, film, figure etc.; Contract text, electrical form, briefing file, e-mail data, product design document etc. all are the unstructured datas that does not have clear and definite structural constraint in the IT application in enterprise.Show that according to research report unstructured data accounts for more than 80% of total data total amount.Increasingly important in the major application demand of unstructured data in government, business decision.

Unstructured data management has been researched and proposed for the theory in conventional information field and method and has been challenged and become important new research direction.Because the unstructured data data type is abundant; Complex structure does not have data structure constraint clear and definite, unified Definition, in addition the data scale of its magnanimity; Highly dynamic data characteristic; Various application scenarios, unified associating requirements for access makes the unstructured data management face huge challenge.Scientific research institution launches research one after another both at home and abroad, has obtained some significant progress, but does not still have ripe unstructured data management theory and gordian technique method at present.

Because the unstructured data kind differs from one another, every kind of data type has distinctive data manipulation, through the EXPANDING DISPLAY AREA data model, supports the valid function of different unstructured datas.Based on above-mentioned consideration; Each major company is around dissimilar unstructured data types; Definition also realizes peculiar operation, and combine application, has realized the unstructured data management system; For example: Oracle 9i supports SQL/XML, 10g to support XQuery, 11g to support binary XML storage and index, and Oracle Multimedia assembly provides preliminary support etc. to the metadata management of part multi-medium data; Sybase Search is to the support of the functions such as processing, analysis, management and inquiry of unstructured data; The DB2 of IBM is to the association store of outside LOB file and metadata corresponding thereof; The Documentum of EMC provides the sharing of information of the interdepartmental all kinds of enterprise-level, form based on the content knowledge storehouse (content repository) of unanimity; UIMA (Unstructured Information Management Architecture) can analyze a large amount of unstructured datas and obtain information that the final user is concerned about etc.

The subject matter of managing based on the unstructured data of object model comprises: system lacks the optimized execution mechanism of object method at present, and data processing efficiency is difficult to be guaranteed in the magnanimity environment; System stresses to handle the different demands of special object, in handling the uniform data inquiry, has certain difficulty; Some system realizes based on relational database, is subject to the framework of relational database, needs problems such as the concurrent control of strict consideration, further reduces the efficient that unstructured data is handled.

Data integrated system and data space also are the solutions that proposes towards the unstructured data management.Data integrated system can carry out the data in the data source that disperse, isomery integrated, realizes the transparent access to the data source data, for the user provides the Data View of the overall situation and unified inquiry service.Canonical system comprises: the data integrated system Paygo system that the TSIMMIS system that Stanford university is developed, the Information Manifold system of AT & T Corp. and Google propose.Also data warehouse can be regarded the integrated system that adopts materialization virtual view integration mode as.The representative system of data space comprises the iMemex system that the people such as professor Dittrich of Semex system and the Zurich, SUI Polytechnics of people such as the Dong of Washington university exploitation develop.The Semex system supports keyword query based on relational model, can utilize structure optimization key search result's Top-K ordering.IMemex has proposed a kind of notion and formalization representation method of unified resource view based on graph model, realizes the unified expression to various data types (like document, catalogue, relation table, XML document, data stream etc.).

The data integration correlation technique lays particular emphasis on sharing of isomeric data and inquiry, in the unstructured data management system, can reduce the space cost, improves the Query Result quality.Pattern match in the data integration, query rewrite etc. make system constructing cost and query processing cost prohibitive.Data space has overcome the subproblem in the data integration, but the model of data space inside is too complicated, does not support the data management of magnanimity.The distributed management framework of keyword query mode and mass data is not discussed in the data integrated system simultaneously.

Summary of the invention

The objective of the invention is to propose a kind of management method of the unstructured data based on cloud computing environment; Manage polytype unstructured datas such as comprising text, audio frequency, video, picture with the mode of data characteristics, to satisfy the various requirement of user to data query.

The management method based on the unstructured data of cloud computing environment that the present invention proposes may further comprise the steps:

(1) query processing module is carried out initialization to the source data cloud storage system; In the source data cloud storage system, set up corresponding catalogue for each user; Query processing module is carried out initialization to the characteristic cloud storage system; In the characteristic cloud storage system, set up the acquiescence form respectively, in order to the characteristic of in the acquiescence form, storing the unstructured data of respective type respectively for the polymorphic type unstructured data of system handles;

(2) query processing module is opened the specified network port, and the connection status of query processing module and network-external client is monitored;

(3) when query processing module receives the connection request of network-external client; Query processing module and network-external client connect; Control thread in the query processing module receives the operational order of client through network from the client of network-external; Data thread in the query processing module receive client from the client of network-external through network with the corresponding unstructured data of this operational order, and buffer memory;

(4) when the operational order of client is memory command; Control thread in the query processing module deposits in the source data cloud storage system according to user's assigned address based on the unstructured data of operational order with above-mentioned buffer memory, if there is the source data address of user's appointment in the operational order, then to the address property advanced validity checking; If having the address bears the same name; Then query processing module increases a number information to the address, obtains a new source data address, and the unstructured data of buffer memory is deposited in the new source data address; If not having the address bears the same name; Then the unstructured data of buffer memory is deposited in the source data address of user's appointment, if the user does not have the assigned source data address, then query processing module generates a new source data address automatically; This new source data address is associated with corresponding User Catalog, and the unstructured data of buffer memory is deposited in the new source data address;

(5) to the unstructured data of non-text class, query processing module is judged the unstructured data type of the non-text class of storage, according to judged result; Notify the pairing characteristic extracting module of unstructured data type of corresponding non-text class to extract the characteristic of the unstructured data of non-text class; After the unstructured data characteristic extracting module of the non-text class of particular type has notice, from the source data cloud storage system, take out the destructuring source data of non-text class, and for after the destructuring source data of the non-text class obtained extracts characteristic; The characteristic of extracting is returned to query processing module; Query processing module deposits characteristic in the characteristic storage system in after receiving characteristic, if having user's characteristic specified data address in the operational order, then to the address property advanced validity checking; If having the characteristic address bears the same name; Then query processing module increases a number information to the characteristic address, obtains a new feature data address, and characteristic is deposited in the new feature data address; If not having the address bears the same name; Then the unstructured data with buffer memory deposits user's characteristic specified data address in, if the user does not have the specific characteristic data address, then query processing module generates a new feature data address automatically; This new feature data address is associated with corresponding User Catalog, and the unstructured data of buffer memory is deposited in the new feature data address;

(6) to the unstructured data of text class; The sub-cloud system of text feature data in the control thread notice query processing module of query processing module extracts the characteristic of text class unstructured data; After the sub-cloud system of text feature data has notice; From the source data cloud storage system, obtain the text unstructured data, and be that the text unstructured data that obtains extracts text feature and sets up text index;

(7) when operational order is querying command, if there is the source data address of user's appointment in the operational order, then query processing module is taken out the destructuring source data from the appropriate address of source data cloud storage system, returns to the user through the data thread; If have user's characteristic specified data address in the operational order; Query processing module is taken out characteristic from the appropriate address of characteristic cloud storage system; And according to the source data address of storing in the characteristic form; From the appropriate address of source data cloud storage system, take out the destructuring source data, return to the user through the data thread;

(8) when operational order is querying command; If have user's characteristic specified data in the operational order; Then: if characteristic is the text feature data; The sub-cloud system of query processing module notification text characteristic is inquired about, and the sub-cloud system of text feature data is back to query processing module with the Query Result that comprises the unstructured data address list that inquiry obtains, and query processing module returns to the user with Query Result; If characteristic is non-text feature data; Then query processing module is read all characteristics from the characteristic cloud storage system; And the characteristic of reading and user's characteristic specified data compared; If the operational order appointment is approximate match, then user's characteristic specified data and the characteristic of reading to be carried out the degree of approximation and calculate, the source data address that all degrees of approximation is met the characteristic of approximate extents returns to the user; If the operational order appointment is coupling of equal value, the source data address of the characteristic that then all and user's characteristic specified data is equated fully returns to the user;

(9) when operational order is querying command; If do not have source data address, characteristic address and the characteristic of user's appointment in the operational order; Then query processing module under the catalogue corresponding with the user, is taken out all unstructured data addresses and is returned to the user from the source data cloud storage system.

The management method that the present invention proposes based on the unstructured data of cloud computing environment; Its advantage is: first; Based on the sub-storage system of a plurality of existing cloud computings, adopt loosely-coupled architecture, make the storage of non-structured data and management have enhanced scalability; The second, loosely-coupled architecture is supported the simple and easy grafting and the expansion of antithetical phrase storage system, characteristic extracting module; The 3rd, adopt unified way to manage to manage multiple known unstructured data, the management of the dissimilar unstructured datas of integrated support realizes information sharing; The 4th, the feasible type to the unstructured data management of unified unstructured data management mode is expanded and is become easy; The 5th, unified unstructured data administration module is supported the intersection inquiry of unstructured data based on the same type characteristic; The 6th; The related data of the query processing module of unstructured data all are stored in the sub-storage system of cloud computing; The data sync of query processing module realizes in sub-storage system, therefore supports a plurality of identical query processing module with advancing operation, to satisfy many concurrent users' demand for services.

Description of drawings

Fig. 1 is the FB(flow block) of the inventive method.

Fig. 2 is based on the unstructured data management system architectural schematic of the inventive method.

Embodiment

The management method that the present invention proposes based on the unstructured data of cloud computing environment, its FB(flow block) is as shown in Figure 1, may further comprise the steps:

(1) query processing module is carried out initialization to source data cloud storage system Hadoop; In source data cloud storage system Hadoop, set up corresponding catalogue for each user; Query processing module is carried out initialization to characteristic cloud storage system Cassandra; In characteristic cloud storage system Cassandra, set up acquiescence form default_image, default_music and default_text respectively for the polymorphic type unstructured data of system handles; In order to the characteristic of in the acquiescence form, storing the unstructured data of respective type respectively, as shown in Figure 2;

(4) when the operational order of client is memory command; Control thread in the query processing module deposits among the source data cloud storage system Hadoop according to user's assigned address based on the unstructured data of operational order with above-mentioned buffer memory, if there is the source data address of user's appointment in the operational order, then to the address property advanced validity checking; If having the address bears the same name; Then query processing module increases a number information to the address, obtains a new source data address, and the unstructured data of buffer memory is deposited in the new source data address; If not having the address bears the same name; Then the unstructured data of buffer memory is deposited in the source data address of user's appointment, if the user does not have the assigned source data address, then query processing module generates a new source data address automatically; This new source data address is associated with corresponding User Catalog, and the unstructured data of buffer memory is deposited in the new source data address;

(5) to the unstructured data of non-text class, query processing module is judged the unstructured data type of the non-text class of storage, according to judged result; Notify the pairing characteristic extracting module of unstructured data type of corresponding non-text class to extract the characteristic of the unstructured data of non-text class; After the unstructured data characteristic extracting module of the non-text class of particular type has notice, from source data cloud storage system Hadoop, take out the destructuring source data of non-text class, and for after the destructuring source data of the non-text class obtained extracts characteristic; The characteristic of extracting is returned to query processing module; Query processing module deposits characteristic in the characteristic storage system in after receiving characteristic, if having user's characteristic specified data address in the operational order, then to the address property advanced validity checking; If having the characteristic address bears the same name; Then query processing module increases a number information to the characteristic address, obtains a new feature data address, and characteristic is deposited in the new feature data address; If not having the address bears the same name; Then the unstructured data with buffer memory deposits user's characteristic specified data address in, if the user does not have the specific characteristic data address, then query processing module generates a new feature data address automatically; This new feature data address is associated with corresponding User Catalog, and the unstructured data of buffer memory is deposited in the new feature data address;

(6) to the unstructured data of text class; The sub-cloud system of text feature data in the control thread notice query processing module of query processing module extracts the characteristic of text class unstructured data; After the sub-cloud system of text feature data has notice; From source data cloud storage system Hadoop, obtain the text unstructured data, and be that the text unstructured data that obtains extracts text feature and sets up text index;

(7) when operational order is querying command, if there is the source data address of user's appointment in the operational order, then query processing module is taken out the destructuring source data from the appropriate address of source data cloud storage system Hadoop, returns to the user through the data thread; If have user's characteristic specified data address in the operational order; Query processing module is taken out characteristic from the appropriate address of characteristic cloud storage system Cassandra; And according to the source data address of storing in the characteristic form; From the appropriate address of source data cloud storage system Hadoop, take out the destructuring source data, return to the user through the data thread;

(8) when operational order is querying command; If have user's characteristic specified data in the operational order; Then: if characteristic is the text feature data; The sub-cloud system of query processing module notification text characteristic is inquired about, and the sub-cloud system of text feature data is back to query processing module with the Query Result that comprises the unstructured data address list that inquiry obtains, and query processing module returns to the user with Query Result; If characteristic is non-text feature data; Then query processing module is read all characteristics from characteristic cloud storage system Cassandra; And the characteristic of reading and user's characteristic specified data compared; If the operational order appointment is approximate match, then user's characteristic specified data and the characteristic of reading to be carried out the degree of approximation and calculate, the source data address that all degrees of approximation is met the characteristic of approximate extents returns to the user; If the operational order appointment is coupling of equal value, the source data address of the characteristic that then all and user's characteristic specified data is equated fully returns to the user;

(9) when operational order is querying command; If do not have source data address, characteristic address and the characteristic of user's appointment in the operational order; Then query processing module under the catalogue corresponding with the user, is taken out all unstructured data addresses and is returned to the user from source data cloud storage system Hadoop.

Claims

1. management method based on the unstructured data of cloud computing environment is characterized in that this method comprises following nine steps:

(4) when the operational order of client is not memory command, get into step (7), when the operational order of client was memory command, the control thread in the query processing module deposited in the source data cloud storage system according to user's assigned address according to the unstructured data of operational order with above-mentioned buffer memory; If there is the source data address of user's appointment in the operational order; Then validity checking is carried out in the address, if exist the address to bear the same name, then query processing module increases a number information to the address; Obtain a new source data address; And the unstructured data of buffer memory deposited in the new source data address, if do not exist the address to bear the same name, then the unstructured data of buffer memory is deposited in the source data address of user's appointment; If the user does not have the assigned source data address; Then query processing module generates a new source data address automatically, this new source data address is associated with corresponding User Catalog, and the unstructured data of buffer memory is deposited in the new source data address; Query processing module is judged the unstructured data type, if text class unstructured data then gets into step (6), otherwise gets into step (5);

(5) to the unstructured data of non-text class, query processing module is judged the unstructured data type of the non-text class of storage, according to judged result; Notify the pairing characteristic extracting module of unstructured data type of corresponding non-text class to extract the characteristic of the unstructured data of non-text class; After the unstructured data characteristic extracting module of the non-text class of particular type has notice, from the source data cloud storage system, take out the destructuring source data of non-text class, and for after the destructuring source data of the non-text class obtained extracts characteristic; The characteristic of extracting is returned to query processing module; Query processing module deposits characteristic in the characteristic cloud storage system in after receiving characteristic, if having user's characteristic specified data address in the operational order, then validity checking is carried out in the address; If having the characteristic address bears the same name; Then query processing module increases a number information to the characteristic address, obtains a new feature data address, and characteristic is deposited in the new feature data address; If not having the address bears the same name; Then the unstructured data with buffer memory deposits user's characteristic specified data address in, if the user does not have the specific characteristic data address, then query processing module generates a new feature data address automatically; This new feature data address is associated with corresponding User Catalog, and the unstructured data of buffer memory is deposited in the new feature data address;

(7) when operational order is querying command, if there is the source data address of user's appointment in the operational order, then query processing module is taken out the destructuring source data from the appropriate address of source data cloud storage system, returns to the user through the data thread; If have user's characteristic specified data address in the operational order; Query processing module is taken out characteristic from the appropriate address of characteristic cloud storage system; And the source data address of storing in the acquiescence form according to the characteristic cloud storage system; From the appropriate address of source data cloud storage system, take out the destructuring source data, return to the user through the data thread; If do not have the address of user's appointment in the operational order, but exist characteristic then to get into step (8),, do not exist characteristic then to get into step (9) if neither there is the address of user's appointment in the operational order yet;