CN104424309A

CN104424309A - Unstructured data processing method based on technological media cloud computing

Info

Publication number: CN104424309A
Application number: CN201310399024.4A
Authority: CN
Inventors: 渠继永
Original assignee: TTTH HOLDINGS Co Ltd
Current assignee: TTTH HOLDINGS Co Ltd
Priority date: 2013-09-05
Filing date: 2013-09-05
Publication date: 2015-03-18

Abstract

The invention discloses an unstructured data processing method based on technological media cloud computing. The method includes 1, acquiring technological media information data; 2, performing distribution-type cloud storage according to the characteristics of different types; 3, calling and performing offline processing, including cleaning, duplication removal, relevance, filtering, keyword extraction and intelligent classification, on unstructured data of cloud storage of the step 2, and updating the unstructured data in the cloud storage. The method has the advantages that unstructured data solution scheme based on cloud computing is provided for the perpendicular field of technological media, owing to accurate industrial positioning, the frequently-used key words are analyzed deeply, the accuracy of information can be improved, partial noise words can be removed, and data processing efficiency can be improved.

Description

A kind of based on scientific and technological media cloud computing unstructured data disposal route

Technical field

The present invention relates to microcomputer data processing field, particularly relate to a kind of based on scientific and technological media cloud computing unstructured data disposal route.

Background technology

Cloud computing is the increase of related service based on internet, use and delivery mode, is usually directed to provide dynamically easily expansion by internet and is often virtualized resource.Narrow sense cloud computing refers to payment and the using forestland of IT infrastructure, refers to obtain resource requirement by network in the mode as required, easily expanded; Broad sense cloud computing refers to payment and the using forestland of service, refer to obtain required service by network in the mode as required, easily expanded, it is relevant with software, internet that this service can be IT, also can be other service, mean that computing power also be can be used as a kind of commodity and circulated by internet.

Unstructured data management is challenged for the theory and methods in conventional information field proposes and becomes important new research direction.Because unstructured data data type is enriched, complex structure, data structure that is clear and definite, unified definition is not had to retrain, in addition the data scale of its magnanimity, highly dynamic data characteristic, various application scenarios, unified associating requirements for access, makes unstructured data manage and faces huge challenge.Because unstructured data kind is different from each other, often kind of data type, with distinctive data manipulation, by EXPANDING DISPLAY AREA data model, supports the valid function of different unstructured data; Based on above-mentioned consideration, each major company, around dissimilar unstructured data types, defines and realizes peculiar operation, and connected applications field, achieve unstructured data management system.

The subject matter managed based on the unstructured data of object model comprises: system lacks the optimization execution mechanism of object method at present, and in magnanimity environment, the efficiency of data processing is difficult to be guaranteed; System stresses the different demands processing special object, in the inquiry of process uniform data, there is certain difficulty; Some system realizes based on relational database, is limited to the framework of relational database, needs the problems such as strict consideration con current control, reduces the efficiency of unstructured data process further.Data integration correlation technique lays particular emphasis on sharing of isomeric data and inquiry, can reduce space cost in unstructured data management system, improves Query Result quality.Pattern match in data integration, query rewrite etc. make system constructing cost and query processing cost prohibitive.Data space overcomes the subproblem in data integration, but the model of data space inside is too complicated, does not support the data management of magnanimity.Meanwhile, the distributed management framework of keyword query mode and mass data is not discussed in data integrated system.

Analyze in conjunction with above, it is important to note that at present, the more existing unstructured data treatment technology based on cloud computing, its scope is still more wide in range, also deeply inadequate to the precision of data; Meanwhile, the existing unstructured data process based on cloud computing only relates to the method realized, a whole set of solution not from software and hardware configuration to implementation method.Therefore, for above aspect, need to make effective innovation.

Summary of the invention

The object of this invention is to provide a kind of unstructured data treatment technology in conjunction with cloud computing and provide hardware configuration, system architecture, data processing, result feedback etc. full-range based on scientific and technological media cloud computing unstructured data disposal route, to solve many deficiencies of prior art.

Object of the present invention carrys out specific implementation by the following technical programs:

A kind of based on scientific and technological media cloud computing unstructured data disposal route, form primarily of following steps:

(1), carry out the acquisition of scientific and technological media information data, obtain pending unstructured data;

(2), to unstructured data, distributed cloud storage is carried out according to dissimilar feature;

(3), to the unstructured data that step (2) medium cloud stores, processed offline is carried out after calling, processed offline comprises: cleaning, re-scheduling, association, filtration, keyword extraction and classifying intelligently, is then updated to by the unstructured data after processed offline in cloud storage;

(4), according to the feature of unstructured data, respond receiving information retrieval requests, result for retrieval sequence is shown according to the feature of unstructured data.

In step (1), the channel of scientific and technological media information data acquisition comprises manual entry and internet captures two kinds of modes.

Step (3), that carries out unstructured data calls and subsequent processed offline, is completed by large-scale distributed computing platform.

For step (4), result for retrieval sequence stores in the buffer simultaneously.

Step (4), is updated directly into the result for retrieval sequence in buffer memory in cloud storage or carries out cloud storage again after processed offline.

Beneficial effect based on scientific and technological media cloud computing unstructured data disposal route of the present invention is: the method is the unstructured data solution based on cloud computing in the vertical field being positioned at scientific and technological media, due to the precise positioning to industry, to the in-depth analysis of conventional keyword, the precision of information can be improved, the noise word of energy exclusive segment, improves the efficiency of data processing simultaneously; Be embodied in:

One, adopts the architecture of the sub-storage systems such as loose couplings destructuring source data cloud storage system, the characteristic cloud storage system of non-textual class unstructured data and the characteristic cloud system of text class unstructured data;

Its two, by can the independent query processing module of multiple deployment to the scheduling of the sub-storage system of bottom and polymorphic type feature extraction submodule, the source data of association unstructured data and characteristic;

Its three, realize the management function such as the storage to multiple unstructured data, acquisition, inquiry towards source data and characteristic with unified pattern;

All there is the advantage of enhanced scalability in the system architecture of formation and the content of management etc.

Accompanying drawing explanation

According to drawings and embodiments the present invention is described in further detail below.

Fig. 1 is based on scientific and technological media cloud computing unstructured data process flow figure described in the embodiment of the present invention.

Embodiment

As shown in Figure 1, a kind of based on scientific and technological media cloud computing unstructured data disposal route described in the embodiment of the present invention, form primarily of following steps:

(2), to unstructured data, distributed cloud storage is carried out according to dissimilar feature; This step requires to adopt the architecture supporting Large Copacity, high performance Hadoop+HBase

(4), according to the feature of unstructured data, respond receiving information retrieval requests, result for retrieval sequence is shown according to the feature of unstructured data, and each result in described result for retrieval sequence is linked to corresponding data source respectively.

Step (4), is updated directly into the result for retrieval sequence in buffer memory in cloud storage or carries out cloud storage again after processed offline.Like this, when before the Data Update of not carrying out being correlated with, when same information retrieval requests, do not need carry out cloud computing and directly result for retrieval sequence issued requesting party.

Claims

1., based on a scientific and technological media cloud computing unstructured data disposal route, it is characterized in that, form primarily of following steps:

2. as claimed in claim 1 a kind of based on scientific and technological media cloud computing unstructured data disposal route, it is characterized in that: in step (1), the channel of scientific and technological media information data acquisition comprises manual entry and internet captures two kinds of modes.

3. as claimed in claim 1 a kind of based on scientific and technological media cloud computing unstructured data disposal route, it is characterized in that: step (3), that carries out unstructured data calls and subsequent processed offline, is completed by large-scale distributed computing platform.

4. as claimed in claim 1 a kind of based on scientific and technological media cloud computing unstructured data disposal route, it is characterized in that: for step (4), result for retrieval sequence stores in the buffer simultaneously.

5. one as claimed in claim 4 is based on scientific and technological media cloud computing unstructured data disposal route, it is characterized in that: step (4), the result for retrieval sequence in buffer memory is updated directly in cloud storage or carries out cloud storage again after processed offline.