CN104778200A

CN104778200A - Heterogeneous processing big data retrieval method combining historical data

Info

Publication number: CN104778200A
Application number: CN201510016057.5A
Authority: CN
Inventors: 薛凯军; 周凡; 韩冠亚; 姜涛
Original assignee: Institute of Dongguan of Sun Yat Sen University; National Sun Yat Sen University
Current assignee: Sun Yat Sen University; Institute of Dongguan of Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2015-01-13
Filing date: 2015-01-13
Publication date: 2015-07-15

Abstract

The embodiment of the invention discloses a heterogeneous processing big data retrieval method combining historical data. The method comprises the following steps: receiving a key word sentence input by a user; retrieving in a historical record sheet in a Web server based on the key word sentence; judging whether the key word sentence of current search exists in the historical record sheet and directly taking out a result from the Web server if the key word sentence exists; performing distributed search in a database server by using a meta search engine if the key word sentence of the current search does not exist; feeding back a final search result to the user. By implementing the embodiment of the invention, the problem of heterogeneity in a big data source is effectively solved; the precision ratio and the recall ratio are extremely high; repeated search is avoided.

Description

A kind of method of the large data retrieval of isomery process in conjunction with historical data

Technical field

The present invention relates to large data technique field, especially relate to a kind of method of the large data retrieval of isomery process in conjunction with historical data.

Background technology

As a kind of new types of data management mode of field of cloud calculation---large data, are the key of data management and bottleneck: along with the expansion of storage size, how to improve loading efficiency and the recall ratio of data.But because the supercomputing capability of large data along with cloud computing produces, it has following three features:

Scale is large: namely data volume is large, has exceeded the imagination of people, a common social networks.500TB new data is exceeded as the Facebook data volume of a day reaches.

Heterogeneous data: the data type in large data has very large difference.

Value density is low: in large data, the data having a value with us just very small percentage wherein.Most typical example is video monitoring.

How effectively, quickly and accurately find the information required for people, make it to become valuable source, be the important need of information age people, and in the face of the information resources of magnanimity, information retrieval technique plays more and more important effect.But existing data base management system (DBMS) is different, the deployment platform of data-storage system is different, makes data resource all there is isomery physically and in logic.Numerous heterogeneous resource systems each other incompatible, resource object lacks with content and associates, and the diversity of information resources and isomerism, result in Information Resource Access inconvenience, is difficult to the information shared and state.

From 1998, the Paepeke of Stanford University just proposed the Interoperability of isomeric data.Paepeke thinks, heterogeneous database retrieval technique is the direction of following information retrieval main flow, target is to realize isomeric data resource sharing, to the main flow direction of the semantic structure information retrieval different with architecture, target is the information sharing realizing isomeric data, thus sets up the connection of mutual operation to the information of different semantic structures and architecture.

Therefore, combine each isomeric data resource, realize the data conversion between different pieces of information resource, eliminate isomery, therefrom retrieving the data of specifying is major issue urgently to be resolved hurrily.Herein for the storage condition of the Heterogeneous data in current information field, the information retrieval that users group and user are badly in need of realizing, the requirement of information sharing and information communication, how to solve in the problem of data source isomerism in consideration, in conjunction with participle technique, utilize the method for existing retrieves historical data, improve recall ratio and the precision ratio of large data widely.At this, the isomery process large data retrieval method of this technology called after in conjunction with historical data.

Middleware Technology was proposed by Wiederhold as far back as 1992, the schematic diagram of theory structure shown in Fig. 1, and the expert in data integration fields many afterwards carries out very deep research to the middleware of data integration.Typical Data Integration Middleware uses XML data Construction of A Model to go out the data pattern of the overall situation, mutual by each data source and wrapper, on the base of global data model, when user sends inquiry request to middleware, the request of user is converted to the manageable subquery request of various data source by middleware, take out data from each tributary after, in respective data source, carry out merging treatment, the net result finally generating user global query returns.In this mode, the coexisting issues on structural data, semi-structured data and unstructured data can be solved to a certain extent.

Although can process semi-structured and non-structured data, in process unstructured data, efficiency is very low.Lower in the efficiency of these type of Data Integration Middleware data, carrying out not getting rid of in the process integrated have contamination data to enter, further, existing Data Integration Middleware generally pays attention to process and the optimization of global query, to the purity of data and the precision ratio of result for retrieval lower.In this mode, to the hardware requirement of middleware and flow process cost larger.

Summary of the invention

The large data retrieval method of isomery process in conjunction with historical data in this paper, when ensureing recall ratio and the precision ratio of data retrieval, efficiently solve the problem of Heterogeneous data in large data, and drastically increasing effectiveness of retrieval, is a kind of novel large data search method.

In order to solve the problem, the present invention proposes a kind of method of the large data retrieval of isomery process in conjunction with historical data, comprising the steps:

Receive the keyword phrase of user's input;

Retrieve in the history table in Web server based on keyword phrase;

Judge in described history table, whether there is this keyword phrase searched for, if had, directly from Web server, take out result; If there is no the keyword phrase that this is searched for, then adopt in META Search Engine to database server and carry out distributed search;

Final Search Results is fed back to user.

Also comprise before the keyword phrase of described reception user input:

Web server is based on historic user retrieval request, and history data store associated user retrieved is at local spatial.

Described connecing is carried out retrieval based on keyword phrase and is comprised in the history table in Web server:

Utilizing the fast word segmentation method of whole word two points of dictionaries based on improving, request statement during user search being divided into each independently after word, retrieving in the history table in Web server.

History table in described Web server stores < key word, the attribute that time point > these two is crucial.

If have described, direct result of taking out from Web server comprises:

The key word having this to search in history retrieval, then take out result with regard to direct from Web server, then the new data in search history record sheet after time point, and two data are added up, and obtain the Search Results of total data.

Implement the embodiment of the present invention, efficiently solve the problem of isomerism in large data source; Precision ratio and recall ratio are very high; Avoid repeat search.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the search system structural representation based on middleware Technology of the prior art;

Fig. 2 is the system architecture schematic diagram of the large data retrieval of isomery process in the embodiment of the present invention;

Fig. 3 is the method flow diagram of the large data retrieval of isomery process in conjunction with historical data in the embodiment of the present invention;

Fig. 4 is the semi-match schematic flow sheet in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

So, process emphatically following 2 problems herein: the isomerism how solving large data source; How to ensure, under the recall ratio of data retrieval and the prerequisite of precision ratio, to improve recall precision.

The transmission mode of large data as in Fig. 2 as shown, user sends out and asks request to Web server, Web server is according to request, submit Query statement on part or all of database server, then the data wanted found by database server from the data of various places, and then reverse turning back in user is gone.The large data retrieval method of isomery process in conjunction with historical data used herein is mainly for database server and these two parts of Web server.

The implementation method concrete in conjunction with the large data retrieval method of isomery process of historical data is as follows: open up one piece of enough large space in Web server, when the result obtained of a user search key word, stored in this space, again can utilize result in order to during retrieval next time.During user search, certainly to inputting key word or critical sentence, utilizing the fast word segmentation method of whole word two points of dictionaries based on improving, request statement during user search is divided into each independently after word, retrieve in the history table in Web server.History table in Web server must store < key word, the attribute that time point > these two is crucial.According to the Three models retrieving result, can judge whether the result part retrieved can directly from the result of Web server.If the key word not having this to search in history retrieval, then we just utilize META Search Engine to carry out distributed search; If the key word having this to search in history retrieval, then take out result with regard to direct from Web server, then the new data in search history record sheet after time point, two data are added up, and obtain the Search Results of total data.In conjunction with the large data retrieval method of isomery process of historical data search routine as shown in Figure 3.

The large data retrieval method of isomery process in conjunction with historical data can be divided into two processing sections:

Participle part; Isomery part.

TB (terabyte): terabyte is the unit of computer storage capacity, 1TB=1024GB.

Heterogeneous data retrieval: i.e. cross search, the concurrence retrieval to the multiple distributed heterogeneous data sources on local and wide area network is realized with unified Retrieval Interface, and by the operation such as duplicate removal, sequence to result for retrieval, result is integrated, in a unified format result is presented to user.

META Search Engine: be also called multiple search engine, helping user select in multiple search engine and utilize suitable (or even simultaneously utilizing several) search engine to realize search operaqtion by a unified user interface, is the global control mechanism to the multiple gopher being distributed in network.

First participle problem is discussed.Be the arbitrary statement of input when user search, it is impossible for directly retrieval type being used as in statement, so the statement of user's input must be extracted keyword string.Because the language composition of Chinese is based on individual character, forms and depend on word, so be easy to cause discrimination on participle.

The segmentation methods adopted herein is the fast word segmentation method improving whole word two points of dictionaries, and the main algorithm of this method is as follows:

(1) the statement K establishing user to input, is divided into some substatement { K by the punctuation mark in statement _i;

(2) treat that segmentation word string S assignment becomes S ₁, use the reverse segmenting method of maximum coupling and the segmenting method of Forward Maximum Method to obtain two kinds of cutting results: Forward Maximum Method is scanning word string from left to right, is used for carrying out dictionary matching.By whole word string and dictionary matching.If it fails to match, just remove the right first character, again mate, repeat until cutting terminates.Reverse maximum coupling segmenting method difference be, when it fails to match at every turn, removal be first left word, all the other are the same with direct algorithms.Being split by S, is S with the word string that segmenting method and the reverse maximum segmenting method of Forward Maximum Method obtain ₁and S ₂, and each word attaches the word frequency inquired.

(3) S is compared ₁and S ₂if, identical, then by S ₁assignment to T, and forwards step (7) to;

(4) S is calculated respectively ₁and S ₂unitary probability.Unitary probability P (S _i) be defined as follows:

P(S _i)＝P(W ₁)*P(W ₂)*......*P(W _n)

Wherein, W _it _iin word, P (W _i) be W _ithe statistical probability that obtains divided by the sum of dictionary entry of word frequency.

(5) S is calculated respectively ₁and S ₂point penalty M ₁and M ₂.Point penalty is defined as in a word string has how many words just to have how many points, if there is an individual character not becoming word, just pluses fifteen.Such as: he// to have a meal, point penalty is 3.

(6) T is calculated ₁and T ₂the final evaluation of estimate of word string, selects the high word string assignment of final evaluation of estimate to T.

Final evaluation of estimate is defined as follows:

P (E_{i}) = P (T_{i}) * (\frac{1}{M_{i}}) .

(7) the unconscious auxiliary word in T is removed, obtain keyword string K _i, now, i-th clause's cutting completes, below cutting clause S _i+1go to step (2).

Utilize fast word segmentation method, we compare with the key table of Web server after the search sentence that user sends is changed into corresponding keyword string.

Then, the isomerism problem of data source is discussed.Although data source is isomery, distributed, the existence of the imperceptible isomerism of customer group should be allowed in retrieval, easy just as what use local database.Consider based on this point, adopt block form META Search Engine herein, adopt the method retrieved completely---the data source of isomery is merged in a public view, ensure that the integrality that data store and conforming problem.

In block form META Search Engine, we utilize historical data, avoid repeat search.We set up a block space table for the historical record that user search is crossed and preserve.When user retrieves, first we search the history table in Web server, according to lookup result, can be divided into following three kinds of situations:

(1) mate completely

Coupling represents the key word of user search and historical query record mates completely completely, so the result in result space in Web server can be used directly, therefore, our direct search is in the result space of Web server.Then, according to the time point of key word in history lists, with searching next new data more late than this time point in block form META Search Engine in each data source, after finding occurrence, and result space merges, and obtains the final data of this search, is set to A.And in Web server, directly upgrade the result of this search.

(2) semi-match

Fig. 4 shows semi-match schematic flow sheet, and semi-match represents the key word of user search and historical query record only has local matching, but like this before same query portion gained result can directly utilize by this retrieval.Processing procedure is: first, takes out the Query Result B of same part in result table ₁.Secondly, imperfect inquiry is carried out to the new data after history lists time point---identical key word is searched, obtains result B ₂.By B ₁and B ₂be combined, obtain result B.Result B is the shared result of the just inquiry same section comprised.In result B, inquire about different piece, the data obtained are exactly the final data of this search, and add result key word and corresponding Search Results on Web server.

(3) do not mate completely

Do not mate the key word that represent user search completely and historical query record does not mate completely.Therefore, retrieval needs from the beginning to the end in data source, utilizes META Search Engine to retrieve, and what obtain after retrieval is exactly the net result that this is searched for.

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is that the hardware that can carry out instruction relevant by program has come, this program can be stored in a computer-readable recording medium, storage medium can comprise: ROM (read-only memory) (ROM, Read OnlyMemory), random access memory (RAM, Random Access Memory), disk or CD etc.

In addition, above the online appointment registration system based on Digital Television that the embodiment of the present invention provides is described in detail, apply specific case herein to set forth principle of the present invention and embodiment, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1., in conjunction with a method for the large data retrieval of isomery process of historical data, it is characterized in that, comprise the steps:

Receive the keyword phrase of user's input;

Retrieve in the history table in Web server based on keyword phrase;

Final Search Results is fed back to user.

2. as claimed in claim 1 in conjunction with the method for the large data retrieval of isomery process of historical data, it is characterized in that, also comprise before the keyword phrase of described reception user input:

3., as claimed in claim 2 in conjunction with the method for the large data retrieval of isomery process of historical data, it is characterized in that, described in connect and in the history table in Web server, carry out retrieval based on keyword phrase and comprise:

4. as claimed in claim 3 in conjunction with the method for the large data retrieval of isomery process of historical data, it is characterized in that, history table in described Web server stores < key word, the attribute that time point > these two is crucial.

5. if as claimed in claim 4 in conjunction with the method for the large data retrieval of isomery process of historical data, it is characterized in that having described, direct result of taking out from Web server comprises: