CN104778200A - Heterogeneous processing big data retrieval method combining historical data - Google Patents

Heterogeneous processing big data retrieval method combining historical data Download PDF

Info

Publication number
CN104778200A
CN104778200A CN201510016057.5A CN201510016057A CN104778200A CN 104778200 A CN104778200 A CN 104778200A CN 201510016057 A CN201510016057 A CN 201510016057A CN 104778200 A CN104778200 A CN 104778200A
Authority
CN
China
Prior art keywords
data
search
web server
retrieval
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510016057.5A
Other languages
Chinese (zh)
Inventor
薛凯军
周凡
韩冠亚
姜涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Institute of Dongguan of Sun Yat Sen University
National Sun Yat Sen University
Original Assignee
Institute of Dongguan of Sun Yat Sen University
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Dongguan of Sun Yat Sen University, National Sun Yat Sen University filed Critical Institute of Dongguan of Sun Yat Sen University
Priority to CN201510016057.5A priority Critical patent/CN104778200A/en
Publication of CN104778200A publication Critical patent/CN104778200A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a heterogeneous processing big data retrieval method combining historical data. The method comprises the following steps: receiving a key word sentence input by a user; retrieving in a historical record sheet in a Web server based on the key word sentence; judging whether the key word sentence of current search exists in the historical record sheet and directly taking out a result from the Web server if the key word sentence exists; performing distributed search in a database server by using a meta search engine if the key word sentence of the current search does not exist; feeding back a final search result to the user. By implementing the embodiment of the invention, the problem of heterogeneity in a big data source is effectively solved; the precision ratio and the recall ratio are extremely high; repeated search is avoided.

Description

A kind of method of the large data retrieval of isomery process in conjunction with historical data
Technical field
The present invention relates to large data technique field, especially relate to a kind of method of the large data retrieval of isomery process in conjunction with historical data.
Background technology
As a kind of new types of data management mode of field of cloud calculation---large data, are the key of data management and bottleneck: along with the expansion of storage size, how to improve loading efficiency and the recall ratio of data.But because the supercomputing capability of large data along with cloud computing produces, it has following three features:
Scale is large: namely data volume is large, has exceeded the imagination of people, a common social networks.500TB new data is exceeded as the Facebook data volume of a day reaches.
Heterogeneous data: the data type in large data has very large difference.
Value density is low: in large data, the data having a value with us just very small percentage wherein.Most typical example is video monitoring.
How effectively, quickly and accurately find the information required for people, make it to become valuable source, be the important need of information age people, and in the face of the information resources of magnanimity, information retrieval technique plays more and more important effect.But existing data base management system (DBMS) is different, the deployment platform of data-storage system is different, makes data resource all there is isomery physically and in logic.Numerous heterogeneous resource systems each other incompatible, resource object lacks with content and associates, and the diversity of information resources and isomerism, result in Information Resource Access inconvenience, is difficult to the information shared and state.
From 1998, the Paepeke of Stanford University just proposed the Interoperability of isomeric data.Paepeke thinks, heterogeneous database retrieval technique is the direction of following information retrieval main flow, target is to realize isomeric data resource sharing, to the main flow direction of the semantic structure information retrieval different with architecture, target is the information sharing realizing isomeric data, thus sets up the connection of mutual operation to the information of different semantic structures and architecture.
Therefore, combine each isomeric data resource, realize the data conversion between different pieces of information resource, eliminate isomery, therefrom retrieving the data of specifying is major issue urgently to be resolved hurrily.Herein for the storage condition of the Heterogeneous data in current information field, the information retrieval that users group and user are badly in need of realizing, the requirement of information sharing and information communication, how to solve in the problem of data source isomerism in consideration, in conjunction with participle technique, utilize the method for existing retrieves historical data, improve recall ratio and the precision ratio of large data widely.At this, the isomery process large data retrieval method of this technology called after in conjunction with historical data.
Middleware Technology was proposed by Wiederhold as far back as 1992, the schematic diagram of theory structure shown in Fig. 1, and the expert in data integration fields many afterwards carries out very deep research to the middleware of data integration.Typical Data Integration Middleware uses XML data Construction of A Model to go out the data pattern of the overall situation, mutual by each data source and wrapper, on the base of global data model, when user sends inquiry request to middleware, the request of user is converted to the manageable subquery request of various data source by middleware, take out data from each tributary after, in respective data source, carry out merging treatment, the net result finally generating user global query returns.In this mode, the coexisting issues on structural data, semi-structured data and unstructured data can be solved to a certain extent.
Although can process semi-structured and non-structured data, in process unstructured data, efficiency is very low.Lower in the efficiency of these type of Data Integration Middleware data, carrying out not getting rid of in the process integrated have contamination data to enter, further, existing Data Integration Middleware generally pays attention to process and the optimization of global query, to the purity of data and the precision ratio of result for retrieval lower.In this mode, to the hardware requirement of middleware and flow process cost larger.
Summary of the invention
The large data retrieval method of isomery process in conjunction with historical data in this paper, when ensureing recall ratio and the precision ratio of data retrieval, efficiently solve the problem of Heterogeneous data in large data, and drastically increasing effectiveness of retrieval, is a kind of novel large data search method.
In order to solve the problem, the present invention proposes a kind of method of the large data retrieval of isomery process in conjunction with historical data, comprising the steps:
Receive the keyword phrase of user's input;
Retrieve in the history table in Web server based on keyword phrase;
Judge in described history table, whether there is this keyword phrase searched for, if had, directly from Web server, take out result; If there is no the keyword phrase that this is searched for, then adopt in META Search Engine to database server and carry out distributed search;
Final Search Results is fed back to user.
Also comprise before the keyword phrase of described reception user input:
Web server is based on historic user retrieval request, and history data store associated user retrieved is at local spatial.
Described connecing is carried out retrieval based on keyword phrase and is comprised in the history table in Web server:
Utilizing the fast word segmentation method of whole word two points of dictionaries based on improving, request statement during user search being divided into each independently after word, retrieving in the history table in Web server.
History table in described Web server stores < key word, the attribute that time point > these two is crucial.
If have described, direct result of taking out from Web server comprises:
The key word having this to search in history retrieval, then take out result with regard to direct from Web server, then the new data in search history record sheet after time point, and two data are added up, and obtain the Search Results of total data.
Implement the embodiment of the present invention, efficiently solve the problem of isomerism in large data source; Precision ratio and recall ratio are very high; Avoid repeat search.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the search system structural representation based on middleware Technology of the prior art;
Fig. 2 is the system architecture schematic diagram of the large data retrieval of isomery process in the embodiment of the present invention;
Fig. 3 is the method flow diagram of the large data retrieval of isomery process in conjunction with historical data in the embodiment of the present invention;
Fig. 4 is the semi-match schematic flow sheet in the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
The large data retrieval method of isomery process in conjunction with historical data in this paper, when ensureing recall ratio and the precision ratio of data retrieval, efficiently solve the problem of Heterogeneous data in large data, and drastically increasing effectiveness of retrieval, is a kind of novel large data search method.
So, process emphatically following 2 problems herein: the isomerism how solving large data source; How to ensure, under the recall ratio of data retrieval and the prerequisite of precision ratio, to improve recall precision.
The transmission mode of large data as in Fig. 2 as shown, user sends out and asks request to Web server, Web server is according to request, submit Query statement on part or all of database server, then the data wanted found by database server from the data of various places, and then reverse turning back in user is gone.The large data retrieval method of isomery process in conjunction with historical data used herein is mainly for database server and these two parts of Web server.
The implementation method concrete in conjunction with the large data retrieval method of isomery process of historical data is as follows: open up one piece of enough large space in Web server, when the result obtained of a user search key word, stored in this space, again can utilize result in order to during retrieval next time.During user search, certainly to inputting key word or critical sentence, utilizing the fast word segmentation method of whole word two points of dictionaries based on improving, request statement during user search is divided into each independently after word, retrieve in the history table in Web server.History table in Web server must store < key word, the attribute that time point > these two is crucial.According to the Three models retrieving result, can judge whether the result part retrieved can directly from the result of Web server.If the key word not having this to search in history retrieval, then we just utilize META Search Engine to carry out distributed search; If the key word having this to search in history retrieval, then take out result with regard to direct from Web server, then the new data in search history record sheet after time point, two data are added up, and obtain the Search Results of total data.In conjunction with the large data retrieval method of isomery process of historical data search routine as shown in Figure 3.
The large data retrieval method of isomery process in conjunction with historical data can be divided into two processing sections:
Participle part; Isomery part.
TB (terabyte): terabyte is the unit of computer storage capacity, 1TB=1024GB.
Heterogeneous data retrieval: i.e. cross search, the concurrence retrieval to the multiple distributed heterogeneous data sources on local and wide area network is realized with unified Retrieval Interface, and by the operation such as duplicate removal, sequence to result for retrieval, result is integrated, in a unified format result is presented to user.
META Search Engine: be also called multiple search engine, helping user select in multiple search engine and utilize suitable (or even simultaneously utilizing several) search engine to realize search operaqtion by a unified user interface, is the global control mechanism to the multiple gopher being distributed in network.
First participle problem is discussed.Be the arbitrary statement of input when user search, it is impossible for directly retrieval type being used as in statement, so the statement of user's input must be extracted keyword string.Because the language composition of Chinese is based on individual character, forms and depend on word, so be easy to cause discrimination on participle.
The segmentation methods adopted herein is the fast word segmentation method improving whole word two points of dictionaries, and the main algorithm of this method is as follows:
(1) the statement K establishing user to input, is divided into some substatement { K by the punctuation mark in statement i;
(2) treat that segmentation word string S assignment becomes S 1, use the reverse segmenting method of maximum coupling and the segmenting method of Forward Maximum Method to obtain two kinds of cutting results: Forward Maximum Method is scanning word string from left to right, is used for carrying out dictionary matching.By whole word string and dictionary matching.If it fails to match, just remove the right first character, again mate, repeat until cutting terminates.Reverse maximum coupling segmenting method difference be, when it fails to match at every turn, removal be first left word, all the other are the same with direct algorithms.Being split by S, is S with the word string that segmenting method and the reverse maximum segmenting method of Forward Maximum Method obtain 1and S 2, and each word attaches the word frequency inquired.
(3) S is compared 1and S 2if, identical, then by S 1assignment to T, and forwards step (7) to;
(4) S is calculated respectively 1and S 2unitary probability.Unitary probability P (S i) be defined as follows:
P(S i)=P(W 1)*P(W 2)*......*P(W n)
Wherein, W it iin word, P (W i) be W ithe statistical probability that obtains divided by the sum of dictionary entry of word frequency.
(5) S is calculated respectively 1and S 2point penalty M 1and M 2.Point penalty is defined as in a word string has how many words just to have how many points, if there is an individual character not becoming word, just pluses fifteen.Such as: he// to have a meal, point penalty is 3.
(6) T is calculated 1and T 2the final evaluation of estimate of word string, selects the high word string assignment of final evaluation of estimate to T.
Final evaluation of estimate is defined as follows:
P ( E i ) = P ( T i ) * ( 1 M i ) .
(7) the unconscious auxiliary word in T is removed, obtain keyword string K i, now, i-th clause's cutting completes, below cutting clause S i+1go to step (2).
Utilize fast word segmentation method, we compare with the key table of Web server after the search sentence that user sends is changed into corresponding keyword string.
Then, the isomerism problem of data source is discussed.Although data source is isomery, distributed, the existence of the imperceptible isomerism of customer group should be allowed in retrieval, easy just as what use local database.Consider based on this point, adopt block form META Search Engine herein, adopt the method retrieved completely---the data source of isomery is merged in a public view, ensure that the integrality that data store and conforming problem.
In block form META Search Engine, we utilize historical data, avoid repeat search.We set up a block space table for the historical record that user search is crossed and preserve.When user retrieves, first we search the history table in Web server, according to lookup result, can be divided into following three kinds of situations:
(1) mate completely
Coupling represents the key word of user search and historical query record mates completely completely, so the result in result space in Web server can be used directly, therefore, our direct search is in the result space of Web server.Then, according to the time point of key word in history lists, with searching next new data more late than this time point in block form META Search Engine in each data source, after finding occurrence, and result space merges, and obtains the final data of this search, is set to A.And in Web server, directly upgrade the result of this search.
(2) semi-match
Fig. 4 shows semi-match schematic flow sheet, and semi-match represents the key word of user search and historical query record only has local matching, but like this before same query portion gained result can directly utilize by this retrieval.Processing procedure is: first, takes out the Query Result B of same part in result table 1.Secondly, imperfect inquiry is carried out to the new data after history lists time point---identical key word is searched, obtains result B 2.By B 1and B 2be combined, obtain result B.Result B is the shared result of the just inquiry same section comprised.In result B, inquire about different piece, the data obtained are exactly the final data of this search, and add result key word and corresponding Search Results on Web server.
(3) do not mate completely
Do not mate the key word that represent user search completely and historical query record does not mate completely.Therefore, retrieval needs from the beginning to the end in data source, utilizes META Search Engine to retrieve, and what obtain after retrieval is exactly the net result that this is searched for.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is that the hardware that can carry out instruction relevant by program has come, this program can be stored in a computer-readable recording medium, storage medium can comprise: ROM (read-only memory) (ROM, Read OnlyMemory), random access memory (RAM, Random Access Memory), disk or CD etc.
In addition, above the online appointment registration system based on Digital Television that the embodiment of the present invention provides is described in detail, apply specific case herein to set forth principle of the present invention and embodiment, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (5)

1., in conjunction with a method for the large data retrieval of isomery process of historical data, it is characterized in that, comprise the steps:
Receive the keyword phrase of user's input;
Retrieve in the history table in Web server based on keyword phrase;
Judge in described history table, whether there is this keyword phrase searched for, if had, directly from Web server, take out result; If there is no the keyword phrase that this is searched for, then adopt in META Search Engine to database server and carry out distributed search;
Final Search Results is fed back to user.
2. as claimed in claim 1 in conjunction with the method for the large data retrieval of isomery process of historical data, it is characterized in that, also comprise before the keyword phrase of described reception user input:
Web server is based on historic user retrieval request, and history data store associated user retrieved is at local spatial.
3., as claimed in claim 2 in conjunction with the method for the large data retrieval of isomery process of historical data, it is characterized in that, described in connect and in the history table in Web server, carry out retrieval based on keyword phrase and comprise:
Utilizing the fast word segmentation method of whole word two points of dictionaries based on improving, request statement during user search being divided into each independently after word, retrieving in the history table in Web server.
4. as claimed in claim 3 in conjunction with the method for the large data retrieval of isomery process of historical data, it is characterized in that, history table in described Web server stores < key word, the attribute that time point > these two is crucial.
5. if as claimed in claim 4 in conjunction with the method for the large data retrieval of isomery process of historical data, it is characterized in that having described, direct result of taking out from Web server comprises:
The key word having this to search in history retrieval, then take out result with regard to direct from Web server, then the new data in search history record sheet after time point, and two data are added up, and obtain the Search Results of total data.
CN201510016057.5A 2015-01-13 2015-01-13 Heterogeneous processing big data retrieval method combining historical data Pending CN104778200A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510016057.5A CN104778200A (en) 2015-01-13 2015-01-13 Heterogeneous processing big data retrieval method combining historical data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510016057.5A CN104778200A (en) 2015-01-13 2015-01-13 Heterogeneous processing big data retrieval method combining historical data

Publications (1)

Publication Number Publication Date
CN104778200A true CN104778200A (en) 2015-07-15

Family

ID=53619664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510016057.5A Pending CN104778200A (en) 2015-01-13 2015-01-13 Heterogeneous processing big data retrieval method combining historical data

Country Status (1)

Country Link
CN (1) CN104778200A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183884A (en) * 2015-09-24 2015-12-23 西安未来国际信息股份有限公司 Search engine system and method based on big data technique
CN109558485A (en) * 2018-10-25 2019-04-02 安徽创见未来教育科技有限公司 A kind of study big data search management method
CN109947970A (en) * 2019-03-16 2019-06-28 张兴宇 A kind of textile fabric flower pattern searching system
CN111339421A (en) * 2020-02-28 2020-06-26 腾讯科技(深圳)有限公司 Information search method, device, equipment and storage medium based on cloud technology

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006277642A (en) * 2005-03-30 2006-10-12 Nomura Research Institute Ltd Data transformation system and program
CN101071442A (en) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 Distributed indesx file searching method, searching system and searching server
CN102200979A (en) * 2010-03-26 2011-09-28 上海市浦东科技信息中心 Distributed parallel information retrieval system and distributed parallel information retrieval method
CN102436513A (en) * 2012-01-18 2012-05-02 中国电子科技集团公司第十五研究所 Distributed search method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006277642A (en) * 2005-03-30 2006-10-12 Nomura Research Institute Ltd Data transformation system and program
CN101071442A (en) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 Distributed indesx file searching method, searching system and searching server
CN102200979A (en) * 2010-03-26 2011-09-28 上海市浦东科技信息中心 Distributed parallel information retrieval system and distributed parallel information retrieval method
CN102436513A (en) * 2012-01-18 2012-05-02 中国电子科技集团公司第十五研究所 Distributed search method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
尤川川等: "一种基于大数据的有效搜索方法", 《计算机科学》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183884A (en) * 2015-09-24 2015-12-23 西安未来国际信息股份有限公司 Search engine system and method based on big data technique
CN109558485A (en) * 2018-10-25 2019-04-02 安徽创见未来教育科技有限公司 A kind of study big data search management method
CN109947970A (en) * 2019-03-16 2019-06-28 张兴宇 A kind of textile fabric flower pattern searching system
CN109947970B (en) * 2019-03-16 2021-01-15 南通联发信息科技有限公司 Textile fabric pattern retrieval system
CN111339421A (en) * 2020-02-28 2020-06-26 腾讯科技(深圳)有限公司 Information search method, device, equipment and storage medium based on cloud technology
CN111339421B (en) * 2020-02-28 2023-02-28 腾讯科技(深圳)有限公司 Information search method, device, equipment and storage medium based on cloud technology

Similar Documents

Publication Publication Date Title
CN1845104B (en) System and method for intelligent retrieval and processing of information
CN110162644B (en) Image set establishing method, device and storage medium
US10452661B2 (en) Automated database schema annotation
CN104239513A (en) Semantic retrieval method oriented to field data
CN108446316B (en) association word recommendation method and device, electronic equipment and storage medium
US9317556B2 (en) Accelerating database queries containing bitmap-based conditions
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
CN104778200A (en) Heterogeneous processing big data retrieval method combining historical data
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN107577714A (en) A kind of data query method based on HBase
CN103226608A (en) Parallel file searching method based on folder-level telescopic Bloom Filter bit diagram
CN110110234B (en) Big data real-time searching system and method
CN105404677A (en) Tree structure based retrieval method
CN109783599A (en) Knowledge mapping search method and system based on multi storage
Rautray et al. Comparative study of DE and PSO over document summarization
Malhotra et al. An ingenious pattern matching approach to ameliorate web page rank
CN108804580B (en) Method for querying keywords in federal RDF database
US9547701B2 (en) Method of discovering and exploring feature knowledge
US9886497B2 (en) Indexing presentation slides
CN107657067B (en) Cosine distance-based leading-edge scientific and technological information rapid pushing method and system
CN105426490A (en) Tree structure based indexing method
Dai et al. Search Engine System Based on Ontology of Technological Resources.
US20230142351A1 (en) Methods and systems for searching and retrieving information
TW578067B (en) Knowledge graphic system and method based on ontology
Sunny et al. Potential Roles and Applications of Thesauri in Digital Information Retrieval Systems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150715