CN106055622A - Data searching method and system - Google Patents

Data searching method and system Download PDF

Info

Publication number
CN106055622A
CN106055622A CN201610362426.0A CN201610362426A CN106055622A CN 106055622 A CN106055622 A CN 106055622A CN 201610362426 A CN201610362426 A CN 201610362426A CN 106055622 A CN106055622 A CN 106055622A
Authority
CN
China
Prior art keywords
index
server
cluster
data
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610362426.0A
Other languages
Chinese (zh)
Inventor
王之滨
程林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Group Co Ltd
Original Assignee
Inspur Software Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Group Co Ltd filed Critical Inspur Software Group Co Ltd
Priority to CN201610362426.0A priority Critical patent/CN106055622A/en
Publication of CN106055622A publication Critical patent/CN106055622A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data searching method and a system, wherein the method comprises the following steps: determining a plurality of servers currently used for data search; constructing a distributed m multiplied by n search model according to a plurality of servers; constructing indexes in m servers of each cluster; when a search request is received, distributing the search request to any one server in a target cluster corresponding to the minimum current load according to the current load of each cluster; and determining a first index corresponding to the search request according to the constructed index, and providing target data according to the first index. According to the scheme, the distributed m × n search model is built, the index is built in each cluster of the search model, and the search request is distributed to any one server in the cluster corresponding to the minimum current load when the search request is received, so that load balance is achieved, the pressure born by the servers in the cluster is reduced, and the data search efficiency is improved.

Description

A kind of data search method and system
Technical field
The present invention relates to search engine technique field, particularly to a kind of data search method and system.
Background technology
Along with the high speed development of Internet technology, user is increasingly concerned with how to obtain actual effect data with the fastest time, To excavate valuable information from actual effect data.
Traditional information search technique is by setting up the index corresponding with each data source in the server, at these clothes When business device receives searching request, search respective index according to this searching request, and determine data source location according to index.
But, when the number of searching request is more, the search capability of server cannot sustain more searching request, Thus affect search efficiency.
Summary of the invention
Embodiments provide a kind of data search method and system, to improve search efficiency.
First aspect, embodiments provides a kind of data search method, including:
Determine the multiple servers being currently used in data search;
Distributed m × n search model is built according to described multiple servers;Wherein, n is for characterizing what search model included Cluster number, m is for characterizing the number of servers that each cluster includes;n≥1;m≥1;
Index building in the m station server of each cluster;
When receiving searching request, according to the present load of each cluster, the distribution of described searching request is worked as to minimum In any one station server in the target cluster that front load is corresponding;
Determine, according to the index built, the first index that described searching request is corresponding, and provide according to described first index Target data.
Preferably, described according to described multiple servers build distributed m × n search model, including:
The value of n: n=S/T is calculated by following formula;
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;
The value of m: m=Q/R+X is calculated by following formula;
Wherein, Q is used for characterizing searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X For characterizing machine increment.
Preferably,
Farther include: when the quantity of the m station server in current cluster is not less than 2, at the m platform of described current cluster Server determines master server and from server;
Described index building in the m station server of each cluster, including: in master server, build master disk index, And often build a master disk index, update CommitLog journal file;Synchronizing cycle is being set from server, and according to The synchronizing cycle set, the master disk built in master server index and CommitLog journal file are synchronized to from server In, and be internal memory index according to CommitLog journal file by master disk index construct maximum for volumes of searches from server, And in internal memory index amount more than when setting threshold value, according to the structure time of internal memory index, the internal memory index built at first is refreshed For indexing from disk, so that internal memory index amount is not more than described setting threshold value.
Preferably,
The described master disk that builds in master server indexes, including:
According to the index construct cycle set in advance in master server, trigger full dose dump program, and record now Sart point in time;
Determine the data corresponding to data source;
After often getting corresponding to the current data of data source, it may be judged whether there is next data, if judging knot Fruit includes existing, then use iterator model, utilize DataProvider interface, obtain next number corresponding to data source According to, and build a corresponding Map record, and continue executing with step, until judged result includes not existing;
The a plurality of Map built record is encapsulated as corresponding Map object, and utilizes packaged Map object to build interpolation Batch updating object, and this batch updating object is refreshed to master disk index;
According to record described sart point in time and current point in time, to from described sart point in time to described current time Between point between time period in incremental data refresh to master disk index.
Preferably, farther include: when master server receives the Real time request for the second index, for described the Two indexes perform the operation that described Real time request is corresponding, and the operation of execution are updated in CommitLog journal file;With this Master server is subordinated to building from server of same cluster according to the operation updated the CommitLog journal file being synchronized to Corresponding internal memory index;
Described Real time request includes: removal request, interpolation request, more newly requested, removal request, batch interpolation request in batches Ask with batch updating.
Second aspect, embodiments provides a kind of data search system, including:
First determines unit, for determining the multiple servers being currently used in data search;
Search model construction unit, for building distributed m × n search model according to described multiple servers;Wherein, n For characterizing the cluster number that search model includes, m is for characterizing the number of servers that each cluster includes;n≥1;m≥1;
Index construct unit, for index building in the m station server of each cluster;
Allocation unit, for when receiving searching request, according to the present load of each cluster, by described searching request In any one station server in the target cluster that the most minimum present load of distribution is corresponding;
Search unit, for determining the first index that described searching request is corresponding, and according to institute according to the index built State the first index and target data is provided.
Preferably, described search model construction unit, specifically for:
The value of n: n=S/T is calculated by following formula;
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;
The value of m: m=Q/R+X is calculated by following formula;
Wherein, Q is used for characterizing searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X For characterizing machine increment.
Preferably,
Farther include: second determines unit, when the quantity of the m station server in current cluster is not less than 2, The m station server of described current cluster determines master server and from server;
Described index construct unit, specifically for building master disk index in master server, and often builds a main magnetic Fake draws, and updates CommitLog journal file;Synchronizing cycle is being set from server, and according to the synchronizing cycle set, will The master disk index built in master server and CommitLog journal file are synchronized to from server, and from server It is internal memory index according to CommitLog journal file by master disk index construct maximum for volumes of searches, and big in internal memory index amount In time setting threshold value, according to the structure time of internal memory index, the internal memory index built at first is refreshed as indexing from disk, so that Internal memory index amount is not more than described setting threshold value.
Preferably,
Described index construct unit, specifically for:
According to the index construct cycle set in advance in master server, trigger full dose dump program, and record now Sart point in time;
Determine the data corresponding to data source;
After often getting corresponding to the current data of data source, it may be judged whether there is next data, if judging knot Fruit includes existing, then use iterator model, utilize DataProvider interface, obtain next number corresponding to data source According to, and build a corresponding Map record, and continue executing with step, until judged result includes not existing;
The a plurality of Map built record is encapsulated as corresponding Map object, and utilizes packaged Map object to build interpolation Batch updating object, and this batch updating object is refreshed to master disk index;
According to record described sart point in time and current point in time, to from described sart point in time to described current time Between point between time period in incremental data refresh to master disk index.
Preferably, farther include:
Real-time update unit, during for receiving the Real time request for the second index at master server, for described the Two indexes perform the operation that described Real time request is corresponding, and the operation of execution are updated in CommitLog journal file;With this Master server is subordinated to building from server of same cluster according to the operation updated the CommitLog journal file being synchronized to Corresponding internal memory index;
Described Real time request includes: removal request, interpolation request, more newly requested, removal request, batch interpolation request in batches Ask with batch updating.
Embodiments provide a kind of data search method and system, by building distributed m × n search model, And in each cluster of search model equal index building, when receiving searching request, searching request is distributed to minimum working as Any one station server in the cluster that front load is corresponding, to realize load balancing, reduces the pressure that cluster server is born Power, and then improve data search efficiency.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is the present invention Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to according to These accompanying drawings obtain other accompanying drawing.
Fig. 1 is a kind of method flow diagram that one embodiment of the invention provides;
Fig. 2 is the another kind of method flow diagram that one embodiment of the invention provides;
Fig. 3 is the performance indications comparing result schematic diagram of Xsolr Yu the Solr system that one embodiment of the invention provides;
Fig. 4 is data consistency and the disaster tolerance schematic diagram of the Xsolr that one embodiment of the invention provides;
Fig. 5 is a kind of system structure schematic diagram that one embodiment of the invention provides;
Fig. 6 is the another kind of system structure schematic diagram that one embodiment of the invention provides;
Fig. 7 is the another kind of system structure schematic diagram that one embodiment of the invention provides.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is The a part of embodiment of the present invention rather than whole embodiments, based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment obtained on the premise of not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, embodiments provide a kind of data search method, the method may comprise steps of:
Step 101: determine the multiple servers being currently used in data search;
Step 102: build distributed m × n search model according to described multiple servers;Wherein, n is used for characterizing search mould The cluster number that type includes, m is for characterizing the number of servers that each cluster includes;n≥1;m≥1;
Step 103: index building in the m station server of each cluster;
Step 104: when receiving searching request, according to the present load of each cluster, distributes described searching request In any one station server to the target cluster that minimum present load is corresponding;
Step 105: determine the first index that described searching request is corresponding according to the index built, and according to described first Index provides target data.
According to such scheme, by building distributed m × n search model, and in each cluster of search model all Index building, distributes to any one in the cluster that minimum present load is corresponding when receiving searching request by searching request Server, to realize load balancing, reduces the pressure that cluster server is born, and then improves data search efficiency.
In an embodiment of the invention, in order to ensure in cluster that a certain station server is delayed after machine, this cluster can ensure that The availability of search service, may further include: when the quantity of the m station server in current cluster is not less than 2, described The m station server of current cluster determines master server and from server;
Described index building in the m station server of each cluster, including: in master server, build master disk index, And often build a master disk index, update CommitLog journal file;Synchronizing cycle is being set from server, and according to The synchronizing cycle set, the master disk built in master server index and CommitLog journal file are synchronized to from server In, and be internal memory index according to CommitLog journal file by master disk index construct maximum for volumes of searches from server, And in internal memory index amount more than when setting threshold value, according to the structure time of internal memory index, the internal memory index built at first is refreshed For indexing from disk, so that internal memory index amount is not more than described setting threshold value.
In an embodiment of the invention, in order to improve the efficiency of data search further, may further include: set up Index classification rule;
After index building, farther include in the m station server of each cluster described: the master disk rope that will build Draw and be divided into multiple packet according to described index classification rule, the corresponding corresponding index type of each packet;
The first index that described searching request is corresponding determined in the described index according to structure, comprises determining that described search The target index type that request is corresponding, and in corresponding packet, travel through each master disk rope according to described target index type Draw, to determine described first index.So, the hunting zone of index can be reduced, thus improve the efficiency of indexed search.
As in figure 2 it is shown, embodiments provide a kind of data search method, the method may comprise steps of:
Step 201: determine the multiple servers being currently used in data search.
The present embodiment illustrates as a example by 100 station servers.
Step 202: according to packet route, is divided into multiple collection group by this multiple servers.
In the present embodiment, the multiple servers being currently used in data search can be divided into multiple according to system requirements Collection group, wherein, this packet mode can realize according to GroupFilter packet route.Such as, by this 100 station server It is divided into Liang Geji group, the first Ge Ji group 40 station server, the second collection group 60 station server.Wherein, the first Ge Ji group For being the search server that Pekinese's searching request realizes corresponding data to ID, the second collection group is for ID being The searching request in Hebei province realizes the search server of corresponding data.
Step 203: collect group for each, the server in this collection group is divided into n cluster, in each cluster Including m station server.
The value of n and m depends primarily on following two factors:
1, the maximum search request amount of single server carrying.
Wherein, the index capacity of single server is determined by disk size.Index is related to during setting up new index Merging, old index continues to provide index service, and disk size now is: a new index+a is just providing service Index+index merges disk space required, and disk size at least indexes 3 times of capacity.
Wherein, the searching request amount that single server can bear can be obtained by stress test.Work as machine loading (load), when value is equal to the CPU core number of server, the TPS (searching request amount per second) that server can bear is stress test Peak value, if the server for 4 cores, the generally TPS when load=4 are the peak that its unit can bear.
2, index total amount and searching request total amount.
Interior cluster number n of packet, and service in each cluster can be calculated respectively below with formula (1) and formula (2) Number of units m of device:
N=S/T; (1)
M=Q/R+X; (2)
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;Q is used for characterizing Searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X is used for characterizing machine increment.
Step 204: determine master server and from server in the m station server that each cluster includes.
In an embodiment of the invention, in same cluster, other services in cluster respectively of each station server Aware services issued by device, to guarantee perception two-by-two.
Wherein, each station server in cluster is provided which search service, when in cluster, m is not less than 2, and can be at this collection Main (Master) server is determined and from (Slave) server, wherein it is possible to determine Master clothes in this cluster in Qun Business device, remaining server is as Slave server;When in cluster m equal to 1 time, can using the server in this cluster as Master server.
Step 205: index building in the Servers-all of each cluster.
In the present embodiment, the index of following three part can be built: master disk index, internal memory index and from disk rope Draw.
Wherein, the index structure of disk+internal memory can ensure that newly-increased index represents immediately.Master server is responsible for main magnetic The foundation that fake draws, and often set up CommitLog journal file of a master disk index upgrade, same by Slave server Step master disk index and CommitLog journal file, to ensure that index data is not lost, and arrive at the internal memory index built Flush to index from disk by the internal memory built at first index after setting threshold value, it is ensured that availability.
The model construction building employing full dose dump+ real-time incremental dump of index.Wherein, full dose dump uses week time Phase task is periodically executed, and real-time incremental dump is generated by real-time calling interface, carries out the building process of real time indexing in detail below Describe in detail bright.
(1) structure of master disk index
Preset the index construct cycle, such as, set this index construct cycle as 1 week.Master server is according to index structure Building cycle clocked flip full dose dump program, to realize the structure of master disk index, this building process is as follows:
Determining the data corresponding to data source, this data source can be the information of web crawlers crawl, various storage system (such as, data base, file, NOSQL system etc.).The foundation of master disk index can use iterator model, utilizes First with hasNext (), DataProvider interface, judges whether next data exists, if existing, by next () method Take off a data, build and return a Map record, until hasNext () method is returned as false, terminate whole iteration Process.The corresponding index document of Map record,<key, value>key-value pair of Map corresponds to territory and the thresholding of document.By Map Record is encapsulated as corresponding Map object, and the AddUpdateCommand utilizing packaged Map object to build Solr (adds and criticizes Amount updates) object, the UpdateHandle.addDoc () method finally calling Solr is added document extremely index, thus is completed main The structure of disk index.
In an embodiment of the invention, in the building process of master disk index, the real time indexing having increment produces, because of This, need completion incremental data full dose dump the term of execution, wherein it is possible to carry out this real-time rope of completion by time point mechanism Draw: can record sart point in time checkPoint when the building process of master disk index starts, master disk index is built After having stood, according to current point in time, between from the outset, put the incremental data brush in the time period between current point in time Newly to master disk index.
In order to ensure that a Master server is delayed after machine in the cluster, in this cluster, other servers can continue to provide Search service, needs to carry out the synchronization of master disk index at Slave server, and this process is as follows: the main magnetic of Master server After dish index construct completes, notice Slave server carries out the copy of master disk index, and is passed to by checkPoint Slave server, Slave server notifies Master server after completing the construction work that master disk indexes, and utilizes structure Master disk index provide search service.
(2) real time indexing builds
In the present embodiment, real time indexing builds the structure that can comprise internal memory index and the structure two from disk index Part, wherein, it is as follows that this realizes process:
The structure of A: internal memory index.
Data source (identical with master disk index) be may come from for Real time request, it is also possible to real-time from client Interface interchange.As a example by real-time interface calls, the real-time method of client call Real-timeBean class sends request.Wherein, Real time request method is divided into Add (interpolation), Update (renewal), Delete (deletion), mAdd (batch adds), mUpdate (to criticize Amount updates), mDelete (batch delete).This Real time request of received server-side, is encapsulated as Document by this Real time request Request object adds in CommitLog journal file.
Structure process Real-timeJob of real time indexing, continuous repeating query CommitLog is created during Slave startup of server Journal file, once has new record to produce and i.e. carries out writing index operation, according to the difference of request type, call Solr's respectively The correlation method of UpdateHandler sets up internal memory index.
B: from the structure of disk index.Internal memory index is limited to memory size, so needing to set threshold value, prevents internal memory rope It is introduced through the quick-fried internal memory of big support or affects other application.When internal memory index is more than this setting threshold value, newly open an internal memory index, newly The Real time request come writes new internal memory index, refreshes old internal memory index to from disk simultaneously.Utilize internal memory index with from The identical characteristic of disk index structure, internal memory index is directly copied to disk, shape by the copy method calling Directory Become and index from disk, promote search performance, it is achieved search for immediately.
Step 206: receive the Real time request that user sends, according to the address information of this Real time request, determines that this in real time please Ask corresponding collection group.
Such as, the address information that this Real time request is corresponding is Beijing, then determine that the collection group that this Real time request is corresponding is First collection group.
Step 207: according to the present load of each cluster in this collection group, by the distribution of this Real time request to minimum current negative Carry in any one station server in corresponding target cluster.
In order to realize load balancing, improve the response efficiency to Real time request, can be by the distribution of this Real time request to minimum In any one station server in the target cluster that present load is corresponding.
Step 208: according to the type of Real time request, this Real time request is responded.
In the present embodiment, this Real time request may include that searching request, removal request, interpolation request, more newly requested, Removal request, batch add request and batch updating request in batches.
It is described in detail for above-mentioned a few class Real time requests separately below.
(1) searching request
The server receiving this searching request performs following operation: for single group searching, then according to this searching request, right Master disk index, internal memory index, carry out thorough search, wherein it is desired to filtered out by Search Results deleted from disk index Document.Wherein, the docId being deleted document is stored in delList set.Many group searchings are then utilized to the shard of Solr The result set of many groups is made union operation by concept, and a group in arbitrarily selected set of packets is conducted interviews by ICP/IP protocol, this The EmbeddedSolrServer that Shi Liyong Solr provides obtains current group index, and sends TCP/IP request to other groups, obtains Take index data the merge operation being indexed, improve index acquisition speed.
(2) request, removal request and more newly requested are added
A: add request.Client sends Add order, is encapsulated into adding request write CommitLog journal file In.Judge during IndexBuildJob index building, as asked for AddDocumentRequest, extract index number and it is believed that Breath is converted to the AddUpdateCommand object of Solr, finally calls the AddDoc () order of the UpdateHandle of Solr, Realize the interpolation of index data.
B: removal request.Client sends Delete order, is encapsulated into removal request In DeleteDocumentRequest write CommitLog journal file.IndexWriter is had in Lucene search engine Can do deletion action with two objects of IndexReader, difference is in IndexWriter example that the content deleted is buffered Get up, can't come into force at once.After search engine receives real-time removal request, if document to be deleted is in internal memory, then use IndexWriter directly deletes, and otherwise does mark at disk index (i.e. master disk indexes and indexes from disk) and deletes, will In the set delList that document storing is the most to be deleted.When internal memory index is submitted to, call the commit () side of IndexWriter Method, the IndexReader method recalling disk index deletes the element of delList set one by one.
C: more newly requested.Identical with Add order, client sends Update order and can be encapsulated as UpdateDocumentRequest asks, in write CommitLog daily record.During IndexBuildJob index building, will index Data message is converted to AddUpdateCommand, and arrange AllowDups is false simultaneously, does not the most allow index data to repeat. Now Solr can first determine whether to index the most to exist, and has, deletes, and is then added operation, completes to update.
Utilize that following experiment is following to be verified the search model in the above embodiment of the present invention, wherein, this experiment by 10 station server compositions, are divided into 5 groups, and often group one master one is standby.Server configures is Intel zero RXeon zero R4 core CPUE5520@ 2.27GHz, internal memory 4GB, 60GB hard disk 7200 turns.Index total amount is 60GB, single server 12GB, and experimental data is averaged Value.
The prototype system that experiment realizes is Xsolr, and comparison system is Solr, and data set is 40,000,000 documents, Mei Gewen Shelves are made up of 22 territories, and mean size is 0.03KB, and testing tool is LoadRunner.Experimental result divides two parts, first group The performance indications contrast that experiment is Xsolr Yu Solr system, refer to Fig. 3;Second group experiment for Xsolr data consistency with Disaster tolerance, refer to Fig. 4.
A: real-time response performance.TPS according to Fig. 3, Xsolr and response time performance under 4 kinds of test conditions All it is better than Solr.The request response time of real-time update is all within 1s.Carry out at internal memory because Xsolr updates operation, index Set up at internal memory, and do not merge with disk index, simultaneously scan for internal memory and disk index, reduce magnetic disc i/o, add The speed speed of real time indexing data display.
B: load analysis.In the case of system load is close, the CPU of Xsolr more consumes resource.Because Xsolr meeting Set up internal memory index, take a large amount of internal memory under experimental situation, time the highest, reach 1GB.CPU usage is the highest about 30%, this Time index renewal TPS be up to 2100, a large amount of committed memories, but machine loading is still less than 4, within the acceptable range.
C: data consistency.According to Fig. 4, in front 15s, single server is added respectively, updates, deletes behaviour Making, Master/Slave server data keeps consistent, has pole technicality.During real-time operation, Slave incessantly from Master machine pulls increment CommitLog journal file, carries out consumption and create real time indexing, it is ensured that active/standby server data Concordance.But owing to CommitLog is to be sequentially written in, and there is certain network overhead from Master machine copied files, So can go out, existing imperceptible difference, realize final consistency in millisecond rank, minimum on the impact of system integrity service, can In the range of acceptance.
D: data disaster tolerance and integrity.During 17s, Master server adds 1000 records, now internal memory index at random Not up to threshold value, does not brush and indexes into from disk.During 30s, Slave server recovers service, its index record number and Master phase With.After 35s, Master server is delayed machine, recovers to start after 60s, and its index record number is identical with Slave, and meets and initially add 1000 record numbers.Xsolr passes through CommitLog daily record persistence real time data, arranges machine recovery point of delaying.Server is delayed machine During recovery, read from CommitLog log recording side-play amount nearest for CheckPoint, start to rebuild internal memory rope at side-play amount Draw, it is ensured that the integrity of data, it is achieved data disaster tolerance.
According to above-mentioned experimental result it can be shown that distributed m × n search model that the present embodiment provides meets the reality of system Time sexual demand, under big data quantity and high concurrent environment, ensure that concordance and the data disaster tolerance of data, it was demonstrated that system simultaneously The feasibility of model.
Refer to Fig. 5, embodiments provide a kind of data search system, may include that
First determines unit 501, for determining the multiple servers being currently used in data search;
Search model construction unit 502, for building distributed m × n search model according to described multiple servers;Its In, n is for characterizing the cluster number that search model includes, m is for characterizing the number of servers that each cluster includes;n≥1;m ≥1;
Index construct unit 503, for index building in the m station server of each cluster;
Allocation unit 504, for when receiving searching request, according to the present load of each cluster, by described search In any one station server in the target cluster that the most minimum present load of request distribution is corresponding;
Search unit 505, for determining the first index that described searching request is corresponding according to the index built, and according to Described first index provides target data.
In an embodiment of the invention, described search model construction unit 502, specifically for:
The value of n: n=S/T is calculated by following formula;
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;
The value of m: m=Q/R+X is calculated by following formula;
Wherein, Q is used for characterizing searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X For characterizing machine increment.
In an embodiment of the invention, refer to Fig. 6, this data search system may further include:
Second determines unit 601, for the quantity of m station server in current cluster not less than 2 time, described currently The m station server of cluster determines master server and from server;
Described index construct unit 503, specifically for building master disk index in master server, and often builds a master Disk indexes, and updates CommitLog journal file;Synchronizing cycle is being set from server, and according to the synchronizing cycle set, The master disk built in master server index and CommitLog journal file are synchronized to from server, and from server Middle is internal memory index according to CommitLog journal file by master disk index construct maximum for volumes of searches, and in internal memory index amount More than when setting threshold value, according to the structure time of internal memory index, the internal memory index built at first is refreshed as indexing from disk, with Internal memory index amount is made to be not more than described setting threshold value.
In an embodiment of the invention, described index construct unit 503, specifically for:
According to the index construct cycle set in advance in master server, trigger full dose dump program, and record now Sart point in time;
Determine the data corresponding to data source;
After often getting corresponding to the current data of data source, it may be judged whether there is next data, if judging knot Fruit includes existing, then use iterator model, utilize DataProvider interface, obtain next number corresponding to data source According to, and build a corresponding Map record, and continue executing with step, until judged result includes not existing;
The a plurality of Map built record is encapsulated as corresponding Map object, and utilizes packaged Map object to build interpolation Batch updating object, and this batch updating object is refreshed to master disk index;
According to record described sart point in time and current point in time, to from described sart point in time to described current time Between point between time period in incremental data refresh to master disk index.
In an embodiment of the invention, refer to Fig. 7, this data search system may further include:
Real-time update unit 701, during for receiving the Real time request for the second index at master server, for described Second index performs the operation that described Real time request is corresponding, and the operation of execution is updated in CommitLog journal file;With This master server be subordinated to same cluster from server according to the CommitLog journal file being synchronized to update operation structure Build corresponding internal memory index;
Described Real time request includes: removal request, interpolation request, more newly requested, removal request, batch interpolation request in batches Ask with batch updating.
To sum up, each embodiment of the present invention at least can realize following beneficial effect:
1, in embodiments of the present invention, by building distributed m × n search model, and at each collection of search model Equal index building in Qun, distributes to appointing in the cluster that minimum present load is corresponding when receiving searching request by searching request Anticipate a station server, to realize load balancing, reduce the pressure that cluster server is born, and then improve data search effect Rate.
2, in embodiments of the present invention, by being grouped according to routing rule by multiple servers, search is being received During request, its collection group that search service is provided can be defined as according to the address information of searching request, searches such that it is able to reduce The rope time, improve search efficiency.
3, in embodiments of the present invention, internal memory is used to index, with disk, the many Indexing Mechanisms combined.By the side of full dose Formula periodically sets up master disk index, it is ensured that the integrity of data.Real time information first writes internal memory with incremental mode, and internal memory index is super Replicate write disk after going out to set threshold value, formed and index from disk, do not merge with master disk index, reduce the index merging time and open Pin.Audit memory and disk index, promote the search response time simultaneously.
4, in embodiments of the present invention, solve the index data disaster tolerance problem under distributed environment, introduce CommitLog day Will mechanism, persistence index metadata, improves Master/Slave (active and standby) model of Solr, it is ensured that the concordance of data and can The property used.
The contents such as the information between each unit in said apparatus is mutual, execution process, owing to implementing with the inventive method Example is based on same design, and particular content can be found in the narration in the inventive method embodiment, and here is omitted.
It should be noted that in this article, the relational terms of such as first and second etc is used merely to an entity Or operation separates with another entity or operating space, and not necessarily require or imply existence between these entities or operation The relation of any this reality or order.And, term " includes ", " comprising " or its any other variant are intended to non- Comprising of exclusiveness, so that include that the process of a series of key element, method, article or equipment not only include those key elements, But also include other key elements being not expressly set out, or also include being consolidated by this process, method, article or equipment Some key elements.In the case of there is no more restriction, statement the key element " including " and limiting, do not arrange Except there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be passed through The hardware that programmed instruction is relevant completes, and aforesaid program can be stored in the storage medium of embodied on computer readable, this program Upon execution, perform to include the step of said method embodiment;And aforesaid storage medium includes: ROM, RAM, magnetic disc or light In the various medium that can store program code such as dish.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate the skill of the present invention Art scheme, is not intended to limit protection scope of the present invention.All made within the spirit and principles in the present invention any amendment, Equivalent, improvement etc., be all contained in protection scope of the present invention.

Claims (10)

1. a data search method, it is characterised in that including:
Determine the multiple servers being currently used in data search;
Distributed m × n search model is built according to described multiple servers;Wherein, n is for characterizing the cluster that search model includes Number, m is for characterizing the number of servers that each cluster includes;n≥1;m≥1;
Index building in the m station server of each cluster;
When receiving searching request, according to the present load of each cluster, by the distribution of described searching request to minimum current negative Carry in any one station server in corresponding target cluster;
Determine, according to the index built, the first index that described searching request is corresponding, and provide target according to described first index Data.
Method the most according to claim 1, it is characterised in that described according to the described multiple servers distributed m × n of structure Search model, including:
The value of n: n=S/T is calculated by following formula;
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;
The value of m: m=Q/R+X is calculated by following formula;
Wherein, Q is used for characterizing searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X is used for Characterize machine increment.
Method the most according to claim 1, it is characterised in that
Farther including: when the quantity of the m station server in current cluster is not less than 2, the m platform at described current cluster services Device determines master server and from server;
Described index building in the m station server of each cluster, including: in master server, build master disk index, and often Build a master disk index, update CommitLog journal file;Synchronizing cycle is being set from server, and according to setting Synchronizing cycle, by the master disk built in master server index and CommitLog journal file be synchronized to from server, and It is being internal memory index according to CommitLog journal file by master disk index construct maximum for volumes of searches from server, and Internal memory index amount more than set threshold value time, according to internal memory index the structure time, by build at first internal memory index refresh for from Disk indexes, so that internal memory index amount is not more than described setting threshold value.
Method the most according to claim 3, it is characterised in that
The described master disk that builds in master server indexes, including:
According to the index construct cycle set in advance in master server, trigger full dose dump program, and record beginning now Time point;
Determine the data corresponding to data source;
After often getting corresponding to the current data of data source, it may be judged whether there is next data, if judged result bag Include existence, then use iterator model, utilize DataProvider interface, obtain next data corresponding to data source, and Build a corresponding Map record, and continue executing with step, until judged result includes not existing;
The a plurality of Map built record is encapsulated as corresponding Map object, and utilizes packaged Map object to build interpolation batch Upgating object, and this batch updating object is refreshed to master disk index;
According to record described sart point in time and current point in time, to from described sart point in time to described current point in time Between time period in incremental data refresh to master disk index.
5. according to described method arbitrary in claim 1-4, it is characterised in that farther include: receive at master server During for the Real time request of the second index, perform operation corresponding to described Real time request for described second index, and will perform Operation update in CommitLog journal file;Be subordinated to this master server same cluster from server according to synchronization To CommitLog journal file in update operation build corresponding internal memory index;
Described Real time request includes: removal request, interpolation are asked, more newly requested, removal request, batch add request and criticize in batches Measure more newly requested.
6. a data search system, it is characterised in that including:
First determines unit, for determining the multiple servers being currently used in data search;
Search model construction unit, for building distributed m × n search model according to described multiple servers;Wherein, n is used for Characterizing the cluster number that search model includes, m is for characterizing the number of servers that each cluster includes;n≥1;m≥1;
Index construct unit, for index building in the m station server of each cluster;
Allocation unit, for when receiving searching request, according to the present load of each cluster, distributes described searching request In any one station server to the target cluster that minimum present load is corresponding;
Search unit, for determining the first index that described searching request is corresponding, and according to described the according to the index built One index provides target data.
Data search system the most according to claim 6, it is characterised in that described search model construction unit, specifically uses In:
The value of n: n=S/T is calculated by following formula;
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;
The value of m: m=Q/R+X is calculated by following formula;
Wherein, Q is used for characterizing searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X is used for Characterize machine increment.
Data search system the most according to claim 6, it is characterised in that
Farther include: second determines unit, when the quantity of the m station server in current cluster is not less than 2, described The m station server of current cluster determines master server and from server;
Described index construct unit, specifically for building master disk index in master server, and often builds a master disk rope Draw, update CommitLog journal file;Synchronizing cycle is being set from server, and according to the synchronizing cycle set, by main clothes The master disk index built in business device and CommitLog journal file are synchronized to from server, and in basis from server The master disk index construct that volumes of searches is maximum is internal memory index by CommitLog journal file, and in internal memory index amount more than setting When determining threshold value, according to the structure time of internal memory index, the internal memory index built at first is refreshed as indexing from disk, so that internal memory Index amount is not more than described setting threshold value.
Data search system the most according to claim 8, it is characterised in that
Described index construct unit, specifically for:
According to the index construct cycle set in advance in master server, trigger full dose dump program, and record beginning now Time point;
Determine the data corresponding to data source;
After often getting corresponding to the current data of data source, it may be judged whether there is next data, if judged result bag Include existence, then use iterator model, utilize DataProvider interface, obtain next data corresponding to data source, and Build a corresponding Map record, and continue executing with step, until judged result includes not existing;
The a plurality of Map built record is encapsulated as corresponding Map object, and utilizes packaged Map object to build interpolation batch Upgating object, and this batch updating object is refreshed to master disk index;
According to record described sart point in time and current point in time, to from described sart point in time to described current point in time Between time period in incremental data refresh to master disk index.
10. according to described data search system arbitrary in claim 6-9, it is characterised in that farther include:
Real-time update unit, during for receiving the Real time request for the second index at master server, for described second rope Draw and perform the operation that described Real time request is corresponding, and the operation of execution is updated in CommitLog journal file;With these main clothes Business device is subordinated to building accordingly from server of same cluster according to the operation updated the CommitLog journal file being synchronized to Internal memory index;
Described Real time request includes: removal request, interpolation are asked, more newly requested, removal request, batch add request and criticize in batches Measure more newly requested.
CN201610362426.0A 2016-05-26 2016-05-26 Data searching method and system Pending CN106055622A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610362426.0A CN106055622A (en) 2016-05-26 2016-05-26 Data searching method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610362426.0A CN106055622A (en) 2016-05-26 2016-05-26 Data searching method and system

Publications (1)

Publication Number Publication Date
CN106055622A true CN106055622A (en) 2016-10-26

Family

ID=57174847

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610362426.0A Pending CN106055622A (en) 2016-05-26 2016-05-26 Data searching method and system

Country Status (1)

Country Link
CN (1) CN106055622A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649870A (en) * 2017-01-03 2017-05-10 山东浪潮商用系统有限公司 Distributed implementation method for search engine
CN108573063A (en) * 2018-04-27 2018-09-25 宁波银行股份有限公司 A kind of data query method and system
CN108763578A (en) * 2018-06-07 2018-11-06 腾讯科技(深圳)有限公司 A kind of newer method of index file and server
CN110019080A (en) * 2017-07-14 2019-07-16 北京京东尚科信息技术有限公司 Data access method and device
CN110609844A (en) * 2018-05-29 2019-12-24 优信拍(北京)信息科技有限公司 Data updating method, device and system
CN111209462A (en) * 2020-01-02 2020-05-29 北京字节跳动网络技术有限公司 Data processing method, device and equipment
CN111400555A (en) * 2020-03-05 2020-07-10 湖南大学 Graph data query task processing method and device, computer equipment and storage medium
CN112507187A (en) * 2020-11-11 2021-03-16 贝壳技术有限公司 Index changing method and device
CN113824776A (en) * 2021-09-02 2021-12-21 济南浪潮数据技术有限公司 Automatic network request distribution method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101071442A (en) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 Distributed indesx file searching method, searching system and searching server
CN103729386A (en) * 2012-10-16 2014-04-16 阿里巴巴集团控股有限公司 Information query system and method
CN103984745A (en) * 2014-05-23 2014-08-13 何震宇 Distributed video vertical searching method and system
CN105468720A (en) * 2015-11-20 2016-04-06 北京锐安科技有限公司 Method for integrating distributed data processing systems, corresponding systems and data processing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101071442A (en) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 Distributed indesx file searching method, searching system and searching server
CN103729386A (en) * 2012-10-16 2014-04-16 阿里巴巴集团控股有限公司 Information query system and method
CN103984745A (en) * 2014-05-23 2014-08-13 何震宇 Distributed video vertical searching method and system
CN105468720A (en) * 2015-11-20 2016-04-06 北京锐安科技有限公司 Method for integrating distributed data processing systems, corresponding systems and data processing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
傅巍玮 等: "基于Solr的分布式实时搜索模型研究与实现", 《电信科学》 *
傅巍玮: "分布式实时垂直搜索引擎研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649870A (en) * 2017-01-03 2017-05-10 山东浪潮商用系统有限公司 Distributed implementation method for search engine
CN110019080A (en) * 2017-07-14 2019-07-16 北京京东尚科信息技术有限公司 Data access method and device
CN110019080B (en) * 2017-07-14 2021-11-12 北京京东尚科信息技术有限公司 Data access method and device
CN108573063A (en) * 2018-04-27 2018-09-25 宁波银行股份有限公司 A kind of data query method and system
CN110609844A (en) * 2018-05-29 2019-12-24 优信拍(北京)信息科技有限公司 Data updating method, device and system
CN110609844B (en) * 2018-05-29 2022-05-13 优信拍(北京)信息科技有限公司 Data updating method, device and system
CN108763578A (en) * 2018-06-07 2018-11-06 腾讯科技(深圳)有限公司 A kind of newer method of index file and server
CN111209462A (en) * 2020-01-02 2020-05-29 北京字节跳动网络技术有限公司 Data processing method, device and equipment
CN111400555A (en) * 2020-03-05 2020-07-10 湖南大学 Graph data query task processing method and device, computer equipment and storage medium
CN111400555B (en) * 2020-03-05 2023-09-26 湖南大学 Graph data query task processing method and device, computer equipment and storage medium
CN112507187A (en) * 2020-11-11 2021-03-16 贝壳技术有限公司 Index changing method and device
CN113824776A (en) * 2021-09-02 2021-12-21 济南浪潮数据技术有限公司 Automatic network request distribution method and system

Similar Documents

Publication Publication Date Title
CN106055622A (en) Data searching method and system
Zhang et al. An efficient multi-dimensional index for cloud data management
CN108600321A (en) A kind of diagram data storage method and system based on distributed memory cloud
CN106649870A (en) Distributed implementation method for search engine
Ma et al. Query processing of massive trajectory data based on mapreduce
CN111460023A (en) Service data processing method, device, equipment and storage medium based on elastic search
CN103246612B (en) A kind of method of data buffer storage and device
Ju et al. iGraph: an incremental data processing system for dynamic graph
CN101169785A (en) Clustered database system dynamic loading balancing method
CN102968498A (en) Method and device for processing data
CN105468473A (en) Data migration method and data migration apparatus
CN105069149A (en) Structured line data-oriented distributed parallel data importing method
CN108073696B (en) GIS application method based on distributed memory database
CN104239377A (en) Platform-crossing data retrieval method and device
US11080207B2 (en) Caching framework for big-data engines in the cloud
CN113220795B (en) Data processing method, device, equipment and medium based on distributed storage
CN104951464B (en) Date storage method and system
CN112162846B (en) Transaction processing method, device and computer readable storage medium
CN109408590A (en) Expansion method, device, equipment and the storage medium of distributed data base
Sethia et al. A multi-agent simulation framework on small Hadoop cluster
CN105677761A (en) Data sharding method and system
CN106569896A (en) Data distribution and parallel processing method and system
Oruganti et al. Exploring Hadoop as a platform for distributed association rule mining
CN110083306A (en) A kind of distributed objects storage system and storage method
CN105975345A (en) Video frame data dynamic equilibrium memory management method based on distributed memory

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161026

RJ01 Rejection of invention patent application after publication