CN106055622A - Data searching method and system - Google Patents
Data searching method and system Download PDFInfo
- Publication number
- CN106055622A CN106055622A CN201610362426.0A CN201610362426A CN106055622A CN 106055622 A CN106055622 A CN 106055622A CN 201610362426 A CN201610362426 A CN 201610362426A CN 106055622 A CN106055622 A CN 106055622A
- Authority
- CN
- China
- Prior art keywords
- index
- server
- cluster
- data
- request
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000015654 memory Effects 0.000 claims description 68
- 230000001360 synchronised effect Effects 0.000 claims description 10
- 238000010276 construction Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 230000003111 delayed effect Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 235000013399 edible fruits Nutrition 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 230000002688 persistence Effects 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000012550 audit Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data searching method and a system, wherein the method comprises the following steps: determining a plurality of servers currently used for data search; constructing a distributed m multiplied by n search model according to a plurality of servers; constructing indexes in m servers of each cluster; when a search request is received, distributing the search request to any one server in a target cluster corresponding to the minimum current load according to the current load of each cluster; and determining a first index corresponding to the search request according to the constructed index, and providing target data according to the first index. According to the scheme, the distributed m × n search model is built, the index is built in each cluster of the search model, and the search request is distributed to any one server in the cluster corresponding to the minimum current load when the search request is received, so that load balance is achieved, the pressure born by the servers in the cluster is reduced, and the data search efficiency is improved.
Description
Technical field
The present invention relates to search engine technique field, particularly to a kind of data search method and system.
Background technology
Along with the high speed development of Internet technology, user is increasingly concerned with how to obtain actual effect data with the fastest time,
To excavate valuable information from actual effect data.
Traditional information search technique is by setting up the index corresponding with each data source in the server, at these clothes
When business device receives searching request, search respective index according to this searching request, and determine data source location according to index.
But, when the number of searching request is more, the search capability of server cannot sustain more searching request,
Thus affect search efficiency.
Summary of the invention
Embodiments provide a kind of data search method and system, to improve search efficiency.
First aspect, embodiments provides a kind of data search method, including:
Determine the multiple servers being currently used in data search;
Distributed m × n search model is built according to described multiple servers;Wherein, n is for characterizing what search model included
Cluster number, m is for characterizing the number of servers that each cluster includes;n≥1;m≥1;
Index building in the m station server of each cluster;
When receiving searching request, according to the present load of each cluster, the distribution of described searching request is worked as to minimum
In any one station server in the target cluster that front load is corresponding;
Determine, according to the index built, the first index that described searching request is corresponding, and provide according to described first index
Target data.
Preferably, described according to described multiple servers build distributed m × n search model, including:
The value of n: n=S/T is calculated by following formula;
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;
The value of m: m=Q/R+X is calculated by following formula;
Wherein, Q is used for characterizing searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X
For characterizing machine increment.
Preferably,
Farther include: when the quantity of the m station server in current cluster is not less than 2, at the m platform of described current cluster
Server determines master server and from server;
Described index building in the m station server of each cluster, including: in master server, build master disk index,
And often build a master disk index, update CommitLog journal file;Synchronizing cycle is being set from server, and according to
The synchronizing cycle set, the master disk built in master server index and CommitLog journal file are synchronized to from server
In, and be internal memory index according to CommitLog journal file by master disk index construct maximum for volumes of searches from server,
And in internal memory index amount more than when setting threshold value, according to the structure time of internal memory index, the internal memory index built at first is refreshed
For indexing from disk, so that internal memory index amount is not more than described setting threshold value.
Preferably,
The described master disk that builds in master server indexes, including:
According to the index construct cycle set in advance in master server, trigger full dose dump program, and record now
Sart point in time;
Determine the data corresponding to data source;
After often getting corresponding to the current data of data source, it may be judged whether there is next data, if judging knot
Fruit includes existing, then use iterator model, utilize DataProvider interface, obtain next number corresponding to data source
According to, and build a corresponding Map record, and continue executing with step, until judged result includes not existing;
The a plurality of Map built record is encapsulated as corresponding Map object, and utilizes packaged Map object to build interpolation
Batch updating object, and this batch updating object is refreshed to master disk index;
According to record described sart point in time and current point in time, to from described sart point in time to described current time
Between point between time period in incremental data refresh to master disk index.
Preferably, farther include: when master server receives the Real time request for the second index, for described the
Two indexes perform the operation that described Real time request is corresponding, and the operation of execution are updated in CommitLog journal file;With this
Master server is subordinated to building from server of same cluster according to the operation updated the CommitLog journal file being synchronized to
Corresponding internal memory index;
Described Real time request includes: removal request, interpolation request, more newly requested, removal request, batch interpolation request in batches
Ask with batch updating.
Second aspect, embodiments provides a kind of data search system, including:
First determines unit, for determining the multiple servers being currently used in data search;
Search model construction unit, for building distributed m × n search model according to described multiple servers;Wherein, n
For characterizing the cluster number that search model includes, m is for characterizing the number of servers that each cluster includes;n≥1;m≥1;
Index construct unit, for index building in the m station server of each cluster;
Allocation unit, for when receiving searching request, according to the present load of each cluster, by described searching request
In any one station server in the target cluster that the most minimum present load of distribution is corresponding;
Search unit, for determining the first index that described searching request is corresponding, and according to institute according to the index built
State the first index and target data is provided.
Preferably, described search model construction unit, specifically for:
The value of n: n=S/T is calculated by following formula;
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;
The value of m: m=Q/R+X is calculated by following formula;
Wherein, Q is used for characterizing searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X
For characterizing machine increment.
Preferably,
Farther include: second determines unit, when the quantity of the m station server in current cluster is not less than 2,
The m station server of described current cluster determines master server and from server;
Described index construct unit, specifically for building master disk index in master server, and often builds a main magnetic
Fake draws, and updates CommitLog journal file;Synchronizing cycle is being set from server, and according to the synchronizing cycle set, will
The master disk index built in master server and CommitLog journal file are synchronized to from server, and from server
It is internal memory index according to CommitLog journal file by master disk index construct maximum for volumes of searches, and big in internal memory index amount
In time setting threshold value, according to the structure time of internal memory index, the internal memory index built at first is refreshed as indexing from disk, so that
Internal memory index amount is not more than described setting threshold value.
Preferably,
Described index construct unit, specifically for:
According to the index construct cycle set in advance in master server, trigger full dose dump program, and record now
Sart point in time;
Determine the data corresponding to data source;
After often getting corresponding to the current data of data source, it may be judged whether there is next data, if judging knot
Fruit includes existing, then use iterator model, utilize DataProvider interface, obtain next number corresponding to data source
According to, and build a corresponding Map record, and continue executing with step, until judged result includes not existing;
The a plurality of Map built record is encapsulated as corresponding Map object, and utilizes packaged Map object to build interpolation
Batch updating object, and this batch updating object is refreshed to master disk index;
According to record described sart point in time and current point in time, to from described sart point in time to described current time
Between point between time period in incremental data refresh to master disk index.
Preferably, farther include:
Real-time update unit, during for receiving the Real time request for the second index at master server, for described the
Two indexes perform the operation that described Real time request is corresponding, and the operation of execution are updated in CommitLog journal file;With this
Master server is subordinated to building from server of same cluster according to the operation updated the CommitLog journal file being synchronized to
Corresponding internal memory index;
Described Real time request includes: removal request, interpolation request, more newly requested, removal request, batch interpolation request in batches
Ask with batch updating.
Embodiments provide a kind of data search method and system, by building distributed m × n search model,
And in each cluster of search model equal index building, when receiving searching request, searching request is distributed to minimum working as
Any one station server in the cluster that front load is corresponding, to realize load balancing, reduces the pressure that cluster server is born
Power, and then improve data search efficiency.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is the present invention
Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to according to
These accompanying drawings obtain other accompanying drawing.
Fig. 1 is a kind of method flow diagram that one embodiment of the invention provides;
Fig. 2 is the another kind of method flow diagram that one embodiment of the invention provides;
Fig. 3 is the performance indications comparing result schematic diagram of Xsolr Yu the Solr system that one embodiment of the invention provides;
Fig. 4 is data consistency and the disaster tolerance schematic diagram of the Xsolr that one embodiment of the invention provides;
Fig. 5 is a kind of system structure schematic diagram that one embodiment of the invention provides;
Fig. 6 is the another kind of system structure schematic diagram that one embodiment of the invention provides;
Fig. 7 is the another kind of system structure schematic diagram that one embodiment of the invention provides.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
The a part of embodiment of the present invention rather than whole embodiments, based on the embodiment in the present invention, those of ordinary skill in the art
The every other embodiment obtained on the premise of not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, embodiments provide a kind of data search method, the method may comprise steps of:
Step 101: determine the multiple servers being currently used in data search;
Step 102: build distributed m × n search model according to described multiple servers;Wherein, n is used for characterizing search mould
The cluster number that type includes, m is for characterizing the number of servers that each cluster includes;n≥1;m≥1;
Step 103: index building in the m station server of each cluster;
Step 104: when receiving searching request, according to the present load of each cluster, distributes described searching request
In any one station server to the target cluster that minimum present load is corresponding;
Step 105: determine the first index that described searching request is corresponding according to the index built, and according to described first
Index provides target data.
According to such scheme, by building distributed m × n search model, and in each cluster of search model all
Index building, distributes to any one in the cluster that minimum present load is corresponding when receiving searching request by searching request
Server, to realize load balancing, reduces the pressure that cluster server is born, and then improves data search efficiency.
In an embodiment of the invention, in order to ensure in cluster that a certain station server is delayed after machine, this cluster can ensure that
The availability of search service, may further include: when the quantity of the m station server in current cluster is not less than 2, described
The m station server of current cluster determines master server and from server;
Described index building in the m station server of each cluster, including: in master server, build master disk index,
And often build a master disk index, update CommitLog journal file;Synchronizing cycle is being set from server, and according to
The synchronizing cycle set, the master disk built in master server index and CommitLog journal file are synchronized to from server
In, and be internal memory index according to CommitLog journal file by master disk index construct maximum for volumes of searches from server,
And in internal memory index amount more than when setting threshold value, according to the structure time of internal memory index, the internal memory index built at first is refreshed
For indexing from disk, so that internal memory index amount is not more than described setting threshold value.
In an embodiment of the invention, in order to improve the efficiency of data search further, may further include: set up
Index classification rule;
After index building, farther include in the m station server of each cluster described: the master disk rope that will build
Draw and be divided into multiple packet according to described index classification rule, the corresponding corresponding index type of each packet;
The first index that described searching request is corresponding determined in the described index according to structure, comprises determining that described search
The target index type that request is corresponding, and in corresponding packet, travel through each master disk rope according to described target index type
Draw, to determine described first index.So, the hunting zone of index can be reduced, thus improve the efficiency of indexed search.
As in figure 2 it is shown, embodiments provide a kind of data search method, the method may comprise steps of:
Step 201: determine the multiple servers being currently used in data search.
The present embodiment illustrates as a example by 100 station servers.
Step 202: according to packet route, is divided into multiple collection group by this multiple servers.
In the present embodiment, the multiple servers being currently used in data search can be divided into multiple according to system requirements
Collection group, wherein, this packet mode can realize according to GroupFilter packet route.Such as, by this 100 station server
It is divided into Liang Geji group, the first Ge Ji group 40 station server, the second collection group 60 station server.Wherein, the first Ge Ji group
For being the search server that Pekinese's searching request realizes corresponding data to ID, the second collection group is for ID being
The searching request in Hebei province realizes the search server of corresponding data.
Step 203: collect group for each, the server in this collection group is divided into n cluster, in each cluster
Including m station server.
The value of n and m depends primarily on following two factors:
1, the maximum search request amount of single server carrying.
Wherein, the index capacity of single server is determined by disk size.Index is related to during setting up new index
Merging, old index continues to provide index service, and disk size now is: a new index+a is just providing service
Index+index merges disk space required, and disk size at least indexes 3 times of capacity.
Wherein, the searching request amount that single server can bear can be obtained by stress test.Work as machine loading
(load), when value is equal to the CPU core number of server, the TPS (searching request amount per second) that server can bear is stress test
Peak value, if the server for 4 cores, the generally TPS when load=4 are the peak that its unit can bear.
2, index total amount and searching request total amount.
Interior cluster number n of packet, and service in each cluster can be calculated respectively below with formula (1) and formula (2)
Number of units m of device:
N=S/T; (1)
M=Q/R+X; (2)
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;Q is used for characterizing
Searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X is used for characterizing machine increment.
Step 204: determine master server and from server in the m station server that each cluster includes.
In an embodiment of the invention, in same cluster, other services in cluster respectively of each station server
Aware services issued by device, to guarantee perception two-by-two.
Wherein, each station server in cluster is provided which search service, when in cluster, m is not less than 2, and can be at this collection
Main (Master) server is determined and from (Slave) server, wherein it is possible to determine Master clothes in this cluster in Qun
Business device, remaining server is as Slave server;When in cluster m equal to 1 time, can using the server in this cluster as
Master server.
Step 205: index building in the Servers-all of each cluster.
In the present embodiment, the index of following three part can be built: master disk index, internal memory index and from disk rope
Draw.
Wherein, the index structure of disk+internal memory can ensure that newly-increased index represents immediately.Master server is responsible for main magnetic
The foundation that fake draws, and often set up CommitLog journal file of a master disk index upgrade, same by Slave server
Step master disk index and CommitLog journal file, to ensure that index data is not lost, and arrive at the internal memory index built
Flush to index from disk by the internal memory built at first index after setting threshold value, it is ensured that availability.
The model construction building employing full dose dump+ real-time incremental dump of index.Wherein, full dose dump uses week time
Phase task is periodically executed, and real-time incremental dump is generated by real-time calling interface, carries out the building process of real time indexing in detail below
Describe in detail bright.
(1) structure of master disk index
Preset the index construct cycle, such as, set this index construct cycle as 1 week.Master server is according to index structure
Building cycle clocked flip full dose dump program, to realize the structure of master disk index, this building process is as follows:
Determining the data corresponding to data source, this data source can be the information of web crawlers crawl, various storage system
(such as, data base, file, NOSQL system etc.).The foundation of master disk index can use iterator model, utilizes
First with hasNext (), DataProvider interface, judges whether next data exists, if existing, by next () method
Take off a data, build and return a Map record, until hasNext () method is returned as false, terminate whole iteration
Process.The corresponding index document of Map record,<key, value>key-value pair of Map corresponds to territory and the thresholding of document.By Map
Record is encapsulated as corresponding Map object, and the AddUpdateCommand utilizing packaged Map object to build Solr (adds and criticizes
Amount updates) object, the UpdateHandle.addDoc () method finally calling Solr is added document extremely index, thus is completed main
The structure of disk index.
In an embodiment of the invention, in the building process of master disk index, the real time indexing having increment produces, because of
This, need completion incremental data full dose dump the term of execution, wherein it is possible to carry out this real-time rope of completion by time point mechanism
Draw: can record sart point in time checkPoint when the building process of master disk index starts, master disk index is built
After having stood, according to current point in time, between from the outset, put the incremental data brush in the time period between current point in time
Newly to master disk index.
In order to ensure that a Master server is delayed after machine in the cluster, in this cluster, other servers can continue to provide
Search service, needs to carry out the synchronization of master disk index at Slave server, and this process is as follows: the main magnetic of Master server
After dish index construct completes, notice Slave server carries out the copy of master disk index, and is passed to by checkPoint
Slave server, Slave server notifies Master server after completing the construction work that master disk indexes, and utilizes structure
Master disk index provide search service.
(2) real time indexing builds
In the present embodiment, real time indexing builds the structure that can comprise internal memory index and the structure two from disk index
Part, wherein, it is as follows that this realizes process:
The structure of A: internal memory index.
Data source (identical with master disk index) be may come from for Real time request, it is also possible to real-time from client
Interface interchange.As a example by real-time interface calls, the real-time method of client call Real-timeBean class sends request.Wherein,
Real time request method is divided into Add (interpolation), Update (renewal), Delete (deletion), mAdd (batch adds), mUpdate (to criticize
Amount updates), mDelete (batch delete).This Real time request of received server-side, is encapsulated as Document by this Real time request
Request object adds in CommitLog journal file.
Structure process Real-timeJob of real time indexing, continuous repeating query CommitLog is created during Slave startup of server
Journal file, once has new record to produce and i.e. carries out writing index operation, according to the difference of request type, call Solr's respectively
The correlation method of UpdateHandler sets up internal memory index.
B: from the structure of disk index.Internal memory index is limited to memory size, so needing to set threshold value, prevents internal memory rope
It is introduced through the quick-fried internal memory of big support or affects other application.When internal memory index is more than this setting threshold value, newly open an internal memory index, newly
The Real time request come writes new internal memory index, refreshes old internal memory index to from disk simultaneously.Utilize internal memory index with from
The identical characteristic of disk index structure, internal memory index is directly copied to disk, shape by the copy method calling Directory
Become and index from disk, promote search performance, it is achieved search for immediately.
Step 206: receive the Real time request that user sends, according to the address information of this Real time request, determines that this in real time please
Ask corresponding collection group.
Such as, the address information that this Real time request is corresponding is Beijing, then determine that the collection group that this Real time request is corresponding is
First collection group.
Step 207: according to the present load of each cluster in this collection group, by the distribution of this Real time request to minimum current negative
Carry in any one station server in corresponding target cluster.
In order to realize load balancing, improve the response efficiency to Real time request, can be by the distribution of this Real time request to minimum
In any one station server in the target cluster that present load is corresponding.
Step 208: according to the type of Real time request, this Real time request is responded.
In the present embodiment, this Real time request may include that searching request, removal request, interpolation request, more newly requested,
Removal request, batch add request and batch updating request in batches.
It is described in detail for above-mentioned a few class Real time requests separately below.
(1) searching request
The server receiving this searching request performs following operation: for single group searching, then according to this searching request, right
Master disk index, internal memory index, carry out thorough search, wherein it is desired to filtered out by Search Results deleted from disk index
Document.Wherein, the docId being deleted document is stored in delList set.Many group searchings are then utilized to the shard of Solr
The result set of many groups is made union operation by concept, and a group in arbitrarily selected set of packets is conducted interviews by ICP/IP protocol, this
The EmbeddedSolrServer that Shi Liyong Solr provides obtains current group index, and sends TCP/IP request to other groups, obtains
Take index data the merge operation being indexed, improve index acquisition speed.
(2) request, removal request and more newly requested are added
A: add request.Client sends Add order, is encapsulated into adding request write CommitLog journal file
In.Judge during IndexBuildJob index building, as asked for AddDocumentRequest, extract index number and it is believed that
Breath is converted to the AddUpdateCommand object of Solr, finally calls the AddDoc () order of the UpdateHandle of Solr,
Realize the interpolation of index data.
B: removal request.Client sends Delete order, is encapsulated into removal request
In DeleteDocumentRequest write CommitLog journal file.IndexWriter is had in Lucene search engine
Can do deletion action with two objects of IndexReader, difference is in IndexWriter example that the content deleted is buffered
Get up, can't come into force at once.After search engine receives real-time removal request, if document to be deleted is in internal memory, then use
IndexWriter directly deletes, and otherwise does mark at disk index (i.e. master disk indexes and indexes from disk) and deletes, will
In the set delList that document storing is the most to be deleted.When internal memory index is submitted to, call the commit () side of IndexWriter
Method, the IndexReader method recalling disk index deletes the element of delList set one by one.
C: more newly requested.Identical with Add order, client sends Update order and can be encapsulated as
UpdateDocumentRequest asks, in write CommitLog daily record.During IndexBuildJob index building, will index
Data message is converted to AddUpdateCommand, and arrange AllowDups is false simultaneously, does not the most allow index data to repeat.
Now Solr can first determine whether to index the most to exist, and has, deletes, and is then added operation, completes to update.
Utilize that following experiment is following to be verified the search model in the above embodiment of the present invention, wherein, this experiment by
10 station server compositions, are divided into 5 groups, and often group one master one is standby.Server configures is Intel zero RXeon zero R4 core CPUE5520@
2.27GHz, internal memory 4GB, 60GB hard disk 7200 turns.Index total amount is 60GB, single server 12GB, and experimental data is averaged
Value.
The prototype system that experiment realizes is Xsolr, and comparison system is Solr, and data set is 40,000,000 documents, Mei Gewen
Shelves are made up of 22 territories, and mean size is 0.03KB, and testing tool is LoadRunner.Experimental result divides two parts, first group
The performance indications contrast that experiment is Xsolr Yu Solr system, refer to Fig. 3;Second group experiment for Xsolr data consistency with
Disaster tolerance, refer to Fig. 4.
A: real-time response performance.TPS according to Fig. 3, Xsolr and response time performance under 4 kinds of test conditions
All it is better than Solr.The request response time of real-time update is all within 1s.Carry out at internal memory because Xsolr updates operation, index
Set up at internal memory, and do not merge with disk index, simultaneously scan for internal memory and disk index, reduce magnetic disc i/o, add
The speed speed of real time indexing data display.
B: load analysis.In the case of system load is close, the CPU of Xsolr more consumes resource.Because Xsolr meeting
Set up internal memory index, take a large amount of internal memory under experimental situation, time the highest, reach 1GB.CPU usage is the highest about 30%, this
Time index renewal TPS be up to 2100, a large amount of committed memories, but machine loading is still less than 4, within the acceptable range.
C: data consistency.According to Fig. 4, in front 15s, single server is added respectively, updates, deletes behaviour
Making, Master/Slave server data keeps consistent, has pole technicality.During real-time operation, Slave incessantly from
Master machine pulls increment CommitLog journal file, carries out consumption and create real time indexing, it is ensured that active/standby server data
Concordance.But owing to CommitLog is to be sequentially written in, and there is certain network overhead from Master machine copied files,
So can go out, existing imperceptible difference, realize final consistency in millisecond rank, minimum on the impact of system integrity service, can
In the range of acceptance.
D: data disaster tolerance and integrity.During 17s, Master server adds 1000 records, now internal memory index at random
Not up to threshold value, does not brush and indexes into from disk.During 30s, Slave server recovers service, its index record number and Master phase
With.After 35s, Master server is delayed machine, recovers to start after 60s, and its index record number is identical with Slave, and meets and initially add
1000 record numbers.Xsolr passes through CommitLog daily record persistence real time data, arranges machine recovery point of delaying.Server is delayed machine
During recovery, read from CommitLog log recording side-play amount nearest for CheckPoint, start to rebuild internal memory rope at side-play amount
Draw, it is ensured that the integrity of data, it is achieved data disaster tolerance.
According to above-mentioned experimental result it can be shown that distributed m × n search model that the present embodiment provides meets the reality of system
Time sexual demand, under big data quantity and high concurrent environment, ensure that concordance and the data disaster tolerance of data, it was demonstrated that system simultaneously
The feasibility of model.
Refer to Fig. 5, embodiments provide a kind of data search system, may include that
First determines unit 501, for determining the multiple servers being currently used in data search;
Search model construction unit 502, for building distributed m × n search model according to described multiple servers;Its
In, n is for characterizing the cluster number that search model includes, m is for characterizing the number of servers that each cluster includes;n≥1;m
≥1;
Index construct unit 503, for index building in the m station server of each cluster;
Allocation unit 504, for when receiving searching request, according to the present load of each cluster, by described search
In any one station server in the target cluster that the most minimum present load of request distribution is corresponding;
Search unit 505, for determining the first index that described searching request is corresponding according to the index built, and according to
Described first index provides target data.
In an embodiment of the invention, described search model construction unit 502, specifically for:
The value of n: n=S/T is calculated by following formula;
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;
The value of m: m=Q/R+X is calculated by following formula;
Wherein, Q is used for characterizing searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X
For characterizing machine increment.
In an embodiment of the invention, refer to Fig. 6, this data search system may further include:
Second determines unit 601, for the quantity of m station server in current cluster not less than 2 time, described currently
The m station server of cluster determines master server and from server;
Described index construct unit 503, specifically for building master disk index in master server, and often builds a master
Disk indexes, and updates CommitLog journal file;Synchronizing cycle is being set from server, and according to the synchronizing cycle set,
The master disk built in master server index and CommitLog journal file are synchronized to from server, and from server
Middle is internal memory index according to CommitLog journal file by master disk index construct maximum for volumes of searches, and in internal memory index amount
More than when setting threshold value, according to the structure time of internal memory index, the internal memory index built at first is refreshed as indexing from disk, with
Internal memory index amount is made to be not more than described setting threshold value.
In an embodiment of the invention, described index construct unit 503, specifically for:
According to the index construct cycle set in advance in master server, trigger full dose dump program, and record now
Sart point in time;
Determine the data corresponding to data source;
After often getting corresponding to the current data of data source, it may be judged whether there is next data, if judging knot
Fruit includes existing, then use iterator model, utilize DataProvider interface, obtain next number corresponding to data source
According to, and build a corresponding Map record, and continue executing with step, until judged result includes not existing;
The a plurality of Map built record is encapsulated as corresponding Map object, and utilizes packaged Map object to build interpolation
Batch updating object, and this batch updating object is refreshed to master disk index;
According to record described sart point in time and current point in time, to from described sart point in time to described current time
Between point between time period in incremental data refresh to master disk index.
In an embodiment of the invention, refer to Fig. 7, this data search system may further include:
Real-time update unit 701, during for receiving the Real time request for the second index at master server, for described
Second index performs the operation that described Real time request is corresponding, and the operation of execution is updated in CommitLog journal file;With
This master server be subordinated to same cluster from server according to the CommitLog journal file being synchronized to update operation structure
Build corresponding internal memory index;
Described Real time request includes: removal request, interpolation request, more newly requested, removal request, batch interpolation request in batches
Ask with batch updating.
To sum up, each embodiment of the present invention at least can realize following beneficial effect:
1, in embodiments of the present invention, by building distributed m × n search model, and at each collection of search model
Equal index building in Qun, distributes to appointing in the cluster that minimum present load is corresponding when receiving searching request by searching request
Anticipate a station server, to realize load balancing, reduce the pressure that cluster server is born, and then improve data search effect
Rate.
2, in embodiments of the present invention, by being grouped according to routing rule by multiple servers, search is being received
During request, its collection group that search service is provided can be defined as according to the address information of searching request, searches such that it is able to reduce
The rope time, improve search efficiency.
3, in embodiments of the present invention, internal memory is used to index, with disk, the many Indexing Mechanisms combined.By the side of full dose
Formula periodically sets up master disk index, it is ensured that the integrity of data.Real time information first writes internal memory with incremental mode, and internal memory index is super
Replicate write disk after going out to set threshold value, formed and index from disk, do not merge with master disk index, reduce the index merging time and open
Pin.Audit memory and disk index, promote the search response time simultaneously.
4, in embodiments of the present invention, solve the index data disaster tolerance problem under distributed environment, introduce CommitLog day
Will mechanism, persistence index metadata, improves Master/Slave (active and standby) model of Solr, it is ensured that the concordance of data and can
The property used.
The contents such as the information between each unit in said apparatus is mutual, execution process, owing to implementing with the inventive method
Example is based on same design, and particular content can be found in the narration in the inventive method embodiment, and here is omitted.
It should be noted that in this article, the relational terms of such as first and second etc is used merely to an entity
Or operation separates with another entity or operating space, and not necessarily require or imply existence between these entities or operation
The relation of any this reality or order.And, term " includes ", " comprising " or its any other variant are intended to non-
Comprising of exclusiveness, so that include that the process of a series of key element, method, article or equipment not only include those key elements,
But also include other key elements being not expressly set out, or also include being consolidated by this process, method, article or equipment
Some key elements.In the case of there is no more restriction, statement the key element " including " and limiting, do not arrange
Except there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be passed through
The hardware that programmed instruction is relevant completes, and aforesaid program can be stored in the storage medium of embodied on computer readable, this program
Upon execution, perform to include the step of said method embodiment;And aforesaid storage medium includes: ROM, RAM, magnetic disc or light
In the various medium that can store program code such as dish.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate the skill of the present invention
Art scheme, is not intended to limit protection scope of the present invention.All made within the spirit and principles in the present invention any amendment,
Equivalent, improvement etc., be all contained in protection scope of the present invention.
Claims (10)
1. a data search method, it is characterised in that including:
Determine the multiple servers being currently used in data search;
Distributed m × n search model is built according to described multiple servers;Wherein, n is for characterizing the cluster that search model includes
Number, m is for characterizing the number of servers that each cluster includes;n≥1;m≥1;
Index building in the m station server of each cluster;
When receiving searching request, according to the present load of each cluster, by the distribution of described searching request to minimum current negative
Carry in any one station server in corresponding target cluster;
Determine, according to the index built, the first index that described searching request is corresponding, and provide target according to described first index
Data.
Method the most according to claim 1, it is characterised in that described according to the described multiple servers distributed m × n of structure
Search model, including:
The value of n: n=S/T is calculated by following formula;
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;
The value of m: m=Q/R+X is calculated by following formula;
Wherein, Q is used for characterizing searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X is used for
Characterize machine increment.
Method the most according to claim 1, it is characterised in that
Farther including: when the quantity of the m station server in current cluster is not less than 2, the m platform at described current cluster services
Device determines master server and from server;
Described index building in the m station server of each cluster, including: in master server, build master disk index, and often
Build a master disk index, update CommitLog journal file;Synchronizing cycle is being set from server, and according to setting
Synchronizing cycle, by the master disk built in master server index and CommitLog journal file be synchronized to from server, and
It is being internal memory index according to CommitLog journal file by master disk index construct maximum for volumes of searches from server, and
Internal memory index amount more than set threshold value time, according to internal memory index the structure time, by build at first internal memory index refresh for from
Disk indexes, so that internal memory index amount is not more than described setting threshold value.
Method the most according to claim 3, it is characterised in that
The described master disk that builds in master server indexes, including:
According to the index construct cycle set in advance in master server, trigger full dose dump program, and record beginning now
Time point;
Determine the data corresponding to data source;
After often getting corresponding to the current data of data source, it may be judged whether there is next data, if judged result bag
Include existence, then use iterator model, utilize DataProvider interface, obtain next data corresponding to data source, and
Build a corresponding Map record, and continue executing with step, until judged result includes not existing;
The a plurality of Map built record is encapsulated as corresponding Map object, and utilizes packaged Map object to build interpolation batch
Upgating object, and this batch updating object is refreshed to master disk index;
According to record described sart point in time and current point in time, to from described sart point in time to described current point in time
Between time period in incremental data refresh to master disk index.
5. according to described method arbitrary in claim 1-4, it is characterised in that farther include: receive at master server
During for the Real time request of the second index, perform operation corresponding to described Real time request for described second index, and will perform
Operation update in CommitLog journal file;Be subordinated to this master server same cluster from server according to synchronization
To CommitLog journal file in update operation build corresponding internal memory index;
Described Real time request includes: removal request, interpolation are asked, more newly requested, removal request, batch add request and criticize in batches
Measure more newly requested.
6. a data search system, it is characterised in that including:
First determines unit, for determining the multiple servers being currently used in data search;
Search model construction unit, for building distributed m × n search model according to described multiple servers;Wherein, n is used for
Characterizing the cluster number that search model includes, m is for characterizing the number of servers that each cluster includes;n≥1;m≥1;
Index construct unit, for index building in the m station server of each cluster;
Allocation unit, for when receiving searching request, according to the present load of each cluster, distributes described searching request
In any one station server to the target cluster that minimum present load is corresponding;
Search unit, for determining the first index that described searching request is corresponding, and according to described the according to the index built
One index provides target data.
Data search system the most according to claim 6, it is characterised in that described search model construction unit, specifically uses
In:
The value of n: n=S/T is calculated by following formula;
Wherein, S is used for characterizing index total amount;T is for characterizing the largest index amount of single server carrying;
The value of m: m=Q/R+X is calculated by following formula;
Wherein, Q is used for characterizing searching request total amount;R is for characterizing the maximum search request amount of single server carrying;X is used for
Characterize machine increment.
Data search system the most according to claim 6, it is characterised in that
Farther include: second determines unit, when the quantity of the m station server in current cluster is not less than 2, described
The m station server of current cluster determines master server and from server;
Described index construct unit, specifically for building master disk index in master server, and often builds a master disk rope
Draw, update CommitLog journal file;Synchronizing cycle is being set from server, and according to the synchronizing cycle set, by main clothes
The master disk index built in business device and CommitLog journal file are synchronized to from server, and in basis from server
The master disk index construct that volumes of searches is maximum is internal memory index by CommitLog journal file, and in internal memory index amount more than setting
When determining threshold value, according to the structure time of internal memory index, the internal memory index built at first is refreshed as indexing from disk, so that internal memory
Index amount is not more than described setting threshold value.
Data search system the most according to claim 8, it is characterised in that
Described index construct unit, specifically for:
According to the index construct cycle set in advance in master server, trigger full dose dump program, and record beginning now
Time point;
Determine the data corresponding to data source;
After often getting corresponding to the current data of data source, it may be judged whether there is next data, if judged result bag
Include existence, then use iterator model, utilize DataProvider interface, obtain next data corresponding to data source, and
Build a corresponding Map record, and continue executing with step, until judged result includes not existing;
The a plurality of Map built record is encapsulated as corresponding Map object, and utilizes packaged Map object to build interpolation batch
Upgating object, and this batch updating object is refreshed to master disk index;
According to record described sart point in time and current point in time, to from described sart point in time to described current point in time
Between time period in incremental data refresh to master disk index.
10. according to described data search system arbitrary in claim 6-9, it is characterised in that farther include:
Real-time update unit, during for receiving the Real time request for the second index at master server, for described second rope
Draw and perform the operation that described Real time request is corresponding, and the operation of execution is updated in CommitLog journal file;With these main clothes
Business device is subordinated to building accordingly from server of same cluster according to the operation updated the CommitLog journal file being synchronized to
Internal memory index;
Described Real time request includes: removal request, interpolation are asked, more newly requested, removal request, batch add request and criticize in batches
Measure more newly requested.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610362426.0A CN106055622A (en) | 2016-05-26 | 2016-05-26 | Data searching method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610362426.0A CN106055622A (en) | 2016-05-26 | 2016-05-26 | Data searching method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106055622A true CN106055622A (en) | 2016-10-26 |
Family
ID=57174847
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610362426.0A Pending CN106055622A (en) | 2016-05-26 | 2016-05-26 | Data searching method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106055622A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649870A (en) * | 2017-01-03 | 2017-05-10 | 山东浪潮商用系统有限公司 | Distributed implementation method for search engine |
CN108573063A (en) * | 2018-04-27 | 2018-09-25 | 宁波银行股份有限公司 | A kind of data query method and system |
CN108763578A (en) * | 2018-06-07 | 2018-11-06 | 腾讯科技(深圳)有限公司 | A kind of newer method of index file and server |
CN110019080A (en) * | 2017-07-14 | 2019-07-16 | 北京京东尚科信息技术有限公司 | Data access method and device |
CN110609844A (en) * | 2018-05-29 | 2019-12-24 | 优信拍(北京)信息科技有限公司 | Data updating method, device and system |
CN111209462A (en) * | 2020-01-02 | 2020-05-29 | 北京字节跳动网络技术有限公司 | Data processing method, device and equipment |
CN111400555A (en) * | 2020-03-05 | 2020-07-10 | 湖南大学 | Graph data query task processing method and device, computer equipment and storage medium |
CN112507187A (en) * | 2020-11-11 | 2021-03-16 | 贝壳技术有限公司 | Index changing method and device |
CN113824776A (en) * | 2021-09-02 | 2021-12-21 | 济南浪潮数据技术有限公司 | Automatic network request distribution method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101071442A (en) * | 2007-06-26 | 2007-11-14 | 腾讯科技(深圳)有限公司 | Distributed indesx file searching method, searching system and searching server |
CN103729386A (en) * | 2012-10-16 | 2014-04-16 | 阿里巴巴集团控股有限公司 | Information query system and method |
CN103984745A (en) * | 2014-05-23 | 2014-08-13 | 何震宇 | Distributed video vertical searching method and system |
CN105468720A (en) * | 2015-11-20 | 2016-04-06 | 北京锐安科技有限公司 | Method for integrating distributed data processing systems, corresponding systems and data processing method |
-
2016
- 2016-05-26 CN CN201610362426.0A patent/CN106055622A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101071442A (en) * | 2007-06-26 | 2007-11-14 | 腾讯科技(深圳)有限公司 | Distributed indesx file searching method, searching system and searching server |
CN103729386A (en) * | 2012-10-16 | 2014-04-16 | 阿里巴巴集团控股有限公司 | Information query system and method |
CN103984745A (en) * | 2014-05-23 | 2014-08-13 | 何震宇 | Distributed video vertical searching method and system |
CN105468720A (en) * | 2015-11-20 | 2016-04-06 | 北京锐安科技有限公司 | Method for integrating distributed data processing systems, corresponding systems and data processing method |
Non-Patent Citations (2)
Title |
---|
傅巍玮 等: "基于Solr的分布式实时搜索模型研究与实现", 《电信科学》 * |
傅巍玮: "分布式实时垂直搜索引擎研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649870A (en) * | 2017-01-03 | 2017-05-10 | 山东浪潮商用系统有限公司 | Distributed implementation method for search engine |
CN110019080A (en) * | 2017-07-14 | 2019-07-16 | 北京京东尚科信息技术有限公司 | Data access method and device |
CN110019080B (en) * | 2017-07-14 | 2021-11-12 | 北京京东尚科信息技术有限公司 | Data access method and device |
CN108573063A (en) * | 2018-04-27 | 2018-09-25 | 宁波银行股份有限公司 | A kind of data query method and system |
CN110609844A (en) * | 2018-05-29 | 2019-12-24 | 优信拍(北京)信息科技有限公司 | Data updating method, device and system |
CN110609844B (en) * | 2018-05-29 | 2022-05-13 | 优信拍(北京)信息科技有限公司 | Data updating method, device and system |
CN108763578A (en) * | 2018-06-07 | 2018-11-06 | 腾讯科技(深圳)有限公司 | A kind of newer method of index file and server |
CN111209462A (en) * | 2020-01-02 | 2020-05-29 | 北京字节跳动网络技术有限公司 | Data processing method, device and equipment |
CN111400555A (en) * | 2020-03-05 | 2020-07-10 | 湖南大学 | Graph data query task processing method and device, computer equipment and storage medium |
CN111400555B (en) * | 2020-03-05 | 2023-09-26 | 湖南大学 | Graph data query task processing method and device, computer equipment and storage medium |
CN112507187A (en) * | 2020-11-11 | 2021-03-16 | 贝壳技术有限公司 | Index changing method and device |
CN113824776A (en) * | 2021-09-02 | 2021-12-21 | 济南浪潮数据技术有限公司 | Automatic network request distribution method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106055622A (en) | Data searching method and system | |
Zhang et al. | An efficient multi-dimensional index for cloud data management | |
CN108600321A (en) | A kind of diagram data storage method and system based on distributed memory cloud | |
CN106649870A (en) | Distributed implementation method for search engine | |
Ma et al. | Query processing of massive trajectory data based on mapreduce | |
CN111460023A (en) | Service data processing method, device, equipment and storage medium based on elastic search | |
CN103246612B (en) | A kind of method of data buffer storage and device | |
Ju et al. | iGraph: an incremental data processing system for dynamic graph | |
CN101169785A (en) | Clustered database system dynamic loading balancing method | |
CN102968498A (en) | Method and device for processing data | |
CN105468473A (en) | Data migration method and data migration apparatus | |
CN105069149A (en) | Structured line data-oriented distributed parallel data importing method | |
CN108073696B (en) | GIS application method based on distributed memory database | |
CN104239377A (en) | Platform-crossing data retrieval method and device | |
US11080207B2 (en) | Caching framework for big-data engines in the cloud | |
CN113220795B (en) | Data processing method, device, equipment and medium based on distributed storage | |
CN104951464B (en) | Date storage method and system | |
CN112162846B (en) | Transaction processing method, device and computer readable storage medium | |
CN109408590A (en) | Expansion method, device, equipment and the storage medium of distributed data base | |
Sethia et al. | A multi-agent simulation framework on small Hadoop cluster | |
CN105677761A (en) | Data sharding method and system | |
CN106569896A (en) | Data distribution and parallel processing method and system | |
Oruganti et al. | Exploring Hadoop as a platform for distributed association rule mining | |
CN110083306A (en) | A kind of distributed objects storage system and storage method | |
CN105975345A (en) | Video frame data dynamic equilibrium memory management method based on distributed memory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161026 |
|
RJ01 | Rejection of invention patent application after publication |