CN102169507B - Implementation method of distributed real-time search engine - Google Patents

Implementation method of distributed real-time search engine Download PDF

Info

Publication number
CN102169507B
CN102169507B CN 201110137785 CN201110137785A CN102169507B CN 102169507 B CN102169507 B CN 102169507B CN 201110137785 CN201110137785 CN 201110137785 CN 201110137785 A CN201110137785 A CN 201110137785A CN 102169507 B CN102169507 B CN 102169507B
Authority
CN
China
Prior art keywords
index
burst
node
center control
control nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201110137785
Other languages
Chinese (zh)
Other versions
CN102169507A (en
Inventor
程行荣
季刚
陈青溪
时宜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Yaxon Networks Co Ltd
Original Assignee
Xiamen Yaxon Networks Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Yaxon Networks Co Ltd filed Critical Xiamen Yaxon Networks Co Ltd
Priority to CN 201110137785 priority Critical patent/CN102169507B/en
Publication of CN102169507A publication Critical patent/CN102169507A/en
Application granted granted Critical
Publication of CN102169507B publication Critical patent/CN102169507B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of search engines, specifically relating to a distributed real-time search engine. A system construction and operation method of the search engine at least comprises the following steps: A, designing a functional structure of a system; B, designing a data index structure of the system; C, creating an index; D, updating the index; and E, searching the index. The distributed real-time search engine can construct an updating index and a combining index simultaneously in the memory of the system, and can access the updating index and the combining index simultaneously while searching the index; when the number of the documents of the updating index is accumulated to a threshold value, the updating index is submitted to a disk index and changed as a combining index, and the original combining index is changed as a new updating index; and therefore, the updating data can be searched, and the real time property of the retrieval data of the search engine can be improved.

Description

A kind of implementation method of distributed real-time search engine
Technical field
The present invention relates to the search engine technique field, relate in particular to a kind of implementation method of distributed real-time search engine.
Background technology
Be accompanied by the arrival of era of knowledge-driven economy, the information in the internet is explosive growth, and what the present stage people faced is not absence of information, but information spreads unchecked, the screening of having no way of, thereby, obtaining the information that needs how accurately and fast, in time, is the problem that search engine need to solve.
Search engine refers to according to certain strategy, uses specific computer program to gather information from particular network such as internet, and after information being organized and processed, for the user provides retrieval service, the information display that user search is relevant is to user's system.
Traditional search engine, for example, Google, Baidu, Yahoo etc., although the data volume of processing is huge, reached the TB level, but its data source is mainly from conventional websites such as portal website, forum, E-Government, the station data renewal frequency of this class is not high, each data volume of upgrading is also little, thereby its information processing is not high to the requirement of real-time of search engine.
Along with microblogging, the rise of the social medias such as social class website, " micromessage " that the netizen creates emerges in multitude, thereby produces the real time mass data.In addition along with the fast development of enterprise mobile application such as mobile crm system and handheld terminal, the user has higher requirement to inquiry velocity and the real-time of information, and traditional search engines can not adapt to the processing demands of the processing of real time mass data and real-time search.The data volume that the real time mass data have renewal frequency height, renewal is large, the large characteristic of data volume of accumulation, usually reaches hundreds of GB, even reaches the data volume of TB or PB level.Real-time search engine has very high requirement on the real-time of mass data processing and inquiry response.When data volume reaches the TB level, there is very large contradiction between the frequency of Data Update and the speed of inquiry response, because it is large to work as the cumulative data amount, when the data volume of upgrading is also very large, thereby can cause the structure of index and maintenance time length to cause real-time to guarantee, namely, when existing search engine scheme adopts this increment index mechanism, the structure of index and retrieving separately carry out, after the number of files that the construction logic of index is only accumulated in new section reaches threshold value (such as 10000) or reaches threshold value (such as 5 minutes) interval time, just new section is submitted in the index burst for the indexed search logic.Therefore, can retrieve the document from being submitted to of a document, between have a regular hour and postpone, usually a few minutes in the dozens of minutes scope, and in real-time retrieval, so long delay is intolerable.
Summary of the invention
Deficiency for the prior art scheme, the present invention proposes a kind ofly to overcome increment index mechanism with the contradiction between the index real-time, index during by the renewal in the Installed System Memory, a kind of distributed real-time search engine that the cooperation of index and disk index realizes when merging.
The technical solution used in the present invention is as follows:
A kind of implementation method of distributed real-time search engine, its system constructing and operation may further comprise the steps at least:
A. the functional structure of design system, this functional structure is to create in the concentrating type system based on Master/Slave, comprise the following functions node: center control nodes, index datastore node and external service node, wherein, described center control nodes is created in the Master system, described index datastore node and external service node are created in the Slave system, described center control nodes, the storage and maintenance that is used for the attribute information of data directory structure index, and the storage and maintenance of the attribute information of index datastore node, described index datastore node is used for the establishment of data directory structure index burst, upgrade and retrieval, described external service node is used for the establishment of reception hint, renewal and retrieval request also are forwarded to center control nodes with this request and process;
B. the data directory structure of design system, this index structure tree hierarchy from top to bottom consists of: index, the index burst, section, document and territory, wherein, described index can have a plurality of in a system, a described index burst is the data block of described index after divided, wherein, each index burst that belongs to same index is stored on the index datastore node, a described index burst is to be made of one or more section, a described section is to be made of one or more document, each contained document can be different data object type in the section, a described document has the uniquely identified key assignments in system's overall situation, the structure of described document comprises for the territory of describing Doctype;
C. the establishment of index may further comprise the steps:
C1. after externally service node receives the index creation request this request is forwarded to center control nodes, center control nodes is resolved this index creation request, therefrom extract the attribute information of index to be created, and verify that this attribute information is whether complete and effectively, if this attribute information is complete and effective, then carry out the processing of step C2, if this attribute information is incomplete or invalid, then send answer failed information to external service node;
C2. center control nodes is divided into some bursts according to the index burst number in the attribute information of the index to be created that generates among the step C1 with index to be created, simultaneously, according to the attribute information that is stored in the index data node in the center control nodes, judge state and the loading condition of each index data node, and come according to this to determine each index burst is stored and created, and then the attribute information with index to be created is sent to each corresponding index datastore node in which index data node.The index datastore node is according to the attribute information of the index to be created of receiving, make up the index burst of the described index to be created of center control nodes assignment at this index datastore node, if this index datastore node creates this index burst failure, then center control nodes divides the index data node in good condition, that load is relatively little of tasking other to create this index burst, finish or create unsuccessfully until whole index bursts of this index to be created create in the index datastore node, carry out the processing of step C3;
If C3. whole index bursts of index to be created create in the index datastore node and finish among the step C2, the center control nodes updated stored is in index datastore node attribute information wherein, and transmission index burst creates successful response message to external service node; If whole index bursts of index to be created create unsuccessfully in the index datastore node among the step C2, then send to external service node and create replying of index failure;
D. the renewal of index may further comprise the steps:
D1. after externally service node receives the index upgrade request this request is forwarded to center control nodes, center control nodes is sent to this index upgrade request the index datastore node at the index burst place of this index according to the index attributes information and the index datastore node attribute information that are stored in wherein;
D2. the index datastore node is according to the index upgrade request of receiving, on the index burst of index to be updated place index datastore node, to upgrade document storage in new section, if upgrade the document storage success, then will upgrade the corresponding old document of document and in new section, be labeled as the deletion state, and return the index upgrade successful information to center control nodes, if upgrade the document storage failure, then return the index upgrade failure information to center control nodes, center control nodes is sent to external service node with index upgrade success or failed information at last;
The index upgrade of this step D also comprises the delete step of document: when index upgrade request during only for the deletion document command, on the storage burst of the index datastore node at document to be deleted place, in new section the document is labeled as deletion;
The index upgrade of this step D, also comprise the step that makes up real time indexing: in the internal memory of system, make up simultaneously index when index is with merging when upgrading, the retrieval of index be when accessing this renewal index and when merging index carry out, when carrying out index upgrade, the index when index in the renewal is described renewal, when reached threshold value the update time of index when the number of documents of index reached threshold value or this renewal when this renewal, system indexes when submitting this renewal in the disk index, index index index when the upgrading during merging before index and the simultaneously change when merging when changing afterwards this renewal;
E. the retrieval of index may further comprise the steps:
E1. externally send it to center control nodes after the retrieval request of service node reception hint, center control nodes resolve this retrieval request and judge its for the target index, then according to the attribute information of index datastore node attribute information and target index, search all index bursts of this target index, and assign retrieval request to the index datastore node of each burst of storage;
E2. the index datastore node is retrieved relevant documentation according to the retrieval request of receiving at the respective index burst of its storage, will be sent to external service node after the result for retrieval ordering at last;
E3. externally the result for retrieval of service node each index datastore node that will receive is integrated, is sent to client after the ordering.
Further, the functional structure of the described system of steps A, also comprise a center control nodes for subsequent use, described center control nodes in real time with the data backed up in synchronization of its storage to center control nodes for subsequent use, when center control nodes breaks down the phase, this center control nodes for subsequent use changes to center control nodes, and when former center control nodes is recovered from fault, former center control nodes changes to new center control nodes for subsequent use.
Further, described index datastore node and external service node periodically send the heartbeat signal that characterizes its status information to described center control nodes, if center control nodes is not received heartbeat signal within the default time, then this index datastore node of mark or external service node are dead, simultaneously, center control nodes can will be labeled as all index bursts of storing in the dead index datastore node, copy is a in the index data node of any this index burst copy of not storing of other again in the copy of these index bursts of storing from other index data nodes, so that the number of copies of index burst remains unchanged, all be available at any time with assurance index burst.
Further, in the heartbeat signal that described index data node occurs in the center control nodes, the load information that comprises this index data node, in the process of index creation, center control nodes can be distributed to the index burst the little index data node storage of load as far as possible, equally, in the process of indexed search, center control nodes can be submitted to retrieval request the index datastore node processing at the little index burst of load or this burst copy place as far as possible.
Further, described index datastore node attribute information comprises: the type of the ID of node, the title of node, node, the state of node, the load of node and the position of node, described index attributes information comprises: the memory node ID of the number of copies of the burst number of the organization definition of document, index, index burst and index burst and index burst copy in the title of index, the index.
Further, in the data directory structure of the described system of step B, each index burst also has a plurality of index burst copies, this index burst copy creates when the described index creation of step C, upgrade rear asynchronous refresh at former index burst when the described index upgrade of step D, it is stored on the different index datastore nodes with former index burst; The index datastore node at former index burst place is responsible for processing the update request for this index burst, when former index burst upgrade complete after, the index data node that the index data node at former index burst place is responsible for update request is sent to asynchronously corresponding index burst copy place carries out the renewal of index burst copy; Index burst copy is all supported indexed search with corresponding former index burst, center control nodes is submitted to the little index burst of load or index burst copy place index datastore node processing according to the loading condition of former index burst and index burst copy place index datastore node with the indexed search request.
Further, center control nodes is made regular check on the number of the index burst copy of each index in whole index, and when the number of index burst copy was lower than default setting number, system copied the copy of this index burst automatically in other back end; When the index datastore node of the former index burst of storage breaks down, system chooses an index upgrade job of taking over former index burst from the index burst copy of correspondence, this index burst copy becomes new former index burst, then in other index data nodes generating an index burst copy, guarantee that the number of copies of this index burst remains unchanged; When the index data node of storage index burst copy broke down, system can generate a copy the same with former index burst in other index data nodes, guarantee that the number of copies of this index burst remains unchanged.
Further, each index burst of described same index and index burst copy creating and be stored on the index datastore node, be to carry out according to following strategy: center control nodes is according to the load information of node in the attribute information of index datastore node, described index burst and index burst copy are dispensed to the lightest index datastore node of load, when the number of available index datastore node is less than the number of index burst, center control nodes distributes a plurality of index bursts to same index datastore node, and center control nodes is the index burst copy of allocation index burst not; When the number of available index datastore node during more than the number of index burst, the center control nodes distribution portion or all the index burst copy of index bursts to remaining index datastore node.
Further, in the renewal of the described index of step D, the step of the merging of the section of comprising also: the index of the index in described renewal divides the number of medium film section to reach on threshold value or the distance once to reach interval time that index merges threshold value, the index datastore node at this index burst place reads the document in less several sections and it is stored in a new section, then with these several less section physics deletions.
Further, the storage of the described renewal document of step D on the index burst, the cryptographic hash by the key assignments that calculate to upgrade document, this cryptographic hash is counted delivery with the index burst of document place index after, at last document is assigned to the index burst of the numerical value reference numeral of this delivery and stores.
Further, the different pieces of information object type of the described document of step B, comprise: text data object, image data objects, audio data objects, video data objects, executable program data object, the attribute information of each data object type are stored in the structure in territory of document.
The present invention is by adopting technique scheme, and the beneficial effect that has is:
1. in the internal memory of system, make up simultaneously index when index is with merging when upgrading, index when index is with merging when passing through simultaneously access renewal during indexed search, after the number of documents of index runs up to threshold value when upgrading, upgrading index is submitted to the disk index and changes to index when merging, index when index changes to new renewal during original merging, guaranteed that the data of upgrading also can be retrieved, but improved the real-time of search engine retrieve data;
2. the center control nodes of native system, center control nodes for subsequent use, external service node and index datastore node are at the concentrating type system creation based on Master/Slave, has Error Tolerance, be fit to be deployed on the cheap machine, and the data access of high-throughput can be provided;
3. by the index burst that is stored in the index datastore node is created index burst copy, strengthen the fault-tolerance of system.
Description of drawings
Fig. 1 is the functional structure synoptic diagram of one embodiment of the present invention.
Fig. 2 is the synoptic diagram of data directory structure of the present invention.
Fig. 3 is the embodiment synoptic diagram of index burst of the present invention and index burst copy storage policy.
Embodiment
Now the present invention is further described with embodiment by reference to the accompanying drawings.
A kind of implementation method of distributed real-time search engine, its system constructing and operation are to be made of following steps:
Steps A: the functional structure of design system, consult shown in the accompanying drawing 1, this functional structure is to create in the concentrating type system based on Master/Slave, comprise the following functions node: center control nodes, index datastore node and external service node, wherein, described center control nodes is created in the Master system, described index datastore node and external service node are created in the Slave system, described center control nodes is host node in system, the storage and maintenance that is used for the attribute information of data directory structure index, and the storage and maintenance of the attribute information of index datastore node, described index datastore node is back end in system, be used for the establishment of data directory structure index sliced layer, upgrade and retrieval, described external service node is client node in system, is used for the establishment of reception hint, renewal and retrieval request also are forwarded to center control nodes with this request and process;
Step B: the data directory structure of design system, consult shown in the accompanying drawing 2, this index structure tree hierarchy from top to bottom consists of: index, the index burst, section, document and territory, wherein, described index can have a plurality of in a system, a described index burst is the data block of described index after divided, wherein, each index burst that belongs to same index is stored on the index datastore node, a described index burst is to be made of one or more section, a described section is to be made of one or more document, each contained document can be different data object type in the section, a described document has the uniquely identified key assignments in system's overall situation, the structure of described document comprises for the territory of describing the document different attribute; Wherein, described index provides the set of the several data object of retrieval support, and described index burst disperses to be stored on the index datastore node of system, and this can improve the retrieve data efficient of system;
Step C: the establishment of index is to be made of following step:
C1. after externally service node receives the index creation request this request is forwarded to center control nodes, center control nodes is resolved this index creation request, therefrom extract the attribute information of index to be created, and verify that this attribute information is whether complete and effectively, if this attribute information is complete and effective, then carry out the processing of step C2, if this attribute information is incomplete or invalid, then send answer failed information to external service node;
C2. center control nodes is divided into some bursts according to the index burst number in the attribute information of the index to be created that generates among the step C1 with index to be created, simultaneously, according to the attribute information that is stored in the index data node in the center control nodes, judge state and the loading condition of each index data node, and come according to this to determine each index burst is stored and created, and then the attribute information with index to be created is sent to each corresponding index datastore node in which index data node; The index datastore node is according to the attribute information of the index to be created of receiving, make up the index burst of the described index to be created of center control nodes assignment at this index datastore node, if this index datastore node creates this index burst failure, then center control nodes divides the index data node in good condition, that load is relatively little of tasking other to create this index burst, finish or create unsuccessfully until whole index bursts of this index to be created create in the index datastore node, carry out the processing of step C3;
If C3. whole index bursts of index to be created create in the index datastore node and finish among the step C2, the center control nodes updated stored is in index datastore node attribute information wherein, and transmission index burst creates successful response message to external service node; If whole index bursts of index to be created create unsuccessfully in the index datastore node among the step C2, then send to external service node and create replying of index failure;
Step D: the renewal of index is to be made of following steps:
D1. after externally service node receives the index upgrade request this request is forwarded to center control nodes, center control nodes is sent to this index upgrade request the index datastore node at the index burst place of this index according to the index attributes information and the index datastore node attribute information that are stored in wherein;
D2. the index datastore node is according to the index upgrade request of receiving, on the index burst of index to be updated place index datastore node, to upgrade document storage in new section, if upgrade the document storage success, then will upgrade the corresponding old document of document and in new section, be labeled as the deletion state, and return the index upgrade successful information to center control nodes, if upgrade the document storage failure, then return the index upgrade failure information to center control nodes, center control nodes is sent to external service node with index upgrade success or failed information at last;
The index upgrade of this step D also comprises the delete step of document: when index upgrade request during only for the deletion document command, on the storage burst of the index datastore node at document to be deleted place, in new section the document is labeled as deletion;
The index upgrade of this step D, also comprise the step that makes up real time indexing: in the internal memory of system, make up simultaneously index when index is with merging when upgrading, the retrieval of index be when accessing this renewal index and when merging index carry out, when carrying out index upgrade, the index when index in the renewal is described renewal, when reached threshold value the update time of index when the number of documents of index reached threshold value or this renewal when this renewal, system indexes when submitting this renewal in the disk index, index index index when the upgrading during merging before index and the simultaneously change when merging when changing afterwards this renewal;
Step e: the retrieval of index is to be made of following steps:
E1. externally send it to center control nodes after the retrieval request of service node reception hint, center control nodes resolve this retrieval request and judge its for the target index, then according to the attribute information of index datastore node attribute information and target index, search all index bursts of this target index, and assign retrieval request to the index datastore node of each burst of storage;
E2. the index datastore node is retrieved relevant documentation according to the retrieval request of receiving at the respective index burst of its storage, will be sent to external service node after the result for retrieval ordering at last;
E3. externally the result for retrieval of service node each index datastore node that will receive is integrated, is sent to client after the ordering.
As one preferred embodiment, the functional structure of the described system of steps A, also comprise a center control nodes for subsequent use, described center control nodes in real time with the data backed up in synchronization of its storage to center control nodes for subsequent use, when center control nodes breaks down the phase, this center control nodes for subsequent use changes to center control nodes, and when former center control nodes is recovered from fault, former center control nodes changes to new center control nodes for subsequent use; Because center control nodes is host node in system, in a single day it break down, and will cause the whole system paralysis, therefore, by increasing center control nodes for subsequent use, can realize the fault of center control nodes is shifted, and improves the fault-tolerance of system.
As one preferred embodiment, described index datastore node and external service node periodically send the heartbeat signal that characterizes its status information to described center control nodes, if center control nodes is not received heartbeat signal within the default time, then this index datastore node of mark or external service node are dead, simultaneously, center control nodes can will be labeled as all index bursts of storing in the dead index datastore node, copy is a in the index data node of any this index burst copy of not storing of other again in the copy of these index bursts of storing from other index data nodes, so that the number of copies of index burst remains unchanged, all be available at any time with assurance index burst.
As one preferred embodiment, in the heartbeat signal that described index data node occurs in the center control nodes, the load information that comprises this index data node, in the process of index creation, center control nodes can be distributed to the index burst the little index data node storage of load as far as possible, equally, in the process of indexed search, center control nodes can be submitted to retrieval request the index datastore node processing at the little index burst of load or this burst copy place as far as possible.
As one preferred embodiment, described index datastore node attribute information comprises: the type of the ID of node, the title of node, node, the state of node, the load of node and the position of node, and described index attributes information comprises: the memory node ID of the number of copies of the burst number of the organization definition of document, index, index burst and index burst and index burst copy in the title of index, the index; This index datastore node attribute information and index attributes information are metadata in system, this metadata store is on center control nodes, and the center control nodes of system, index datastore node and external service node can be followed according to these metadata and be deduced each index burst position in cluster.
As one preferred embodiment, in the data directory structure of the described system of step B, each index burst also has a plurality of index burst copies, this index burst copy creates when the described index creation of step C, upgrade rear asynchronous refresh at former index burst when the described index upgrade of step D, it is stored on the different index datastore nodes with former index burst.The index datastore node at former index burst place is responsible for processing the update request for this index burst, when former index burst upgrade complete after, the index data node that the index data node at former index burst place is responsible for update request is sent to asynchronously corresponding index burst copy place carries out the renewal of index burst copy.Index burst copy is all supported indexed search with corresponding former index burst, center control nodes is submitted to the little index burst of load or index burst copy place index datastore node processing according to the loading condition of former index burst and index burst copy place index datastore node with the indexed search request.。
Further, center control nodes is made regular check on the number of the index burst copy of each index in whole index, and when the number of index burst copy was lower than default setting number, system copied the copy of this index burst automatically in other back end.When the index datastore node of the former index burst of storage breaks down, system chooses an index upgrade job of taking over former index burst from the index burst copy of correspondence, this index burst copy becomes new former index burst, then in other index data nodes generating an index burst copy, guarantee that the number of copies of this index burst remains unchanged.When the index data node of storage index burst copy broke down, system can generate a copy the same with former index burst in other index data nodes, guarantee that the number of copies of this index burst remains unchanged.
Further, each index burst of described same index and index burst copy creating and be stored on the index datastore node, be to carry out according to following strategy: center control nodes is according to the load information of node in the attribute information of index datastore node, described index burst and index burst copy are dispensed to the lightest index datastore node of load, when the number of available index datastore node is less than the number of index burst, center control nodes distributes a plurality of index bursts to same index datastore node, and center control nodes is the index burst copy of allocation index burst not; When the number of available index datastore node during more than the number of index burst, the center control nodes distribution portion or all the index burst copy of index bursts to remaining index datastore node; One that consults this strategy shown in the accompanying drawing 3 illustrates, it is that an index burst number is 2, the index burst number of copies of each index burst is 1 index in the situation of the storage of index datastore node: when the index datastore nodes of system is 1, the index burst 1 of this index and index burst 2 all are stored in the index datastore node 1, and each burst does not have index burst copy, because copy only is stored in the different nodes and could availability and the reliability of system be worked with former burst, when the index datastore nodes in the system is 2, the index burst 1 and the index burst 2 that are stored in the index datastore node 1 all have index burst copy 1 ' and the index burst copy 2 ' that is stored on the index datastore node 2, index datastore node 2 can provide with index datastore node 1 the same service, therefore increase the service performance that the index datastore node can expanding system; When the index datastore nodes of system was 4, index burst 1, index burst 2, index burst copy 1 ' and index burst copy 2 ' were separately to be stored on these 4 index datastore nodes.
As one preferred embodiment, in the renewal of the described index of step D, the step of the merging of the section of comprising also: the index of the index in described renewal divides the number of medium film section to reach on threshold value or the distance once to reach interval time that index merges threshold value, the index datastore node at this index burst place reads the document in less several sections and it is stored in a new section, then with these several less section physics deletions; In the building process of index, can constantly produce new section, when index divides the number of medium film section too many, can affect the recall precision of indexed search logic, therefore, this step is merged into a large section with a plurality of little sections, and rejects the data of tag delete, has optimized the storage space of index, reduce the number of the index segment that the indexed search logic operates simultaneously, thereby improved the recall precision of indexed search logic.
As one preferred embodiment, the storage of the described renewal document of step D on the index burst, by calculating the cryptographic hash of the key assignments that upgrades document, after this cryptographic hash counted delivery with the index burst of document place index, at last document is assigned to the index burst of the numerical value reference numeral of this delivery and stores.
As one preferred embodiment, the different pieces of information object type of the described document of step B is: text data object, image data objects, audio data objects, video data objects, executable program data object, the attribute information of each data object type is stored in the structure in territory of document, the structure in the territory of document is used for the attribute information of storage document, for example, for the document of text, can comprise following information: file name, keyword, author, file size, classification, file description etc.; And for the document of audio types, can comprise following information: file name, bit rate (bps), file size, duration, author or artist name, song title, school, album name etc.
Although specifically show and introduced the present invention in conjunction with preferred embodiment; but the those skilled in the art should be understood that; within not breaking away from the spirit and scope of the present invention that appended claims limits; can make a variety of changes the present invention in the form and details, be protection scope of the present invention.

Claims (10)

1. the implementation method of a distributed real-time search engine, its system constructing and operation may further comprise the steps at least:
A. the functional structure of design system, this functional structure is to create in the concentrating type system based on Master/Slave, comprise the following functions node: center control nodes, index datastore node and external service node, wherein, described center control nodes is created in the Master system, described index datastore node and external service node are created in the Slave system, described center control nodes, the storage and maintenance that is used for the attribute information of data directory structure index, and the storage and maintenance of the attribute information of index datastore node, described index datastore node is used for the establishment of data directory structure index burst, upgrade and retrieval, described external service node is used for the establishment of reception hint, renewal and retrieval request also are forwarded to center control nodes with this request and process;
B. the data directory structure of design system, this index structure tree hierarchy from top to bottom consists of: index, the index burst, section, document and territory, wherein, described index can have a plurality of in a system, a described index burst is the data block of described index after divided, wherein, each index burst that belongs to same index is stored on the index datastore node, a described index burst is to be made of one or more section, a described section is to be made of one or more document, each contained document can be different data object type in the section, a described document has the uniquely identified key assignments in system's overall situation, the structure of described document comprises for the territory of describing Doctype;
C. the establishment of index may further comprise the steps:
C1. after externally service node receives the index creation request this request is forwarded to center control nodes, center control nodes is resolved this index creation request, therefrom extract the attribute information of index to be created, and verify that this attribute information is whether complete and effectively, if this attribute information is complete and effective, then carry out the processing of step C2, if this attribute information is incomplete or invalid, then send answer failed information to external service node;
C2. center control nodes is divided into some bursts according to the index burst number in the attribute information of the index to be created that generates among the step C1 with index to be created, simultaneously, according to the attribute information that is stored in the index datastore node in the center control nodes, judge state and the loading condition of each index datastore node, and come according to this to determine each index burst is stored and created, and then the attribute information with index to be created is sent to each corresponding index datastore node in which index datastore node; The index datastore node is according to the attribute information of the index to be created of receiving, make up an index burst of the described index to be created of center control nodes assignment at this index datastore node, if this index datastore node creates this index burst failure, then center control nodes divides the index datastore node in good condition, that load is relatively little of tasking other to create this index burst, finish or create failure until whole index bursts of this index to be created create in the index datastore node, carry out the processing of step C3;
If C3. whole index bursts of index to be created create in the index datastore node and finish among the step C2, the center control nodes updated stored is in index datastore node attribute information wherein, and transmission index burst creates successful response message to external service node; If whole index bursts of index to be created create unsuccessfully in the index datastore node among the step C2, then send to external service node and create replying of index failure;
D. the renewal of index may further comprise the steps:
D1. after externally service node receives the index upgrade request this request is forwarded to center control nodes, center control nodes is sent to this index upgrade request the index datastore node at the index burst place of this index according to the index attributes information and the index datastore node attribute information that are stored in wherein;
D2. the index datastore node is according to the index upgrade request of receiving, on the index burst of index to be updated place index datastore node, to upgrade document storage in new section, if upgrade the document storage success, then will upgrade the corresponding old document of document and in new section, be labeled as the deletion state, and return the index upgrade successful information to center control nodes, if upgrade the document storage failure, then return the index upgrade failure information to center control nodes, center control nodes is sent to external service node with index upgrade success or failed information at last;
The index upgrade of this step D also comprises the delete step of document: when index upgrade request during only for the deletion document command, on the storage burst of the index datastore node at document to be deleted place, in new section the document is labeled as deletion;
The index upgrade of this step D, also comprise the step that makes up real time indexing: in the internal memory of system, make up simultaneously index when index is with merging when upgrading, the retrieval of index be when accessing this renewal index and when merging index carry out, when carrying out index upgrade, the index when index in the renewal is described renewal, when reached threshold value the update time of index when the number of documents of index reached threshold value or this renewal when this renewal, system indexes when submitting this renewal in the disk index, index index index when the upgrading during merging before index and the simultaneously change when merging when changing afterwards this renewal;
E. the retrieval of index may further comprise the steps:
E1. externally send it to center control nodes after the retrieval request of service node reception hint, center control nodes resolve this retrieval request and judge its for the target index, then according to the attribute information of index datastore node attribute information and target index, search all index bursts of this target index, and assign retrieval request to the index datastore node of each burst of storage;
E2. the index datastore node is retrieved relevant documentation according to the retrieval request of receiving at the respective index burst of its storage, will be sent to external service node after the result for retrieval ordering at last;
E3. externally the result for retrieval of service node each index datastore node that will receive is integrated, is sent to client after the ordering.
2. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: the functional structure of the described system of steps A, also comprise a center control nodes for subsequent use, described center control nodes in real time with the data backed up in synchronization of its storage to center control nodes for subsequent use, when center control nodes breaks down the phase, this center control nodes for subsequent use changes to center control nodes, when former center control nodes is recovered from fault, former center control nodes changes to new center control nodes for subsequent use.
3. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: described index datastore node and external service node periodically send the heartbeat signal that characterizes its status information to described center control nodes, if center control nodes is not received heartbeat signal within the default time, then this index datastore node of mark or external service node are dead, simultaneously, center control nodes can will be labeled as all index bursts of storing in the dead index datastore node, copy is a in the index datastore node of any this index burst copy of not storing of other again in the copy of these index bursts of storing from other index datastore nodes, so that the number of copies of index burst remains unchanged, all be available at any time with assurance index burst; In the heartbeat signal that described index datastore node occurs in the center control nodes, the load information that comprises this index datastore node, in the process of index creation, center control nodes can be distributed to the index burst the little index datastore node storage of load as far as possible, equally, in the process of indexed search, center control nodes can be submitted to retrieval request the index datastore node processing at the little index burst of load or this burst copy place as far as possible.
4. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: described index datastore node attribute information comprises: the type of the ID of node, the title of node, node, the state of node, the load of node and the position of node, described index attributes information comprises: the memory node ID of the number of copies of the burst number of the organization definition of document, index, index burst and index burst and index burst copy in the title of index, the index.
5. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: in the data directory structure of the described system of step B, each index burst also has a plurality of index burst copies, this index burst copy creates when the described index creation of step C, upgrade rear asynchronous refresh at former index burst when the described index upgrade of step D, it is stored on the different index datastore nodes with former index burst; The index datastore node at former index burst place is responsible for processing the update request for this index burst, when former index burst upgrade complete after, the index datastore node that the index datastore node at former index burst place is responsible for update request is sent to asynchronously corresponding index burst copy place carries out the renewal of index burst copy; Index burst copy is all supported indexed search with corresponding former index burst, center control nodes is submitted to the little index burst of load or index burst copy place index datastore node processing according to the loading condition of former index burst and index burst copy place index datastore node with the indexed search request.
6. the implementation method of distributed real-time search engine as claimed in claim 5, it is characterized in that: center control nodes is made regular check on the number of the index burst copy of each index in whole index, when the number of index burst copy was lower than default setting number, system copied the copy of this index burst automatically in other index datastore nodes; When the index datastore node of the former index burst of storage breaks down, system chooses an index upgrade job of taking over former index burst from the index burst copy of correspondence, this index burst copy becomes new former index burst, then in other index datastore nodes generating an index burst copy, guarantee that the number of copies of this index burst remains unchanged; When the index datastore node of storage index burst copy broke down, system can generate a copy the same with former index burst in other index datastore nodes, guarantee that the number of copies of this index burst remains unchanged.
7. the implementation method of distributed real-time search engine as claimed in claim 5, it is characterized in that: each index burst of same index and index burst copy creating and be stored on the index datastore node, be to carry out according to following strategy: center control nodes is according to the load information of node in the attribute information of index datastore node, described index burst and index burst copy are dispensed to the lightest index datastore node of load, when the number of available index datastore node is less than the number of index burst, center control nodes distributes a plurality of index bursts to same index datastore node, and center control nodes is the index burst copy of allocation index burst not; When the number of available index datastore node during more than the number of index burst, the center control nodes distribution portion or all the index burst copy of index bursts to remaining index datastore node.
8. the implementation method of distributed search engine as claimed in claim 1, it is characterized in that: in the renewal of the described index of step D, the step of the merging of the section of comprising also: the index of the index in described renewal divides the number of medium film section to reach on threshold value or the distance once to reach interval time that index merges threshold value, the index datastore node at this index burst place reads the document in less several sections and it is stored in a new section, then with these several less section physics deletions.
9. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: the storage of the described renewal document of step D on the index burst, by calculating the cryptographic hash of the key assignments that upgrades document, after this cryptographic hash counted delivery with the index burst of document place index, at last document is assigned to the index burst of the numerical value reference numeral of this delivery and stores.
10. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: the different pieces of information object type of the described document of step B, comprise: text data object, image data objects, audio data objects, video data objects, executable program data object, the attribute information of each data object type are stored in the structure in territory of document.
CN 201110137785 2011-05-26 2011-05-26 Implementation method of distributed real-time search engine Active CN102169507B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110137785 CN102169507B (en) 2011-05-26 2011-05-26 Implementation method of distributed real-time search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110137785 CN102169507B (en) 2011-05-26 2011-05-26 Implementation method of distributed real-time search engine

Publications (2)

Publication Number Publication Date
CN102169507A CN102169507A (en) 2011-08-31
CN102169507B true CN102169507B (en) 2013-03-20

Family

ID=44490669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110137785 Active CN102169507B (en) 2011-05-26 2011-05-26 Implementation method of distributed real-time search engine

Country Status (1)

Country Link
CN (1) CN102169507B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649804A (en) * 2016-12-29 2017-05-10 深圳市优必选科技有限公司 Data processing method, data processing device and data processing system for data query server

Families Citing this family (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102394922A (en) * 2011-10-27 2012-03-28 上海文广互动电视有限公司 Distributed cluster file system and file access method thereof
CN102523480A (en) * 2011-12-08 2012-06-27 成都东方盛行电子有限责任公司 Recording system and method based on active-standby and cache technology
CN103309903A (en) * 2012-03-16 2013-09-18 刘龙 Position search system and method based on cloud computing
CN102779185B (en) * 2012-06-29 2014-11-12 浙江大学 High-availability distribution type full-text index method
CN103685429B (en) * 2012-09-25 2017-09-26 阿里巴巴集团控股有限公司 A kind of method and apparatus of information displaying
CN103106233A (en) * 2012-11-02 2013-05-15 北京邮电大学 Asynchronous index and read-write method of massive files applied to search engine
US20140156668A1 (en) * 2012-12-04 2014-06-05 Linkedin Corporation Apparatus and method for indexing electronic content
CN102984762B (en) * 2012-12-12 2016-05-25 中国联合网络通信集团有限公司 IMS function assigning method and device
CN103914483B (en) * 2013-01-07 2018-09-25 深圳市腾讯计算机系统有限公司 File memory method, device and file reading, device
CN103067525B (en) * 2013-01-18 2015-11-25 广东工业大学 A kind of cloud storing data backup method of feature based code
CN103198108B (en) * 2013-03-27 2016-08-10 新浪网技术(中国)有限公司 A kind of index data update method, retrieval server and system
CN103258036A (en) * 2013-05-15 2013-08-21 广州一呼百应网络技术有限公司 Distributed real-time search engine based on p2p
CN103310023A (en) * 2013-07-05 2013-09-18 深圳中兴网信科技有限公司 Distributed searching system and method
CN104298692B (en) * 2013-07-19 2017-11-24 深圳中兴网信科技有限公司 A kind of method and system of distributed search
CN103488687A (en) * 2013-09-02 2014-01-01 用友软件股份有限公司 Searching system and searching method of big data
CN104239377A (en) * 2013-11-12 2014-12-24 新华瑞德(北京)网络科技有限公司 Platform-crossing data retrieval method and device
CN103699648A (en) * 2013-12-26 2014-04-02 成都市卓睿科技有限公司 Tree-form data structure used for quick retrieval and implementation method of tree-form data structure
CN104092735A (en) * 2014-06-23 2014-10-08 吕志雪 Cloud computing data access method and system based on binary tree
CN104252537B (en) * 2014-09-18 2019-05-21 彩讯科技股份有限公司 Index sharding method based on mail features
CN104361009B (en) * 2014-10-11 2017-10-31 北京中搜网络技术股份有限公司 A kind of real time indexing method based on inverted index
CN104820693B (en) * 2015-04-28 2018-07-24 广东小天才科技有限公司 A kind of method and device of data search
CN105045684B (en) * 2015-07-16 2018-06-15 北京京东尚科信息技术有限公司 Index switching and the method and device of index control
CN105138669A (en) * 2015-09-07 2015-12-09 天脉聚源(北京)传媒科技有限公司 Method and device for combining incremental indexes with general indexes
CN105373835B (en) * 2015-10-14 2021-07-02 国网湖北省电力公司 Link information management method based on structure tree model
CN106598990B (en) * 2015-10-16 2020-06-19 卓望数码技术(深圳)有限公司 Searching method and system
CN105843933B (en) * 2016-03-30 2019-01-29 电子科技大学 The index establishing method of distributed memory columnar database
US9934092B2 (en) * 2016-07-12 2018-04-03 International Business Machines Corporation Manipulating a distributed agreement protocol to identify a desired set of storage units
CN106294721B (en) * 2016-08-08 2020-05-19 无锡天脉聚源传媒科技有限公司 Cluster data counting and exporting methods and devices
CN108509438B (en) * 2017-02-24 2021-08-31 南京烽火星空通信发展有限公司 ElasticSearch fragment expansion method
CN108694188B (en) * 2017-04-07 2023-05-12 腾讯科技(深圳)有限公司 Index data updating method and related device
CN107133350A (en) * 2017-05-25 2017-09-05 努比亚技术有限公司 Data-updating method, mobile terminal and storage medium based on search engine
CN107220347B (en) * 2017-05-27 2020-07-03 国家计算机网络与信息安全管理中心 Custom relevance ranking algorithm based on Lucene support expression
CN109002448B (en) * 2017-06-07 2020-12-08 中国移动通信集团甘肃有限公司 Report statistical method, device and system
CN109120885B (en) * 2017-06-26 2021-01-05 杭州海康威视数字技术股份有限公司 Video data acquisition method and device
CN107436923A (en) * 2017-07-07 2017-12-05 北京奇虎科技有限公司 A kind of method and apparatus of the search index in big data cluster
JP7065956B2 (en) * 2017-10-23 2022-05-12 シーメンス アクチエンゲゼルシヤフト Methods and control systems for controlling and / or monitoring equipment
CN108804502A (en) * 2018-04-09 2018-11-13 中国平安人寿保险股份有限公司 Big data inquiry system, method, computer equipment and storage medium
CN108681592B (en) * 2018-05-15 2021-05-25 北京三快在线科技有限公司 Index switching method, device and system and index switching central control device
CN110609844B (en) * 2018-05-29 2022-05-13 优信拍(北京)信息科技有限公司 Data updating method, device and system
CN108959640B (en) * 2018-07-26 2021-02-12 浙江数链科技有限公司 ES index rapid construction method and device
CN109086409B (en) * 2018-08-02 2021-10-08 泰康保险集团股份有限公司 Microservice data processing method and device, electronic equipment and computer readable medium
CN109767247A (en) * 2019-01-15 2019-05-17 武汉费米坊科技有限公司 A kind of distribution commodity traceability system and source tracing method
CN109726264B (en) * 2019-01-16 2022-02-25 北京百度网讯科技有限公司 Method, apparatus, device and medium for index information update
CN110209910B (en) * 2019-05-20 2021-06-04 无线生活(杭州)信息科技有限公司 Index switching scheduling method and scheduling device
CN110175151A (en) * 2019-05-22 2019-08-27 中国农业科学院农业信息研究所 A kind of processing method, device, equipment and the storage medium of agricultural big data
CN110704453B (en) * 2019-10-15 2022-05-06 腾讯音乐娱乐科技(深圳)有限公司 Data query method and device, storage medium and electronic equipment
CN111324767A (en) * 2020-02-17 2020-06-23 厦门快商通科技股份有限公司 Distributed audio fingerprint engine system
CN111611222A (en) * 2020-04-27 2020-09-01 上海鼎茂信息技术有限公司 Data dynamic processing method based on distributed storage
CN112527210A (en) * 2020-12-22 2021-03-19 南京中兴力维软件有限公司 Storage method and device of full data and computer readable storage medium
CN113535730A (en) * 2021-07-21 2021-10-22 挂号网(杭州)科技有限公司 Index updating method and system for search engine, electronic equipment and storage medium
CN114020986B (en) * 2022-01-05 2022-04-26 深圳思谋信息科技有限公司 Content retrieval system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101677328A (en) * 2008-09-19 2010-03-24 中兴通讯股份有限公司 Content-fragment based multimedia distributing system and content-fragment based multimedia distributing method
CN101727460A (en) * 2008-10-31 2010-06-09 中兴通讯股份有限公司 Method and system for positioning content fragment
CN101853283A (en) * 2010-05-21 2010-10-06 南京邮电大学 Construction method for multidimensional data-oriented semantic indexing peer-to-peer network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8903810B2 (en) * 2005-12-05 2014-12-02 Collarity, Inc. Techniques for ranking search results

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101677328A (en) * 2008-09-19 2010-03-24 中兴通讯股份有限公司 Content-fragment based multimedia distributing system and content-fragment based multimedia distributing method
CN101727460A (en) * 2008-10-31 2010-06-09 中兴通讯股份有限公司 Method and system for positioning content fragment
CN101853283A (en) * 2010-05-21 2010-10-06 南京邮电大学 Construction method for multidimensional data-oriented semantic indexing peer-to-peer network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649804A (en) * 2016-12-29 2017-05-10 深圳市优必选科技有限公司 Data processing method, data processing device and data processing system for data query server

Also Published As

Publication number Publication date
CN102169507A (en) 2011-08-31

Similar Documents

Publication Publication Date Title
CN102169507B (en) Implementation method of distributed real-time search engine
JP6778795B2 (en) Methods, devices and systems for storing data
JP7410181B2 (en) Hybrid indexing methods, systems, and programs
EP3596619B1 (en) Methods, devices and systems for maintaining consistency of metadata and data across data centers
CN107003935B (en) Apparatus, method and computer medium for optimizing database deduplication
CN104714755B (en) Snapshot management method and device
US8423733B1 (en) Single-copy implicit sharing among clones
CN102937980B (en) A kind of Cluster Database data enquire method
AU2013210018B2 (en) Location independent files
US7769792B1 (en) Low overhead thread synchronization system and method for garbage collecting stale data in a document repository without interrupting concurrent querying
CN104679898A (en) Big data access method
CN109376121B (en) File indexing system and method based on elastic search full-text retrieval
CN104301360A (en) Method, log server and system for recording log data
CN104778270A (en) Storage method for multiple files
EP3103025A2 (en) Content based organization of file systems
JP2022500727A (en) Systems and methods for early removal of tombstone records in databases
CN103595797B (en) Caching method for distributed storage system
CN108614837B (en) File storage and retrieval method and device
CN104881466A (en) Method and device for processing data fragments and deleting garbage files
US9002906B1 (en) System and method for handling large transactions in a storage virtualization system
CN105183400A (en) Object storage method and system based on content addressing
CN103049574B (en) Realize key assignments file system and the method for file dynamic copies
CN104424219A (en) Method and equipment of managing data documents
CN112334891B (en) Centralized storage for search servers
CN103778219A (en) HBase-based method for updating incremental indexes

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant