CN111198931B - Data indexing system and method - Google Patents

Data indexing system and method Download PDF

Info

Publication number
CN111198931B
CN111198931B CN201811368059.0A CN201811368059A CN111198931B CN 111198931 B CN111198931 B CN 111198931B CN 201811368059 A CN201811368059 A CN 201811368059A CN 111198931 B CN111198931 B CN 111198931B
Authority
CN
China
Prior art keywords
index
data
incremental
active
data storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811368059.0A
Other languages
Chinese (zh)
Other versions
CN111198931A (en
Inventor
顾朝媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811368059.0A priority Critical patent/CN111198931B/en
Publication of CN111198931A publication Critical patent/CN111198931A/en
Application granted granted Critical
Publication of CN111198931B publication Critical patent/CN111198931B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data indexing method, which comprises the following steps: the method comprises the steps of providing data services for users by a current full index and an active incremental index, updating metadata and the active incremental index based on newly acquired data, triggering and creating a new full index at a trigger moment so as to create a new full index by using the metadata corresponding to the trigger moment, updating the metadata, the active incremental index and a backup incremental index by using the newly acquired data during the trigger moment and a new full index creation completion moment, setting the new full index as the current full index after the new full index is created, exchanging the active and backup incremental indexes, and providing the data services for the users by the newly set current full index and the active incremental index. The invention also provides a corresponding system and a corresponding computing device.

Description

Data indexing system and method
Technical Field
The invention relates to the field of data query, in particular to the technical field of providing data query service for users through data indexes.
Background
With the coming of the internet era, information such as hot news and entertainment headlines is spread more and more, and a search engine is produced in order to facilitate people to search and filter favorite contents. While meeting the search requirements of people, the search engine faces the search requirements of a large number of users, and the highest number can reach thousands of clicks per second or even tens of thousands of clicks. To meet the search requirements of users, data indexing services are typically provided within search engines. Such data indexes are typically inverted indexes, each entry in such an index table comprising an attribute value and the address of the respective record having the attribute value, in which the attribute value is not determined by the record, but the position of the record is determined by the attribute value. The inverted index is suitable for being used in a search engine, namely, a search request of a user can be converted into one or more attribute values, then a record address with each attribute value, such as a webpage URL (uniform resource locator) and the like, is obtained according to the data index, and then a search result is returned.
Since the search engine provides a search service for the user on one hand and needs to acquire the latest web page content from the internet on the other hand, in order to ensure the real-time performance of the search data of the search engine, the data index therein needs to be updated. There are two types of index update strategies used in the prior art:
1. and (4) completely reconstructing the strategy, namely merging the newly added documents and the original old documents when the number of the newly captured documents reaches a certain number, and then reestablishing indexes for all the documents. After the new index is built, the old index is abandoned and released, and then the response to the user query is completely responsible for the new index. A problem with this strategy is that newly crawled documents cannot be represented in the index in real time, resulting in an untimely search for data, and in addition, documents during the creation of the index may not be indexed, resulting in inconsistent data.
2. And (3) a re-merging strategy: when new documents enter the search system, the search system maintains the temporary inverted index in the memory to record the information of the documents, and when the new documents reach a certain amount or the memory with the specified size is consumed, the temporary index and the inverted index of the old documents are combined to generate a new index. The problem of this strategy is that as the index file becomes larger, a lot of time and resources are needed to construct a new index, and the newly added documents cannot be reflected in the new index during the construction of the index file, which also causes the problems of not timely searching data and inconsistent index results.
Although the above strategies can continuously provide accurate results for users through index switching, it is not considered that as network data increases, new network data may also exist during construction of a new index, which may cause the network data not to be tied out in the index results, thereby causing a real-time problem that search results including new documents cannot be obtained in real time as much as possible.
Therefore, a new data index providing scheme is needed, which can ensure the consistency and real-time performance of data index updating while switching indexes.
Disclosure of Invention
To this end, the present invention provides a new data indexing scheme in an attempt to solve or at least alleviate at least one of the problems presented above.
According to an aspect of the present invention, there is provided a method of data indexing, the method comprising the steps of: the method comprises the steps of providing data services for users by a current full index and an active incremental index, updating metadata and the active incremental index based on newly acquired data, triggering and creating a new full index at a trigger moment so as to create a new full index by using the metadata corresponding to the trigger moment, updating the metadata, the active incremental index and a backup incremental index by using the newly acquired data between the trigger moment and a new full index creation completion moment, setting the new full index as the current full index after the new full index is created, exchanging the active and backup incremental indexes, and providing the data services for the users by the newly set current full index and the active incremental index.
Optionally, the method according to the present invention further comprises: and repeating the steps of updating the metadata and the active incremental indexes, triggering and creating a new full index, updating the metadata, the active incremental indexes and the backup incremental indexes by using the newly acquired data during the triggering time and the creating finishing time of the new full index, and newly setting the current full index, the active incremental indexes and the backup incremental indexes.
Optionally, in the method according to the invention, the trigger time is set to a predetermined time after the current full index is newly set, or to that time when the active delta index reaches a predetermined amount of data.
Optionally, the data index includes a correspondence between data attribute values and data storage locations, and the method according to the present invention further includes the steps of: extracting data attribute values and data storage locations from the newly acquired data, and updating the metadata and delta index with the newly acquired data comprises: the extracted data attribute values and data storage locations are written into the metadata and active delta index.
Optionally, in the method according to the present invention, the newly acquired data includes web content, the data storage location includes a web link corresponding to the web content, and the data attribute value includes a data attribute extracted from the web content.
Optionally, in the method according to the present invention, the step of updating the backup incremental index during the trigger time and the time of creating the new full index completion time includes: and emptying the content in the backup incremental index before updating the backup incremental index.
Optionally, in the method according to the present invention, the step of providing the data service for the user by the current full index and the active delta index comprises: the method includes receiving a user request including a data attribute value to be queried, obtaining a first data storage location corresponding to the queried data attribute value from a current full index, obtaining a second data storage location corresponding to the queried data attribute value from an active delta index, and merging the first data storage location and the second data storage location to return the merged data storage location to the user as a result.
Optionally, the method according to the invention further comprises the steps of: counting user requests to obtain a hot request list with query times exceeding a preset number; during the creation of a new full index based on the metadata, for each request in the hot request list, acquiring a corresponding first data storage position from the new full index in advance; and caching the hot request list and the retrieved corresponding first data storage location for use after the new full index is set to the current full index.
Optionally, in the method according to the present invention, the step of providing data service for the user by the current full index and the active incremental index includes: and if the user request is in the hot request list, acquiring the first data storage position from the cache, otherwise acquiring the first data storage position from the current full index.
Optionally, in the method according to the present invention, the incremental index includes an odd incremental index and an even incremental index, and when the odd incremental index is set as the active incremental index, the even incremental index is set as the backup incremental index; and when the even incremental index is set as the active incremental index, the odd incremental index is set as the backup incremental index.
According to another aspect of the present invention, there is also provided a data indexing system, comprising: a data storage unit in which metadata is stored; the index storage unit is used for storing the current full index, the active incremental index and the backup incremental index; an incremental index updating unit adapted to update the metadata and the active incremental index based on the newly acquired data; the full index building unit is suitable for triggering and creating a new full index at the triggering moment so as to create the new full index by using the metadata corresponding to the triggering moment; the index switching unit is suitable for setting the new full index as the current full index after the creation of the new full index is completed, exchanging the active and backup incremental indexes, and providing a data index for a user by the newly set current full index and the active incremental index; and the incremental index updating unit is suitable for updating the metadata, the active incremental index and the backup incremental index by using the newly acquired data in the period between the trigger time and the creation completion time of the new full index.
According to still another aspect of the invention, a computing device is also provided. The computing device includes at least one processor and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor and include instructions for performing the method of providing data services described above.
According to the data indexing scheme, the real-time updated index data can be ensured to exist in two types of incremental indexes through double-write activity and backup incremental indexes in the process of creating the full index, and after the full index is created and distributed successfully, the activity and backup incremental indexes are switched and the double-write incremental indexes are stopped, so that the consistency of the index data is ensured.
In addition, according to the scheme of the invention, the retrieval sentences which are frequently queried by the user are preheated once when the whole index is created and distributed, and the hot query results are ensured to be cached, so that the real-time performance of the search service is further improved.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.
FIG. 1 shows a schematic diagram of a search system 100 according to one embodiment of the invention;
FIG. 2 shows a schematic diagram of a computing device 200, according to one embodiment of the invention;
FIG. 3 illustrates a flow diagram of a data indexing method 300 according to one embodiment of the invention;
FIG. 4 shows a flow diagram of a data indexing method 400 according to another embodiment of the invention; and
FIG. 5 shows a schematic diagram of a data indexing system 500 according to one embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 shows a schematic diagram of a search system 100 according to one embodiment of the invention. As shown in FIG. 1, search system 100 includes a client 110, a data indexing system 500, and a web server 120.
Although only 3 clients 110 are shown in fig. 1, it should be understood that in practice there are a substantial number of clients 110, this client 110 having a respective form, including but not limited to a mobile terminal, a personal computer, a personal digital assistant, etc. The present invention is not limited by the type of the client 110, as long as the user can utilize the client 110 to send a user search request to the data indexing service system 500, and receive the result returned by the system 500 to be displayed in the client 110. For example, client 110 may be a mobile terminal that issues a search request to system 500 through an application installed on the mobile terminal and displays the search results on an interface of the mobile terminal.
Also, while only 2 web servers 120 are shown in FIG. 1, it should be understood that in practice there are a substantial number of web servers 120, with these web servers 120 having respective forms, including but not limited to web servers, application servers, and the like. The present invention is not limited by the type of web servers 120 as long as these web servers 120 can be accessed by the system 500 and provide the web content 130 thereon to the data indexing system 500.
Data indexing system 500 is communicatively coupled to web server 120 and client 110. The system 500 may obtain the web content 130 from the web server 120 and process the obtained web content 130 to obtain various search keywords. The data index 510 is constructed using the obtained search keyword and the link address of the web content. Data index 510 includes a correspondence between data attribute values and data storage locations. In this embodiment, the data attribute value includes a search keyword extracted from the web content 130, and the data storage location includes a link address of the web content 130.
The system 500 may utilize web crawler technology to obtain web content 130 from the web server 120. Once the web server 120 provides the new web content 130, the system 500 may retrieve the content and update the data index 510. Methods 300 and 400 of managing data indexes in system 500 are described in further detail below with reference to fig. 3 and 4.
The system 500 receives a search request from the client 110, extracts search keywords from the search request, and retrieves corresponding data storage locations from the data index 510 and returns to the client 110. For searching, the data storage location is a link address of the web content 130, so that the client 110 can use the link address to obtain the web content 130 from the web server 120 for displaying in the client 110, thereby completing the web searching operation.
According to embodiments of the invention, system 500 may be implemented by computing device 200 as described below. FIG. 2 shows a schematic diagram of a computing device 200, according to one embodiment of the invention.
As shown in FIG. 2, in a basic configuration 202, computing device 200 typically includes system memory 206 and one or more processors 204. A memory bus 208 may be used for communicating between the processor 204 and the system memory 206.
Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and registers 216. Example processor core 214 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations the memory controller 218 may be an internal part of the processor 204.
Depending on the desired configuration, system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 206 may include an operating system 220, one or more applications 222, and program data 224. In some implementations, the application 222 can be arranged to execute instructions on the operating system with the program data 224 by the one or more processors 204.
Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. Example peripheral interfaces 244 can include a serial interface controller 254 and a parallel interface controller 256, which can be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 258. An example communication device 246 may include a network controller 260, which may be arranged to facilitate communications with one or more other computing devices 262 over a network communication link via one or more communication ports 264.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, radio Frequency (RF), microwave, infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
Computing device 200 may be implemented as a server, such as a database server, an application server, a WEB server, and the like, or as a personal computer including desktop and notebook computer configurations. Of course, computing device 200 may also be implemented as part of a small-sized portable (or mobile) electronic device.
According to one embodiment of the invention, the system 500 is implemented by the computing device 200, and the computing device 200 is configured to perform the data indexing method 300 and/or 400 according to the invention. Where application 222 of computing device 200 includes a plurality of program instructions that implement data indexing methods 300 and/or 400 according to the present invention, program data 224 may also store configuration information for system 500, etc.
It should be noted that one or more of the components of system 500, or one or more portions of each of the components, may be implemented by computing device 200 in accordance with another embodiment of the present invention. These applications 222 in the computing device 200 implement program instructions of one or more portions of the methods 300 and/or 400, respectively, according to the present invention, and these applications 222 cooperate together to implement the methods 300 and/or 400 according to the present invention.
FIG. 3 shows a flow diagram of a data indexing method 300 according to one embodiment of the invention. The method 300 is performed in the data indexing system 500 described above. As shown in fig. 3, the method 300 begins at step S310.
In step S310, a data index is provided for the user by the current full index and the active delta index. The current full index is a full index created using all metadata at a certain time, and the incremental index is a newly added index created using metadata newly added after the certain time. And combining the retrieval result by using the full index and the retrieval result by using the incremental index to obtain a complete retrieval result.
According to one embodiment of the invention, two types of incremental indexes are provided, namely odd incremental indexes and even incremental indexes. One of the two types of incremental indexes is set as an active incremental index, and the other is set as a backup incremental index. For example, when the odd incremental index is the active incremental index, then the even incremental index is set as the backup incremental index, and vice versa. We describe in step S310 taking the odd incremental index as the active incremental index as an example.
Alternatively, the metadata may be stored in a particular data storage device, such as a database. The existing database can be managed by a database management system, and various version control, data backup, distributed storage and the like can be supported. The present invention is not limited to a specific form of database, and such a database is within the scope of the present invention as long as the metadata can be written into the database and read out from the database to create the full-scale index.
Optionally, the metadata comprises a pair of a data attribute value and a data storage location. Thus, the data index created from the metadata includes a correspondence between the data attribute values and the data storage locations. Therefore, in step S310, in order to provide a data service to a user, a user request including a data attribute value to be queried is first received. A first data storage location corresponding to the queried data attribute value is then obtained from the current full index, and a second data storage location corresponding to the queried data attribute value is obtained from the active delta index. The obtained first data storage location and the second data storage location are then merged to return the merged data storage location as a result to the user.
Step S310 starts at the moment when the current full index has just been used. Since during the creation of the full-scale index, there will also be new documents processed by the system 500 and new metadata generated. These new metadata are used to update the current active delta index while being stored in the database. Thus, at the time of step S310, the current full index is the full index created using all metadata at the time the creation of the full index was triggered, and the active delta index is updated using metadata that was newly added during the time the creation of the full index was triggered and the time the full index was just taken into use.
Subsequently, the method 300 proceeds to step S320, where in step S320 the metadata and the active delta index are updated based on the newly acquired data. Step S320 corresponds to the runtime after the newly created current full index is put into use. During this run, the system 500 will continuously retrieve documents on the network for analysis and thereby obtain new data, when the current full index is already in use, and the updating of the full index takes a lot of time, so that the newly retrieved data cannot be used to update the current full index, but can be used to update the active delta index that does not take a lot of time to update, while the stored metadata, such as metadata in a database, is updated with the newly retrieved data.
In step S320, since the backup incremental index does not provide index service for the user, only the active incremental index is updated, and the backup incremental index is not updated.
Alternatively, in step S320, since the metadata includes pairs of data attribute values and data storage locations, the new document may be processed to obtain a plurality of data attribute values and data storage locations therein, so that a plurality of metadata may be created for one document and then written into a database storing metadata.
When updating the active delta index with the metadata, searching whether an index node matched with a data attribute value in the metadata exists in the delta index, and if so, writing a data storage position in the metadata into a data storage position of the index node; otherwise, a new inode is created in the incremental index.
When step S320 is executed for a certain period of time, for example, or the accumulated index data in the active delta index exceeds a predetermined amount, or the size of the delta index file exceeds a predetermined value, step S330 is entered.
In step S330, the creation of a new full index is triggered, so that the new full index is created by using the metadata corresponding to the trigger time. Since it takes a long time to create the full index, and the system 500 will continue to acquire new data to process and update the metadata, it is necessary to make clear at which time of the metadata the full index is constructed.
Subsequently, in step S340, during the time interval defined by the trigger time determined in step S330 and the time when the creation of the new full index is completed and put into use, the metadata, the active incremental index and the backup incremental index are updated by using the newly acquired data during the time interval. The processing of updating the metadata and updating the incremental index by using the data is the same as the processing described in step S320, and is not described again here. It should be noted that during this time, not only the active incremental index (i.e., the odd incremental index in the example) is updated, but also the backup incremental index (i.e., the even incremental index). During execution of step S340, the data service for the user continues to be provided by the current full index and the active delta index.
Optionally, before updating the backup incremental index, the content in the backup incremental index is cleared, or a new backup incremental index is created to receive new metadata content.
When the creation of the new full index is completed, in step S350, the newly created full index is set as the current full index, the active and backup incremental indexes are exchanged, the current active incremental index is set as the backup incremental index, and the current backup incremental index is set as the active incremental index. Namely, the odd incremental index is set as the backup incremental index, and the even incremental index is set as the active incremental index, so that the data service is continuously provided for the user by the newly set current full index and the active incremental index.
Through the processing of the above steps S310-S350, a new full index can be constructed for use as data increases, and at the same time, the consistency of data services provided during indexing is achieved by switching between two types of incremental indexes.
Subsequently, the processes of steps S320-S350 may be repeated to construct a new full index and switch the two types of indexes, so as to ensure data consistency during index construction and ensure that the data index result is not affected by index switching.
The following provides a status process that changes according to the method 300 of providing data indexing, where Index represents the content of the Index corresponding to the metadata Meta, e.g., index1 corresponds to Meta1, index2 corresponds to Meta2, and so on.
1. At state 1 corresponding to step S310, at this time the current full index has just been created:
a database: metadata Meta1+ Meta2
Full index: current full Index Index1
Active delta index (in this case, odd delta index): index2
Backup incremental index (in this case, even incremental index): index2
The Index providing the service is Index1+ Index2
2. At state 2 corresponding to step S320:
a database: metadata Meta1+ Meta2+ Meta3
Full index: current full Index Index1
Active delta index (in this case, odd delta index): index2+ Index3
Backup incremental index (in this case even incremental index): index2
The Index of the service is Index1+ Index2+ Index3
3. At state 3 corresponding to steps S330 and S340:
a database: metadata Meta1+ Meta2+ Meta3+ Meta4
Full index: the current full Index Index1 newly created full Index Index1+ Index2+ Index3
Active delta index (in this case, odd delta index): index2+ Index3+ Index4
Backup incremental index (in this case even incremental index): index4
The Index providing the service is Index1+ Index2+ Index3+ Index4
4. At state 4 for step S350:
a database: metadata Meta1+ Meta2+ Meta3+ Meta4
Full index: current full Index Index1+ Index2+ Index3
Active delta index (in this case even delta index): index4
Backup incremental index (in this case, odd incremental index): index2+ Index3+ Index4
The Index providing the service is Index1+ Index2+ Index3+ Index4
By using the data indexing method 300 according to the present embodiment, considering that the full index creation consumes a long time, in the full index creation process, the double-write odd-even incremental index ensures that the index data updated in real time exists in both types of incremental indexes, and after the full index is created and distributed successfully, the incremental indexes are switched and the double-write incremental index is stopped, thereby ensuring the consistency of the index data.
FIG. 4 shows a schematic diagram of a data indexing method 400 according to another embodiment of the invention. Among the steps of the method 400 shown in fig. 4, the steps that are the same as or similar to the steps of the method 300 shown in fig. 3 are denoted by the same reference numerals and are not described again.
As shown in fig. 4, step S410 is provided before step S310. In step S410, the user requests are counted to obtain a hot request list with the number of queries exceeding a predetermined number. According to one embodiment, the user request may be recorded, and the query statements in the user request during the last period defined in steps S310 to S340 may be counted, and one or more query statements with the query times exceeding the predetermined number may be determined to construct the hit request list.
Triggering the construction of a new full index at step S330, and before using the new full index as the current full index at step S350, the method 400 further includes step S430. In step S430, before the new full index has been constructed and is not yet put into use, for each request in the hot request list constructed in step S410, a corresponding first data storage location is obtained in advance from the new full index. Subsequently in step S440, the hit request list obtained in step S410 and the first data storage location obtained in step S430 are cached so that the cached search results can be utilized after the new full index is set as the current full index and put into use in step S350.
To this end, in the method 400, step S320 further includes a sub-step S420, during the data service provided to the user by the current full index and the active delta index, if the user request is in the hot request list, the corresponding first data storage location is obtained from the cache constructed in step S440, otherwise, the first data storage location is obtained from the current full index.
According to the method 400 shown in fig. 4, the query results of the user are cached, the statements with the advanced query times in each index period are counted, and the query results of the statements are uniformly subjected to one-time preheating query after the indexes are successfully created, so that the results of hot queries are already cached during the period that a new full index is put into use, and the efficiency of data retrieval is improved.
FIG. 5 shows a schematic diagram of a data indexing system 500 according to one embodiment of the invention. It should be noted that fig. 2 depicts one computing device 200 that may implement system 500. Computing device 200 depicts system 500 as a hardware partition, while FIG. 5 depicts system 500 as a functional block partition. Both described differently, but are within the scope of the invention.
As shown in fig. 5, the data indexing system 500 includes a data storage unit 520, an index storage unit 512, an incremental index updating unit 530, a full index building unit 540, and an index switching unit 550.
The data storage unit 520 is, for example, a database in which metadata 522 is stored. According to one implementation, various versions of metadata may be stored in the database, and all metadata information at a certain time may be recorded. The metadata includes, for example, a pair of a data attribute value and a data storage location. The specific implementation of the above method 300 and 400 with reference to fig. 3 and 4 has been described above if a new document is processed to obtain metadata and the metadata is used to update the indexes including the full index and the incremental index, and will not be described herein again.
The index storage unit 512 stores therein various indexes including the current full index 510a, the odd incremental index, and the even incremental index, etc. As previously described, the odd and even incremental indices are alternately set to the active incremental index 510b and the backup incremental index 510c, respectively, depending on the operational phase of the system.
The delta index update unit 530 is coupled to the data storage unit 520 and the index storage unit 512, and updates the metadata 522 in the data storage unit 520 and the active delta index 510b in the index storage unit 512 with the newly obtained data. According to one embodiment, the delta index update unit 530 extracts data attribute values and data storage locations from the newly acquired data and writes the extracted data attribute values and data storage locations into the metadata 522 and the active delta index 510b, respectively.
The full index building unit 540 is also coupled to the data storage unit 520 and the index storage unit 512, and triggers the creation of a new full index at a predetermined trigger time, so as to create a new full index 510d in the index storage unit 512 by using the metadata 520 corresponding to the trigger time.
When the full index building unit 540 is triggered to create the new full index 510d, an indication to create the new full index is sent to the incremental index updating unit 530, and thereafter, during the triggering time and the new full index creation completion time, the incremental index updating unit 530 updates not only the metadata 510a and the active incremental index 510b, but also the backup incremental index 510c, using the data newly acquired during the triggering time.
Optionally, the incremental index updating unit 530 clears the content of the backup incremental index before updating the backup incremental index 510c.
The index switching unit 550 is coupled to the full index constructing unit 540 and the index storage unit 512, and after the full index constructing unit 540 indicates that the creation of the new full index is completed, the new full index 512d is set as the current full index 512a, and the current active and backup incremental indexes 512b and 512c are exchanged, that is, the current active incremental index is set as the backup incremental index, and the current backup incremental index is set as the active incremental index. For example, if the active delta index at this time is an odd delta index, the odd delta index is set as the backup delta index, and the even delta index is set as the active delta index, so that the data service is continuously provided to the user by the newly set current full index 510a and the active delta index 510b.
Data indexing system 500 also includes a component that processes requests for users, namely request processing unit 560. The request processing unit 560 receives a user request including a data attribute value to be queried, retrieves a first data storage location corresponding to the queried data attribute value from the current full index 510a of the index storage unit 512, retrieves a second data storage location corresponding to the queried data attribute value from the active delta index 510b, and then merges the first data storage location and the second data storage location to return the merged data storage location to the user as a result.
In addition, according to an embodiment of the present invention, the processing result requested by the user may be cached, so as to improve the efficiency of request processing. The data index service system 500 further includes a preloading unit 570. The preloading unit 570 may count the user requests to obtain a hot request list in which the number of queries exceeds a predetermined number. Subsequently, the preloading unit 570 acquires, for each request in the hot request list, a corresponding first data storage location from the new full index in advance during creation of the new full index based on the metadata by the full index building unit 540, and stores the hot request list and the acquired corresponding first data storage location together into the caching unit 580 to be used after setting the new full index as the current full index.
In this way, when processing the user request, the request processing unit 560 first determines whether the user request is in the hot request list, and if the request is in the hot request list, acquires the first data storage location from the caching unit 580, otherwise acquires the first data storage location from the current full index 510 a.
It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed to reflect the intent: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the device in this example. The modules in the foregoing examples may be combined into one module or may additionally be divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Moreover, those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments, not others, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Additionally, some of the embodiments are described herein as a method or combination of method elements that can be implemented by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed with respect to the scope of the invention, which is to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims.

Claims (14)

1. A method of data indexing, comprising:
providing data service for a user by a current full index and an active incremental index, wherein the current full index is created based on metadata, and the active incremental index is updated by data newly acquired during the creation of the full index based on the metadata;
updating the metadata and the active delta index based on newly acquired data;
triggering to create a new full index at a triggering moment so as to create the new full index by using metadata corresponding to the triggering moment;
in the period between the trigger moment and the creation completion moment of the new full index, updating the metadata, the active incremental index and the backup incremental index by using newly acquired data in the period; and
after the creation of the new full index is completed, setting the new full index as the current full index, setting the active incremental index and the backup incremental index as the backup incremental index and the active incremental index respectively, and providing data service for a user by the newly set current full index and the newly set active incremental index;
the step of providing data service for the user by the current full index and the active incremental index comprises the following steps:
receiving and counting a user request to obtain a hot request list with query times exceeding a preset number, wherein the user request comprises a data attribute value to be queried;
during the creation of a new full index based on metadata, for each user request in the hot request list, acquiring a corresponding first data storage position from the new full index in advance;
caching the hot request list and the acquired corresponding first data storage position;
and if the received user request is in the hot request list every time, acquiring the first data storage position from the cache, otherwise acquiring the first data storage position corresponding to the inquired data attribute value from the current full index.
2. The method of claim 1, wherein the trigger time is set to a predetermined time after the current full index is newly set or to the time at which an active delta index reaches a predetermined amount of data.
3. The method of claim 1 or 2, wherein the data index comprises a correspondence between data attribute values and data storage locations, the method further comprising:
extracting data attribute values and data storage positions from the newly acquired data; and
the step of updating the metadata and the delta index with the newly acquired data comprises:
writing the extracted data attribute values and data storage locations into the metadata and the active delta index.
4. The method of claim 3, wherein:
the newly acquired data includes web content, the data storage location includes a web link corresponding to the web content, and the data attribute value includes a data attribute extracted from the web content.
5. The method of any of claims 1-2, wherein the step of updating the backup incremental index during a trigger time and the time the creation of the new full index is complete comprises:
and emptying the content in the backup incremental index before updating the backup incremental index.
6. The method of any of claims 1-2, wherein the step of providing data services to the user by the current full index and the active delta index further comprises:
obtaining a second data storage location corresponding to the queried data attribute value from the active delta index; and
and merging the first data storage position and the second data storage position so as to return the merged data storage position to the user as a result.
7. The method of any of claims 1-2, wherein an incremental index comprises an odd incremental index and an even incremental index, the even incremental index being set to the backup incremental index when the odd incremental index is set to the active incremental index; and when the even delta index is set to the active delta index, the odd delta index is set to the backup delta index.
8. A data indexing system, comprising:
a data storage unit in which metadata is stored;
the index storage unit is used for storing a current full index, an active incremental index and a backup incremental index, wherein the current full index is created based on the metadata, and the active incremental index is obtained by updating data which is newly acquired during the process of creating the full index based on the metadata;
an incremental index updating unit adapted to update the metadata and the active incremental index based on newly acquired data;
the full index building unit is suitable for triggering and creating a new full index at the triggering moment so as to create the new full index by using the metadata corresponding to the triggering moment;
the index switching unit is suitable for setting the new full index as the current full index and setting the active incremental index and the backup incremental index as the backup incremental index and the active incremental index respectively after the creation of the new full index is completed;
the incremental index updating unit is suitable for updating the metadata, the active incremental index and the backup incremental index by using newly acquired data in a period between the trigger time and the creation completion time of the new full index;
the system comprises a preloading unit, a full index constructing unit and a full index generating unit, wherein the preloading unit is suitable for counting user requests to obtain a hot request list with the query times exceeding a preset number, and during the period that the full index constructing unit creates a new full index based on metadata, for each request in the hot request list, a corresponding first data storage position is obtained from the new full index in advance;
the cache unit is suitable for caching the hot request list and the corresponding first data storage position;
and the request processing unit is suitable for receiving a user request, and acquiring the first data storage position from the cache if the user request received each time is in the hot request list, otherwise acquiring the first data storage position corresponding to the inquired data attribute value from the current full index.
9. The system of claim 8, the data index comprising a correspondence between data attribute values and data storage locations, the incremental index update unit adapted to:
extracting data attribute values and data storage positions from the newly acquired data; and
writing the extracted data attribute values and data storage locations into the metadata and the active delta index.
10. The system of claim 9, wherein:
the newly acquired data includes web content, the data storage location includes a web link corresponding to the web content, and the data attribute value includes a data attribute extracted from the web content.
11. The system according to any of claims 8-10, wherein said incremental index updating unit is adapted to empty the contents of said backup incremental index before said updating of said backup incremental index during the time of triggering and the time of completion of creation of said new full index.
12. The system of any of claims 8-10, the request processing further adapted to:
obtaining a second data storage location corresponding to the queried data attribute value from the active delta index; and
and merging the first data storage position and the second data storage position so as to return the merged data storage position to the user as a result.
13. The system of any of claims 8-10, wherein an incremental index comprises an odd incremental index and an even incremental index, the even incremental index being set to the backup incremental index when the odd incremental index is set to the active incremental index; and when the even delta index is set to the active delta index, the odd delta index is set to the backup delta index.
14. A computing device, comprising:
at least one processor; and
a memory storing program instructions configured for execution by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-7.
CN201811368059.0A 2018-11-16 2018-11-16 Data indexing system and method Active CN111198931B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811368059.0A CN111198931B (en) 2018-11-16 2018-11-16 Data indexing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811368059.0A CN111198931B (en) 2018-11-16 2018-11-16 Data indexing system and method

Publications (2)

Publication Number Publication Date
CN111198931A CN111198931A (en) 2020-05-26
CN111198931B true CN111198931B (en) 2023-04-07

Family

ID=70744237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811368059.0A Active CN111198931B (en) 2018-11-16 2018-11-16 Data indexing system and method

Country Status (1)

Country Link
CN (1) CN111198931B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294731A (en) * 2012-03-05 2013-09-11 阿里巴巴集团控股有限公司 Real-time index creating and real-time searching method and device
WO2018120876A1 (en) * 2016-12-29 2018-07-05 北京奇艺世纪科技有限公司 Method and device for searching for cache update

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294731A (en) * 2012-03-05 2013-09-11 阿里巴巴集团控股有限公司 Real-time index creating and real-time searching method and device
WO2018120876A1 (en) * 2016-12-29 2018-07-05 北京奇艺世纪科技有限公司 Method and device for searching for cache update

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程知群 ; 章超 ; 韩高帅 ; .基于Solr的数据检索技术研究.杭州电子科技大学学报(自然科学版).2017,第37卷(第1期),第11-15页. *

Also Published As

Publication number Publication date
CN111198931A (en) 2020-05-26

Similar Documents

Publication Publication Date Title
CN110674432B (en) Second-level caching method, device and computer readable storage medium
CN108509547B (en) Information management method, information management system and electronic equipment
Ben-David What does the Web remember of its deleted past? An archival reconstruction of the former Yugoslav top-level domain
CN110046133B (en) Metadata management method, device and system for storage file system
US9122769B2 (en) Method and system for processing information of a stream of information
JP2016035764A (en) Method and apparatus of processing nested fragment caching of web page
CN104753966A (en) Resource file inquiry method and system based on server and client caches
CN1202257A (en) System and method for locating pages on the world wide web and for locating documents from network of computers
US9519673B2 (en) Management of I/O and log size for columnar database
JP7062750B2 (en) Methods, computer programs and systems for cognitive file and object management for distributed storage environments
CN103617199A (en) Data operating method and data operating system
CN110737682A (en) cache operation method, device, storage medium and electronic equipment
CN109992603B (en) Data searching method and device, electronic equipment and computer readable medium
CN113377289A (en) Cache management method, system, computing device and readable storage medium
CN111046041A (en) Data processing method and device, storage medium and processor
CN105074696A (en) Unified searchable storage for resource-constrained and other devices
US9633035B2 (en) Storage system and methods for time continuum data retrieval
CN111198931B (en) Data indexing system and method
CN111814029A (en) Data query method, system and computing device
CN110888840A (en) File query method, device, equipment and medium in distributed file system
CN106446080B (en) Data query method, query service equipment, client equipment and data system
US11947490B2 (en) Index generation and use with indeterminate ingestion patterns
Kausar et al. Web crawler based on mobile agent and java aglets
Briquemont et al. Conflict-free partially replicated data types
CN105069108B (en) Based on PaaS system big data querying method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant