WO2016155669A1 - Data storage method and device - Google Patents

Data storage method and device Download PDF

Info

Publication number
WO2016155669A1
WO2016155669A1 PCT/CN2016/078369 CN2016078369W WO2016155669A1 WO 2016155669 A1 WO2016155669 A1 WO 2016155669A1 CN 2016078369 W CN2016078369 W CN 2016078369W WO 2016155669 A1 WO2016155669 A1 WO 2016155669A1
Authority
WO
WIPO (PCT)
Prior art keywords
update
storage area
field
webpage
index
Prior art date
Application number
PCT/CN2016/078369
Other languages
French (fr)
Chinese (zh)
Inventor
蔡迥航
李前令
Original Assignee
广州神马移动信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州神马移动信息科技有限公司 filed Critical 广州神马移动信息科技有限公司
Publication of WO2016155669A1 publication Critical patent/WO2016155669A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a data storage method and apparatus.
  • the search engine When the search engine provides the search service for the user, the search information of the inverted index and the positive index is used to obtain the relevant information of the target webpage and provide the information to the user.
  • the inverted index is an index structure from a keyword mapping to a plurality of sorted web pages
  • the front index is an index structure that maps from a specific web page to summary information of the web page.
  • update webpages on the Internet The update speed of webpages on the Internet is very fast, and there are constantly updated webpage generations (in this article, newly generated webpages, webpages to be deleted, and webpages with updates are collectively referred to as update webpages).
  • the search engine will use the crawler software to continuously search for updated web pages in the background, and store the summary of the updated webpages, and incrementally update the existing inverted index and the positive index, that is, in the original Some index structures are newly created based on the inverted index from the keyword to the updated web page, and the positive index from the updated web page to the corresponding web page summary.
  • the positive index is updated incrementally along with an incremental update of the stored page summary.
  • a page summary consists of several fields, such as author, keyword, title, creation time, update time, page hits, and more. When any of the fields change, you need to re-save a full snippet containing all the fields and re-stor the corresponding index of the page.
  • the search efficiency is reduced due to the large amount of new data, so the full update is performed, that is, the entire data structure of the positive index is updated.
  • the existing web page summary storage method needs to incrementally store all the fields included in the web page summary when the web page summary is incrementally updated, so the amount of data stored in each incremental update is large.
  • the full update of the web page summary and the positive index has to be performed, and the full amount of data involved is larger, which takes up a lot of time and Equipment resources.
  • the embodiment of the invention provides a data storage method and device, which solves the technical problem that the webpage digest storage method of the prior art has to be fully updated after several incremental updates, resulting in occupying a large amount of time and device resources. .
  • a data storage method comprising: determining an update field in the webpage digest when the webpage digest is updated, and the Updating a field storage area corresponding to the field; newly adding an update storage area in the field storage area, and storing, in the update storage area, field data of the update field after the current update and index information of the field data.
  • the update storage area includes a data storage area and a corresponding index storage area, and the updated field data is stored in the data storage area, and the index information of the field data is stored in the index storage area.
  • the storing the index information of the field data in the index storage area includes: storing, in the index storage area, a webpage identifier corresponding to the field data, and the field data in the data storage area Store location information.
  • the method further includes: newly adding a webpage index table, storing, in the webpage index table, a webpage identifier corresponding to the current update, and storage location information of the webpage identifier in the index storage area.
  • the storing the webpage identifier corresponding to the current update in the webpage index table includes: setting 2 N index sub-tables in the webpage index table, and setting a corresponding N-bit binary table for each indexword table.
  • a value of N is an integer that is greater than or equal to 1; a binary value corresponding to the identifier of the webpage is obtained, and the webpage identifier is stored in an index sub-table of the corresponding table value according to the first N digits of the binary value.
  • the method further includes: presetting a plurality of field storage areas, and respectively designating one or more corresponding fields for each of the field storage areas.
  • the specifying one or more fields for each field area respectively includes: updating an update frequency of each field included in the summary webpage, and specifying one or more corresponding fields for each field storage area according to the update frequency Fields.
  • specifying one or more fields corresponding to each field storage area according to the update frequency includes: dividing fields with the same or similar update frequency into the same field storage area.
  • the method further includes: determining whether there is a webpage to be deleted, if yes, setting a valid time of the webpage to be deleted in the newly added update storage area; and when the valid time is reached, the method is to be The deleted field data is marked as invalid in the field data and corresponding index information stored at each update.
  • the method further includes: marking the history field data corresponding to the update field and the corresponding history index information in the history update storage area as invalid.
  • the method further includes: merging a plurality of update storage areas included in the field storage area, and deleting field data marked as invalid and corresponding index information in the merged new update storage area.
  • the merging the field storage area includes a plurality of update storage areas, including: storing in the field Selecting, in the area, a plurality of update storage areas to be merged; respectively calculating a sum of the quantity of valid field data included in the update storage area to be merged; and if the sum of the quantity is less than a first preset threshold, combining the to-be-combined Update store.
  • the selecting, by the field storage area, a plurality of update storage areas to be merged comprises: separately calculating a quantity of valid field data included in each update storage area; and selecting the quantity of the valid field data from the field storage area The least number of update storage areas serve as the update storage area to be merged.
  • the selecting, by the field storage area, the plurality of update storage areas to be merged comprises: respectively calculating a ratio of the number of valid field data included in the update storage area to the total field data quantity included in the update storage area; Selecting, in the field storage area, a plurality of update storage areas having the lowest ratio as the update storage area to be merged.
  • the present invention provides a data storage device, the data including a webpage digest and index information of the webpage digest, the apparatus comprising: a determining unit, configured to determine the webpage digest when the webpage digest is updated An update field in the field, and a field storage area corresponding to the update field; a first storage unit, configured to newly add an update storage area in the field storage area, and storing the update field in the update storage area in the current update Subsequent field data and index information of the field data.
  • the update storage area includes a data storage area and a corresponding index storage area;
  • the first storage unit includes: a data storage subunit and an index storage subunit, and the data storage subunit is configured to be in the data
  • the storage area stores the field data of the current update;
  • the index storage subunit is configured to store index information of the field data in the index storage area.
  • the index storage subunit is configured to store, in the index storage area, a webpage identifier corresponding to the field data, and storage location information of the field data in the data storage area.
  • the device further includes: a second storage unit, configured to newly add a webpage index table, and store, in the webpage index table, a webpage identifier corresponding to the current update, and the webpage identifier is in the index storage area Storage location information in .
  • a second storage unit configured to newly add a webpage index table, and store, in the webpage index table, a webpage identifier corresponding to the current update, and the webpage identifier is in the index storage area Storage location information in .
  • the second storage unit includes: a setting subunit, configured to set 2 N index sub-tables in the webpage index table, and set a corresponding N-bit binary table value for each index word table, where N is a pre- An integer of greater than or equal to 1 is set; a webpage storage subunit is configured to obtain a binary value corresponding to the identifier of the webpage, and store the webpage identifier into an index subtable of the corresponding table value according to the first N digits of the binary value.
  • the device further includes: a setting unit, configured to preset a plurality of field storage areas, and respectively specify one or more corresponding fields for each field storage area.
  • a setting unit configured to preset a plurality of field storage areas, and respectively specify one or more corresponding fields for each field storage area.
  • the setting unit is configured to: collect an update frequency of each field included in the webpage summary, and specify a corresponding one or more fields for each field storage area according to the update frequency.
  • the device further includes: a determining setting unit, configured to determine whether there is a webpage to be deleted, and if yes, setting a valid time of the webpage to be deleted in the newly added update storage area; the first marking unit, After the valid time is reached, the field data and corresponding index information stored in the webpage to be deleted are marked as invalid.
  • a determining setting unit configured to determine whether there is a webpage to be deleted, and if yes, setting a valid time of the webpage to be deleted in the newly added update storage area; the first marking unit, After the valid time is reached, the field data and corresponding index information stored in the webpage to be deleted are marked as invalid.
  • the device further includes: a second marking unit, configured to mark the historical field data corresponding to the update field and the corresponding historical index information in the history update storage area as invalid.
  • the apparatus further includes: a merging unit, configured to merge the plurality of update storage areas included in the field storage area; and a deleting unit, configured to: in the merged new update storage area, the first marking unit and the first The two tag units are marked as invalid field data and index information is deleted.
  • the merging unit includes: a first selecting subunit, configured to select, in the field storage area, a plurality of update storage areas to be merged; and a first calculating subunit, configured to separately calculate the update storage to be merged The sum of the number of valid field data included in the area; the first merge subunit, configured to merge the update storage areas to be merged if the sum of the quantities is less than a first preset threshold.
  • the first selection subunit includes: a second calculation subunit, configured to separately calculate a quantity of valid field data included in each update storage area; and a second selection subunit, configured to select from the field storage area
  • the plurality of update storage areas with the least number of valid field data are used as the update storage area to be merged.
  • the first selection subunit includes: a third calculation subunit, configured to respectively calculate a ratio of the number of valid field data included in the update storage area to the total field data quantity included in the update storage area; And selecting a subunit, configured to select, in the field storage area, the plurality of update storage areas with the lowest ratio as the update storage area to be merged.
  • the foregoing technical solution provides a data storage method and apparatus, and when an update of a webpage is updated, determining an update field in the webpage summary and a field storage area corresponding to the update field; and newly adding an update storage area in the field storage area,
  • the update storage area stores field data of the update field after the current update and index information of the field data.
  • the setting unit is further configured to: divide the fields with the same or similar update frequency into the same field storage area.
  • the application also provides a computer terminal for executing the program code of the steps provided by the above data storage method.
  • the application also provides a storage medium for storing program code executed by the above data storage method.
  • the technical solution only needs to update the update field and the corresponding index information in an incremental manner without incrementally updating the data of all the fields, thereby greatly reducing the number of times stored in a single update. According to the quantity, the amount of new data is too large, and the resulting full update occurs, saving time and storage space overhead and improving storage efficiency.
  • FIG. 1 is a schematic flowchart diagram of an embodiment of a data storage method according to the present invention
  • FIG. 2 is a schematic flowchart diagram of another embodiment of a data storage method according to the present invention.
  • FIG. 3 is a schematic diagram of a data storage structure corresponding to a data storage method according to the present invention.
  • FIG. 4 is a schematic flowchart diagram of another embodiment of a data storage method according to the present invention.
  • FIG. 5 is a schematic structural diagram of an embodiment of a data storage device according to the present invention.
  • FIG. 6 is a schematic structural diagram of another embodiment of a data storage device according to the present invention.
  • FIG. 7 is a schematic structural diagram of another embodiment of a data storage device according to the present invention.
  • FIG. 8 is a schematic structural diagram of another embodiment of a data storage device according to the present invention.
  • FIG. 9 is a schematic structural diagram of an embodiment of a merging unit of a data storage device according to the present invention.
  • FIG. 1 it is a schematic flowchart of an embodiment of the data storage method provided by the present invention.
  • the embodiment includes the following steps:
  • Step 101 Determine, when the webpage summary is updated, an update field in the webpage summary, and a field storage area corresponding to the update field.
  • a web page summary usually includes multiple fields such as an author, a keyword, a body, a title, a creation time, an update time, and a web page click.
  • fields such as an author, a keyword, a body, a title, a creation time, an update time, and a web page click.
  • a web page when a web page is updated, not all of the fields it contains are updated.
  • the possibility of updating the fields such as author, creation time, etc. is very small, and the possibility of updating the fields such as page hits and visitors is large, so You need to identify the updated fields in the snippet. For newly created pages and pages to be deleted, all fields contained in their snippet are considered to be update fields.
  • Step 102 Add an update storage area in the field storage area, and store, in the update storage area, field data of the update field after the current update and index information of the field data.
  • Each field storage area contains several update storage areas, and each update storage area is in each update page.
  • the update store is added to the field store corresponding to the field in which the update occurred.
  • the index information is the positive index information from the webpage to the webpage summary field data.
  • the inverted index is used to retrieve a plurality of target webpages related to the search keyword, and then the target is obtained according to the positive index information. Summary field data for the page.
  • the update storage period of the webpage summary may be preset, for example, may be set to one day, and at each preset update time, all the updated webpages in the previous day are counted, and the summary field in which the update occurs is The corresponding index information is stored.
  • the technical solution of the foregoing embodiment provides a data storage method and apparatus, and when an update of a webpage is updated, determining an update field in the webpage summary and a field storage area corresponding to the update field; and newly adding an update storage in the field storage area And storing, in the update storage area, field data of the update field after the current update and index information of the field data.
  • a plurality of field storage areas may be pre-established, and one or more fields are respectively designated for each field storage area.
  • the update frequency of each field included in the webpage summary is pre-stated, and one or more fields corresponding to each field storage area are respectively specified according to the update frequency, and the fields with the same or similar update frequency may be divided into the same field.
  • the update storage area can be newly added only in the field storage area where the updated field is located.
  • each field it can be divided into three field storage areas: stable storage area, non-changeable storage area, and variable storage area.
  • the stable storage area corresponds to the author, keyword, creation time and other relatively stable fields, and it is not easy to change the storage area corresponding to the text, the title and the like, and the changeable time, the page click amount, etc. are more likely to occur. Changed field.
  • each field can also divide each field into different field storage areas according to experience or statistical data in actual operation to obtain higher update and storage efficiency.
  • the division manner may be fixed, or may be periodically adjusted dynamically by statistical data within a certain period of time.
  • the update storage area may be specifically divided into two parts: a data storage area and a corresponding index storage area, and the updated field is stored in the data storage area.
  • Data storing index information of the field data in the index storage area.
  • the index information may specifically include a webpage identifier corresponding to the field data, and storage location information of the field data in the data storage area. Since the number of web pages updated each time is generally large, the number of field data stored in the data storage area is also large. When a certain field data is acquired, the field can be read from the corresponding index storage area. The index information of the data, thereby locating the piece of field data in the data storage area.
  • FIG. 2 is a schematic flowchart diagram of another embodiment of a data storage method according to the present invention.
  • the embodiment includes the following steps 201 to 204:
  • Step 201 Determine, when the webpage summary is updated, an update field in the webpage summary, and a field storage area corresponding to the update field.
  • Step 202 Add an update storage area to the field storage area, where the update storage area includes a data storage area and a corresponding index storage area.
  • Step 203 Store the updated field data in the data storage area, store the webpage identifier corresponding to the field data in the index storage area, and store the field data in the data storage area. location information.
  • Step 204 The webpage index table is newly added, and the webpage identifier corresponding to the current update is stored in the webpage index table, and the storage location information of the webpage identifier in the index storage area.
  • the webpage identifier corresponding to the current update is stored in the newly added webpage index table, and the webpage identifier may be a URL address of the webpage, or other information that can be used to identify the webpage.
  • the webpage storage table further stores storage location information of the webpage identifier in the index storage area, and is used to locate a target webpage identifier in the index storage area.
  • the step 204 may store the webpage identifier corresponding to the field data according to the following steps a) and b):
  • the webpage may be The index table is divided into 2 N index sub-tables, and each index word table corresponds to an N-bit binary table value.
  • the webpage identifier may be stored in the index word table corresponding to the table value according to the first N-bit binary value of the webpage identifier. In this way, when searching for the webpage identifier, it is only necessary to search according to the first N-bit binary value of the webpage identifier to the index word table corresponding to the table value, which greatly saves the search time.
  • FIG. 3 is a schematic diagram of a data storage structure established by using the foregoing embodiment of the present invention.
  • the number of web pages included in the Internet is large.
  • a separate index fragment is used as an example.
  • the index fragment includes a webpage index table, a field storage area, and a version information table.
  • the webpage index table stores a list of webpage identifiers involved in the current update, and storage location information of each webpage identifier in the newly added update storage area.
  • the web page index table i is a newly added web page index table at the time of the i-th update. In an actual application, the update time information may be added to the name of the webpage index table. On the one hand, each different webpage index table may be distinguished, and on the other hand, it is convenient to immediately distinguish which one is the latest webpage index table from the name.
  • each field store contains several update buckets.
  • the update storage area i is an update storage area newly added in the non-changeable storage area at the ith update, and includes two parts of the data storage area i and the corresponding index storage area i.
  • the update time information may also be added to the name of the update storage area, on the one hand, the different update storage areas may be distinguished, and on the other hand, it is convenient to immediately distinguish which one is the latest update storage area from the name.
  • the latest version information is recorded in the version information table, such as which web index table is included in the current index fragment, and which update storage area is included in each field storage area, so as to facilitate version management to ensure that the latest index can be updated.
  • the index information gets the latest summary field data.
  • the version information table i is a newly added version information table at the time of the ith update.
  • the update time information may be added to the name of the version information table, and on the one hand, different version information tables may be distinguished, and another It is also convenient to immediately identify which one is the latest version information table from the name.
  • FIG. 4 is a schematic flowchart diagram of another embodiment of a data storage method according to the present invention.
  • steps 201 to 204 refer to descriptions of corresponding steps in the foregoing embodiment, and the embodiment is described. Also included are the following steps 205 to 208:
  • Step 205 Determine whether there is a webpage to be deleted, and if yes, set a valid time of the webpage to be deleted in the newly added update storage area.
  • Step 206 After the valid time is reached, the field data and the corresponding index information stored in the webpage to be deleted are marked as invalid.
  • the update of the webpage summary and the index is an incremental update, and is also applicable to the webpage to be deleted. If the webpage that needs to be deleted is detected during the update, all the fields of the webpage are considered to be update fields, and the update storage area is also added in the corresponding field storage area.
  • the updated field data is empty data. Therefore, in the data storage area, the updated field data may be replaced by a preset identifier, and the preset identifier is stored in the index storage area. Location letter Information and web page identification.
  • the inverted index and the positive index may have problems with the update time difference in actual use, such as the webpage to be deleted in this update, if each summary field of the webpage to be deleted is updated after the webpage summary and the corresponding positive index are updated. And the index information is deleted immediately, or marked as invalid, since the inverted index may not be completely updated at this time, that is, the webpage identifier to be deleted has not been deleted in the webpage list used by the inverted index, then it may be There will be an index requirement for the summary field of the web page to be deleted.
  • the “effective time” attribute is set for the webpage to be deleted, so that after the webpage summary and the corresponding index information are updated, the field data and the corresponding index information of the webpage to be deleted are still retained. After the expiration of the valid time, the field data and the corresponding index information stored in the webpage to be deleted are marked as invalid, so that the webpage list used in the inverted index can be guaranteed.
  • the deleted webpage identifier has been deleted, and there is no longer an index requirement for the digest field of the webpage to be deleted, thereby solving the problem of the update time difference between the inverted index and the positive index.
  • the "valid time” attribute may be stored in the index storage area together with the web page identifier to be deleted. In order to keep the storage format consistent, for the web pages that do not need to be deleted, the "valid time” attribute may be retained in each update storage area, and the attribute value is set to be invalid, or an infinitely long time is set until it is detected. When the page needs to be deleted, the value of the attribute is actually set in the newly added update store.
  • Step 207 Mark the history field data corresponding to the update field and the corresponding history index information in the history update storage area as invalid.
  • the update of the webpage summary field of the present invention is an incremental update, it may happen that the plurality of update storage areas contain field data and corresponding index information updated by the same webpage at different times.
  • the newly updated update storage area should be used as the valid update storage area corresponding to the webpage, and the field data and the corresponding historical index information included in the valid update storage area are valid field data and index information.
  • the history field data corresponding to the update field and the corresponding history index information in the past history update storage area are all marked as invalid.
  • the update time information may be included in the file name corresponding to the update storage area, so that the file name of each update storage area can quickly determine which update storage area is a valid update storage area.
  • Step 208 Combine the plurality of update storage areas included in the field storage area, and delete the field data marked as invalid and the corresponding index information in the merged new update storage area.
  • step 208 may specifically include the following steps c), d), e):
  • the number of valid field data referred to herein may be the number of valid fields or the memory value occupied by the valid field data.
  • the upper limit threshold of the field data and/or the index information included in the update storage area may be set for each update storage area, for example, the field data corresponding to the maximum of 100 fields in each update storage area is set. Assuming that two update storage areas to be merged are selected, and the two update storage areas to be merged contain valid fields of 55 and 60, respectively, the number of valid fields included in the two update storage areas to be merged And more than 100, so the two update storage areas to be merged cannot be merged.
  • step 2081 may specifically include the following sub-steps:
  • a plurality of update storage areas with the least number of valid field data may be selected in a targeted manner, so that the selected update storage area to be merged is more likely to meet the merge condition of the above step 2083.
  • step 2081 may specifically include the following sub-steps:
  • a plurality of update storage areas with the least number of valid field data may be selected, or a plurality of update storage areas with the least proportion of valid field data may be selected, so that the selected update to be merged is selected.
  • the storage area is more likely to meet the merge condition of the above step 2083, which saves time and space overhead.
  • the foregoing technical solution provides an embodiment of a data storage method, where an update field in the webpage summary and a field storage area corresponding to the update field are determined when the webpage summary is updated; and the update storage is newly added in the field storage area. a storage area in which the field data of the update field after the current update and the index information of the field data are stored in the update storage area.
  • the technical solution only needs to incrementally update the update field and the corresponding index information, without incrementally updating the data of all the fields, thereby greatly reducing the data stored in a single update.
  • the amount so as to avoid the excessive amount of new data, and the resulting full amount of updates, saving time and storage space overhead, improving storage efficiency.
  • the present invention further provides an embodiment of a data storage device, as shown in FIG. 5, which is a structure of an embodiment of a data storage device provided by the present invention.
  • the device comprises:
  • the determining unit 501 is configured to: when the webpage summary is updated, determine an update field in the webpage summary, and a field storage area corresponding to the update field;
  • the first storage unit 502 is configured to newly add an update storage area in the field storage area, and store, in the update storage area, field data of the update field after the current update and index information of the field data.
  • the update storage area includes a data storage area and a corresponding index storage area
  • the first storage unit 502 includes: a data storage subunit 5021 and an index storage subunit 5022;
  • the data storage subunit 5021 is specifically configured to store, in the data storage area, the field data after the current update;
  • the index storage subunit 5022 is specifically configured to store index information of the field data in the index storage area.
  • the index storage unit 5022 is configured to store, in the index storage area, a webpage identifier corresponding to the field data, and storage location information of the field data in the data storage area.
  • FIG. 6 is a schematic structural diagram of another embodiment of a data storage device according to the present invention.
  • the device further includes:
  • the second storage unit 503 is configured to newly add a webpage index table, and store, in the webpage index table, a webpage identifier corresponding to the current update, and storage location information of the webpage identifier in the index storage area.
  • the second storage unit 503 includes:
  • the setting subunit 5031 is configured to set 2 N index sub-tables in the webpage index table, and set a corresponding N-bit binary table value for each index word table, where N is an integer preset to be greater than or equal to 1;
  • the webpage storage subunit 5032 is configured to obtain a binary value corresponding to the identifier of the webpage, and store the webpage identifier into an index subtable of the corresponding table value according to the first N digits of the binary value.
  • FIG. 7 is a schematic structural diagram of another embodiment of a data storage device according to the present invention.
  • the device further includes:
  • the setting unit 504 is configured to preset a plurality of field storage areas, and respectively specify one or more corresponding fields for each field storage area.
  • the setting unit 504 is specifically configured to:
  • the update frequency of each field included in the statistical webpage summary, and corresponding one or more fields are respectively designated for each field storage area according to the update frequency.
  • the apparatus further includes:
  • a determining setting unit 505 configured to determine whether there is a webpage to be deleted, and if yes, setting a valid time of the webpage to be deleted in the newly added update storage area;
  • the first marking unit 506 is configured to mark, when the valid time is reached, the field data and the corresponding index information stored in the webpage to be deleted at each update as invalid.
  • the apparatus further includes:
  • the second marking unit 507 is configured to mark the history field data corresponding to the update field and the corresponding history index information in the history update storage area as invalid.
  • FIG. 8 is a schematic structural diagram of another embodiment of a data storage device according to the present invention.
  • the device further includes:
  • a merging unit 508 configured to merge the plurality of update storage areas included in the field storage area
  • the deleting unit 509 is configured to delete the field data and the index information marked as invalid by the first marking unit and the second marking unit in the merged new update storage area.
  • FIG. 9 is a schematic structural diagram of an embodiment of a merging unit 508 of a data storage device according to the present invention.
  • the merging unit 508 includes:
  • a first selection subunit 5081 configured to select, in the field storage area, a plurality of update storage areas to be merged
  • a first calculating subunit 5082 configured to separately calculate a sum of the number of valid field data included in the update storage area to be merged
  • the first merging sub-unit 5083 is configured to merge the update storage areas to be merged if the sum of the quantities is less than a first preset threshold.
  • the first selection subunit 5081 includes:
  • a second calculating subunit 50811, configured to separately calculate a quantity of valid field data included in each update storage area
  • the second selection sub-unit 50812 is configured to select, from the field storage area, a plurality of update storage areas with the least number of valid field data as the update storage area to be merged.
  • the first selection subunit 5081 may also include:
  • a third calculating subunit for respectively calculating a ratio of the number of valid field data included in the update storage area to the total field data quantity included in the update storage area;
  • a third selection subunit (not shown) for selecting, in the field storage area, the plurality of update storage areas with the lowest ratio as the update storage area to be merged.
  • each functional unit or subunit or module in each of the above embodiments may be operated in a computer terminal as part of the apparatus, and the above unit or subunit or module may be executed by a processor in the computer terminal.
  • the computer terminal can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile Internet device (MID), a PAD, and the like.
  • the embodiment of the data storage device provided by the present invention is essentially the same as the embodiment of the data storage method described above, and therefore is not explained in detail.
  • the embodiment of the data storage device provided by the foregoing technical solution, when the webpage summary is updated, determining an update field in the webpage summary, and a field storage area corresponding to the update field; adding an update storage area in the field storage area And storing, in the update storage area, field data of the update field after the current update and index information of the field data.
  • the various functional modules provided by the embodiments of the present application may be run in a mobile terminal, a computer terminal, or the like, or may be stored as part of a storage medium.
  • embodiments of the present invention may provide a computer terminal, which may be any computer terminal device in a group of computer terminals.
  • a computer terminal may also be replaced with a terminal device such as a mobile terminal.
  • the computer terminal may be located in at least one network device of the plurality of network devices of the computer network.
  • the computer terminal may execute the program code of the following steps in the data storage method: when the webpage summary is updated, determining an update field in the webpage summary, and storing a field corresponding to the update field And updating the storage area in the field storage area, and storing the field data of the update field after the current update and the index information of the field data in the update storage area.
  • the computer terminal can include: one or more processors, memory, and transmission means.
  • the memory can be used to store software programs and modules, such as the data storage method and the program instructions/modules corresponding to the device in the embodiment of the present invention.
  • the processor executes various functional applications by running software programs and modules stored in the memory. And data processing, that is, the above data storage method is implemented.
  • the memory may include a high speed random access memory, and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • the memory can further include memory remotely located relative to the processor, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the above transmission device is for receiving or transmitting data via a network.
  • Specific examples of the above network may include a wired network and a wireless network.
  • the transmission device includes a Network Interface Controller (NIC) that can be connected to other network devices and routers via a network cable to communicate with the Internet or a local area network.
  • the transmission device is a Radio Frequency (RF) module for communicating with the Internet wirelessly.
  • NIC Network Interface Controller
  • RF Radio Frequency
  • the memory is used to store preset action conditions and information of the preset rights user, and an application.
  • the processor can call the memory stored information and the application by the transmitting device to execute the program code of the method steps of each of the alternative or preferred embodiments of the above method embodiments.
  • the computer terminal can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device (MID), a PAD, and the like.
  • a smart phone such as an Android phone, an iOS phone, etc.
  • a tablet computer such as an iPad, Samsung Galaxy Tab, Samsung Galaxy Tab, etc.
  • MID mobile Internet device
  • PAD PAD
  • Embodiments of the present invention also provide a storage medium.
  • the foregoing storage medium may be used to save program code executed by the data storage method provided by the foregoing method embodiment and the device embodiment.
  • the foregoing storage medium may be located in any one of the computer terminal groups in the computer network, or in any one of the mobile terminal groups.
  • the storage medium is arranged to store program code for performing the following steps: And updating, in the webpage summary, an update field in the webpage summary, and a field storage area corresponding to the update field; adding an update storage area in the field storage area, and storing the update field in the update storage area The field data after the update and the index information of the field data.
  • the storage medium may also be configured to store program code of various preferred or optional method steps provided by the data storage method.
  • the technology in the embodiments of the present invention can be implemented by means of software plus necessary general hardware including general-purpose integrated circuits, general-purpose CPUs, general-purpose memories, general-purpose components, and the like. It can be implemented by dedicated hardware including an application specific integrated circuit, a dedicated CPU, a dedicated memory, a dedicated component, etc., but in many cases the former is a better implementation. Based on such understanding, the technical solution in the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product, which may be stored in a storage medium such as a read-only memory.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • CD Compact Disc

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data storage method and device. Data comprises a webpage abstract and index information about the webpage abstract. The method comprises: when a webpage abstract is updated, determining an update field in the webpage abstract and a field storage zone corresponding to the update field (101); and adding an update storage zone to the field storage zone, and storing the field data of the update field after the update at the present time and the index information about the field data in the update storage zone (102). In the technical solution, during storage of a webpage abstract, only an update field and corresponding index information are incrementally updated without the need to incrementally update the data of all fields, thereby greatly reducing the data volume stored during updating at a single time, avoiding an overly large newly added data volume and the occurrence of a full update caused thereby, saving time and overheads of storage space and improving storage efficiency.

Description

一种数据存储方法及装置Data storage method and device 技术领域Technical field
本发明涉及计算机技术领域,特别是涉及一种数据存储方法及装置。The present invention relates to the field of computer technologies, and in particular, to a data storage method and apparatus.
背景技术Background technique
搜索引擎在为用户提供搜索服务时,主要是利用倒排索引和正排索引的映射结构获得目标网页的相关信息并提供给用户。倒排索引是从关键词映射到若干个排序后的网页的索引结构,正排索引是从具体的网页映射到该网页的摘要信息的索引结构。在搜索时,搜索引擎首先根据用户输入的搜索语句确定搜索关键词,然后根据搜索关键词和倒排索引检索到相关的若干目标网页,并将这些目标网页排序之后,根据正排索引将网页摘要和网页的原始地址提供给用户,用户再根据网页摘要来决定是否点击该网页的原始地址进一步浏览该网页。When the search engine provides the search service for the user, the search information of the inverted index and the positive index is used to obtain the relevant information of the target webpage and provide the information to the user. The inverted index is an index structure from a keyword mapping to a plurality of sorted web pages, and the front index is an index structure that maps from a specific web page to summary information of the web page. When searching, the search engine first determines the search keyword according to the search sentence input by the user, and then retrieves related target web pages according to the search keyword and the inverted index, and sorts the target web pages, and then summarizes the webpage according to the positive index. And the original address of the webpage is provided to the user, and the user further decides whether to click the original address of the webpage to further browse the webpage according to the webpage summary.
在互联网上网页的更新速度非常快,不断有更新网页生成(本文中将新生成的网页、待删除的网页和有更新的网页统称为更新网页)。为向用户提供更实时、准确的搜索服务,搜索引擎会在后台利用爬虫软件不断搜寻更新网页,并存储更新网页的摘要,同时对已有的倒排索引和正排索引进行增量更新,即在原有的索引结构上新创建基于从关键词到该更新网页的倒排索引,以及从该更新网页到对应的网页摘要的正排索引。The update speed of webpages on the Internet is very fast, and there are constantly updated webpage generations (in this article, newly generated webpages, webpages to be deleted, and webpages with updates are collectively referred to as update webpages). In order to provide users with more real-time and accurate search services, the search engine will use the crawler software to continuously search for updated web pages in the background, and store the summary of the updated webpages, and incrementally update the existing inverted index and the positive index, that is, in the original Some index structures are newly created based on the inverted index from the keyword to the updated web page, and the positive index from the updated web page to the corresponding web page summary.
正排索引在增量更新时,同时伴随着存储的网页摘要的增量更新。网页摘要由若干字段组成,例如作者、关键词、标题、创建时间、更新时间、网页点击量等字段。任一字段发生变化时,都需要重新存储一条包含所有字段的完整网页摘要,并重新存储该网页对应的正排索引。通常在正排索引经过若干次增量更新之后,由于新增的数据量较大,引起搜索效率降低,因此会进行全量更新,即对正排索引的整个数据结构进行更新。The positive index is updated incrementally along with an incremental update of the stored page summary. A page summary consists of several fields, such as author, keyword, title, creation time, update time, page hits, and more. When any of the fields change, you need to re-save a full snippet containing all the fields and re-stor the corresponding index of the page. Usually, after several incremental updates of the positive index, the search efficiency is reduced due to the large amount of new data, so the full update is performed, that is, the entire data structure of the positive index is updated.
由此可见,现有的网页摘要的存储方式在对网页摘要进行增量更新时,需要对网页摘要包含的所有字段进行增量存储,因此每次增量更新所存储的数据量较大。加之由于网页更新的频率较快,因此在若干次增量更新后,导致不得不进行网页摘要和正排索引的全量更新,而全量更新所涉及的数据量更大,由此占用了大量的时间和设备资源。It can be seen that the existing web page summary storage method needs to incrementally store all the fields included in the web page summary when the web page summary is incrementally updated, so the amount of data stored in each incremental update is large. In addition, due to the faster frequency of web page updates, after several incremental updates, the full update of the web page summary and the positive index has to be performed, and the full amount of data involved is larger, which takes up a lot of time and Equipment resources.
发明内容Summary of the invention
本发明实施例提供了一种数据存储方法及装置,以解决现有技术的网页摘要存储方法在经过若干次增量更新后,不得不进行全量更新,导致占用大量的时间和设备资源的技术问题。 The embodiment of the invention provides a data storage method and device, which solves the technical problem that the webpage digest storage method of the prior art has to be fully updated after several incremental updates, resulting in occupying a large amount of time and device resources. .
为了解决上述技术问题,本发明实施例公开了如下技术方案:In order to solve the above technical problem, the embodiment of the present invention discloses the following technical solutions:
一方面,提供了一种数据存储方法,所述数据包括网页摘要和所述网页摘要的索引信息,所述方法包括:在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区;在所述字段存储区新增加更新存储区,在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。In one aspect, a data storage method is provided, the data including a webpage digest and index information of the webpage digest, the method comprising: determining an update field in the webpage digest when the webpage digest is updated, and the Updating a field storage area corresponding to the field; newly adding an update storage area in the field storage area, and storing, in the update storage area, field data of the update field after the current update and index information of the field data.
进一步地,所述更新存储区包括数据存储区和对应的索引存储区,在所述数据存储区存储所述本次更新后的字段数据,在所述索引存储区存储所述字段数据的索引信息。Further, the update storage area includes a data storage area and a corresponding index storage area, and the updated field data is stored in the data storage area, and the index information of the field data is stored in the index storage area. .
进一步地,所述在所述索引存储区存储所述字段数据的索引信息包括:在所述索引存储区存储所述字段数据对应的网页标识,以及所述字段数据在所述数据存储区中的存储位置信息。Further, the storing the index information of the field data in the index storage area includes: storing, in the index storage area, a webpage identifier corresponding to the field data, and the field data in the data storage area Store location information.
进一步地,所述方法还包括:新增加网页索引表,在所述网页索引表中存储本次更新对应的网页标识,以及所述网页标识在所述索引存储区中的存储位置信息。Further, the method further includes: newly adding a webpage index table, storing, in the webpage index table, a webpage identifier corresponding to the current update, and storage location information of the webpage identifier in the index storage area.
进一步地,所述在所述网页索引表中存储本次更新对应的网页标识包括:在所述网页索引表中设置2N个索引子表,为每一个索引字表设置对应的N位二进制表值,N为预设大于等于1的整数;获取所述网页的标识对应的二进制数值,根据所述二进制数值的前N位将所述网页标识存储到对应表值的索引子表中。Further, the storing the webpage identifier corresponding to the current update in the webpage index table includes: setting 2 N index sub-tables in the webpage index table, and setting a corresponding N-bit binary table for each indexword table. A value of N is an integer that is greater than or equal to 1; a binary value corresponding to the identifier of the webpage is obtained, and the webpage identifier is stored in an index sub-table of the corresponding table value according to the first N digits of the binary value.
进一步地,所述方法还包括:预设若干字段存储区,分别为每一个字段存储区指定对应的一个或多个字段。Further, the method further includes: presetting a plurality of field storage areas, and respectively designating one or more corresponding fields for each of the field storage areas.
进一步地,所述分别为每一个字段区指定对应的一个或多个字段包括:统计网页摘要包含的各字段的更新频率,根据所述更新频率分别为每一个字段存储区指定对应的一个或多个字段。Further, the specifying one or more fields for each field area respectively includes: updating an update frequency of each field included in the summary webpage, and specifying one or more corresponding fields for each field storage area according to the update frequency Fields.
进一步地,根据所述更新频率分别为每一个字段存储区指定对应的一个或多个字段包括:将更新频率相同或相近的字段划分在同一个字段存储区中。Further, specifying one or more fields corresponding to each field storage area according to the update frequency includes: dividing fields with the same or similar update frequency into the same field storage area.
进一步地,所述方法还包括:判断是否有待删除的网页,如果有,在新增加的更新存储区中设置所述待删除的网页的有效时间;当达到所述有效时间后,将所述待删除的网页在各次更新时存储的字段数据和对应的索引信息标记为无效。Further, the method further includes: determining whether there is a webpage to be deleted, if yes, setting a valid time of the webpage to be deleted in the newly added update storage area; and when the valid time is reached, the method is to be The deleted field data is marked as invalid in the field data and corresponding index information stored at each update.
进一步地,所述方法还包括:将历史更新存储区中与所述更新字段对应的历史字段数据和对应的历史索引信息标记为无效。Further, the method further includes: marking the history field data corresponding to the update field and the corresponding history index information in the history update storage area as invalid.
进一步地,所述方法还包括:合并所述字段存储区包含的若干更新存储区,在合并后的新更新存储区中删除标记为无效的字段数据和对应的索引信息。Further, the method further includes: merging a plurality of update storage areas included in the field storage area, and deleting field data marked as invalid and corresponding index information in the merged new update storage area.
进一步地,所述合并所述字段存储区包含的若干更新存储区包括:在所述字段存储 区中选择若干待合并的更新存储区;分别计算所述待合并的更新存储区包含的有效字段数据的数量之和;如果所述数量之和小于第一预设阈值,则合并所述待合并的更新存储区。Further, the merging the field storage area includes a plurality of update storage areas, including: storing in the field Selecting, in the area, a plurality of update storage areas to be merged; respectively calculating a sum of the quantity of valid field data included in the update storage area to be merged; and if the sum of the quantity is less than a first preset threshold, combining the to-be-combined Update store.
进一步地,所述从所述字段存储区中选择若干待合并的更新存储区包括:分别计算每一个更新存储区包含的有效字段数据数量;从所述字段存储区中选择所述有效字段数据数量最少的若干更新存储区作为所述待合并的更新存储区。Further, the selecting, by the field storage area, a plurality of update storage areas to be merged comprises: separately calculating a quantity of valid field data included in each update storage area; and selecting the quantity of the valid field data from the field storage area The least number of update storage areas serve as the update storage area to be merged.
进一步地,所述从所述字段存储区中选择若干待合并的更新存储区包括:分别计算所述更新存储区包含的有效字段数据数量与所述更新存储区包含的总字段数据数量的比值;在所述字段存储区中选择所述比值最低的若干更新存储区作为所述待合并的更新存储区。Further, the selecting, by the field storage area, the plurality of update storage areas to be merged comprises: respectively calculating a ratio of the number of valid field data included in the update storage area to the total field data quantity included in the update storage area; Selecting, in the field storage area, a plurality of update storage areas having the lowest ratio as the update storage area to be merged.
另一方面,本发明提供了一种数据存储装置,所述数据包括网页摘要和所述网页摘要的索引信息,所述装置包括:确定单元,用于在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区;第一存储单元,用于在所述字段存储区新增加更新存储区,在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。In another aspect, the present invention provides a data storage device, the data including a webpage digest and index information of the webpage digest, the apparatus comprising: a determining unit, configured to determine the webpage digest when the webpage digest is updated An update field in the field, and a field storage area corresponding to the update field; a first storage unit, configured to newly add an update storage area in the field storage area, and storing the update field in the update storage area in the current update Subsequent field data and index information of the field data.
进一步地,所述更新存储区包括数据存储区和对应的索引存储区;所述第一存储单元包括:数据存储子单元和索引存储子单元,所述数据存储子单元,用于在所述数据存储区存储所述本次更新后的字段数据;所述索引存储子单元,用于在所述索引存储区存储所述字段数据的索引信息。Further, the update storage area includes a data storage area and a corresponding index storage area; the first storage unit includes: a data storage subunit and an index storage subunit, and the data storage subunit is configured to be in the data The storage area stores the field data of the current update; the index storage subunit is configured to store index information of the field data in the index storage area.
进一步地,所述索引存储子单元用于在所述索引存储区存储所述字段数据对应的网页标识,以及所述字段数据在所述数据存储区中的存储位置信息。Further, the index storage subunit is configured to store, in the index storage area, a webpage identifier corresponding to the field data, and storage location information of the field data in the data storage area.
进一步地,所述装置还包括:第二存储单元,用于新增加网页索引表,并在所述网页索引表中存储本次更新对应的网页标识,以及所述网页标识在所述索引存储区中的存储位置信息。Further, the device further includes: a second storage unit, configured to newly add a webpage index table, and store, in the webpage index table, a webpage identifier corresponding to the current update, and the webpage identifier is in the index storage area Storage location information in .
进一步地,所述第二存储单元包括:设置子单元,用于在所述网页索引表中设置2N个索引子表,为每一个索引字表设置对应的N位二进制表值,N为预设大于等于1的整数;网页存储子单元,用于获取所述网页的标识对应的二进制数值,根据所述二进制数值的前N位将所述网页标识存储到对应表值的索引子表中。Further, the second storage unit includes: a setting subunit, configured to set 2 N index sub-tables in the webpage index table, and set a corresponding N-bit binary table value for each index word table, where N is a pre- An integer of greater than or equal to 1 is set; a webpage storage subunit is configured to obtain a binary value corresponding to the identifier of the webpage, and store the webpage identifier into an index subtable of the corresponding table value according to the first N digits of the binary value.
进一步地,所述装置还包括:设置单元,用于预设若干字段存储区,分别为每一个字段存储区指定对应的一个或多个字段。Further, the device further includes: a setting unit, configured to preset a plurality of field storage areas, and respectively specify one or more corresponding fields for each field storage area.
进一步地,所述设置单元用于:统计网页摘要包含的各字段的更新频率,根据所述更新频率分别为每一个字段存储区指定对应的一个或多个字段。 Further, the setting unit is configured to: collect an update frequency of each field included in the webpage summary, and specify a corresponding one or more fields for each field storage area according to the update frequency.
进一步地,所述装置还包括:判断设置单元,用于判断是否有待删除的网页,如果有,在新增加的更新存储区中设置所述待删除的网页的有效时间;第一标记单元,用于当达到所述有效时间后,将所述待删除的网页在各次更新时存储的字段数据和对应的索引信息标记为无效。Further, the device further includes: a determining setting unit, configured to determine whether there is a webpage to be deleted, and if yes, setting a valid time of the webpage to be deleted in the newly added update storage area; the first marking unit, After the valid time is reached, the field data and corresponding index information stored in the webpage to be deleted are marked as invalid.
进一步地,所述装置还包括:第二标记单元,用于将历史更新存储区中与所述更新字段对应的历史字段数据和对应的历史索引信息标记为无效。Further, the device further includes: a second marking unit, configured to mark the historical field data corresponding to the update field and the corresponding historical index information in the history update storage area as invalid.
进一步地,所述装置还包括:合并单元,用于合并所述字段存储区包含的若干更新存储区;删除单元,用于在合并后的新更新存储区中将所述第一标记单元和第二标记单元标记为无效的字段数据和索引信息删除。Further, the apparatus further includes: a merging unit, configured to merge the plurality of update storage areas included in the field storage area; and a deleting unit, configured to: in the merged new update storage area, the first marking unit and the first The two tag units are marked as invalid field data and index information is deleted.
进一步地,所述合并单元包括:第一选择子单元,用于在所述字段存储区中选择若干待合并的更新存储区;第一计算子单元,用于分别计算所述待合并的更新存储区包含的有效字段数据的数量之和;第一合并子单元,用于如果所述数量之和小于第一预设阈值,则合并所述待合并的更新存储区。Further, the merging unit includes: a first selecting subunit, configured to select, in the field storage area, a plurality of update storage areas to be merged; and a first calculating subunit, configured to separately calculate the update storage to be merged The sum of the number of valid field data included in the area; the first merge subunit, configured to merge the update storage areas to be merged if the sum of the quantities is less than a first preset threshold.
进一步地,所述第一选择子单元包括:第二计算子单元,用于分别计算每一个更新存储区包含的有效字段数据数量;第二选择子单元,用于从所述字段存储区中选择所述有效字段数据数量最少的若干更新存储区作为所述待合并的更新存储区。Further, the first selection subunit includes: a second calculation subunit, configured to separately calculate a quantity of valid field data included in each update storage area; and a second selection subunit, configured to select from the field storage area The plurality of update storage areas with the least number of valid field data are used as the update storage area to be merged.
进一步地,所述第一选择子单元包括:第三计算子单元,用于分别计算所述更新存储区包含的有效字段数据数量与所述更新存储区包含的总字段数据数量的比值;第三选择子单元,用于在所述字段存储区中选择所述比值最低的若干更新存储区作为所述待合并的更新存储区。Further, the first selection subunit includes: a third calculation subunit, configured to respectively calculate a ratio of the number of valid field data included in the update storage area to the total field data quantity included in the update storage area; And selecting a subunit, configured to select, in the field storage area, the plurality of update storage areas with the lowest ratio as the update storage area to be merged.
上述技术方案提供数据存储方法及装置,在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区;在所述字段存储区新增加更新存储区,在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。The foregoing technical solution provides a data storage method and apparatus, and when an update of a webpage is updated, determining an update field in the webpage summary and a field storage area corresponding to the update field; and newly adding an update storage area in the field storage area, The update storage area stores field data of the update field after the current update and index information of the field data.
进一步地,所述设置单元还用于:将更新频率相同或相近的字段划分在同一个字段存储区中。Further, the setting unit is further configured to: divide the fields with the same or similar update frequency into the same field storage area.
本申请还提供了一种计算机终端,用于执行上述数据存储方法提供的步骤的程序代码。The application also provides a computer terminal for executing the program code of the steps provided by the above data storage method.
本申请还提供了一种存储介质,用于保存上述数据存储方法所执行的程序代码。The application also provides a storage medium for storing program code executed by the above data storage method.
该技术方案在对网页摘要存储时,只需对其中的更新字段和对应的索引信息进行增量更新,而无需对所有字段的数据进行增量更新,因此大大降低了单次更新时存储的数 据量,从而避免了新增的数据量过大,以及由此导致的全量更新的发生,节约了时间及存储空间的开销,提高了存储效率。When the webpage summary is stored, the technical solution only needs to update the update field and the corresponding index information in an incremental manner without incrementally updating the data of all the fields, thereby greatly reducing the number of times stored in a single update. According to the quantity, the amount of new data is too large, and the resulting full update occurs, saving time and storage space overhead and improving storage efficiency.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it will be apparent to those skilled in the art that In other words, other drawings can be obtained based on these drawings without paying for creative labor.
图1为本发明提供的一种数据存储方法的一个实施例的流程示意图;FIG. 1 is a schematic flowchart diagram of an embodiment of a data storage method according to the present invention;
图2为本发明提供的一种数据存储方法的另一个实施例的流程示意图;2 is a schematic flowchart diagram of another embodiment of a data storage method according to the present invention;
图3所示为本发明提供的一种数据存储方法对应的数据存储结构示意图;FIG. 3 is a schematic diagram of a data storage structure corresponding to a data storage method according to the present invention; FIG.
图4为本发明提供的一种数据存储方法的另一个实施例的流程示意图;4 is a schematic flowchart diagram of another embodiment of a data storage method according to the present invention;
图5为本发明提供的一种数据存储装置的一个实施例的结构示意图;FIG. 5 is a schematic structural diagram of an embodiment of a data storage device according to the present invention; FIG.
图6为本发明提供的一种数据存储装置的另一个实施例的结构示意图;FIG. 6 is a schematic structural diagram of another embodiment of a data storage device according to the present invention; FIG.
图7为本发明提供的一种数据存储装置的另一个实施例的结构示意图;FIG. 7 is a schematic structural diagram of another embodiment of a data storage device according to the present invention; FIG.
图8为本发明提供的一种数据存储装置的另一个实施例的结构示意图;FIG. 8 is a schematic structural diagram of another embodiment of a data storage device according to the present invention; FIG.
图9为本发明提供的一种数据存储装置的合并单元的一个实施例的结构示意图。FIG. 9 is a schematic structural diagram of an embodiment of a merging unit of a data storage device according to the present invention.
具体实施方式detailed description
首先对本发明数据存储方法提供的实施例进行说明,参见图1,为本发明数据存储方法提供的一个实施例的流程示意图,本实施例包括如下步骤:First, the embodiment provided by the data storage method of the present invention is described. Referring to FIG. 1 , it is a schematic flowchart of an embodiment of the data storage method provided by the present invention. The embodiment includes the following steps:
步骤101:在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区。Step 101: Determine, when the webpage summary is updated, an update field in the webpage summary, and a field storage area corresponding to the update field.
参考背景技术的相关描述可知,网页摘要通常包含作者、关键词、正文、标题、创建时间、更新时间、网页点击量等多个字段。一般来说,当网页更新时其包含的各个字段并非全部都更新,例如作者、创建时间等字段更新的可能性非常小,而网页点击量、访问者等字段更新的可能性则较大,因此需确定网页摘要中存在更新的字段。对于新创建的网页和待删除的网页,可认为其网页摘要包含的所有字段都是更新字段。Referring to the related description of the background art, a web page summary usually includes multiple fields such as an author, a keyword, a body, a title, a creation time, an update time, and a web page click. In general, when a web page is updated, not all of the fields it contains are updated. For example, the possibility of updating the fields such as author, creation time, etc. is very small, and the possibility of updating the fields such as page hits and visitors is large, so You need to identify the updated fields in the snippet. For newly created pages and pages to be deleted, all fields contained in their snippet are considered to be update fields.
步骤102:在所述字段存储区新增加更新存储区,在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。Step 102: Add an update storage area in the field storage area, and store, in the update storage area, field data of the update field after the current update and index information of the field data.
每个字段存储区都包含若干更新存储区,各个更新存储区均为在每一次更新网页摘 要时,在发生更新的字段所对应的字段存储区中新增的更新存储区。Each field storage area contains several update storage areas, and each update storage area is in each update page. When needed, the update store is added to the field store corresponding to the field in which the update occurred.
所述索引信息即为从网页到该网页摘要字段数据的正排索引信息,在搜索时,先利用倒排索引检索到与搜索关键词相关的若干目标网页,再根据正排索引信息获得各个目标网页的摘要字段数据。The index information is the positive index information from the webpage to the webpage summary field data. In the search, the inverted index is used to retrieve a plurality of target webpages related to the search keyword, and then the target is obtained according to the positive index information. Summary field data for the page.
在实际应用中,可预设网页摘要的更新存储周期,例如可设为一天,则在每天的预设更新时刻,统计对前一天内所有出现更新的网页,对当中发生更新的摘要字段及其对应的索引信息进行存储。In an actual application, the update storage period of the webpage summary may be preset, for example, may be set to one day, and at each preset update time, all the updated webpages in the previous day are counted, and the summary field in which the update occurs is The corresponding index information is stored.
上述实施例的技术方案提供数据存储方法及装置,在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区;在所述字段存储区新增加更新存储区,在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。The technical solution of the foregoing embodiment provides a data storage method and apparatus, and when an update of a webpage is updated, determining an update field in the webpage summary and a field storage area corresponding to the update field; and newly adding an update storage in the field storage area And storing, in the update storage area, field data of the update field after the current update and index information of the field data.
该实施例在对网页摘要存储时,只需对其中的更新字段和对应的索引信息进行增量更新,而无需对所有字段的数据进行增量更新,因此大大降低了单次更新时存储的数据量,从而避免了新增的数据量过大,以及由此导致的全量更新的发生,节约了时间及存储空间的开销,提高了存储效率。In this embodiment, when the webpage digest is stored, only the update field and the corresponding index information are incrementally updated, and the data of all the fields need not be incrementally updated, thereby greatly reducing the data stored in the single update. The amount, so as to avoid the excessive amount of new data, and the resulting full amount of updates, saving time and storage space overhead, improving storage efficiency.
可选的,在本发明的其他实施例中,可预先设立若干字段存储区,分别为每一个字段存储区指定对应的一个或多个字段。Optionally, in other embodiments of the present invention, a plurality of field storage areas may be pre-established, and one or more fields are respectively designated for each field storage area.
优选的,预先统计网页摘要包含的各字段的更新频率,根据所述更新频率分别为每一个字段存储区指定对应的一个或多个字段,可以将更新频率相同或相近的字段划分在同一个字段存储区中,这样当网页摘要有更新时,只在有更新的字段所在的字段存储区中新增加更新存储区即可。Preferably, the update frequency of each field included in the webpage summary is pre-stated, and one or more fields corresponding to each field storage area are respectively specified according to the update frequency, and the fields with the same or similar update frequency may be divided into the same field. In the storage area, when the web page summary is updated, the update storage area can be newly added only in the field storage area where the updated field is located.
例如,可根据各字段的更新频率将其划分为稳定存储区、不易变存储区、易变存储区三个字段存储区。其中,稳定存储区对应作者、关键词、创建时间等较稳定字段,不易变存储区对应正文、标题等较不容易改变的字段,而易变存储区对应更新时间、网页点击量等较容易发生改变的字段。For example, according to the update frequency of each field, it can be divided into three field storage areas: stable storage area, non-changeable storage area, and variable storage area. Among them, the stable storage area corresponds to the author, keyword, creation time and other relatively stable fields, and it is not easy to change the storage area corresponding to the text, the title and the like, and the changeable time, the page click amount, etc. are more likely to occur. Changed field.
本领域技术人员也可以根据经验,或者是在实际运行中的统计数据,对各字段自行划分至不同的字段存储区,以获得较高的更新和存储效率。该划分方式可以是固定的,也可以以一定时间周期内的统计数据,周期性的动态调整。Those skilled in the art can also divide each field into different field storage areas according to experience or statistical data in actual operation to obtain higher update and storage efficiency. The division manner may be fixed, or may be periodically adjusted dynamically by statistical data within a certain period of time.
在本发明数据存储方法的另一个实施例中,可将所述更新存储区具体划分为数据存储区和对应的索引存储区两部分,在所述数据存储区存储所述本次更新后的字段数据,在所述索引存储区存储所述字段数据的索引信息。 In another embodiment of the data storage method of the present invention, the update storage area may be specifically divided into two parts: a data storage area and a corresponding index storage area, and the updated field is stored in the data storage area. Data, storing index information of the field data in the index storage area.
所述索引信息可具体包括所述字段数据对应的网页标识,以及所述字段数据在所述数据存储区中的存储位置信息。由于每次更新的网页数量一般都比较大,因此在数据存储区中存储的字段数据的条数也较多,在获取某一条字段数据时,可从对应的索引存储区中读取所述字段数据的索引信息,进而在所述数据存储区中定位该条字段数据。The index information may specifically include a webpage identifier corresponding to the field data, and storage location information of the field data in the data storage area. Since the number of web pages updated each time is generally large, the number of field data stored in the data storage area is also large. When a certain field data is acquired, the field can be read from the corresponding index storage area. The index information of the data, thereby locating the piece of field data in the data storage area.
如图2所示为本发明数据存储方法的另一个实施例的流程示意图,所述实施例包括如下步骤201至步骤204:FIG. 2 is a schematic flowchart diagram of another embodiment of a data storage method according to the present invention. The embodiment includes the following steps 201 to 204:
步骤201:在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区。Step 201: Determine, when the webpage summary is updated, an update field in the webpage summary, and a field storage area corresponding to the update field.
步骤202:在所述字段存储区新增加更新存储区,所述更新存储区包括数据存储区和对应的索引存储区。Step 202: Add an update storage area to the field storage area, where the update storage area includes a data storage area and a corresponding index storage area.
步骤203:在所述数据存储区存储所述本次更新后的字段数据,在所述索引存储区存储所述字段数据对应的网页标识,以及所述字段数据在所述数据存储区中的存储位置信息。Step 203: Store the updated field data in the data storage area, store the webpage identifier corresponding to the field data in the index storage area, and store the field data in the data storage area. location information.
步骤204:新增加网页索引表,在所述网页索引表中存储本次更新对应的网页标识,以及所述网页标识在所述索引存储区中的存储位置信息。Step 204: The webpage index table is newly added, and the webpage identifier corresponding to the current update is stored in the webpage index table, and the storage location information of the webpage identifier in the index storage area.
在新增加的网页索引表中存储有本次更新对应的网页标识,所述网页标识具体可以是该网页的URL地址,或者其他可用于标识该网页的信息。该网页存储表中还存储有所述网页标识在所述索引存储区中的存储位置信息,用于在所述索引存储区中定位目标网页标识。The webpage identifier corresponding to the current update is stored in the newly added webpage index table, and the webpage identifier may be a URL address of the webpage, or other information that can be used to identify the webpage. The webpage storage table further stores storage location information of the webpage identifier in the index storage area, and is used to locate a target webpage identifier in the index storage area.
上述步骤203、204的执行顺序不做限制。The order of execution of the above steps 203, 204 is not limited.
所述步骤204可具体根据如下步骤a)和b)来存储所述字段数据对应的网页标识:The step 204 may store the webpage identifier corresponding to the field data according to the following steps a) and b):
步骤a):在所述网页索引表中设置2N个索引子表,为每一个索引字表设置对应的N位二进制表值,N为预设大于等于1的整数。Step a): setting 2 N index sub-tables in the webpage index table, and setting a corresponding N-bit binary table value for each index word table, where N is an integer preset to be greater than or equal to 1.
步骤b):获取所述网页标识对应的二进制数值,根据所述二进制数值的前N位将所述网页标识存储到对应表值的索引子表中。Step b): Obtain a binary value corresponding to the webpage identifier, and store the webpage identifier into an index sub-table corresponding to the table value according to the first N digits of the binary value.
由于在每次更新时涉及的网页数量较大,并且网页标识在计算机中的存储形式一般是一个位数较多的二进制数值,为了便于在网页索引表中快速查找目标网页的标识,可将网页索引表划分成2N个索引子表,每一个索引字表对应一个N位二进制表值。在存储网页标识时,可根据该网页标识的前N位二进制数值,将该网页标识存储到对应表值的索引字表中。这样在查找该网页标识时,只需根据该网页标识的前N位二进制数值,去对应表值的索引字表中查找即可,大大节省了查找时间。 Since the number of webpages involved in each update is large, and the storage form of the webpage identifier in the computer is generally a binary number with a large number of digits, in order to quickly find the logo of the target webpage in the webpage index table, the webpage may be The index table is divided into 2 N index sub-tables, and each index word table corresponds to an N-bit binary table value. When the webpage identifier is stored, the webpage identifier may be stored in the index word table corresponding to the table value according to the first N-bit binary value of the webpage identifier. In this way, when searching for the webpage identifier, it is only necessary to search according to the first N-bit binary value of the webpage identifier to the index word table corresponding to the table value, which greatly saves the search time.
如图3所示为采用本发明上述实施例所建立的数据存储结构示意图。互联网中包含的网页数量较大,我们一般将哈希值相同或相近的网页的摘要和索引信息存储在同一个索引分片中,在图3中,以单独的一个索引分片为例进行介绍,该索引分片包含网页索引表、字段存储区、版本信息表。FIG. 3 is a schematic diagram of a data storage structure established by using the foregoing embodiment of the present invention. The number of web pages included in the Internet is large. We generally store the digest and index information of web pages with the same or similar hash values in the same index fragment. In Figure 3, a separate index fragment is used as an example. The index fragment includes a webpage index table, a field storage area, and a version information table.
网页索引表中存储本次更新时涉及的网页标识列表,以及每一个网页标识在新增加的更新存储区中的存储位置信息。网页索引表i为在第i次更新时新增加的网页索引表。在实际应用中,可以将网页索引表的名称上附加更新时间信息,一方面可以区分各个不同的网页索引表,另一方面也便于从名称上立刻辨别出哪一个为最新的网页索引表。The webpage index table stores a list of webpage identifiers involved in the current update, and storage location information of each webpage identifier in the newly added update storage area. The web page index table i is a newly added web page index table at the time of the i-th update. In an actual application, the update time information may be added to the name of the webpage index table. On the one hand, each different webpage index table may be distinguished, and on the other hand, it is convenient to immediately distinguish which one is the latest webpage index table from the name.
在图3中,包含三个字段存储区,分别是稳定存储区、不易变存储区、易变存储区。每一个字段存储区均包含若干更新存储区。例如,更新存储区i为在第i次更新时,在不易变存储区中新增加的更新存储区,其包含数据存储区i和对应的索引存储区i两部分。在实际应用中,也可以将更新存储区的名称中附加更新时间信息,一方面可以区分各个不同的更新存储区,另一方面也便于从名称上立刻辨别出哪一个为最新的更新存储区。In FIG. 3, three field storage areas are included, which are a stable storage area, a non-volatile storage area, and a volatile storage area. Each field store contains several update buckets. For example, the update storage area i is an update storage area newly added in the non-changeable storage area at the ith update, and includes two parts of the data storage area i and the corresponding index storage area i. In practical applications, the update time information may also be added to the name of the update storage area, on the one hand, the different update storage areas may be distinguished, and on the other hand, it is convenient to immediately distinguish which one is the latest update storage area from the name.
版本信息表中记录了最新的版本信息,如当前索引分片包含哪些网页索引表、各字段存储区包含哪些更新存储区等信息,以便于进行版本管理,以确保在正排索引时能够按照最新的索引信息获得最新的摘要字段数据。版本信息表i为在第i次更新时新增加的版本信息表,在实际应用中,也可以将版本信息表的名称中附加更新时间信息,一方面可以区分各个不同的版本信息表,另一方面也便于从名称上立刻辨别出哪一个为最新的版本信息表。The latest version information is recorded in the version information table, such as which web index table is included in the current index fragment, and which update storage area is included in each field storage area, so as to facilitate version management to ensure that the latest index can be updated. The index information gets the latest summary field data. The version information table i is a newly added version information table at the time of the ith update. In an actual application, the update time information may be added to the name of the version information table, and on the one hand, different version information tables may be distinguished, and another It is also convenient to immediately identify which one is the latest version information table from the name.
如图4所示为本发明提供的一种数据存储方法的另一个实施例的流程示意图,在所述实施例中,步骤201至204参见上述实施例中的对应步骤的描述,所述实施例还包括如下步骤205至步骤208:FIG. 4 is a schematic flowchart diagram of another embodiment of a data storage method according to the present invention. In the embodiment, steps 201 to 204 refer to descriptions of corresponding steps in the foregoing embodiment, and the embodiment is described. Also included are the following steps 205 to 208:
步骤205:判断是否有待删除的网页,如果有,在新增加的更新存储区中设置所述待删除的网页的有效时间。Step 205: Determine whether there is a webpage to be deleted, and if yes, set a valid time of the webpage to be deleted in the newly added update storage area.
步骤206:当达到所述有效时间后,将所述待删除的网页在各次更新时存储的字段数据和对应的索引信息标记为无效。Step 206: After the valid time is reached, the field data and the corresponding index information stored in the webpage to be deleted are marked as invalid.
本发明的技术方案,对网页摘要及索引的更新均为增量更新,对于待删除的网页也适用。如果本次更新时,检测到有需要删除的网页,则认为该网页的所有字段都是更新字段,在对应的各个字段存储区中相应的也要增加更新存储区。In the technical solution of the present invention, the update of the webpage summary and the index is an incremental update, and is also applicable to the webpage to be deleted. If the webpage that needs to be deleted is detected during the update, all the fields of the webpage are considered to be update fields, and the update storage area is also added in the corresponding field storage area.
由于是待删除的网页,更新后的字段数据都是空数据,因此在数据存储区中,可以用预设标识来代替更新后的字段数据,并在索引存储区中相应存储该预设标识的位置信 息和网页标识。The updated field data is empty data. Therefore, in the data storage area, the updated field data may be replaced by a preset identifier, and the preset identifier is stored in the index storage area. Location letter Information and web page identification.
倒排索引和正排索引在实际使用时可能存在更新时间差的问题,例如在本次更新中有待删除的网页,如果当网页摘要和对应的正排索引更新后将该待删除的网页的各摘要字段及索引信息立即删除,或标记为无效的话,由于倒排索引此时可能还未完全更新,即在倒排索引使用的网页列表中还未将该待删除的网页标识删除,则此时可能还会有针对该待删除的网页的摘要字段的索引需求。The inverted index and the positive index may have problems with the update time difference in actual use, such as the webpage to be deleted in this update, if each summary field of the webpage to be deleted is updated after the webpage summary and the corresponding positive index are updated. And the index information is deleted immediately, or marked as invalid, since the inverted index may not be completely updated at this time, that is, the webpage identifier to be deleted has not been deleted in the webpage list used by the inverted index, then it may be There will be an index requirement for the summary field of the web page to be deleted.
因此,在实施例中,为待删除的网页设置“有效时间”属性,这样在本次更新网页摘要和对应的索引信息后,仍然会对该待删除的网页的字段数据和对应的索引信息保留一段时间,直至达到所述有效时间后,将所述待删除的网页在各次更新时存储的字段数据和对应的索引信息标记为无效,这样可以保证在倒排索引使用的网页列表中该待删除的网页标识已经确实被删除了,不会再有针对该待删除的网页的摘要字段的索引需求,从而解决了倒排索引和正排索引的更新时间差的问题。Therefore, in the embodiment, the “effective time” attribute is set for the webpage to be deleted, so that after the webpage summary and the corresponding index information are updated, the field data and the corresponding index information of the webpage to be deleted are still retained. After the expiration of the valid time, the field data and the corresponding index information stored in the webpage to be deleted are marked as invalid, so that the webpage list used in the inverted index can be guaranteed. The deleted webpage identifier has been deleted, and there is no longer an index requirement for the digest field of the webpage to be deleted, thereby solving the problem of the update time difference between the inverted index and the positive index.
该“有效时间”属性可以与该待删除的网页标识一起,存储在索引存储区中。为了保持存储格式一致,对于不需要删除的网页,也可以在其各个更新存储区中保留该“有效时间”属性,并将该属性值设置为无效,或设置一个无限长的时间,直至检测到该网页需要删除时,才在新增加的更新存储区中真正设置该属性的值。The "valid time" attribute may be stored in the index storage area together with the web page identifier to be deleted. In order to keep the storage format consistent, for the web pages that do not need to be deleted, the "valid time" attribute may be retained in each update storage area, and the attribute value is set to be invalid, or an infinitely long time is set until it is detected. When the page needs to be deleted, the value of the attribute is actually set in the newly added update store.
步骤207:将历史更新存储区中与所述更新字段对应的历史字段数据和对应的历史索引信息标记为无效。Step 207: Mark the history field data corresponding to the update field and the corresponding history index information in the history update storage area as invalid.
由于本发明的网页摘要字段的更新是增量更新,因此可能出现这样的情况,即多个更新存储区包含同一个网页在不同时期更新的字段数据及对应的索引信息。这时,应以最新更新的更新存储区作为该网页对应的有效更新存储区,该有效更新存储区中包含的字段数据和对应的历史索引信息即为有效的字段数据和索引信息。而将以往的历史更新存储区中与所述更新字段对应的历史字段数据和对应的历史索引信息都标记为无效。Since the update of the webpage summary field of the present invention is an incremental update, it may happen that the plurality of update storage areas contain field data and corresponding index information updated by the same webpage at different times. At this time, the newly updated update storage area should be used as the valid update storage area corresponding to the webpage, and the field data and the corresponding historical index information included in the valid update storage area are valid field data and index information. The history field data corresponding to the update field and the corresponding history index information in the past history update storage area are all marked as invalid.
在实际应用中,可在更新存储区对应的文件名中附带更新时间信息,这样从各更新存储区的文件名即可快速判断出哪一个更新存储区为有效更新存储区。In practical applications, the update time information may be included in the file name corresponding to the update storage area, so that the file name of each update storage area can quickly determine which update storage area is a valid update storage area.
步骤208:合并所述字段存储区包含的若干更新存储区,在合并后的新更新存储区中删除标记为无效的字段数据和对应的索引信息。Step 208: Combine the plurality of update storage areas included in the field storage area, and delete the field data marked as invalid and the corresponding index information in the merged new update storage area.
随着网页摘要不断更新,各字段存储区中包含的更新存储区的数量越来越多,同时各更新存储区包含的无效字段数据和索引信息的数量也越来越多。过多的更新存储区的数量降低了检索服务的效率,并且无效字段数据和索引信息也浪费了很多存储空间。如果进行全量更新的话,涉及的数据量较大,因此本实施例中对若干更新存储区进行合并,有效减少更新存储区的数量,去掉其中包含的无效字段数据和对应的索引信息,节约存 储空间,从而使增量更新一直进行下去,避免全量更新所带来的时间和设备的巨大开销。As the web page summary is continuously updated, the number of update storage areas included in each field storage area is increasing, and the number of invalid field data and index information contained in each update storage area is also increasing. Excessive amount of update storage reduces the efficiency of the retrieval service, and invalid field data and index information wastes a lot of storage space. If the full amount of updates is performed, the amount of data involved is large. Therefore, in this embodiment, a plurality of update storage areas are merged, the number of updated storage areas is effectively reduced, and invalid field data and corresponding index information contained therein are removed, thereby saving Storage space, so that incremental updates continue, avoiding the time and equipment overhead caused by full updates.
具体的,上述步骤208可具体包括如下步骤c)、d)、e):Specifically, the foregoing step 208 may specifically include the following steps c), d), e):
步骤c):在所述字段存储区中选择若干待合并的更新存储区。Step c): selecting a plurality of update storage areas to be merged in the field storage area.
步骤d):分别计算所述待合并的更新存储区包含的有效字段数据的数量之和。Step d): respectively calculating the sum of the number of valid field data included in the update storage area to be merged.
这里所指的有效字段数据的数量具体可以是有效字段的个数,也可以是有效字段数据所占用的内存值。The number of valid field data referred to herein may be the number of valid fields or the memory value occupied by the valid field data.
步骤e):如果所述数量之和小于第一预设阈值,则合并所述待合并的更新存储区。Step e): If the sum of the quantities is less than the first preset threshold, merge the update storage areas to be merged.
具体的,可为每个更新存储区设置其包含的字段数据和/或索引信息的数量上限阈值,例如,设置每一个更新存储区最多包含100条字段对应的字段数据。假设选择了两个待合并的更新存储区,并且这两个待合并的更新存储区包含的有效字段分别为55和60,则由于这两个待合并的更新存储区包含的有效字段的数量之和超过了100,因此这两个待合并的更新存储区不能合并。Specifically, the upper limit threshold of the field data and/or the index information included in the update storage area may be set for each update storage area, for example, the field data corresponding to the maximum of 100 fields in each update storage area is set. Assuming that two update storage areas to be merged are selected, and the two update storage areas to be merged contain valid fields of 55 and 60, respectively, the number of valid fields included in the two update storage areas to be merged And more than 100, so the two update storage areas to be merged cannot be merged.
可选的,上述步骤2081可具体包括如下子步骤:Optionally, the foregoing step 2081 may specifically include the following sub-steps:
分别计算每一个更新存储区包含的有效字段数据数量。Calculate the number of valid field data contained in each update store separately.
从所述字段存储区中选择所述有效字段数据数量最少的若干更新存储区作为所述待合并的更新存储区。Selecting, from the field storage area, a plurality of update storage areas with the least number of valid field data as the update storage area to be merged.
在选择待合并的更新存储区时,可有针对性的选择有效字段数据的数量最少的若干更新存储区,这样选择出来的待合并的更新存储区更有可能符合上述步骤2083的合并条件。When the update storage area to be merged is selected, a plurality of update storage areas with the least number of valid field data may be selected in a targeted manner, so that the selected update storage area to be merged is more likely to meet the merge condition of the above step 2083.
可选的,上述步骤2081也可具体包括如下子步骤:Optionally, the foregoing step 2081 may specifically include the following sub-steps:
所述从所述字段存储区中选择若干待合并的更新存储区。The selecting, from the field storage area, a plurality of update storage areas to be merged.
分别计算所述更新存储区包含的有效字段数据数量与所述更新存储区包含的总字段数据数量的比值。Calculating, respectively, a ratio of the number of valid field data included in the update storage area to the total number of field data included in the update storage area.
在所述字段存储区中选择所述比值最低的若干更新存储区作为所述待合并的更新存储区。Selecting, in the field storage area, a plurality of update storage areas having the lowest ratio as the update storage area to be merged.
在选择待合并的更新存储区时,可有针对性的选择有效字段数据的数量最少的若干更新存储区,或选择有效字段数据的比例最少的若干更新存储区,这样选择出来的待合并的更新存储区更有可能符合上述步骤2083的合并条件,更节约时间和空间开销。When selecting the update storage area to be merged, a plurality of update storage areas with the least number of valid field data may be selected, or a plurality of update storage areas with the least proportion of valid field data may be selected, so that the selected update to be merged is selected. The storage area is more likely to meet the merge condition of the above step 2083, which saves time and space overhead.
上述技术方案提供数据存储方法的实施例,在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区;在所述字段存储区新增加更新存 储区,在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。The foregoing technical solution provides an embodiment of a data storage method, where an update field in the webpage summary and a field storage area corresponding to the update field are determined when the webpage summary is updated; and the update storage is newly added in the field storage area. a storage area in which the field data of the update field after the current update and the index information of the field data are stored in the update storage area.
该技术方案在对网页摘要存储时,只需对其中的更新字段和对应的索引信息进行增量更新,而无需对所有字段的数据进行增量更新,因此大大降低了单次更新时存储的数据量,从而避免了新增的数据量过大,以及由此导致的全量更新的发生,节约了时间及存储空间的开销,提高了存储效率。When the webpage summary is stored, the technical solution only needs to incrementally update the update field and the corresponding index information, without incrementally updating the data of all the fields, thereby greatly reducing the data stored in a single update. The amount, so as to avoid the excessive amount of new data, and the resulting full amount of updates, saving time and storage space overhead, improving storage efficiency.
与本发明一种数据存储方法提供的实施例相对应,本发明还提供了一种数据存储装置的实施例,如图5所示为本发明提供的一种数据存储装置的一个实施例的结构示意图,所述装置包括:Corresponding to the embodiment provided by the data storage method of the present invention, the present invention further provides an embodiment of a data storage device, as shown in FIG. 5, which is a structure of an embodiment of a data storage device provided by the present invention. Schematically, the device comprises:
确定单元501,用于在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区;The determining unit 501 is configured to: when the webpage summary is updated, determine an update field in the webpage summary, and a field storage area corresponding to the update field;
第一存储单元502,用于在所述字段存储区新增加更新存储区,在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。The first storage unit 502 is configured to newly add an update storage area in the field storage area, and store, in the update storage area, field data of the update field after the current update and index information of the field data.
可选的,所述更新存储区包括数据存储区和对应的索引存储区;Optionally, the update storage area includes a data storage area and a corresponding index storage area;
所述第一存储单元502包括:数据存储子单元5021和索引存储子单元5022;The first storage unit 502 includes: a data storage subunit 5021 and an index storage subunit 5022;
所述数据存储子单元5021,具体用于在所述数据存储区存储所述本次更新后的字段数据;The data storage subunit 5021 is specifically configured to store, in the data storage area, the field data after the current update;
所述索引存储子单元5022,具体用于在所述索引存储区存储所述字段数据的索引信息。The index storage subunit 5022 is specifically configured to store index information of the field data in the index storage area.
可选的,所述索引存储单元5022用于在所述索引存储区存储所述字段数据对应的网页标识,以及所述字段数据在所述数据存储区中的存储位置信息。Optionally, the index storage unit 5022 is configured to store, in the index storage area, a webpage identifier corresponding to the field data, and storage location information of the field data in the data storage area.
如图6所示,为本发明提供的一种数据存储装置的另一个实施例的结构示意图,所述装置还包括:FIG. 6 is a schematic structural diagram of another embodiment of a data storage device according to the present invention. The device further includes:
第二存储单元503,用于新增加网页索引表,并在所述网页索引表中存储本次更新对应的网页标识,以及所述网页标识在所述索引存储区中的存储位置信息。The second storage unit 503 is configured to newly add a webpage index table, and store, in the webpage index table, a webpage identifier corresponding to the current update, and storage location information of the webpage identifier in the index storage area.
可选的,所述第二存储单元503包括:Optionally, the second storage unit 503 includes:
设置子单元5031,用于在所述网页索引表中设置2N个索引子表,为每一个索引字表设置对应的N位二进制表值,N为预设大于等于1的整数;The setting subunit 5031 is configured to set 2 N index sub-tables in the webpage index table, and set a corresponding N-bit binary table value for each index word table, where N is an integer preset to be greater than or equal to 1;
网页存储子单元5032,用于获取所述网页的标识对应的二进制数值,根据所述二进制数值的前N位将所述网页标识存储到对应表值的索引子表中。 The webpage storage subunit 5032 is configured to obtain a binary value corresponding to the identifier of the webpage, and store the webpage identifier into an index subtable of the corresponding table value according to the first N digits of the binary value.
如图7所示,为本发明一种数据存储装置提供的另一个实施例的结构示意图,所述装置还包括:FIG. 7 is a schematic structural diagram of another embodiment of a data storage device according to the present invention. The device further includes:
设置单元504,用于预设若干字段存储区,分别为每一个字段存储区指定对应的一个一个或多个字段。The setting unit 504 is configured to preset a plurality of field storage areas, and respectively specify one or more corresponding fields for each field storage area.
可选的,所述设置单元504具体用于:Optionally, the setting unit 504 is specifically configured to:
统计网页摘要包含的各字段的更新频率,根据所述更新频率分别为每一个字段存储区指定对应的一个或多个字段。The update frequency of each field included in the statistical webpage summary, and corresponding one or more fields are respectively designated for each field storage area according to the update frequency.
可选的,如图7所示,所述装置还包括:Optionally, as shown in FIG. 7, the apparatus further includes:
判断设置单元505,用于判断是否有待删除的网页,如果有,在新增加的更新存储区中设置所述待删除的网页的有效时间;a determining setting unit 505, configured to determine whether there is a webpage to be deleted, and if yes, setting a valid time of the webpage to be deleted in the newly added update storage area;
第一标记单元506,用于当达到所述有效时间后,将所述待删除的网页在各次更新时存储的字段数据和对应的索引信息标记为无效。The first marking unit 506 is configured to mark, when the valid time is reached, the field data and the corresponding index information stored in the webpage to be deleted at each update as invalid.
可选的,如图7所示,所述装置还包括:Optionally, as shown in FIG. 7, the apparatus further includes:
第二标记单元507,用于将历史更新存储区中与所述更新字段对应的历史字段数据和对应的历史索引信息标记为无效。The second marking unit 507 is configured to mark the history field data corresponding to the update field and the corresponding history index information in the history update storage area as invalid.
如图8所示,为本发明一种数据存储装置提供的另一个实施例的结构示意图,所述装置还包括:FIG. 8 is a schematic structural diagram of another embodiment of a data storage device according to the present invention. The device further includes:
合并单元508,用于合并所述字段存储区包含的若干更新存储区;a merging unit 508, configured to merge the plurality of update storage areas included in the field storage area;
删除单元509,用于在合并后的新更新存储区中将所述第一标记单元和第二标记单元标记为无效的字段数据和索引信息删除。The deleting unit 509 is configured to delete the field data and the index information marked as invalid by the first marking unit and the second marking unit in the merged new update storage area.
如图9所示,为本发明提供的一种数据存储装置的合并单元508的一个实施例的结构示意图,所述合并单元508包括:FIG. 9 is a schematic structural diagram of an embodiment of a merging unit 508 of a data storage device according to the present invention. The merging unit 508 includes:
第一选择子单元5081,用于在所述字段存储区中选择若干待合并的更新存储区;a first selection subunit 5081, configured to select, in the field storage area, a plurality of update storage areas to be merged;
第一计算子单元5082,用于分别计算所述待合并的更新存储区包含的有效字段数据的数量之和;a first calculating subunit 5082, configured to separately calculate a sum of the number of valid field data included in the update storage area to be merged;
第一合并子单元5083,用于如果所述数量之和小于第一预设阈值,则合并所述待合并的更新存储区。The first merging sub-unit 5083 is configured to merge the update storage areas to be merged if the sum of the quantities is less than a first preset threshold.
可选的,所述第一选择子单元5081包括:Optionally, the first selection subunit 5081 includes:
第二计算子单元50811,用于分别计算每一个更新存储区包含的有效字段数据数量; a second calculating subunit 50811, configured to separately calculate a quantity of valid field data included in each update storage area;
第二选择子单元50812,用于从所述字段存储区中选择所述有效字段数据数量最少的若干更新存储区作为所述待合并的更新存储区。The second selection sub-unit 50812 is configured to select, from the field storage area, a plurality of update storage areas with the least number of valid field data as the update storage area to be merged.
可选的,所述第一选择子单元5081也可以包括:Optionally, the first selection subunit 5081 may also include:
第三计算子单元(图中未示出),用于分别计算所述更新存储区包含的有效字段数据数量与所述更新存储区包含的总字段数据数量的比值;a third calculating subunit (not shown) for respectively calculating a ratio of the number of valid field data included in the update storage area to the total field data quantity included in the update storage area;
第三选择子单元(图中未示出),用于在所述字段存储区中选择所述比值最低的若干更新存储区作为所述待合并的更新存储区。And a third selection subunit (not shown) for selecting, in the field storage area, the plurality of update storage areas with the lowest ratio as the update storage area to be merged.
此处,需要说明的是,上述各实施例中的各功能单元或子单元或模块可以作为装置的一部分运行在计算机终端中,可以通过计算机终端中的处理器来执行上述单元或子单元或模块实现的功能,计算机终端也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。Here, it should be noted that each functional unit or subunit or module in each of the above embodiments may be operated in a computer terminal as part of the apparatus, and the above unit or subunit or module may be executed by a processor in the computer terminal. The functions realized, the computer terminal can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile Internet device (MID), a PAD, and the like.
本发明提供的数据存储装置的实施例,技术方案本质与上述数据存储方法的实施例相同,因此未做具体解释描述,相关之处可参见上述数据存储方法的实施例的对应部分。The embodiment of the data storage device provided by the present invention is essentially the same as the embodiment of the data storage method described above, and therefore is not explained in detail. For the related parts, reference may be made to the corresponding part of the embodiment of the data storage method.
上述技术方案提供的数据存储装置的实施例,在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区;在所述字段存储区新增加更新存储区,在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。The embodiment of the data storage device provided by the foregoing technical solution, when the webpage summary is updated, determining an update field in the webpage summary, and a field storage area corresponding to the update field; adding an update storage area in the field storage area And storing, in the update storage area, field data of the update field after the current update and index information of the field data.
该实施例在对网页摘要存储时,只需对其中的更新字段和对应的索引信息进行增量更新,而无需对所有字段的数据进行增量更新,因此大大降低了单次更新时存储的数据量,从而避免了新增的数据量过大,以及由此导致的全量更新的发生,节约了时间及存储空间的开销,提高了存储效率。In this embodiment, when the webpage digest is stored, only the update field and the corresponding index information are incrementally updated, and the data of all the fields need not be incrementally updated, thereby greatly reducing the data stored in the single update. The amount, so as to avoid the excessive amount of new data, and the resulting full amount of updates, saving time and storage space overhead, improving storage efficiency.
本申请实施例所提供的各个功能模块可以在移动终端、计算机终端或者类似的运算装置中运行,也可以作为存储介质的一部分进行存储。The various functional modules provided by the embodiments of the present application may be run in a mobile terminal, a computer terminal, or the like, or may be stored as part of a storage medium.
由此,本发明的实施例可以提供一种计算机终端,该计算机终端可以是计算机终端群中的任意一个计算机终端设备。可选地,在本实施例中,上述计算机终端也可以替换为移动终端等终端设备。Thus, embodiments of the present invention may provide a computer terminal, which may be any computer terminal device in a group of computer terminals. Optionally, in this embodiment, the foregoing computer terminal may also be replaced with a terminal device such as a mobile terminal.
可选地,在本实施例中,上述计算机终端可以位于计算机网络的多个网络设备中的至少一个网络设备。Optionally, in this embodiment, the computer terminal may be located in at least one network device of the plurality of network devices of the computer network.
在本实施例中,上述计算机终端可以执行数据存储方法中以下步骤的程序代码:在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储 区;在所述字段存储区新增加更新存储区,在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。In this embodiment, the computer terminal may execute the program code of the following steps in the data storage method: when the webpage summary is updated, determining an update field in the webpage summary, and storing a field corresponding to the update field And updating the storage area in the field storage area, and storing the field data of the update field after the current update and the index information of the field data in the update storage area.
可选地,该计算机终端可以包括:一个或多个处理器、存储器、以及传输装置。Optionally, the computer terminal can include: one or more processors, memory, and transmission means.
其中,存储器可用于存储软件程序以及模块,如本发明实施例中的数据存储方法及装置对应的程序指令/模块,处理器通过运行存储在存储器内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的数据存储方法。存储器可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory can be used to store software programs and modules, such as the data storage method and the program instructions/modules corresponding to the device in the embodiment of the present invention. The processor executes various functional applications by running software programs and modules stored in the memory. And data processing, that is, the above data storage method is implemented. The memory may include a high speed random access memory, and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory can further include memory remotely located relative to the processor, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
上述的传输装置用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装置包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。The above transmission device is for receiving or transmitting data via a network. Specific examples of the above network may include a wired network and a wireless network. In one example, the transmission device includes a Network Interface Controller (NIC) that can be connected to other network devices and routers via a network cable to communicate with the Internet or a local area network. In one example, the transmission device is a Radio Frequency (RF) module for communicating with the Internet wirelessly.
其中,具体地,存储器用于存储预设动作条件和预设权限用户的信息、以及应用程序。Specifically, the memory is used to store preset action conditions and information of the preset rights user, and an application.
处理器可以通过传输装置调用存储器存储的信息及应用程序,以执行上述方法实施例中的各个可选或优选实施例的方法步骤的程序代码。The processor can call the memory stored information and the application by the transmitting device to execute the program code of the method steps of each of the alternative or preferred embodiments of the above method embodiments.
本领域普通技术人员可以理解,计算机终端也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌声电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。A person skilled in the art can understand that the computer terminal can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device (MID), a PAD, and the like.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps of the foregoing embodiments may be completed by a program to instruct terminal device related hardware, and the program may be stored in a computer readable storage medium, and the storage medium may be Including: flash disk, read-only memory (ROM), random access memory (RAM), disk or optical disk.
本发明的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以用于保存上述方法实施例和装置实施例所提供的数据存储方法所执行的程序代码。Embodiments of the present invention also provide a storage medium. Optionally, in this embodiment, the foregoing storage medium may be used to save program code executed by the data storage method provided by the foregoing method embodiment and the device embodiment.
可选地,在本实施例中,上述存储介质可以位于计算机网络中计算机终端群中的任意一个计算机终端中,或者位于移动终端群中的任意一个移动终端中。Optionally, in this embodiment, the foregoing storage medium may be located in any one of the computer terminal groups in the computer network, or in any one of the mobile terminal groups.
可选地,在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:在 网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区;在所述字段存储区新增加更新存储区,在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。Optionally, in the embodiment, the storage medium is arranged to store program code for performing the following steps: And updating, in the webpage summary, an update field in the webpage summary, and a field storage area corresponding to the update field; adding an update storage area in the field storage area, and storing the update field in the update storage area The field data after the update and the index information of the field data.
可选地,在本实施例中,存储介质还可以被设置为存储数据存储方法提供的各种优选地或可选的方法步骤的程序代码。Alternatively, in the present embodiment, the storage medium may also be configured to store program code of various preferred or optional method steps provided by the data storage method.
如上参照附图以示例的方式描述了根据本发明的数据存储方法及装置。但是,本领域技术人员应当理解,对于上述本发明所提出的获取立体热力图的方法及装置,还可以在不脱离本发明内容的基础上做出各种改进。因此,本发明的保护范围应当由所附的权利要求书的内容确定。The data storage method and apparatus according to the present invention are described above by way of example with reference to the accompanying drawings. However, it should be understood by those skilled in the art that various improvements can be made to the method and apparatus for obtaining a three-dimensional heat map proposed by the present invention without departing from the scope of the present invention. Therefore, the scope of the invention should be determined by the content of the appended claims.
本领域的技术人员可以清楚地了解到本发明实施例中的技术可借助软件加必需的通用硬件的方式来实现,通用硬件包括通用集成电路、通用CPU、通用存储器、通用元器件等,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明实施例中的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例或者实施例的某些部分所述的方法。It will be apparent to those skilled in the art that the technology in the embodiments of the present invention can be implemented by means of software plus necessary general hardware including general-purpose integrated circuits, general-purpose CPUs, general-purpose memories, general-purpose components, and the like. It can be implemented by dedicated hardware including an application specific integrated circuit, a dedicated CPU, a dedicated memory, a dedicated component, etc., but in many cases the former is a better implementation. Based on such understanding, the technical solution in the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product, which may be stored in a storage medium such as a read-only memory. (ROM, Read-Only Memory), Random Access Memory (RAM), disk, CD, etc., including a number of instructions to make a computer device (can be a personal computer, server, or network device, etc.) The methods described in various embodiments of the invention or in certain portions of the embodiments are performed.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置和系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and system embodiments, the description is relatively simple, since it is substantially similar to the method embodiment, and the relevant portions of the method embodiments can be referred to.
以上所述的本发明实施方式,并不构成对本发明保护范围的限定。任何在本发明的精神和原则之内所作的修改、等同替换和改进等,均应包含在本发明的保护范围之内。 The embodiments of the invention described above are not intended to limit the scope of the invention. Any modifications, equivalent substitutions and improvements made within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (30)

  1. 一种数据存储方法,所述数据包括网页摘要和所述网页摘要的索引信息,其特征在于,所述方法包括:A data storage method, the data comprising a webpage digest and index information of the webpage digest, wherein the method comprises:
    在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区;Determining, in the webpage summary update, an update field in the webpage summary, and a field storage area corresponding to the update field;
    在所述字段存储区新增加更新存储区;Adding an update storage area to the field storage area;
    在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。And storing, in the update storage area, field data of the update field after the current update and index information of the field data.
  2. 根据权利要求1所述的方法,其特征在于,所述更新存储区包括数据存储区和对应的索引存储区,在所述数据存储区存储所述本次更新后的字段数据,在所述索引存储区存储所述字段数据的索引信息。The method according to claim 1, wherein said update storage area comprises a data storage area and a corresponding index storage area, and said updated field data is stored in said data storage area, said index The storage area stores index information of the field data.
  3. 根据权利要求2所述的方法,其特征在于,在所述索引存储区存储所述字段数据的索引信息包括:The method according to claim 2, wherein storing the index information of the field data in the index storage area comprises:
    在所述索引存储区存储所述字段数据对应的网页标识,以及所述字段数据在所述数据存储区中的存储位置信息。And storing, in the index storage area, a webpage identifier corresponding to the field data, and storage location information of the field data in the data storage area.
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:The method of claim 3, wherein the method further comprises:
    新增加网页索引表,在所述网页索引表中存储本次更新对应的网页标识,以及所述网页标识在所述索引存储区中的存储位置信息。A webpage index table is newly added, and the webpage identifier corresponding to the current update is stored in the webpage index table, and the storage location information of the webpage identifier in the index storage area.
  5. 根据权利要求4所述的方法,其特征在于,在所述网页索引表中存储本次更新对应的网页标识包括:The method according to claim 4, wherein storing the webpage identifier corresponding to the current update in the webpage index table comprises:
    在所述网页索引表中设置2N个索引子表,为每一个索引字表设置对应的N位二进制表值,N为预设大于等于1的整数;Setting 2 N index sub-tables in the webpage index table, and setting a corresponding N-bit binary table value for each index word table, where N is an integer preset to be greater than or equal to 1;
    获取所述网页标识对应的二进制数值;Obtaining a binary value corresponding to the webpage identifier;
    根据所述二进制数值的前N位将所述网页标识存储到对应表值的索引子表中。The web page identifier is stored in an index sub-table corresponding to the table value according to the first N bits of the binary value.
  6. 根据权利要求1至5中任意一项所述的方法,其特征在于,所述方法还包括: The method according to any one of claims 1 to 5, further comprising:
    预设若干字段存储区,分别为每一个字段存储区指定对应的一个或多个字段。A plurality of field storage areas are preset, and one or more fields are respectively designated for each field storage area.
  7. 根据权利要求6所述的方法,其特征在于,分别为每一个字段区指定对应的一个或多个字段包括:The method of claim 6 wherein assigning each of the field regions a corresponding one or more fields comprises:
    统计网页摘要包含的各字段的更新频率,根据所述更新频率分别为每一个字段存储区指定对应的一个或多个字段。The update frequency of each field included in the statistical webpage summary, and corresponding one or more fields are respectively designated for each field storage area according to the update frequency.
  8. 根据权利要求6所述的方法,其特征在于,所述方法还包括:The method of claim 6 wherein the method further comprises:
    判断是否有待删除的网页;Determine whether there is a web page to be deleted;
    如果有,则在新增加的更新存储区中设置所述待删除的网页的有效时间;If yes, setting a valid time of the webpage to be deleted in the newly added update storage area;
    当达到所述有效时间后,将所述待删除的网页在各次更新时存储的字段数据和对应的索引信息标记为无效。After the valid time is reached, the field data and the corresponding index information stored in the webpage to be deleted are marked as invalid.
  9. 根据权利要求1至5或7至8中任意一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 5 or 7 to 8, wherein the method further comprises:
    将历史更新存储区中与所述更新字段对应的历史字段数据和对应的历史索引信息标记为无效。The history field data corresponding to the update field and the corresponding history index information in the history update storage area are marked as invalid.
  10. 根据权利要求9所述的方法,其特征在于,所述方法还包括:The method of claim 9 wherein the method further comprises:
    合并所述字段存储区包含的若干更新存储区;Merging a plurality of update storage areas included in the field storage area;
    在合并后的新更新存储区中删除标记为无效的字段数据和索引信息。Field data and index information marked as invalid are deleted in the merged new update store.
  11. 根据权利要求10所述的方法,其特征在于,合并所述字段存储区包含的若干更新存储区包括:The method according to claim 10, wherein merging the plurality of update storage areas included in the field storage area comprises:
    在所述字段存储区中选择若干待合并的更新存储区;Selecting a plurality of update storage areas to be merged in the field storage area;
    分别计算所述待合并的更新存储区包含的有效字段数据的数量之和;Calculating, respectively, a sum of the number of valid field data included in the update storage area to be merged;
    如果所述数量之和小于第一预设阈值,则合并所述待合并的更新存储区。 If the sum of the quantities is less than the first preset threshold, the update storage areas to be merged are merged.
  12. 根据权利要求11所述的方法,其特征在于,从所述字段存储区中选择若干待合并的更新存储区包括:The method according to claim 11, wherein selecting a plurality of update storage areas to be merged from the field storage area comprises:
    分别计算每一个更新存储区包含的有效字段数据数量;Calculating the number of valid field data included in each update storage area separately;
    从所述字段存储区中选择所述有效字段数据数量最少的若干更新存储区作为所述待合并的更新存储区。Selecting, from the field storage area, a plurality of update storage areas with the least number of valid field data as the update storage area to be merged.
  13. 根据权利要求11所述的方法,其特征在于,从所述字段存储区中选择若干待合并的更新存储区包括:The method according to claim 11, wherein selecting a plurality of update storage areas to be merged from the field storage area comprises:
    分别计算所述更新存储区包含的有效字段数据数量与所述更新存储区包含的总字段数据数量的比值;Calculating, respectively, a ratio of a quantity of valid field data included in the update storage area to a total number of field data included in the update storage area;
    在所述字段存储区中选择所述比值最低的若干更新存储区作为所述待合并的更新存储区。Selecting, in the field storage area, a plurality of update storage areas having the lowest ratio as the update storage area to be merged.
  14. 根据权利要求7所述的方法,其特征在于,根据所述更新频率分别为每一个字段存储区指定对应的一个或多个字段包括:The method according to claim 7, wherein the specifying one or more fields for each field storage area according to the update frequency includes:
    将更新频率相同或相近的字段划分在同一个字段存储区中。Fields with the same or similar update frequency are divided into the same field storage area.
  15. 一种数据存储装置,所述数据包括网页摘要和所述网页摘要的索引信息,其特征在于,所述装置包括:A data storage device, the data comprising a webpage digest and index information of the webpage digest, wherein the apparatus comprises:
    确定单元,用于在网页摘要更新时,确定所述网页摘要中的更新字段,以及所述更新字段对应的字段存储区;a determining unit, configured to determine an update field in the webpage summary and a field storage area corresponding to the update field when the webpage summary is updated;
    第一存储单元,用于在所述字段存储区新增加更新存储区,在所述更新存储区存储所述更新字段在本次更新后的字段数据和所述字段数据的索引信息。The first storage unit is configured to newly add an update storage area in the field storage area, and store, in the update storage area, field data of the update field after the current update and index information of the field data.
  16. 根据权利要求15所述的装置,其特征在于,所述更新存储区包括数据存储区和对应的索引存储区,其中,所述第一存储单元包括:The device according to claim 15, wherein the update storage area comprises a data storage area and a corresponding index storage area, wherein the first storage unit comprises:
    所述数据存储子单元,用于在所述数据存储区存储所述本次更新后的字段数据;The data storage subunit is configured to store, in the data storage area, the field data after the current update;
    所述索引存储子单元,用于在所述索引存储区存储所述字段数据的索引信息。 The index storage subunit is configured to store index information of the field data in the index storage area.
  17. 根据权利要求16所述的装置,其特征在于,所述索引存储子单元用于在所述索引存储区存储所述字段数据对应的网页标识,以及所述字段数据在所述数据存储区中的存储位置信息。The device according to claim 16, wherein the index storage subunit is configured to store, in the index storage area, a webpage identifier corresponding to the field data, and the field data in the data storage area. Store location information.
  18. 根据权利要求17所述的装置,其特征在于,所述装置还包括:The device according to claim 17, wherein the device further comprises:
    第二存储单元,用于新增加网页索引表,并在所述网页索引表中存储本次更新对应的网页标识,以及所述网页标识在所述索引存储区中的存储位置信息。a second storage unit, configured to newly add a webpage index table, and store, in the webpage index table, a webpage identifier corresponding to the current update, and storage location information of the webpage identifier in the index storage area.
  19. 根据权利要求18所述的装置,其特征在于,所述第二存储单元包括:The device of claim 18, wherein the second storage unit comprises:
    设置子单元,用于在所述网页索引表中设置2N个索引子表,为每一个索引字表设置对应的N位二进制表值,N为预设大于等于1的整数;a setting subunit, configured to set 2 N index sub-tables in the webpage index table, and set a corresponding N-bit binary table value for each index word table, where N is an integer preset to be greater than or equal to 1;
    网页存储子单元,用于获取所述网页的标识对应的二进制数值,根据所述二进制数值的前N位将所述网页标识存储到对应表值的索引子表中。The webpage storage subunit is configured to obtain a binary value corresponding to the identifier of the webpage, and store the webpage identifier into an index subtable of the corresponding table value according to the first N digits of the binary value.
  20. 根据权利要求15至19中任意一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 15 to 19, wherein the device further comprises:
    设置单元,用于预设若干字段存储区,分别为每一个字段存储区指定对应的一个或多个字段。The setting unit is configured to preset a plurality of field storage areas, and respectively specify one or more corresponding fields for each field storage area.
  21. 根据权利要求20所述的装置,其特征在于,所述设置单元还用于:The device according to claim 20, wherein the setting unit is further configured to:
    统计网页摘要包含的各字段的更新频率,根据所述更新频率分别为每一个字段存储区指定对应的一个或多个字段。The update frequency of each field included in the statistical webpage summary, and corresponding one or more fields are respectively designated for each field storage area according to the update frequency.
  22. 根据权利要求21所述的装置,其特征在于,所述装置还包括:The device of claim 21, wherein the device further comprises:
    判断设置单元,用于判断是否有待删除的网页,如果有,则在新增加的更新存储区中设置所述待删除的网页的有效时间;a judging setting unit, configured to determine whether there is a webpage to be deleted, and if yes, setting a valid time of the webpage to be deleted in the newly added update storage area;
    第一标记单元,用于当达到所述有效时间后,将所述待删除的网页在各次更新时存储的字段数据和对应的索引信息标记为无效。The first marking unit is configured to mark, when the valid time is reached, the field data and the corresponding index information stored by the to-be-deleted webpage at each update as invalid.
  23. 根据权利要求15至20或21至22中任意一项所述的装置,其特征在于,所述 装置还包括:Apparatus according to any one of claims 15 to 20 or 21 to 22, wherein said said The device also includes:
    第二标记单元,用于将历史更新存储区中与所述更新字段对应的历史字段数据和对应的历史索引信息标记为无效。And a second marking unit, configured to mark the historical field data corresponding to the update field and the corresponding historical index information in the history update storage area as invalid.
  24. 根据权利要求23所述的装置,其特征在于,所述装置还包括:The device of claim 23, wherein the device further comprises:
    合并单元,用于合并所述字段存储区包含的若干更新存储区;a merging unit, configured to merge a plurality of update storage areas included in the field storage area;
    删除单元,用于在合并后的新更新存储区中将所述第一标记单元和第二标记单元标记为无效的字段数据和索引信息删除。And a deleting unit, configured to delete the field data and the index information marked as invalid by the first marking unit and the second marking unit in the merged new update storage area.
  25. 根据权利要求24所述的装置,其特征在于,所述合并单元包括:The apparatus according to claim 24, wherein said merging unit comprises:
    第一选择子单元,用于在所述字段存储区中选择若干待合并的更新存储区;a first selection subunit, configured to select, in the field storage area, a plurality of update storage areas to be merged;
    第一计算子单元,用于分别计算所述待合并的更新存储区包含的有效字段数据的数量之和;a first calculating subunit, configured to separately calculate a sum of the quantity of valid field data included in the update storage area to be merged;
    第一合并子单元,用于如果所述数量之和小于第一预设阈值,则合并所述待合并的更新存储区。a first merging subunit, configured to merge the update storage areas to be merged if the sum of the quantities is less than a first preset threshold.
  26. 根据权利要求25所述的装置,其特征在于,所述第一选择子单元包括:The apparatus according to claim 25, wherein said first selection subunit comprises:
    第二计算子单元,用于分别计算每一个更新存储区包含的有效字段数据数量;a second calculating subunit, configured to separately calculate a quantity of valid field data included in each update storage area;
    第二选择子单元,用于从所述字段存储区中选择所述有效字段数据数量最少的若干更新存储区作为所述待合并的更新存储区。And a second selection subunit, configured to select, from the field storage area, a plurality of update storage areas with the least number of valid field data as the update storage area to be merged.
  27. 根据权利要求25所述的装置,其特征在于,所述第一选择子单元包括:The apparatus according to claim 25, wherein said first selection subunit comprises:
    第三计算子单元,用于分别计算所述更新存储区包含的有效字段数据数量与所述更新存储区包含的总字段数据数量的比值;a third calculating subunit, configured to separately calculate a ratio of a quantity of valid field data included in the update storage area to a total number of field data included in the update storage area;
    第三选择子单元,用于在所述字段存储区中选择所述比值最低的若干更新存储区作为所述待合并的更新存储区。And a third selection subunit, configured to select, in the field storage area, a plurality of update storage areas with the lowest ratio as the update storage area to be merged.
  28. 根据权利要求21所述的装置,其特征在于,所述设置单元还用于:The device according to claim 21, wherein the setting unit is further configured to:
    将更新频率相同或相近的字段划分在同一个字段存储区中。 Fields with the same or similar update frequency are divided into the same field storage area.
  29. 一种计算机终端,用于执行权利要求1至14中任一项所述的数据存储方法提供的步骤的程序代码。A computer terminal for executing the program code of the steps provided by the data storage method of any one of claims 1 to 14.
  30. 一种存储介质,用于保存权利要求1至14中任一项所述的数据存储方法所执行的程序代码。 A storage medium for storing program code executed by the data storage method according to any one of claims 1 to 14.
PCT/CN2016/078369 2015-04-02 2016-04-01 Data storage method and device WO2016155669A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510155130.7A CN104794177B (en) 2015-04-02 2015-04-02 A kind of date storage method and device
CN201510155130.7 2015-04-02

Publications (1)

Publication Number Publication Date
WO2016155669A1 true WO2016155669A1 (en) 2016-10-06

Family

ID=53558969

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/078369 WO2016155669A1 (en) 2015-04-02 2016-04-01 Data storage method and device

Country Status (2)

Country Link
CN (1) CN104794177B (en)
WO (1) WO2016155669A1 (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794177B (en) * 2015-04-02 2016-10-12 广州神马移动信息科技有限公司 A kind of date storage method and device
CN105138562A (en) * 2015-07-23 2015-12-09 小米科技有限责任公司 Data processing method and device of relational database
CN105068843A (en) * 2015-08-24 2015-11-18 北京网田科技发展有限公司 Data updating method of automobile recommendation program and data updating system of automobile recommendation program
CN105205688A (en) * 2015-08-25 2015-12-30 北京网田科技发展有限公司 Automobile information recommendation system
CN105223405B (en) * 2015-10-23 2017-12-05 上海理工大学 The determination method of the data storage frequency of battery management system
CN107315693B (en) * 2016-04-26 2020-06-09 阿里巴巴集团控股有限公司 Data storage method and device
CN108089879B (en) * 2016-11-21 2021-11-26 阿里巴巴(中国)有限公司 Incremental updating method, equipment and programmable equipment
CN109408599B (en) * 2018-09-20 2021-09-28 佛山科学技术学院 Distributed storage method for big data
CN109739857B (en) * 2018-12-28 2020-09-01 深圳市网心科技有限公司 Data distributed writing method and device under high concurrency, terminal and storage medium
CN110309162A (en) * 2019-06-14 2019-10-08 福建天泉教育科技有限公司 A kind of optimization method and server-side of ES more new data
CN111752941B (en) * 2019-07-31 2024-05-17 北京京东尚科信息技术有限公司 Data storage and access method and device, server and storage medium
CN111241135B (en) * 2019-12-31 2024-04-09 广州酷旅旅行社有限公司 Commodity searching method, commodity searching device, computer equipment and storage medium
CN114748875B (en) * 2022-05-20 2023-03-24 一点灵犀信息技术(广州)有限公司 Data saving method, device, equipment, storage medium and program product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1831808A (en) * 2005-03-11 2006-09-13 佛山市顺德区顺达电脑厂有限公司 System for timing updating web specific field and its method
US20070067305A1 (en) * 2005-09-21 2007-03-22 Stephen Ives Display of search results on mobile device browser with background process
CN102831252A (en) * 2012-09-21 2012-12-19 北京奇虎科技有限公司 Method and device for updating index database and search method and system
CN104794177A (en) * 2015-04-02 2015-07-22 广州神马移动信息科技有限公司 Data storing method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5233233B2 (en) * 2007-10-05 2013-07-10 日本電気株式会社 Information search system, information search index registration device, information search method and program
CN104468807B (en) * 2014-12-12 2018-11-13 北京易网无际科技有限公司 Carry out processing method, high in the clouds device, local device and the system of web cache

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1831808A (en) * 2005-03-11 2006-09-13 佛山市顺德区顺达电脑厂有限公司 System for timing updating web specific field and its method
US20070067305A1 (en) * 2005-09-21 2007-03-22 Stephen Ives Display of search results on mobile device browser with background process
CN102831252A (en) * 2012-09-21 2012-12-19 北京奇虎科技有限公司 Method and device for updating index database and search method and system
CN104794177A (en) * 2015-04-02 2015-07-22 广州神马移动信息科技有限公司 Data storing method and device

Also Published As

Publication number Publication date
CN104794177B (en) 2016-10-12
CN104794177A (en) 2015-07-22

Similar Documents

Publication Publication Date Title
WO2016155669A1 (en) Data storage method and device
CN106970936B (en) Data processing method and device and data query method and device
KR101994021B1 (en) File manipulation method and apparatus
US8782635B2 (en) Reconfiguration of computer system to allow application installation
CN109344348B (en) Resource updating method and device
CN107545451B (en) Advertisement pushing method and device
CN105989015B (en) Database capacity expansion method and device and method and device for accessing database
CN104601736A (en) Method and device for realizing short uniform resource locator (URL) service
WO2017101591A1 (en) Method for constructing knowledge base, and controller
CN108989205B (en) Identity identification and routing data generation method and device and server
CN113568940B (en) Method, device, equipment and storage medium for data query
US10061806B2 (en) Presenting previously selected search results
CN104516920A (en) Data inquiry method and data inquiry system
CN105302807A (en) Method and apparatus for obtaining information category
CN108769211A (en) The method for routing and computer readable storage medium of client device, webpage
CN109460406B (en) Data processing method and device
CN110020272B (en) Caching method and device and computer storage medium
CN114328983A (en) Document fragmenting method, data retrieval device and electronic equipment
KR20200045310A (en) Method for recommending information based on hashtag and terminal for executing the same
CN111178965B (en) Resource release method and server
CN106055640A (en) Buffer memory management method and system
CN110019783B (en) Attribute word clustering method and device
CN113411364B (en) Resource acquisition method and device and server
CN104636384A (en) Document processing method and device
CN109828970B (en) Information processing method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16771424

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16771424

Country of ref document: EP

Kind code of ref document: A1