CN109033462A - The method and system of low-frequency data item are determined in the storage equipment of big data storage - Google Patents

The method and system of low-frequency data item are determined in the storage equipment of big data storage Download PDF

Info

Publication number
CN109033462A
CN109033462A CN201811006475.6A CN201811006475A CN109033462A CN 109033462 A CN109033462 A CN 109033462A CN 201811006475 A CN201811006475 A CN 201811006475A CN 109033462 A CN109033462 A CN 109033462A
Authority
CN
China
Prior art keywords
data item
data
storage
access
equipment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811006475.6A
Other languages
Chinese (zh)
Other versions
CN109033462B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Sibeishou Engineering Consulting Co ltd
Original Assignee
Du Guangxiang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Du Guangxiang filed Critical Du Guangxiang
Priority to CN201811006475.6A priority Critical patent/CN109033462B/en
Publication of CN109033462A publication Critical patent/CN109033462A/en
Application granted granted Critical
Publication of CN109033462B publication Critical patent/CN109033462B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of in the storage equipment stored for big data determines the method and system of low-frequency data item, when wherein method includes: the data access operation not being currently running in determining all storage equipment in big data storage system, the access information statistics file of each storage equipment is determined;It is determined based on access information statistics file in current statistical time section and is accessed multiple pre-selection data item that number is less than low frequency frequency threshold value in all data item of each storage equipment, the total memory capacity of each storage equipment is determined according to the device descriptive information in the system log device of big data storage system, the free memory capacity that each storage equipment is determined according to the storage message file in the storage information area of each storage equipment determines the low frequency coefficient of each pre-selection data item in each storage equipment;The pre-selection data item that low frequency coefficient is less than low frequency coefficient threshold value in multiple pre-selection data item in each storage equipment is determined as low-frequency data item.

Description

The method and system of low-frequency data item are determined in the storage equipment of big data storage
Technical field
The present invention relates to big data field of storage and cloud storage field, and more particularly, to one kind for number greatly According to the method and system for determining low-frequency data item in the storage equipment of storage.
Background technique
Currently, data volume is just with geometric progression as the use of various types of information equipments becomes more and more frequently Mode carries out explosive increase.In order to obtain useful information from the data of magnanimity, it is necessary to effectively be deposited to the data of magnanimity Storage.Big data storage system can satisfy the demand to effectively being stored to mass data.However, being deposited in current big data In storage system, the low-frequency data item in the storage equipment in big data storage system can not be identified.Generally, due to low frequency Data item gradually increases in storage equipment, can seriously reduce storage equipment, the even data access of big data storage system Efficiency.
Summary of the invention
According to an aspect of the present invention, a kind of low-frequency data determining in the storage equipment stored for big data is provided The method of item, which comprises
In response to receiving each storage in big data storage system for multiple storage equipment of big data storage In equipment determine low-frequency data item request, by the big data storage system from arbitrary request of data side institute it is received newly Data access request be redirected to the system buffer equipment of the big data storage system without by received new data visit Ask the corresponding storage equipment that request is sent in multiple storage equipment, with by the system buffer equipment by new data access Request each of the description information of included querying condition and the ephemeral data item set of the system buffer equipment interim Data item carries out content matching with the content matching degree of each ephemeral data item of determination, selects content from multiple ephemeral data items Matching degree is greater than at least one selected ephemeral data item of matching degree threshold value, by least one selected selected nonce It is sent to request of data side indicated by the new data access request according to item, and in the buffering of the system buffer equipment The new data access request is saved in area;
The data access behaviour not being currently running in determining all storage equipment in the big data storage system When making, the running log file of each storage equipment in multiple storage equipment in the big data storage system, and base are obtained Running log file in current statistical time section and each storage equipment determine stored in each storage equipment it is multiple The access information by statistics of data item, is deposited according in the threshold value at preset access time interval and each storage equipment The access information by statistics of multiple data item of storage determines the access information statistics file of each storage equipment, wherein accessing Time interval be data item it is adjacent be accessed twice between a period of time;Wherein the access information statistics file includes Frequency statistics table, the frequency statistics table include multiple frequency records, wherein the content of each frequency record is 8 tuples < data The identifier of item, statistics initial time, counts end time, sizes of memory, greater than the threshold at access time interval at accessed number The number of value, interval of maximum access time, minimum access time interval >;
All numbers of each storage equipment in current statistical time section are determined based on the access information statistics file The multiple pre-selection data item for being less than low frequency frequency threshold value according to number is accessed in item, according to the system of the big data storage system Device descriptive information in recording equipment determines the total memory capacity of each storage equipment, is believed according to the storage of each storage equipment The storage message file in region is ceased to determine the free memory capacity of each storage equipment, is determined according to the following equation every The low frequency coefficient of each pre-selection data item in a storage equipment:
Wherein DTFiFor low frequency coefficient, the t of i-th of pre-selection data item in current storage devicesimaxFor in current storage devices Maximum access time interval, t in multiple access time intervals of i-th of pre-selection data itemiminIt is in current storage devices i-th Minimum access time interval, t in multiple access time intervals of a pre-selection data itemibeginIt is i-th in current storage devices Preselect statistics initial time, the t of data itemiendFor the statistics end time of i-th of pre-selection data item, C in current storage devices Total memory capacity, R for current storage devices are the free memory capacity of current storage devices, UNiFor in current storage devices Number, the AN of the threshold value greater than access time interval in multiple access time intervals of i-th of pre-selection data itemiCurrently to deposit The accessed number of i-th of pre-selection data item in equipment is stored up, wherein i is natural number and PT >=i >=1, PT are currently stored set The standby middle quantity for preselecting data item and PT >=100;And
Low frequency coefficient in multiple pre-selection data item in each storage equipment is less than to the pre-selection data of low frequency coefficient threshold value Item is determined as low-frequency data item.
Wherein, when the data management apparatus being located at outside big data storage system needs depositing in big data storage system When storing up determining low-frequency data item in equipment, the data management apparatus is sent in big data storage to the big data storage system The request of low-frequency data item is determined in each storage equipment in system for multiple storage equipment of big data storage;
Wherein by the big data storage system from arbitrary request of data side received new data access request weight Be directed to the system buffer equipment of the big data storage system without by received new data access request be sent to it is more Corresponding storage equipment in a storage equipment includes:
At the time of receiving the request of determining low-frequency data item with the big data storage system, by the big data Storage system then from arbitrary request of data side received new data access request be redirected to big data storage The system buffer equipment of system without by received new data access request be sent to it is multiple storage equipment in it is corresponding Store equipment;
Wherein the new data access request includes the description information of querying condition and querying condition, the ephemeral data It include multiple ephemeral data items in item set, and each ephemeral data item has summary info, the summary info is for generally Introduce the content of ephemeral data item with including;
The description information for the querying condition for wherein being included by new data access request by the system buffer equipment with It is each interim with determination that each ephemeral data item in the ephemeral data item set of the system buffer equipment carries out content matching The content matching degree of data item includes:
The description information for the querying condition for being included by new data access request by the system buffer equipment with it is described The summary info of each ephemeral data item in the ephemeral data item set of system buffer equipment compared based on semantic content Content matching, the content matching compared based on keyword or the content matching combined based on semantic content and keyword with true The content matching degree of fixed each ephemeral data item and the querying condition;
Wherein the matching degree threshold value is 60%, and the range of content matching degree is [0%, 100%];
After wherein saving the new data access request in the buffer area of the system buffer equipment further include: to Request of data side indicated by the new data access request is sent for showing the big data storage system pause data Access and the new data access request have been saved to the response message in the buffer area of the system buffer equipment, and And it carries in the response message for showing the new data access request from request of data side in the buffer area The information of current Queue sequence, wherein coming in the buffer area according to the time span of new data access request being saved Determine current Queue sequence of the new data access request in the buffer area, and according to being protected in current Queue sequence The descending order for the time span deposited is ranked up new data access request.
Wherein respective running log file is saved in the system data region of each storage equipment;
Wherein current statistical time section receives the request when institute of determining low-frequency data item for big data storage system The proxima luce (prox. luc) of the current date at place starts and a period of time of the consecutive days of predetermined quantity forward;The wherein nature of predetermined quantity Day is 10 consecutive days, 20 consecutive days or 30 consecutive days;
Wherein determine that each storage is set based on the running log file in current statistical time section and each storage equipment The access information by statistics of multiple data item of standby middle storage includes:
Based on current statistical time section to it is each storage equipment running log file in all log recordings into Row is chosen to obtain multiple log recordings of each storage equipment in current statistical time section;
Classify according to data item to multiple log recordings of each storage equipment in current statistical time section, To obtain the access information by statistics of each data item;
The multiple data item stored in each storage equipment are made of the access information by statistics of each data item By the access information of statistics;
Wherein each log recording includes: the identifier of data item, access initial time, access end time, storage ruler Very little and storage initial time;
Wherein each data item has summary info, and the summary info is used to briefly introduce the content of data item.
Wherein the threshold value at the preset access time interval is 5 minutes, 10 minutes, 15 minutes or 20 minutes.
According to the warp of the multiple data item stored in the threshold value at preset access time interval and each storage equipment The access information for crossing statistics determines that each access information statistics file for storing equipment includes:
The access information by statistics of each data item in the multiple data item stored in each storage equipment is carried out Statistics is with the accessed number of each data item of determination and all access time intervals;
The threshold greater than access time interval of each data item is determined based on all access time intervals of each data item The number of value, interval of maximum access time and minimum access time interval;
Access initial time accessed for the first time in the access information by statistics of each data item is determined as uniting Initial time is counted, the access end time accessed for the last time in the access information by statistics of each data item is determined To count the end time;
The sizes of memory of each data item is determined based on the access information by statistics of each data item.
The low frequency frequency threshold value is 100,150 or 200;
Device descriptive information in the system log device includes: that all storages included by big data storage system are set The standby total memory capacity of total quantity, each storage equipment, the network address of each storage equipment and/or each storage equipment adds Enter the time of the big data storage system;
Storage message file in the storage information area of each storage equipment includes: the total quantity of data item, every number Believe according to the abstract of the sizes of memory of item, the starting storage time of each data item, the identifier of each data item, each data item The free memory capacity of breath and each storage equipment;
The low frequency coefficient threshold value is 120,160 or 220.
In the preselected number that low frequency coefficient in multiple pre-selection data item in each storage equipment is less than to low frequency coefficient threshold value It is determined as after low-frequency data item according to item, further includes:
It is true greater than 2 times of data item of low frequency frequency threshold value by number is accessed in all data item of each storage equipment It is set to data item to be selected to obtain multiple data item to be selected, and constitute respective item set to be selected by multiple data item to be selected It closes, multiple low-frequency data items that low frequency coefficient is less than low frequency coefficient threshold value in each storage equipment is constituted into respective low-frequency data Item set;
The current storage equipment being directed in multiple storage equipment:
The quantity of low-frequency data item in the low-frequency data item set of current storage equipment is less than or equal to current When storing the quantity of the data item to be selected in the collection of data items to be selected of equipment, according to the ascending order of accessed number sequentially by low frequency All low-frequency data items in collection of data items are ranked up to generate the first sorted lists, will be ordered as in the first sorted lists 1st low-frequency data item as current low-frequency data item,
6.1, summary info based on current low-frequency data item is plucked with each data item to be selected in collection of data items to be selected Information is wanted to carry out content matching, with the content matching degree of determination current low-frequency data item and each data item to be selected;
6.2, by all data item to be selected of collection of data items to be selected with the content matching degree of current low-frequency data item most Big data item to be selected and current low-frequency data item carry out data item combination, to form a new data item, by new data Item is saved in the idle storage space of current storage equipment;
6.3, it is deleted from the collection of data items to be selected maximum to be selected with the content matching degree of current low-frequency data item Data item;
6.4,1 low-frequency data after current low-frequency data item is determined in first sorted lists with the presence or absence of sequence , if it is present carrying out step 6.5;If it does not exist, then terminating;
6.5, sequence 1 low-frequency data item after current low-frequency data item in first sorted lists is selected as Current low-frequency data item, carries out step 6.1;
Or the quantity of the low-frequency data item in the low-frequency data item set of current storage equipment is greater than current deposit When storing up the quantity of the data item to be selected in the collection of data items to be selected of equipment, by the low-frequency data item set of current storage equipment In all low-frequency data items be grouped to generate multiple low-frequency data item groups so that the multiple low-frequency data Xiang Zuzhong Total accessed number of all low-frequency data items is greater than 1.5 times of low frequency frequency threshold value in each low-frequency data item group, and determines The averagely accessed number of all low-frequency data items in each low-frequency data item group, wherein the average quilt of each low-frequency data item group The absolute value of difference between access times is less than 20.
In the preselected number that low frequency coefficient in multiple pre-selection data item in each storage equipment is less than to low frequency coefficient threshold value It is determined as after low-frequency data item according to item, further includes:
According to the current Queue sequence of data access requests multiple in the buffer area of system buffer equipment in buffer area Each data access request carries out data access operation;
It is right in the case where not having any data access request being saved in the buffer area for determining system buffer equipment The big data storage system from arbitrary request of data side received new data access request parsed it is new to obtain Querying condition;
It is determined in the catalogue storage server of the big data storage system more involved in the new querying condition A data item, and determine at least one target storage device involved in multiple data item;
The new querying condition is sent to each target storage device, and receives and accords with from each target storage device Close at least one data item of the new querying condition;
Target data item set will be formed from the received all data item of each target storage device institute, and by the mesh Mark collection of data items is sent to request of data side indicated by the new data access request.
8, according to the method described in claim 7, wherein according to data access multiple in the buffer area of system buffer equipment The current Queue sequence of request carries out data access operation to each data access request in buffer area
8.1, it is determined according to the current Queue sequence of data access requests multiple in the buffer area of system buffer equipment current The data access request of processing, wherein the currently processed data access request is multiple data access requests in buffer area Sort primary data access request in current Queue sequence;
8.2, currently processed data access request is parsed to obtain currently processed querying condition;
8.3, the currently processed querying condition is determined in the catalogue storage server of the big data storage system Related multiple data item, and determine at least one target storage in big data storage system involved in multiple data item Equipment;
8.4, the currently processed querying condition is sent to each target storage device, and is stored from each target Equipment receives at least one data item for meeting the currently processed querying condition;
8.5, target data item set will be formed from the received all data item of each target storage device institute, and by institute It states target data item set and is sent to request of data side indicated by the currently processed data access request;
8.6, the primary data access that sorts in the current Queue sequence of data access requests multiple in buffer area is asked Ask deletion;
8.7, determine in the buffer area of system buffer equipment whether there is any data access request being saved, if It is then to carry out step 8.1;If it is not, then determining that any data for not having in the buffer area of system buffer equipment and being saved are visited Ask request.
According to another aspect of the present invention, a kind of low-frequency data determining in the storage equipment stored for big data is provided The system of item, the system comprises:
Pretreatment unit, in response to receiving multiple storage equipment in big data storage system for big data storage Each storage equipment in determine the request of low-frequency data item, by the big data storage system from arbitrary institute of request of data side Received new data access request is redirected to the system buffer equipment of the big data storage system without institute is received New data access request is sent to the corresponding storage equipment in multiple storage equipment, with will be new by the system buffer equipment The data access request description information of querying condition that is included and the ephemeral data item set of the system buffer equipment in Each ephemeral data item carry out content matching with the content matching degree of each ephemeral data item of determination, from multiple ephemeral data items Middle selection content matching degree is greater than at least one selected ephemeral data item of matching degree threshold value, by least one selected choosing Fixed ephemeral data item is sent to request of data side indicated by the new data access request, and in the system buffer The new data access request is saved in the buffer area of equipment;
Statistic unit, the number not being currently running in determining all storage equipment in the big data storage system When according to access operation, the running log text of each storage equipment in multiple storage equipment in the big data storage system is obtained Part, and determined in each storage equipment and deposited based on the running log file in current statistical time section and each storage equipment The access information by statistics of multiple data item of storage, according to the threshold value and each storage at preset access time interval The access information by statistics of the multiple data item stored in equipment determines the access information statistics file of each storage equipment, Wherein access time interval be data item it is adjacent be accessed twice between a period of time;The wherein access information statistics File includes frequency statistics table, and the frequency statistics table includes multiple frequency records, wherein the content of each frequency record is 8 yuan Group < data item identifier, statistics initial time, the statistics end time, sizes of memory, is greater than access time at accessed number The number of the threshold value at interval, interval of maximum access time, minimum access time interval >;
Computing unit determines each storage equipment in current statistical time section based on the access information statistics file All data item in be accessed number be less than low frequency frequency threshold value multiple pre-selection data item, according to the big data store be Device descriptive information in the system log device of system determines the total memory capacity of each storage equipment, according to each storage equipment Storage information area in storage message file come determine it is each storage equipment free memory capacity, according to following formula To determine the low frequency coefficient of each pre-selection data item in each storage equipment:
Wherein DTFiFor low frequency coefficient, the t of i-th of pre-selection data item in current storage devicesimaxFor in current storage devices Maximum access time interval, t in multiple access time intervals of i-th of pre-selection data itemiminIt is in current storage devices i-th Minimum access time interval, t in multiple access time intervals of a pre-selection data itemibeginIt is i-th in current storage devices Preselect statistics initial time, the t of data itemiendFor the statistics end time of i-th of pre-selection data item, C in current storage devices Total memory capacity, R for current storage devices are the free memory capacity of current storage devices, UNiFor in current storage devices Number, the AN of the threshold value greater than access time interval in multiple access time intervals of i-th of pre-selection data itemiCurrently to deposit The accessed number of i-th of pre-selection data item in equipment is stored up, wherein i is natural number and PT >=i >=1, PT are currently stored set The standby middle quantity for preselecting data item and PT >=100;And
Low frequency coefficient in multiple pre-selection data item in each storage equipment is less than to the pre-selection data of low frequency coefficient threshold value Item is determined as low-frequency data item.
Wherein, when the data management apparatus being located at outside big data storage system needs depositing in big data storage system When storing up determining low-frequency data item in equipment, the data management apparatus is sent in big data storage to the big data storage system The request of low-frequency data item is determined in each storage equipment in system for multiple storage equipment of big data storage;
Wherein pretreatment unit by the big data storage system from arbitrary request of data side received new data Access request be redirected to the system buffer equipment of the big data storage system without by received new data access ask Ask be sent to it is multiple storage equipment in corresponding storage equipment include:
It, will at the time of pretreatment unit receives the request of determining low-frequency data item with the big data storage system The big data storage system then from arbitrary request of data side received new data access request be redirected to it is described The system buffer equipment of big data storage system without by received new data access request be sent to multiple storage equipment In corresponding storage equipment;
Wherein the new data access request includes the description information of querying condition and querying condition, the ephemeral data It include multiple ephemeral data items in item set, and each ephemeral data item has summary info, the summary info is for generally Introduce the content of ephemeral data item with including;
The wherein querying condition that new data access request is included by pretreatment unit by the system buffer equipment Description information carries out content matching with each ephemeral data item in the ephemeral data item set of the system buffer equipment with true The content matching degree of each ephemeral data item includes: calmly
The description for the querying condition that new data access request is included by pretreatment unit by the system buffer equipment The summary info of each ephemeral data item in the ephemeral data item set of information and the system buffer equipment is carried out based on language Content matching that adopted content compares, the content matching compared based on keyword or in being combined based on semantic content and keyword Hold matching with the content matching degree of determination each ephemeral data item and the querying condition;
Wherein the matching degree threshold value is 60%, and the range of content matching degree is [0%, 100%];
Wherein pretreatment unit is sent to request of data side indicated by the new data access request for showing It states big data storage system pause data access and the new data access request has been saved to the system buffer and sets Response message in standby buffer area, and carry in the response message for showing the new data from request of data side The information of current Queue sequence of the access request in the buffer area, wherein according to new data access in the buffer area The time span of request being saved determines current Queue sequence of the new data access request in the buffer area, and New data access request is ranked up according to the descending order for the time span being saved in current Queue sequence.
Wherein running log file is saved in the system data region of each storage equipment;
Wherein current statistical time section receives the request when institute of determining low-frequency data item for big data storage system The proxima luce (prox. luc) of the current date at place starts and a period of time of the consecutive days of predetermined quantity forward;The wherein nature of predetermined quantity Day is 10 consecutive days, 20 consecutive days or 30 consecutive days;
Wherein statistic unit is determined every based on the running log file in current statistical time section and each storage equipment The multiple data item stored in a storage equipment by statistics access information include:
Statistic unit is based on current statistical time section to all days in the running log file of each storage equipment Will record is chosen to obtain multiple log recordings of each storage equipment in current statistical time section;
Multiple log recordings of the statistic unit according to data item to each storage equipment in current statistical time section Classify, to obtain the access information by statistics of each data item;
Access information of the statistic unit by each data item by statistics constitute store in each storage equipment it is multiple The access information by statistics of data item;
Wherein each log recording includes: the identifier of data item, access initial time, access end time, storage ruler Very little and storage initial time;
Wherein each data item has summary info, and the summary info is used to briefly introduce the content of data item.
Wherein the threshold value at the preset access time interval is 5 minutes, 10 minutes, 15 minutes or 20 minutes.
Statistic unit is according to the multiple numbers stored in the threshold value and each storage equipment at preset access time interval Determine that each access information statistics file for storing equipment includes: according to the access information by statistics of item
Access by statistics of the statistic unit to each data item in the multiple data item stored in each storage equipment Information is counted with the accessed number of each data item of determination and all access time intervals;
Statistic unit based on all access time intervals of each data item determine each data item be greater than access time The number of the threshold value at interval, interval of maximum access time and minimum access time interval;
The access initial time that statistic unit will be accessed for the first time in the access information by statistics of each data item It is determined as counting initial time, access accessed for the last time in the access information by statistics of each data item is terminated Time is determined as counting the end time;
Statistic unit determines the sizes of memory of each data item based on the access information by statistics of each data item.
The low frequency frequency threshold value is 100,150 or 200;
Device descriptive information in the system log device includes: storage equipment included by big data storage system Described in total quantity, the total memory capacity of each storage equipment, the network address of each storage equipment or each storage equipment are added The time of big data storage system;
Storage message file in the storage information area of each storage equipment includes: the total quantity of data item, every number Believe according to the abstract of the sizes of memory of item, the starting storage time of each data item, the identifier of each data item, each data item The free memory capacity of breath and each storage equipment;
The low frequency coefficient threshold value is 120,160 or 220.
Further include adjustment unit, is greater than low frequency number for number will to be accessed in all data item of each storage equipment 2 times of data item of threshold value is determined as data item to be selected to obtain multiple data item to be selected, and is made of multiple data item to be selected Low frequency coefficient in all data item of each storage equipment is less than multiple low frequencies of low frequency coefficient threshold value by collection of data items to be selected Data item constitutes low-frequency data item set;
The current storage equipment being directed in multiple storage equipment:
When the quantity of the low-frequency data item in the low-frequency data item set of current storage equipment is less than or equal to number to be selected According to item gather in data item to be selected quantity when, according to the ascending order sequence of accessed number by the institute in low-frequency data item set There is low-frequency data item to be ranked up to generate the first sorted lists, the 1st low-frequency data will be ordered as in the first sorted lists Item is used as current low-frequency data item,
14.1, summary info based on current low-frequency data item is plucked with each data item to be selected in collection of data items to be selected Information is wanted to carry out content matching, with the content matching degree of determination current low-frequency data item and each data item to be selected;
14.2, by all data item to be selected of collection of data items to be selected with the content matching degree of current low-frequency data item most Big data item to be selected and current low-frequency data item carry out data item combination, to form a new data item, by new data Item is saved in the idle storage space of current storage equipment;
14.3, it is deleted from the collection of data items to be selected maximum to be selected with the content matching degree of current low-frequency data item Data item;
14.4,1 low frequency number after current low-frequency data item is determined in first sorted lists with the presence or absence of sequence According to item, if it is present carrying out 14.5;If it does not exist, then terminating;
14.5, by sequence in first sorted lists, 1 low-frequency data item selects to make after current low-frequency data item For current low-frequency data item, 14.1 are carried out;
Or when the quantity of the low-frequency data item in low-frequency data item set is greater than the number to be selected in collection of data items to be selected According to item quantity when, all low-frequency data items in low-frequency data item set are grouped to generate multiple low-frequency data items Group, so that the total of all low-frequency data items is accessed number in each low-frequency data item group of the multiple low-frequency data Xiang Zuzhong Greater than 1.5 times of low frequency frequency threshold value, and determine the average accessed secondary of all low-frequency data items in each low-frequency data item group Number, wherein the absolute value of the difference between the averagely accessed number of each low-frequency data item group is less than 20.
The currently queuing of multiple data access requests is suitable in buffer area of the pretreatment unit according to system buffer equipment Each data access request in ordered pair buffer area carries out data access operation;
It is right in the case where not having any data access request being saved in the buffer area for determining system buffer equipment The big data storage system from arbitrary request of data side received new data access request parsed it is new to obtain Querying condition;
It is determined in the catalogue storage server of the big data storage system more involved in the new querying condition A data item, and determine at least one target storage device involved in multiple data item;
The new querying condition is sent to each target storage device, and receives and accords with from each target storage device Close at least one data item of the new querying condition;
Target data item set will be formed from the received all data item of each target storage device institute, and by the mesh Mark collection of data items is sent to request of data side indicated by the new data access request.
Wherein the currently queuing of multiple data access requests is suitable in buffer area of the pretreatment unit according to system buffer equipment Each data access request in ordered pair buffer area carries out data access operation
16.1, it is determined according to the current Queue sequence of data access requests multiple in the buffer area of system buffer equipment current The data access request of processing, wherein the currently processed data access request is multiple data access requests in buffer area Sort primary data access request in current Queue sequence;
16.2, currently processed data access request is parsed to obtain currently processed querying condition;
16.3, the currently processed querying condition is determined in the catalogue storage server of the big data storage system Related multiple data item, and determine at least one target storage device involved in multiple data item;
16.4, the currently processed querying condition is sent to each target storage device, and is deposited from each target Storage equipment receives at least one data item for meeting the currently processed querying condition;
16.5, target data item set will be formed from each received all data item of target storage device institute, and will The target data item set is sent to request of data side indicated by the currently processed data access request;
16.6, the primary data access that sorts in the current Queue sequence of data access requests multiple in buffer area is asked Ask deletion;
16.7, determine in the buffer area of system buffer equipment whether there is any data access request being saved, if It is then to carry out 16.1;If it is not, then determining that any data access for not having in the buffer area of system buffer equipment and being saved is asked It asks.
Detailed description of the invention
By reference to the following drawings, exemplary embodiments of the present invention can be more fully understood by:
Fig. 1 is that low-frequency data item is determined in the storage equipment stored for big data according to embodiment of the present invention The flow chart of method;
Fig. 2 is the schematic diagram according to multiple access information statistics files of embodiment of the present invention;And
Fig. 3 is that low-frequency data item is determined in the storage equipment stored for big data according to embodiment of the present invention The structural schematic diagram of system.
Specific embodiment
Fig. 1 is that low-frequency data item is determined in the storage equipment stored for big data according to embodiment of the present invention The flow chart of method 100.
In step 101, in response to receiving multiple storage equipment in big data storage system for big data storage Each storage equipment in determine the request of low-frequency data item, by the big data storage system from arbitrary institute of request of data side Received new data access request is redirected to the system buffer equipment of the big data storage system without institute is received New data access request is sent to the corresponding storage equipment in multiple storage equipment, with will be new by the system buffer equipment The data access request description information of querying condition that is included and the ephemeral data item set of the system buffer equipment in Each ephemeral data item carry out content matching with the content matching degree of each ephemeral data item of determination, from multiple ephemeral data items Middle selection content matching degree is greater than at least one selected ephemeral data item of matching degree threshold value, by least one selected choosing Fixed ephemeral data item is sent to request of data side indicated by the new data access request, and in the system buffer The new data access request is saved in the buffer area of equipment.
When the data management apparatus being located at outside big data storage system needs the storage in big data storage system to set When standby middle determining low-frequency data item, the data management apparatus is sent in big data storage system to the big data storage system The request of low-frequency data item is determined in each storage equipment of interior multiple storage equipment for big data storage.Positioned at big data Data management apparatus outside storage system can by big data storage system maintenance personnel, administrative staff or operation personnel into Row operation or control.For example, the maintenance personnel of big data storage system, administrative staff or operation personnel can periodically or root The identification or determination to low-frequency data item are triggered according to practical operation situation.It include that multiple storages are set in big data storage system It is standby, and each storage equipment can store each memory capacity for storing equipment of multiple data item and can be arbitrary rationally Numerical value.Each data item can be the number of various types of data files, such as text type, audio types, video type etc. According to file.Wherein low-frequency data item for example refers to that the accessed number in specific time is all lower than big data storage system The averagely accessed number of data item, or the data item lower than the averagely accessed number of all data item of storage equipment etc..
Wherein by the big data storage system from arbitrary request of data side received new data access request weight Be directed to the system buffer equipment of the big data storage system without by received new data access request be sent to it is more Corresponding storage equipment in a storage equipment includes:
At the time of receiving the request of determining low-frequency data item with the big data storage system, by the big data Storage system then from arbitrary request of data side received new data access request be redirected to big data storage The system buffer equipment of system without by received new data access request be sent to it is multiple storage equipment in it is corresponding Store equipment;
Wherein the new data access request includes the description information of querying condition and querying condition, the ephemeral data It include multiple ephemeral data items in item set, and each ephemeral data item has summary info, the summary info is for generally Introduce the content of ephemeral data item with including.
Multiple storages in big data storage system for big data storage are received in the big data storage system At the time of determining the request of low-frequency data item in each storage equipment of equipment, multiple new data may be received and visited Ask request.At this point, promoting big data storage system is then received all from one or more arbitrary institutes of request of data side New data access request is all redirected to the system buffer equipment of the big data storage system without institute is received new Data access request be sent to it is multiple storage equipment in corresponding storage equipment.In general, big data storage system can basis The determination in the catalogue storage server of the big data storage system of querying condition included by new data access request is looked into Multiple data item involved in inquiry condition, and determine at least one target storage device involved in multiple data item.It will be described Currently processed querying condition is sent to each target storage device, and meets described work as from the reception of each target storage device At least one data item of the querying condition of pre-treatment.And when in order to carry out the identification of low-frequency data item or determine, big data is deposited All new data access requests are all redirected to the system buffer equipment of the big data storage system by storage system.Wherein System buffer equipment is located inside big data storage system, and for storing the ephemeral data item including multiple ephemeral data items Set, or for being buffered to data access request.Wherein querying condition is, for example, mobile communication and 5G and (uplink Or downlink).In this case, the description information of querying condition is, for example, the uplink or downlink of 5G mobile communication Link.It include multiple ephemeral data items in ephemeral data item set, and each ephemeral data item can be various types of numbers According to the data file of file, such as text type, audio types, video type etc..Each ephemeral data item or each data item It all has summary info and summary info is used to briefly introduce the content of ephemeral data item or data item.For example, abstract letter Breath are as follows: the C++ since 0 ing allows your 21 days association C++ this programming languages using straightaway introduce.
The description information for the querying condition for wherein being included by new data access request by the system buffer equipment with It is each interim with determination that each ephemeral data item in the ephemeral data item set of the system buffer equipment carries out content matching The content matching degree of data item includes:
The description information for the querying condition for being included by new data access request by the system buffer equipment with it is described The summary info of each ephemeral data item in the ephemeral data item set of system buffer equipment compared based on semantic content Content matching, the content matching compared based on keyword or the content matching combined based on semantic content and keyword with true The content matching degree of fixed each ephemeral data item and the querying condition.The application can be used any existing text and compare other side Formula determines the description information of querying condition that new data access request is included and the ephemeral data item of system buffer equipment Content matching degree between the summary info of each ephemeral data item in set, wherein text alignments are, for example, to be based on language Content matching that adopted content compares, the content matching compared based on keyword or in being combined based on semantic content and keyword Hold matching.Wherein, the content matching degree of each ephemeral data item and the querying condition may be used to indicate that each ephemeral data Item close degree, similar degree, degree of correlation or correlation degree with the querying condition.
Wherein the matching degree threshold value is 55%, 60%, 65%, 70% or any reasonable value, and content matching degree Range be [0%, 100%], i.e. content matching degree can be any numerical value between from 0% to 100%.From multiple nonces According at least one the selected ephemeral data item for selecting content matching degree to be greater than matching degree threshold value in item, i.e., from multiple ephemeral datas Selection content matching degree is greater than 55%, 60%, 65% or 70% at least one selected ephemeral data item in.It will be selected At least one selected ephemeral data item be sent to request of data side indicated by the new data access request, and The new data access request is saved in the buffer area of the system buffer equipment.By it is selected at least one selected face When the data item purpose that is sent to request of data side indicated by the new data access request be to allow request of data side can Content relevant to data access request is obtained, in the case where big data storage system suspends data access service to promote to count According to requesting party it will be seen that related content.
After wherein saving the new data access request in the buffer area of the system buffer equipment further include: to Request of data side indicated by the new data access request is sent for showing the big data storage system pause data Access and the new data access request have been saved to the response message in the buffer area of the system buffer equipment, and And it carries in the response message for showing the new data access request from request of data side in the buffer area The information of current Queue sequence.Wherein come in the buffer area according to the time span of new data access request being saved Determine current Queue sequence of the new data access request in the buffer area, and according to being protected in current Queue sequence The descending order for the time span deposited is ranked up new data access request.That is, the time span being saved is longer, then newly Data access request current Queue sequence it is more forward.Preferably, to number indicated by the new data access request It sends according to requesting party for having shown the big data storage system pause data access and the new data access request It is saved to after the response message in the buffer area of the system buffer equipment further include: periodically to the new data Request of data side indicated by access request is sent for showing the new data access request from request of data side described The notification message of current Queue sequence in buffer area.
In step 102, it is not currently running in determining all storage equipment in the big data storage system When data access operation, the running log text of each storage equipment in multiple storage equipment in the big data storage system is obtained Part, and determined in each storage equipment and deposited based on the running log file in current statistical time section and each storage equipment The access information by statistics of multiple data item of storage, according to the threshold value and each storage at preset access time interval The access information by statistics of the multiple data item stored in equipment determines the access information statistics file of each storage equipment, Wherein access time interval be data item it is adjacent be accessed twice between a period of time;The wherein access information statistics File includes frequency statistics table, and the frequency statistics table includes multiple frequency records, wherein the content of each frequency record is 8 yuan Group < data item identifier, statistics initial time, the statistics end time, sizes of memory, is greater than access time at accessed number The number of the threshold value at interval, interval of maximum access time, minimum access time interval >.
The data access operation being wherein currently running refers to that storage equipment is looked into according to transmitted by big data storage system Inquiry condition carries out data retrieval in the memory space of itself, will constitute item set by data retrieval data item obtained It closes, collection of data items is sent to the operation processing of request of data side by big data storage system.
Wherein running log file is saved in the system data region of each storage equipment.Wherein running log file packet Include multiple log recordings, wherein each log recording include: data item identifier, access initial time, access the end time, Sizes of memory and storage initial time.Wherein the identifier of data item can be the title of data item, the unique identification of data item, Coding of data item etc. is capable of the information of unique identification data item.Access initial time refers to number involved in current log record The initial time being accessed according to item.At the end of the access end time refers to that data item involved in current log record is accessed Between.For example, may be related to the operation such as reading, modify when accessing to the data item in storage equipment, when accessing starting Between and access the end time be used for indicate this operation initial time and the end time.Sizes of memory is that data item is set in storage Sizes of memory in standby.Storage initial time is the starting that data item starts storage in storage equipment or big data storage system Time, that is, data item is saved in storage equipment or big data storage system to provide the initial time of access service.At this In application, access includes reading and/or modifying.
Wherein current statistical time section receives the request when institute of determining low-frequency data item for big data storage system The proxima luce (prox. luc) of the current date at place starts and a period of time of the consecutive days of predetermined quantity forward;The wherein nature of predetermined quantity Day is 10 consecutive days, 20 consecutive days or 30 consecutive days.For example, big data storage system receives determining low-frequency data item Request time be the 11:25:36 on the 11st of August in 2018, then big data storage system receives asking for determining low-frequency data item Locating current date is on August 11st, 2018 when asking.When big data storage system receives the request of determining low-frequency data item The proxima luce (prox. luc) of locating current date is on August 10th, 2018.Current statistical time section is the reception of big data storage system Proxima luce (prox. luc) to locating current date when the request for determining low-frequency data item start and forward predetermined quantity (for example, 10 Natural number) consecutive days a period of time, i.e., current statistical time section be on August 00:00:00 to 2018 years 1,8 2018 Moon 23:59:59 on the 10th.
Wherein determine that each storage is set based on the running log file in current statistical time section and each storage equipment The access information by statistics of multiple data item of standby middle storage includes:
Based on current statistical time section to it is each storage equipment running log file in all log recordings into Row is chosen to obtain multiple log recordings of each storage equipment in current statistical time section;
Classify according to data item to multiple log recordings of each storage equipment in current statistical time section, To obtain the access information by statistics of each data item;
The multiple data item stored in each storage equipment are made of the access information by statistics of each data item By the access information of statistics.
For example, current statistical time section is 00:00:00 to 2018 years on the 1st August 23:59:59 on the 10th of August in 2018, That is 10 consecutive days, then based on 00:00:00 to 2018 years on the 1st August of August in 2018 23:59:59 on the 10th to each storage equipment Running log file in all log recordings chosen to obtain each storage equipment in the 00:00 on the 1st of August in 2018: All log recordings in 00 to 2018 on August, 10,23:59:59.According to data item (for example, identifier of data item) to every Multiple log recordings of a storage equipment in 00:00:00 to 2018 years on the 1st August of August in 2018 23:59:59 on the 10th are divided Class, to obtain the access information by statistics of each data item.Each data item by statistics access information be, for example, All accessed information of each data item in current statistical time section.By each data item in each storage equipment By statistics access information constitute it is each storage equipment in store multiple data item by statistics access information.
Wherein each data item has summary info, and the summary info is used to briefly introduce the content of data item.Example Such as, summary info are as follows: the C++ since 0 allows your 21 days association C++ this programming languages using straightaway introduction.
Wherein access time interval be data item it is adjacent be accessed twice between a period of time, for example, current The accessed access end time to a period of time between accessed access initial time next time.It is wherein described preparatory The threshold value at the access time interval set is 5 minutes, 10 minutes, 15 minutes, 20 minutes or any reasonable value.In general, working as In preceding statistical time section (or statistical time section), data item A is accessed 5 times and the time accessed every time is 30 seconds, then data item A current statistical time section (or statistical time section) has 4 access time intervals.
According to the warp of the multiple data item stored in the threshold value at preset access time interval and each storage equipment The access information for crossing statistics determines that each access information statistics file for storing equipment includes:
The access information by statistics of each data item in the multiple data item stored in each storage equipment is carried out Statistics is with the accessed number of each data item of determination and all access time intervals;
The threshold greater than access time interval of each data item is determined based on all access time intervals of each data item The number of value, interval of maximum access time and minimum access time interval;
Access initial time accessed for the first time in the access information by statistics of each data item is determined as uniting Initial time is counted, the access end time accessed for the last time in the access information by statistics of each data item is determined To count the end time;
The sizes of memory of each data item is determined based on the access information by statistics of each data item.
Due to each access information packet by statistics for storing each data item in the multiple data item stored in equipment Include multiple log recordings, and each log recording represents data item and is accessed 1 time, thus by the quantity of log recording come Determine (always) the accessed number of each data item.In addition, multiple log recordings are tied according to access initial time or access The beam time is ranked up, and can obtain the access time interval between each log recording, so that it is determined that between all access times Every.Further, being compared by threshold value to preset access time interval and all access time intervals can Determine the number of the threshold value greater than access time interval of each data item, and by uniting to all access time intervals Meter can determine maximum access time interval and the minimum access time interval of each data item.
For example, current statistical time section is 00:00:00 to 2018 years on the 1st August 23:59:59 on the 10th of August in 2018, The access initial time that first time of the data item A in current statistical time section is accessed is the 09:02 on the 1st of August in 2018: 11, access 2018 end times August 09:05:36 on the 1st, and last in current statistical time section of data item A Secondary accessed access initial time is the 22:26:53 on the 10th of August in 2018, accesses 2018 end times August 22:27 on the 10th: 39, then statistics initial time of the data item A in current statistical time section is the 09:02:11 on the 1st of August in 2018, and is united The meter end time is the 22:27:39 on the 10th of August in 2018.
In addition, determining each data according to the sizes of memory in log recording arbitrary in the access information by statistics The sizes of memory of item.
In step 103, determine that each storage is set in current statistical time section based on the access information statistics file It is accessed multiple pre-selection data item that number is less than low frequency frequency threshold value in standby all data item, is stored according to the big data Device descriptive information in the system log device of system determines the total memory capacity of each storage equipment, is set according to each storage Storage message file in standby storage information area determines the free memory capacity of each storage equipment, according to following public affairs Formula come determine it is each storage equipment in each pre-selection data item low frequency coefficient:
Wherein DTFiFor low frequency coefficient, the t of i-th of pre-selection data item in current storage devicesimaxFor in current storage devices Maximum access time interval, t in multiple access time intervals of i-th of pre-selection data itemiminIt is in current storage devices i-th Minimum access time interval, t in multiple access time intervals of a pre-selection data itemibeginIt is i-th in current storage devices Preselect statistics initial time, the t of data itemiendFor the statistics end time of i-th of pre-selection data item, C in current storage devices Total memory capacity, R for current storage devices are the free memory capacity of current storage devices, UNiFor in current storage devices Number, the AN of the threshold value greater than access time interval in multiple access time intervals of i-th of pre-selection data itemiCurrently to deposit The accessed number of i-th of pre-selection data item in equipment is stored up, wherein i is natural number, and PT is natural number and PT >=i >=1, PT are Quantity and PT >=100 of data item are preselected in current storage devices.
Wherein, low frequency frequency threshold value is 100,150,175,200 or any reasonable value.In the system log device Device descriptive information includes: total storage of the total quantity of storage equipment included by big data storage system, each storage equipment The time of the big data storage system is added in capacity, the network address of each storage equipment or each storage equipment.Big data The total quantity of storage equipment included by storage system is the total quantity of all storage equipment in big data storage system.Each deposit The total memory capacity of storage equipment be the total capacity of the memory space of each storage equipment or can be each storage equipment can be with The total capacity of the memory space of item for storing data.The network address of each storage equipment is, for example, IP address, MAC Address Deng.The time that the big data storage system is added in each storage equipment refers to that the big data storage is added in each storage equipment Initial time of the system to carry out storing data item as the storage equipment in the big data storage system.
Storage message file in the storage information area of each storage equipment includes: the total quantity of data item, every number Believe according to the abstract of the sizes of memory of item, the starting storage time of each data item, the identifier of each data item, each data item The free memory capacity of breath and each storage equipment.The total quantity of data item refers to all data item in each storage equipment Total quantity.The sizes of memory of each data item refers to sizes of memory or institute when each data item is stored in storing equipment The memory space of occupancy.The starting storage time of each data item refers to that each data item starts in the storage equipment belonged to The time of storage, for example, data item is copied to the time in storage equipment.The identifier of each data item can be data item Title, the coding of the unique identification of data item, data item etc. be capable of the information of unique identification data item.Each data item is plucked Want information for briefly introducing the content of ephemeral data item or data item.For example, summary info are as follows: the C++ since 0 is used Straightaway introduction allows your 21 days association C++ this programming languages.The free memory capacity of each storage equipment refers to each The free memory capacity or residual storage capacity of new data item can be stored in storage equipment.Wherein low frequency coefficient threshold value is 90, any reasonable value such as 100,120,130,150,160,170,220.
In step 104, low frequency coefficient in multiple pre-selection data item in each storage equipment is less than low frequency coefficient threshold value Pre-selection data item be determined as low-frequency data item.That is, the application is through the above steps, for number greatly in big data storage system According to identifying or recognizing low-frequency data item in each storage equipment of storage.
In the preselected number that low frequency coefficient in multiple pre-selection data item in each storage equipment is less than to low frequency coefficient threshold value It is determined as after low-frequency data item according to item, further includes:
It is true greater than 2 times of data item of low frequency frequency threshold value by number is accessed in all data item of each storage equipment It is set to data item to be selected to obtain multiple data item to be selected, and constitute collection of data items to be selected by multiple data item to be selected, it will be every Multiple low-frequency data items that low frequency coefficient is less than low frequency coefficient threshold value in a storage equipment constitute low-frequency data item set.For example, When low frequency frequency threshold value is 100, then number will be accessed in all data item of each storage equipment and be greater than 200 (100 × 2) Data item be determined as data item to be selected to obtain multiple data item to be selected.For example, then will when low frequency coefficient threshold value is 120 Multiple low-frequency data items of the low frequency coefficient less than 120 constitute low-frequency data item set in each storage equipment, i.e., by each storage All low-frequency data items in equipment constitute low-frequency data item set.
Working as when the quantity of the low-frequency data item in the low-frequency data item set in current storage equipment is less than or equal to It, will according to the ascending order sequence of accessed number when the quantity of the data item to be selected in the collection of data items to be selected of preceding storage equipment All low-frequency data items in low-frequency data item set are ranked up to generate the first sorted lists, will be arranged in the first sorted lists The low-frequency data item that sequence is the 1st is as current low-frequency data item.For example, when the low-frequency data item in low-frequency data item set When quantity (for example, 326) is less than quantity (for example, 827) of the data item to be selected in collection of data items to be selected, according to accessed time All low-frequency data items in low-frequency data item set are ranked up to generate first by several ascending order sequences (sequence increased) Sorted lists.In the first sorted lists, the accessed number for the forward data item that sorts is fewer, and the data rearward that sort The accessed number of item is more.The 1st low-frequency data item will be ordered as in first sorted lists (that is, accessed number is minimum Data item or low-frequency data item) be used as current low-frequency data item.
6.1, summary info based on current low-frequency data item is plucked with each data item to be selected in collection of data items to be selected Information is wanted to carry out content matching, with the content matching degree of determination current low-frequency data item and each data item to be selected.The application can With use any existing text alignments determine current low-frequency data item summary info and collection of data items to be selected in Content matching degree between the summary info of each data item to be selected, wherein text alignments are, for example, to be based on semantic content ratio Pair content matching, the content matching that is compared based on keyword or the content matching that is combined based on semantic content and keyword. Wherein, the content matching degree of each data item to be selected and current low-frequency data item may be used to indicate that each data item to be selected and institute State close degree, similar degree, degree of correlation or the correlation degree of current low-frequency data item.
6.2, by all data item to be selected of collection of data items to be selected with the content matching degree of current low-frequency data item most Big data item to be selected and current low-frequency data item carry out data item combination, to form a new data item, by new data Item is saved in the idle storage space of current storage equipment.By in all data item to be selected of collection of data items to be selected with work as The maximum data item to be selected of the content matching degree of preceding low-frequency data item and current low-frequency data item carry out data item combination refer to by With the maximum data item to be selected of content matching degree and current low-frequency data item configuration file group of current low-frequency data item, and will With the summary info of the maximum data item to be selected of content matching degree of current low-frequency data item and the abstract of current low-frequency data item Information is merged with the summary info of configuration file group.Using the file group constituted the data item new as one, and will New data item is saved in the idle storage space of current storage equipment, i.e., in the memory space of no storing data item.
6.3, it is deleted from the collection of data items to be selected maximum to be selected with the content matching degree of current low-frequency data item Data item.In the idle storage space that new data item (the file group constituted) is saved in current storage equipment it Afterwards, the maximum data item to be selected of content matching degree with current low-frequency data item is deleted from the collection of data items to be selected.This Outside, from current storage equipment by with the maximum data item to be selected of the content matching degree of current low-frequency data item and current low frequency Data entry deletion is (this is because with the maximum data item to be selected of content matching degree of current low-frequency data item and current low-frequency data The file group that item is constituted has been saved in the idle storage space of current storage equipment).
6.4,1 low-frequency data after current low-frequency data item is determined in first sorted lists with the presence or absence of sequence , if it is present carrying out step 6.5;If it does not exist, then terminating.It determines in first sorted lists with the presence or absence of row Sequence 1 low-frequency data item after current low-frequency data item, which is meant that, to be determined in first sorted lists with the presence or absence of interviewed Ask that number is higher than current low-frequency data item and the low frequency number adjacent in the first sorted lists with the current low-frequency data item According to item.Such as, when current low-frequency data item is to be ordered as the 1st low-frequency data item, then sequence is 1 after current low-frequency data item The low-frequency data item of position is the low-frequency data item for being ordered as the 2nd, i.e. it is least to be accessed number second in the first sorted lists Low-frequency data item or data item.If it is present step 6.5 is carried out, if it does not exist, then terminating the above process.
6.5, sequence 1 low-frequency data item after current low-frequency data item in first sorted lists is selected as Current low-frequency data item, carries out step 6.1.For example, the low-frequency data item selection for being ordered as the 2nd in the first sorted lists is made To carry out step 6.1 after current low-frequency data item, and so on, the 3rd, the 4th, the 5th will be ordered as in the first sorted lists Position ..., until last 1 low-frequency data item is selected as current low-frequency data item.
Or the quantity of the low-frequency data item in the low-frequency data item set of current storage equipment is greater than current It, will be in the low-frequency data item of current storage equipment when storing the quantity of the data item to be selected in the collection of data items to be selected of equipment All low-frequency data items in set are grouped to generate multiple low-frequency data item groups, so that the multiple low-frequency data item group In each low-frequency data item group in total accessed numbers of all low-frequency data items be greater than 1.5 times of low frequency frequency threshold value.Really The averagely accessed number of all low-frequency data items in fixed each low-frequency data item group.Preferably, plurality of low-frequency data item The absolute value of difference in group between the averagely accessed number of any two low-frequency data item group less than 20,30,40,50,60, Any reasonable values such as 70.
For example, the quantity (for example, 569) when the low-frequency data item in low-frequency data item set is greater than collection of data items to be selected In data item to be selected quantity (for example, 516) when, 569 low-frequency data items in low-frequency data item set are grouped To generate multiple low-frequency data item groups.Wherein, the application according to the quantity K of the low-frequency data item in low-frequency data item set and point Parameter Z is organized to determine the number of packet G being grouped to low-frequency data item, whereinZ is equal to any conjunctions such as 3,4,5 Manage numerical value.When Z is equal to 5,569 low-frequency data items are divided into 113 low frequency numbers According to item group.
Additionally, the total of all low-frequency data items is accessed in each low-frequency data item group of multiple low-frequency data Xiang Zuzhong Number is greater than 1.1 times, 1.2 times, 1.3 times, 1.5 times or any reasonable value of low frequency frequency threshold value.Determine each low-frequency data Item organizes the averagely accessed number of interior all low-frequency data items, i.e., the averagely accessed number of each low-frequency data item group.For example, Low-frequency data item group includes low-frequency data item 1-5, and the accessed number of low-frequency data item 1-5 is 95,76,110,82 respectively With 102, then the averagely accessed number of all low-frequency data items is 93 in low-frequency data item group.Plurality of low-frequency data item group The absolute value of difference between the averagely accessed number of middle any two low-frequency data item group is less than 20,30,40,50,60,70 Etc. any reasonable value.
In the preselected number that low frequency coefficient in multiple pre-selection data item in each storage equipment is less than to low frequency coefficient threshold value It is determined as after low-frequency data item according to item, further includes:
According to the current Queue sequence of data access requests multiple in the buffer area of system buffer equipment in buffer area Each data access request carries out data access operation.For example, multiple data access requests in the buffer area of system buffer equipment Current Queue sequence are as follows: the first data access request, the second data access request, third data access request, the 4th data Access request and the 5th data access request are then visited according to the first data access request, the second data access request, third data Ask that the current Queue sequence of request, the 4th data access request and the 5th data access request visits each data in buffer area Ask that request carries out data access operation.
It is right in the case where not having any data access request being saved in the buffer area for determining system buffer equipment The big data storage system from arbitrary request of data side received new data access request parsed it is new to obtain Querying condition.For example, when the first data access request in the buffer area for determining system buffer equipment, the second data access are asked Ask, third data access request, the 4th data access request and the 5th data access request have been processed, therefore system is slow Rush any data access request for not having in the buffer area of equipment and being saved.Then, to the big data storage system from number According to requesting party received 6th data access request parsed to obtain new querying condition.Wherein new querying condition is for example It is mobile communication and 5G and (uplink or downlink).
It is determined in the catalogue storage server of the big data storage system more involved in the new querying condition A data item, and determine at least one target storage device in big data storage system involved in multiple data item.Wherein, Catalogue storage server is used to store the directory information of all data item in big data storage system.For example, directory information is number According to the identifier of item, the summary info of data item, the metadata information of data item, the keyword message of data item, data item institute Storage equipment being located at etc..Catalogue storage server is according to querying condition or new querying condition to storage big data storage system Interior all data item are inquired, for example, in the summary info of data item, the metadata information of data item and/or data item It is looked into keyword message using new querying condition (for example, mobile communication and 5G and (uplink or downlink)) It askes, with multiple data item involved in the determination new querying condition.According to directory information determine each data item be located at, It is stored in or related storage equipment, thereby determines that at least one target storage device involved in multiple data item.? In special circumstances, multiple data item are likely located in same target storage device.
The new querying condition is sent to each target storage device, and receives and accords with from each target storage device Close at least one data item of the new querying condition.Each target storage device is according to the new querying condition at itself It is retrieved in all data item stored, to obtain at least one data item, and by least one data obtained Item is sent to the interface equipment of big data storage system.Preferably, there is no redundancies in the big data storage system of the application Data item, i.e., each data item are unique.Wherein, interface equipment is used to receive data access request from request of data side, And interface equipment is used to collection of data items or target data item set being sent to corresponding request of data side.
Target data item set will be formed from the received all data item of each target storage device institute, and by the mesh Mark collection of data items is sent to request of data side indicated by the new data access request.Interface equipment will be from each target It stores the received all data item of equipment institute and forms target data item set, and interface equipment is by the target data item set It is sent to request of data side indicated by the new data access request.
Wherein according to the current Queue sequence of data access requests multiple in the buffer area of system buffer equipment to buffer area In each data access request carry out data access operation include:
8.1, it is determined according to the current Queue sequence of data access requests multiple in the buffer area of system buffer equipment current The data access request of processing, wherein the currently processed data access request is multiple data access requests in buffer area Sort primary data access request in current Queue sequence.As described above, for example, more in the buffer area of system buffer equipment The current Queue sequence of a data access request are as follows: the first data access request, the second data access request, third data access Request, the 4th data access request and the 5th data access request, then according to data multiple in the buffer area of system buffer equipment The current Queue sequence of access request determines that currently processed data access request is the first data access request.
8.2, currently processed data access request is parsed to obtain currently processed querying condition.Wherein data Access request or currently processed data access request include querying condition, therefore are carried out to currently processed data access request Parsing can obtain currently processed querying condition.Wherein currently processed querying condition is, for example, mobile communication and 5G and (on Line link or downlink).
8.3, the currently processed querying condition is determined in the catalogue storage server of the big data storage system Related multiple data item, and determine at least one target storage device involved in multiple data item.Wherein, catalogue stores Server is used to store the directory information of all data item in big data storage system.For example, directory information is the mark of data item What knowledge symbol, the summary info of data item, the metadata information of data item, the keyword message of data item, data item were located at deposits Store up equipment etc..Catalogue storage server is according to currently processed querying condition to all data item in storage big data storage system It is inquired, for example, in the keyword message of the summary info of data item, the metadata information of data item and/or data item It is inquired using currently processed querying condition (for example, mobile communication and 5G and (uplink or downlink)), with true Multiple data item involved in the fixed new querying condition.Determine that each data item is located at, is stored according to directory information In or related storage equipment, thereby determine that at least one target storage device involved in multiple data item.In special feelings Under condition, multiple data item are likely located in same target storage device.
8.4, the currently processed querying condition is sent to each target storage device, and is stored from each target Equipment receives at least one data item for meeting the currently processed querying condition.Each target storage device is worked as according to The querying condition of pre-treatment is retrieved in all data item itself stored, to obtain at least one data item, and At least one data item obtained is sent to the interface equipment of big data storage system.Preferably, the big data of the application The data item of redundancy is not present in storage system, i.e., each data item is unique.Wherein, interface equipment from data for asking The side of asking receives data access request, and interface equipment is for collection of data items or target data item set to be sent to accordingly Request of data side.
8.5, target data item set will be formed from the received all data item of each target storage device institute, and by institute It states target data item set and is sent to request of data side indicated by the currently processed data access request.Interface equipment will Target data item set are formed from the received all data item of each target storage device institute, and interface equipment is by the target Collection of data items is sent to request of data side indicated by the new data access request.
8.6, the primary data access that sorts in the current Queue sequence of data access requests multiple in buffer area is asked Ask deletion.For example, the first data access request in the current Queue sequence of data access requests multiple in buffer area is deleted.
8.7, determine in the buffer area of system buffer equipment whether there is any data access request being saved, if It is then to carry out step 8.1;If it is not, then determining that any data for not having in the buffer area of system buffer equipment and being saved are visited Ask request.
For example, in the buffer area of system buffer equipment multiple data access requests current Queue sequence are as follows: the first data Access request, the second data access request, third data access request, the 4th data access request and the 5th data access are asked It asks, and after deleting the first data access request in the current Queue sequence of data access requests multiple in buffer area, Then determine that there is any data access request being saved in the buffer area of system buffer equipment, i.e. the second data access request, Third data access request, the 4th data access request and the 5th data access request, then carry out step 801.
After deleting the 5th in the current Queue sequence of data access requests multiple in buffer area according to access request, then Determine do not have any data access request for being saved in the buffer area of system buffer equipment, i.e. the first data access request, Second data access request, third data access request, the 4th data access request and the 5th data access request complete Data access operation, it is determined that do not have any data access request being saved in the buffer area of system buffer equipment.Exist In the case where determining any data access request for not having in the buffer area of system buffer equipment and being saved, to the big data Storage system from arbitrary request of data side received new data access request parsed to obtain new querying condition, and Carry out respective handling.
In this application, identical if there is the accessed number of different data item or low-frequency data item, and need From data item or low-frequency data item select one as current data item or current low-frequency data item when, from accessed number It is selected at random in identical different data item or low-frequency data item.
Fig. 2 is the schematic diagram according to multiple access information statistics files 200 of embodiment of the present invention.The application is in determination When the data access operation not being currently running in all storage equipment in the big data storage system, obtain described big Each running log file for storing equipment in multiple storage equipment in data-storage system, and based on current statistical time The running log file of section and each storage equipment determines that the passing through for multiple data item stored in each storage equipment counts Access information, according to the multiple data item stored in the threshold value at preset access time interval and each storage equipment Access information by statistics determines the access information statistics file of each storage equipment, and wherein access time interval is data item It is adjacent be accessed twice between a period of time.As shown in Fig. 2, since each storage equipment all has respective access letter Statistics file is ceased, therefore there are multiple access information statistics files 200.Access information statistics file includes frequency statistics table 201, The frequency statistics table 201 include multiple frequency records (serial number 1,2,3,4,5,6 ...), wherein each frequency record is interior Hold for 8 tuples < data item identifier, accessed number, statistics initial time, the statistics end time, sizes of memory, be greater than visit Ask the number of the threshold value of time interval, maximum access time interval, minimum access time interval >.
As shown in Fig. 2, access information statistics file 1 includes frequency statistics table 201.It include multiple in frequency statistics table 201 Frequency record.6 frequency records are illustrated only in frequency statistics table 201, wherein the identifier of data item be respectively PPT introduction, Big data system introduction, The Tai-Chi Master, C++, U.S.'s tourist handbook, Sanya tourism strategy since 0.For example, PPT introduction and Big data system introduction be PPT file, The Tai-Chi Master and since 0 ing C++ be video file, U.S.'s tourist handbook and Sanya trip Trip strategy is pdf document.Also, when showing the accessed number of each data item, statistics starting in frequency statistics table 201 Between, the statistics end time, sizes of memory, greater than the number of the threshold value at access time interval, maximum access time interval and minimum Access time interval.
Fig. 3 is that low-frequency data item is determined in the storage equipment stored for big data according to embodiment of the present invention The structural schematic diagram of system 300.System 300 includes: pretreatment unit 301, statistic unit 302, computing unit 303, determines list Member 304 and adjustment unit 305.
Pretreatment unit 301 is set in response to receiving multiple storages in big data storage system for big data storage The request that low-frequency data item is determined in standby each storage equipment, by the big data storage system from arbitrary request of data side Received new data access request be redirected to the system buffer equipment of the big data storage system without will be received New data access request be sent to it is multiple storage equipment in corresponding storage equipment, with by the system buffer equipment will The ephemeral data item set of the description information for the querying condition that new data access request is included and the system buffer equipment In each ephemeral data item carry out content matching with the content matching degree of each ephemeral data item of determination, from multiple ephemeral datas In selection content matching degree be greater than matching degree threshold value at least one selected ephemeral data item, by it is selected at least one Selected ephemeral data item is sent to request of data side indicated by the new data access request, and slow in the system It rushes and saves the new data access request in the buffer area of equipment.
When the data management apparatus being located at outside big data storage system needs the storage in big data storage system to set When standby middle determining low-frequency data item, the data management apparatus is sent in big data storage system to the big data storage system The request of low-frequency data item is determined in each storage equipment of interior multiple storage equipment for big data storage.Positioned at big data Data management apparatus outside storage system can by big data storage system maintenance personnel, administrative staff or operation personnel into Row operation or control.For example, the maintenance personnel of big data storage system, administrative staff or operation personnel can periodically or root The identification or determination to low-frequency data item are triggered according to practical operation situation.It include that multiple storages are set in big data storage system It is standby, and each storage equipment can store each memory capacity for storing equipment of multiple data item and can be arbitrary rationally Numerical value.Each data item can be the number of various types of data files, such as text type, audio types, video type etc. According to file.Wherein low-frequency data item refers to that the accessed number in specific time is lower than all data of big data storage system The averagely accessed number of item, or the data item lower than averagely accessed number of all data item of storage equipment etc..
Wherein by the big data storage system from arbitrary request of data side received new data access request weight Be directed to the system buffer equipment of the big data storage system without by received new data access request be sent to it is more Corresponding storage equipment in a storage equipment includes:
At the time of receiving the request of determining low-frequency data item with the big data storage system, by the big data Storage system then from arbitrary request of data side received new data access request be redirected to big data storage The system buffer equipment of system without by received new data access request be sent to it is multiple storage equipment in it is corresponding Store equipment;
Wherein the new data access request includes the description information of querying condition and querying condition, the ephemeral data It include multiple ephemeral data items in item set, and each ephemeral data item has summary info, the summary info is for generally Introduce the content of ephemeral data item with including.
Multiple storages in big data storage system for big data storage are received in the big data storage system At the time of determining the request of low-frequency data item in each storage equipment of equipment, multiple new data may be received and visited Ask request.At this point, promoting big data storage system is then received all from one or more arbitrary institutes of request of data side New data access request is all redirected to the system buffer equipment of the big data storage system without institute is received new Data access request be sent to it is multiple storage equipment in corresponding storage equipment.In general, big data storage system can basis The determination in the catalogue storage server of the big data storage system of querying condition included by new data access request is looked into Multiple data item involved in inquiry condition, and determine at least one target storage device involved in multiple data item.It will be described Currently processed querying condition is sent to each target storage device, and meets described work as from the reception of each target storage device At least one data item of the querying condition of pre-treatment.And when in order to carry out the identification of low-frequency data item or determine, big data is deposited All new data access requests are all redirected to the system buffer equipment of the big data storage system by storage system.Wherein System buffer equipment is located inside big data storage system, and for storing the ephemeral data item including multiple ephemeral data items Set, or for being buffered to data access request.Wherein querying condition is, for example, mobile communication and 5G and (uplink Or downlink).In this case, the description information of querying condition is, for example, the uplink or downlink of 5G mobile communication Link.It include multiple ephemeral data items in ephemeral data item set, and each ephemeral data item can be various types of numbers According to the data file of file, such as text type, audio types, video type etc..Each ephemeral data item or each data item It all has summary info and summary info is used to briefly introduce the content of ephemeral data item or data item.For example, abstract letter Breath are as follows: the C++ since 0 ing allows your 21 days association C++ this programming languages using straightaway introduce.
The description information for the querying condition for wherein being included by new data access request by the system buffer equipment with It is each interim with determination that each ephemeral data item in the ephemeral data item set of the system buffer equipment carries out content matching The content matching degree of data item includes:
The description information for the querying condition for being included by new data access request by the system buffer equipment with it is described The summary info of each ephemeral data item in the ephemeral data item set of system buffer equipment compared based on semantic content Content matching, the content matching compared based on keyword or the content matching combined based on semantic content and keyword with true The content matching degree of fixed each ephemeral data item and the querying condition.The application can be used any existing text and compare other side Formula determines the description information of querying condition that new data access request is included and the ephemeral data item of system buffer equipment Content matching degree between the summary info of each ephemeral data item in set, wherein text alignments are, for example, to be based on language Content matching that adopted content compares, the content matching compared based on keyword or in being combined based on semantic content and keyword Hold matching.Wherein, the content matching degree of each ephemeral data item and the querying condition may be used to indicate that each ephemeral data Item close degree, similar degree, degree of correlation or correlation degree with the querying condition.
Wherein the matching degree threshold value is 55%, 60%, 65%, 70% or any reasonable value, and content matching degree Range be [0%, 100%], i.e. content matching degree can be any numerical value between from 0% to 100%.From multiple nonces According at least one the selected ephemeral data item for selecting content matching degree to be greater than matching degree threshold value in item, i.e., from multiple ephemeral datas Selection content matching degree is greater than 55%, 60%, 65% or 70% at least one selected ephemeral data item in.It will be selected At least one selected ephemeral data item be sent to request of data side indicated by the new data access request, and The new data access request is saved in the buffer area of the system buffer equipment.By it is selected at least one selected face When the data item purpose that is sent to request of data side indicated by the new data access request be to allow request of data side can Content relevant to data access request is obtained, in the case where big data storage system suspends data access service to promote to count According to requesting party it will be seen that related content.
After wherein saving the new data access request in the buffer area of the system buffer equipment further include: to Request of data side indicated by the new data access request is sent for showing the big data storage system pause data Access and the new data access request have been saved to the response message in the buffer area of the system buffer equipment, and And it carries in the response message for showing the new data access request from request of data side in the buffer area The information of current Queue sequence.Wherein come in the buffer area according to the time span of new data access request being saved Determine current Queue sequence of the new data access request in the buffer area, and according to being protected in current Queue sequence The descending order for the time span deposited is ranked up new data access request.That is, the time span being saved is longer, then newly Data access request current Queue sequence it is more forward.Preferably, to number indicated by the new data access request It sends according to requesting party for having shown the big data storage system pause data access and the new data access request It is saved to after the response message in the buffer area of the system buffer equipment further include: periodically to the new data Request of data side indicated by access request is sent for showing the new data access request from request of data side described The notification message of current Queue sequence in buffer area.
Statistic unit 302 is not currently running in determining all storage equipment in the big data storage system Data access operation when, obtain in the big data storage system it is multiple storage equipment in it is each storage equipment running logs File, and determined in each storage equipment based on the running log file in current statistical time section and each storage equipment The access information by statistics of multiple data item of storage, is deposited according to the threshold value at preset access time interval and each The access information by statistics of the multiple data item stored in storage equipment determines the access information statistics text of each storage equipment Part, wherein access time interval be data item it is adjacent be accessed twice between a period of time;The wherein access information Statistics file includes frequency statistics table, and the frequency statistics table includes multiple frequency records, wherein the content of each frequency record For 8 tuples < data item identifier, accessed number, statistics initial time, the statistics end time, sizes of memory, it is greater than access The number of the threshold value of time interval, interval of maximum access time, minimum access time interval >.
The data access operation being wherein currently running refers to that storage equipment is looked into according to transmitted by big data storage system Inquiry condition carries out data retrieval in the memory space of itself, will constitute item set by data retrieval data item obtained It closes, collection of data items is sent to the operation processing of request of data side by big data storage system.
Wherein running log file is saved in the system data region of each storage equipment.Wherein running log file packet Include multiple log recordings, wherein each log recording include: data item identifier, access initial time, access the end time, Sizes of memory and storage initial time.Wherein the identifier of data item can be the title of data item, the unique identification of data item, Coding of data item etc. is capable of the information of unique identification data item.Access initial time refers to number involved in current log record The initial time being accessed according to item.At the end of the access end time refers to that data item involved in current log record is accessed Between.For example, may be related to the operation such as reading, modify when accessing to the data item in storage equipment, when accessing starting Between and access the end time be used for indicate this operation initial time and the end time.Sizes of memory is that data item is set in storage Sizes of memory in standby.Storage initial time is the starting that data item starts storage in storage equipment or big data storage system Time, that is, data item is saved in storage equipment or big data storage system to provide the initial time of access service.At this In application, access includes reading and/or modifying.
Wherein current statistical time section receives the request when institute of determining low-frequency data item for big data storage system The proxima luce (prox. luc) of the current date at place starts and a period of time of the consecutive days of predetermined quantity forward;The wherein nature of predetermined quantity Day is 10 consecutive days, 20 consecutive days or 30 consecutive days.For example, big data storage system receives determining low-frequency data item Request time be the 11:25:36 on the 11st of August in 2018, then big data storage system receives asking for determining low-frequency data item Locating current date is on August 11st, 2018 when asking.When big data storage system receives the request of determining low-frequency data item The proxima luce (prox. luc) of locating current date is on August 10th, 2018.Current statistical time section is the reception of big data storage system Proxima luce (prox. luc) to locating current date when the request for determining low-frequency data item start and forward predetermined quantity (for example, 10 Natural number) consecutive days a period of time, i.e., current statistical time section be on August 00:00:00 to 2018 years 1,8 2018 Moon 23:59:59 on the 10th.
Wherein determine that each storage is set based on the running log file in current statistical time section and each storage equipment The access information by statistics of multiple data item of standby middle storage includes:
Based on current statistical time section to it is each storage equipment running log file in all log recordings into Row is chosen to obtain multiple log recordings of each storage equipment in current statistical time section;
Classify according to data item to multiple log recordings of each storage equipment in current statistical time section, To obtain the access information by statistics of each data item;
The multiple data item stored in each storage equipment are made of the access information by statistics of each data item By the access information of statistics.
For example, current statistical time section is 00:00:00 to 2018 years on the 1st August 23:59:59 on the 10th of August in 2018, That is 10 consecutive days, then based on 00:00:00 to 2018 years on the 1st August of August in 2018 23:59:59 on the 10th to each storage equipment Running log file in all log recordings chosen to obtain each storage equipment in the 00:00 on the 1st of August in 2018: All log recordings in 00 to 2018 on August, 10,23:59:59.According to data item (for example, identifier of data item) to every Multiple log recordings of a storage equipment in 00:00:00 to 2018 years on the 1st August of August in 2018 23:59:59 on the 10th are divided Class, to obtain the access information by statistics of each data item.Each data item by statistics access information be, for example, All accessed information of each data item in current statistical time section.By each data item in each storage equipment By statistics access information constitute it is each storage equipment in store multiple data item by statistics access information.
Wherein each data item has summary info, and the summary info is used to briefly introduce the content of data item.Example Such as, summary info are as follows: the C++ since 0 allows your 21 days association C++ this programming languages using straightaway introduction.
Wherein access time interval be data item it is adjacent be accessed twice between a period of time, for example, current The accessed access end time to a period of time between accessed access initial time next time.It is wherein described preparatory The threshold value at the access time interval set is 5 minutes, 10 minutes, 15 minutes, 20 minutes or any reasonable value.In general, working as In preceding statistical time section (or statistical time section), data item A is accessed 5 times and the time accessed every time is 30 seconds, then data item A current statistical time section (or statistical time section) has 4 access time intervals.
According to the warp of the multiple data item stored in the threshold value at preset access time interval and each storage equipment The access information for crossing statistics determines that each access information statistics file for storing equipment includes:
The access information by statistics of each data item in the multiple data item stored in each storage equipment is carried out Statistics is with the accessed number of each data item of determination and all access time intervals;
The threshold greater than access time interval of each data item is determined based on all access time intervals of each data item The number of value, interval of maximum access time and minimum access time interval;
Access initial time accessed for the first time in the access information by statistics of each data item is determined as uniting Initial time is counted, the access end time accessed for the last time in the access information by statistics of each data item is determined To count the end time;
The sizes of memory of each data item is determined based on the access information by statistics of each data item.
Due to each access information packet by statistics for storing each data item in the multiple data item stored in equipment Include multiple log recordings, and each log recording represents data item and is accessed 1 time, thus by the quantity of log recording come Determine (always) the accessed number of each data item.In addition, multiple log recordings are tied according to access initial time or access The beam time is ranked up, and can obtain the access time interval between each log recording, so that it is determined that between all access times Every.Further, being compared by threshold value to preset access time interval and all access time intervals can Determine the number of the threshold value greater than access time interval of each data item, and by uniting to all access time intervals Meter can determine maximum access time interval and the minimum access time interval of each data item.
For example, current statistical time section is 00:00:00 to 2018 years on the 1st August 23:59:59 on the 10th of August in 2018, The access initial time that first time of the data item A in current statistical time section is accessed is the 09:02 on the 1st of August in 2018: 11, access 2018 end times August 09:05:36 on the 1st, and last in current statistical time section of data item A Secondary accessed access initial time is the 22:26:53 on the 10th of August in 2018, accesses 2018 end times August 22:27 on the 10th: 39, then statistics initial time of the data item A in current statistical time section is the 09:02:11 on the 1st of August in 2018, and is united The meter end time is the 22:27:39 on the 10th of August in 2018.
In addition, determining each data according to the sizes of memory in log recording arbitrary in the access information by statistics The sizes of memory of item.
Computing unit 303 determines each storage in current statistical time section based on the access information statistics file It is accessed multiple pre-selection data item that number is less than low frequency frequency threshold value in all data item of equipment, is deposited according to the big data Device descriptive information in the system log device of storage system determines the total memory capacity of each storage equipment, according to each storage Storage message file in the storage information area of equipment determines the free memory capacity of each storage equipment, according to following Formula come determine it is each storage equipment in each pre-selection data item low frequency coefficient:
Wherein DTFiFor low frequency coefficient, the t of i-th of pre-selection data item in current storage devicesimaxFor in current storage devices Maximum access time interval, t in multiple access time intervals of i-th of pre-selection data itemiminIt is in current storage devices i-th Minimum access time interval, t in multiple access time intervals of a pre-selection data itemibeginIt is i-th in current storage devices Preselect statistics initial time, the t of data itemiendFor the statistics end time of i-th of pre-selection data item, C in current storage devices Total memory capacity, R for current storage devices are the free memory capacity of current storage devices, UNiFor in current storage devices Number, the AN of the threshold value greater than access time interval in multiple access time intervals of i-th of pre-selection data itemiCurrently to deposit The accessed number of i-th of pre-selection data item in equipment is stored up, wherein i is natural number, and PT is natural number and PT >=i >=1, PT are Quantity and PT >=100 of data item are preselected in current storage devices.
Wherein, low frequency frequency threshold value is 100,150,175,200 or any reasonable value.In the system log device Device descriptive information includes: total storage of the total quantity of storage equipment included by big data storage system, each storage equipment The time of the big data storage system is added in capacity, the network address of each storage equipment or each storage equipment.Big data The total quantity of storage equipment included by storage system is the total quantity of all storage equipment in big data storage system.Each deposit The total memory capacity of storage equipment be the total capacity of the memory space of each storage equipment or can be each storage equipment can be with The total capacity of the memory space of item for storing data.The network address of each storage equipment is, for example, IP address, MAC Address Deng.The time that the big data storage system is added in each storage equipment refers to that the big data storage is added in each storage equipment Initial time of the system to carry out storing data item as the storage equipment in the big data storage system.
Storage message file in the storage information area of each storage equipment includes: the total quantity of data item, every number Believe according to the abstract of the sizes of memory of item, the starting storage time of each data item, the identifier of each data item, each data item The free memory capacity of breath and each storage equipment.The total quantity of data item refers to all data item in each storage equipment Total quantity.The sizes of memory of each data item refers to sizes of memory or institute when each data item is stored in storing equipment The memory space of occupancy.The starting storage time of each data item refers to that each data item starts in the storage equipment belonged to The time of storage, for example, data item is copied to the time in storage equipment.The identifier of each data item can be data item Title, the coding of the unique identification of data item, data item etc. be capable of the information of unique identification data item.Each data item is plucked Want information for briefly introducing the content of ephemeral data item or data item.For example, summary info are as follows: the C++ since 0 is used Straightaway introduction allows your 21 days association C++ this programming languages.The free memory capacity of each storage equipment refers to each The free memory capacity or residual storage capacity of new data item can be stored in storage equipment.Wherein low frequency coefficient threshold value is 90, any reasonable value such as 100,120,130,150,160,170,220.
Low frequency coefficient in multiple pre-selection data item in each storage equipment is less than low frequency coefficient threshold value by determination unit 304 Pre-selection data item be determined as low-frequency data item.That is, the application is through the above steps, for number greatly in big data storage system According to identifying or recognizing low-frequency data item in each storage equipment of storage.
In the preselected number that low frequency coefficient in multiple pre-selection data item in each storage equipment is less than to low frequency coefficient threshold value After being determined as low-frequency data item according to item, further include using adjustment unit 305 by it is each storage equipment all data item in quilt 2 times of the data item that access times are greater than low frequency frequency threshold value is determined as data item to be selected to obtain multiple data item to be selected, and Collection of data items to be selected is constituted by multiple data item to be selected, low frequency coefficient in each storage equipment is less than low frequency coefficient threshold value Multiple low-frequency data items constitute low-frequency data item set.For example, when low frequency frequency threshold value is 100, then by each storage equipment All data item in be accessed number and be determined as data item to be selected greater than 100 × 2 data item to obtain multiple data to be selected ?.For example, when low frequency coefficient threshold value is 120, then by multiple low-frequency datas of the low frequency coefficient less than 120 in each storage equipment Item constitutes low-frequency data item set, i.e., all low-frequency data items in each storage equipment is constituted low-frequency data item set.
When the quantity of the low-frequency data item in low-frequency data item set is less than or equal to be selected in collection of data items to be selected When the quantity of data item, all low-frequency data items in low-frequency data item set are carried out according to the ascending order sequence of accessed number Sequence will be ordered as the 1st low-frequency data item as current low frequency number to generate the first sorted lists in the first sorted lists According to item.For example, the quantity (for example, 326) when the low-frequency data item in low-frequency data item set is less than in collection of data items to be selected Data item to be selected quantity (for example, 827) when, according to the ascending order sequence (sequence increased) of accessed number by low-frequency data All low-frequency data items in item set are ranked up to generate the first sorted lists.In the first sorted lists, sort forward Data item accessed number it is fewer, and the accessed number for the data item rearward of sorting is more.By the first sorted lists In be ordered as the 1st low-frequency data item (that is, accessed the least data item of number or low-frequency data item) and be used as current low frequency Data item.
6.1, summary info based on current low-frequency data item is plucked with each data item to be selected in collection of data items to be selected Information is wanted to carry out content matching, with the content matching degree of determination current low-frequency data item and each data item to be selected.The application can With use any existing text alignments determine the summary info of current low-frequency data item with collection of data items to be selected In each data item to be selected summary info between content matching degree, wherein text alignments are, for example, to be based on semantic content The content matching of comparison, the content matching based on keyword comparison or the content combined based on semantic content and keyword Match.Wherein, the content matching degree of each data item to be selected and current low-frequency data item may be used to indicate that each number to be selected According to close degree, similar degree, degree of correlation or the correlation degree of item and the current low-frequency data item.
6.2, by all data item to be selected of collection of data items to be selected with the content matching degree of current low-frequency data item most Big data item to be selected and current low-frequency data item carry out data item combination, to form a new data item, by new data Item is saved in the idle storage space of storage equipment.By in all data item to be selected of collection of data items to be selected with current low frequency The maximum data item to be selected of the content matching degree of data item and current low-frequency data item carry out data item combination refer to by with it is current The maximum data item to be selected of the content matching degree of low-frequency data item and current low-frequency data item configuration file group, and will with it is current The summary info of the summary info of the maximum data item to be selected of the content matching degree of low-frequency data item and current low-frequency data item into Row merges with the summary info of configuration file group.Using the file group constituted the data item new as one, and by new number It is saved according to item in the idle storage space of storage equipment, i.e., in the memory space of no storing data item.
6.3, it is deleted from the collection of data items to be selected maximum to be selected with the content matching degree of current low-frequency data item Data item.After in the idle storage space that new data item (the file group constituted) is saved in storage equipment, from institute State the maximum data item to be selected of content matching degree deleted in collection of data items to be selected with current low-frequency data item.In addition, from depositing Store up equipment in by with the maximum data item to be selected of the content matching degree of current low-frequency data item and current low-frequency data entry deletion (this It is the text because being constituted with the maximum data item to be selected of content matching degree of current low-frequency data item and current low-frequency data item Part group has been saved in the idle storage space of storage equipment).
6.4,1 low-frequency data after current low-frequency data item is determined in first sorted lists with the presence or absence of sequence , if it is present carrying out step 6.5;If it does not exist, then terminating.It determines in first sorted lists with the presence or absence of row Sequence 1 low-frequency data item after current low-frequency data item, which is meant that in determining first sorted lists, whether there is Accessed number is higher than current low-frequency data item and adjacent low in the first sorted lists with the current low-frequency data item Frequency data item.Such as, when current low-frequency data item is to be ordered as the 1st low-frequency data item, then sequence is in current low-frequency data 1 low-frequency data item is the low-frequency data item for being ordered as the 2nd after, i.e. is accessed number second most in the first sorted lists Few low-frequency data item or data item.If it is present step 6.5 is carried out, if it does not exist, then terminating the above process.
6.5, sequence 1 low-frequency data item after current low-frequency data item in first sorted lists is selected as Current low-frequency data item, carries out step 6.1.For example, the low-frequency data item selection for being ordered as the 2nd in the first sorted lists is made To carry out step 6.1 after current low-frequency data item, and so on, the 3rd, the 4th, the 5th will be ordered as in the first sorted lists Position ..., until last 1 low-frequency data item is selected as current low-frequency data item.
Or when the quantity of the low-frequency data item in low-frequency data item set is greater than the number to be selected in collection of data items to be selected According to item quantity when, all low-frequency data items in low-frequency data item set are grouped to generate multiple low-frequency data items Group, so that the total of all low-frequency data items is accessed number in each low-frequency data item group of the multiple low-frequency data Xiang Zuzhong Greater than 1.5 times of low frequency frequency threshold value.Determine the averagely accessed number of all low-frequency data items in each low-frequency data item group. Preferably, difference between the averagely accessed number of plurality of low-frequency data Xiang Zuzhong any two low-frequency data item group Absolute value is less than any reasonable values such as 20,30,40,50,60,70.
For example, the quantity (for example, 569) when the low-frequency data item in low-frequency data item set is greater than collection of data items to be selected In data item to be selected quantity (for example, 516) when, 569 low-frequency data items in low-frequency data item set are grouped To generate multiple low-frequency data item groups.Wherein, the application according to the quantity K of the low-frequency data item in low-frequency data item set and point Parameter Z is organized to determine the number of packet G being grouped to low-frequency data item, whereinZ is equal to any conjunctions such as 3,4,5 Manage numerical value.When Z is equal to 5,569 low-frequency data items are divided into 113 low frequency numbers According to item group.
Additionally, the total of all low-frequency data items is accessed in each low-frequency data item group of multiple low-frequency data Xiang Zuzhong Number is greater than 1.1 times, 1.2 times, 1.3 times, 1.5 times or any reasonable value of low frequency frequency threshold value.Determine each low-frequency data Item organizes the averagely accessed number of interior all low-frequency data items, i.e., the averagely accessed number of each low-frequency data item group.For example, Low-frequency data item group includes low-frequency data item 1-5, and the accessed number of low-frequency data item 1-5 is 95,76,110,82 respectively With 102, then the averagely accessed number of all low-frequency data items is 93 in low-frequency data item group.Plurality of low-frequency data item group The absolute value of difference between the averagely accessed number of middle any two low-frequency data item group is less than 20,30,40,50,60,70 Etc. any reasonable value.
In the preselected number that low frequency coefficient in multiple pre-selection data item in each storage equipment is less than to low frequency coefficient threshold value It is determined as after low-frequency data item according to item, further includes:
According to the current Queue sequence of data access requests multiple in the buffer area of system buffer equipment in buffer area Each data access request carries out data access operation.For example, multiple data access requests in the buffer area of system buffer equipment Current Queue sequence are as follows: the first data access request, the second data access request, third data access request, the 4th data Access request and the 5th data access request are then visited according to the first data access request, the second data access request, third data Ask that the current Queue sequence of request, the 4th data access request and the 5th data access request visits each data in buffer area Ask that request carries out data access operation.
It is right in the case where not having any data access request being saved in the buffer area for determining system buffer equipment The big data storage system from arbitrary request of data side received new data access request parsed it is new to obtain Querying condition.For example, when the first data access request in the buffer area for determining system buffer equipment, the second data access are asked Ask, third data access request, the 4th data access request and the 5th data access request have been processed, therefore system is slow Rush any data access request for not having in the buffer area of equipment and being saved.Then, to the big data storage system from number According to requesting party received 6th data access request parsed to obtain new querying condition.Wherein new querying condition is for example It is mobile communication and 5G and (uplink or downlink).
It is determined in the catalogue storage server of the big data storage system more involved in the new querying condition A data item, and determine at least one target storage device involved in multiple data item.Wherein, catalogue storage server is used for Store the directory information of all data item in big data storage system.For example, directory information is the identifier of data item, data item Summary info, the storage equipment that is located at of the metadata information of data item, the keyword message of data item, data item etc..Mesh Address book stored server looks into all data item in storage big data storage system according to querying condition or new querying condition It askes, for example, using new in the keyword message of the summary info of data item, the metadata information of data item and/or data item Querying condition (for example, mobile communication and 5G and (uplink or downlink)) inquired, looked into so that determination is described new Multiple data item involved in inquiry condition.Determine that each data item is located at, is stored in or related according to directory information Equipment is stored, thereby determines that at least one target storage device involved in multiple data item.Under special circumstances, multiple data Item is likely located in same target storage device.
The new querying condition is sent to each target storage device, and receives and accords with from each target storage device Close at least one data item of the new querying condition.Each target storage device is according to the new querying condition at itself It is retrieved in all data item stored, to obtain at least one data item, and by least one data obtained Item is sent to the interface equipment of big data storage system.Preferably, there is no redundancies in the big data storage system of the application Data item, i.e., each data item are unique.Wherein, interface equipment is used to receive data access request from request of data side, And interface equipment is used to collection of data items or target data item set being sent to corresponding request of data side.
Target data item set will be formed from the received all data item of each target storage device institute, and by the mesh Mark collection of data items is sent to request of data side indicated by the new data access request.Wherein according to system buffer equipment Buffer area in multiple data access requests current Queue sequence in buffer area each data access request carry out data Access operation includes:
8.1, it is determined according to the current Queue sequence of data access requests multiple in the buffer area of system buffer equipment current The data access request of processing, wherein the currently processed data access request is multiple data access requests in buffer area Sort primary data access request in current Queue sequence.
8.2, currently processed data access request is parsed to obtain currently processed querying condition.
8.3, the currently processed querying condition is determined in the catalogue storage server of the big data storage system Related multiple data item, and determine at least one target storage device involved in multiple data item.Wherein, catalogue stores Server is used to store the directory information of all data item in big data storage system.
8.4, the currently processed querying condition is sent to each target storage device, and is stored from each target Equipment receives at least one data item for meeting the currently processed querying condition.
8.5, target data item set will be formed from the received all data item of each target storage device institute, and by institute It states target data item set and is sent to request of data side indicated by the currently processed data access request.
8.6, the primary data access that sorts in the current Queue sequence of data access requests multiple in buffer area is asked Ask deletion.
8.7, determine in the buffer area of system buffer equipment whether there is any data access request being saved, if It is then to carry out step 8.1;If it is not, then determining that any data for not having in the buffer area of system buffer equipment and being saved are visited Ask request.

Claims (10)

1. a kind of method for determining low-frequency data item in the storage equipment stored for big data, which comprises
In response to receiving each storage equipment in big data storage system for multiple storage equipment of big data storage The request of middle determining low-frequency data item, by the big data storage system from arbitrary request of data side received new data Access request be redirected to the system buffer equipment of the big data storage system without by received new data access ask Seek the corresponding storage equipment being sent in multiple storage equipment, with by the system buffer equipment by new data access request Each ephemeral data in the ephemeral data item set of the description information for the querying condition for being included and the system buffer equipment Item carries out content matching with the content matching degree of each ephemeral data item of determination, selects content matching from multiple ephemeral data items Degree is greater than at least one selected ephemeral data item of matching degree threshold value, by least one selected selected ephemeral data item It is sent to request of data side indicated by the new data access request, and in the buffer area of the system buffer equipment Save the new data access request;
When the data access operation not being currently running in determining all storage equipment in the big data storage system, The running log file of each storage equipment in multiple storage equipment in the big data storage system is obtained, and based on current Statistical time section and the running log file of each storage equipment determine multiple data item for storing in each storage equipment By statistics access information, it is more according to being stored in the threshold value at preset access time interval and each storage equipment The access information by statistics of a data item determines the access information statistics file of each storage equipment, wherein between access time Every be data item it is adjacent be accessed twice between a period of time;Wherein the access information statistics file includes frequency system Table is counted, the frequency statistics table includes multiple frequency records, wherein the content of each frequency record is 8 tuples < data item mark Know symbol, accessed number, statistics initial time, the statistics end time, sizes of memory, greater than access time interval threshold value time Number, maximum access time interval, minimum access time interval >;
All data item of each storage equipment in current statistical time section are determined based on the access information statistics file In be accessed number be less than low frequency frequency threshold value multiple pre-selection data item, according to the system of the big data storage system record Device descriptive information in equipment determines the total memory capacity of each storage equipment, according to the storage information area of each storage equipment Storage message file in domain determines the free memory capacity of each storage equipment, and each deposit is determined according to the following equation Store up the low frequency coefficient of each pre-selection data item in equipment:
Wherein DTFiFor low frequency coefficient, the t of i-th of pre-selection data item in current storage devicesimaxIt is in current storage devices i-th Maximum access time interval, t in multiple access time intervals of a pre-selection data itemiminIt is i-th in current storage devices Minimum access time interval, t in multiple access time intervals of pre-selection data itemibeginIt is pre- for i-th in current storage devices Select statistics initial time, the t of data itemiendIt is for the statistics end time of i-th of pre-selection data item, C in current storage devices Total memory capacity, the R of current storage devices are the free memory capacity of current storage devices, UNiIt is in current storage devices i-th Number, the AN of the threshold value greater than access time interval in multiple access time intervals of a pre-selection data itemiIt is currently stored The accessed number of i-th of pre-selection data item in equipment, wherein i is natural number and PT >=i >=1, PT are current storage devices The middle quantity for preselecting data item and PT >=100;And
The pre-selection data item that low frequency coefficient in multiple pre-selection data item in each storage equipment is less than low frequency coefficient threshold value is true It is set to low-frequency data item.
2. according to the method described in claim 1, wherein, when the data management apparatus being located at outside big data storage system needs When determining low-frequency data item in the storage equipment in big data storage system, the data management apparatus is deposited to the big data Storage system is sent in each storage equipment in big data storage system for multiple storage equipment of big data storage and determines The request of low-frequency data item;
Wherein by the big data storage system from arbitrary request of data side received new data access request redirect To the big data storage system system buffer equipment without by received new data access request be sent to multiple deposit Storage equipment in corresponding storage equipment include:
At the time of receiving the request of determining low-frequency data item with the big data storage system, the big data is stored System then from arbitrary request of data side received new data access request be redirected to the big data storage system System buffer equipment without by received new data access request be sent to it is multiple storage equipment in corresponding storages Equipment;
Wherein the new data access request includes the description information of querying condition and querying condition, the ephemeral data item collection It include multiple ephemeral data items in conjunction, and each ephemeral data item has summary info, the summary info is for briefly Introduce the content of ephemeral data item;
The description information for the querying condition for wherein being included by new data access request by the system buffer equipment with it is described Each ephemeral data item in the ephemeral data item set of system buffer equipment carries out content matching with each ephemeral data of determination Content matching degree include:
By the description information and the system of the querying condition that new data access request is included by the system buffer equipment The summary info for buffering each ephemeral data item in the ephemeral data item set of equipment is carried out based in semantic content comparison It is every to determine to hold matching, the content matching compared based on keyword or the content matching that combines based on semantic content and keyword The content matching degree of a ephemeral data item and the querying condition;
Wherein the matching degree threshold value is 60%, and the range of content matching degree is [0%, 100%];
After wherein saving the new data access request in the buffer area of the system buffer equipment further include: to described Request of data side indicated by new data access request is sent for showing the big data storage system pause data access The response message in the buffer area of the system buffer equipment, and institute are had been saved to the new data access request It states in response message and carries for showing that the new data access request from request of data side is current in the buffer area The information of Queue sequence, wherein being determined in the buffer area according to the time span of new data access request being saved Current Queue sequence of the new data access request in the buffer area, and according to being saved in current Queue sequence The descending order of time span is ranked up new data access request.
3. method described in any one of -2 according to claim 1, wherein in the system data region of each storage equipment Save respective running log file;
Wherein current statistical time section receives locating when the request of determining low-frequency data item for big data storage system The proxima luce (prox. luc) of current date starts and a period of time of the consecutive days of predetermined quantity forward;Wherein the consecutive days of predetermined quantity are 10 consecutive days, 20 consecutive days or 30 consecutive days;
Wherein determined in each storage equipment based on the running log file in current statistical time section and each storage equipment Storage multiple data item by statistics access information include:
All log recordings in the running log file of each storage equipment are selected based on current statistical time section It takes to obtain multiple log recordings of each storage equipment in current statistical time section;
Classify according to data item to multiple log recordings of each storage equipment in current statistical time section, to obtain Obtain the access information by statistics of each data item;
The process of the multiple data item stored in each storage equipment is made of the access information by statistics of each data item The access information of statistics;
Wherein each log recording include: data item identifier, access initial time, access the end time, sizes of memory and Store initial time;
Wherein each data item has summary info, and the summary info is used to briefly introduce the content of data item.
4. method described in any one of -3 according to claim 1,
Wherein the threshold value at the preset access time interval is 5 minutes, 10 minutes, 15 minutes or 20 minutes.
According to the multiple data item stored in the threshold value at preset access time interval and each storage equipment by system The access information of meter determines that the access information statistics file of each storage equipment includes:
The access information by statistics of each data item in the multiple data item stored in each storage equipment is counted With the accessed number of each data item of determination and all access time intervals;
The threshold value greater than access time interval of each data item is determined based on all access time intervals of each data item Number, interval of maximum access time and minimum access time interval;
Access initial time accessed for the first time in the access information by statistics of each data item is determined as counting Begin the time, the access end time accessed for the last time in the access information by statistics of each data item is determined as uniting Count the end time;
The sizes of memory of each data item is determined based on the access information by statistics of each data item.
5. method described in any one of -4 according to claim 1,
The low frequency frequency threshold value is 100,150 or 200;
Device descriptive information in the system log device includes: all storage equipment included by big data storage system Institute is added in total quantity, the total memory capacity of each storage equipment, the network address of each storage equipment and/or each storage equipment State the time of big data storage system;
Storage message file in the storage information area of each storage equipment includes: the total quantity of data item, each data item Sizes of memory, the starting storage time of each data item, the identifier of each data item, each data item summary info with And the free memory capacity of each storage equipment;
The low frequency coefficient threshold value is 120,160 or 220.
6. a kind of system that low-frequency data item is determined in the storage equipment stored for big data, the system comprises:
Pretreatment unit, in response to receiving in big data storage system for the every of multiple storage equipment of big data storage The request that low-frequency data item is determined in a storage equipment, the big data storage system is received from arbitrary request of data side New data access request be redirected to the system buffer equipment of the big data storage system without by it is received new Data access request is sent to the corresponding storage equipment in multiple storage equipment, with by the system buffer equipment by new number The description information of querying condition for being included according to access request with it is every in the ephemeral data item set of the system buffer equipment A ephemeral data item carries out content matching with the content matching degree of each ephemeral data item of determination, from multiple ephemeral data Xiang Zhongxuan At least one selected ephemeral data item that content matching degree is greater than matching degree threshold value is selected, at least one is selected by selected Ephemeral data item is sent to request of data side indicated by the new data access request, and in the system buffer equipment Buffer area in save the new data access request;
Statistic unit, the data not being currently running in determining all storage equipment in the big data storage system are visited When asking operation, the running log file of each storage equipment in multiple storage equipment in the big data storage system is obtained, and And it is determined based on the running log file in current statistical time section and each storage equipment and to be stored in each storage equipment The access information by statistics of multiple data item, according to the threshold value at preset access time interval and each storage equipment The access information by statistics of multiple data item of middle storage determines the access information statistics file of each storage equipment, wherein Access time interval be data item it is adjacent be accessed twice between a period of time;The wherein access information statistics file Including frequency statistics table, the frequency statistics table includes multiple frequency records, wherein the content of each frequency record be 8 tuples < The identifier of data item, statistics initial time, the statistics end time, sizes of memory, is greater than access time interval at accessed number The number of threshold value, maximum access time interval, minimum access time interval >;
Computing unit determines the institute of each storage equipment in current statistical time section based on the access information statistics file Have and be accessed multiple pre-selection data item that number is less than low frequency frequency threshold value in data item, according to the big data storage system Device descriptive information in system log device determines the total memory capacity of each storage equipment, according to depositing for each storage equipment The storage message file in information area is stored up to determine the free memory capacity of each storage equipment, according to following formula come really The low frequency coefficient of each pre-selection data item in fixed each storage equipment:
Wherein DTFiFor low frequency coefficient, the t of i-th of pre-selection data item in current storage devicesimaxIt is in current storage devices i-th Maximum access time interval, t in multiple access time intervals of a pre-selection data itemiminIt is i-th in current storage devices Minimum access time interval, t in multiple access time intervals of pre-selection data itemibeginIt is pre- for i-th in current storage devices Select statistics initial time, the t of data itemiendIt is for the statistics end time of i-th of pre-selection data item, C in current storage devices Total memory capacity, the R of current storage devices are the free memory capacity of current storage devices, UNiIt is in current storage devices i-th Number, the AN of the threshold value greater than access time interval in multiple access time intervals of a pre-selection data itemiIt is currently stored The accessed number of i-th of pre-selection data item in equipment, wherein i is natural number and PT >=i >=1, PT are current storage devices The middle quantity for preselecting data item and PT >=100;And
The pre-selection data item that low frequency coefficient in multiple pre-selection data item in each storage equipment is less than low frequency coefficient threshold value is true It is set to low-frequency data item.
7. system according to claim 6, wherein when the data management apparatus being located at outside big data storage system needs When determining low-frequency data item in the storage equipment in big data storage system, the data management apparatus is deposited to the big data Storage system is sent in each storage equipment in big data storage system for multiple storage equipment of big data storage and determines The request of low-frequency data item;
Wherein pretreatment unit by the big data storage system from arbitrary request of data side received new data access Request is redirected to the system buffer equipment of the big data storage system without by the received new data access request hair of institute Giving multiple corresponding storage equipment stored in equipment includes:
It, will be described at the time of pretreatment unit receives the request of determining low-frequency data item with the big data storage system Big data storage system then from arbitrary request of data side received new data access request be redirected to the big number According to storage system system buffer equipment without by received new data access request be sent in multiple storage equipment Corresponding storage equipment;
Wherein the new data access request includes the description information of querying condition and querying condition, the ephemeral data item collection It include multiple ephemeral data items in conjunction, and each ephemeral data item has summary info, the summary info is for briefly Introduce the content of ephemeral data item;
The description for the querying condition that wherein new data access request is included by pretreatment unit by the system buffer equipment It is every to determine that each ephemeral data item in the ephemeral data item set of information and the system buffer equipment carries out content matching The content matching degree of a ephemeral data item includes:
The description information for the querying condition that new data access request is included by pretreatment unit by the system buffer equipment It carries out with the summary info of each ephemeral data item in the ephemeral data item set of the system buffer equipment based in semanteme Hold the content matching compared, the content matching based on keyword comparison or the content combined based on semantic content and keyword It is equipped with the content matching degree for determining each ephemeral data item and the querying condition;
Wherein the matching degree threshold value is 60%, and the range of content matching degree is [0%, 100%];
Wherein pretreatment unit is sent described big for showing to request of data side indicated by the new data access request Data-storage system pause data access and the new data access request have been saved to the system buffer equipment Response message in buffer area, and carry in the response message for showing the new data access from request of data side The information for requesting the current Queue sequence in the buffer area, wherein according to new data access request in the buffer area The time span being saved determine current Queue sequence of the new data access request in the buffer area, and working as New data access request is ranked up according to the descending order for the time span being saved in preceding Queue sequence.
8. the system according to any one of claim 6-7, wherein in the system data region of each storage equipment Save running log file;
Wherein current statistical time section receives locating when the request of determining low-frequency data item for big data storage system The proxima luce (prox. luc) of current date starts and a period of time of the consecutive days of predetermined quantity forward;Wherein the consecutive days of predetermined quantity are 10 consecutive days, 20 consecutive days or 30 consecutive days;
Wherein statistic unit is each deposited based on the running log file determination in current statistical time section and each storage equipment Storage equipment in store multiple data item by statistics access information include:
Statistic unit remembers all logs in the running log file of each storage equipment based on current statistical time section Record is chosen to obtain multiple log recordings of each storage equipment in current statistical time section;
Statistic unit carries out multiple log recordings of each storage equipment in current statistical time section according to data item Classification, to obtain the access information by statistics of each data item;
The access information by statistics of each data item is constituted the multiple data stored in each storage equipment by statistic unit The access information by statistics of item;
Wherein each log recording include: data item identifier, access initial time, access the end time, sizes of memory and Store initial time;
Wherein each data item has summary info, and the summary info is used to briefly introduce the content of data item.
9. the system according to any one of claim 6-8,
Wherein the threshold value at the preset access time interval is 5 minutes, 10 minutes, 15 minutes or 20 minutes.
Statistic unit is according to the multiple data item stored in the threshold value and each storage equipment at preset access time interval By statistics access information determine it is each storage equipment access information statistics file include:
Access information by statistics of the statistic unit to each data item in the multiple data item stored in each storage equipment It is counted with the accessed number of each data item of determination and all access time intervals;
Statistic unit based on all access time intervals of each data item determine each data item be greater than access time interval Threshold value number, maximum access time interval and minimum access time interval;
Statistic unit determines access initial time accessed for the first time in the access information by statistics of each data item To count initial time, by the access end time accessed for the last time in the access information by statistics of each data item It is determined as counting the end time;
Statistic unit determines the sizes of memory of each data item based on the access information by statistics of each data item.
10. the system according to any one of claim 6-9,
The low frequency frequency threshold value is 100,150 or 200;
Device descriptive information in the system log device includes: the sum of storage equipment included by big data storage system The big number is added in amount, the total memory capacity of each storage equipment, the network address of each storage equipment or each storage equipment According to the time of storage system;
Storage message file in the storage information area of each storage equipment includes: the total quantity of data item, each data item Sizes of memory, the starting storage time of each data item, the identifier of each data item, each data item summary info with And the free memory capacity of each storage equipment;
The low frequency coefficient threshold value is 120,160 or 220.
CN201811006475.6A 2018-08-30 2018-08-30 Method and system for determining low frequency data items in a storage device for large data storage Active CN109033462B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811006475.6A CN109033462B (en) 2018-08-30 2018-08-30 Method and system for determining low frequency data items in a storage device for large data storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811006475.6A CN109033462B (en) 2018-08-30 2018-08-30 Method and system for determining low frequency data items in a storage device for large data storage

Publications (2)

Publication Number Publication Date
CN109033462A true CN109033462A (en) 2018-12-18
CN109033462B CN109033462B (en) 2023-04-28

Family

ID=64626509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811006475.6A Active CN109033462B (en) 2018-08-30 2018-08-30 Method and system for determining low frequency data items in a storage device for large data storage

Country Status (1)

Country Link
CN (1) CN109033462B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271104A (en) * 2018-08-30 2019-01-25 杜广香 It is a kind of for determining the method and system of the operating status of big data storage system
CN109739817A (en) * 2018-12-26 2019-05-10 杜广香 A kind of method and system of the storing data file in big data storage system
CN109753505A (en) * 2018-12-26 2019-05-14 杜广香 The method and system of temporary storage cell are created in big data storage system
CN112965810A (en) * 2021-01-27 2021-06-15 合肥大多数信息科技有限公司 Multi-kernel browser data integration method based on shared network channel
CN116541365A (en) * 2023-07-06 2023-08-04 成都泛联智存科技有限公司 File storage method, device, storage medium and client

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106775461A (en) * 2016-11-30 2017-05-31 华为技术有限公司 Hot spot data determines method, equipment and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106775461A (en) * 2016-11-30 2017-05-31 华为技术有限公司 Hot spot data determines method, equipment and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271104A (en) * 2018-08-30 2019-01-25 杜广香 It is a kind of for determining the method and system of the operating status of big data storage system
CN109739817A (en) * 2018-12-26 2019-05-10 杜广香 A kind of method and system of the storing data file in big data storage system
CN109753505A (en) * 2018-12-26 2019-05-14 杜广香 The method and system of temporary storage cell are created in big data storage system
CN109753505B (en) * 2018-12-26 2022-06-24 济南银华信息技术有限公司 Method and system for creating temporary storage unit in big data storage system
CN109739817B (en) * 2018-12-26 2023-01-03 深圳光点软件科技有限公司 Method and system for storing data file in big data storage system
CN112965810A (en) * 2021-01-27 2021-06-15 合肥大多数信息科技有限公司 Multi-kernel browser data integration method based on shared network channel
CN116541365A (en) * 2023-07-06 2023-08-04 成都泛联智存科技有限公司 File storage method, device, storage medium and client
CN116541365B (en) * 2023-07-06 2023-09-15 成都泛联智存科技有限公司 File storage method, device, storage medium and client

Also Published As

Publication number Publication date
CN109033462B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN109033462A (en) The method and system of low-frequency data item are determined in the storage equipment of big data storage
US9727572B2 (en) Database compression system and method
CN107801086A (en) The dispatching method and system of more caching servers
CN106959963A (en) A kind of data query method, apparatus and system
US20030018688A1 (en) Method and apparatus to facilitate accessing data in network management protocol tables
KR101411321B1 (en) Method and apparatus for managing neighbor node having similar characteristic with active node and computer readable medium thereof
CN100512152C (en) Method of managing alarm inquiry
CN105550180B (en) The method, apparatus and system of data processing
CN108228432A (en) A kind of distributed link tracking, analysis method and server, global scheduler
CN109271104A (en) It is a kind of for determining the method and system of the operating status of big data storage system
CN107203623B (en) Load balancing and adjusting method of web crawler system
US20090037443A1 (en) Intelligent group communication
CN101848149B (en) Method and device for scheduling graded queues in packet network
CN109271103A (en) A kind of method and system carrying out data mixing storage in big data storage system
CN109271102A (en) Identify the method and system of the low access degree storage equipment in big data storage system
CN109271101A (en) It is a kind of for determining the method and system of the data balancing of big data storage system
CN100488173C (en) A method for carrying out automatic selection of packet classification algorithm
CN109240988A (en) For avoiding big data storage system from entering the method and system of access imbalance state
US6996577B1 (en) Method and system for automatically grouping objects in a directory system based on their access patterns
CN106940715B (en) A kind of method and apparatus of the inquiry based on concordance list
CN109150819B (en) A kind of attack recognition method and its identifying system
JP4648290B2 (en) Packet transfer apparatus, packet distribution method, group affiliation processor change method, and computer program
US11681680B2 (en) Method, device and computer program product for managing index tables
Mahmood et al. Fast: frequency-aware spatio-textual indexing for in-memory continuous filter query processing
CN108810095A (en) A kind of news push method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230403

Address after: Room 201, No. 2-2-2 Yingcai Street, Tianhe District, Guangzhou City, Guangdong Province, 510000 (Location: 2) (Office only)

Applicant after: Guangzhou sibeishou Engineering Consulting Co.,Ltd.

Address before: 252659 Shandong province Liaocheng City Linqing City Dai Wan Town, the village of the South Village Health Room

Applicant before: Du Guangxiang

GR01 Patent grant
GR01 Patent grant