CN103853766B - A kind of on-line processing method and system towards stream data - Google Patents

A kind of on-line processing method and system towards stream data Download PDF

Info

Publication number
CN103853766B
CN103853766B CN201210510056.2A CN201210510056A CN103853766B CN 103853766 B CN103853766 B CN 103853766B CN 201210510056 A CN201210510056 A CN 201210510056A CN 103853766 B CN103853766 B CN 103853766B
Authority
CN
China
Prior art keywords
stream data
data
memory cache
cache layer
analysis program
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210510056.2A
Other languages
Chinese (zh)
Other versions
CN103853766A (en
Inventor
张瑾
程学旗
林祥辉
黄康平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201210510056.2A priority Critical patent/CN103853766B/en
Publication of CN103853766A publication Critical patent/CN103853766A/en
Application granted granted Critical
Publication of CN103853766B publication Critical patent/CN103853766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a kind of on-line processing method towards stream data, including:Step 1, sets up online memory cache layer, is stored in the online memory cache layer after carrying out attribute extraction according to key value structure to the stream data;Step 2, sets up hybrid index structure to the stream data in the memory cache layer;Step 3, every stream data to establishing index structure increase an access flag, and this flag bit is used to indicate different analysis programs for the registration scenarios of the stream data, while recording to the state that each analysis program accesses stream data.Step 4, data scrubbing, if certain stream data by the memory cache layer in all analysis programs specified accessed, the stream data is carried out into cleaning operation.The present invention significantly reduces the reading and writing data pressure during Stream Processing, can effectively alleviate the pressure of database in extensive stream data processing system, and can lift the real-time processing speed of stream data.

Description

A kind of on-line processing method and system towards stream data
Technical field
The present invention relates to large-scale data is processed, particularly with regard to a kind of on-line processing method towards stream data and System.
Background technology
It is with the progress and expanding economy in epoch, increasing to the demand of information in people's daily life, especially It is becoming increasingly popular with internet, the information for having magnanimity daily is issued on the internet and propagated.In 2011, analysis was adjusted Grind mechanism IDC to issue《Value is extracted from chaos》.This report shows that global information total amount often spends 2 years, will increase One times.2011, the global data total amount for being created and being replicated was 1.8ZB.For example, 1.8ZB equivalent to the whole world each People does the data total amount produced by 2.15 hundred million high-resolution nuclear magnetic resonance checks daily.
The task of large-scale data analysis process system is exactly that mass data is processed, and the analysis from mass data is dug Excavate valuable knowledge.Common data handling system needs collection to be stored from the data of each data source, then Data are being read from data storage device, is being analyzed and is processed.A kind of framework of conventional data analysis processing system is to set Vertical central database is realizing the storage and reading of data.News, forum are directed to from internet, are won by capture program first The data of the different classifications such as visitor, microblogging, social networks, search engine are acquired and are written in central database;Then, Various analysis programs read data from database, carry out follow-up data analysis and process.Central database assume responsibility for simultaneously The write of data and reading task.
System architecture with database as storage center has widely been accepted and has been applied.But in mass data ring Under border, with increase, the growth of derived data amount and the increase of applied analysis program number purpose of data source species, centre data The problem of storehouse framework is increasingly highlighted.The shortcoming of central database framework has been mainly reflected in three aspects:First real-time responsiveness Can decline;More than second database interaction;3rd data processing time delay.
It is with the increase of data source, the increase of data volume and the increase of number of applications, traditional based on middle calculation Shortcoming according to the Data Management Analysis system of the framework in storehouse is increasingly highlighted.So, a kind of new data processing architecture need be proposed To cause problem above effectively to be alleviated.
Under normal circumstances, for the resolving ideas of this problem can be summarized as following four:
Message-oriented middleware method.Message-oriented middleware is a kind of centre being made up of message transfer mechanism or SMS queue's pattern Part technology.Message can be sent to each application program by message-oriented middleware, can be alleviated by using message-oriented middleware The read-write pressure of data, at the same can in the message between application program is controlled in part for the access of message.Message-oriented middleware exists Important function has been played in many sector applications.In the demand of enterprise-level application, message transmission needs to ensure reliability and safety Property, but, excessively pay close attention to reliability and security increased the time of data processing and the time delay of data transfer, be not suitable for big rule The requirement of the handling capacity of mould data processing.
Distributed Message Queue method.Increasing company and research institution attempt using based on distributed towards disappearing Alleviating the problem brought by central database framework, these distributed message queues great majority are all with item of increasing income for the system of breath Purpose form is issued.Distributed message handling system can be under efficient process mass data environment messenger service.But this Kind distributed message handling system has two, and one is that these systems are all based on the mode of major key inquiry to carry out The read-write of data, it is impossible to according to the inquiry of some critical field, it is impossible to replace the query function of relevant database completely;Two It is distributed message handling system to ensure high-throughput, it is impossible to the fine integrality and security that must ensure data.
Caching method.In Computer Architecture for the read or write speed of internal memory be 10 times of disk read-write speed with On, so in order to avoid frequently data base read-write, just someone employs the thought of caching, is opened up in one piece outside database Deposit as data buffer zone, mitigate database loads with this, improve data access speed.This caching based on internal memory is still There are problems that two, one is efficiency when cannot optimize data write into Databasce;Two is based on key assignments(Key-Value)The number of tissue According to, it is impossible to interval query operation is carried out for some specific field.
Internal memory database method.In Web applications, for example user accesses, and user clicks on, and these data are arrived in streaming Reach, so research becomes academia and industrial quarters is all extremely paid close attention to asks for the processing method of the online data of stream data Topic.The research branch that another online data is processed is the research and development of memory database.Memory database, as the term suggests Data are exactly placed on the database operated in internal memory.Relative to disk, the reading and writing data speed of internal memory will be higher by several quantity Level, compares in saving the data in internal memory and the performance that can be greatly enhanced application is accessed from disk.Meanwhile, memory database The traditional approach of data in magnetic disk management is abandoned, architecture has all been redesigned in internal memory based on total data, and It has been also carried out being correspondingly improved in terms of data buffer storage, fast algorithm, parallel work-flow, so data processing speed compares traditional database Data processing speed it is many soon, typically all more than 10 times.The maximum feature of memory database is its " primary copy " or " work Make version " memory-resident, i.e. active transaction only come into contacts with the memory copying of real-time internal memory database.Redis maximum shortcoming Be it is not fine must solve the problems, such as data, services reliability, all of data are all stored in the memory headroom of user's application Interior, once process is restarted, or exception is exited, and will result in loss of data.But which cannot meet the different words according to data The demand of Duan Jinhang inquiries.
In sum, alleviate the ability of data access pressure in prior art, limited by various different factors, it is impossible to meet Actual demand.
The content of the invention
The purpose of the present invention is:An inline cache layer based on internal memory is introduced, the characteristics of for stream data, will be original For a large amount of read-write pressure of database are transferred in inline cache layer, so as to during significantly reducing Stream Processing, data are read Pressure is write, effectively alleviates the pressure of database in extensive stream data processing system, lift the real-time processing speed of stream data Degree.
For achieving the above object, the present invention proposes a kind of on-line processing method towards stream data, including:
Step 1, sets up online memory cache layer, and the stream data is carried out storing after attribute extraction according to key value structure In the online memory cache layer;
Step 2, sets up hybrid index structure to the stream data in the memory cache layer;
Step 3, every stream data to establishing index structure increase an access flag, and this flag bit is used to mark Will difference analysis program is for the registration scenarios of the stream data;Access the state of stream data simultaneously to each analysis program Recorded;
Step 4, data scrubbing, if certain stream data by the memory cache layer in all analysis programs for specifying access Cross, then the stream data is carried out into cleaning operation.
The on-line processing method also includes:After certain analysis program reads stream data from the memory cache layer, Check the access flag of the stream data:
If the stream data was accessed by the analysis program, it is to have read flag bit, then not by the stream data Return the analysis program;
If the stream data was not accessed by the analysis program, it is not read flag bit, then the stream data is returned Back to the analysis program, and the flag bit of the stream data is arranged to read flag bit.
The on-line processing method also includes:After reading stream data, the access flag of the stream data is checked:
If the stream data was accessed by the analysis program of all registrations, by the stream data from memory cache layer Remove;
Whether the residence time for otherwise inquiring about the stream data exceedes threshold value, and analysis is continued waiting for if not less than the threshold value The stream data is removed from memory cache layer if more than the threshold value by the access of program.
The mode of setting up of the key value structure in the step 1 is:For each stream data, memory cache layer will be which Unique No. ID key as record of distribution one, all properties information of the key assignments corresponding to the stream data.The step Hybrid index structure described in rapid 2 is combined foundation according to key value structure, B+ trees index structure and Hash Index Structure.
The step 2 includes:
Judge whether the stream data in the inline cache layer is needed by Field Inquiry:
If desired press Field Inquiry:If necessary to carry out interval query according to current attribute, to this Building Attribute Field B+ Tree index structure, if necessary to carry out major key inquiry according to current attribute, then to this Building Attribute Field Hash Index Structure;
If need not be by Field Inquiry, need not be to this Building Attribute Field index structure.
In the step 3:The access flag is 32 integer numerals, each bit of each integer numeral Position can represent an analysis program for the access state of stream data, when the stream data in internal memory is initialized, Each bit of the access flag of every stream data is 0;
When analysis program is registered to internal memory cache layer, the memory cache layer is its one access flag of distribution Position, after certain analysis program accesses a stream data, the memory cache layer is by the access flag of the stream data Digitwise operation is carried out with the access identities of the analysis program, and using the result after calculating as the current access mark of the stream data Will position.
In the step 4:
After reading stream data, the access flag of the stream data is checked:
If the stream data was accessed by the analysis program of all registrations, by the stream data from memory cache layer Remove;
Otherwise inquire about whether the stream data exceedes threshold value, the visit of analysis program is continued waiting for if not less than the threshold value Ask, the stream data is removed from memory cache layer if more than the threshold value.
For achieving the above object, the present invention also provides a kind of Online Processing System towards stream data, including:
Online memory cache layer building module, for setting up online memory cache layer, carries out attribute to the stream data It is stored in the online memory cache layer according to key value structure after extraction;
Hybrid index structure sets up module, for setting up hybrid index to the stream data in the memory cache layer Structure;
Access flag builds module, increases an access flag for every stream data to establishing index structure Position, this flag bit are used to indicate different analysis programs for the registration scenarios of the stream data, while to each analysis program The state for accessing stream data is recorded;
Internal memory stream data cleaning modul, for accessing to all analysis programs specified in by the memory cache layer The stream data crossed, carries out cleaning operation.
The Online Processing System also includes:
Stream data exits return module, for reading after stream data, checks the access flag of the stream data:
If the stream data analyzed routine access mistake, is to have read flag bit, then the stream data is not returned Analysis program;If the stream data does not have analyzed routine access mistake, it is not read flag bit, then by the mark of the stream data Position is arranged to read flag bit, and returns the stream data to analysis program.
In the internal memory stream data cleaning modul:
After analysis program reads stream data from the memory cache layer, the access flag of the stream data is checked Position:It is if the stream data was accessed by all registered analysis programs, the stream data is clear from memory cache layer Except the stream data;Whether the residence time for otherwise inquiring about the stream data exceedes threshold value, continues if not less than the threshold value The stream data is removed the stream data from memory cache layer if more than the threshold value by the access of program to be analyzed.
The beneficial effects of the present invention is:The on-line processing method and system towards stream data of the present invention is by increasing Data buffer storage based on internal memory, the characteristics of for stream data, a large amount of read-write pressure originally for database is transferred to The pressure of database in extensive stream data processing system in inline cache layer, is effectively alleviated, streaming number is greatly reduced According to read-write pressure, improve stream data real-time processing speed and data handling system it is ageing.
Describe the present invention below in conjunction with the drawings and specific embodiments, but it is not as a limitation of the invention.
Description of the drawings
Fig. 1 is the on-line processing method flow chart towards stream data of the present invention;
Fig. 2 is the Online Processing System schematic diagram towards stream data of the present invention.
Specific embodiment
The core concept of the present invention is an inline cache layer based on internal memory to be introduced on original framework, for stream The characteristics of formula data, for a large amount of read-write pressure of database, will be transferred in inline cache, and efficient must can carry originally For data, services.
Fig. 1 is the on-line processing method flow chart towards stream data of the present invention.As shown in figure 1, the method includes:
Step 1, sets up online memory cache layer, and the stream data is carried out storing after attribute extraction according to key value structure In the online memory cache layer.
Step 2, sets up hybrid index structure to the stream data in the memory cache layer.
Step 3, every stream data to establishing index structure increase an access flag, and this flag bit is used to mark Will difference analysis program is for the registration scenarios of the stream data;Access the state of stream data simultaneously to each analysis program Recorded.
Stream data is that dynamic is present, and for every stream data, what which can be accessed by which analysis program is certain 's.
Step 4, data scrubbing, if certain stream data by the memory cache layer in all analysis programs for specifying access Cross, then the stream data is carried out into cleaning operation.
The mode of setting up of the key value structure in the step 1 is:For each stream data, memory cache layer will be which Unique No. ID key as record of distribution one, all properties information of the key assignments corresponding to the stream data.Original On the basis of based on central database framework, an online memory cache layer is increased.The memory cache layer of increase is based on interior The management of row stream data is deposited into, and reading and writing data service is externally provided by network interface.The increase of memory cache layer is right Adjusted in the data flow of data handling system.On the one hand, the stream data for collecting is written to interior by capture program Deposit in caching, analysis program reads stream data from memory cache, carries out data analysis.On the other hand, memory cache will be fixed Phase is written to the stream data in internal memory in database and carries out persistent storage.
In online memory cache, each stream data organizes storage according to the mode of key assignments.For each streaming number According to memory cache will distribute one globally unique No. ID key as record for which, and followed by key storage is the institute of record There is the information of attribute.All of stream data is stored in key assignments mode, and by the key of stream data come unique mark One record.On the basis of based on key assignments storage, the present invention sets up many index structures of mixing for stream data, for per bar The different field of stream data sets up different types of index structure.For the stream data of storage, some inquiries need by The inquiry of uniqueness is carried out according to attribute field, some inquiries need to be inquired about according to the interval of field.Need for there is uniqueness These fields are set up hash index in internal memory by the inquiry asked.Set up using uniqueness field as the index value of hash index Hash index, carry out in Hash Index Structure uniqueness inquire about when, under best-case can with O (1) (i.e. constant) when Between complexity carry out the inquiry of stream data.For the attribute field for having interval query demand, these fields are built in internal memory Vertical B+ trees index.The interval query carried out by B+ trees index structure can be with O's (logn) (i.e. logarithm) under average case Complete in time complexity.
The on-line processing method also includes dynamic registration step:
After certain analysis program reads stream data from the memory cache layer, the access mark of the stream data is checked Will position:
If the stream data was accessed by the analysis program, it is to have read flag bit, then not by the stream data Return the analysis program;
If the stream data was not accessed by the analysis program, it is not read flag bit, then the stream data is returned Back to the analysis program, and the flag bit of the stream data is arranged to read flag bit.The present invention is set up in internal memory Application program dynamic registration based on access control label and cancel register mechanism, there is provided the data stream type of high scalability reads. For stream data, the present invention is in internal memory for each stream data record increases a data access label.Data are visited Ask that label is 32 integer numerals, each bit of integer numeral can represent an analysis program for streaming The service condition of data.Analysis program needs to memory cache to be registered, and memory cache is its one data access mark of distribution Know, i.e., the analysis program registered is represented using some bit in 32 integer numerals.When analysis program is registered After success, memory cache can be the mark of one access data of its distribution, and the analysis program is exactly come convection current by the mark Formula data conduct interviews and use.In order to reduce repetition stream data accounting for for the network bandwidth in the process of stream data With each analysis program is unable to repeated accesses same stream data.During for data initialization in internal memory, every streaming number According to data access mark each bit be 0.After certain application program accessed the stream data, memory cache The data access mark of the data access flag position of this stream data and the analysis program carried out step-by-step or computing, will meter Result after calculation is used as the current data access abstract factory of the stream data.When an application program accessed certain streaming number According to afterwards, cannot the repeated accesses stream datas.
The step 4 includes:
After reading stream data, the access flag of the stream data is checked:
If the stream data was accessed by the analysis program of all registrations, by the stream data from memory cache layer Remove;
Otherwise inquire about whether the stream data exceedes threshold value, the visit of analysis program is continued waiting for if not less than the threshold value Ask, the stream data is removed from memory cache layer if more than the threshold value.
I.e. the present invention establishes efficient internal storage data cleaning and escape mechanism, the streaming number being resident in cleaning internal memory in time According to the availability of raising data, services.For the cleaning mechanism of the stream data in internal memory, the present invention is classified as two kinds of situations Account for.Under normal circumstances, internal storage data caching checks the access control label of stream data in internal memory, if it find that right In all registered analysis programs, the stream data had all been used, then by log-on data scale removal process, by which from interior Deposit middle deletion.In abnormal cases, internal storage data caching checks the access control label of stream data in internal memory, if it find that having Some analysis programs still have not visited the stream data, then the residence time to this stream data in internal memory judges. If the stream data is resident in internal memory for a long time, exceed the time threshold of regulation, then by log-on data scale removal process, Which is deleted from internal memory;If residence time of the stream data in internal memory is not less than the time threshold of regulation, not right Which is processed, and allows which to continue to be stored in internal memory.
Fig. 2 is the Online Processing System schematic diagram towards stream data of the present invention.As shown in Fig. 2 the system includes:
Online memory cache layer building module, for setting up online memory cache layer, carries out attribute to the stream data It is stored in the online memory cache layer according to key value structure after extraction;
Hybrid index structure sets up module, for setting up hybrid index to the stream data in the memory cache layer Structure;
Access flag builds module, increases an access flag for every stream data to establishing index structure Position, this flag bit are used to indicate different analysis programs for the registration scenarios of the stream data;Simultaneously to each analysis program The state for accessing stream data is recorded;
Internal memory stream data cleaning modul, for accessing to all analysis programs specified in by the memory cache layer The stream data crossed, carries out cleaning operation.
On the basis of original framework based on central database, an online memory cache layer is increased.What is increased is interior Deposit cache layer carries out the management of stream data based on internal memory, and externally provides reading and writing data service by network interface.Internal memory The increase of cache layer is adjusted for the data flow of data handling system.On the one hand, capture program is by the stream for collecting Formula data are written in memory cache, and analysis program reads stream data from memory cache, carry out data analysis.The opposing party Face, memory cache periodically will be written to the stream data in internal memory in database and carry out persistent storage.
In online memory cache, each stream data organizes storage according to the mode of key assignments.For each streaming number According to memory cache will distribute one globally unique No. ID key as record for which, and the key assignments corresponds to the stream data All properties information.All of stream data is stored in key assignments mode, and by the key of stream data uniquely marking Know a record.On the basis of based on key assignments storage, the present invention sets up many index structures of mixing for stream data, for every The different field of bar stream data sets up different types of index structure.For the stream data of storage, some inquiries need The inquiry of uniqueness is carried out according to attribute field, some inquiries need to be inquired about according to the interval of field.For there is uniqueness These fields are set up hash index in internal memory by the inquiry of demand.Build using uniqueness field as the index value of hash index Vertical hash index, when uniqueness inquiry is carried out in Hash Index Structure, can be with O's (1) (i.e. constant) under average case Time complexity carries out the inquiry of stream data.For the attribute field for having interval query demand, to these fields in internal memory Set up B+ trees index.The interval query carried out by B+ trees index structure can be with O (logn) (i.e. logarithm) under average case Time complexity in complete.
The Online Processing System also includes:
Stream data exits return module, for reading after stream data, checks the access flag of the stream data:
If the stream data analyzed routine access mistake, is to have read flag bit, then the stream data is not returned Analysis program;If the stream data does not have analyzed routine access mistake, it is not read flag bit, then by the mark of the stream data Position is arranged to read flag bit, and returns the stream data to analysis program.The present invention is set up based on access control in internal memory The application program dynamic registration of label processed and cancel register mechanism, there is provided the data stream type of high scalability reads.For streaming number According to the present invention is in internal memory for each stream data record increases a data access label.Data access label is one Individual 32 integer numerals, each bit of integer numeral can represent an analysis program for the use of stream data Situation.Analysis program needs to memory cache to be registered, and memory cache is its one data access identities of distribution, i.e., using 32 Some bit in the integer numeral of position is representing the analysis program registered.It is after analysis program succeeds in registration, interior The mark that caching can be one access data of its distribution is deposited, the analysis program is exactly to be visited come streaming data by the mark Ask and use.In order to reduce duplicate data for the occupancy of the network bandwidth in the process of stream data, each analysis program is not Can repeated accesses same stream data.During for data initialization in internal memory, the data access mark of every stream data Each bit be 0.After certain application program accessed the stream data, memory cache is by this stream data The data access mark of data access flag position and the analysis program carries out step-by-step or computing, using the result after calculating as this The current data access abstract factory of stream data.After an application program accessed certain stream data, cannot be again The stream data is accessed again.
In the internal memory stream data cleaning modul:
After analysis program reads stream data from the memory cache layer, the access flag of the stream data is checked Position:It is if the stream data was accessed by all registered analysis programs, the stream data is clear from memory cache layer Except the stream data;Whether the residence time for otherwise inquiring about the stream data exceedes threshold value, continues if not less than the threshold value The stream data is removed the stream data from memory cache layer if more than the threshold value by the access of program to be analyzed.
I.e. the present invention establishes efficient internal storage data cleaning and escape mechanism, the streaming number being resident in cleaning internal memory in time According to the availability of raising data, services.For the cleaning mechanism of the stream data in internal memory, the present invention is classified as two kinds of situations Account for.Under normal circumstances, internal storage data caching checks the access control label of stream data in internal memory, if it find that right In all registered analysis programs, the stream data had all been used, then by log-on data scale removal process, by which from interior Middle deletion is deposited, the effective rate of utilization of internal memory is lifted.Under abnormal conditions, internal storage data caching checks the access of stream data in internal memory Abstract factory, if it find that have some analysis programs still to have not visited the stream data, then to this stream data in internal memory Residence time judged.If the stream data is resident in internal memory for a long time, exceed the time threshold of regulation, then By log-on data scale removal process, which is deleted from internal memory;If residence time of the stream data in internal memory is not less than rule Fixed time threshold, then do not processed to which, allows which to continue to be stored in internal memory.
Certainly, the present invention can also have other various embodiments, in the case of without departing substantially from spirit of the invention and its essence, ripe Know those skilled in the art and various corresponding changes and deformation, but these corresponding changes and deformation can be made according to the present invention The protection domain of the claims in the present invention should all be belonged to.

Claims (10)

1. a kind of on-line processing method towards stream data, it is characterised in that include:
Step 1, sets up online memory cache layer, institute is stored in after carrying out attribute extraction according to key value structure to the stream data State in online memory cache layer;
Step 2, sets up hybrid index structure to the stream data in the memory cache layer;
Step 3, every stream data to establishing index structure increase an access flag, and this flag bit is used for mark not With analysis program for the registration scenarios of the stream data, while carrying out to the state that each analysis program accesses stream data Record;
Step 4, data scrubbing, if certain stream data by the memory cache layer in all analysis programs specified accessed, The stream data is carried out into cleaning operation then.
2. on-line processing method as claimed in claim 1, it is characterised in that the on-line processing method also includes dynamic registration Step:
After certain analysis program reads stream data from the memory cache layer, the access flag of the stream data is checked Position:
If the stream data was accessed by the analysis program, it is to have read flag bit, then the stream data is not returned The analysis program;
If the stream data was not accessed by the analysis program, it is not read flag bit, then the stream data is returned to The analysis program, and the flag bit of the stream data is arranged to read flag bit.
3. on-line processing method as claimed in claim 1, it is characterised in that the foundation side of the key value structure in the step 1 Formula is:For each stream data, memory cache floor will distribute unique No. ID key as record for which, the key is remembered Record the information of the stream data all properties.
4. on-line processing method as claimed in claim 1, it is characterised in that hybrid index structure is described in the step 2 Combine foundation according to key value structure, B+ trees index structure and Hash Index Structure.
5. on-line processing method as claimed in claim 1, it is characterised in that the step 2 includes:
Judge whether the stream data in the inline cache layer is needed by Field Inquiry:
If desired press Field Inquiry:If necessary to carry out interval query according to current attribute, to this Building Attribute Field B+ tree ropes Guiding structure, if necessary to carry out major key inquiry according to current attribute, then to this Building Attribute Field Hash Index Structure;
If need not be by Field Inquiry, need not be to this Building Attribute Field index structure.
6. on-line processing method as claimed in claim 1, it is characterised in that in the step 3:The access flag is one Individual 32 integer numerals, each bit of each integer numeral can represent an analysis program for stream data Access state, when the stream data in internal memory is initialized, each bit of the access flag of every stream data It is 0;
When analysis program is registered to internal memory cache layer, the memory cache layer is its one access flag of distribution, when After certain analysis program accesses a stream data, the memory cache layer is by the access flag of the stream data and this point The access identities of analysis program carry out digitwise operation, and using the result after calculating as the current access flag of the stream data.
7. on-line processing method as claimed in claim 1, it is characterised in that in the step 4:
After reading stream data, the access flag of the stream data is checked:
It is if the stream data was accessed by the analysis program of all registrations, the stream data is clear from memory cache layer Remove;
Whether the residence time for otherwise inquiring about the stream data exceedes threshold value, if not less than the threshold value continues waiting for analysis program Access, if more than the stream data being removed from memory cache layer if the threshold value.
8. a kind of Online Processing System towards stream data, it is characterised in that include:
Online memory cache layer building module, for setting up online memory cache layer, carries out attribute extraction to the stream data It is stored in the online memory cache layer according to key value structure afterwards;
Hybrid index structure sets up module, for setting up hybrid index knot in the memory cache layer to the stream data Structure;
Access flag builds module, increases an access flag for every stream data to establishing index structure, This flag bit is used to indicate different analysis programs for the registration scenarios of the stream data, while accessing to each analysis program The state of stream data is recorded;
Internal memory stream data cleaning modul, for what is accessed to all analysis programs specified in by the memory cache layer Stream data, carries out cleaning operation.
9. Online Processing System as claimed in claim 8, it is characterised in that the Online Processing System also includes:
Stream data exits return module, for reading after stream data, checks the access flag of the stream data:If The stream data analyzed routine access mistake, is to have read flag bit, then the stream data is not returned analysis program;If The stream data does not have analyzed routine access mistake, is not read flag bit, then be arranged to read by the flag bit of the stream data Flag bit, and the stream data is returned to analysis program.
10. Online Processing System as claimed in claim 8, it is characterised in that in the internal memory stream data cleaning modul:
After analysis program reads stream data from the memory cache layer, the access flag of the stream data is checked: If the stream data was accessed by all registered analysis programs, the stream data is removed from memory cache layer; Whether the residence time for otherwise inquiring about the stream data exceedes threshold value, and the visit of analysis program is continued waiting for if not less than the threshold value Ask, the stream data is removed from memory cache layer if more than the threshold value.
CN201210510056.2A 2012-12-03 2012-12-03 A kind of on-line processing method and system towards stream data Active CN103853766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210510056.2A CN103853766B (en) 2012-12-03 2012-12-03 A kind of on-line processing method and system towards stream data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210510056.2A CN103853766B (en) 2012-12-03 2012-12-03 A kind of on-line processing method and system towards stream data

Publications (2)

Publication Number Publication Date
CN103853766A CN103853766A (en) 2014-06-11
CN103853766B true CN103853766B (en) 2017-04-05

Family

ID=50861433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210510056.2A Active CN103853766B (en) 2012-12-03 2012-12-03 A kind of on-line processing method and system towards stream data

Country Status (1)

Country Link
CN (1) CN103853766B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572973A (en) * 2014-12-31 2015-04-29 上海格尔软件股份有限公司 High-performance memory caching system and method
CN104657467B (en) * 2015-02-11 2017-09-05 南京国电南自维美德自动化有限公司 A kind of data-pushing framework with subscription/publication of real-time internal memory database
CN105242971B (en) * 2015-10-20 2019-02-22 北京航空航天大学 Memory object management method and system towards Stream Processing system
CN106911589B (en) 2015-12-22 2020-04-24 阿里巴巴集团控股有限公司 Data processing method and equipment
CN106506254B (en) * 2016-09-20 2019-04-16 北京理工大学 A kind of bottleneck node detection method of extensive stream data processing system
CN106959928B (en) * 2017-03-23 2019-08-13 华中科技大学 A kind of stream data real-time processing method and system based on multi-level buffer structure
CN110120959B (en) * 2018-02-05 2023-04-07 北京京东尚科信息技术有限公司 Big data pushing method, device, system, equipment and readable storage medium
CN110609707B (en) * 2018-06-14 2021-11-02 北京嘀嘀无限科技发展有限公司 Online data processing system generation method, device and equipment
CN110532072A (en) * 2019-07-24 2019-12-03 中国科学院计算技术研究所 Distributive type data processing method and system based on Mach
CN110532263A (en) * 2019-08-08 2019-12-03 杭州广立微电子有限公司 A kind of integrated circuit test system and its data base management system towards column
CN110990059B (en) * 2019-11-28 2021-11-19 中国科学院计算技术研究所 Stream type calculation engine operation method and system for tilt data
CN112035528B (en) * 2020-09-11 2024-04-16 中国银行股份有限公司 Data query method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102495838A (en) * 2011-11-03 2012-06-13 成都市华为赛门铁克科技有限公司 Data processing method and data processing device
CN102542057A (en) * 2011-12-29 2012-07-04 北京大学 High dimension data index structure design method based on solid state hard disk
CN102567434A (en) * 2010-12-31 2012-07-11 百度在线网络技术(北京)有限公司 Data block processing method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7689602B1 (en) * 2005-07-20 2010-03-30 Bakbone Software, Inc. Method of creating hierarchical indices for a distributed object system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567434A (en) * 2010-12-31 2012-07-11 百度在线网络技术(北京)有限公司 Data block processing method
CN102495838A (en) * 2011-11-03 2012-06-13 成都市华为赛门铁克科技有限公司 Data processing method and data processing device
CN102542057A (en) * 2011-12-29 2012-07-04 北京大学 High dimension data index structure design method based on solid state hard disk

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
流式数据库系统的研究与设计;张玲东;《中国优秀硕士学位论文全文数据库信息科技辑》;20050915(第05期);全文 *
流式数据挖掘的现状及统计学的研究趋势;朱建平等;《统计研究》;20070731;第24卷(第7期);第84-87页 *

Also Published As

Publication number Publication date
CN103853766A (en) 2014-06-11

Similar Documents

Publication Publication Date Title
CN103853766B (en) A kind of on-line processing method and system towards stream data
US20190138221A1 (en) Method and Apparatus for SSD Storage Access
CN102843396B (en) Data write-in and read method and device in a kind of distributed cache system
US8397293B2 (en) Suspicious node detection and recovery in mapreduce computing
Zhan et al. A loan application fraud detection method based on knowledge graph and neural network
CN103198361B (en) Based on the XACML strategy evaluation engine system of multiple Optimization Mechanism
CN109165096B (en) Cache utilization system and method for web cluster
CN113535677B (en) Data analysis query management method, device, computer equipment and storage medium
CN107633045A (en) The statistical method and its system of tenant data capacity in a kind of cloud storage service
Jain et al. Refreshing datawarehouse in near real-time
US7895247B2 (en) Tracking space usage in a database
US20120310918A1 (en) Unique join data caching method
CN107577787A (en) The method and system of associated data information storage
CN109446167A (en) A kind of storage of daily record data, extracting method and device
WO2023278975A1 (en) Making decisions for placing data in a multi-tenant cache
Cremonezi et al. Improving the attribute retrieval on ABAC using opportunistic caches for fog-based IoT networks
CN116661685A (en) Hierarchical storage method and system for object storage metadata of business behavior awareness
CN112817982B (en) Dynamic power law graph storage method based on LSM tree
Li [Retracted] Research on the Social Security and Elderly Care System under the Background of Big Data
CN111147575B (en) Data storage system based on block chain
CN111767344A (en) Novel alliance chain for improving data processing capacity
CN106027685A (en) Peak access method based on cloud computation system
CN108062311A (en) A kind of method and system of access service device web data
CN105653621A (en) Uninterrupted business system, data export method thereof and streaming data service module
CN112596955B (en) Emergency processing system and method for processing large-scale system emergency in cloud computing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20140611

Assignee: Branch DNT data Polytron Technologies Inc

Assignor: Institute of Computing Technology, Chinese Academy of Sciences

Contract record no.: 2018110000033

Denomination of invention: Online processing method and system oriented to streamed data

Granted publication date: 20170405

License type: Common License

Record date: 20180807