CN106326317A - Data processing method and device - Google Patents
Data processing method and device Download PDFInfo
- Publication number
- CN106326317A CN106326317A CN201510400093.1A CN201510400093A CN106326317A CN 106326317 A CN106326317 A CN 106326317A CN 201510400093 A CN201510400093 A CN 201510400093A CN 106326317 A CN106326317 A CN 106326317A
- Authority
- CN
- China
- Prior art keywords
- data
- user
- collection
- inquiry
- abstract fields
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
Abstract
The invention discloses a data processing method and a device. The method comprises the steps of carrying out data screening processing on user behavior data, and forming a data screening result; on the basis of the data screening result, dividing the user behavior data into summary data meeting the query requirement and system data meeting the data analyzing and processing requirement, wherein the summary data belongs to a user list set, and the system data and the summary data both belong to a system data set; on the basis of the system data set, generating associated fields for querying the system data set, wherein the associated fields belong to a user detail set.
Description
Technical field
The present invention relates to data processing field, particularly relate to a kind of data processing method and device.
Background technology
Along with information technology and the development of electronic technology, occur in that concept and the use of big data.Big data
Can preferably realize data sharing.But find in the prior art, current Data processing is still deposited
Slow at substantial amounts of data query, data-handling efficiency is low and consumes substantial amounts of data processing resources etc. asks
Topic.
Such as, data based on Hbase data base process, and look into row keyword (RowKey)
During inquiry, speed is fast and efficiency is high, but when inquiring about with non-row keyword, it will usually occur that speed is slow
And the problem such as treatment effeciency is low.Described Hbase data base is to be distributed, increasing income towards row
Data base, is different from general relational database, and it is one and is suitable for unstructured data storage
Data base.The data base of per-column rather than based on row the pattern of HBase unlike another.
Therefore in prior art, the data process side that a kind of data-handling efficiency is high and inquiry velocity is fast is proposed
Method, is problem demanding prompt solution.
Summary of the invention
In view of this, embodiment of the present invention expectation provides a kind of data processing method and device, it is possible at least portion
Decompose the problem that certainly data-handling efficiency is low or inquiry velocity is slow.
For reaching above-mentioned purpose, the technical scheme is that and be achieved in that: embodiment of the present invention first party
Face provides a kind of data processing method, and described method includes:
User behavior data is carried out data screening process, forms data screening result;
Based on described data screening result, user behavior data is divided into the summary data meeting query demand
With the system data meeting Data Analysis Services;Wherein, described summary data belongs to user list collection;Institute
State system data and described summary data belongs to system data set;
The associate field inquiring about described system data set is formed based on described coefficient data collection;Wherein, described pass
Connection field belongs to user's detail collection.
Based on such scheme, described method also includes:
Based on described system data set, set up the master meter associated with described associate field.
Based on such scheme, described user list collection also includes abstract fields;
Wherein, described abstract fields and described summary data have mapping relations, it is possible to pluck described in being used for inquiring about
Want data.
Based on such scheme, described abstract fields includes ID and query time.
Based on such scheme, described method also includes:
Based on described user list collection and described user's detail collection, set up concordance list;
Wherein, the search index of described concordance list includes associate field and described abstract fields
Based on such scheme, described concordance list also includes described summary data.
Based on such scheme, described method also includes:
Receive the inquiry tag inputting formation based on user;
Described inquiry tag is mated with the field in described concordance list;
If described inquiry tag matches with described abstract fields, then pluck based on described in the inquiry of described abstract fields
Want data, and return described summary data;
If described inquiry tag matches with described associate field, then based on described associate field, inquiry is described
Master meter also returns Query Result.
Embodiment of the present invention second aspect provides a kind of data processing equipment, and described device includes:
Screening unit, for user behavior data carries out data screening process, forms data screening result;
Division unit, for based on described data screening result, is divided into user behavior data and meets inquiry
The summary data of demand and the system data meeting Data Analysis Services;Wherein, described summary data belongs to
User list collection;Described system data and described summary data belong to system data set;
Signal generating unit, for forming the associate field inquiring about described system data set based on described coefficient data collection;
Wherein, described associate field belongs to user's detail collection.
Based on such scheme, described device also includes:
First sets up unit, for based on described system data set, setting up the master associated with described associate field
Table.
Based on such scheme, described user list collection also includes abstract fields;
Wherein, described abstract fields and described summary data have mapping relations, it is possible to pluck described in being used for inquiring about
Want data.
Based on such scheme, described abstract fields includes ID and query time.
Based on such scheme, described device also includes:
Second sets up unit, for based on described user list collection and described user's detail collection, sets up concordance list;
Wherein, the search index of described concordance list includes associate field and described abstract fields
Based on such scheme,
Described concordance list also includes described summary data.
Based on such scheme, described device also includes:
Receive unit, for receiving the inquiry tag inputting formation based on user;
Matching unit, for mating described inquiry tag with the field in described concordance list;
First query unit, if matching with described abstract fields, then based on described for described inquiry tag
Abstract fields inquires about described summary data, and returns described summary data;
Second query unit, if matching with described associate field, then based on described for described inquiry tag
Associate field, inquires about described master meter and returns Query Result.
Data processing method described in the embodiment of the present invention and device, will form user list collection, user's detail
Collection and system data set these three data set, user list collection is the data that user would generally inquire, and is placed on
User list is concentrated, and so when carrying out general data inquiry, the data that user list is concentrated are less than used
User behavior data, thus reduce retrieval amount, thus improve inquiry velocity.Concurrently form user bright
Thin collection, is formed with associate field in user's detail collection, it is possible to inquires system data and concentrates the number seldom inquired about
According to, and system data concentrate data conveniently carry out systematic analysis process, in practice it has proved that, data redudancy is little
And redundancy can realize the controlled of redundancy as desired by adjusting data included by data set, thus subtract
Lack data and take substantial amounts of storage and the phenomenon of system maintenance operation resource.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of a kind of data processing method described in the embodiment of the present invention;
Fig. 2 is the local schematic flow sheet of a kind of data processing method described in the embodiment of the present invention;
Fig. 3 is one of structural representation of data processing equipment described in the embodiment of the present invention;
Fig. 4 is the two of the structural representation of the data processing equipment described in the embodiment of the present invention;
Fig. 5 is the schematic flow sheet that in the data processing method described in the embodiment of the present invention, data divide;
Fig. 6 is the relation schematic diagram between three kinds of data sets described in the embodiment of the present invention;
Fig. 7 is the effect signal of the master meter described in the embodiment of the present invention and concordance list.
Detailed description of the invention
Below in conjunction with Figure of description and specific embodiment technical scheme done and further explain in detail
State.
Embodiment of the method:
As it is shown in figure 1, present embodiments provide a kind of data processing method, described method includes:
Step S110: user behavior data carries out data screening process, forms data screening result;
Step S120: based on described data screening result, user behavior data is divided into and meets query demand
Summary data and meet the system data of Data Analysis Services;Wherein, described summary data belongs to user
List collection;Described system data and described summary data belong to system data set;
Step S130: form the associate field inquiring about described system data set based on described coefficient data collection;Its
In, described associate field belongs to user's detail collection.
The most described step S110 can carry out described data sieve according to data processing needs
Choosing processes.Typically require and meet the data of common user query demand and then should belong to the data of a class.Here
Meet user's query demand and comprise the steps that meeting user looks in real time to the user behavior data occurred in the appointment time
The data of inquiry demand.Here the appointment time can be from the beginning of current time, move back forward in a period of time
Data.The described appointment time can be nearest one month interior user behavior data.And described system data is concentrated
System data if, it may be possible to the less data of probability of user's inquiry, the most such as, according to inquiry system
Meter result, inquires about user probability and less than threshold value or inquires about probability data the most rearward as system number
Concentrate according to belonging to system data.
It should be noted that data screening described in the most described step S110 processes, it is believed that be
The data analysis that carries out according to data storage rule before data storage and abstract, each collection on memory space
The data closed may be stored in together, can in storage in logic, and these attribution datas are in different set.
Here set can include user list collection and system data set etc..Data are storing division in logic, can
To be realized by the mode such as data pointer and data label.
With the data instance of storage in HBase data base, in master meter, storage has P1 column data, in step
S110 filters out P2 arrange as the data in described summary data;Described P2 less than described P1 is just
Integer.During carrying out data division in the step s 120, also include the inquiry generating described summary data
The step of index.Substantially summary data is equivalent to a table that can inquire about, and inquiring about this table needs search index,
This search index can obtain this summary data.Such as the every data line in described P2 column data all corresponding
The index of inquiry, this search index can be described as RoWKey in described HBase data base.
Certainly, the user carrying out data query can be divided into multiple classification, the most such as includes two classifications.In step
Described in rapid S120, summary data is probably the data that disclosure satisfy that first kind user's query demand.Here general
General family is described first kind user, and usual described first kind user is the user that authority is relatively low, may
Some data is not open to these users, or some data is lost interest in by these users, not may require that and looks into
Ask corresponding data.Here some data i.e. include described system data.Such as, described user behavior bag
Include the internet behavior of user A.Data interested for user A are probably and oneself have accessed which webpage, access
The data such as data traffic produced by webpage.But user A may be uninterested, accessed by oneself
The communication protocol of webpage employing and money source IP address etc..
In a word, described user behavior data divides according to predetermined Data Partition Strategy and is referred to as summary data and is
System data.System data is the data that disclosure satisfy that systematic analysis demand.
Inquire about summary speed to improve user in the present embodiment, emerging by user is not felt by step S110
Interest or do not allow inquiry system data separate with described summary data.Like this user is at inquiry number
According to time, the most do not use and all user behavior datas go inquiry, thus decrease comparing matching inquiry
Amount such that it is able to improve inquiry velocity.
Of course for analyzing and processing at a slow speed demand, in the present embodiment, summary data and system data all belong to
System data is concentrated, and facilitates data analysis set-up to be analyzed data, like this, also ensure that system
Need to carry out Data Analysis Services efficiency during data analysis.Meanwhile, for convenience to all user's row
For the inquiry of data, also introduce user's detail collection in the present embodiment, formed also inside user's detail collection
Storing relevant field, these associate fields can concentrate each user behavior as inquiring about described system data
The search index of data.
Like this, meet the query demand to system data the most simultaneously;And use this data to process knot
Structure is when carrying out data query, in practice it has proved that speed is also that Millisecond is other.
As further improvement of this embodiment, described method also includes:
Based on described system data set, set up the master meter associated with described associate field.
Master meter described in the present embodiment can include each user behavior data, and the most described master meter can
For being distributed in time the table formed;Generally with the predetermined incremental time cycle, update described master meter.
Described associate field associates with described master meter, uses described associate field can carry out in described master meter
Inquiry, it is possible to return corresponding data;This makes it possible to the inquiry of data in master meter that realizes of simplicity, especially
Facilitate the Equations of The Second Kind user inquiry to the result after systematic analysis process in master meter.Such as, pass through
Analysis to user behavior data, creates analysis result, and this analysis result is probably and exists some website
Visitation frequency in the appointment time, these data may relate to commercial code, to having the first of relatively low rights
Class user do not open, but the most open for Equations of The Second Kind user (such as web analytics maintenance staff),
Then this is processed by systematic analysis and produces the data produced based on user behavior, it is possible to bright by described user
Associate field in thin list inquires.
As further improvement of this embodiment, described user list collection also includes abstract fields;
Wherein, described abstract fields and described summary data have mapping relations, it is possible to pluck described in being used for inquiring about
Want data.Described abstract fields includes ID and query time.Described ID is for can identify use
Any information at family is concrete such as information such as user account or user identity serial numbers.Described query time is for using
The time range of family given query, usually some moment before current time or current time or time
Section.
The ID that different users is different to application.Described query time can be regarded as user and wants with inquiry
The user behavior data produced in this period.Here ID at least includes first kind ID.
Understand, from the present embodiment, the data concentrated at described user list to arrange by ID, pin
Its user behavior is all record for each user and produces the user behavior data being available for inquiring about, as accessed website
Time, have accessed which website, access these websites produce data traffic and access these websites produce
Raw discharge pattern.Described discharge pattern can include that Generation Mobile Telecommunication System 2G flow, the third generation move logical
Letter 3G flow or forth generation mobile communication 4G flow etc..
Described abstract fields is equivalent to inquire about the index of described summary data, when data processing equipment receive right
During the abstract fields answered, can process according to the coupling of abstract fields etc., inquire the summary data of correspondence.?
When implementing, for the management to described abstract fields, reduce the problem that index lost efficacy, generally described in pluck
Want field also to belong to described user's detail collection, user's detail set pair the index carrying out data query (includes
Abstract fields and associate field) same it be managed and safeguard.
Described method also includes:
Based on described user list collection and described user's detail collection, set up concordance list;
Wherein, the search index of described concordance list includes associate field and described abstract fields.
The most described method also includes concordance list.According to preceding solution, inquire about data
These are had data by available described abstract fields and described associate field the most for convenience
The same association of inquiry field, it is to avoid form multiple concordance list and multiple index type, in the present embodiment will
Associate field and abstract fields are all stored in concordance list, decrease index type, it is to avoid index type is chaotic
And index storage takies the problem that memory space is big, decrease data redundancy.
Described concordance list also includes described summary data.
The most described concordance list the most directly includes summary data, utilizes described abstract fields just energy straight
Connecing the summary data inquired about in described concordance list, like this, described abstract fields is equivalent to dereferenced index.
Associate field in the most described concordance list, it is also possible to be associated with in master meter and carry out in master meter
The inquiry of data, is equivalent to association index.Which achieves dereferenced index and the same management of association index
And process.
Certainly summary data can also be multiplexed with described associate field.Such as, first kind user may inquire about
The website before oneself accessed, but Equations of The Second Kind user may inquire about the website that first kind user accessed
Accumulation access the information such as number.The website that this paid first kind user accessed is to be available for first kind user to look into
The data ask, are stored in user list as summary data and concentrate.The website that first kind user accessed at that time
Also serve as associate field, be stored in user's detail concentrate, this time in order to reduce data redundancy further,
The data that user's detail collection and user list are concentrated can be collectively forming aforementioned index table.Concordance list has
Input i.e. as summary data a bit, also serve as associate field, thus can reduce data redundancy as far as possible, subtract
Minority is according to storage and safeguards the resource taken.
During concrete implementation, each row element in described concordance list can at least include an institute
State abstract fields and a described associate field.In certain described concordance list, whether each row element may also include
Doing the attribute of master meter correlation inquiry parameter, the parameter value that this attribute is corresponding is false, then it represents that in this row element
Search index be abstract fields, can directly return the summary data that this abstract fields is corresponding.Generally
This summary data also is stored in this row element.When the described attribute whether doing master meter table relevant parameter is true time,
Represent that the search index that this row element is corresponding is described associate field, be to can be used for being associated with master meter to inquire about
's.Additionally, concordance list may also include the attribute of associated element number, if associated element number is 0, also
Represent corresponding abstract fields, be otherwise associate field.
Each data or field in the most described concordance list can serve as user and inquire about the index of data,
But the field of correspondence is specially abstract fields or associate field, then can but divide two kinds of fields by setting
Parameter represent, such as associated element number or the parameter of whether relating attribute.
As in figure 2 it is shown, described method also includes:
Step S210: receive the inquiry tag inputting formation based on user;
Step S220: described inquiry tag is mated with the field in described concordance list;
Step S230: if described inquiry tag matches with described abstract fields, then based on described abstract fields
Inquire about described summary data, and return described summary data;
Step S240: if described inquiry tag matches with described associate field, then based on described associate field,
Inquire about described master meter and return Query Result.
Described inquiry tag can be the search index of user's input, or electronic equipment combines the information of user's input
The data that can be used for carrying out mating with the data in user's detail list generated with the information such as ID.
The present embodiment obviously provides a kind of data query method, using the teaching of the invention it is possible to provide unified data-query interfaces,
By mating of inquiry tag and the field in concordance list, summary number can be returned quickly through the mode of dereferenced
According to, by the way of association, return out the data in master meter simultaneously, have that data-handling efficiency is fast, data are superfluous
The feature that remaining is little and index management is easy.
Apparatus embodiments:
As it is shown on figure 3, the present embodiment provides a kind of data processing equipment, described device includes:
Screening unit 110, for user behavior data carries out data screening process, forms data screening knot
Really;
Division unit 120, for based on described data screening result, is divided into satisfied by user behavior data
The summary data of query demand and the system data meeting Data Analysis Services;Wherein, described summary data is returned
Belong to user list collection;Described system data and described summary data belong to system data set;
Signal generating unit 130, for forming the association inquiring about described system data set based on described coefficient data collection
Field;Wherein, described associate field belongs to user's detail collection.
Data processing equipment described in the present embodiment may correspond to various forms of to carry out data process
Electronic equipment, such as desktop computer, notebook computer, server or server platform etc..
Described screening unit 110, division unit 120 and the structure corresponding to signal generating unit 130 can include place
Reason structure and storage medium.Described process structure can include processor and process circuit.Described processor can wrap
Include application processor AP, digital signal processor DSP, programmable array PLC, digital signal processor
The structure such as DSP or Micro-processor MCV.Described process circuit can include application-specific integrated circuit ASIC.Described
Storage medium can be connected with described process structure by attachment structures such as EBIs.On described storage medium
Storage has executable code, and described process structure can realize said units by performing described executable code
Function.
Above-mentioned any two unit can distinguish corresponding different process structure, it is also possible to integrated corresponding to same
Described process structure.During the corresponding multiple unit of described process structure assembly, described process structure uses the time-division multiple
With or the mode of concurrent thread, be respectively completed the operation of different units.
The differentiation of the most described summary data and described system data can be referring particularly to corresponding method
Embodiment, is not repeated at this.
As shown in Figure 4, described device also includes:
First sets up unit 130, for based on described system data set, sets up and associates with described associate field
Master meter.
Described in the present embodiment, first sets up unit can to include that storage medium, described storage medium are used for storing described
Master meter.Described first sets up unit for setting up function or strategy, based on described system data according to master meter
The data concentrated, form described master meter.
Between described master meter and associate field relevant, this incidence relation may be used for Equations of The Second Kind user
Inquire the data in described master meter.
Described user list collection also includes abstract fields;Wherein, described abstract fields has with described summary data
There are mapping relations, it is possible to be used for inquiring about described summary data.
The most described user list directly includes can the first kind user inquiry summary data and
Abstract fields.The most described data processing equipment by abstract fields can mistake in Millisecond quick search to institute
State summary data.
In the present embodiment, described abstract fields includes ID and query time;Described summary data is
Based on ID distribution and the data of formation, such fast and easy is inquired about.
As further improvement of this embodiment, described device also includes: second sets up unit 150, is used for
Based on described user list collection and described user's detail collection, set up concordance list;Wherein, the looking into of described concordance list
Ask index and include associate field and described abstract fields.
The most described second concrete structure setting up unit 150 and the described first knot setting up unit
Structure is similar to, except for the difference that: described second to set up unit 150 be the structure for setting up concordance list.Described
Concordance list includes critical field and associate field, is so managed index by concordance list unification, energy
Enough avoid the various problems such as index is chaotic.
Described concordance list also includes described summary data.In the present embodiment directly by described summary data in institute
State in concordance list and safeguard, so by with the fields match in concordance list, it is determined that corresponding inquiry tag
Mate with critical field, the most directly the summary data being maintained in concordance list is returned, thus can maximum limit
The raising query rate of degree;And store mapping respectively relative to abstract fields and summary data, to greatest extent
Decrease data redundancy, reduce storage resource and data maintenance resource that data take.
As further improvement of this embodiment, as shown in Figure 4, described device also includes:
Receive unit 210, for receiving the inquiry tag inputting formation based on user;
Matching unit 220, for mating described inquiry tag with the field in described concordance list;
First query unit 230, if matching with described abstract fields for described inquiry tag, then based on
Described abstract fields inquires about described summary data, and returns described summary data;
Second query unit 240, if matching with described associate field for described inquiry tag, then based on
Described associate field, inquires about described master meter and returns Query Result.
Described in the present embodiment receive unit 210 concrete structure can include communication interface, for receive based on
User inputs the inquiry tag of formation.Under normal circumstances, the label of inquiry summary data can be that ID adds
Upper query event.If needing to inquire about the data in described master meter, generally it is probably query time plus user
Mark, adds the information such as service fields.
The concrete structure of described first query unit 230 and the second query unit 240 can be all corresponding to having
The process structure of information searching function;The specific descriptions of described process structure may refer to preceding sections.In a word,
Device described in the present embodiment, when inquiring about data, is used in combination association index and dereferenced index, it is possible to fast
Fast inquires the data that user wants, and the redundancy with data is little, and the system resource taken is few
Feature.
Several concrete examples are provided below in conjunction with above-mentioned any embodiment.
Example one:
Shown in Fig. 5 is based on the flow process in previous embodiment divided data.
User behavior data in HBase is divided, is divided into: meet user's real-time query demand
Data and the data meeting system processing needs.Further the data meeting user and implementing query demand are drawn
Divide and belong to the data of user list collection and belong to the data of user's detail collection.And only meet system and process need
The data asked then will all belong to system data set.
The process that described user behavior data divides can be as follows:
The first step: the demand meeting user behavior real-time query is abstract:
Because the reason that user behavior class data are placed in HBase is mainly due to user behavior real-time query
Demand causes.So first we have carried out abstract analysis to the feature of user behavior real-time query data.
Owing to the attention of user is limited in scope limited, a user is generally impossible to pay close attention to substantial amounts of number simultaneously
According to content.So user likes the process of browsing data to be divided into two steps to perform.We are according to such row
For feature, data display is divided into two steps: show the data of summary and show detailed data.
We, based on such a data display mode, are divided into two classes the data being used in the future allowing user inquire about
Record set, one is user list collection, and one is user's detail collection.That User Summary collection is corresponding is user
Browsing the summary demand of bulk data, what user's detail collection was corresponding is the demand of inquiry detail.The two data
Set is all the subset of whole user behavior data.
How so the two data set screens from total data?Because user only can be concerned about
The data of oneself, and it is likely to be certain a period of time.So the two data set is logical first by user
Identification field filters, the most temporally Field Sanitization, has just obtained required data set.
Second step: meet the abstract of Data Analysis Services demand:
Just we analyze the demand of user's real-time query, but in a big data sharing platform, use
Behavior class data in family can not only be used by user's real-time query.It is also possible to can be by the data of internal system
Analyze and process.
The service feature of the data analysis and process of lower surface analysis internal system once.
The scene of intrasystem data analysis and process can be different according to the content of concrete analysis, different
Analyzing data of interest row also different, this is also the origin of row storage, so data are divided in theory
Analysis and processing procedure need any one field in table to be likely to.Dividing at big data sharing platform simultaneously
During analysis processes, generally will not be concerned about the data of some user, but be concerned about the data of a large number of users, but base
This be all temporally the cycle carry out incremental processing.No matter it is the SQL process using multilist connection, or
It is all so that MepReduce processes.It is directed to the data analysis and process demand characteristic of such internal system,
We have obtained such a " system data set ", and this data set contains all of data field,
It is filtered out from total data by time field.
Shown in Fig. 6 for according to method shown in Fig. 5, the data distribution schematic diagram after division.Can from Fig. 6
Know, may all meet systematic analysis and process the data of demand, therefore all of user by all of user behavior data
Behavioral data all belongs to system data set.And the data meeting user's real-time query demand can include belonging to
The data that user list collection and user's detail are concentrated.The data that user list is concentrated include summary data and key
Field, described critical field is to inquire about the search index of described summary data.The most for convenience to rope
The unified management drawn.Described user's detail concentrates the associated characters except including concentrating system data data query
Beyond Duan, also include the abstract fields that user's detail is concentrated.
Following table is the similarities and differences of three kinds of data sets of comparison.
Obviously the most as can be known from the above table, the data of user list collection may between user's detail collection and system data set it
Between, but user's detail collection and user list collection are all the subsets of system data set.Described search field is for looking into
Ask the index that corresponding data is concentrated.The data that inquiry user list is concentrated in this example can use user
Mark is plus the search field of time.Here time is aforesaid query time.
Divide based on above-mentioned data, carry out the setting of data structure.Because system data set includes whole
Field, so applicable master meter stores related data.User's detail set has the feature that Record to return is minimum,
It is especially suitable for by master meter being built an association index (association index here is aforesaid associate field)
Realize.Because HBase association secondary index performance when recording less is the highest, space hold is the most less.
User list set, because Record to return is relatively big, uses HBase association secondary index to obtain best performance,
But because the field of this set will be little than other two set, relatively it is suitable for using HBase dereferenced two
Level index realizes.Index namely preserves the summary data all needed, during inquiry, does not make the pass with master meter
Joint investigation is ask.So performance is higher, but can take some spaces, because this Set field is minimum, quite
Optimal performance is exchanged in sacrificing some redundancies.
By above-mentioned design so that have higher query performance under the demand of various inquiries, count simultaneously
According to having there are some data redundancies, but amount of redundancy is the least.
By as above-mentioned, a master meter and two indexes will be formed.Here two indexes are i.e. equivalent to aforementioned reality
Execute the abstract fields in example and associate field.Abstract fields is equivalent to dereferenced index, and associate field is suitable
In association index.
Next the two index is merged.When implementing, the content of non-joint index includes
The content of joint index, merging method be using the index content of user list set as index after merging in
Hold, by transmitting " whether making master meter correlation inquiry parameter " to index, when parameter is fictitious time, directly return
Concordance list data, and " associated element number " two parameters.
In the analyzing and processing of big data sharing platform, generally will not be concerned about the data of some user, but
It is concerned about the data of a large number of users, but the temporally cycle that is substantially carries out incremental processing.No matter it is to use multilist
The SQL process of connection, or MepReduce process.
System data set processes mainly for internal analysis, by master meter can preferably solve in system
Portion analyzes and processes time performance problem.Master meter uses the time plus service fields as search index, the most just
Being to say that the data distribution in master meter is temporally distributed in HBase, time such, we are making system
When internal analysis processes, the position at newly-increased data place can be navigated to efficiently.Only take and need data to be processed.
The speed to data query and extraction in data handling procedure can be improved.
As it is shown in fig. 7, the data in master meter are tissue line in chronological order;And the data in concordance list
Tissue line is carried out by user account.Certainly user account is the one of which of ID.
Abstract fields and summary data is included, wherein, in Fig. 7 " 5-6 ", " 6-5xxx " of display at concordance list
The summary data that all can map as described abstract fields.The most also show index value, index herein
Value is equivalent to the associate field in previous embodiment, carries out the inquiry of data in master meter for being associated with master meter.
Store various data in master meter, these data the most also include user account, query time, Fa
And the service fields etc. of user's query and search of Fb composition.
ID+timestamp+other abstract fields are conspired to create a character string and need row keyword as product by concordance list,
Middle with decollator segmentation, when user inquires about summary data, by ID, the time range of the inquiry of oneself
And other querying conditions become a kind of row keyword query.By this row keyword query rapidly locating
Position.Then directly take out concordance list data, before described concordance list and included all abstract fields,
So directly returning data from concordance list, master meter inquiry need not be associated.Here ID is aforementioned user
The one of mark.
So whole process is exactly the row key range inquiry of a concordance list in fact, therefore efficiency is higher.
Concordance list associates with master meter, and the index value first half of concordance list is identical with master meter row keyword, from making
Concordance list is logical with master meter association to be associated, and the index value latter half of concordance list is other User Summary number
According to.Such purpose is because institute's index value of concordance list to be had and itself has two kinds of purposes, first with master meter
Row keyword association, second is to carry User Summary data.Thus achieve association rope by index value
Draw and index, with dereferenced, the result merged.The most at this moment index value includes Part I and Part II;
Part I is for mapping with summary data, and Part II is for mapping with master meter, and such Part I is suitable
In abstract fields, and Part II is equivalent to associate field.A like this index value, can use simultaneously
In inquiring summary data and the data of system data concentration.Certain this mode, is to discriminate between in aforementioned setting
Particularly parameter distinguishes associate field and abstract fields.Be equivalent to corresponding two fields of an index value,
One is association or dereferenced, and another is associated element number.So can realize data and preserve portion,
For two kinds of purposes.
When user inquires about detailed data, associate master meter by concordance list.Whole process is the most just carried out (returning
Return record number * 2) secondary row keyword query, because the feature of detailed data inquiry is inquiry low volume data, but
There is relatively multi-field.So can have higher performance during association.
In the scene of Data Management Analysis, processing procedure is frequently not centered by unique user.Substantially
It it is all the previous hour data of catch cropping incremental processing, such as localization process on time.At the most newly-increased data constantly work
Reason.
Data Analysis Services system obtain data, can press from master meter row keyword query pass through Hive,
Impala, spark engine queries and process.Or use MapReduce to be directly positioned to place by Rowkey
The data interval of reason, it is to avoid full table scan.
Example two:
The present invention is introduced below by an example:
User's internet behavior data of A mouth are entered storage.Supporting business uses the demand of data.These data are led to
Often there are two class purposes:
User inquires about internet records.
System makees user's internet behavior analysis by these data.
Metadata is as follows:
Definition data set
Field is sorted by user's attention rate, is ranked up to low attention rate from high attention rate, and then selects
The field of high attention rate is as abstract fields, and determines summary data collection according to abstract fields, adds it
The data that remaining user is concerned about are as detailed data collection, and complete or collected works are exactly that systematic analysis processes data set.Here pass
Note degree may be embodied in user and inquires about in the frequency of these data.Ranking results is as follows:
In the ranking results that upper table is formed, by user account, time started, business numbering, class of service
Numbering and web site name are numbered as abstract fields, and data corresponding to these abstract fields are as summary data.
Using the data field that in upper table, numbering is concentrated as user's detailed data from user account to application layer protocol.Its
His system field belongs to system data set.
Definition master meter
Master meter as row keyword, uses the time started to add by time+service fields in the present example
User account links up the keyword as master meter.(time is fixing as row keyword rowkey prefix,
Aft section can design according to business demand), total data is all saved in master meter
Index of definition table
Join together to build up index by the field in user list collection.In described user list collection, field comprises the steps that
User account number, business numbering, class of service numbering, time started, web site name numbering.
Concordance list, with user account as prefix, serves user's inquiry.By this use when user inquires about data
The user account at family makees Rapid matching.
When user inquires about internet records.
When user inquires about his certain time period internet records, by user account number and (time here time period
Section is i.e. equivalent to aforesaid query time), and select the inquiries such as type of service (instant messaging, microblogging etc.)
Internet records is data summarization part.User is it can be seen that corresponding summary data.In example he it will be seen that
He used what software (BusinessType_ID) to have accessed the information such as what website (WebsiteID).
These data volumes may have a lot, it is possible to has paging.Internal system is to be entered by HBase secondary index
Every trade keyword match, completes quick search.Because not associating master meter, even if inquiry record has several ten thousand,
Also can guarantee that and return within the several seconds.
When user inquires about detailed data, after user has seen summary data, that selects that he to check further is bright
Count evidence accurately.Detailed data has more message details (field), but data acknowledgment number is the least.Internal system is
By HBase secondary index being carried out row keyword query by user account number, time period etc., associate the most again
Inquire about to master meter.This association process can be low than Unlinkability, but due to detail inquiry record number
It is less, so second level returns the most secure.
System of users behavioral data is analyzed processing.
Such as: will analyzing which website user, to access number most.System can not only be analyzed once, but presses
Time window (such as one day) makees incremental analysis.It requires that data to be distributed in time, convenient only process
The data of nearest one day.Our master meter is temporally to make rowkey with service fields, the most temporally
Distribution, master meter comprises all fields simultaneously.Directly master meter is analyzed during analysis, and index need not be walked
Table.When HBase fetches data, it is substantially the rowkey inquiry that master meter is carried out scope, from HBase
The middle taking-up data of nearest a day collect, and need not scan conventional data.
In several embodiments provided herein, it should be understood that disclosed equipment and method,
Can realize by another way.Apparatus embodiments described above is only schematically, such as,
The division of described unit, is only a kind of logic function and divides, and actual can have other division when realizing
Mode, such as: multiple unit or assembly can be in conjunction with, or are desirably integrated into another system, or some are special
Levy and can ignore, or do not perform.It addition, the coupling each other of shown or discussed each ingredient,
Or direct-coupling or communication connection can be the INDIRECT COUPLING by some interfaces, equipment or unit or logical
Letter connect, can be electrical, machinery or other form.
The above-mentioned unit illustrated as separating component can be or may not be physically separate, makees
The parts shown for unit can be or may not be physical location, i.e. may be located at a place,
Can also be distributed on multiple NE;Can select according to the actual needs therein partly or entirely
Unit realizes the purpose of the present embodiment scheme.
It addition, each functional unit in various embodiments of the present invention can be fully integrated into a processing module
In, it is also possible to it is that each unit is individually as a unit, it is also possible to two or more unit collection
Become in a unit;Above-mentioned integrated unit both can realize to use the form of hardware, it would however also be possible to employ
Hardware adds the form of SFU software functional unit and realizes.
One of ordinary skill in the art will appreciate that: realize all or part of step of said method embodiment
Can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in a computer-readable
Taking in storage medium, this program upon execution, performs to include the step of said method embodiment;And it is aforementioned
Storage medium include: movable storage device, read only memory (ROM, Read-Only Memory),
Random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various
The medium of program code can be stored.
The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is not limited to
In this, any those familiar with the art, can be easily in the technical scope that the invention discloses
Expect change or replace, all should contain within protection scope of the present invention.Therefore, the protection of the present invention
Scope should be as the criterion with described scope of the claims.
Claims (14)
1. a data processing method, it is characterised in that described method includes:
User behavior data is carried out data screening process, forms data screening result;
Based on described data screening result, user behavior data is divided into the summary data meeting query demand
With the system data meeting Data Analysis Services;Wherein, described summary data belongs to user list collection;Institute
State system data and described summary data belongs to system data set;
The associate field inquiring about described system data set is formed based on described coefficient data collection;Wherein, described pass
Connection field belongs to user's detail collection.
Method the most according to claim 1, it is characterised in that
Described method also includes:
Based on described system data set, set up the master meter associated with described associate field.
Method the most according to claim 1, it is characterised in that
Described user list collection also includes abstract fields;
Wherein, described abstract fields and described summary data have mapping relations, it is possible to pluck described in being used for inquiring about
Want data.
Method the most according to claim 3, it is characterised in that
Described abstract fields includes ID and query time.
Method the most according to claim 3, it is characterised in that
Described method also includes:
Based on described user list collection and described user's detail collection, set up concordance list;
Wherein, the search index of described concordance list includes associate field and described abstract fields.
Method the most according to claim 5, it is characterised in that
Described concordance list also includes described summary data.
Method the most according to claim 5, it is characterised in that
Described method also includes:
Receive the inquiry tag inputting formation based on user;
Described inquiry tag is mated with the field in described concordance list;
If described inquiry tag matches with described abstract fields, then pluck based on described in the inquiry of described abstract fields
Want data, and return described summary data;
If described inquiry tag matches with described associate field, then based on described associate field, inquiry is described
Master meter also returns Query Result.
8. a data processing equipment, it is characterised in that described device includes:
Screening unit, for user behavior data carries out data screening process, forms data screening result;
Division unit, for based on described data screening result, is divided into user behavior data and meets inquiry
The summary data of demand and the system data meeting Data Analysis Services;Wherein, described summary data belongs to
User list collection;Described system data and described summary data belong to system data set;
Signal generating unit, for forming the associate field inquiring about described system data set based on described coefficient data collection;
Wherein, described associate field belongs to user's detail collection.
Device the most according to claim 8, it is characterised in that
Described device also includes:
First sets up unit, for based on described system data set, setting up the master associated with described associate field
Table.
Device the most according to claim 9, it is characterised in that
Described user list collection also includes abstract fields;
Wherein, described abstract fields and described summary data have mapping relations, it is possible to pluck described in being used for inquiring about
Want data.
11. devices according to claim 10, it is characterised in that
Described abstract fields includes ID and query time.
12. devices according to claim 10, it is characterised in that
Described device also includes:
Second sets up unit, for based on described user list collection and described user's detail collection, sets up concordance list;
Wherein, the search index of described concordance list includes associate field and described abstract fields.
13. devices according to claim 12, it is characterised in that
Described concordance list also includes described summary data.
14. devices according to claim 12, it is characterised in that
Described device also includes:
Receive unit, for receiving the inquiry tag inputting formation based on user;
Matching unit, for mating described inquiry tag with the field in described concordance list;
First query unit, if matching with described abstract fields, then based on described for described inquiry tag
Abstract fields inquires about described summary data, and returns described summary data;
Second query unit, if matching with described associate field, then based on described for described inquiry tag
Associate field, inquires about described master meter and returns Query Result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510400093.1A CN106326317A (en) | 2015-07-09 | 2015-07-09 | Data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510400093.1A CN106326317A (en) | 2015-07-09 | 2015-07-09 | Data processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106326317A true CN106326317A (en) | 2017-01-11 |
Family
ID=57725456
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510400093.1A Pending CN106326317A (en) | 2015-07-09 | 2015-07-09 | Data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106326317A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038224A (en) * | 2017-03-29 | 2017-08-11 | 腾讯科技(深圳)有限公司 | Data processing method and data processing equipment |
CN108418827A (en) * | 2018-03-15 | 2018-08-17 | 北京知道创宇信息技术有限公司 | User's behaviors analysis method and device |
CN109117427A (en) * | 2017-06-22 | 2019-01-01 | 索意互动(北京)信息技术有限公司 | A kind of client, server, search method and its system |
CN112214556A (en) * | 2020-09-30 | 2021-01-12 | 招商局金融科技有限公司 | Label generation method and device, electronic equipment and computer readable storage medium |
CN115114344A (en) * | 2021-11-05 | 2022-09-27 | 腾讯科技(深圳)有限公司 | Transaction processing method and device, computing equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102117320A (en) * | 2011-01-11 | 2011-07-06 | 百度在线网络技术(北京)有限公司 | Structured data searching method and device |
CN102609421A (en) * | 2011-01-24 | 2012-07-25 | 阿里巴巴集团控股有限公司 | Data query method and device |
CN102867064A (en) * | 2012-09-28 | 2013-01-09 | 用友软件股份有限公司 | Associated field query device and associated field query method |
CN103488704A (en) * | 2013-09-06 | 2014-01-01 | 乐视致新电子科技(天津)有限公司 | Method and device for storing data |
US20140330816A1 (en) * | 2011-11-18 | 2014-11-06 | Debabrata Dash | Query summary generation using row-column data storage |
CN104160394A (en) * | 2011-12-23 | 2014-11-19 | 阿米亚托股份有限公司 | Scalable analysis platform for semi-structured data |
-
2015
- 2015-07-09 CN CN201510400093.1A patent/CN106326317A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102117320A (en) * | 2011-01-11 | 2011-07-06 | 百度在线网络技术(北京)有限公司 | Structured data searching method and device |
CN102609421A (en) * | 2011-01-24 | 2012-07-25 | 阿里巴巴集团控股有限公司 | Data query method and device |
US20140330816A1 (en) * | 2011-11-18 | 2014-11-06 | Debabrata Dash | Query summary generation using row-column data storage |
CN104160394A (en) * | 2011-12-23 | 2014-11-19 | 阿米亚托股份有限公司 | Scalable analysis platform for semi-structured data |
CN102867064A (en) * | 2012-09-28 | 2013-01-09 | 用友软件股份有限公司 | Associated field query device and associated field query method |
CN103488704A (en) * | 2013-09-06 | 2014-01-01 | 乐视致新电子科技(天津)有限公司 | Method and device for storing data |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038224A (en) * | 2017-03-29 | 2017-08-11 | 腾讯科技(深圳)有限公司 | Data processing method and data processing equipment |
CN109117427A (en) * | 2017-06-22 | 2019-01-01 | 索意互动(北京)信息技术有限公司 | A kind of client, server, search method and its system |
CN108418827A (en) * | 2018-03-15 | 2018-08-17 | 北京知道创宇信息技术有限公司 | User's behaviors analysis method and device |
CN108418827B (en) * | 2018-03-15 | 2020-11-03 | 北京知道创宇信息技术股份有限公司 | Network behavior analysis method and device |
CN112214556A (en) * | 2020-09-30 | 2021-01-12 | 招商局金融科技有限公司 | Label generation method and device, electronic equipment and computer readable storage medium |
CN112214556B (en) * | 2020-09-30 | 2024-02-23 | 招商局金融科技有限公司 | Label generation method, label generation device, electronic equipment and computer readable storage medium |
CN115114344A (en) * | 2021-11-05 | 2022-09-27 | 腾讯科技(深圳)有限公司 | Transaction processing method and device, computing equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11176114B2 (en) | RAM daemons | |
CN111782965B (en) | Intention recommendation method, device, equipment and storage medium | |
US8918365B2 (en) | Dedicating disks to reading or writing | |
CN109347798A (en) | Generation method, device, equipment and the storage medium of network security knowledge map | |
CN106326317A (en) | Data processing method and device | |
US20150019544A1 (en) | Information service for facts extracted from differing sources on a wide area network | |
CN107729336A (en) | Data processing method, equipment and system | |
CN112165462A (en) | Attack prediction method and device based on portrait, electronic equipment and storage medium | |
CN111026874A (en) | Data processing method and server of knowledge graph | |
CN113297457B (en) | High-precision intelligent information resource pushing system and pushing method | |
WO2011094522A1 (en) | Method and system for conducting legal research using clustering analytics | |
CN102955802B (en) | The method and apparatus of data is obtained from data sheet | |
CN112765366A (en) | APT (android Package) organization portrait construction method based on knowledge map | |
US20170270184A1 (en) | Methods and devices for processing objects to be searched | |
CN106528612A (en) | Distributed retrieval system and method oriented to industry metadata registration | |
Bellandi et al. | Toward a general framework for multimodal big data analysis | |
US11776078B2 (en) | Systems and methods for generating strategic competitive intelligence data relevant for an entity | |
CN113221535A (en) | Information processing method, device, computer equipment and storage medium | |
CN112667663A (en) | Data query method and system | |
CN110781213A (en) | Multi-source mass data correlation searching method and system with personnel as center | |
Nayak et al. | Applications of data mining in web services | |
Alvanaki et al. | Tracking set correlations at large scale | |
Caldeira et al. | Experimental evaluation among reblocking techniques applied to the entity resolution | |
KR20110125966A (en) | Method and system for generating synonyms group using sentence analysis | |
CN111949830A (en) | Discrete indexing method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170111 |