CN106326317A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN106326317A
CN106326317A CN201510400093.1A CN201510400093A CN106326317A CN 106326317 A CN106326317 A CN 106326317A CN 201510400093 A CN201510400093 A CN 201510400093A CN 106326317 A CN106326317 A CN 106326317A
Authority
CN
China
Prior art keywords
data
user
collection
inquiry
abstract fields
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510400093.1A
Other languages
Chinese (zh)
Inventor
卢山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Group Shanxi Co Ltd
Original Assignee
China Mobile Group Shanxi Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Group Shanxi Co Ltd filed Critical China Mobile Group Shanxi Co Ltd
Priority to CN201510400093.1A priority Critical patent/CN106326317A/en
Publication of CN106326317A publication Critical patent/CN106326317A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures

Abstract

The invention discloses a data processing method and a device. The method comprises the steps of carrying out data screening processing on user behavior data, and forming a data screening result; on the basis of the data screening result, dividing the user behavior data into summary data meeting the query requirement and system data meeting the data analyzing and processing requirement, wherein the summary data belongs to a user list set, and the system data and the summary data both belong to a system data set; on the basis of the system data set, generating associated fields for querying the system data set, wherein the associated fields belong to a user detail set.

Description

Data processing method and device
Technical field
The present invention relates to data processing field, particularly relate to a kind of data processing method and device.
Background technology
Along with information technology and the development of electronic technology, occur in that concept and the use of big data.Big data Can preferably realize data sharing.But find in the prior art, current Data processing is still deposited Slow at substantial amounts of data query, data-handling efficiency is low and consumes substantial amounts of data processing resources etc. asks Topic.
Such as, data based on Hbase data base process, and look into row keyword (RowKey) During inquiry, speed is fast and efficiency is high, but when inquiring about with non-row keyword, it will usually occur that speed is slow And the problem such as treatment effeciency is low.Described Hbase data base is to be distributed, increasing income towards row Data base, is different from general relational database, and it is one and is suitable for unstructured data storage Data base.The data base of per-column rather than based on row the pattern of HBase unlike another.
Therefore in prior art, the data process side that a kind of data-handling efficiency is high and inquiry velocity is fast is proposed Method, is problem demanding prompt solution.
Summary of the invention
In view of this, embodiment of the present invention expectation provides a kind of data processing method and device, it is possible at least portion Decompose the problem that certainly data-handling efficiency is low or inquiry velocity is slow.
For reaching above-mentioned purpose, the technical scheme is that and be achieved in that: embodiment of the present invention first party Face provides a kind of data processing method, and described method includes:
User behavior data is carried out data screening process, forms data screening result;
Based on described data screening result, user behavior data is divided into the summary data meeting query demand With the system data meeting Data Analysis Services;Wherein, described summary data belongs to user list collection;Institute State system data and described summary data belongs to system data set;
The associate field inquiring about described system data set is formed based on described coefficient data collection;Wherein, described pass Connection field belongs to user's detail collection.
Based on such scheme, described method also includes:
Based on described system data set, set up the master meter associated with described associate field.
Based on such scheme, described user list collection also includes abstract fields;
Wherein, described abstract fields and described summary data have mapping relations, it is possible to pluck described in being used for inquiring about Want data.
Based on such scheme, described abstract fields includes ID and query time.
Based on such scheme, described method also includes:
Based on described user list collection and described user's detail collection, set up concordance list;
Wherein, the search index of described concordance list includes associate field and described abstract fields
Based on such scheme, described concordance list also includes described summary data.
Based on such scheme, described method also includes:
Receive the inquiry tag inputting formation based on user;
Described inquiry tag is mated with the field in described concordance list;
If described inquiry tag matches with described abstract fields, then pluck based on described in the inquiry of described abstract fields Want data, and return described summary data;
If described inquiry tag matches with described associate field, then based on described associate field, inquiry is described Master meter also returns Query Result.
Embodiment of the present invention second aspect provides a kind of data processing equipment, and described device includes:
Screening unit, for user behavior data carries out data screening process, forms data screening result;
Division unit, for based on described data screening result, is divided into user behavior data and meets inquiry The summary data of demand and the system data meeting Data Analysis Services;Wherein, described summary data belongs to User list collection;Described system data and described summary data belong to system data set;
Signal generating unit, for forming the associate field inquiring about described system data set based on described coefficient data collection; Wherein, described associate field belongs to user's detail collection.
Based on such scheme, described device also includes:
First sets up unit, for based on described system data set, setting up the master associated with described associate field Table.
Based on such scheme, described user list collection also includes abstract fields;
Wherein, described abstract fields and described summary data have mapping relations, it is possible to pluck described in being used for inquiring about Want data.
Based on such scheme, described abstract fields includes ID and query time.
Based on such scheme, described device also includes:
Second sets up unit, for based on described user list collection and described user's detail collection, sets up concordance list;
Wherein, the search index of described concordance list includes associate field and described abstract fields
Based on such scheme,
Described concordance list also includes described summary data.
Based on such scheme, described device also includes:
Receive unit, for receiving the inquiry tag inputting formation based on user;
Matching unit, for mating described inquiry tag with the field in described concordance list;
First query unit, if matching with described abstract fields, then based on described for described inquiry tag Abstract fields inquires about described summary data, and returns described summary data;
Second query unit, if matching with described associate field, then based on described for described inquiry tag Associate field, inquires about described master meter and returns Query Result.
Data processing method described in the embodiment of the present invention and device, will form user list collection, user's detail Collection and system data set these three data set, user list collection is the data that user would generally inquire, and is placed on User list is concentrated, and so when carrying out general data inquiry, the data that user list is concentrated are less than used User behavior data, thus reduce retrieval amount, thus improve inquiry velocity.Concurrently form user bright Thin collection, is formed with associate field in user's detail collection, it is possible to inquires system data and concentrates the number seldom inquired about According to, and system data concentrate data conveniently carry out systematic analysis process, in practice it has proved that, data redudancy is little And redundancy can realize the controlled of redundancy as desired by adjusting data included by data set, thus subtract Lack data and take substantial amounts of storage and the phenomenon of system maintenance operation resource.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of a kind of data processing method described in the embodiment of the present invention;
Fig. 2 is the local schematic flow sheet of a kind of data processing method described in the embodiment of the present invention;
Fig. 3 is one of structural representation of data processing equipment described in the embodiment of the present invention;
Fig. 4 is the two of the structural representation of the data processing equipment described in the embodiment of the present invention;
Fig. 5 is the schematic flow sheet that in the data processing method described in the embodiment of the present invention, data divide;
Fig. 6 is the relation schematic diagram between three kinds of data sets described in the embodiment of the present invention;
Fig. 7 is the effect signal of the master meter described in the embodiment of the present invention and concordance list.
Detailed description of the invention
Below in conjunction with Figure of description and specific embodiment technical scheme done and further explain in detail State.
Embodiment of the method:
As it is shown in figure 1, present embodiments provide a kind of data processing method, described method includes:
Step S110: user behavior data carries out data screening process, forms data screening result;
Step S120: based on described data screening result, user behavior data is divided into and meets query demand Summary data and meet the system data of Data Analysis Services;Wherein, described summary data belongs to user List collection;Described system data and described summary data belong to system data set;
Step S130: form the associate field inquiring about described system data set based on described coefficient data collection;Its In, described associate field belongs to user's detail collection.
The most described step S110 can carry out described data sieve according to data processing needs Choosing processes.Typically require and meet the data of common user query demand and then should belong to the data of a class.Here Meet user's query demand and comprise the steps that meeting user looks in real time to the user behavior data occurred in the appointment time The data of inquiry demand.Here the appointment time can be from the beginning of current time, move back forward in a period of time Data.The described appointment time can be nearest one month interior user behavior data.And described system data is concentrated System data if, it may be possible to the less data of probability of user's inquiry, the most such as, according to inquiry system Meter result, inquires about user probability and less than threshold value or inquires about probability data the most rearward as system number Concentrate according to belonging to system data.
It should be noted that data screening described in the most described step S110 processes, it is believed that be The data analysis that carries out according to data storage rule before data storage and abstract, each collection on memory space The data closed may be stored in together, can in storage in logic, and these attribution datas are in different set. Here set can include user list collection and system data set etc..Data are storing division in logic, can To be realized by the mode such as data pointer and data label.
With the data instance of storage in HBase data base, in master meter, storage has P1 column data, in step S110 filters out P2 arrange as the data in described summary data;Described P2 less than described P1 is just Integer.During carrying out data division in the step s 120, also include the inquiry generating described summary data The step of index.Substantially summary data is equivalent to a table that can inquire about, and inquiring about this table needs search index, This search index can obtain this summary data.Such as the every data line in described P2 column data all corresponding The index of inquiry, this search index can be described as RoWKey in described HBase data base.
Certainly, the user carrying out data query can be divided into multiple classification, the most such as includes two classifications.In step Described in rapid S120, summary data is probably the data that disclosure satisfy that first kind user's query demand.Here general General family is described first kind user, and usual described first kind user is the user that authority is relatively low, may Some data is not open to these users, or some data is lost interest in by these users, not may require that and looks into Ask corresponding data.Here some data i.e. include described system data.Such as, described user behavior bag Include the internet behavior of user A.Data interested for user A are probably and oneself have accessed which webpage, access The data such as data traffic produced by webpage.But user A may be uninterested, accessed by oneself The communication protocol of webpage employing and money source IP address etc..
In a word, described user behavior data divides according to predetermined Data Partition Strategy and is referred to as summary data and is System data.System data is the data that disclosure satisfy that systematic analysis demand.
Inquire about summary speed to improve user in the present embodiment, emerging by user is not felt by step S110 Interest or do not allow inquiry system data separate with described summary data.Like this user is at inquiry number According to time, the most do not use and all user behavior datas go inquiry, thus decrease comparing matching inquiry Amount such that it is able to improve inquiry velocity.
Of course for analyzing and processing at a slow speed demand, in the present embodiment, summary data and system data all belong to System data is concentrated, and facilitates data analysis set-up to be analyzed data, like this, also ensure that system Need to carry out Data Analysis Services efficiency during data analysis.Meanwhile, for convenience to all user's row For the inquiry of data, also introduce user's detail collection in the present embodiment, formed also inside user's detail collection Storing relevant field, these associate fields can concentrate each user behavior as inquiring about described system data The search index of data.
Like this, meet the query demand to system data the most simultaneously;And use this data to process knot Structure is when carrying out data query, in practice it has proved that speed is also that Millisecond is other.
As further improvement of this embodiment, described method also includes:
Based on described system data set, set up the master meter associated with described associate field.
Master meter described in the present embodiment can include each user behavior data, and the most described master meter can For being distributed in time the table formed;Generally with the predetermined incremental time cycle, update described master meter.
Described associate field associates with described master meter, uses described associate field can carry out in described master meter Inquiry, it is possible to return corresponding data;This makes it possible to the inquiry of data in master meter that realizes of simplicity, especially Facilitate the Equations of The Second Kind user inquiry to the result after systematic analysis process in master meter.Such as, pass through Analysis to user behavior data, creates analysis result, and this analysis result is probably and exists some website Visitation frequency in the appointment time, these data may relate to commercial code, to having the first of relatively low rights Class user do not open, but the most open for Equations of The Second Kind user (such as web analytics maintenance staff), Then this is processed by systematic analysis and produces the data produced based on user behavior, it is possible to bright by described user Associate field in thin list inquires.
As further improvement of this embodiment, described user list collection also includes abstract fields;
Wherein, described abstract fields and described summary data have mapping relations, it is possible to pluck described in being used for inquiring about Want data.Described abstract fields includes ID and query time.Described ID is for can identify use Any information at family is concrete such as information such as user account or user identity serial numbers.Described query time is for using The time range of family given query, usually some moment before current time or current time or time Section.
The ID that different users is different to application.Described query time can be regarded as user and wants with inquiry The user behavior data produced in this period.Here ID at least includes first kind ID.
Understand, from the present embodiment, the data concentrated at described user list to arrange by ID, pin Its user behavior is all record for each user and produces the user behavior data being available for inquiring about, as accessed website Time, have accessed which website, access these websites produce data traffic and access these websites produce Raw discharge pattern.Described discharge pattern can include that Generation Mobile Telecommunication System 2G flow, the third generation move logical Letter 3G flow or forth generation mobile communication 4G flow etc..
Described abstract fields is equivalent to inquire about the index of described summary data, when data processing equipment receive right During the abstract fields answered, can process according to the coupling of abstract fields etc., inquire the summary data of correspondence.? When implementing, for the management to described abstract fields, reduce the problem that index lost efficacy, generally described in pluck Want field also to belong to described user's detail collection, user's detail set pair the index carrying out data query (includes Abstract fields and associate field) same it be managed and safeguard.
Described method also includes:
Based on described user list collection and described user's detail collection, set up concordance list;
Wherein, the search index of described concordance list includes associate field and described abstract fields.
The most described method also includes concordance list.According to preceding solution, inquire about data These are had data by available described abstract fields and described associate field the most for convenience The same association of inquiry field, it is to avoid form multiple concordance list and multiple index type, in the present embodiment will Associate field and abstract fields are all stored in concordance list, decrease index type, it is to avoid index type is chaotic And index storage takies the problem that memory space is big, decrease data redundancy.
Described concordance list also includes described summary data.
The most described concordance list the most directly includes summary data, utilizes described abstract fields just energy straight Connecing the summary data inquired about in described concordance list, like this, described abstract fields is equivalent to dereferenced index.
Associate field in the most described concordance list, it is also possible to be associated with in master meter and carry out in master meter The inquiry of data, is equivalent to association index.Which achieves dereferenced index and the same management of association index And process.
Certainly summary data can also be multiplexed with described associate field.Such as, first kind user may inquire about The website before oneself accessed, but Equations of The Second Kind user may inquire about the website that first kind user accessed Accumulation access the information such as number.The website that this paid first kind user accessed is to be available for first kind user to look into The data ask, are stored in user list as summary data and concentrate.The website that first kind user accessed at that time Also serve as associate field, be stored in user's detail concentrate, this time in order to reduce data redundancy further, The data that user's detail collection and user list are concentrated can be collectively forming aforementioned index table.Concordance list has Input i.e. as summary data a bit, also serve as associate field, thus can reduce data redundancy as far as possible, subtract Minority is according to storage and safeguards the resource taken.
During concrete implementation, each row element in described concordance list can at least include an institute State abstract fields and a described associate field.In certain described concordance list, whether each row element may also include Doing the attribute of master meter correlation inquiry parameter, the parameter value that this attribute is corresponding is false, then it represents that in this row element Search index be abstract fields, can directly return the summary data that this abstract fields is corresponding.Generally This summary data also is stored in this row element.When the described attribute whether doing master meter table relevant parameter is true time, Represent that the search index that this row element is corresponding is described associate field, be to can be used for being associated with master meter to inquire about 's.Additionally, concordance list may also include the attribute of associated element number, if associated element number is 0, also Represent corresponding abstract fields, be otherwise associate field.
Each data or field in the most described concordance list can serve as user and inquire about the index of data, But the field of correspondence is specially abstract fields or associate field, then can but divide two kinds of fields by setting Parameter represent, such as associated element number or the parameter of whether relating attribute.
As in figure 2 it is shown, described method also includes:
Step S210: receive the inquiry tag inputting formation based on user;
Step S220: described inquiry tag is mated with the field in described concordance list;
Step S230: if described inquiry tag matches with described abstract fields, then based on described abstract fields Inquire about described summary data, and return described summary data;
Step S240: if described inquiry tag matches with described associate field, then based on described associate field, Inquire about described master meter and return Query Result.
Described inquiry tag can be the search index of user's input, or electronic equipment combines the information of user's input The data that can be used for carrying out mating with the data in user's detail list generated with the information such as ID.
The present embodiment obviously provides a kind of data query method, using the teaching of the invention it is possible to provide unified data-query interfaces, By mating of inquiry tag and the field in concordance list, summary number can be returned quickly through the mode of dereferenced According to, by the way of association, return out the data in master meter simultaneously, have that data-handling efficiency is fast, data are superfluous The feature that remaining is little and index management is easy.
Apparatus embodiments:
As it is shown on figure 3, the present embodiment provides a kind of data processing equipment, described device includes:
Screening unit 110, for user behavior data carries out data screening process, forms data screening knot Really;
Division unit 120, for based on described data screening result, is divided into satisfied by user behavior data The summary data of query demand and the system data meeting Data Analysis Services;Wherein, described summary data is returned Belong to user list collection;Described system data and described summary data belong to system data set;
Signal generating unit 130, for forming the association inquiring about described system data set based on described coefficient data collection Field;Wherein, described associate field belongs to user's detail collection.
Data processing equipment described in the present embodiment may correspond to various forms of to carry out data process Electronic equipment, such as desktop computer, notebook computer, server or server platform etc..
Described screening unit 110, division unit 120 and the structure corresponding to signal generating unit 130 can include place Reason structure and storage medium.Described process structure can include processor and process circuit.Described processor can wrap Include application processor AP, digital signal processor DSP, programmable array PLC, digital signal processor The structure such as DSP or Micro-processor MCV.Described process circuit can include application-specific integrated circuit ASIC.Described Storage medium can be connected with described process structure by attachment structures such as EBIs.On described storage medium Storage has executable code, and described process structure can realize said units by performing described executable code Function.
Above-mentioned any two unit can distinguish corresponding different process structure, it is also possible to integrated corresponding to same Described process structure.During the corresponding multiple unit of described process structure assembly, described process structure uses the time-division multiple With or the mode of concurrent thread, be respectively completed the operation of different units.
The differentiation of the most described summary data and described system data can be referring particularly to corresponding method Embodiment, is not repeated at this.
As shown in Figure 4, described device also includes:
First sets up unit 130, for based on described system data set, sets up and associates with described associate field Master meter.
Described in the present embodiment, first sets up unit can to include that storage medium, described storage medium are used for storing described Master meter.Described first sets up unit for setting up function or strategy, based on described system data according to master meter The data concentrated, form described master meter.
Between described master meter and associate field relevant, this incidence relation may be used for Equations of The Second Kind user Inquire the data in described master meter.
Described user list collection also includes abstract fields;Wherein, described abstract fields has with described summary data There are mapping relations, it is possible to be used for inquiring about described summary data.
The most described user list directly includes can the first kind user inquiry summary data and Abstract fields.The most described data processing equipment by abstract fields can mistake in Millisecond quick search to institute State summary data.
In the present embodiment, described abstract fields includes ID and query time;Described summary data is Based on ID distribution and the data of formation, such fast and easy is inquired about.
As further improvement of this embodiment, described device also includes: second sets up unit 150, is used for Based on described user list collection and described user's detail collection, set up concordance list;Wherein, the looking into of described concordance list Ask index and include associate field and described abstract fields.
The most described second concrete structure setting up unit 150 and the described first knot setting up unit Structure is similar to, except for the difference that: described second to set up unit 150 be the structure for setting up concordance list.Described Concordance list includes critical field and associate field, is so managed index by concordance list unification, energy Enough avoid the various problems such as index is chaotic.
Described concordance list also includes described summary data.In the present embodiment directly by described summary data in institute State in concordance list and safeguard, so by with the fields match in concordance list, it is determined that corresponding inquiry tag Mate with critical field, the most directly the summary data being maintained in concordance list is returned, thus can maximum limit The raising query rate of degree;And store mapping respectively relative to abstract fields and summary data, to greatest extent Decrease data redundancy, reduce storage resource and data maintenance resource that data take.
As further improvement of this embodiment, as shown in Figure 4, described device also includes:
Receive unit 210, for receiving the inquiry tag inputting formation based on user;
Matching unit 220, for mating described inquiry tag with the field in described concordance list;
First query unit 230, if matching with described abstract fields for described inquiry tag, then based on Described abstract fields inquires about described summary data, and returns described summary data;
Second query unit 240, if matching with described associate field for described inquiry tag, then based on Described associate field, inquires about described master meter and returns Query Result.
Described in the present embodiment receive unit 210 concrete structure can include communication interface, for receive based on User inputs the inquiry tag of formation.Under normal circumstances, the label of inquiry summary data can be that ID adds Upper query event.If needing to inquire about the data in described master meter, generally it is probably query time plus user Mark, adds the information such as service fields.
The concrete structure of described first query unit 230 and the second query unit 240 can be all corresponding to having The process structure of information searching function;The specific descriptions of described process structure may refer to preceding sections.In a word, Device described in the present embodiment, when inquiring about data, is used in combination association index and dereferenced index, it is possible to fast Fast inquires the data that user wants, and the redundancy with data is little, and the system resource taken is few Feature.
Several concrete examples are provided below in conjunction with above-mentioned any embodiment.
Example one:
Shown in Fig. 5 is based on the flow process in previous embodiment divided data.
User behavior data in HBase is divided, is divided into: meet user's real-time query demand Data and the data meeting system processing needs.Further the data meeting user and implementing query demand are drawn Divide and belong to the data of user list collection and belong to the data of user's detail collection.And only meet system and process need The data asked then will all belong to system data set.
The process that described user behavior data divides can be as follows:
The first step: the demand meeting user behavior real-time query is abstract:
Because the reason that user behavior class data are placed in HBase is mainly due to user behavior real-time query Demand causes.So first we have carried out abstract analysis to the feature of user behavior real-time query data.
Owing to the attention of user is limited in scope limited, a user is generally impossible to pay close attention to substantial amounts of number simultaneously According to content.So user likes the process of browsing data to be divided into two steps to perform.We are according to such row For feature, data display is divided into two steps: show the data of summary and show detailed data.
We, based on such a data display mode, are divided into two classes the data being used in the future allowing user inquire about Record set, one is user list collection, and one is user's detail collection.That User Summary collection is corresponding is user Browsing the summary demand of bulk data, what user's detail collection was corresponding is the demand of inquiry detail.The two data Set is all the subset of whole user behavior data.
How so the two data set screens from total data?Because user only can be concerned about The data of oneself, and it is likely to be certain a period of time.So the two data set is logical first by user Identification field filters, the most temporally Field Sanitization, has just obtained required data set.
Second step: meet the abstract of Data Analysis Services demand:
Just we analyze the demand of user's real-time query, but in a big data sharing platform, use Behavior class data in family can not only be used by user's real-time query.It is also possible to can be by the data of internal system Analyze and process.
The service feature of the data analysis and process of lower surface analysis internal system once.
The scene of intrasystem data analysis and process can be different according to the content of concrete analysis, different Analyzing data of interest row also different, this is also the origin of row storage, so data are divided in theory Analysis and processing procedure need any one field in table to be likely to.Dividing at big data sharing platform simultaneously During analysis processes, generally will not be concerned about the data of some user, but be concerned about the data of a large number of users, but base This be all temporally the cycle carry out incremental processing.No matter it is the SQL process using multilist connection, or It is all so that MepReduce processes.It is directed to the data analysis and process demand characteristic of such internal system, We have obtained such a " system data set ", and this data set contains all of data field, It is filtered out from total data by time field.
Shown in Fig. 6 for according to method shown in Fig. 5, the data distribution schematic diagram after division.Can from Fig. 6 Know, may all meet systematic analysis and process the data of demand, therefore all of user by all of user behavior data Behavioral data all belongs to system data set.And the data meeting user's real-time query demand can include belonging to The data that user list collection and user's detail are concentrated.The data that user list is concentrated include summary data and key Field, described critical field is to inquire about the search index of described summary data.The most for convenience to rope The unified management drawn.Described user's detail concentrates the associated characters except including concentrating system data data query Beyond Duan, also include the abstract fields that user's detail is concentrated.
Following table is the similarities and differences of three kinds of data sets of comparison.
Obviously the most as can be known from the above table, the data of user list collection may between user's detail collection and system data set it Between, but user's detail collection and user list collection are all the subsets of system data set.Described search field is for looking into Ask the index that corresponding data is concentrated.The data that inquiry user list is concentrated in this example can use user Mark is plus the search field of time.Here time is aforesaid query time.
Divide based on above-mentioned data, carry out the setting of data structure.Because system data set includes whole Field, so applicable master meter stores related data.User's detail set has the feature that Record to return is minimum, It is especially suitable for by master meter being built an association index (association index here is aforesaid associate field) Realize.Because HBase association secondary index performance when recording less is the highest, space hold is the most less. User list set, because Record to return is relatively big, uses HBase association secondary index to obtain best performance, But because the field of this set will be little than other two set, relatively it is suitable for using HBase dereferenced two Level index realizes.Index namely preserves the summary data all needed, during inquiry, does not make the pass with master meter Joint investigation is ask.So performance is higher, but can take some spaces, because this Set field is minimum, quite Optimal performance is exchanged in sacrificing some redundancies.
By above-mentioned design so that have higher query performance under the demand of various inquiries, count simultaneously According to having there are some data redundancies, but amount of redundancy is the least.
By as above-mentioned, a master meter and two indexes will be formed.Here two indexes are i.e. equivalent to aforementioned reality Execute the abstract fields in example and associate field.Abstract fields is equivalent to dereferenced index, and associate field is suitable In association index.
Next the two index is merged.When implementing, the content of non-joint index includes The content of joint index, merging method be using the index content of user list set as index after merging in Hold, by transmitting " whether making master meter correlation inquiry parameter " to index, when parameter is fictitious time, directly return Concordance list data, and " associated element number " two parameters.
In the analyzing and processing of big data sharing platform, generally will not be concerned about the data of some user, but It is concerned about the data of a large number of users, but the temporally cycle that is substantially carries out incremental processing.No matter it is to use multilist The SQL process of connection, or MepReduce process.
System data set processes mainly for internal analysis, by master meter can preferably solve in system Portion analyzes and processes time performance problem.Master meter uses the time plus service fields as search index, the most just Being to say that the data distribution in master meter is temporally distributed in HBase, time such, we are making system When internal analysis processes, the position at newly-increased data place can be navigated to efficiently.Only take and need data to be processed. The speed to data query and extraction in data handling procedure can be improved.
As it is shown in fig. 7, the data in master meter are tissue line in chronological order;And the data in concordance list Tissue line is carried out by user account.Certainly user account is the one of which of ID.
Abstract fields and summary data is included, wherein, in Fig. 7 " 5-6 ", " 6-5xxx " of display at concordance list The summary data that all can map as described abstract fields.The most also show index value, index herein Value is equivalent to the associate field in previous embodiment, carries out the inquiry of data in master meter for being associated with master meter.
Store various data in master meter, these data the most also include user account, query time, Fa And the service fields etc. of user's query and search of Fb composition.
ID+timestamp+other abstract fields are conspired to create a character string and need row keyword as product by concordance list, Middle with decollator segmentation, when user inquires about summary data, by ID, the time range of the inquiry of oneself And other querying conditions become a kind of row keyword query.By this row keyword query rapidly locating Position.Then directly take out concordance list data, before described concordance list and included all abstract fields, So directly returning data from concordance list, master meter inquiry need not be associated.Here ID is aforementioned user The one of mark.
So whole process is exactly the row key range inquiry of a concordance list in fact, therefore efficiency is higher.
Concordance list associates with master meter, and the index value first half of concordance list is identical with master meter row keyword, from making Concordance list is logical with master meter association to be associated, and the index value latter half of concordance list is other User Summary number According to.Such purpose is because institute's index value of concordance list to be had and itself has two kinds of purposes, first with master meter Row keyword association, second is to carry User Summary data.Thus achieve association rope by index value Draw and index, with dereferenced, the result merged.The most at this moment index value includes Part I and Part II; Part I is for mapping with summary data, and Part II is for mapping with master meter, and such Part I is suitable In abstract fields, and Part II is equivalent to associate field.A like this index value, can use simultaneously In inquiring summary data and the data of system data concentration.Certain this mode, is to discriminate between in aforementioned setting Particularly parameter distinguishes associate field and abstract fields.Be equivalent to corresponding two fields of an index value, One is association or dereferenced, and another is associated element number.So can realize data and preserve portion, For two kinds of purposes.
When user inquires about detailed data, associate master meter by concordance list.Whole process is the most just carried out (returning Return record number * 2) secondary row keyword query, because the feature of detailed data inquiry is inquiry low volume data, but There is relatively multi-field.So can have higher performance during association.
In the scene of Data Management Analysis, processing procedure is frequently not centered by unique user.Substantially It it is all the previous hour data of catch cropping incremental processing, such as localization process on time.At the most newly-increased data constantly work Reason.
Data Analysis Services system obtain data, can press from master meter row keyword query pass through Hive, Impala, spark engine queries and process.Or use MapReduce to be directly positioned to place by Rowkey The data interval of reason, it is to avoid full table scan.
Example two:
The present invention is introduced below by an example:
User's internet behavior data of A mouth are entered storage.Supporting business uses the demand of data.These data are led to Often there are two class purposes:
User inquires about internet records.
System makees user's internet behavior analysis by these data.
Metadata is as follows:
Definition data set
Field is sorted by user's attention rate, is ranked up to low attention rate from high attention rate, and then selects The field of high attention rate is as abstract fields, and determines summary data collection according to abstract fields, adds it The data that remaining user is concerned about are as detailed data collection, and complete or collected works are exactly that systematic analysis processes data set.Here pass Note degree may be embodied in user and inquires about in the frequency of these data.Ranking results is as follows:
In the ranking results that upper table is formed, by user account, time started, business numbering, class of service Numbering and web site name are numbered as abstract fields, and data corresponding to these abstract fields are as summary data. Using the data field that in upper table, numbering is concentrated as user's detailed data from user account to application layer protocol.Its His system field belongs to system data set.
Definition master meter
Master meter as row keyword, uses the time started to add by time+service fields in the present example User account links up the keyword as master meter.(time is fixing as row keyword rowkey prefix, Aft section can design according to business demand), total data is all saved in master meter
Index of definition table
Join together to build up index by the field in user list collection.In described user list collection, field comprises the steps that User account number, business numbering, class of service numbering, time started, web site name numbering.
Concordance list, with user account as prefix, serves user's inquiry.By this use when user inquires about data The user account at family makees Rapid matching.
When user inquires about internet records.
When user inquires about his certain time period internet records, by user account number and (time here time period Section is i.e. equivalent to aforesaid query time), and select the inquiries such as type of service (instant messaging, microblogging etc.) Internet records is data summarization part.User is it can be seen that corresponding summary data.In example he it will be seen that He used what software (BusinessType_ID) to have accessed the information such as what website (WebsiteID). These data volumes may have a lot, it is possible to has paging.Internal system is to be entered by HBase secondary index Every trade keyword match, completes quick search.Because not associating master meter, even if inquiry record has several ten thousand, Also can guarantee that and return within the several seconds.
When user inquires about detailed data, after user has seen summary data, that selects that he to check further is bright Count evidence accurately.Detailed data has more message details (field), but data acknowledgment number is the least.Internal system is By HBase secondary index being carried out row keyword query by user account number, time period etc., associate the most again Inquire about to master meter.This association process can be low than Unlinkability, but due to detail inquiry record number It is less, so second level returns the most secure.
System of users behavioral data is analyzed processing.
Such as: will analyzing which website user, to access number most.System can not only be analyzed once, but presses Time window (such as one day) makees incremental analysis.It requires that data to be distributed in time, convenient only process The data of nearest one day.Our master meter is temporally to make rowkey with service fields, the most temporally Distribution, master meter comprises all fields simultaneously.Directly master meter is analyzed during analysis, and index need not be walked Table.When HBase fetches data, it is substantially the rowkey inquiry that master meter is carried out scope, from HBase The middle taking-up data of nearest a day collect, and need not scan conventional data.
In several embodiments provided herein, it should be understood that disclosed equipment and method, Can realize by another way.Apparatus embodiments described above is only schematically, such as, The division of described unit, is only a kind of logic function and divides, and actual can have other division when realizing Mode, such as: multiple unit or assembly can be in conjunction with, or are desirably integrated into another system, or some are special Levy and can ignore, or do not perform.It addition, the coupling each other of shown or discussed each ingredient, Or direct-coupling or communication connection can be the INDIRECT COUPLING by some interfaces, equipment or unit or logical Letter connect, can be electrical, machinery or other form.
The above-mentioned unit illustrated as separating component can be or may not be physically separate, makees The parts shown for unit can be or may not be physical location, i.e. may be located at a place, Can also be distributed on multiple NE;Can select according to the actual needs therein partly or entirely Unit realizes the purpose of the present embodiment scheme.
It addition, each functional unit in various embodiments of the present invention can be fully integrated into a processing module In, it is also possible to it is that each unit is individually as a unit, it is also possible to two or more unit collection Become in a unit;Above-mentioned integrated unit both can realize to use the form of hardware, it would however also be possible to employ Hardware adds the form of SFU software functional unit and realizes.
One of ordinary skill in the art will appreciate that: realize all or part of step of said method embodiment Can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in a computer-readable Taking in storage medium, this program upon execution, performs to include the step of said method embodiment;And it is aforementioned Storage medium include: movable storage device, read only memory (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various The medium of program code can be stored.
The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is not limited to In this, any those familiar with the art, can be easily in the technical scope that the invention discloses Expect change or replace, all should contain within protection scope of the present invention.Therefore, the protection of the present invention Scope should be as the criterion with described scope of the claims.

Claims (14)

1. a data processing method, it is characterised in that described method includes:
User behavior data is carried out data screening process, forms data screening result;
Based on described data screening result, user behavior data is divided into the summary data meeting query demand With the system data meeting Data Analysis Services;Wherein, described summary data belongs to user list collection;Institute State system data and described summary data belongs to system data set;
The associate field inquiring about described system data set is formed based on described coefficient data collection;Wherein, described pass Connection field belongs to user's detail collection.
Method the most according to claim 1, it is characterised in that
Described method also includes:
Based on described system data set, set up the master meter associated with described associate field.
Method the most according to claim 1, it is characterised in that
Described user list collection also includes abstract fields;
Wherein, described abstract fields and described summary data have mapping relations, it is possible to pluck described in being used for inquiring about Want data.
Method the most according to claim 3, it is characterised in that
Described abstract fields includes ID and query time.
Method the most according to claim 3, it is characterised in that
Described method also includes:
Based on described user list collection and described user's detail collection, set up concordance list;
Wherein, the search index of described concordance list includes associate field and described abstract fields.
Method the most according to claim 5, it is characterised in that
Described concordance list also includes described summary data.
Method the most according to claim 5, it is characterised in that
Described method also includes:
Receive the inquiry tag inputting formation based on user;
Described inquiry tag is mated with the field in described concordance list;
If described inquiry tag matches with described abstract fields, then pluck based on described in the inquiry of described abstract fields Want data, and return described summary data;
If described inquiry tag matches with described associate field, then based on described associate field, inquiry is described Master meter also returns Query Result.
8. a data processing equipment, it is characterised in that described device includes:
Screening unit, for user behavior data carries out data screening process, forms data screening result;
Division unit, for based on described data screening result, is divided into user behavior data and meets inquiry The summary data of demand and the system data meeting Data Analysis Services;Wherein, described summary data belongs to User list collection;Described system data and described summary data belong to system data set;
Signal generating unit, for forming the associate field inquiring about described system data set based on described coefficient data collection; Wherein, described associate field belongs to user's detail collection.
Device the most according to claim 8, it is characterised in that
Described device also includes:
First sets up unit, for based on described system data set, setting up the master associated with described associate field Table.
Device the most according to claim 9, it is characterised in that
Described user list collection also includes abstract fields;
Wherein, described abstract fields and described summary data have mapping relations, it is possible to pluck described in being used for inquiring about Want data.
11. devices according to claim 10, it is characterised in that
Described abstract fields includes ID and query time.
12. devices according to claim 10, it is characterised in that
Described device also includes:
Second sets up unit, for based on described user list collection and described user's detail collection, sets up concordance list;
Wherein, the search index of described concordance list includes associate field and described abstract fields.
13. devices according to claim 12, it is characterised in that
Described concordance list also includes described summary data.
14. devices according to claim 12, it is characterised in that
Described device also includes:
Receive unit, for receiving the inquiry tag inputting formation based on user;
Matching unit, for mating described inquiry tag with the field in described concordance list;
First query unit, if matching with described abstract fields, then based on described for described inquiry tag Abstract fields inquires about described summary data, and returns described summary data;
Second query unit, if matching with described associate field, then based on described for described inquiry tag Associate field, inquires about described master meter and returns Query Result.
CN201510400093.1A 2015-07-09 2015-07-09 Data processing method and device Pending CN106326317A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510400093.1A CN106326317A (en) 2015-07-09 2015-07-09 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510400093.1A CN106326317A (en) 2015-07-09 2015-07-09 Data processing method and device

Publications (1)

Publication Number Publication Date
CN106326317A true CN106326317A (en) 2017-01-11

Family

ID=57725456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510400093.1A Pending CN106326317A (en) 2015-07-09 2015-07-09 Data processing method and device

Country Status (1)

Country Link
CN (1) CN106326317A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038224A (en) * 2017-03-29 2017-08-11 腾讯科技(深圳)有限公司 Data processing method and data processing equipment
CN108418827A (en) * 2018-03-15 2018-08-17 北京知道创宇信息技术有限公司 User's behaviors analysis method and device
CN109117427A (en) * 2017-06-22 2019-01-01 索意互动(北京)信息技术有限公司 A kind of client, server, search method and its system
CN112214556A (en) * 2020-09-30 2021-01-12 招商局金融科技有限公司 Label generation method and device, electronic equipment and computer readable storage medium
CN115114344A (en) * 2021-11-05 2022-09-27 腾讯科技(深圳)有限公司 Transaction processing method and device, computing equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117320A (en) * 2011-01-11 2011-07-06 百度在线网络技术(北京)有限公司 Structured data searching method and device
CN102609421A (en) * 2011-01-24 2012-07-25 阿里巴巴集团控股有限公司 Data query method and device
CN102867064A (en) * 2012-09-28 2013-01-09 用友软件股份有限公司 Associated field query device and associated field query method
CN103488704A (en) * 2013-09-06 2014-01-01 乐视致新电子科技(天津)有限公司 Method and device for storing data
US20140330816A1 (en) * 2011-11-18 2014-11-06 Debabrata Dash Query summary generation using row-column data storage
CN104160394A (en) * 2011-12-23 2014-11-19 阿米亚托股份有限公司 Scalable analysis platform for semi-structured data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117320A (en) * 2011-01-11 2011-07-06 百度在线网络技术(北京)有限公司 Structured data searching method and device
CN102609421A (en) * 2011-01-24 2012-07-25 阿里巴巴集团控股有限公司 Data query method and device
US20140330816A1 (en) * 2011-11-18 2014-11-06 Debabrata Dash Query summary generation using row-column data storage
CN104160394A (en) * 2011-12-23 2014-11-19 阿米亚托股份有限公司 Scalable analysis platform for semi-structured data
CN102867064A (en) * 2012-09-28 2013-01-09 用友软件股份有限公司 Associated field query device and associated field query method
CN103488704A (en) * 2013-09-06 2014-01-01 乐视致新电子科技(天津)有限公司 Method and device for storing data

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038224A (en) * 2017-03-29 2017-08-11 腾讯科技(深圳)有限公司 Data processing method and data processing equipment
CN109117427A (en) * 2017-06-22 2019-01-01 索意互动(北京)信息技术有限公司 A kind of client, server, search method and its system
CN108418827A (en) * 2018-03-15 2018-08-17 北京知道创宇信息技术有限公司 User's behaviors analysis method and device
CN108418827B (en) * 2018-03-15 2020-11-03 北京知道创宇信息技术股份有限公司 Network behavior analysis method and device
CN112214556A (en) * 2020-09-30 2021-01-12 招商局金融科技有限公司 Label generation method and device, electronic equipment and computer readable storage medium
CN112214556B (en) * 2020-09-30 2024-02-23 招商局金融科技有限公司 Label generation method, label generation device, electronic equipment and computer readable storage medium
CN115114344A (en) * 2021-11-05 2022-09-27 腾讯科技(深圳)有限公司 Transaction processing method and device, computing equipment and storage medium

Similar Documents

Publication Publication Date Title
US11176114B2 (en) RAM daemons
CN111782965B (en) Intention recommendation method, device, equipment and storage medium
US8918365B2 (en) Dedicating disks to reading or writing
CN109347798A (en) Generation method, device, equipment and the storage medium of network security knowledge map
CN106326317A (en) Data processing method and device
US20150019544A1 (en) Information service for facts extracted from differing sources on a wide area network
CN107729336A (en) Data processing method, equipment and system
CN112165462A (en) Attack prediction method and device based on portrait, electronic equipment and storage medium
CN111026874A (en) Data processing method and server of knowledge graph
CN113297457B (en) High-precision intelligent information resource pushing system and pushing method
WO2011094522A1 (en) Method and system for conducting legal research using clustering analytics
CN102955802B (en) The method and apparatus of data is obtained from data sheet
CN112765366A (en) APT (android Package) organization portrait construction method based on knowledge map
US20170270184A1 (en) Methods and devices for processing objects to be searched
CN106528612A (en) Distributed retrieval system and method oriented to industry metadata registration
Bellandi et al. Toward a general framework for multimodal big data analysis
US11776078B2 (en) Systems and methods for generating strategic competitive intelligence data relevant for an entity
CN113221535A (en) Information processing method, device, computer equipment and storage medium
CN112667663A (en) Data query method and system
CN110781213A (en) Multi-source mass data correlation searching method and system with personnel as center
Nayak et al. Applications of data mining in web services
Alvanaki et al. Tracking set correlations at large scale
Caldeira et al. Experimental evaluation among reblocking techniques applied to the entity resolution
KR20110125966A (en) Method and system for generating synonyms group using sentence analysis
CN111949830A (en) Discrete indexing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170111