CN106776901A - Data extraction method, apparatus and system - Google Patents
Data extraction method, apparatus and system Download PDFInfo
- Publication number
- CN106776901A CN106776901A CN201611080168.3A CN201611080168A CN106776901A CN 106776901 A CN106776901 A CN 106776901A CN 201611080168 A CN201611080168 A CN 201611080168A CN 106776901 A CN106776901 A CN 106776901A
- Authority
- CN
- China
- Prior art keywords
- data
- key
- value
- type
- data type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of method for extracting the data from one or more data source, each data source in one or more data source includes many datas, there is the data item of key-value pair form per data including one or more, the data extraction method includes step:For each data source in one or more data source, the corresponding data type of each key is determined, generate data type table;Parse a data and extract one or more data item included by the data, for each data item:The key-value pair for constituting the data item is extracted, the data source according to the data determines the corresponding data type of extracted key from data type table;And the value in the key-value pair that is extracted is verified using the corresponding data verification method of the data type, is extracted if successfully if verification, the value in the key-value pair of record extraction.The invention also discloses corresponding data extraction device and system.
Description
Technical field
Data abstraction techniques field of the present invention, especially data extraction method, apparatus and system.
Background technology
It is accurate such as in HTTP access logs, Internet of Things data how from the data of magnanimity under current big data environment
The data message of needs is really extracted, for analysis user behavior, hobby, custom etc., or prediction user behavior, improvement are extensively
Accuse dispensing effect etc. and be respectively provided with highly important meaning.
To extract data instance from URL (Universal Resource Locator, URL), typically
Data are carried out full text matching by ground by predetermined regular expression, as long as hit, just extracts the data for matching,
And its type is appointed as the corresponding type of predetermined regular expression.By put into practice find, the program have error rate it is high lack
Point.For example, some data, only some content meets regular expression rule, can also be identified as corresponding data type,
It is extracted;Or, some data its types is not the corresponding data type of regular expression, but in mass data, number
Partial content in just meets regular expression rule, and this partial data will extracting by mistake.
Accordingly, it would be desirable to a kind of data extraction method, can accurately be extracted from the data from various data sources data,
And ensure the efficiency that data are extracted.
The content of the invention
Therefore, the invention provides data extraction method, apparatus and system, to try hard to solve or at least alleviate to deposit above
At least one problem.
According to an aspect of the invention, there is provided a kind of side for extracting the data from one or more data source
Method, each data source in one or more data source includes many datas, has including one or more per data
Key-value includes step to the data item of form, the data extraction method:For each data in one or more data source
Source, determines the corresponding data type of each key, generates data type table;Parse a data and extract included by the data
One or more data item, for each data item:The key-value pair for constituting the data item is extracted, according to the data
Data source determines the corresponding data type of extracted key from data type table;And using the corresponding number of the data type
The value of the key-value centering according to method of calibration to being extracted is verified, and is extracted if successfully if verification, and the key that record is extracted-
It is worth the value of centering.
Alternatively, in data extraction method of the invention, generation data type table the step of include:For one
Or each data source in multiple data sources, data are sampled, to obtain the first number data;For the first number
Every data in data, parses data and extracts all of data item one by one;To the key-value centering in each data item
The corresponding value of key, its data type is analyzed by regular expression and/or data verification method, used as the corresponding data class of the key
Type;In counting each data source, the corresponding data type number of each key and value number corresponding with the data type;And
Data type of the corresponding value number accounting more than first threshold is chosen from the data type corresponding to each key, is defined as this
The key and identified data type in the corresponding data type of the key in data source, and the associated storage data source, as number
According to type list.
Alternatively, in data extraction method of the invention, for one or more data source in every number
Include according to source, the step of sampled to data:Extract preceding first number data in every kind of data source;And/or in every kind of number
According to the first number data of random sampling in source;And/or the first number data is extracted in every kind of data source on a time period.
Alternatively, in data extraction method of the invention, the corresponding value number accounting of data type is certain key
The corresponding value number of a certain data type account for the corresponding all data types of the key in the data source value total number ratio.
Alternatively, in data extraction method of the invention, using the data verification method of the data type to institute
The step of value of the key-value centering of extraction is verified also includes:Regular expression using the data type is to being extracted
The value of key-value centering is verified.
Alternatively, in data extraction method of the invention, also including correction data type the step of:It is pre- when meeting
If during condition, each key extracts successful number, extracts the number of failure in first scheduled time counted every kind of data source,
Calculate the extraction success percentage of each key in every kind of data source in the time period;And if extracting success percentage less than second
Threshold value, then produce alarm signal, is corrected with trigger data type, and resampling counts the corresponding data class of the key in the data source
Type.
Alternatively, in data extraction method of the invention, correction data type the step of also include:Every second
The step of scheduled time repeats the generation data type table to latest data, generates new data type table;According to new number
According to type list, data of the corresponding value number accounting more than first threshold are chosen again in the data type corresponding to each key
Type as the corresponding data type of the key in the data source, to perform follow-up data extraction step.
Alternatively, in data extraction method of the invention, data type includes:Identity, social account,
Reason positional information, mobile device mark.
Alternatively, in data extraction method of the invention, first scheduled time was one day;Described second pre- timing
Between be seven days or one day.
According to another aspect of the invention, there is provided a kind of extraction for extracting the data from one or more data source
Device, each data source in one or more data source includes many datas, has including one or more per data
There is data item of the key-value to form, the data extraction device includes:Data type analysis module, for one or more number
According to each data source in source, the corresponding data type of each key is determined, generate data type table;Data extraction module, is suitable to
Parse a data and extract one or more data item included by the data, be further adapted for for each data item,
Extract the key-value pair for constituting the data item;Data type analysis module is further adapted for according to the data source of the data from data class
The corresponding data type of extracted key is determined in type table;And data check module, it is suitable to using data type correspondence
The value of key-value centering of the data verification method to being extracted verify, is extracted if successfully if verification, record extraction
The value of key-value centering.
Alternatively, in data extraction device of the invention, data type analysis module includes:Data sampling list
Unit, is suitable to, for each data source in one or more data source, sample data, to obtain the first number bar number
According to;Data extracting unit, is suitable to, for the every data in the first number data, data be parsed one by one and is extracted all of
Data item;Data type analysis unit, is suitable to the corresponding value of key-value centering key in each data item, by regular expressions
Formula and/or data verification method analyze its data type, used as the corresponding data type of the key;Statistic unit, is suitable to statistics every
In individual data source, the corresponding data type number of each key and value number corresponding with the data type;Data type analysis
Unit is further adapted for choosing data class of the corresponding value number accounting more than first threshold from the data type corresponding to each key
Type, is defined as the corresponding data type of the key in the data source, and the key and identified data in the associated storage data source
Type, as data type table.
Alternatively, in data extraction device of the invention, data sampling unit is further adapted for extracting every kind of data source
In preceding first number data;And/or it is further adapted for the first number of random sampling data in every kind of data source;And/or also fit
In the first number data is extracted in every kind of data source on a time period.
Alternatively, in data extraction device of the invention, the corresponding value number accounting of data type is certain key
The corresponding value number of a certain data type account for the corresponding all data types of the key in the data source value total number ratio.
Alternatively, in data extraction device of the invention, data check module is further adapted for using the data type
The value of key-value centering of the regular expression to being extracted verify.
Alternatively, in data extraction device of the invention, also including data type rectification module, data type is rectified
Positive module is suitable to when meeting pre-conditioned, and each key extracts successfully individual in first scheduled time counted every kind of data source
Number, the number for extracting failure, calculate the extraction success percentage of each key in every kind of data source in the time period;And data class
Type rectification module is further adapted for, when success percentage is extracted less than Second Threshold, producing alarm signal, is rectified with trigger data type
Just, resampling counts the corresponding data type of the key in the data source.
Alternatively, in data extraction device of the invention, data type rectification module is further adapted for pre- every second
Trigger data of fixing time type analysis module, so that data type analysis module is suitable to generate new data class according to latest data
Type table, and according to new data type table, choose corresponding value number accounting again in the data type corresponding to each key
More than first threshold data type as the corresponding data type of the key in the data source, extract step to perform follow-up data
Suddenly.
Alternatively, in data extraction device of the invention, data type includes:Identity, social account,
Reason positional information, mobile device mark.
Alternatively, in data extraction device of the invention, first scheduled time was one day;Described second pre- timing
Between be seven days or one day.
According to another aspect of the invention, a kind of extraction for extracting the data from one or more data sources is additionally provided
System, including:Data acquisition device, is suitable to gather the data from one or more data sources;Data as described above are extracted
Device;And data analysis set-up, it is suitable to be analyzed the data extracted.
Data extraction scheme of the invention, is analyzed in drawing each data source, the value of each key (Key) by sampling statistics
(Value) data type, generates data type table;When data are extracted, it is known that the data type of key, it is only necessary to use the number
Verified according to the data verification method of type, improve data extraction efficiency;Also, determined by verifying, even more ensure that
The accuracy rate that data are extracted.
Brief description of the drawings
In order to realize above-mentioned and related purpose, some illustrative sides are described herein in conjunction with following description and accompanying drawing
Face, these aspects indicate the various modes that can put into practice principles disclosed herein, and all aspects and its equivalent aspect
It is intended to fall under in the range of theme required for protection.By being read in conjunction with the figure following detailed description, the disclosure it is above-mentioned
And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical reference generally refers to identical
Part or element.
Fig. 1 shows the schematic diagram of data extraction system according to an embodiment of the invention 100;
Fig. 2 shows the flow chart of data extraction method according to an embodiment of the invention 200;
Fig. 3 shows the schematic diagram of data extraction device according to an embodiment of the invention 120;And
Fig. 4 shows the schematic diagram of the data extraction device 120 according to further embodiment of this invention.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
Limited.Conversely, there is provided these embodiments are able to be best understood from the disclosure, and can be by the scope of the present disclosure
Complete conveys to those skilled in the art.
Fig. 1 shows the schematic diagram of data extraction system according to an embodiment of the invention 100.As shown in figure 1, this is
System 100 includes data acquisition device 110, data extraction device 120 and data analysis set-up 130.Wherein, data acquisition device
110 are suitable to gather the data from one or more data sources, wherein, each data source in one or more data source is equal
Including many datas, and there is data item of the key-value to (Key-Value) form including one or more per data.Number
The accurate value (Value) extracted in data item in being suitable to the data gathered from data acquisition device 110 according to extraction element 120,
For example, extracting Email, GPS location, social tool account etc..Data analysis set-up 130 is suitable to by data extraction device
120 data extracted are analyzed, and user is predicted for example, setting up user by a series of data of sign user characteristicses and drawing a portrait
Behavior.
Based on description above to system 100, in the present system, how accurately and efficiently to extract data is to realize we
The key point of case, that is, the operation performed by data extraction device 120.
The flow that data extraction is carried out to data extraction device 120 is described in detail below.
Such as Fig. 2, the data extraction side performed in data extraction device 120 according to an embodiment of the invention is shown
The flow chart of method 200.Method 200 starts from step S210.In step S210, for gathered by data acquisition device 110 one
Each data source in individual or multiple data sources, determines the corresponding data type of each key, generates data type table.
According to some implementation methods, in different data sources, corresponding value (Value) implication of same key (Key) is very
It is possible to different.For example, in HTTP access logs, the HTTP access logs of different web sites belong to different data sources, because
In the daily record of different web sites, the implication of same Key may be different, by taking following two URL as an example,
URL1:http://www.xxx.com/index.htmId=aaa@bbb.com&name=test
URL2:http://www.yyy.com/index.htmlId=123456&phone=13405671234
In the two URL, to being Key=Value forms, for URL1, id field meanings therein are users to key-value
Identity information, and for URL2, id field meanings therein are the Digital IDs of user, although both Key are identical, implication is complete
It is complete different.
Therefore, first have to the key-value for extracting to classifying, determine in every kind of data source, each key correspondence
Data type.
Firstly, for each data source in one or more data source, data are sampled, to obtain the first number
Mesh data.Alternatively, the step of being sampled to data includes:Extract preceding first number data in every kind of data source;With/
Or in every kind of data source the first number of random sampling data;And/or the first number is extracted in every kind of data source on a time period
Mesh data.
Secondly, for every the data in the first number data for sampling, parsing data and extract all one by one
Data item.It should be noted that the present invention is not restricted to the method for extracting the data item comprising key-value pair, such as data
Item is the form of " key separators value ", and according to separators data, Part I is key, and Part II is value,
Wherein separator be probably ":", "=" etc..
Then, to the corresponding value of key-value centering key in each data item, by regular expression and/or data check
Method analyzes its data type, used as the corresponding data type of the key.Alternatively, data type includes:Identity (such as identity
Card number), social account (such as micro-signal, No. QQ, Email), geographical location information (such as GPS location, city, country), movement set
Standby mark (such as IMEI).
According to one embodiment of present invention, there is check code in the data of some data types, and such as identification card number is last
One is check bit, then, can whether correct by calculating checking check bit.And for example, the regular expression of Email is:^
[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+( [a-zA-Z0-9_-]+)+$, can be analyzed by the regular expression
Whether key-value is to meeting the data type of Email.
Finally, in counting each data source, the corresponding data type number of each key and corresponding with the data type
Value number, chooses data type of the corresponding value number accounting more than first threshold from the data type corresponding to each key,
It is defined as the corresponding data type of the key in the data source, and the key and identified data class in the associated storage data source
Type, as data type table.
According to the embodiment of the present invention, count in each data source, the corresponding data type of each key can be using such as
Lower form is represented:
Wherein, " number M " represent in data source X, key Key1 to should have data type A, data type B ..., it is unknown
The data types such as data type, and it is M that the number of the corresponding values of data type A has altogether.
The value number accounting value of the data type corresponding to each key is calculated, wherein, the corresponding value number of data type is accounted for
The value that the corresponding value number of a certain data type than being certain key accounts for the corresponding all data types of the key in the data source is total
The ratio of number, when the value number accounting of certain data type exceedes first threshold (e.g., 0.8), determines that the data type
It is the corresponding data type of the key in the data source, the value of corresponding other data types of the key, it may be possible to improper value, Ke Yipai
Remove, generation data type table is as follows:
Then in step S220, parse a data and extract one or more data included by the data
.
Then in step S230, for each data item, the key-value pair for constituting the data item is extracted, according to this number
According to data source the corresponding data type of extracted key is determined from data type table.Assuming that the data source of the data is
X, can show that the corresponding data types of key Key1 are A from data above type list.
Then in step S240, the key-value centering using the corresponding data verification method of the data type to being extracted
Value verified, if verification by if extracts successfully, record extraction key-value centering value.Usually, data check is used
, it is necessary to ergodic data list of types, verifies whether the value meets the data type successively during the data type of method assay value
Call format and verification require which meets, and which data type is the value just belong to.But in the method, due to basis
Data type table determines the corresponding data type of extracted key, only need to verify whether the corresponding value of the key meets the data class
Type, substantially increases efficiency.
According to still another embodiment of the invention, can also utilize the data type regular expression to extracted key-
The value of value centering is verified.Using regular expression assay value data type when, with value matched data type canonical table
Up to formula, if desired for analyzing IP address data type, then the regular expression of use value Corresponding matching IP address.
According to the embodiment of the present invention, can be using data verification method or regular expression come corresponding to check value
Data type, it is also possible to the data type by way of both combine corresponding to check value, the invention is not limited in this regard.
If through verification, the corresponding data type of value is consistent with the data type of the data determined through step S230,
Then verification passes through, and the value for successfully recording the key-value centering extracted is extracted in expression.Usually, the value is stored with JSON forms, such as:
{"ip":"1.1.1.1","email":"xxx@yyy.com"}
According to a kind of implementation, data source is now needed because that situations such as upgrading the implication of key may be caused to change
Correction process is done to data type.Typically, the step of correcting data type includes:
When pre-conditioned (data volume is sufficiently large, the total degree that such as certain key occurs thousands of times, up to ten thousand times) is met, every
First scheduled time (such as 1 day) counts the number that each key extracts successful number, extraction fails in every kind of data source, and calculating should
In time period in every kind of data source each key extraction success percentage.
If extracting success percentage less than Second Threshold (e.g., the interval of Second Threshold is 0.75-0.85), then produce
Alarm signal, is corrected with (automatic or manual by administrative staff) trigger data type, and resampling counts the key in the data source
Corresponding data type.
According to a kind of implementation, above-mentioned can also be repeated to latest data every second scheduled time (such as 1 day or 7 days)
The step of generation data type table (that is, step S210), generate new data type table.
According to new data type table, corresponding value number accounting is chosen again in the data type corresponding to each key
More than first threshold data type as the corresponding data type of the key in the data source, to complete the step of follow-up data extraction
Suddenly.
With reference to described above, during this method 200 draws each data source by sampling statistics analysis, the value of each key (Key)
(Value) data type, generates data type table;When data are extracted, it is known that the data type of key, it is only necessary to use the number
Verified according to the data verification method of type, improve data extraction efficiency;Also, determined by verifying, even more ensure that
The accuracy rate that data are extracted.
Furthermore, it is contemplated that situations such as data source is because of upgrading may cause the implication of key to change, and increased to data
The step of type carries out correction process, further increases the accuracy rate of data.
Correspondingly, Fig. 3 shows the schematic diagram of data extraction device 120 according to embodiments of the present invention, as shown in figure 3,
The device 120 includes:Data type analysis module 122, data extraction module 124 and data correction verification module 126.
Data type analysis module 122 determines each key pair for each data source in one or more data source
The data type answered, generates data type table.
Further, data type analysis module 122 includes:Data sampling unit 1222, data extracting unit 1224, number
According to type analysis unit 1226 and statistic unit 1228, as shown in Figure 3.
Data sampling unit 1222 is suitable to, for each data source in one or more data source, adopt data
Sample, to obtain the first number data.Alternatively, data sampling unit 1222 is suitable to extract preceding first number in every kind of data source
Data;And/or in every kind of data source the first number of random sampling data;And/or on a time period in every kind of data source
Extract the first number data.
Data extracting unit 1224 is suitable to, for the every data in the first number data, data be parsed one by one and is extracted
Go out all of data item.The present invention to extracting there is key-value not to be restricted to the mode of the data item of form.
Data type analysis unit 1226 is suitable to the corresponding value of key-value centering key in each data item, by canonical
Expression formula and/or data verification method analyze its data type, used as the corresponding data type of the key.Alternatively, data type
Including:Identity (such as identification card number), social account (such as micro-signal, No. QQ, Email), geographical location information (such as GPS
Put, city, country), mobile device mark (such as IMEI).
According to one embodiment of present invention, there is check code, such as identification card number most in the data such as some data types
Latter position is check bit, then, can whether correct by calculating checking check bit.And for example, the regular expression of Email is:^
[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+( [a-zA-Z0-9_-]+)+$, can be analyzed by the regular expression
Whether key-value is to meeting the data type of Email.It is of course also possible to summary two ways carrys out the data type of assay value,
The invention is not limited in this regard.
Statistic unit 1228 be suitable to count each data source in, the corresponding data type number of each key and with the data
The corresponding value number of type, statistics is as shown in the table:
Data type analysis unit 1226 is further adapted for choosing corresponding value number from the data type corresponding to each key
Accounting exceedes the data type of first threshold, is defined as the corresponding data type of the key in the data source, wherein, data type pair
The value number accounting answered is that the corresponding value number of a certain data type of certain key accounts for the corresponding all numbers of the key in the data source
According to the ratio of the value total number of type.
Data type analysis unit 1226 is further adapted for the key and identified data type in the associated storage data source, makees
It is data type table, it is as shown in the table:
Data extraction module 124 is suitable to parse a data and extracts one or more number included by the data
According to item, it is further adapted for, for each data item, extracting the key-value pair for constituting the data item.
Data type analysis module 122 is further adapted for being determined from above-mentioned data type table according to the data source of the data
The corresponding data type of key extracted.Such as, the corresponding data types of key Key2 are E, key Key5 in data source Y in data source X
Corresponding data type is G.
Data check module 126 is suitable to using the corresponding data verification method of the data type to the key-value pair extracted
In value verified, if verification by if extracts successfully, record extraction key-value centering value.
Embodiments in accordance with the present invention, by the corresponding data class of extracted key is determined according to data type table
Type, only need to verify the method for calibration whether corresponding value of the key meets the data type.
According to still another embodiment of the invention, can also utilize the data type regular expression to extracted key-
The value of value centering is verified.Using regular expression assay value data type when, the canonical of use value matched data type
Expression formula, if desired for analyzing IP address data type, then the regular expression of use value Corresponding matching IP address.
According to the embodiment of the present invention, it is also possible to verified by way of combining data verification method and regular expression
The corresponding data type of value, the invention is not limited in this regard.
If through verification, being worth the number of corresponding data type and the data determined through data type analysis module 122
Consistent according to type, then verification passes through, and the value for successfully recording the key-value centering extracted is extracted in expression.Usually, with JSON forms
Store the value.
Situations such as in view of data source because of upgrading, may cause the implication of key to change, therefore the present apparatus 120 is except number
Outside according to type analysis module 122, data extraction module 124 and data correction verification module 126, also including data type rectification module
128, as shown in Figure 4.
Data type rectification module 128 is suitable to when meeting pre-conditioned, every first scheduled time (e.g., 1 day) statistics
Each key extracts successful number, extracts the number of failure in every kind of data source, calculates every in every kind of data source in the time period
The extraction success percentage of individual key.Alternatively, pre-conditioned to be set to that data volume is sufficiently large, the total degree that such as certain key occurs is thousands of
Secondary, Shang Wanci.
Data type rectification module 128 is further adapted for success percentage is extracted that (e.g., Second Threshold takes less than Second Threshold
Value scope is 0.75-0.85) when, alarm signal is produced, corrected with (automatic or manual by administrative staff) trigger data type, weight
The corresponding data type of the key in new sampling statistics data source.
According to further embodiment of this invention, data type rectification module 128 was further adapted for every (e.g., 1 day second scheduled time
Or 7 days) trigger data type analysis module 122, so as to data type analysis module 122 be suitable to according to latest data generate it is new
Data type table, and according to new data type table, choose corresponding value again in the data type corresponding to each key
Number accounting exceedes the data type of first threshold as the corresponding data type of the key in the data source, is carried with completing follow-up data
The step of taking.
Based on described above, during the present apparatus 120 draws each data source by sampling statistics analysis, the data class of the value of each key
Type, generates data type table;When data are extracted, it is known that the data type of key, it is only necessary to use the data school of the data type
Proved recipe method is verified, and data extraction efficiency is improve, also by verifying the accuracy rate for determining to ensure that data are extracted.
Furthermore, it is contemplated that situations such as data source is because of upgrading may cause the implication of key to change, and increased to data
Type carries out the function of correction process, further increases the accuracy rate of data.
It should be appreciated that in order to simplify one or more that the disclosure and helping understands in each inventive aspect, it is right above
In the description of exemplary embodiment of the invention, each feature of the invention be grouped together into sometimes single embodiment, figure or
In person's descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required for protection hair
The bright feature more features required than being expressly recited in each claim.More precisely, as the following claims
As book reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real
Thus the claims for applying mode are expressly incorporated in the specific embodiment, and wherein each claim is in itself as this hair
Bright separate embodiments.
Those skilled in the art should be understood the module or unit or group of the equipment in example disclosed herein
Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example
In one or more different equipment.Module in aforementioned exemplary can be combined as a module or be segmented into multiple in addition
Submodule.
Those skilled in the art are appreciated that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment
Unit or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or
Sub-component.In addition at least some in such feature and/or process or unit exclude each other, can use any
Combine to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed appoint
Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power
Profit is required, summary and accompanying drawing) disclosed in each feature can the alternative features of or similar purpose identical, equivalent by offer carry out generation
Replace.
A5, the method as any one of A1-4, wherein using the data verification method of the data type to being extracted
The value of key-value centering also include the step of verified:Using the regular expression of the data type to the key-value extracted
The value of centering is verified.
A6, the method as any one of A1-5, also including correction data type the step of:It is pre-conditioned when meeting
When, each key extracts successful number, extracts the number of failure in first scheduled time counted every kind of data source, and calculating should
In time period in every kind of data source each key extraction success percentage;And if the extraction success percentage is less than the second threshold
Value, then produce alarm signal, is corrected with trigger data type, and resampling counts the corresponding data class of the key in the data source
Type.
A7, the method as described in A6, wherein the step of correction data type also includes:Every second scheduled time to newest
The step of generating data type table described in Data duplication, generates new data type table;According to new data type table, at each
Again data type of the corresponding value number accounting more than first threshold is chosen in data type corresponding to key as the data
The corresponding data type of the key in source, the step of extraction with performing follow-up data.
A8, the method as any one of A1-7, wherein, data type includes:Identity, social account, geography
Positional information, mobile device mark.
A9, the method as any one of A1-8, wherein, first scheduled time was one day;Second scheduled time
It is seven days or one day.
B15, the device as any one of B10-14, also including data type rectification module, data type correction mould
Block is suitable to when meeting pre-conditioned, in first scheduled time counted every kind of data source each key extract successful number,
The number of failure is extracted, the extraction success percentage of each key in every kind of data source in the time period is calculated;And data type
Rectification module is further adapted for, when success percentage is extracted less than Second Threshold, producing alarm signal, is corrected with trigger data type,
Resampling counts the corresponding data type of the key in the data source.
B16, the device as described in B15, wherein, data type rectification module is further adapted for triggering institute every second scheduled time
Data type analysis module is stated, so that data type analysis module is suitable to generate new data type table according to latest data, and
According to new data type table, corresponding value number accounting is chosen again in the data type corresponding to each key more than first
The data type of threshold value as the corresponding data type of the key in the data source, to perform follow-up data extraction step.
B17, the device as any one of B10-16, wherein data type include:Identity, social account,
Reason positional information, mobile device mark.
B18, the device as any one of B10-17, wherein first scheduled time was one day;Described second pre- timing
Between be seven days or one day.
Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments
In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection is appointed
One of meaning mode can be used in any combination.
Additionally, some in the embodiment be described as herein can be by the processor of computer system or by performing
The combination of method or method element that other devices of the function are implemented.Therefore, with for implementing methods described or method
The processor of the necessary instruction of element forms the device for implementing the method or method element.Additionally, device embodiment
Element described in this is the example of following device:The device is used to implement as performed by the element for the purpose for implementing the invention
Function.
As used in this, unless specifically stated so, come using ordinal number " first ", " second ", " the 3rd " etc.
Description plain objects are merely representative of and are related to the different instances of similar object, and are not intended to imply that the object being so described must
Must have the time it is upper, spatially, sequence aspect or given order in any other manner.
Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from
It is interior it is clear for the skilled person that in the scope of the present invention for thus describing, it can be envisaged that other embodiments.Additionally, it should be noted that
The language that is used in this specification primarily to readable and teaching purpose and select, rather than in order to explain or limit
Determine subject of the present invention and select.Therefore, in the case of without departing from the scope of the appended claims and spirit, for this
Many modifications and changes will be apparent from for the those of ordinary skill of technical field.For the scope of the present invention, to this
The done disclosure of invention is illustrative and not restrictive, and it is intended that the scope of the present invention be defined by the claims appended hereto.
Claims (10)
1. a kind of method for extracting the data from one or more data source, every in one or more of data sources
Individual data source includes many datas, has data item of the key-value to form including one or more per data, and the data are carried
Taking method includes step:
For each data source in one or more of data sources, the corresponding data type of each key is determined, generate number
According to type list;
Parse a data and extract one or more data item included by the data, for each data item:
The key-value pair for constituting the data item is extracted, the data source according to the data determines institute from the data type table
The corresponding data type of key of extraction;And
The value of the key-value centering using the corresponding data verification method of the data type to being extracted is verified, if verification is logical
Crossed the value then extract and successfully record the key-value centering extracted.
2. the method for claim 1, wherein it is described generation data type table the step of include:
For each data source in one or more of data sources, data are sampled, to obtain the first number bar
Data;
For the every data in the first number data, data are parsed one by one and all of data item is extracted;
To the corresponding value of key-value centering key in each data item, it is analyzed by regular expression and/or data verification method
Data type, as the corresponding data type of the key;
In counting each data source, the corresponding data type number of each key and value number corresponding with the data type;With
And
Data type of the corresponding value number accounting more than first threshold is chosen from the data type corresponding to each key, it is determined that
It is the corresponding data type of the key in the data source, and the key and identified data type in the associated storage data source, make
It is data type table.
3. method as claimed in claim 2, wherein, described each data in one or more of data sources
Source, includes the step of sampled to data:
Extract preceding first number data in every kind of data source;And/or
The first number of random sampling data in every kind of data source;And/or
The first number data is extracted in every kind of data source on a time period.
4. the method as any one of claim 1-3, wherein, the corresponding value number accounting of the data type is certain
The corresponding value number of a certain data type of key accounts for the ratio of the value total number of the corresponding all data types of the key in the data source
Value.
5. it is a kind of extract the data from one or more data source extraction element, in one or more of data sources
Each data source include many datas, there is data item of the key-value to form, the number including one or more per data
Include according to extraction element:
Data type analysis module, for each data source in one or more of data sources, determines each key correspondence
Data type, generate data type table;
Data extraction module, is suitable to one data of parsing and extracts one or more data item included by the data,
It is further adapted for, for each data item, extracting the key-value pair for constituting the data item;
The data type analysis module is further adapted for determining institute from the data type table according to the data source of the data
The corresponding data type of key of extraction;And
Data check module, is suitable to using the value of key-value centering of the corresponding data verification method of the data type to being extracted
Verified, extracted the value for successfully recording the key-value centering extracted if if verification.
6. device as claimed in claim 5, wherein, the data type analysis module includes:
Data sampling unit, is suitable to, for each data source in one or more of data sources, sample data,
To obtain the first number data;
Data extracting unit, is suitable to, for the every data in the first number data, data be parsed one by one and is extracted
All of data item;
Data type analysis unit, is suitable to the corresponding value of key-value centering key in each data item, by regular expression
And/or data verification method analyzes its data type, as the corresponding data type of the key;
Statistic unit, be suitable to count each data source in, the corresponding data type number of each key and with the data type pair
The value number answered;
The data type analysis unit is further adapted for choosing corresponding value number accounting from the data type corresponding to each key
More than the data type of first threshold, it is defined as the corresponding data type of the key in the data source, and the associated storage data source
In the key and identified data type, as data type table.
7. device as claimed in claim 6, wherein, data sampling unit is further adapted for extracting preceding first number in every kind of data source
Data;And/or it is further adapted for the first number of random sampling data in every kind of data source;And/or be further adapted for existing on a time period
The first number data is extracted in every kind of data source.
8. the device as any one of claim 5-7, wherein, the corresponding value number accounting of the data type is certain
The corresponding value number of a certain data type of key accounts for the ratio of the value total number of the corresponding all data types of the key in the data source
Value.
9. the device as any one of claim 5-8, wherein,
The value that the data check module is further adapted for the key-value centering to being extracted using the regular expression of the data type is entered
Row verification.
10. it is a kind of extract the data from one or more data sources extraction system, including:
Data acquisition device, is suitable to gather the data from one or more data sources;
Data extraction device as any one of claim 5-9;And
Data analysis set-up, is suitable to be analyzed the data extracted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611080168.3A CN106776901B (en) | 2016-11-30 | 2016-11-30 | Data extraction method, device and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611080168.3A CN106776901B (en) | 2016-11-30 | 2016-11-30 | Data extraction method, device and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106776901A true CN106776901A (en) | 2017-05-31 |
CN106776901B CN106776901B (en) | 2019-12-06 |
Family
ID=58901448
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611080168.3A Active CN106776901B (en) | 2016-11-30 | 2016-11-30 | Data extraction method, device and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106776901B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107894973A (en) * | 2017-10-30 | 2018-04-10 | 武汉华工赛百数据系统有限公司 | A kind of method for interchanging data and system based on XML |
CN109684374A (en) * | 2018-11-28 | 2019-04-26 | 海南电网有限责任公司信息通信分公司 | A kind of extracting method and device of the key-value pair of time series data |
CN109710651A (en) * | 2018-12-25 | 2019-05-03 | 成都四方伟业软件股份有限公司 | Data type recognition methods and device |
CN110390208A (en) * | 2019-06-26 | 2019-10-29 | 联动优势科技有限公司 | A kind of the preferred data source access method and device of composite data item label |
CN110866557A (en) * | 2019-11-12 | 2020-03-06 | 贵州医渡云技术有限公司 | Data evaluation method and device, storage medium and electronic device |
CN111488260A (en) * | 2019-01-29 | 2020-08-04 | 华为技术有限公司 | Data template acquisition method and device, computer equipment and readable storage medium |
CN111753332A (en) * | 2020-06-29 | 2020-10-09 | 上海通联金融服务有限公司 | Method for completing log desensitization in log writing stage based on sensitive information rule |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103870381A (en) * | 2012-12-10 | 2014-06-18 | 百度在线网络技术(北京)有限公司 | Test data generating method and device |
CN104809178A (en) * | 2015-04-15 | 2015-07-29 | 北京科电高技术公司 | Write-in method of key/value database memory log |
CN104933096A (en) * | 2015-05-22 | 2015-09-23 | 北京奇虎科技有限公司 | Abnormal key recognition method of database, abnormal key recognition device of database and data system |
-
2016
- 2016-11-30 CN CN201611080168.3A patent/CN106776901B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103870381A (en) * | 2012-12-10 | 2014-06-18 | 百度在线网络技术(北京)有限公司 | Test data generating method and device |
CN104809178A (en) * | 2015-04-15 | 2015-07-29 | 北京科电高技术公司 | Write-in method of key/value database memory log |
CN104933096A (en) * | 2015-05-22 | 2015-09-23 | 北京奇虎科技有限公司 | Abnormal key recognition method of database, abnormal key recognition device of database and data system |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107894973A (en) * | 2017-10-30 | 2018-04-10 | 武汉华工赛百数据系统有限公司 | A kind of method for interchanging data and system based on XML |
CN109684374A (en) * | 2018-11-28 | 2019-04-26 | 海南电网有限责任公司信息通信分公司 | A kind of extracting method and device of the key-value pair of time series data |
CN109684374B (en) * | 2018-11-28 | 2021-05-25 | 海南电网有限责任公司信息通信分公司 | Method and device for extracting key value pairs of time series data |
CN109710651A (en) * | 2018-12-25 | 2019-05-03 | 成都四方伟业软件股份有限公司 | Data type recognition methods and device |
CN111488260A (en) * | 2019-01-29 | 2020-08-04 | 华为技术有限公司 | Data template acquisition method and device, computer equipment and readable storage medium |
CN111488260B (en) * | 2019-01-29 | 2023-12-08 | 华为云计算技术有限公司 | Data template acquisition method, device, computer equipment and readable storage medium |
CN110390208A (en) * | 2019-06-26 | 2019-10-29 | 联动优势科技有限公司 | A kind of the preferred data source access method and device of composite data item label |
CN110390208B (en) * | 2019-06-26 | 2023-02-21 | 联动优势科技有限公司 | Optimized data source access method and device for composite data item label |
CN110866557A (en) * | 2019-11-12 | 2020-03-06 | 贵州医渡云技术有限公司 | Data evaluation method and device, storage medium and electronic device |
CN110866557B (en) * | 2019-11-12 | 2022-12-13 | 贵州医渡云技术有限公司 | Data evaluation method and device, storage medium and electronic device |
CN111753332A (en) * | 2020-06-29 | 2020-10-09 | 上海通联金融服务有限公司 | Method for completing log desensitization in log writing stage based on sensitive information rule |
Also Published As
Publication number | Publication date |
---|---|
CN106776901B (en) | 2019-12-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106776901A (en) | Data extraction method, apparatus and system | |
CN106127872B (en) | Work attendance method, client and equipment based on mobile terminal | |
CN105446706B (en) | Method and device for evaluating form page use effect and providing original data | |
CN106557747B (en) | The method and device of identification insurance single numbers | |
CN102768659A (en) | Method and system for identifying repeated account | |
CN110336838B (en) | Account abnormity detection method, device, terminal and storage medium | |
CN110634471B (en) | Voice quality inspection method and device, electronic equipment and storage medium | |
CN103530347A (en) | Internet resource quality assessment method and system based on big data mining | |
CN110750433A (en) | Interface test method and device | |
CN102497435A (en) | Data distributing method and device of data service | |
CN109857714A (en) | Journal obtaining method, device, electronic equipment and computer readable storage medium | |
CN108234345A (en) | A kind of traffic characteristic recognition methods of terminal network application, device and system | |
CN105630867A (en) | Method and device for matching data | |
CN111291990A (en) | Quality monitoring processing method and device | |
JP2019504545A (en) | Method and apparatus for recognizing service request for changing mobile phone number | |
CN113792248B (en) | Online education course sharing and distributing system based on Internet and mobile terminal | |
CN109245910B (en) | Method and device for identifying fault type | |
CN103838739B (en) | The detection method and system of error correction term in a kind of search engine | |
CN112507041B (en) | Equipment model identification method and device, electronic equipment and storage medium | |
CN112465565B (en) | User portrait prediction method and device based on machine learning | |
CN110830499B (en) | Network attack application detection method and system | |
CN110516258B (en) | Data verification method and device, storage medium and electronic device | |
CN111092879A (en) | Log association method and device, electronic equipment and storage medium | |
CN109313827A (en) | Classroom is registered method, apparatus, terminal and storage medium | |
CN107040603A (en) | For determining the method and apparatus that application program App enlivens scene |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Room 311501, Unit 1, Building 5, Courtyard 1, Futong East Street, Chaoyang District, Beijing 100102 Applicant after: Beijing Zhichuangyu Information Technology Co., Ltd. Address before: 100097 Jinwei Building 803, 55 Lanindichang South Road, Haidian District, Beijing Applicant before: Beijing Knows Chuangyu Information Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |