CN106776901A - Data extraction method, apparatus and system - Google Patents

Data extraction method, apparatus and system Download PDF

Info

Publication number
CN106776901A
CN106776901A CN201611080168.3A CN201611080168A CN106776901A CN 106776901 A CN106776901 A CN 106776901A CN 201611080168 A CN201611080168 A CN 201611080168A CN 106776901 A CN106776901 A CN 106776901A
Authority
CN
China
Prior art keywords
data
key
value
type
data type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611080168.3A
Other languages
Chinese (zh)
Other versions
CN106776901B (en
Inventor
蔡自彬
何金良
李娟�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Knownsec Information Technology Co Ltd
Original Assignee
Beijing Knownsec Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Knownsec Information Technology Co Ltd filed Critical Beijing Knownsec Information Technology Co Ltd
Priority to CN201611080168.3A priority Critical patent/CN106776901B/en
Publication of CN106776901A publication Critical patent/CN106776901A/en
Application granted granted Critical
Publication of CN106776901B publication Critical patent/CN106776901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of method for extracting the data from one or more data source, each data source in one or more data source includes many datas, there is the data item of key-value pair form per data including one or more, the data extraction method includes step:For each data source in one or more data source, the corresponding data type of each key is determined, generate data type table;Parse a data and extract one or more data item included by the data, for each data item:The key-value pair for constituting the data item is extracted, the data source according to the data determines the corresponding data type of extracted key from data type table;And the value in the key-value pair that is extracted is verified using the corresponding data verification method of the data type, is extracted if successfully if verification, the value in the key-value pair of record extraction.The invention also discloses corresponding data extraction device and system.

Description

Data extraction method, apparatus and system
Technical field
Data abstraction techniques field of the present invention, especially data extraction method, apparatus and system.
Background technology
It is accurate such as in HTTP access logs, Internet of Things data how from the data of magnanimity under current big data environment The data message of needs is really extracted, for analysis user behavior, hobby, custom etc., or prediction user behavior, improvement are extensively Accuse dispensing effect etc. and be respectively provided with highly important meaning.
To extract data instance from URL (Universal Resource Locator, URL), typically Data are carried out full text matching by ground by predetermined regular expression, as long as hit, just extracts the data for matching, And its type is appointed as the corresponding type of predetermined regular expression.By put into practice find, the program have error rate it is high lack Point.For example, some data, only some content meets regular expression rule, can also be identified as corresponding data type, It is extracted;Or, some data its types is not the corresponding data type of regular expression, but in mass data, number Partial content in just meets regular expression rule, and this partial data will extracting by mistake.
Accordingly, it would be desirable to a kind of data extraction method, can accurately be extracted from the data from various data sources data, And ensure the efficiency that data are extracted.
The content of the invention
Therefore, the invention provides data extraction method, apparatus and system, to try hard to solve or at least alleviate to deposit above At least one problem.
According to an aspect of the invention, there is provided a kind of side for extracting the data from one or more data source Method, each data source in one or more data source includes many datas, has including one or more per data Key-value includes step to the data item of form, the data extraction method:For each data in one or more data source Source, determines the corresponding data type of each key, generates data type table;Parse a data and extract included by the data One or more data item, for each data item:The key-value pair for constituting the data item is extracted, according to the data Data source determines the corresponding data type of extracted key from data type table;And using the corresponding number of the data type The value of the key-value centering according to method of calibration to being extracted is verified, and is extracted if successfully if verification, and the key that record is extracted- It is worth the value of centering.
Alternatively, in data extraction method of the invention, generation data type table the step of include:For one Or each data source in multiple data sources, data are sampled, to obtain the first number data;For the first number Every data in data, parses data and extracts all of data item one by one;To the key-value centering in each data item The corresponding value of key, its data type is analyzed by regular expression and/or data verification method, used as the corresponding data class of the key Type;In counting each data source, the corresponding data type number of each key and value number corresponding with the data type;And Data type of the corresponding value number accounting more than first threshold is chosen from the data type corresponding to each key, is defined as this The key and identified data type in the corresponding data type of the key in data source, and the associated storage data source, as number According to type list.
Alternatively, in data extraction method of the invention, for one or more data source in every number Include according to source, the step of sampled to data:Extract preceding first number data in every kind of data source;And/or in every kind of number According to the first number data of random sampling in source;And/or the first number data is extracted in every kind of data source on a time period.
Alternatively, in data extraction method of the invention, the corresponding value number accounting of data type is certain key The corresponding value number of a certain data type account for the corresponding all data types of the key in the data source value total number ratio.
Alternatively, in data extraction method of the invention, using the data verification method of the data type to institute The step of value of the key-value centering of extraction is verified also includes:Regular expression using the data type is to being extracted The value of key-value centering is verified.
Alternatively, in data extraction method of the invention, also including correction data type the step of:It is pre- when meeting If during condition, each key extracts successful number, extracts the number of failure in first scheduled time counted every kind of data source, Calculate the extraction success percentage of each key in every kind of data source in the time period;And if extracting success percentage less than second Threshold value, then produce alarm signal, is corrected with trigger data type, and resampling counts the corresponding data class of the key in the data source Type.
Alternatively, in data extraction method of the invention, correction data type the step of also include:Every second The step of scheduled time repeats the generation data type table to latest data, generates new data type table;According to new number According to type list, data of the corresponding value number accounting more than first threshold are chosen again in the data type corresponding to each key Type as the corresponding data type of the key in the data source, to perform follow-up data extraction step.
Alternatively, in data extraction method of the invention, data type includes:Identity, social account, Reason positional information, mobile device mark.
Alternatively, in data extraction method of the invention, first scheduled time was one day;Described second pre- timing Between be seven days or one day.
According to another aspect of the invention, there is provided a kind of extraction for extracting the data from one or more data source Device, each data source in one or more data source includes many datas, has including one or more per data There is data item of the key-value to form, the data extraction device includes:Data type analysis module, for one or more number According to each data source in source, the corresponding data type of each key is determined, generate data type table;Data extraction module, is suitable to Parse a data and extract one or more data item included by the data, be further adapted for for each data item, Extract the key-value pair for constituting the data item;Data type analysis module is further adapted for according to the data source of the data from data class The corresponding data type of extracted key is determined in type table;And data check module, it is suitable to using data type correspondence The value of key-value centering of the data verification method to being extracted verify, is extracted if successfully if verification, record extraction The value of key-value centering.
Alternatively, in data extraction device of the invention, data type analysis module includes:Data sampling list Unit, is suitable to, for each data source in one or more data source, sample data, to obtain the first number bar number According to;Data extracting unit, is suitable to, for the every data in the first number data, data be parsed one by one and is extracted all of Data item;Data type analysis unit, is suitable to the corresponding value of key-value centering key in each data item, by regular expressions Formula and/or data verification method analyze its data type, used as the corresponding data type of the key;Statistic unit, is suitable to statistics every In individual data source, the corresponding data type number of each key and value number corresponding with the data type;Data type analysis Unit is further adapted for choosing data class of the corresponding value number accounting more than first threshold from the data type corresponding to each key Type, is defined as the corresponding data type of the key in the data source, and the key and identified data in the associated storage data source Type, as data type table.
Alternatively, in data extraction device of the invention, data sampling unit is further adapted for extracting every kind of data source In preceding first number data;And/or it is further adapted for the first number of random sampling data in every kind of data source;And/or also fit In the first number data is extracted in every kind of data source on a time period.
Alternatively, in data extraction device of the invention, the corresponding value number accounting of data type is certain key The corresponding value number of a certain data type account for the corresponding all data types of the key in the data source value total number ratio.
Alternatively, in data extraction device of the invention, data check module is further adapted for using the data type The value of key-value centering of the regular expression to being extracted verify.
Alternatively, in data extraction device of the invention, also including data type rectification module, data type is rectified Positive module is suitable to when meeting pre-conditioned, and each key extracts successfully individual in first scheduled time counted every kind of data source Number, the number for extracting failure, calculate the extraction success percentage of each key in every kind of data source in the time period;And data class Type rectification module is further adapted for, when success percentage is extracted less than Second Threshold, producing alarm signal, is rectified with trigger data type Just, resampling counts the corresponding data type of the key in the data source.
Alternatively, in data extraction device of the invention, data type rectification module is further adapted for pre- every second Trigger data of fixing time type analysis module, so that data type analysis module is suitable to generate new data class according to latest data Type table, and according to new data type table, choose corresponding value number accounting again in the data type corresponding to each key More than first threshold data type as the corresponding data type of the key in the data source, extract step to perform follow-up data Suddenly.
Alternatively, in data extraction device of the invention, data type includes:Identity, social account, Reason positional information, mobile device mark.
Alternatively, in data extraction device of the invention, first scheduled time was one day;Described second pre- timing Between be seven days or one day.
According to another aspect of the invention, a kind of extraction for extracting the data from one or more data sources is additionally provided System, including:Data acquisition device, is suitable to gather the data from one or more data sources;Data as described above are extracted Device;And data analysis set-up, it is suitable to be analyzed the data extracted.
Data extraction scheme of the invention, is analyzed in drawing each data source, the value of each key (Key) by sampling statistics (Value) data type, generates data type table;When data are extracted, it is known that the data type of key, it is only necessary to use the number Verified according to the data verification method of type, improve data extraction efficiency;Also, determined by verifying, even more ensure that The accuracy rate that data are extracted.
Brief description of the drawings
In order to realize above-mentioned and related purpose, some illustrative sides are described herein in conjunction with following description and accompanying drawing Face, these aspects indicate the various modes that can put into practice principles disclosed herein, and all aspects and its equivalent aspect It is intended to fall under in the range of theme required for protection.By being read in conjunction with the figure following detailed description, the disclosure it is above-mentioned And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical reference generally refers to identical Part or element.
Fig. 1 shows the schematic diagram of data extraction system according to an embodiment of the invention 100;
Fig. 2 shows the flow chart of data extraction method according to an embodiment of the invention 200;
Fig. 3 shows the schematic diagram of data extraction device according to an embodiment of the invention 120;And
Fig. 4 shows the schematic diagram of the data extraction device 120 according to further embodiment of this invention.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here Limited.Conversely, there is provided these embodiments are able to be best understood from the disclosure, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.
Fig. 1 shows the schematic diagram of data extraction system according to an embodiment of the invention 100.As shown in figure 1, this is System 100 includes data acquisition device 110, data extraction device 120 and data analysis set-up 130.Wherein, data acquisition device 110 are suitable to gather the data from one or more data sources, wherein, each data source in one or more data source is equal Including many datas, and there is data item of the key-value to (Key-Value) form including one or more per data.Number The accurate value (Value) extracted in data item in being suitable to the data gathered from data acquisition device 110 according to extraction element 120, For example, extracting Email, GPS location, social tool account etc..Data analysis set-up 130 is suitable to by data extraction device 120 data extracted are analyzed, and user is predicted for example, setting up user by a series of data of sign user characteristicses and drawing a portrait Behavior.
Based on description above to system 100, in the present system, how accurately and efficiently to extract data is to realize we The key point of case, that is, the operation performed by data extraction device 120.
The flow that data extraction is carried out to data extraction device 120 is described in detail below.
Such as Fig. 2, the data extraction side performed in data extraction device 120 according to an embodiment of the invention is shown The flow chart of method 200.Method 200 starts from step S210.In step S210, for gathered by data acquisition device 110 one Each data source in individual or multiple data sources, determines the corresponding data type of each key, generates data type table.
According to some implementation methods, in different data sources, corresponding value (Value) implication of same key (Key) is very It is possible to different.For example, in HTTP access logs, the HTTP access logs of different web sites belong to different data sources, because In the daily record of different web sites, the implication of same Key may be different, by taking following two URL as an example,
URL1:http://www.xxx.com/index.htmId=aaa@bbb.com&name=test
URL2:http://www.yyy.com/index.htmlId=123456&phone=13405671234
In the two URL, to being Key=Value forms, for URL1, id field meanings therein are users to key-value Identity information, and for URL2, id field meanings therein are the Digital IDs of user, although both Key are identical, implication is complete It is complete different.
Therefore, first have to the key-value for extracting to classifying, determine in every kind of data source, each key correspondence Data type.
Firstly, for each data source in one or more data source, data are sampled, to obtain the first number Mesh data.Alternatively, the step of being sampled to data includes:Extract preceding first number data in every kind of data source;With/ Or in every kind of data source the first number of random sampling data;And/or the first number is extracted in every kind of data source on a time period Mesh data.
Secondly, for every the data in the first number data for sampling, parsing data and extract all one by one Data item.It should be noted that the present invention is not restricted to the method for extracting the data item comprising key-value pair, such as data Item is the form of " key separators value ", and according to separators data, Part I is key, and Part II is value, Wherein separator be probably ":", "=" etc..
Then, to the corresponding value of key-value centering key in each data item, by regular expression and/or data check Method analyzes its data type, used as the corresponding data type of the key.Alternatively, data type includes:Identity (such as identity Card number), social account (such as micro-signal, No. QQ, Email), geographical location information (such as GPS location, city, country), movement set Standby mark (such as IMEI).
According to one embodiment of present invention, there is check code in the data of some data types, and such as identification card number is last One is check bit, then, can whether correct by calculating checking check bit.And for example, the regular expression of Email is:^ [a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+( [a-zA-Z0-9_-]+)+$, can be analyzed by the regular expression Whether key-value is to meeting the data type of Email.
Finally, in counting each data source, the corresponding data type number of each key and corresponding with the data type Value number, chooses data type of the corresponding value number accounting more than first threshold from the data type corresponding to each key, It is defined as the corresponding data type of the key in the data source, and the key and identified data class in the associated storage data source Type, as data type table.
According to the embodiment of the present invention, count in each data source, the corresponding data type of each key can be using such as Lower form is represented:
Wherein, " number M " represent in data source X, key Key1 to should have data type A, data type B ..., it is unknown The data types such as data type, and it is M that the number of the corresponding values of data type A has altogether.
The value number accounting value of the data type corresponding to each key is calculated, wherein, the corresponding value number of data type is accounted for The value that the corresponding value number of a certain data type than being certain key accounts for the corresponding all data types of the key in the data source is total The ratio of number, when the value number accounting of certain data type exceedes first threshold (e.g., 0.8), determines that the data type It is the corresponding data type of the key in the data source, the value of corresponding other data types of the key, it may be possible to improper value, Ke Yipai Remove, generation data type table is as follows:
Then in step S220, parse a data and extract one or more data included by the data .
Then in step S230, for each data item, the key-value pair for constituting the data item is extracted, according to this number According to data source the corresponding data type of extracted key is determined from data type table.Assuming that the data source of the data is X, can show that the corresponding data types of key Key1 are A from data above type list.
Then in step S240, the key-value centering using the corresponding data verification method of the data type to being extracted Value verified, if verification by if extracts successfully, record extraction key-value centering value.Usually, data check is used , it is necessary to ergodic data list of types, verifies whether the value meets the data type successively during the data type of method assay value Call format and verification require which meets, and which data type is the value just belong to.But in the method, due to basis Data type table determines the corresponding data type of extracted key, only need to verify whether the corresponding value of the key meets the data class Type, substantially increases efficiency.
According to still another embodiment of the invention, can also utilize the data type regular expression to extracted key- The value of value centering is verified.Using regular expression assay value data type when, with value matched data type canonical table Up to formula, if desired for analyzing IP address data type, then the regular expression of use value Corresponding matching IP address.
According to the embodiment of the present invention, can be using data verification method or regular expression come corresponding to check value Data type, it is also possible to the data type by way of both combine corresponding to check value, the invention is not limited in this regard.
If through verification, the corresponding data type of value is consistent with the data type of the data determined through step S230, Then verification passes through, and the value for successfully recording the key-value centering extracted is extracted in expression.Usually, the value is stored with JSON forms, such as:
{"ip":"1.1.1.1","email":"xxx@yyy.com"}
According to a kind of implementation, data source is now needed because that situations such as upgrading the implication of key may be caused to change Correction process is done to data type.Typically, the step of correcting data type includes:
When pre-conditioned (data volume is sufficiently large, the total degree that such as certain key occurs thousands of times, up to ten thousand times) is met, every First scheduled time (such as 1 day) counts the number that each key extracts successful number, extraction fails in every kind of data source, and calculating should In time period in every kind of data source each key extraction success percentage.
If extracting success percentage less than Second Threshold (e.g., the interval of Second Threshold is 0.75-0.85), then produce Alarm signal, is corrected with (automatic or manual by administrative staff) trigger data type, and resampling counts the key in the data source Corresponding data type.
According to a kind of implementation, above-mentioned can also be repeated to latest data every second scheduled time (such as 1 day or 7 days) The step of generation data type table (that is, step S210), generate new data type table.
According to new data type table, corresponding value number accounting is chosen again in the data type corresponding to each key More than first threshold data type as the corresponding data type of the key in the data source, to complete the step of follow-up data extraction Suddenly.
With reference to described above, during this method 200 draws each data source by sampling statistics analysis, the value of each key (Key) (Value) data type, generates data type table;When data are extracted, it is known that the data type of key, it is only necessary to use the number Verified according to the data verification method of type, improve data extraction efficiency;Also, determined by verifying, even more ensure that The accuracy rate that data are extracted.
Furthermore, it is contemplated that situations such as data source is because of upgrading may cause the implication of key to change, and increased to data The step of type carries out correction process, further increases the accuracy rate of data.
Correspondingly, Fig. 3 shows the schematic diagram of data extraction device 120 according to embodiments of the present invention, as shown in figure 3, The device 120 includes:Data type analysis module 122, data extraction module 124 and data correction verification module 126.
Data type analysis module 122 determines each key pair for each data source in one or more data source The data type answered, generates data type table.
Further, data type analysis module 122 includes:Data sampling unit 1222, data extracting unit 1224, number According to type analysis unit 1226 and statistic unit 1228, as shown in Figure 3.
Data sampling unit 1222 is suitable to, for each data source in one or more data source, adopt data Sample, to obtain the first number data.Alternatively, data sampling unit 1222 is suitable to extract preceding first number in every kind of data source Data;And/or in every kind of data source the first number of random sampling data;And/or on a time period in every kind of data source Extract the first number data.
Data extracting unit 1224 is suitable to, for the every data in the first number data, data be parsed one by one and is extracted Go out all of data item.The present invention to extracting there is key-value not to be restricted to the mode of the data item of form.
Data type analysis unit 1226 is suitable to the corresponding value of key-value centering key in each data item, by canonical Expression formula and/or data verification method analyze its data type, used as the corresponding data type of the key.Alternatively, data type Including:Identity (such as identification card number), social account (such as micro-signal, No. QQ, Email), geographical location information (such as GPS Put, city, country), mobile device mark (such as IMEI).
According to one embodiment of present invention, there is check code, such as identification card number most in the data such as some data types Latter position is check bit, then, can whether correct by calculating checking check bit.And for example, the regular expression of Email is:^ [a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+( [a-zA-Z0-9_-]+)+$, can be analyzed by the regular expression Whether key-value is to meeting the data type of Email.It is of course also possible to summary two ways carrys out the data type of assay value, The invention is not limited in this regard.
Statistic unit 1228 be suitable to count each data source in, the corresponding data type number of each key and with the data The corresponding value number of type, statistics is as shown in the table:
Data type analysis unit 1226 is further adapted for choosing corresponding value number from the data type corresponding to each key Accounting exceedes the data type of first threshold, is defined as the corresponding data type of the key in the data source, wherein, data type pair The value number accounting answered is that the corresponding value number of a certain data type of certain key accounts for the corresponding all numbers of the key in the data source According to the ratio of the value total number of type.
Data type analysis unit 1226 is further adapted for the key and identified data type in the associated storage data source, makees It is data type table, it is as shown in the table:
Data extraction module 124 is suitable to parse a data and extracts one or more number included by the data According to item, it is further adapted for, for each data item, extracting the key-value pair for constituting the data item.
Data type analysis module 122 is further adapted for being determined from above-mentioned data type table according to the data source of the data The corresponding data type of key extracted.Such as, the corresponding data types of key Key2 are E, key Key5 in data source Y in data source X Corresponding data type is G.
Data check module 126 is suitable to using the corresponding data verification method of the data type to the key-value pair extracted In value verified, if verification by if extracts successfully, record extraction key-value centering value.
Embodiments in accordance with the present invention, by the corresponding data class of extracted key is determined according to data type table Type, only need to verify the method for calibration whether corresponding value of the key meets the data type.
According to still another embodiment of the invention, can also utilize the data type regular expression to extracted key- The value of value centering is verified.Using regular expression assay value data type when, the canonical of use value matched data type Expression formula, if desired for analyzing IP address data type, then the regular expression of use value Corresponding matching IP address.
According to the embodiment of the present invention, it is also possible to verified by way of combining data verification method and regular expression The corresponding data type of value, the invention is not limited in this regard.
If through verification, being worth the number of corresponding data type and the data determined through data type analysis module 122 Consistent according to type, then verification passes through, and the value for successfully recording the key-value centering extracted is extracted in expression.Usually, with JSON forms Store the value.
Situations such as in view of data source because of upgrading, may cause the implication of key to change, therefore the present apparatus 120 is except number Outside according to type analysis module 122, data extraction module 124 and data correction verification module 126, also including data type rectification module 128, as shown in Figure 4.
Data type rectification module 128 is suitable to when meeting pre-conditioned, every first scheduled time (e.g., 1 day) statistics Each key extracts successful number, extracts the number of failure in every kind of data source, calculates every in every kind of data source in the time period The extraction success percentage of individual key.Alternatively, pre-conditioned to be set to that data volume is sufficiently large, the total degree that such as certain key occurs is thousands of Secondary, Shang Wanci.
Data type rectification module 128 is further adapted for success percentage is extracted that (e.g., Second Threshold takes less than Second Threshold Value scope is 0.75-0.85) when, alarm signal is produced, corrected with (automatic or manual by administrative staff) trigger data type, weight The corresponding data type of the key in new sampling statistics data source.
According to further embodiment of this invention, data type rectification module 128 was further adapted for every (e.g., 1 day second scheduled time Or 7 days) trigger data type analysis module 122, so as to data type analysis module 122 be suitable to according to latest data generate it is new Data type table, and according to new data type table, choose corresponding value again in the data type corresponding to each key Number accounting exceedes the data type of first threshold as the corresponding data type of the key in the data source, is carried with completing follow-up data The step of taking.
Based on described above, during the present apparatus 120 draws each data source by sampling statistics analysis, the data class of the value of each key Type, generates data type table;When data are extracted, it is known that the data type of key, it is only necessary to use the data school of the data type Proved recipe method is verified, and data extraction efficiency is improve, also by verifying the accuracy rate for determining to ensure that data are extracted.
Furthermore, it is contemplated that situations such as data source is because of upgrading may cause the implication of key to change, and increased to data Type carries out the function of correction process, further increases the accuracy rate of data.
It should be appreciated that in order to simplify one or more that the disclosure and helping understands in each inventive aspect, it is right above In the description of exemplary embodiment of the invention, each feature of the invention be grouped together into sometimes single embodiment, figure or In person's descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required for protection hair The bright feature more features required than being expressly recited in each claim.More precisely, as the following claims As book reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real Thus the claims for applying mode are expressly incorporated in the specific embodiment, and wherein each claim is in itself as this hair Bright separate embodiments.
Those skilled in the art should be understood the module or unit or group of the equipment in example disclosed herein Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example In one or more different equipment.Module in aforementioned exemplary can be combined as a module or be segmented into multiple in addition Submodule.
Those skilled in the art are appreciated that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment Unit or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit exclude each other, can use any Combine to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power Profit is required, summary and accompanying drawing) disclosed in each feature can the alternative features of or similar purpose identical, equivalent by offer carry out generation Replace.
A5, the method as any one of A1-4, wherein using the data verification method of the data type to being extracted The value of key-value centering also include the step of verified:Using the regular expression of the data type to the key-value extracted The value of centering is verified.
A6, the method as any one of A1-5, also including correction data type the step of:It is pre-conditioned when meeting When, each key extracts successful number, extracts the number of failure in first scheduled time counted every kind of data source, and calculating should In time period in every kind of data source each key extraction success percentage;And if the extraction success percentage is less than the second threshold Value, then produce alarm signal, is corrected with trigger data type, and resampling counts the corresponding data class of the key in the data source Type.
A7, the method as described in A6, wherein the step of correction data type also includes:Every second scheduled time to newest The step of generating data type table described in Data duplication, generates new data type table;According to new data type table, at each Again data type of the corresponding value number accounting more than first threshold is chosen in data type corresponding to key as the data The corresponding data type of the key in source, the step of extraction with performing follow-up data.
A8, the method as any one of A1-7, wherein, data type includes:Identity, social account, geography Positional information, mobile device mark.
A9, the method as any one of A1-8, wherein, first scheduled time was one day;Second scheduled time It is seven days or one day.
B15, the device as any one of B10-14, also including data type rectification module, data type correction mould Block is suitable to when meeting pre-conditioned, in first scheduled time counted every kind of data source each key extract successful number, The number of failure is extracted, the extraction success percentage of each key in every kind of data source in the time period is calculated;And data type Rectification module is further adapted for, when success percentage is extracted less than Second Threshold, producing alarm signal, is corrected with trigger data type, Resampling counts the corresponding data type of the key in the data source.
B16, the device as described in B15, wherein, data type rectification module is further adapted for triggering institute every second scheduled time Data type analysis module is stated, so that data type analysis module is suitable to generate new data type table according to latest data, and According to new data type table, corresponding value number accounting is chosen again in the data type corresponding to each key more than first The data type of threshold value as the corresponding data type of the key in the data source, to perform follow-up data extraction step.
B17, the device as any one of B10-16, wherein data type include:Identity, social account, Reason positional information, mobile device mark.
B18, the device as any one of B10-17, wherein first scheduled time was one day;Described second pre- timing Between be seven days or one day.
Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection is appointed One of meaning mode can be used in any combination.
Additionally, some in the embodiment be described as herein can be by the processor of computer system or by performing The combination of method or method element that other devices of the function are implemented.Therefore, with for implementing methods described or method The processor of the necessary instruction of element forms the device for implementing the method or method element.Additionally, device embodiment Element described in this is the example of following device:The device is used to implement as performed by the element for the purpose for implementing the invention Function.
As used in this, unless specifically stated so, come using ordinal number " first ", " second ", " the 3rd " etc. Description plain objects are merely representative of and are related to the different instances of similar object, and are not intended to imply that the object being so described must Must have the time it is upper, spatially, sequence aspect or given order in any other manner.
Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from It is interior it is clear for the skilled person that in the scope of the present invention for thus describing, it can be envisaged that other embodiments.Additionally, it should be noted that The language that is used in this specification primarily to readable and teaching purpose and select, rather than in order to explain or limit Determine subject of the present invention and select.Therefore, in the case of without departing from the scope of the appended claims and spirit, for this Many modifications and changes will be apparent from for the those of ordinary skill of technical field.For the scope of the present invention, to this The done disclosure of invention is illustrative and not restrictive, and it is intended that the scope of the present invention be defined by the claims appended hereto.

Claims (10)

1. a kind of method for extracting the data from one or more data source, every in one or more of data sources Individual data source includes many datas, has data item of the key-value to form including one or more per data, and the data are carried Taking method includes step:
For each data source in one or more of data sources, the corresponding data type of each key is determined, generate number According to type list;
Parse a data and extract one or more data item included by the data, for each data item:
The key-value pair for constituting the data item is extracted, the data source according to the data determines institute from the data type table The corresponding data type of key of extraction;And
The value of the key-value centering using the corresponding data verification method of the data type to being extracted is verified, if verification is logical Crossed the value then extract and successfully record the key-value centering extracted.
2. the method for claim 1, wherein it is described generation data type table the step of include:
For each data source in one or more of data sources, data are sampled, to obtain the first number bar Data;
For the every data in the first number data, data are parsed one by one and all of data item is extracted;
To the corresponding value of key-value centering key in each data item, it is analyzed by regular expression and/or data verification method Data type, as the corresponding data type of the key;
In counting each data source, the corresponding data type number of each key and value number corresponding with the data type;With And
Data type of the corresponding value number accounting more than first threshold is chosen from the data type corresponding to each key, it is determined that It is the corresponding data type of the key in the data source, and the key and identified data type in the associated storage data source, make It is data type table.
3. method as claimed in claim 2, wherein, described each data in one or more of data sources Source, includes the step of sampled to data:
Extract preceding first number data in every kind of data source;And/or
The first number of random sampling data in every kind of data source;And/or
The first number data is extracted in every kind of data source on a time period.
4. the method as any one of claim 1-3, wherein, the corresponding value number accounting of the data type is certain The corresponding value number of a certain data type of key accounts for the ratio of the value total number of the corresponding all data types of the key in the data source Value.
5. it is a kind of extract the data from one or more data source extraction element, in one or more of data sources Each data source include many datas, there is data item of the key-value to form, the number including one or more per data Include according to extraction element:
Data type analysis module, for each data source in one or more of data sources, determines each key correspondence Data type, generate data type table;
Data extraction module, is suitable to one data of parsing and extracts one or more data item included by the data, It is further adapted for, for each data item, extracting the key-value pair for constituting the data item;
The data type analysis module is further adapted for determining institute from the data type table according to the data source of the data The corresponding data type of key of extraction;And
Data check module, is suitable to using the value of key-value centering of the corresponding data verification method of the data type to being extracted Verified, extracted the value for successfully recording the key-value centering extracted if if verification.
6. device as claimed in claim 5, wherein, the data type analysis module includes:
Data sampling unit, is suitable to, for each data source in one or more of data sources, sample data, To obtain the first number data;
Data extracting unit, is suitable to, for the every data in the first number data, data be parsed one by one and is extracted All of data item;
Data type analysis unit, is suitable to the corresponding value of key-value centering key in each data item, by regular expression And/or data verification method analyzes its data type, as the corresponding data type of the key;
Statistic unit, be suitable to count each data source in, the corresponding data type number of each key and with the data type pair The value number answered;
The data type analysis unit is further adapted for choosing corresponding value number accounting from the data type corresponding to each key More than the data type of first threshold, it is defined as the corresponding data type of the key in the data source, and the associated storage data source In the key and identified data type, as data type table.
7. device as claimed in claim 6, wherein, data sampling unit is further adapted for extracting preceding first number in every kind of data source Data;And/or it is further adapted for the first number of random sampling data in every kind of data source;And/or be further adapted for existing on a time period The first number data is extracted in every kind of data source.
8. the device as any one of claim 5-7, wherein, the corresponding value number accounting of the data type is certain The corresponding value number of a certain data type of key accounts for the ratio of the value total number of the corresponding all data types of the key in the data source Value.
9. the device as any one of claim 5-8, wherein,
The value that the data check module is further adapted for the key-value centering to being extracted using the regular expression of the data type is entered Row verification.
10. it is a kind of extract the data from one or more data sources extraction system, including:
Data acquisition device, is suitable to gather the data from one or more data sources;
Data extraction device as any one of claim 5-9;And
Data analysis set-up, is suitable to be analyzed the data extracted.
CN201611080168.3A 2016-11-30 2016-11-30 Data extraction method, device and system Active CN106776901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611080168.3A CN106776901B (en) 2016-11-30 2016-11-30 Data extraction method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611080168.3A CN106776901B (en) 2016-11-30 2016-11-30 Data extraction method, device and system

Publications (2)

Publication Number Publication Date
CN106776901A true CN106776901A (en) 2017-05-31
CN106776901B CN106776901B (en) 2019-12-06

Family

ID=58901448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611080168.3A Active CN106776901B (en) 2016-11-30 2016-11-30 Data extraction method, device and system

Country Status (1)

Country Link
CN (1) CN106776901B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107894973A (en) * 2017-10-30 2018-04-10 武汉华工赛百数据系统有限公司 A kind of method for interchanging data and system based on XML
CN109684374A (en) * 2018-11-28 2019-04-26 海南电网有限责任公司信息通信分公司 A kind of extracting method and device of the key-value pair of time series data
CN109710651A (en) * 2018-12-25 2019-05-03 成都四方伟业软件股份有限公司 Data type recognition methods and device
CN110390208A (en) * 2019-06-26 2019-10-29 联动优势科技有限公司 A kind of the preferred data source access method and device of composite data item label
CN110866557A (en) * 2019-11-12 2020-03-06 贵州医渡云技术有限公司 Data evaluation method and device, storage medium and electronic device
CN111488260A (en) * 2019-01-29 2020-08-04 华为技术有限公司 Data template acquisition method and device, computer equipment and readable storage medium
CN111753332A (en) * 2020-06-29 2020-10-09 上海通联金融服务有限公司 Method for completing log desensitization in log writing stage based on sensitive information rule

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870381A (en) * 2012-12-10 2014-06-18 百度在线网络技术(北京)有限公司 Test data generating method and device
CN104809178A (en) * 2015-04-15 2015-07-29 北京科电高技术公司 Write-in method of key/value database memory log
CN104933096A (en) * 2015-05-22 2015-09-23 北京奇虎科技有限公司 Abnormal key recognition method of database, abnormal key recognition device of database and data system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870381A (en) * 2012-12-10 2014-06-18 百度在线网络技术(北京)有限公司 Test data generating method and device
CN104809178A (en) * 2015-04-15 2015-07-29 北京科电高技术公司 Write-in method of key/value database memory log
CN104933096A (en) * 2015-05-22 2015-09-23 北京奇虎科技有限公司 Abnormal key recognition method of database, abnormal key recognition device of database and data system

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107894973A (en) * 2017-10-30 2018-04-10 武汉华工赛百数据系统有限公司 A kind of method for interchanging data and system based on XML
CN109684374A (en) * 2018-11-28 2019-04-26 海南电网有限责任公司信息通信分公司 A kind of extracting method and device of the key-value pair of time series data
CN109684374B (en) * 2018-11-28 2021-05-25 海南电网有限责任公司信息通信分公司 Method and device for extracting key value pairs of time series data
CN109710651A (en) * 2018-12-25 2019-05-03 成都四方伟业软件股份有限公司 Data type recognition methods and device
CN111488260A (en) * 2019-01-29 2020-08-04 华为技术有限公司 Data template acquisition method and device, computer equipment and readable storage medium
CN111488260B (en) * 2019-01-29 2023-12-08 华为云计算技术有限公司 Data template acquisition method, device, computer equipment and readable storage medium
CN110390208A (en) * 2019-06-26 2019-10-29 联动优势科技有限公司 A kind of the preferred data source access method and device of composite data item label
CN110390208B (en) * 2019-06-26 2023-02-21 联动优势科技有限公司 Optimized data source access method and device for composite data item label
CN110866557A (en) * 2019-11-12 2020-03-06 贵州医渡云技术有限公司 Data evaluation method and device, storage medium and electronic device
CN110866557B (en) * 2019-11-12 2022-12-13 贵州医渡云技术有限公司 Data evaluation method and device, storage medium and electronic device
CN111753332A (en) * 2020-06-29 2020-10-09 上海通联金融服务有限公司 Method for completing log desensitization in log writing stage based on sensitive information rule

Also Published As

Publication number Publication date
CN106776901B (en) 2019-12-06

Similar Documents

Publication Publication Date Title
CN106776901A (en) Data extraction method, apparatus and system
CN106127872B (en) Work attendance method, client and equipment based on mobile terminal
CN105446706B (en) Method and device for evaluating form page use effect and providing original data
CN106557747B (en) The method and device of identification insurance single numbers
CN102768659A (en) Method and system for identifying repeated account
CN110336838B (en) Account abnormity detection method, device, terminal and storage medium
CN110634471B (en) Voice quality inspection method and device, electronic equipment and storage medium
CN103530347A (en) Internet resource quality assessment method and system based on big data mining
CN110750433A (en) Interface test method and device
CN102497435A (en) Data distributing method and device of data service
CN109857714A (en) Journal obtaining method, device, electronic equipment and computer readable storage medium
CN108234345A (en) A kind of traffic characteristic recognition methods of terminal network application, device and system
CN105630867A (en) Method and device for matching data
CN111291990A (en) Quality monitoring processing method and device
JP2019504545A (en) Method and apparatus for recognizing service request for changing mobile phone number
CN113792248B (en) Online education course sharing and distributing system based on Internet and mobile terminal
CN109245910B (en) Method and device for identifying fault type
CN103838739B (en) The detection method and system of error correction term in a kind of search engine
CN112507041B (en) Equipment model identification method and device, electronic equipment and storage medium
CN112465565B (en) User portrait prediction method and device based on machine learning
CN110830499B (en) Network attack application detection method and system
CN110516258B (en) Data verification method and device, storage medium and electronic device
CN111092879A (en) Log association method and device, electronic equipment and storage medium
CN109313827A (en) Classroom is registered method, apparatus, terminal and storage medium
CN107040603A (en) For determining the method and apparatus that application program App enlivens scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 311501, Unit 1, Building 5, Courtyard 1, Futong East Street, Chaoyang District, Beijing 100102

Applicant after: Beijing Zhichuangyu Information Technology Co., Ltd.

Address before: 100097 Jinwei Building 803, 55 Lanindichang South Road, Haidian District, Beijing

Applicant before: Beijing Knows Chuangyu Information Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant