CN109388675A - Data analysing method, device, computer equipment and storage medium - Google Patents
Data analysing method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN109388675A CN109388675A CN201811188076.6A CN201811188076A CN109388675A CN 109388675 A CN109388675 A CN 109388675A CN 201811188076 A CN201811188076 A CN 201811188076A CN 109388675 A CN109388675 A CN 109388675A
- Authority
- CN
- China
- Prior art keywords
- data
- information
- field
- field information
- tables
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000003860 storage Methods 0.000 title claims abstract description 20
- 238000012545 processing Methods 0.000 claims abstract description 81
- 238000004458 analytical method Methods 0.000 claims abstract description 59
- 238000007405 data analysis Methods 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 13
- 238000012549 training Methods 0.000 claims description 12
- 238000010276 construction Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 238000010586 diagram Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 4
- 238000009826 distribution Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Abstract
This application involves arriving data processing field, and disclose a kind of data analysing method, device, computer equipment and storage medium.This method comprises: tables of data in scan database is to obtain the field information of the tables of data;It identifies the corresponding data type of the field information, the field information is classified according to the data type to obtain sorting field information;Information is handled according to history field and determines the corresponding format analysis processing rule of the sorting field information, wherein history field processing information record has the corresponding relationship of history field Yu format analysis processing rule;The sorting field information is handled to obtain corresponding data information according to determining format analysis processing rule;Based on Predicting Performance Characteristics model trained in advance, the data characteristic of the tables of data is predicted according to the data information.The analysis efficiency of big data can be improved in this method, understands data value by data characteristic and is used again, and then improves work efficiency.
Description
Technical field
This application involves technical field of data processing more particularly to a kind of data analysing method, device, computer equipment and
Storage medium.
Background technique
Currently, can usually face various mass datas under big data era, touch various new data sources.If
It needs using data, it is necessary first to first data be analyzed, could have basic understanding to data in this way, could preferably make
Use data.If it is not known that the characteristic informations such as the characteristic of data, format, saturation degree, can generate the use of data bad
It influences, for example will lead to the result that association makes mistake.Existing method needs each field to tables of data when using data
Check verifying, the information such as format situation, saturation degree situation and update status of field filling, if literary name section have it is several hundred
A, analysis is got up very time-consuming, while also production business can be produced a very large impact.Therefore, it is necessary to provide a kind of analysis
Method is to solve the above problems.
Summary of the invention
This application provides a kind of data analysing method, device, computer equipment and storage mediums, to improve big data
Analysis efficiency.
This application provides a kind of data analysing methods comprising:
Tables of data in scan database is to obtain the field information of the tables of data;
It identifies the corresponding data type of the field information, and the field information is divided according to the data type
Class obtains sorting field information;
Information is handled according to history field and determines the corresponding format analysis processing rule of the sorting field information, wherein described go through
History field processing information record has the corresponding relationship of history field Yu format analysis processing rule;
The sorting field information is handled to obtain corresponding data information according to determining format analysis processing rule;
Based on Predicting Performance Characteristics model trained in advance, the data characteristic of the tables of data is predicted according to the data information.
This application provides a kind of data analysis set-ups comprising:
Acquiring unit is scanned, the field information of the tables of data is obtained for the tables of data in scan database;
Identify taxon, for identification the corresponding data type of the field information, and according to the data type pair
The field information is classified to obtain sorting field information;
Rule determination unit determines at the corresponding format of the sorting field information for handling information according to history field
Reason rule, wherein history field processing information record has the corresponding relationship of history field Yu format analysis processing rule;
Information process unit, for being handled the sorting field information to obtain according to determining format analysis processing rule
To corresponding data information;
Predicting Performance Characteristics unit, for based on Predicting Performance Characteristics model trained in advance, according to data information prediction
The data characteristic of tables of data.
Present invention also provides a kind of computer equipments comprising memory, processor and is stored on the memory
And the computer program that can be run on the processor, the processor realize provided by the present application when executing described program
The step of data analysing method described in meaning one.
Present invention also provides a kind of computer storage mediums, wherein the computer storage medium is stored with computer journey
Sequence, the computer program make the processor execute number described in any embodiment provided by the present application when being executed by processor
The step of according to analysis method.
The embodiment of the present application provides data analysing method, device, computer equipment and storage medium, passes through scan data
Tables of data in library is to obtain the field information of the tables of data;Identify the corresponding data type of the field information, and according to
The data type classifies the field information to obtain sorting field information;Information, which is handled, according to history field determines institute
The corresponding format analysis processing rule of sorting field information is stated, wherein history field processing information record has history field and format
Handle the corresponding relationship of rule;The sorting field information is handled to be corresponded to according to determining format analysis processing rule
Data information;Based on Predicting Performance Characteristics model trained in advance, predict that the data of the tables of data are special according to the data information
Property.Thus the speed of data analysis is improved, data value is understood by data characteristic and is used again, and then improves work
Efficiency.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the step schematic flow diagram for the training Predicting Performance Characteristics model that one embodiment of the application provides;
Fig. 2 is a kind of schematic flow diagram for data analysing method that one embodiment of the application provides;
Fig. 3 is the sub-step schematic flow diagram of data analysing method in Fig. 2;
Fig. 4 is a kind of step schematic flow diagram for supplementary data information that one embodiment of the application provides;
Fig. 5 is a kind of schematic block diagram for data analysis set-up that one embodiment of the application provides;
Fig. 6 is a kind of schematic block diagram for data analysis set-up that another embodiment of the application provides;
Fig. 7 is a kind of schematic block diagram for data analysis set-up that the another embodiment of the application provides;
Fig. 8 is a kind of schematic block diagram for computer equipment that one embodiment of the application provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction
Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded
Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this present specification merely for the sake of description specific embodiment
And be not intended to limit the application.As present specification and it is used in the attached claims, unless on
Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in present specification and the appended claims is
Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
This application provides a kind of data analysing method, device, computer equipment and storage mediums.This method can be applied
It in server, specifically can be applied in the server of distributed system, mass data analysis handled with realizing, and makes
With the data characteristic for obtaining the mass data before data.
Referring to Fig. 1, Fig. 1 is the step schematic flow diagram for the training Predicting Performance Characteristics model that one embodiment of the application provides.
As shown in Figure 1, the step specifically includes following steps S101 to step S103.
S101, the data of historical data table are obtained as sample data.
Specifically, the historical data table is that user has understood the corresponding tables of data of its data characteristic, data characteristic
It include: the information such as data field type, data field format and saturation degree.Specially server obtains historical data table and sweeps
The tables of data that the data in the historical data table can be selected with user by terminal as sample data, the historical data table is retouched,
And the mark of tables of data is sent to server, by server obtain the tables of data as historical data table and scan reading this go through
Data in history tables of data are as sample data.
S102, feature extraction is carried out to the sample data to obtain feature field, is constructed according to the feature field special
Levy vector.
Specifically, feature extraction is carried out to the sample data, for example extracts text type, dimension type and discrete digital class
The field information of type is as feature field.Thus according to the field construction feature vector, the wherein numerical value in feature vector and every
A feature field is corresponding.
S103, logic-based regression algorithm carry out the characteristic that model training is trained in advance according to described eigenvector
Prediction model.
Specifically, corresponding logistic regression algorithm is selected, naturally it is also possible to select neural network algorithm.It is patrolled based on selected
Regression algorithm is collected, using described eigenvector as input, it would be desirable to which target value carries out model training as output, such as with data
Amount and saturation degree are target as output, training Predicting Performance Characteristics model, and specifit training model will be obtained by training as pre-
First trained Predicting Performance Characteristics model is saved.
Referring to Fig. 2, Fig. 2 is a kind of schematic flow diagram for data analysing method that one embodiment of the application provides.The party
Method can be applied in server, specifically can be applied in the server of distributed system, to realize the spy to mass data
Property is predicted.As shown in Fig. 2, the data analysing method specifically includes step S201 to S205.
Tables of data in S201, scan database is to obtain the field information of the tables of data.
Wherein, the database is the corresponding database of operation system, which is such as to produce dangerous system, life insurance system
System, accident insurance system and cell phone bank system etc., the database can save the mass data of operation system generation, data volume in real time
Larger, if necessary to use the data in the database, the data characteristic for first understanding data in the database before the use is
It is necessary to, it is therefore desirable to the tables of data of the database is scanned to obtain the field information in the tables of data.
Specifically, it in order to improve the processing speeds of data, can be swept by tables of data of the distributed system to database
It retouches to obtain the field information in the tables of data.Can also automatic regular polling scan the tables of data of the database, to obtain the tables of data
In field information, meaning, type and the format of the field information can embody data characteristic.
S202, the corresponding data type of the identification field information, and according to the data type to the field information
Classified to obtain sorting field information.
Wherein, the data type includes text type, dimension type and discrete data type etc..The text type pair
The field information answered includes the text informations such as text and English;The dimension type corresponding field information includes the letter of limited quantity
Breath, such as constellation or gender etc.;The corresponding field information of the discrete data type includes digital information, such as bank's card number,
Telephone number, age or birthday etc..
Specifically, the corresponding data type of the identification field information, comprising: identified according to the feature of field information
Institute's field information corresponding data type, and the field information is classified to obtain category words according to the data type recognized
Segment information.
For example, identifying that the field information recognizes the field information as number, it is determined that the field information is discrete
Data type, and the field information is classified according to the discrete data type to obtain sorting field information, it is specific to classify
Into discrete data type field information class.Field information processing analysis speed can be improved by classification processing.
S203, the corresponding format analysis processing rule of the sorting field information is determined according to history field processing information, wherein
The history field processing information record has the corresponding relationship of history field Yu format analysis processing rule;
Wherein, the history field processing information is corresponding using corresponding format analysis processing rule process history field information
Record information.
Specifically, history field information is the field information of TXT type, then the format analysis processing rule used is TXT format
Processing rule, then can be analyzed according to history field information present field information be it is similar to its, if similar, determine current word
The format analysis processing rule of segment information is the corresponding format analysis processing rule of the history field information.
In one embodiment, in order to quickly determine the corresponding format analysis processing rule of the field information, as shown in figure 3, step
Rapid S203 includes sub-step S203a and S203b.
S203a, corresponding multiple history fields are determined according to the data type of the field information, and calculates the field
The Jie Kade similarity factor of information and each history field;S203b, according to the Jie Kade similarity factor it is determining with it is described
The most like history field of field information, using the corresponding format analysis processing rule of the most like history field as the field
The corresponding format analysis processing rule of information.
Specifically, Jie Kade similarity factor (Jaccard index), also known as Jaccard coefficient, for comparing finite sample
Similitude and otherness between collection.Wherein, Jie Kade similarity factor is bigger, and Sample Similarity is higher.Calculate the field letter
The Jie Kade similarity factor of breath and the history field, is specifically set as two fields for the field information and the history field
Set, for example be set A and set B, the intersection of the character of the character string in two set of fields and the ratio of union are calculated, i.e.,
Jie Kade similarity factor.According to Jie Kade similarity factor determination and the most like history field of the field information, by institute
The corresponding format analysis processing rule of most like history field is stated as the corresponding format analysis processing rule of the field information.
S204, the sorting field information is handled to obtain corresponding data according to determining format analysis processing rule
Information;
Specifically, the sorting field information is parsed according to determining format analysis processing rule, including identification classification
Format, meaning and integrity degree of information of field information etc..
Speed is analyzed in order to further improve the processing of field information, according to determining format analysis processing rule using distribution
Formula processing technique handles the sorting field information to obtain the corresponding data information of the sorting field information, wherein
The distributed proccessing includes Hadoop system or Spark system processing technique.
For example, can be by determining format analysis processing rule and the corresponding host for being sent to distributed system of sorting field information
Above so that the host is handled to obtain the category words according to determining format analysis processing rule to the sorting field information
The corresponding data information of segment information;Receive the data information of the host feedback.
In one embodiment, as shown in figure 4, further including the following contents after step s 204: S204a, judging the number
It is believed that breath whether address information;If S204b, the data information are address information, full address is obtained from match address library
Information supplement is to data information.
For example, user is when filling in for the address information, may not fill in if the data information is address information
Completely, for example the information such as some street of some area of user can have been obtained without filling in the information such as province from match address library
Site preparation location information supplement is to data information.
S205, based on Predicting Performance Characteristics model trained in advance, the data of the tables of data are predicted according to the data information
Characteristic.
Specifically, which is preparatory trained algorithm model, and input parameter is the data of tables of data
Information is for the corresponding data characteristic of data in prediction data table, preferably to utilize the data in the tables of data, wherein
Data characteristic includes the information such as data field type, data field format and saturation degree.It can quickly be managed by the above method
The characteristic for solving data, provides help for follow-up data use.
In the present embodiment, above-mentioned data analysing method is by the tables of data in scan database to obtain the tables of data
Field information;Identify the corresponding data type of the field information, and according to the data type to the field information into
Row classification obtains sorting field information;Information, which is handled, according to history field determines the corresponding format analysis processing of the sorting field information
Rule, wherein history field processing information record has the corresponding relationship of history field Yu format analysis processing rule;According to determination
Format analysis processing rule the sorting field information is handled to obtain corresponding data information;Based on spy trained in advance
Property prediction model, the data characteristic of the tables of data is predicted according to the data information.Thus the speed of data analysis is improved,
Data value is understood by data characteristic to be used again, and then is improved work efficiency.
Fig. 5 is a kind of schematic block diagram of data analysis set-up provided by the embodiments of the present application.As shown in figure 5, corresponding to
Above data analysis method, the application also provide a kind of data analysis set-up.The data analysis set-up includes above-mentioned for executing
The unit of data analysing method, the device can be configured in server.
As shown in figure 5, the data analysis set-up 400 includes: data determination unit 401, extracts construction unit 402, model
Training unit 403, scanning acquiring unit 404, identification taxon 405, rule determination unit 406,407 and of information process unit
Predicting Performance Characteristics unit 408.
Data determination unit 401, for obtaining the data of historical data table as sample data.
Construction unit 402 is extracted, for carrying out feature extraction to obtain feature field, according to described to the sample data
Feature field construction feature vector.
Model training unit 403 is used for logic-based regression algorithm, carries out model training according to described eigenvector and obtains
Trained Predicting Performance Characteristics model in advance.
Acquiring unit 404 is scanned, the field information of the tables of data is obtained for the tables of data in scan database.
Identify taxon 405, for identification the corresponding data type of the field information, and according to the data type
The field information is classified to obtain sorting field information.
Rule determination unit 406 determines the corresponding lattice of the sorting field information for handling information according to history field
Formula processing rule, wherein history field processing information record has the corresponding relationship of history field Yu format analysis processing rule.
In one embodiment, as shown in fig. 6, rule determination unit 406 comprises determining that computation subunit 4061 and determination are set
Set subelement 4062.
Wherein it is determined that computation subunit 4061, corresponding multiple for being determined according to the data type of the field information
History field, and calculate the Jie Kade similarity factor of the field information and each history field;Determine setting subelement
4062, for determining with the most like history field of the field information, the most phase by described according to the Jie Kade similarity factor
As the corresponding format analysis processing rule of history field as the field information corresponding format analysis processing rule.
Information process unit 407, for being handled according to determining format analysis processing rule the sorting field information
To obtain corresponding data information.
Predicting Performance Characteristics unit 408, for predicting institute according to the data information based on Predicting Performance Characteristics model trained in advance
State the data characteristic of tables of data.
In one embodiment, as shown in fig. 7, the data analysis set-up 500 further include: information judging unit 501 and acquisition
Supplementary units 502.
Wherein, information judging unit 501, judge the data information whether address information;Supplementary units 502 are obtained, are used
If being address information in the data information, complete address information is obtained from match address library and adds to data information.
It is apparent to those skilled in the art that for convenience of description and succinctly, the number of foregoing description
According to the specific work process of analytical equipment and unit, can refer to corresponding processes in the foregoing method embodiment, it is no longer superfluous herein
It states.
Above-mentioned apparatus can be implemented as a kind of form of computer program, and computer program can be in meter as shown in Figure 8
It calculates and is run on machine equipment.
Referring to Fig. 8, Fig. 8 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The computer
Equipment 700 can be server.
Referring to Fig. 8, which includes processor 720, memory and the net connected by system bus 710
Network interface 750, wherein memory may include non-volatile memory medium 730 and built-in storage 740.
The non-volatile memory medium 730 can storage program area 731 and computer program 732.The computer program 732
It is performed, processor 720 may make to execute any one data analysing method.
The processor 720 supports the operation of entire computer equipment 700 for providing calculating and control ability.
The built-in storage 740 provides environment for the operation of the computer program 732 in non-volatile memory medium 730, should
When computer program 732 is executed by processor 720, processor 720 may make to execute any one data analysing method.
The network interface 750 such as sends the task dispatching of distribution for carrying out network communication.Those skilled in the art can manage
It solves, structure shown in Fig. 8, only the block diagram of part-structure relevant to application scheme, is not constituted to the application side
The restriction for the computer equipment 700 that case is applied thereon, specific computer equipment 700 may include more than as shown in the figure
Or less component, perhaps combine certain components or with different component layouts.Wherein, the processor 720 is for transporting
Row program code stored in memory, to realize following steps:
Tables of data in scan database is to obtain the field information of the tables of data;
It identifies the corresponding data type of the field information, and the field information is divided according to the data type
Class obtains sorting field information;
Information is handled according to history field and determines the corresponding format analysis processing rule of the sorting field information, wherein described go through
History field processing information record has the corresponding relationship of history field Yu format analysis processing rule;
The sorting field information is handled to obtain corresponding data information according to determining format analysis processing rule;
Based on Predicting Performance Characteristics model trained in advance, the data characteristic of the tables of data is predicted according to the data information.
In one embodiment, the processor 720 is swept described in program code realization stored in memory for running
Before retouching field information of the tables of data in database to obtain the tables of data, following steps are also realized:
The data of historical data table are obtained as sample data;
To the sample data carry out feature extraction to obtain feature field, according to the feature field construction feature to
Amount;
Logic-based regression algorithm carries out the Predicting Performance Characteristics mould that model training is trained in advance according to described eigenvector
Type.
In one embodiment, the processor 720 realizes described for running program code stored in memory
When determining the corresponding format analysis processing rule of the sorting field information according to history field processing information, it is implemented as follows step:
Determine corresponding multiple history fields according to the data type of the field information, and calculate the field information and
The Jie Kade similarity factor of each history field;
It, will be described most like according to Jie Kade similarity factor determination and the most like history field of the field information
The corresponding format analysis processing rule of history field as the field information corresponding format analysis processing rule.
In one embodiment, the processor 720 realizes described for running program code stored in memory
When being handled the sorting field information to obtain corresponding data information according to determining format analysis processing rule, specific implementation
Following steps:
The sorting field information handle using distributed proccessing according to determining format analysis processing rule
To the corresponding data information of the sorting field information, wherein the distributed proccessing includes Hadoop system or Spark
System processing technique.
In one embodiment, the processor 720 realizes described for running program code stored in memory
After being handled the sorting field information to obtain corresponding data information according to determining format analysis processing rule, also realize
Following steps:
Judge the data information whether address information;
If the data information is address information, complete address information is obtained from match address library and adds to data letter
Breath.
In one embodiment, the processor 720 realizes scanning number for running program code stored in memory
When field information according to the tables of data in library to obtain the tables of data, it is implemented as follows step:
The tables of data in the database is scanned by distributed system automatic regular polling to obtain the field of the tables of data
Information.
It should be appreciated that in the embodiment of the present application, processor 720 can be central processing unit (Central
ProcessingUnit, CPU), which can also be other general processors, digital signal processor (Digital
Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit,
ASIC), ready-made programmable gate array (Field-Programmable GateArray, FPGA) or other programmable logic devices
Part, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or
The processor is also possible to any conventional processor etc..
It will be understood by those skilled in the art that 700 structure of computer equipment shown in Fig. 8 is not constituted and is set to computer
Standby 700 restriction may include perhaps combining certain components or different component cloth than illustrating more or fewer components
It sets.
Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process, be
Relevant hardware can be instructed to complete by computer program, computer program can be stored in a storage medium, this is deposited
Storage media is computer readable storage medium.In the embodiment of the present invention, which can be stored in computer system
It in storage medium, and is executed by least one processor in the computer system, includes the reality such as above-mentioned each method with realization
Apply the process step of example.
The computer readable storage medium can be magnetic disk, CD, USB flash disk, mobile hard disk, read-only memory (ROM, Read-
OnlyMemory), the various media that can store program code such as magnetic or disk.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure
Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware
With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This
A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially
Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not
It is considered as beyond scope of the present application.
In several embodiments provided herein, it should be understood that disclosed data analysis set-up and method, it can
To realize by another way.For example, data analysis set-up embodiment described above is only schematical.For example,
The division of each unit, only a kind of logical function partition, there may be another division manner in actual implementation.Such as it is multiple
Unit or assembly can be combined or can be integrated into another system, or some features can be ignored or not executed.
Step in the embodiment of the present application method can be sequentially adjusted, merged and deleted according to actual needs.
Unit in the embodiment of the present application device can be combined, divided and deleted according to actual needs.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integrated
Unit both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product,
It can store in a computer readable storage medium.Based on this understanding, the technical solution of the application substantially or
Person says that all or part of the part that contributes to existing technology or the technical solution can body in the form of software products
Reveal and, which is stored in a storage medium, including some instructions are with so that a computer is set
Standby (can be personal computer, terminal or the network equipment etc.) execute each embodiment the method for the application whole or
Part steps.
The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any
Those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or replace
It changes, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope of the application should be with right
It is required that protection scope subject to.
Claims (10)
1. a kind of data analysing method characterized by comprising
Tables of data in scan database is to obtain the field information of the tables of data;
It identifies the corresponding data type of the field information, and the field information classify according to the data type
To sorting field information;
Information is handled according to history field and determines the corresponding format analysis processing rule of the sorting field information, wherein the history word
Section processing information record has the corresponding relationship of history field Yu format analysis processing rule;
The sorting field information is handled to obtain corresponding data information according to determining format analysis processing rule;
Based on Predicting Performance Characteristics model trained in advance, the data characteristic of the tables of data is predicted according to the data information.
2. data analysing method according to claim 1, which is characterized in that the tables of data in the scan database is to obtain
Before the field information for taking the tables of data, further includes:
The data of historical data table are obtained as sample data;
Feature extraction is carried out to obtain feature field, according to the feature field construction feature vector to the sample data;
Logic-based regression algorithm carries out the Predicting Performance Characteristics model that model training is trained in advance according to described eigenvector.
3. data analysing method according to claim 1, which is characterized in that described to be determined according to history field processing information
The corresponding format analysis processing rule of the sorting field information, comprising:
Corresponding multiple history fields are determined according to the data type of the field information, and calculate the field information and each
The Jie Kade similarity factor of the history field;
According to Jie Kade similarity factor determination and the most like history field of the field information, described most like is gone through
The corresponding format analysis processing rule of history field is as the corresponding format analysis processing rule of the field information.
4. data analysing method according to claim 1, which is characterized in that described right according to determining format analysis processing rule
The sorting field information is handled to obtain corresponding data information, comprising:
The sorting field information is handled to obtain institute using distributed proccessing according to determining format analysis processing rule
State the corresponding data information of sorting field information, wherein the distributed proccessing includes Hadoop system or Spark system
Processing technique.
5. data analysing method according to claim 1, which is characterized in that described right according to determining format analysis processing rule
After the sorting field information is handled to obtain corresponding data information, further includes:
Judge whether the data information is address information;
If the data information is address information, complete address information is obtained from match address library and adds to data information.
6. data analysing method according to claim 1, which is characterized in that the tables of data in the scan database is to obtain
Take the field information of the tables of data, comprising:
The tables of data in the database is scanned by distributed system automatic regular polling to obtain the field information of the tables of data.
7. data analysing method according to claim 1, which is characterized in that the data type includes text type, dimension
Spend type and discrete data type.
8. a kind of data analysis set-up characterized by comprising
Acquiring unit is scanned, the field information of the tables of data is obtained for the tables of data in scan database;
Identify taxon, for identification the corresponding data type of the field information, and according to the data type to described
Field information is classified to obtain sorting field information;
Rule determination unit determines that the corresponding format analysis processing of the sorting field information is advised for handling information according to history field
Then, wherein history field processing information record has the corresponding relationship of history field Yu format analysis processing rule;
Information process unit, for being handled according to determining format analysis processing rule the sorting field information to obtain pair
The data information answered;
Predicting Performance Characteristics unit, for predicting the data according to the data information based on Predicting Performance Characteristics model trained in advance
The data characteristic of table.
9. a kind of computer equipment, which is characterized in that including memory, processor and be stored on the memory and can be in institute
The computer program run on processor is stated, the processor is realized when executing the computer program as in claim 1 to 7
The step of any one the method.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey
Sequence, the computer program make the processor execute such as claim 1 to 7 any one the method when being executed by processor
The step of.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811188076.6A CN109388675A (en) | 2018-10-12 | 2018-10-12 | Data analysing method, device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811188076.6A CN109388675A (en) | 2018-10-12 | 2018-10-12 | Data analysing method, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109388675A true CN109388675A (en) | 2019-02-26 |
Family
ID=65427284
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811188076.6A Pending CN109388675A (en) | 2018-10-12 | 2018-10-12 | Data analysing method, device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109388675A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110008250A (en) * | 2019-03-07 | 2019-07-12 | 平安科技(深圳)有限公司 | Social security data processing method, device and computer equipment based on data mining |
CN110083839A (en) * | 2019-04-29 | 2019-08-02 | 珠海豹好玩科技有限公司 | Text introduction method, device and equipment |
CN110399434A (en) * | 2019-07-25 | 2019-11-01 | 北京明略软件系统有限公司 | Field classification method and device, storage medium, electronic device |
CN110781183A (en) * | 2019-09-10 | 2020-02-11 | 中国平安财产保险股份有限公司 | Method and device for processing incremental data in Hive database and computer equipment |
CN110851428A (en) * | 2019-11-19 | 2020-02-28 | 厦门市美亚柏科信息股份有限公司 | Database analysis method, device and medium based on rule operator dynamic arrangement |
CN111159181A (en) * | 2019-12-18 | 2020-05-15 | 东软集团股份有限公司 | Medical data screening method and device, storage medium and electronic equipment |
CN111639077A (en) * | 2020-05-15 | 2020-09-08 | 杭州数梦工场科技有限公司 | Data management method and device, electronic equipment and storage medium |
CN112685415A (en) * | 2020-12-30 | 2021-04-20 | 平安普惠企业管理有限公司 | Data import method and device, computer equipment and storage medium |
CN112865999A (en) * | 2019-11-28 | 2021-05-28 | 华为技术有限公司 | Information processing method and related equipment |
CN113379004A (en) * | 2021-07-26 | 2021-09-10 | 浙江大华技术股份有限公司 | Data table classification method and device, electronic equipment and storage medium |
WO2021179579A1 (en) * | 2020-03-09 | 2021-09-16 | 平安科技(深圳)有限公司 | Backup data analysis method and apparatus based on file information, and computer device |
CN115543977A (en) * | 2022-09-29 | 2022-12-30 | 河北雄安睿天科技有限公司 | Water supply industry data cleaning method |
CN116975401A (en) * | 2023-09-19 | 2023-10-31 | 杭州美创科技股份有限公司 | Database field identification method, device, computer equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1673995A (en) * | 2004-03-24 | 2005-09-28 | 微软公司 | Method and apparatus for populating electronic forms from scanned documents |
CN106909811A (en) * | 2015-12-23 | 2017-06-30 | 腾讯科技(深圳)有限公司 | The method and apparatus of ID treatment |
US20170278015A1 (en) * | 2016-03-24 | 2017-09-28 | Accenture Global Solutions Limited | Self-learning log classification system |
US20170286388A1 (en) * | 2016-04-04 | 2017-10-05 | Accenture Global Solutions Limited | Document presentation interface based on intelligent mapping |
CN108197109A (en) * | 2017-12-29 | 2018-06-22 | 北京百分点信息科技有限公司 | A kind of multilingual analysis method and device based on natural language processing |
CN108337316A (en) * | 2018-02-08 | 2018-07-27 | 平安科技(深圳)有限公司 | Information-pushing method, device, computer equipment and storage medium |
CN108388924A (en) * | 2018-03-08 | 2018-08-10 | 平安科技(深圳)有限公司 | A kind of data classification method, device, equipment and computer readable storage medium |
-
2018
- 2018-10-12 CN CN201811188076.6A patent/CN109388675A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1673995A (en) * | 2004-03-24 | 2005-09-28 | 微软公司 | Method and apparatus for populating electronic forms from scanned documents |
CN106909811A (en) * | 2015-12-23 | 2017-06-30 | 腾讯科技(深圳)有限公司 | The method and apparatus of ID treatment |
US20170278015A1 (en) * | 2016-03-24 | 2017-09-28 | Accenture Global Solutions Limited | Self-learning log classification system |
US20170286388A1 (en) * | 2016-04-04 | 2017-10-05 | Accenture Global Solutions Limited | Document presentation interface based on intelligent mapping |
CN108197109A (en) * | 2017-12-29 | 2018-06-22 | 北京百分点信息科技有限公司 | A kind of multilingual analysis method and device based on natural language processing |
CN108337316A (en) * | 2018-02-08 | 2018-07-27 | 平安科技(深圳)有限公司 | Information-pushing method, device, computer equipment and storage medium |
CN108388924A (en) * | 2018-03-08 | 2018-08-10 | 平安科技(深圳)有限公司 | A kind of data classification method, device, equipment and computer readable storage medium |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110008250B (en) * | 2019-03-07 | 2024-03-15 | 平安科技(深圳)有限公司 | Social security data processing method and device based on data mining and computer equipment |
CN110008250A (en) * | 2019-03-07 | 2019-07-12 | 平安科技(深圳)有限公司 | Social security data processing method, device and computer equipment based on data mining |
CN110083839A (en) * | 2019-04-29 | 2019-08-02 | 珠海豹好玩科技有限公司 | Text introduction method, device and equipment |
CN110083839B (en) * | 2019-04-29 | 2023-08-22 | 珠海豹好玩科技有限公司 | Text importing method, device and equipment |
CN110399434A (en) * | 2019-07-25 | 2019-11-01 | 北京明略软件系统有限公司 | Field classification method and device, storage medium, electronic device |
CN110781183A (en) * | 2019-09-10 | 2020-02-11 | 中国平安财产保险股份有限公司 | Method and device for processing incremental data in Hive database and computer equipment |
CN110781183B (en) * | 2019-09-10 | 2023-06-27 | 中国平安财产保险股份有限公司 | Processing method and device for incremental data in Hive database and computer equipment |
CN110851428B (en) * | 2019-11-19 | 2022-05-20 | 厦门市美亚柏科信息股份有限公司 | Database analysis method, device and medium based on rule operator dynamic arrangement |
CN110851428A (en) * | 2019-11-19 | 2020-02-28 | 厦门市美亚柏科信息股份有限公司 | Database analysis method, device and medium based on rule operator dynamic arrangement |
CN112865999A (en) * | 2019-11-28 | 2021-05-28 | 华为技术有限公司 | Information processing method and related equipment |
CN111159181A (en) * | 2019-12-18 | 2020-05-15 | 东软集团股份有限公司 | Medical data screening method and device, storage medium and electronic equipment |
WO2021179579A1 (en) * | 2020-03-09 | 2021-09-16 | 平安科技(深圳)有限公司 | Backup data analysis method and apparatus based on file information, and computer device |
CN111639077A (en) * | 2020-05-15 | 2020-09-08 | 杭州数梦工场科技有限公司 | Data management method and device, electronic equipment and storage medium |
CN111639077B (en) * | 2020-05-15 | 2024-03-22 | 杭州数梦工场科技有限公司 | Data management method, device, electronic equipment and storage medium |
CN112685415A (en) * | 2020-12-30 | 2021-04-20 | 平安普惠企业管理有限公司 | Data import method and device, computer equipment and storage medium |
CN113379004A (en) * | 2021-07-26 | 2021-09-10 | 浙江大华技术股份有限公司 | Data table classification method and device, electronic equipment and storage medium |
CN113379004B (en) * | 2021-07-26 | 2023-04-14 | 浙江大华技术股份有限公司 | Data table classification method and device, electronic equipment and storage medium |
CN115543977A (en) * | 2022-09-29 | 2022-12-30 | 河北雄安睿天科技有限公司 | Water supply industry data cleaning method |
CN116975401A (en) * | 2023-09-19 | 2023-10-31 | 杭州美创科技股份有限公司 | Database field identification method, device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109388675A (en) | Data analysing method, device, computer equipment and storage medium | |
CN108628741A (en) | Webpage test method, device, electronic equipment and medium | |
CN108415980A (en) | Question and answer data processing method, electronic device and storage medium | |
CN110222330B (en) | Semantic recognition method and device, storage medium and computer equipment | |
CN112883190A (en) | Text classification method and device, electronic equipment and storage medium | |
CN115130711A (en) | Data processing method and device, computer and readable storage medium | |
CN110276382A (en) | Listener clustering method, apparatus and medium based on spectral clustering | |
CN112328909A (en) | Information recommendation method and device, computer equipment and medium | |
CN112906361A (en) | Text data labeling method and device, electronic equipment and storage medium | |
CN110968664A (en) | Document retrieval method, device, equipment and medium | |
CN110442803A (en) | Data processing method, device, medium and the calculating equipment executed by calculating equipment | |
CN114692889A (en) | Meta-feature training model for machine learning algorithm | |
CN112579781A (en) | Text classification method and device, electronic equipment and medium | |
CN111782774B (en) | Method and device for recommending problems | |
CN111597336A (en) | Processing method and device of training text, electronic equipment and readable storage medium | |
CN109033078B (en) | The recognition methods of sentence classification and device, storage medium, processor | |
CN115952800A (en) | Named entity recognition method and device, computer equipment and readable storage medium | |
CN114780688A (en) | Text quality inspection method, device and equipment based on rule matching and storage medium | |
CN107577760A (en) | A kind of file classification method and device based on constrained qualification | |
CN114141235A (en) | Voice corpus generation method and device, computer equipment and storage medium | |
JP5824429B2 (en) | Spam account score calculation apparatus, spam account score calculation method, and program | |
CN112329943A (en) | Combined index selection method and device, computer equipment and medium | |
CN117010349B (en) | Form filling method, system and storage medium based on neural network model | |
CN114021739B (en) | Business processing method, business processing model training device and electronic equipment | |
CN111680513B (en) | Feature information identification method and device and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |