CN109388675A - Data analysing method, device, computer equipment and storage medium - Google Patents

Data analysing method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN109388675A
CN109388675A CN201811188076.6A CN201811188076A CN109388675A CN 109388675 A CN109388675 A CN 109388675A CN 201811188076 A CN201811188076 A CN 201811188076A CN 109388675 A CN109388675 A CN 109388675A
Authority
CN
China
Prior art keywords
data
information
field
field information
tables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811188076.6A
Other languages
Chinese (zh)
Inventor
陈健鹏
伍文岳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201811188076.6A priority Critical patent/CN109388675A/en
Publication of CN109388675A publication Critical patent/CN109388675A/en
Pending legal-status Critical Current

Links

Abstract

This application involves arriving data processing field, and disclose a kind of data analysing method, device, computer equipment and storage medium.This method comprises: tables of data in scan database is to obtain the field information of the tables of data;It identifies the corresponding data type of the field information, the field information is classified according to the data type to obtain sorting field information;Information is handled according to history field and determines the corresponding format analysis processing rule of the sorting field information, wherein history field processing information record has the corresponding relationship of history field Yu format analysis processing rule;The sorting field information is handled to obtain corresponding data information according to determining format analysis processing rule;Based on Predicting Performance Characteristics model trained in advance, the data characteristic of the tables of data is predicted according to the data information.The analysis efficiency of big data can be improved in this method, understands data value by data characteristic and is used again, and then improves work efficiency.

Description

Data analysing method, device, computer equipment and storage medium
Technical field
This application involves technical field of data processing more particularly to a kind of data analysing method, device, computer equipment and Storage medium.
Background technique
Currently, can usually face various mass datas under big data era, touch various new data sources.If It needs using data, it is necessary first to first data be analyzed, could have basic understanding to data in this way, could preferably make Use data.If it is not known that the characteristic informations such as the characteristic of data, format, saturation degree, can generate the use of data bad It influences, for example will lead to the result that association makes mistake.Existing method needs each field to tables of data when using data Check verifying, the information such as format situation, saturation degree situation and update status of field filling, if literary name section have it is several hundred A, analysis is got up very time-consuming, while also production business can be produced a very large impact.Therefore, it is necessary to provide a kind of analysis Method is to solve the above problems.
Summary of the invention
This application provides a kind of data analysing method, device, computer equipment and storage mediums, to improve big data Analysis efficiency.
This application provides a kind of data analysing methods comprising:
Tables of data in scan database is to obtain the field information of the tables of data;
It identifies the corresponding data type of the field information, and the field information is divided according to the data type Class obtains sorting field information;
Information is handled according to history field and determines the corresponding format analysis processing rule of the sorting field information, wherein described go through History field processing information record has the corresponding relationship of history field Yu format analysis processing rule;
The sorting field information is handled to obtain corresponding data information according to determining format analysis processing rule;
Based on Predicting Performance Characteristics model trained in advance, the data characteristic of the tables of data is predicted according to the data information.
This application provides a kind of data analysis set-ups comprising:
Acquiring unit is scanned, the field information of the tables of data is obtained for the tables of data in scan database;
Identify taxon, for identification the corresponding data type of the field information, and according to the data type pair The field information is classified to obtain sorting field information;
Rule determination unit determines at the corresponding format of the sorting field information for handling information according to history field Reason rule, wherein history field processing information record has the corresponding relationship of history field Yu format analysis processing rule;
Information process unit, for being handled the sorting field information to obtain according to determining format analysis processing rule To corresponding data information;
Predicting Performance Characteristics unit, for based on Predicting Performance Characteristics model trained in advance, according to data information prediction The data characteristic of tables of data.
Present invention also provides a kind of computer equipments comprising memory, processor and is stored on the memory And the computer program that can be run on the processor, the processor realize provided by the present application when executing described program The step of data analysing method described in meaning one.
Present invention also provides a kind of computer storage mediums, wherein the computer storage medium is stored with computer journey Sequence, the computer program make the processor execute number described in any embodiment provided by the present application when being executed by processor The step of according to analysis method.
The embodiment of the present application provides data analysing method, device, computer equipment and storage medium, passes through scan data Tables of data in library is to obtain the field information of the tables of data;Identify the corresponding data type of the field information, and according to The data type classifies the field information to obtain sorting field information;Information, which is handled, according to history field determines institute The corresponding format analysis processing rule of sorting field information is stated, wherein history field processing information record has history field and format Handle the corresponding relationship of rule;The sorting field information is handled to be corresponded to according to determining format analysis processing rule Data information;Based on Predicting Performance Characteristics model trained in advance, predict that the data of the tables of data are special according to the data information Property.Thus the speed of data analysis is improved, data value is understood by data characteristic and is used again, and then improves work Efficiency.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the step schematic flow diagram for the training Predicting Performance Characteristics model that one embodiment of the application provides;
Fig. 2 is a kind of schematic flow diagram for data analysing method that one embodiment of the application provides;
Fig. 3 is the sub-step schematic flow diagram of data analysing method in Fig. 2;
Fig. 4 is a kind of step schematic flow diagram for supplementary data information that one embodiment of the application provides;
Fig. 5 is a kind of schematic block diagram for data analysis set-up that one embodiment of the application provides;
Fig. 6 is a kind of schematic block diagram for data analysis set-up that another embodiment of the application provides;
Fig. 7 is a kind of schematic block diagram for data analysis set-up that the another embodiment of the application provides;
Fig. 8 is a kind of schematic block diagram for computer equipment that one embodiment of the application provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this present specification merely for the sake of description specific embodiment And be not intended to limit the application.As present specification and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in present specification and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
This application provides a kind of data analysing method, device, computer equipment and storage mediums.This method can be applied It in server, specifically can be applied in the server of distributed system, mass data analysis handled with realizing, and makes With the data characteristic for obtaining the mass data before data.
Referring to Fig. 1, Fig. 1 is the step schematic flow diagram for the training Predicting Performance Characteristics model that one embodiment of the application provides. As shown in Figure 1, the step specifically includes following steps S101 to step S103.
S101, the data of historical data table are obtained as sample data.
Specifically, the historical data table is that user has understood the corresponding tables of data of its data characteristic, data characteristic It include: the information such as data field type, data field format and saturation degree.Specially server obtains historical data table and sweeps The tables of data that the data in the historical data table can be selected with user by terminal as sample data, the historical data table is retouched, And the mark of tables of data is sent to server, by server obtain the tables of data as historical data table and scan reading this go through Data in history tables of data are as sample data.
S102, feature extraction is carried out to the sample data to obtain feature field, is constructed according to the feature field special Levy vector.
Specifically, feature extraction is carried out to the sample data, for example extracts text type, dimension type and discrete digital class The field information of type is as feature field.Thus according to the field construction feature vector, the wherein numerical value in feature vector and every A feature field is corresponding.
S103, logic-based regression algorithm carry out the characteristic that model training is trained in advance according to described eigenvector Prediction model.
Specifically, corresponding logistic regression algorithm is selected, naturally it is also possible to select neural network algorithm.It is patrolled based on selected Regression algorithm is collected, using described eigenvector as input, it would be desirable to which target value carries out model training as output, such as with data Amount and saturation degree are target as output, training Predicting Performance Characteristics model, and specifit training model will be obtained by training as pre- First trained Predicting Performance Characteristics model is saved.
Referring to Fig. 2, Fig. 2 is a kind of schematic flow diagram for data analysing method that one embodiment of the application provides.The party Method can be applied in server, specifically can be applied in the server of distributed system, to realize the spy to mass data Property is predicted.As shown in Fig. 2, the data analysing method specifically includes step S201 to S205.
Tables of data in S201, scan database is to obtain the field information of the tables of data.
Wherein, the database is the corresponding database of operation system, which is such as to produce dangerous system, life insurance system System, accident insurance system and cell phone bank system etc., the database can save the mass data of operation system generation, data volume in real time Larger, if necessary to use the data in the database, the data characteristic for first understanding data in the database before the use is It is necessary to, it is therefore desirable to the tables of data of the database is scanned to obtain the field information in the tables of data.
Specifically, it in order to improve the processing speeds of data, can be swept by tables of data of the distributed system to database It retouches to obtain the field information in the tables of data.Can also automatic regular polling scan the tables of data of the database, to obtain the tables of data In field information, meaning, type and the format of the field information can embody data characteristic.
S202, the corresponding data type of the identification field information, and according to the data type to the field information Classified to obtain sorting field information.
Wherein, the data type includes text type, dimension type and discrete data type etc..The text type pair The field information answered includes the text informations such as text and English;The dimension type corresponding field information includes the letter of limited quantity Breath, such as constellation or gender etc.;The corresponding field information of the discrete data type includes digital information, such as bank's card number, Telephone number, age or birthday etc..
Specifically, the corresponding data type of the identification field information, comprising: identified according to the feature of field information Institute's field information corresponding data type, and the field information is classified to obtain category words according to the data type recognized Segment information.
For example, identifying that the field information recognizes the field information as number, it is determined that the field information is discrete Data type, and the field information is classified according to the discrete data type to obtain sorting field information, it is specific to classify Into discrete data type field information class.Field information processing analysis speed can be improved by classification processing.
S203, the corresponding format analysis processing rule of the sorting field information is determined according to history field processing information, wherein The history field processing information record has the corresponding relationship of history field Yu format analysis processing rule;
Wherein, the history field processing information is corresponding using corresponding format analysis processing rule process history field information Record information.
Specifically, history field information is the field information of TXT type, then the format analysis processing rule used is TXT format Processing rule, then can be analyzed according to history field information present field information be it is similar to its, if similar, determine current word The format analysis processing rule of segment information is the corresponding format analysis processing rule of the history field information.
In one embodiment, in order to quickly determine the corresponding format analysis processing rule of the field information, as shown in figure 3, step Rapid S203 includes sub-step S203a and S203b.
S203a, corresponding multiple history fields are determined according to the data type of the field information, and calculates the field The Jie Kade similarity factor of information and each history field;S203b, according to the Jie Kade similarity factor it is determining with it is described The most like history field of field information, using the corresponding format analysis processing rule of the most like history field as the field The corresponding format analysis processing rule of information.
Specifically, Jie Kade similarity factor (Jaccard index), also known as Jaccard coefficient, for comparing finite sample Similitude and otherness between collection.Wherein, Jie Kade similarity factor is bigger, and Sample Similarity is higher.Calculate the field letter The Jie Kade similarity factor of breath and the history field, is specifically set as two fields for the field information and the history field Set, for example be set A and set B, the intersection of the character of the character string in two set of fields and the ratio of union are calculated, i.e., Jie Kade similarity factor.According to Jie Kade similarity factor determination and the most like history field of the field information, by institute The corresponding format analysis processing rule of most like history field is stated as the corresponding format analysis processing rule of the field information.
S204, the sorting field information is handled to obtain corresponding data according to determining format analysis processing rule Information;
Specifically, the sorting field information is parsed according to determining format analysis processing rule, including identification classification Format, meaning and integrity degree of information of field information etc..
Speed is analyzed in order to further improve the processing of field information, according to determining format analysis processing rule using distribution Formula processing technique handles the sorting field information to obtain the corresponding data information of the sorting field information, wherein The distributed proccessing includes Hadoop system or Spark system processing technique.
For example, can be by determining format analysis processing rule and the corresponding host for being sent to distributed system of sorting field information Above so that the host is handled to obtain the category words according to determining format analysis processing rule to the sorting field information The corresponding data information of segment information;Receive the data information of the host feedback.
In one embodiment, as shown in figure 4, further including the following contents after step s 204: S204a, judging the number It is believed that breath whether address information;If S204b, the data information are address information, full address is obtained from match address library Information supplement is to data information.
For example, user is when filling in for the address information, may not fill in if the data information is address information Completely, for example the information such as some street of some area of user can have been obtained without filling in the information such as province from match address library Site preparation location information supplement is to data information.
S205, based on Predicting Performance Characteristics model trained in advance, the data of the tables of data are predicted according to the data information Characteristic.
Specifically, which is preparatory trained algorithm model, and input parameter is the data of tables of data Information is for the corresponding data characteristic of data in prediction data table, preferably to utilize the data in the tables of data, wherein Data characteristic includes the information such as data field type, data field format and saturation degree.It can quickly be managed by the above method The characteristic for solving data, provides help for follow-up data use.
In the present embodiment, above-mentioned data analysing method is by the tables of data in scan database to obtain the tables of data Field information;Identify the corresponding data type of the field information, and according to the data type to the field information into Row classification obtains sorting field information;Information, which is handled, according to history field determines the corresponding format analysis processing of the sorting field information Rule, wherein history field processing information record has the corresponding relationship of history field Yu format analysis processing rule;According to determination Format analysis processing rule the sorting field information is handled to obtain corresponding data information;Based on spy trained in advance Property prediction model, the data characteristic of the tables of data is predicted according to the data information.Thus the speed of data analysis is improved, Data value is understood by data characteristic to be used again, and then is improved work efficiency.
Fig. 5 is a kind of schematic block diagram of data analysis set-up provided by the embodiments of the present application.As shown in figure 5, corresponding to Above data analysis method, the application also provide a kind of data analysis set-up.The data analysis set-up includes above-mentioned for executing The unit of data analysing method, the device can be configured in server.
As shown in figure 5, the data analysis set-up 400 includes: data determination unit 401, extracts construction unit 402, model Training unit 403, scanning acquiring unit 404, identification taxon 405, rule determination unit 406,407 and of information process unit Predicting Performance Characteristics unit 408.
Data determination unit 401, for obtaining the data of historical data table as sample data.
Construction unit 402 is extracted, for carrying out feature extraction to obtain feature field, according to described to the sample data Feature field construction feature vector.
Model training unit 403 is used for logic-based regression algorithm, carries out model training according to described eigenvector and obtains Trained Predicting Performance Characteristics model in advance.
Acquiring unit 404 is scanned, the field information of the tables of data is obtained for the tables of data in scan database.
Identify taxon 405, for identification the corresponding data type of the field information, and according to the data type The field information is classified to obtain sorting field information.
Rule determination unit 406 determines the corresponding lattice of the sorting field information for handling information according to history field Formula processing rule, wherein history field processing information record has the corresponding relationship of history field Yu format analysis processing rule.
In one embodiment, as shown in fig. 6, rule determination unit 406 comprises determining that computation subunit 4061 and determination are set Set subelement 4062.
Wherein it is determined that computation subunit 4061, corresponding multiple for being determined according to the data type of the field information History field, and calculate the Jie Kade similarity factor of the field information and each history field;Determine setting subelement 4062, for determining with the most like history field of the field information, the most phase by described according to the Jie Kade similarity factor As the corresponding format analysis processing rule of history field as the field information corresponding format analysis processing rule.
Information process unit 407, for being handled according to determining format analysis processing rule the sorting field information To obtain corresponding data information.
Predicting Performance Characteristics unit 408, for predicting institute according to the data information based on Predicting Performance Characteristics model trained in advance State the data characteristic of tables of data.
In one embodiment, as shown in fig. 7, the data analysis set-up 500 further include: information judging unit 501 and acquisition Supplementary units 502.
Wherein, information judging unit 501, judge the data information whether address information;Supplementary units 502 are obtained, are used If being address information in the data information, complete address information is obtained from match address library and adds to data information.
It is apparent to those skilled in the art that for convenience of description and succinctly, the number of foregoing description According to the specific work process of analytical equipment and unit, can refer to corresponding processes in the foregoing method embodiment, it is no longer superfluous herein It states.
Above-mentioned apparatus can be implemented as a kind of form of computer program, and computer program can be in meter as shown in Figure 8 It calculates and is run on machine equipment.
Referring to Fig. 8, Fig. 8 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The computer Equipment 700 can be server.
Referring to Fig. 8, which includes processor 720, memory and the net connected by system bus 710 Network interface 750, wherein memory may include non-volatile memory medium 730 and built-in storage 740.
The non-volatile memory medium 730 can storage program area 731 and computer program 732.The computer program 732 It is performed, processor 720 may make to execute any one data analysing method.
The processor 720 supports the operation of entire computer equipment 700 for providing calculating and control ability.
The built-in storage 740 provides environment for the operation of the computer program 732 in non-volatile memory medium 730, should When computer program 732 is executed by processor 720, processor 720 may make to execute any one data analysing method.
The network interface 750 such as sends the task dispatching of distribution for carrying out network communication.Those skilled in the art can manage It solves, structure shown in Fig. 8, only the block diagram of part-structure relevant to application scheme, is not constituted to the application side The restriction for the computer equipment 700 that case is applied thereon, specific computer equipment 700 may include more than as shown in the figure Or less component, perhaps combine certain components or with different component layouts.Wherein, the processor 720 is for transporting Row program code stored in memory, to realize following steps:
Tables of data in scan database is to obtain the field information of the tables of data;
It identifies the corresponding data type of the field information, and the field information is divided according to the data type Class obtains sorting field information;
Information is handled according to history field and determines the corresponding format analysis processing rule of the sorting field information, wherein described go through History field processing information record has the corresponding relationship of history field Yu format analysis processing rule;
The sorting field information is handled to obtain corresponding data information according to determining format analysis processing rule;
Based on Predicting Performance Characteristics model trained in advance, the data characteristic of the tables of data is predicted according to the data information.
In one embodiment, the processor 720 is swept described in program code realization stored in memory for running Before retouching field information of the tables of data in database to obtain the tables of data, following steps are also realized:
The data of historical data table are obtained as sample data;
To the sample data carry out feature extraction to obtain feature field, according to the feature field construction feature to Amount;
Logic-based regression algorithm carries out the Predicting Performance Characteristics mould that model training is trained in advance according to described eigenvector Type.
In one embodiment, the processor 720 realizes described for running program code stored in memory When determining the corresponding format analysis processing rule of the sorting field information according to history field processing information, it is implemented as follows step:
Determine corresponding multiple history fields according to the data type of the field information, and calculate the field information and The Jie Kade similarity factor of each history field;
It, will be described most like according to Jie Kade similarity factor determination and the most like history field of the field information The corresponding format analysis processing rule of history field as the field information corresponding format analysis processing rule.
In one embodiment, the processor 720 realizes described for running program code stored in memory When being handled the sorting field information to obtain corresponding data information according to determining format analysis processing rule, specific implementation Following steps:
The sorting field information handle using distributed proccessing according to determining format analysis processing rule To the corresponding data information of the sorting field information, wherein the distributed proccessing includes Hadoop system or Spark System processing technique.
In one embodiment, the processor 720 realizes described for running program code stored in memory After being handled the sorting field information to obtain corresponding data information according to determining format analysis processing rule, also realize Following steps:
Judge the data information whether address information;
If the data information is address information, complete address information is obtained from match address library and adds to data letter Breath.
In one embodiment, the processor 720 realizes scanning number for running program code stored in memory When field information according to the tables of data in library to obtain the tables of data, it is implemented as follows step:
The tables of data in the database is scanned by distributed system automatic regular polling to obtain the field of the tables of data Information.
It should be appreciated that in the embodiment of the present application, processor 720 can be central processing unit (Central ProcessingUnit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable GateArray, FPGA) or other programmable logic devices Part, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or The processor is also possible to any conventional processor etc..
It will be understood by those skilled in the art that 700 structure of computer equipment shown in Fig. 8 is not constituted and is set to computer Standby 700 restriction may include perhaps combining certain components or different component cloth than illustrating more or fewer components It sets.
Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process, be Relevant hardware can be instructed to complete by computer program, computer program can be stored in a storage medium, this is deposited Storage media is computer readable storage medium.In the embodiment of the present invention, which can be stored in computer system It in storage medium, and is executed by least one processor in the computer system, includes the reality such as above-mentioned each method with realization Apply the process step of example.
The computer readable storage medium can be magnetic disk, CD, USB flash disk, mobile hard disk, read-only memory (ROM, Read- OnlyMemory), the various media that can store program code such as magnetic or disk.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not It is considered as beyond scope of the present application.
In several embodiments provided herein, it should be understood that disclosed data analysis set-up and method, it can To realize by another way.For example, data analysis set-up embodiment described above is only schematical.For example, The division of each unit, only a kind of logical function partition, there may be another division manner in actual implementation.Such as it is multiple Unit or assembly can be combined or can be integrated into another system, or some features can be ignored or not executed.
Step in the embodiment of the present application method can be sequentially adjusted, merged and deleted according to actual needs.
Unit in the embodiment of the present application device can be combined, divided and deleted according to actual needs.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integrated Unit both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product, It can store in a computer readable storage medium.Based on this understanding, the technical solution of the application substantially or Person says that all or part of the part that contributes to existing technology or the technical solution can body in the form of software products Reveal and, which is stored in a storage medium, including some instructions are with so that a computer is set Standby (can be personal computer, terminal or the network equipment etc.) execute each embodiment the method for the application whole or Part steps.
The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any Those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope of the application should be with right It is required that protection scope subject to.

Claims (10)

1. a kind of data analysing method characterized by comprising
Tables of data in scan database is to obtain the field information of the tables of data;
It identifies the corresponding data type of the field information, and the field information classify according to the data type To sorting field information;
Information is handled according to history field and determines the corresponding format analysis processing rule of the sorting field information, wherein the history word Section processing information record has the corresponding relationship of history field Yu format analysis processing rule;
The sorting field information is handled to obtain corresponding data information according to determining format analysis processing rule;
Based on Predicting Performance Characteristics model trained in advance, the data characteristic of the tables of data is predicted according to the data information.
2. data analysing method according to claim 1, which is characterized in that the tables of data in the scan database is to obtain Before the field information for taking the tables of data, further includes:
The data of historical data table are obtained as sample data;
Feature extraction is carried out to obtain feature field, according to the feature field construction feature vector to the sample data;
Logic-based regression algorithm carries out the Predicting Performance Characteristics model that model training is trained in advance according to described eigenvector.
3. data analysing method according to claim 1, which is characterized in that described to be determined according to history field processing information The corresponding format analysis processing rule of the sorting field information, comprising:
Corresponding multiple history fields are determined according to the data type of the field information, and calculate the field information and each The Jie Kade similarity factor of the history field;
According to Jie Kade similarity factor determination and the most like history field of the field information, described most like is gone through The corresponding format analysis processing rule of history field is as the corresponding format analysis processing rule of the field information.
4. data analysing method according to claim 1, which is characterized in that described right according to determining format analysis processing rule The sorting field information is handled to obtain corresponding data information, comprising:
The sorting field information is handled to obtain institute using distributed proccessing according to determining format analysis processing rule State the corresponding data information of sorting field information, wherein the distributed proccessing includes Hadoop system or Spark system Processing technique.
5. data analysing method according to claim 1, which is characterized in that described right according to determining format analysis processing rule After the sorting field information is handled to obtain corresponding data information, further includes:
Judge whether the data information is address information;
If the data information is address information, complete address information is obtained from match address library and adds to data information.
6. data analysing method according to claim 1, which is characterized in that the tables of data in the scan database is to obtain Take the field information of the tables of data, comprising:
The tables of data in the database is scanned by distributed system automatic regular polling to obtain the field information of the tables of data.
7. data analysing method according to claim 1, which is characterized in that the data type includes text type, dimension Spend type and discrete data type.
8. a kind of data analysis set-up characterized by comprising
Acquiring unit is scanned, the field information of the tables of data is obtained for the tables of data in scan database;
Identify taxon, for identification the corresponding data type of the field information, and according to the data type to described Field information is classified to obtain sorting field information;
Rule determination unit determines that the corresponding format analysis processing of the sorting field information is advised for handling information according to history field Then, wherein history field processing information record has the corresponding relationship of history field Yu format analysis processing rule;
Information process unit, for being handled according to determining format analysis processing rule the sorting field information to obtain pair The data information answered;
Predicting Performance Characteristics unit, for predicting the data according to the data information based on Predicting Performance Characteristics model trained in advance The data characteristic of table.
9. a kind of computer equipment, which is characterized in that including memory, processor and be stored on the memory and can be in institute The computer program run on processor is stated, the processor is realized when executing the computer program as in claim 1 to 7 The step of any one the method.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program make the processor execute such as claim 1 to 7 any one the method when being executed by processor The step of.
CN201811188076.6A 2018-10-12 2018-10-12 Data analysing method, device, computer equipment and storage medium Pending CN109388675A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811188076.6A CN109388675A (en) 2018-10-12 2018-10-12 Data analysing method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811188076.6A CN109388675A (en) 2018-10-12 2018-10-12 Data analysing method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN109388675A true CN109388675A (en) 2019-02-26

Family

ID=65427284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811188076.6A Pending CN109388675A (en) 2018-10-12 2018-10-12 Data analysing method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109388675A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008250A (en) * 2019-03-07 2019-07-12 平安科技(深圳)有限公司 Social security data processing method, device and computer equipment based on data mining
CN110083839A (en) * 2019-04-29 2019-08-02 珠海豹好玩科技有限公司 Text introduction method, device and equipment
CN110399434A (en) * 2019-07-25 2019-11-01 北京明略软件系统有限公司 Field classification method and device, storage medium, electronic device
CN110781183A (en) * 2019-09-10 2020-02-11 中国平安财产保险股份有限公司 Method and device for processing incremental data in Hive database and computer equipment
CN110851428A (en) * 2019-11-19 2020-02-28 厦门市美亚柏科信息股份有限公司 Database analysis method, device and medium based on rule operator dynamic arrangement
CN111159181A (en) * 2019-12-18 2020-05-15 东软集团股份有限公司 Medical data screening method and device, storage medium and electronic equipment
CN111639077A (en) * 2020-05-15 2020-09-08 杭州数梦工场科技有限公司 Data management method and device, electronic equipment and storage medium
CN112685415A (en) * 2020-12-30 2021-04-20 平安普惠企业管理有限公司 Data import method and device, computer equipment and storage medium
CN112865999A (en) * 2019-11-28 2021-05-28 华为技术有限公司 Information processing method and related equipment
CN113379004A (en) * 2021-07-26 2021-09-10 浙江大华技术股份有限公司 Data table classification method and device, electronic equipment and storage medium
WO2021179579A1 (en) * 2020-03-09 2021-09-16 平安科技(深圳)有限公司 Backup data analysis method and apparatus based on file information, and computer device
CN115543977A (en) * 2022-09-29 2022-12-30 河北雄安睿天科技有限公司 Water supply industry data cleaning method
CN116975401A (en) * 2023-09-19 2023-10-31 杭州美创科技股份有限公司 Database field identification method, device, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1673995A (en) * 2004-03-24 2005-09-28 微软公司 Method and apparatus for populating electronic forms from scanned documents
CN106909811A (en) * 2015-12-23 2017-06-30 腾讯科技(深圳)有限公司 The method and apparatus of ID treatment
US20170278015A1 (en) * 2016-03-24 2017-09-28 Accenture Global Solutions Limited Self-learning log classification system
US20170286388A1 (en) * 2016-04-04 2017-10-05 Accenture Global Solutions Limited Document presentation interface based on intelligent mapping
CN108197109A (en) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 A kind of multilingual analysis method and device based on natural language processing
CN108337316A (en) * 2018-02-08 2018-07-27 平安科技(深圳)有限公司 Information-pushing method, device, computer equipment and storage medium
CN108388924A (en) * 2018-03-08 2018-08-10 平安科技(深圳)有限公司 A kind of data classification method, device, equipment and computer readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1673995A (en) * 2004-03-24 2005-09-28 微软公司 Method and apparatus for populating electronic forms from scanned documents
CN106909811A (en) * 2015-12-23 2017-06-30 腾讯科技(深圳)有限公司 The method and apparatus of ID treatment
US20170278015A1 (en) * 2016-03-24 2017-09-28 Accenture Global Solutions Limited Self-learning log classification system
US20170286388A1 (en) * 2016-04-04 2017-10-05 Accenture Global Solutions Limited Document presentation interface based on intelligent mapping
CN108197109A (en) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 A kind of multilingual analysis method and device based on natural language processing
CN108337316A (en) * 2018-02-08 2018-07-27 平安科技(深圳)有限公司 Information-pushing method, device, computer equipment and storage medium
CN108388924A (en) * 2018-03-08 2018-08-10 平安科技(深圳)有限公司 A kind of data classification method, device, equipment and computer readable storage medium

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008250B (en) * 2019-03-07 2024-03-15 平安科技(深圳)有限公司 Social security data processing method and device based on data mining and computer equipment
CN110008250A (en) * 2019-03-07 2019-07-12 平安科技(深圳)有限公司 Social security data processing method, device and computer equipment based on data mining
CN110083839A (en) * 2019-04-29 2019-08-02 珠海豹好玩科技有限公司 Text introduction method, device and equipment
CN110083839B (en) * 2019-04-29 2023-08-22 珠海豹好玩科技有限公司 Text importing method, device and equipment
CN110399434A (en) * 2019-07-25 2019-11-01 北京明略软件系统有限公司 Field classification method and device, storage medium, electronic device
CN110781183A (en) * 2019-09-10 2020-02-11 中国平安财产保险股份有限公司 Method and device for processing incremental data in Hive database and computer equipment
CN110781183B (en) * 2019-09-10 2023-06-27 中国平安财产保险股份有限公司 Processing method and device for incremental data in Hive database and computer equipment
CN110851428B (en) * 2019-11-19 2022-05-20 厦门市美亚柏科信息股份有限公司 Database analysis method, device and medium based on rule operator dynamic arrangement
CN110851428A (en) * 2019-11-19 2020-02-28 厦门市美亚柏科信息股份有限公司 Database analysis method, device and medium based on rule operator dynamic arrangement
CN112865999A (en) * 2019-11-28 2021-05-28 华为技术有限公司 Information processing method and related equipment
CN111159181A (en) * 2019-12-18 2020-05-15 东软集团股份有限公司 Medical data screening method and device, storage medium and electronic equipment
WO2021179579A1 (en) * 2020-03-09 2021-09-16 平安科技(深圳)有限公司 Backup data analysis method and apparatus based on file information, and computer device
CN111639077A (en) * 2020-05-15 2020-09-08 杭州数梦工场科技有限公司 Data management method and device, electronic equipment and storage medium
CN111639077B (en) * 2020-05-15 2024-03-22 杭州数梦工场科技有限公司 Data management method, device, electronic equipment and storage medium
CN112685415A (en) * 2020-12-30 2021-04-20 平安普惠企业管理有限公司 Data import method and device, computer equipment and storage medium
CN113379004A (en) * 2021-07-26 2021-09-10 浙江大华技术股份有限公司 Data table classification method and device, electronic equipment and storage medium
CN113379004B (en) * 2021-07-26 2023-04-14 浙江大华技术股份有限公司 Data table classification method and device, electronic equipment and storage medium
CN115543977A (en) * 2022-09-29 2022-12-30 河北雄安睿天科技有限公司 Water supply industry data cleaning method
CN116975401A (en) * 2023-09-19 2023-10-31 杭州美创科技股份有限公司 Database field identification method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109388675A (en) Data analysing method, device, computer equipment and storage medium
CN108628741A (en) Webpage test method, device, electronic equipment and medium
CN108415980A (en) Question and answer data processing method, electronic device and storage medium
CN110222330B (en) Semantic recognition method and device, storage medium and computer equipment
CN112883190A (en) Text classification method and device, electronic equipment and storage medium
CN115130711A (en) Data processing method and device, computer and readable storage medium
CN110276382A (en) Listener clustering method, apparatus and medium based on spectral clustering
CN112328909A (en) Information recommendation method and device, computer equipment and medium
CN112906361A (en) Text data labeling method and device, electronic equipment and storage medium
CN110968664A (en) Document retrieval method, device, equipment and medium
CN110442803A (en) Data processing method, device, medium and the calculating equipment executed by calculating equipment
CN114692889A (en) Meta-feature training model for machine learning algorithm
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN111782774B (en) Method and device for recommending problems
CN111597336A (en) Processing method and device of training text, electronic equipment and readable storage medium
CN109033078B (en) The recognition methods of sentence classification and device, storage medium, processor
CN115952800A (en) Named entity recognition method and device, computer equipment and readable storage medium
CN114780688A (en) Text quality inspection method, device and equipment based on rule matching and storage medium
CN107577760A (en) A kind of file classification method and device based on constrained qualification
CN114141235A (en) Voice corpus generation method and device, computer equipment and storage medium
JP5824429B2 (en) Spam account score calculation apparatus, spam account score calculation method, and program
CN112329943A (en) Combined index selection method and device, computer equipment and medium
CN117010349B (en) Form filling method, system and storage medium based on neural network model
CN114021739B (en) Business processing method, business processing model training device and electronic equipment
CN111680513B (en) Feature information identification method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination