CN110362620A

CN110362620A - A kind of list data structure method based on machine learning

Info

Publication number: CN110362620A
Application number: CN201910623601.0A
Authority: CN
Inventors: 廖闻剑; 李曙光; 宋万军; 姜广栋; 杨万刚; 尹若成
Original assignee: NANJING FIBERHOME INFORMATION DEVELOPMENT Co Ltd
Current assignee: NANJING FIBERHOME INFORMATION DEVELOPMENT Co Ltd
Priority date: 2019-07-11
Filing date: 2019-07-11
Publication date: 2019-10-22
Anticipated expiration: 2039-07-11
Also published as: CN110362620B

Abstract

The list data structure method based on machine learning that the present invention relates to a kind of, quantity statistics are carried out for the object in each unit lattice in great amount of samples electrical form, constitute dictionary table, and combine the number that object occurs in each unit lattice in electrical form to be processed, and it corresponds to the quantity in dictionary table, obtain the score of each unit lattice in electrical form to be processed, using the score of each unit lattice as minimum unit, pass through the comparison of row and column, realize the acquisition of gauge outfit row or gauge outfit column in electrical form to be processed, thus to obtain each header entry, and then it is based on each header entry, carry out the extraction and structuring of data item, it solves in the prior art by rule, only identify lateral gauge outfit, the shortcomings that can not identifying multiple gauge outfits, accurately, efficiently realize the data structured processing of electrical form.

Description

A kind of list data structure method based on machine learning

Technical field

The list data structure method based on machine learning that the present invention relates to a kind of, belongs to list data structure technology Field.

Background technique

Electrical form is the most commonly used computer software tool, the prior art, the Sheet unknown to a content (electricity Sub-table), it is only capable of the data item of reading each unit lattice after opening file, its step are as follows:

(1) Excel file is opened using interface；

(2) Sheet in Excel file is read using interface；

(3) cell in Sheet is read using interface.

In the implementation procedure of the above method, due to not knowing the meaning of each data item, so that the knot of data cannot be completed Structure.It is to be described by the gauge outfit of table, it is not known that the gauge outfit of table can not just understand data because of the meaning of data item.Cause This, some work in order to complete the structuring of list data, used one it is assumed that the gauge outfit of i.e. hypothesis table is present in table First trip, based on it is such it is assumed that gauge outfit can be extracted after extract data again, to complete list data structure, execute step It is rapid as follows:

(1) Excel file is opened using interface；

(2) Sheet in Excel file is read using interface；

(3) the first row cell in Sheet is read using interface, as gauge outfit；

(4) the corresponding data of each gauge outfit are read by column, completes data structured.

This hypothesis has apparent defect, and the gauge outfit that can be extracted only is lateral gauge outfit, and gauge outfit must be right in first trip There is situations such as multirow gauge outfit in the non-first trip of table and a table in the table of longitudinal gauge outfit, gauge outfit, there will be erroneous judgement The case where.For this purpose, some work optimize this based on priori knowledge, solves the problems, such as gauge outfit in non-first trip, step is such as Under:

(1) Excel file is opened using interface；

(2) Sheet in Excel file is read using interface；

(3) it is successively read the data of each row and column in Sheet using interface, (passes through rule until encountering the data being recognized Then match, such as cell-phone number, identity card, bank card), successively first, which is found, upwards from the row column does not meet the rule Row, uses the row as gauge outfit；

There are also problems for this mode, for having multiple gauge outfits in longitudinal gauge outfit and a gauge outfit, can also exist and miss Sentence, for the table of no understanding data, does not just identify gauge outfit.

Summary of the invention

Technical problem to be solved by the invention is to provide a kind of list data structure method based on machine learning, energy The header entry in electrical form is enough accurately identified, and is based on each header entry, efficiently completes the structure of data item in electrical form Change.

In order to solve the above-mentioned technical problem the present invention uses following technical scheme: the present invention devises a kind of based on engineering The list data structure method of habit, for carrying out structuring processing, feature for the data item in electrical form to be processed It is, includes the following steps:

Step A. carries out quantity statistics for the object in preset quantity simple electric table, in each unit lattice, respectively Wherein each object and the quantity corresponding to it are obtained, dictionary table is constructed, subsequently into step B；

Step B. is directed to each unit lattice in electrical form to be processed respectively, and object is in electricity to be processed in statistic unit lattice The number count occurred in sub-table, subsequently into step C；

Step C. is directed to each unit lattice in electrical form to be processed respectively, and object corresponds to dictionary in obtaining unit lattice Quantity c in table, wherein if there is no the object in cell in electrical form to be processed, electronics to be processed in dictionary table The quantity that object corresponds in dictionary table in the cell in table is 0, subsequently into step D；

Step D. is directed to each unit lattice in electrical form to be processed respectively, according to the following formula:

Score score corresponding to obtaining unit lattice, subsequently into step E；

Step E. is directed to each row in electrical form to be processed respectively, gone in score score corresponding to each unit lattice The sum of, as score corresponding to the row；

Meanwhile respectively for each column in electrical form to be processed, arranged in score score corresponding to each unit lattice it With as score corresponding to the column；

The corresponding score of each row in electrical form to be processed, each column difference is obtained, subsequently into step F；

Score step F. corresponding according to row each in electrical form to be processed difference, for each in electrical form to be processed Row is clustered, and is obtained in each row cluster respectively, the average value of score corresponding to each row, clusters institute respectively as each row Corresponding score, the row cluster of reselection highest score, clusters as row to be selected；

Meanwhile according to score corresponding respectively is respectively arranged in electrical form to be processed, for each in electrical form to be processed It arranges and is clustered into column, and obtained in each column cluster respectively, each average value for arranging corresponding score, cluster institute respectively as each column Corresponding score, the column cluster of reselection highest score, clusters as column to be selected；

Subsequently into step G；

Step G. selects the row of highest score, and according to the score of the row, be somebody's turn to do for each row in row to be selected cluster The average mark of each non-mentioned null cell in row, as row cell average mark；

Meanwhile for each column in column cluster to be selected, the column of highest score are selected, and according to the score of the column, be somebody's turn to do The average mark of each non-mentioned null cell in column, as column unit lattice average mark；

Subsequently into step H；

If step H. row cell average mark is greater than column unit lattice average mark, each row in row cluster to be selected is Each gauge outfit row in electrical form to be processed obtains wherein each header entry, and enters step J；

If row cell average mark is less than column unit lattice average mark, each column in column cluster to be selected are as to be processed Each gauge outfit column in electrical form, obtain wherein each header entry, and enter step J；

Step J. reads each data in electrical form to be processed according to each header entry in electrical form to be processed , carry out the structuring of list data.

As a preferred technical solution of the present invention: in the step A, after building obtains the dictionary table, using Following steps I are updated, subsequently into step B to step II for dictionary table；

Step I obtains the maximum number magnitude of the corresponding quantity of each object difference in dictionary table, subsequently into step II；

Step II is directed to each object in dictionary table respectively, executes following steps II -1 to step II -2, for object Corresponding quantity is updated, and then updates dictionary table；

Step II -1. judges whether object belongs to default header entry set, is, sets for quantity corresponding to the object For maximum number magnitude, II -2 is otherwise entered step；

Step II -2. judges whether object belongs to preset data item set, is, sets for quantity corresponding to the object It is 0, is not otherwise modified for quantity corresponding to the object.

As a preferred technical solution of the present invention: in the step F, being distinguished according to row each in electrical form to be processed Corresponding score, Fa-1 to step Fa-3, is clustered for each row in electrical form to be processed as follows；

Step Fa-1. obtains minimum row score and maximum row in electrical form to be processed in each corresponding score of row difference Score, and enter step Fa-2；

Step Fa-2., to the span of maximum row score, is drawn for minimum row score by each row fraction levels are preset Point, each row score section is obtained, subsequently into step Fa-3；

Score step Fa-3. corresponding according to row each in electrical form to be processed difference, will be in electrical form to be processed Each row is divided in each row score section, possesses each row score section of spreadsheet line to be processed, and as each row is poly- Class；

Meanwhile according to each column corresponding score respectively in electrical form to be processed, Fb-1 to step Fb- as follows 3, it is clustered for respectively being arranged in electrical form to be processed into column；

Step Fb-1. obtains the minimum column score and maximum column respectively arranged in corresponding score respectively in electrical form to be processed Score, and enter step Fb-2；

Step Fb-2. is directed to the span of minimum column score to maximum column score, draws by each column fraction levels are preset into column Point, each column score section is obtained, subsequently into step Fb-3；

Step Fb-3., will be in electrical form to be processed according to score corresponding respectively is respectively arranged in electrical form to be processed It respectively arranges, be divided in each column score section, possess each column score section of electrical form column to be processed, as each column are poly- Class.

A kind of list data structure method based on machine learning of the present invention, using above technical scheme with it is existing Technology is compared, and is had following technical effect that

The designed list data structure method based on machine learning of the invention, for each in great amount of samples electrical form Object in cell carries out quantity statistics, constitutes dictionary table, and object goes out in each unit lattice in combination electrical form to be processed Existing number and its correspond to dictionary table in quantity, the score of each unit lattice in electrical form to be processed is obtained, with each list The score of first lattice is minimum unit, by the comparison of row and column, realizes obtaining for gauge outfit row in electrical form to be processed or gauge outfit column , thus to obtain each header entry, and then it is based on each header entry, carries out the extraction and structuring of data item, solve existing Lateral gauge outfit is identified by rule, only in technology, can not identify the shortcomings that multiple gauge outfits, accurately and efficiently realizes electronic watch The data structured of lattice is handled.

Detailed description of the invention

Fig. 1 is the schematic diagram of list data structure method of the present invention design based on machine learning.

Specific embodiment

Specific embodiments of the present invention will be described in further detail with reference to the accompanying drawings of the specification.

The present invention devises a kind of list data structure method based on machine learning, for being directed to electronic watch to be processed Data item in lattice carries out structuring processing, in specific practical application, executes following steps A to step J.

Step A. carries out quantity statistics for the object in preset quantity simple electric table, in each unit lattice, respectively Wherein each object and the quantity corresponding to it are obtained, dictionary table is constructed, then using following steps I to step II, for Dictionary table is updated, subsequently into step B.

Step I obtains the maximum number magnitude of the corresponding quantity of each object difference in dictionary table, subsequently into step II.

Step II is directed to each object in dictionary table respectively, executes following steps II -1 to step II -2, for object Corresponding quantity is updated, and then updates dictionary table.

Step B. is directed to each unit lattice in electrical form to be processed respectively, and object is in electricity to be processed in statistic unit lattice The number count occurred in sub-table, subsequently into step C.

Step C. is directed to each unit lattice in electrical form to be processed respectively, and object corresponds to dictionary in obtaining unit lattice Quantity c in table, wherein if there is no the object in cell in electrical form to be processed, electronics to be processed in dictionary table The quantity that object corresponds in dictionary table in the cell in table is 0, subsequently into step D.

Score score corresponding to obtaining unit lattice, subsequently into step E.

The corresponding score of each row in electrical form to be processed, each column difference is obtained, subsequently into step F.

Step F. distinguishes corresponding score according to row each in electrical form to be processed, as follows Fa-1 to step Fa-3 is clustered for each row in electrical form to be processed, and is obtained in each row cluster respectively, score corresponding to each row Average value, the score corresponding as each row cluster difference, the row cluster of reselection highest score are clustered as row to be selected.

Score step Fa-3. corresponding according to row each in electrical form to be processed difference, will be in electrical form to be processed Each row is divided in each row score section, possesses each row score section of spreadsheet line to be processed, and as each row is poly- Class.

Meanwhile according to each column corresponding score respectively in electrical form to be processed, Fb-1 to step Fb- as follows 3, it is clustered for respectively being arranged in electrical form to be processed into column, and obtained in each column cluster respectively, respectively arrange the flat of corresponding score Mean value, the score corresponding as each column cluster difference, the column cluster of reselection highest score are clustered as column to be selected.

Row cluster to be selected is being obtained with after column to be selected cluster, is entering step G.

Subsequently into step H.

List data structure method based on machine learning designed by above-mentioned technical proposal, for great amount of samples electronic watch Object in lattice in each unit lattice carries out quantity statistics, constitutes dictionary table, and combines in electrical form to be processed in each unit lattice Object occur number and its correspond to dictionary table in quantity, obtain the score of each unit lattice in electrical form to be processed, Using the score of each unit lattice as minimum unit, by the comparison of row and column, gauge outfit row or gauge outfit in electrical form to be processed are realized The acquisition of column thus to obtain each header entry, and then is based on each header entry, carries out the extraction and structuring of data item, solves Lateral gauge outfit is identified by rule, only in the prior art, can not identify the shortcomings that multiple gauge outfits, is accurately and efficiently realized The data structured of electrical form is handled.

Embodiments of the present invention are explained in detail above in conjunction with attached drawing, but the present invention is not limited to above-mentioned implementations Mode within the knowledge of a person skilled in the art can also be without departing from the purpose of the present invention It makes a variety of changes.

Claims

1. a kind of list data structure method based on machine learning, for for the data item in electrical form to be processed into Row structuring processing, which comprises the steps of:

Step A. carries out quantity statistics for the object in preset quantity simple electric table, in each unit lattice, obtains respectively Wherein each object and the quantity corresponding to it construct dictionary table, subsequently into step B；

Step B. is directed to each unit lattice in electrical form to be processed respectively, and object is in electronic watch to be processed in statistic unit lattice The number count occurred in lattice, subsequently into step C；

Step C. is directed to each unit lattice in electrical form to be processed respectively, and object corresponds in dictionary table in obtaining unit lattice Quantity c, wherein if there is no the object in cell in electrical form to be processed, electrical forms to be processed in dictionary table In object corresponds in dictionary table in the cell quantity be 0, subsequently into step D；

Step E. is directed to each row in electrical form to be processed respectively, gone in the sum of score score corresponding to each unit lattice, As score corresponding to the row；

Meanwhile respectively for each column in electrical form to be processed, arranged in the sum of score score corresponding to each unit lattice, As score corresponding to the column；

Step F. is according to row each in electrical form to be processed corresponding score respectively, for respectively advancing in electrical form to be processed Row cluster, and obtained in each row cluster respectively, the average value of score corresponding to each row, as corresponding to each row cluster difference Score, reselection highest score row cluster, as row to be selected cluster；

Meanwhile according to each column corresponding score respectively in electrical form to be processed, for respectively arranged in electrical form to be processed into Column cluster, and obtained in each column cluster respectively, each average value for arranging corresponding score, as corresponding to each column cluster difference Score, reselection highest score column cluster, as column to be selected cluster；

Subsequently into step G；

Step G. selects the row of highest score, and according to the score of the row, obtain in the row for each row in row to be selected cluster The average mark of each non-mentioned null cell, as row cell average mark；

Meanwhile for each column in column cluster to be selected, the column of highest score are selected, and according to the score of the column, obtain in the column The average mark of each non-mentioned null cell, as column unit lattice average mark；

Subsequently into step H；

If step H. row cell average mark is greater than column unit lattice average mark, each row in row cluster to be selected is wait locate Each gauge outfit row in electrical form is managed, obtains wherein each header entry, and enter step J；

If row cell average mark is less than column unit lattice average mark, each column in column cluster to be selected are electronics to be processed Each gauge outfit column in table, obtain wherein each header entry, and enter step J；

Step J. reads each data item in electrical form to be processed according to each header entry in electrical form to be processed, Carry out the structuring of list data.

2. a kind of list data structure method based on machine learning according to claim 1, it is characterised in that: the step In rapid A, after building obtains the dictionary table, using following steps I to step II, be updated for dictionary table, then into Enter step B；

Step II is directed to each object in dictionary table respectively, executes following steps II -1 to step II -2, institute is right for object The quantity answered is updated, and then updates dictionary table；

Step II -1. judges whether object belongs to default header entry set, is, is set to most for quantity corresponding to the object Otherwise big quantitative value enters step II -2；

Step II -2. judges whether object belongs to preset data item set, is, is set to 0 for quantity corresponding to the object, Otherwise it is not modified for quantity corresponding to the object.

3. a kind of list data structure method based on machine learning according to claim 1, it is characterised in that: the step In rapid F, corresponding score is distinguished according to row each in electrical form to be processed, as follows Fa-1 to step Fa-3, for Each row is clustered in electrical form to be processed；

Step Fa-1. obtains minimum row score and maximum row point in electrical form to be processed in each corresponding score of row difference Number, and enter step Fa-2；

Step Fa-2., to the span of maximum row score, is divided by each row fraction levels are preset, is obtained for minimum row score Each row score section is obtained, subsequently into step Fa-3；

Step Fa-3. according to row each in electrical form to be processed corresponding score respectively, by row each in electrical form to be processed, It is divided in each row score section, possesses each row score section of spreadsheet line to be processed, as each row cluster；

Meanwhile according to each column corresponding score respectively in electrical form to be processed, Fb-1 to step Fb-3, needle as follows It is clustered to respectively being arranged in electrical form to be processed into column；

Step Fb-1. obtains the minimum column score and maximum column point respectively arranged in corresponding score respectively in electrical form to be processed Number, and enter step Fb-2；

Step Fb-2. is directed to the span of minimum column score to maximum column score, divides, obtains into column by each column fraction levels are preset Each column score section is obtained, subsequently into step Fb-3；

Step Fb-3. according to each column corresponding score respectively in electrical form to be processed, will in electrical form to be processed respectively column, It is divided in each column score section, possesses each column score section of electrical form column to be processed, as each column cluster.