CN109710651B - Data type identification method and device - Google Patents

Data type identification method and device Download PDF

Info

Publication number
CN109710651B
CN109710651B CN201811586956.9A CN201811586956A CN109710651B CN 109710651 B CN109710651 B CN 109710651B CN 201811586956 A CN201811586956 A CN 201811586956A CN 109710651 B CN109710651 B CN 109710651B
Authority
CN
China
Prior art keywords
data
data type
type information
processed
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811586956.9A
Other languages
Chinese (zh)
Other versions
CN109710651A (en
Inventor
赖文文
王纯斌
赵神州
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Sefon Software Co Ltd
Original Assignee
Chengdu Sefon Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Sefon Software Co Ltd filed Critical Chengdu Sefon Software Co Ltd
Priority to CN201811586956.9A priority Critical patent/CN109710651B/en
Publication of CN109710651A publication Critical patent/CN109710651A/en
Application granted granted Critical
Publication of CN109710651B publication Critical patent/CN109710651B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data type identification method and a data type identification device, wherein the data type identification method comprises the steps of obtaining a data table to be processed, sampling the data table to be processed, and obtaining data type information of each node data of each data position according to each node data of each data position in the sampled data table to be processed; and traversing each node data of each data position, obtaining the common data type information of each node data according to the data type information of each node data of the data position, and obtaining the data type information of the data position in the data table to be processed before sampling according to the common data type information. Therefore, the data type identification is carried out on all the node data in the data to be processed by sampling the data table to be processed and only carrying out the data type identification on the sampled node data, so that the calculation amount of the data type identification is reduced, and the identification speed of the data type of the data position is improved.

Description

Data type identification method and device
Technical Field
The present application relates to the field of data analysis, and in particular, to a data type identification method and apparatus.
Background
In the field of data analysis, the data type of data to be processed generally needs to be known before data processing, while in operation, a technician in the field often ignores setting the data type of the data to be processed for the convenience of entry, and generally manually marks the data type by the technician before analysis, but with the rapid development of computer technology, the data volume of the data to be processed is larger and larger, and the task volume of manual marking becomes larger and larger, so that the calculation amount of data type identification is larger and larger.
Content of application
In view of the above, an object of the present application is to provide a data type identification method and apparatus, so as to solve or improve the above problems.
In order to achieve the above purpose, the embodiments of the present application employ the following technical solutions:
in a first aspect, an embodiment of the present application provides a data type identification method, which is applied to an electronic device, and the method includes:
obtaining a data table to be processed, wherein the data table to be processed comprises a plurality of data positions and a plurality of node data of each data position;
sampling the to-be-processed data table, and obtaining data type information of each node data of each data position according to each node data of each data position in the sampled to-be-processed data table, wherein the data type information comprises at least one data type, and the data type comprises one of character strings, numbers and time;
and traversing each node data of each data position, obtaining the common data type information of each node data according to the data type information of each node data of the data position, and obtaining the data type information of the data position in the data table to be processed before sampling according to the common data type information.
Optionally, the to-be-processed data table includes a plurality of sampling data units, the sampling data units include node data of different data locations, and the step of sampling the to-be-processed data table includes:
sampling each sampling data unit of the data table to be processed, and obtaining the sampled data table to be processed according to each sampled sampling data unit.
Optionally, the step of obtaining the data type information of the data position in the to-be-processed data table before sampling according to the common data type information includes:
obtaining label information of the data position, and performing semantic analysis on the label information to obtain data type information corresponding to the label information;
and extracting common data type information between the common data type information and the data type information corresponding to the label information, wherein the common data type information is the data type information of the data position.
Optionally, after the step of obtaining the data type information of the data position in the data table to be processed before sampling according to the common data type information, the method further includes:
verifying the data table to be processed according to the data type information of each data position, and judging whether a verification result meets a preset standard or not;
if not, returning to the step of sampling the data table to be processed.
Optionally, the step of verifying the to-be-processed data table according to the data type information of each data position and determining whether a verification result meets a preset standard includes:
generating a verification rule of each data position according to the data type information of each data position;
traversing each data position, and verifying each node data of the data position in the re-sampled data table to be processed according to a verification rule of the data position to obtain a verification result, wherein the verification result comprises first node data matched with the verification rule and second node data unmatched with the verification rule;
obtaining the matching proportion of the data position and the verification rule according to the first node data and the second node data, and judging whether the matching proportion is higher than a proportion threshold value;
if so, judging that the checking result meets the preset standard;
if not, judging that the checking result does not meet the preset standard.
In a second aspect, an embodiment of the present application further provides a data type identification device, which is applied to an electronic device, and the method includes:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a data table to be processed, and the data table to be processed comprises a plurality of data positions and a plurality of node data of each data position;
the sampling module is used for sampling the data table to be processed and obtaining the data type information of each node data of each data position according to each node data of each data position in the sampled data table to be processed, wherein the data type information comprises at least one data type, and the data type comprises one of character strings, numbers and time; and
and the identification module is used for traversing each node data of each data position, obtaining the common data type information of each node data according to the data type information of each node data of the data position, and obtaining the data type information of the data position in the data table to be processed before sampling according to the common data type information.
Optionally, the sampling module is further configured to:
sampling each sampling data unit of the data table to be processed, and obtaining the sampled data table to be processed according to each sampled sampling data unit.
Optionally, the identification module is further configured to:
obtaining label information of the data position, and performing semantic analysis on the label information to obtain data type information corresponding to the label information;
and extracting common data type information between the common data type information and the data type information corresponding to the label information, wherein the common data type information is the data type information of the data position.
Optionally, the data type identification apparatus further includes a verification module;
the checking module is used for checking the data table to be processed according to the data type information of each data position, judging whether the checking result meets the preset standard or not, and enabling the sampling module to sample the data table to be processed again when the checking result does not meet the preset standard.
Optionally, the verification module is further configured to:
generating a verification rule of each data position according to the data type information of each data position;
traversing each data position, and verifying each node data of the data position in the re-sampled data table to be processed according to a verification rule of the data position to obtain a verification result, wherein the verification result comprises first node data matched with the verification rule and second node data unmatched with the verification rule;
obtaining the matching proportion of the data position and the verification rule according to the first node data and the second node data, and judging whether the matching proportion is higher than a proportion threshold value;
if so, judging that the checking result meets the preset standard;
if not, judging that the checking result does not meet the preset standard.
Compared with the prior art, the beneficial effects of the application are that:
according to the data type identification method and device provided by the embodiment of the application, the data table to be processed is sampled, and only the sampled node data is subjected to data type identification, so that the data type identification of all the node data in the data to be processed is avoided, the calculated amount of the data type identification is reduced, and the identification speed of the data type of the data position is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below. It is appreciated that the following drawings depict only certain embodiments of the application and are therefore not to be considered limiting of its scope. For a person skilled in the art, it is possible to derive other relevant figures from these figures without inventive effort.
Fig. 1 is a schematic block diagram of a structure of an electronic device for implementing a data type identification method according to an embodiment of the present application.
Fig. 2 is a schematic flowchart of a data type identification method according to an embodiment of the present application.
Fig. 3 is another schematic flow chart of the data type identification method according to the embodiment of the present application.
Fig. 4 is a functional block diagram of a data type identification apparatus according to an embodiment of the present application.
Icon: 100-an electronic device; 110-a bus; 120-a processor; 130-a storage medium; 140-bus interface; 150-a network adapter; 160-a user interface; 200-data type identification means; 210-an obtaining module; 220-a sampling module; 230-an identification module; 240-check module.
Detailed Description
In view of the technical problems described in the background art, it should be particularly noted that, for a normative data table, such as an Excel table, a related technician often sets a data type of table data to a conventional type with strong compatibility when editing the data table, but cannot obtain a specific data type of the table data according to the conventional type when processing the data, so that a large problem often occurs when processing the data, for example, the data in the table cannot be directly divided into data to be processed and data marks according to the data type. Based on this, the present inventors provide a data type identification method and apparatus, so as to solve the above technical problems, and focus on solving the data type identification problem of table data in an Excel table. According to the data type identification method and device, the data table to be processed is sampled, and only the sampled node data is subjected to data type identification, so that the data type identification of all the node data in the data to be processed is avoided, the calculation amount of the data type identification is reduced, and the identification speed of the data type of the data position is improved.
The above prior art solutions have drawbacks that are the results of practical and careful study, and therefore, the discovery process of the above problems and the solutions proposed by the following embodiments of the present application to the above problems should be the contributions of the applicant to the present application in the course of the present application.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present application, it is also to be noted that, unless otherwise explicitly specified or limited, the terms "disposed" and "connected" are to be interpreted broadly, e.g., as being either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the keys in the embodiments can be combined with each other without conflict.
Referring to fig. 1, a block diagram of an electronic device 100 according to an embodiment of the present disclosure is shown. The electronic device 100 may be part of a server and communicatively coupled to a client via a network protocol.
As shown in FIG. 1, electronic device 100 may be implemented by bus 110 as a general bus architecture. Bus 110 may include any number of interconnecting buses and bridges depending on the specific application of electronic device 100 and the overall design constraints. Bus 110 connects various circuits together, including processor 120, storage medium 130, and bus interface 140. Alternatively, the electronic apparatus 100 may connect a network adapter 150 or the like via the bus 110 using the bus interface 140. The network adapter 150 may be used to implement signal processing functions of a physical layer in the electronic device 100, and is communicatively connected to each user end through a network protocol. The user interface 160 may connect external devices such as: a keyboard, a display, a mouse or a joystick, etc. The bus 110 may also connect various other circuits such as timing sources, peripherals, voltage regulators, or power management circuits, which are well known in the art, and therefore, will not be described in detail.
Alternatively, the electronic device 100 may be configured as a general purpose processing system, for example, commonly referred to as a chip, including: one or more microprocessors providing processing functions, and an external memory providing at least a portion of storage medium 130, all connected together with other support circuits through an external bus architecture.
Alternatively, the electronic device 100 may be implemented using: an ASIC (application specific integrated circuit) having a processor 120, a bus interface 140, a user interface 160; and at least a portion of the storage medium 130 integrated in a single chip, or the electronic device 100 may be implemented using: one or more FPGAs (field programmable gate arrays), PLDs (programmable logic devices), controllers, state machines, gate logic, discrete hardware components, any other suitable circuitry, or any combination of circuitry capable of performing the various functions described throughout this application.
Among other things, processor 120 is responsible for managing bus 110 and general processing (including the execution of software stored on storage medium 130). Processor 120 may be implemented using one or more general-purpose processors and/or special-purpose processors. Examples of processor 120 include microprocessors, microcontrollers, DSP processors, and other circuits capable of executing software. Software should be construed broadly to mean instructions, data, or any combination thereof, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
Storage medium 130 is shown in fig. 1 as being separate from processor 120, however, one skilled in the art will readily appreciate that storage medium 130, or any portion thereof, may be located external to electronic device 100. Storage medium 130 may include, for example, a transmission line, a carrier waveform modulated with data, and/or a computer product separate from the wireless node, which may be accessed by processor 120 via bus interface 140. Alternatively, the storage medium 130, or any portion thereof, may be integrated into the processor 120, e.g., may be a cache and/or general purpose registers.
The processor 120 may perform the following embodiments, and in particular, the storage medium 130 may store the data type identification apparatus 200 therein, and the processor 120 may be used for the data type identification apparatus 200.
Further, please refer to fig. 2, which is a flowchart illustrating a data type identification method according to an embodiment of the present application, wherein the data type identification method is executed by the electronic device 100 shown in fig. 1. It should be noted that the data type identification method provided in the embodiment of the present application is not limited by fig. 2 and the following specific sequence. The specific flow of the data type identification method provided by the application is as follows:
step S110, obtaining a to-be-processed data table, where the to-be-processed data table includes a plurality of data positions and a plurality of node data of each data position.
It should be noted that the data position is a position where the node data exists in the data table to be processed, for example, in an Excel table, the data position may be a data column of the Excel table, and the plurality of node data of each data position may be table data in each data column.
As an embodiment, the electronic device 100 may directly obtain the standard to-be-processed data table from a data source in response to a user operation, where the data source may be the storage medium 130 in the electronic device 100, or may be a data storage server communicatively connected to the electronic device 100, and the standard to-be-processed data table may be one or a combination of a database data table, an Excel table, and the like.
For a normal data table to be processed, a plurality of data positions and a plurality of node data of each data position can be obtained according to a table structure of the data table to be processed.
As another embodiment, the electronic device 100 may obtain crawler data from the internet by a data obtaining method such as a crawler, and for the crawler data, the crawler data may be normalized according to tag information of each crawler data, where the tag information may be configured as a data location of a normalized data table, and different data of each same tag information may be used as multiple node data of the data location of the tag information.
As an implementation manner, the electronic device 100 may obtain the crawler data or the standard to-be-processed data table in real time through the data transmission port, and may preset a data filling threshold value for detecting a data type of the real-time data, and generate the to-be-processed data table according to the obtained real-time data when a data amount of the real-time data meets the data filling threshold value.
Based on the design, the data type identification method provided by the application can identify various data sources and data types of data in various data tables, and improves the universality of the data type identification method provided by the application.
And step S120, sampling the data table to be processed, and obtaining the data type information of each node data of each data position according to each node data of each data position in the sampled data table to be processed.
It should be noted that the data type information includes at least one data type, where the data type information includes at least one data type when the type accuracy of the data type identification is high, that is, the data type information may be a set of data types, for example, for the node data 2018, the data type information of the node data may include a time of year type, a time type, and a number type, and when the type accuracy of the data type identification is low, the data type information may include only one data type, for example, to improve the operation speed of the data type identification method, the data types may include: character string, time, and number, its data type information may include only time type for the node data "2018.01.01".
It should be noted that the type precision is a measure of the precision of the data type, for example, for the time type, this data type may be further divided into more accurate data types such as time of year, time of day, time of timestamp, etc., wherein the type precision of the time of year type is higher than the type precision of the time type.
Optionally, before step S120, the method may further include:
and pre-verifying the data type of each data position in the data table to be processed.
The data type of each data position in the data table to be processed can be pre-verified in the following way:
firstly, obtaining label information of each data position in the to-be-processed data table, where the label information may include one or more combinations of header information, remark information, and a preset data type, and it should be noted that the preset data type is a data type of each data position set in the to-be-processed data table by a related technician.
Then, the preview data type information of each data position is obtained according to the label information.
As an implementation mode, semantic analysis can be performed on the header information or the remark information, and the data type information of the tag information is obtained according to the result of the semantic analysis; the preset data type information can be directly obtained for the preset data type.
Optionally, for tag information that includes both the preset data type and the header information or remark information, the data type information of the preset data type may be merged with the data type information of the semantic analysis, for example, the data type information of the preset data type includes a time type, the header information is "year", the data type information that can obtain the semantic analysis according to the header information includes a time of year type, and the preview data type information of the data location includes a time type and a time of year type.
It should be noted that, in order to increase the execution speed, the tag information that includes the preset data type and also includes the header information or the remark information may obtain the pre-detection data type information only according to the preset data type.
And finally, dividing the data positions corresponding to the pre-detection data type information into a data position group needing to be identified and a data position group not needing to be identified according to the data type requirement. For the data position where the pre-check data type information meets the data type requirement, the data position may be configured as a data position that does not need to be identified, otherwise, the data position is configured as a data position that needs to be identified.
Optionally, the data type requirement may be type accuracy of the data type identification method, and if the type accuracy of the pre-check data type information is higher than the data type requirement, it is considered that the pre-check data type information satisfies the data type requirement, for example, the data type identified by the present application may include one of a character string, a number, and a date, and if the pre-check data type information includes a time type and a time of year type, and the type accuracy of the time of year type is higher than the type accuracy of the time type, it is determined that the pre-check data type information satisfies the data type requirement, and the data position is classified into a data position group that does not need to be identified.
Based on the above design, when step S120 is executed, the data position group to be identified may be extracted from the to-be-processed data table, and only the data position group that may be identified is sampled and feature-identified.
As an embodiment, step S120 may be implemented by sampling each sampled data unit of the data table to be processed. Wherein the sampled data unit includes node data of different data locations.
It should be noted that the sample data unit is a constituent unit of the to-be-processed data table, for example, in an Excel table, each data position may be each data column in the table, correspondingly, the sample data unit may be each data row, and one sample data unit may include table data of each data column in the same row. For crawler data, the sample data unit may be a crawler data packet.
Specifically, in step S120, each sampled data unit of the to-be-processed data table may be sampled, and then the sampled to-be-processed data table is obtained according to each sampled data unit. The sampling method can be one of simple random sampling, hierarchical sampling and system sampling, and in consideration of the fact that the data volume of the application scene is large, the system sampling is generally adopted during sampling, and sampling is performed once every N sampling data units at a preset sampling time interval.
Based on the design, when sampling, the node data samples of all data positions needing to identify the data types can be extracted by sampling once, so that the sampling times are reduced.
As an embodiment, the step of obtaining the data type information of the respective node data of each data location may include first inputting the node data of each data location into the data type identification network, and then obtaining the data type information through the data type identification network.
Optionally, for data with an irregular structure, in order to perform data type identification, the data type identification network may be a neural network, the neural network may be divided into a feature extraction layer and a classification function, when the data is in operation, each node data is input into the feature extraction layer to obtain semantic features, and then the semantic features are classified by the classification function to obtain data type information of each node data.
Optionally, for data with a standard structure, in order to reduce the amount of computation, the data type identification network may be formed by regular expressions of each data type, after the data of each node is input into the data type identification network, each regular expression is matched with the data of the node, and if the matching is successful, the data type corresponding to the successfully matched regular expression is written in the data type information of the node data.
Step S130, traversing each node data of each data position, obtaining the common data type information of each node data according to the data type information of each node data of the data position, and obtaining the data type information of the data position in the data table to be processed before sampling according to the common data type information.
In one embodiment, for each data location, common data type information of each node data is obtained according to data type information of each node data of the data location, wherein each data type in the common data type information is a data type included in the data type information of each node data. For example, when the data type of each node data of a certain data location is identified, since one of the node data is "2018", the data type information of the node data includes a value type and a time type, and when the data type information of other node data of the data location only has the value type, the data type information of the data location includes the value type.
Optionally, in consideration of the situation that the data table to be processed may have misjudgment and input error, the data types of the data of each node at the sampled data position may be counted, the frequency of each data type is configured as the data type probability of the data position, and the data type with the probability higher than the preset threshold is used as the data type of the data position.
Based on the design, data type identification errors caused by the contingency of data values are avoided, and the accuracy of data type identification is improved.
As an embodiment, for a data location of which a data location needs to be identified after pre-verification, after obtaining common data type information of the data location, common data type information may be obtained according to the common data type information and the pre-inspection data type information, where the common data type information may include each data type of the data type information and each type information in the pre-inspection data type information.
As another embodiment, for the to-be-processed data table without pre-verification, the tag information of each data position may be processed by referring to a pre-verification method to obtain the data type information corresponding to each data position, and then the common data type information may be obtained according to the corresponding data type information and the common data type information.
Based on the design, the label information can be effectively utilized, and the type precision of the data type identification method provided by the application is improved.
Optionally, referring to fig. 3, a flowchart of the data type identification method provided in the embodiment of the present application is shown, and the data type identification method provided in the present application further includes step S140 after step S130.
Step S140, the data table to be processed is verified according to the data type information of each data position, and whether the verification result meets the preset standard or not is judged.
And re-executing step S120 when the verification result does not satisfy the preset criterion.
As an embodiment, step S140 may be performed by the following sub-steps:
first, a validation rule for each data location is generated based on the data type information for each data location.
When the above steps are executed, for each data position, the data type with the highest type precision in the data type information of the data position may be firstly used, and a corresponding regular rule is generated according to the data type, and then the regular expression is used as the verification rule of the data position.
Then, traversing each data position, and verifying each node data of the data position in the resampled data table to be processed according to the verification rule of the data position to obtain a verification result, wherein the verification result comprises first node data matched with the verification rule and second node data not matched with the verification rule;
specifically, the first node data is node data that can be matched with the regular expression at the data position, and the second node data is node data that cannot be matched with the regular expression at the data position. When the node data are matched, the node data at the position can be screened according to the regular expression at the position, and whether the node data are matched with the regular expression or not can be tested one by one.
And finally, obtaining the matching proportion of the data position and the verification rule according to the first node data and the second node data, and judging whether the matching proportion is higher than a proportion threshold value or not, thereby obtaining a judgment result whether the verification result meets a preset standard or not.
When the matching proportion is higher than the proportion threshold value, judging that the verification result meets a preset standard; and when the matching proportion is lower than the proportion threshold value, judging that the verification result does not meet the preset standard.
Based on the steps, the data type is checked after the data type is identified, so that the data type identification error caused by sampling contingency during sampling is avoided, and the accuracy of the data type identification method provided by the application is further improved.
Based on the data type identification method, the inventor of the application finds that the data table to be processed can be reconstructed according to the identified data type, and the reconstructed data table to be processed comprises at least one data dimension and data measurement corresponding to each data dimension. For example, the data type may include one of a numeric type, a string type, and a time type, the numeric type of data may be configured as a data metric, and the string type of data and the time type of data may be configured as a data dimension.
The data metric may be configured as a data value determined by a data dimension, and the data dimension may be configured as a data flag and used for associating with other data tables to be processed, for example, a data unit of the data table to be processed may be determined by a data dimension value of the data dimension, and a data metric value corresponding to the data dimension may be determined at the same time.
In an embodiment, referring to fig. 4, a functional block diagram of the data type identification apparatus 200 according to an embodiment of the present application is shown, where the data type identification apparatus 200 includes the following functional blocks:
an obtaining module 210, configured to obtain a to-be-processed data table, where the to-be-processed data table includes a plurality of data positions and a plurality of node data of each data position;
the sampling module 220 is configured to sample the data table to be processed, and obtain data type information of each node data of each data position according to each node data of each data position in the sampled data table to be processed, where the data type information includes at least one data type, and the data type includes one of a character string, a number, and time;
the identifying module 230 is configured to traverse each node data of each data position, obtain common data type information of each node data according to the data type information of each node data of the data position, and obtain data type information of the data position in the to-be-processed data table before sampling according to the common data type information.
Optionally, the sampling module 220 is further configured to:
sampling each sampling data unit of the data table to be processed, and obtaining the sampled data table to be processed according to each sampled sampling data unit.
Optionally, the identification module 230 is further configured to:
obtaining label information of the data position, and performing semantic analysis on the label information to obtain data type information corresponding to the label information;
and extracting common data type information between the common data type information and the data type information corresponding to the label information, wherein the common data type information is the data type information of the data position.
Optionally, the data type identifying device 200 may further include a checking module 240, where the checking module 240 is configured to check the to-be-processed data table according to the data type information of each data position, determine whether a checking result meets a preset standard, and enable the sampling module 220 to sample the to-be-processed data table again when the checking result does not meet the preset standard.
Optionally, the checking module 240 is further configured to:
generating a verification rule of each data position according to the data type information of each data position;
traversing each data position, and verifying each node data of the data position in the re-sampled data table to be processed according to the verification rule of the data position to obtain a verification result, wherein the verification result comprises first node data matched with the verification rule and second node data not matched with the verification rule;
obtaining the matching proportion of the data position and the verification rule according to the first node data and the second node data, and judging whether the matching proportion is higher than a proportion threshold value;
if so, judging that the checking result meets a preset standard;
if not, judging that the verification result does not meet the preset standard.
The embodiment of the present application further provides a readable storage medium, where a computer program is stored, and when the computer program is executed, the data type identification method in any of the above method embodiments may be implemented.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (8)

1. A data type identification method is applied to electronic equipment, and the method comprises the following steps:
obtaining a data table to be processed, wherein the data table to be processed comprises a plurality of data positions and a plurality of node data of each data position;
sampling the to-be-processed data table, and obtaining data type information of each node data of each data position according to each node data of each data position in the sampled to-be-processed data table, wherein the data type information comprises at least one data type, and the data type comprises one of a character string type, a number type and a time type;
traversing each node data of each data position, obtaining the common data type information of each node data according to the data type information of each node data of the data position, and obtaining the data type information of the data position in the data table to be processed before sampling according to the common data type information;
the step of obtaining the data type information of the data position in the data table to be processed before sampling according to the common data type information comprises the following steps:
obtaining label information of the data position, and performing semantic analysis on the label information to obtain data type information corresponding to the label information, wherein the label information comprises one or more combinations of header information, remark information and preset data types;
and extracting common data type information between the common data type information and the data type information corresponding to the label information, wherein the common data type information is the data type information of the data position.
2. The data type identification method according to claim 1, wherein the to-be-processed data table includes a plurality of sample data units, the sample data units include node data of different data locations, and the step of sampling the to-be-processed data table includes:
sampling each sampling data unit of the data table to be processed, and obtaining the sampled data table to be processed according to each sampled sampling data unit.
3. The data type identification method according to claim 1, wherein after the step of obtaining the data type information of the data position in the data table to be processed before sampling according to the common data type information, the method further comprises:
verifying the data table to be processed according to the data type information of each data position, and judging whether a verification result meets a preset standard or not;
if not, returning to the step of sampling the data table to be processed.
4. The data type identification method according to claim 3, wherein the step of verifying the to-be-processed data table according to the data type information of each data position and determining whether the verification result meets a preset standard comprises:
generating a verification rule of each data position according to the data type information of each data position;
traversing each data position, and verifying each node data of the data position in the re-sampled data table to be processed according to a verification rule of the data position to obtain a verification result, wherein the verification result comprises first node data matched with the verification rule and second node data unmatched with the verification rule;
obtaining the matching proportion of the data position and the verification rule according to the first node data and the second node data, and judging whether the matching proportion is higher than a proportion threshold value;
if so, judging that the checking result meets the preset standard;
if not, judging that the checking result does not meet the preset standard.
5. A data type recognition device, applied to an electronic device, the device comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a data table to be processed, and the data table to be processed comprises a plurality of data positions and a plurality of node data of each data position;
the sampling module is used for sampling the data table to be processed and obtaining the data type information of each node data of each data position according to each node data of each data position in the sampled data table to be processed, wherein the data type information comprises at least one data type, and the data type comprises one of character strings, numbers and time; and
the identification module is used for traversing each node data of each data position, obtaining the common data type information of each node data according to the data type information of each node data of the data position, and obtaining the data type information of the data position in the data table to be processed before sampling according to the common data type information;
the identification module is further configured to:
obtaining label information of the data position, and performing semantic analysis on the label information to obtain data type information corresponding to the label information, wherein the label information comprises one or more combinations of header information, remark information and preset data types;
and extracting common data type information between the common data type information and the data type information corresponding to the label information, wherein the common data type information is the data type information of the data position.
6. The data type identification device of claim 5, wherein the sampling module is further configured to:
sampling each sampling data unit of the data table to be processed, and obtaining the sampled data table to be processed according to each sampled sampling data unit.
7. The data type identification device of claim 5, further comprising a verification module;
the checking module is used for checking the data table to be processed according to the data type information of each data position, judging whether the checking result meets the preset standard or not, and enabling the sampling module to sample the data table to be processed again when the checking result does not meet the preset standard.
8. The data type identification device of claim 7, wherein the check module is further configured to:
generating a verification rule of each data position according to the data type information of each data position;
traversing each data position, and verifying each node data of the data position in the re-sampled data table to be processed according to a verification rule of the data position to obtain a verification result, wherein the verification result comprises first node data matched with the verification rule and second node data unmatched with the verification rule;
obtaining the matching proportion of the data position and the verification rule according to the first node data and the second node data, and judging whether the matching proportion is higher than a proportion threshold value;
if so, judging that the checking result meets the preset standard;
if not, judging that the checking result does not meet the preset standard.
CN201811586956.9A 2018-12-25 2018-12-25 Data type identification method and device Active CN109710651B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811586956.9A CN109710651B (en) 2018-12-25 2018-12-25 Data type identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811586956.9A CN109710651B (en) 2018-12-25 2018-12-25 Data type identification method and device

Publications (2)

Publication Number Publication Date
CN109710651A CN109710651A (en) 2019-05-03
CN109710651B true CN109710651B (en) 2020-11-10

Family

ID=66257461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811586956.9A Active CN109710651B (en) 2018-12-25 2018-12-25 Data type identification method and device

Country Status (1)

Country Link
CN (1) CN109710651B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113901262A (en) * 2021-09-24 2022-01-07 北京达佳互联信息技术有限公司 Method and device for acquiring data to be processed, server and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8544028B2 (en) * 2011-04-11 2013-09-24 International Business Machines Corporation Extracting and processing data from heterogeneous computer applications
CN105335407A (en) * 2014-07-29 2016-02-17 阿里巴巴集团控股有限公司 Data automation test method and apparatus
CN106033427A (en) * 2015-03-11 2016-10-19 阿里巴巴集团控股有限公司 A sampling data verification method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731976B (en) * 2015-04-14 2018-03-30 海量云图(北京)数据技术有限公司 The discovery of private data and sorting technique in tables of data
CN106611176B (en) * 2015-10-26 2019-10-25 北京国双科技有限公司 The recognition methods of abnormal Chinese character string and device
US11442909B2 (en) * 2015-12-01 2022-09-13 Motorola Solutions, Inc. Data analytics system
CN105975575A (en) * 2016-05-04 2016-09-28 电子科技大学 Automatic data type recognition method
CN106776901B (en) * 2016-11-30 2019-12-06 北京知道创宇信息技术股份有限公司 Data extraction method, device and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8544028B2 (en) * 2011-04-11 2013-09-24 International Business Machines Corporation Extracting and processing data from heterogeneous computer applications
CN105335407A (en) * 2014-07-29 2016-02-17 阿里巴巴集团控股有限公司 Data automation test method and apparatus
CN106033427A (en) * 2015-03-11 2016-10-19 阿里巴巴集团控股有限公司 A sampling data verification method and device

Also Published As

Publication number Publication date
CN109710651A (en) 2019-05-03

Similar Documents

Publication Publication Date Title
US11243923B2 (en) Computing the need for standardization of a set of values
US20170147688A1 (en) Automatically mining patterns for rule based data standardization systems
CN110826494B (en) Labeling data quality evaluation method, labeling data quality evaluation device, computer equipment and storage medium
CN109165209B (en) Data verification method, device, equipment and medium for object types in database
CN110471912B (en) Employee attribute information verification method and device and terminal equipment
CN114840286B (en) Service processing method and server based on big data
CN109710651B (en) Data type identification method and device
CN111738290B (en) Image detection method, model construction and training method, device, equipment and medium
US10635693B2 (en) Efficiently finding potential duplicate values in data
CN111680083A (en) Intelligent multi-stage government financial data acquisition system and data acquisition method
CN110826616A (en) Information processing method and device, electronic equipment and storage medium
CN114116811B (en) Log processing method, device, equipment and storage medium
CN107577760B (en) text classification method and device based on constraint specification
CN116185393A (en) Method, device, equipment, medium and product for generating interface document
CN113778875B (en) System test defect classification method, device, equipment and storage medium
CN115761778A (en) Document reconstruction method, device, equipment and storage medium
TWI777163B (en) Form data detection method, computer device and storage medium
CN114943219A (en) Method, device and equipment for generating bill of material test data and storage medium
CN111190986B (en) Map data comparison method and device
CN112966671A (en) Contract detection method and device, electronic equipment and storage medium
CN112860722A (en) Data checking method and device, electronic equipment and readable storage medium
CN112131296A (en) Data exploration method and device, electronic equipment and storage medium
CN112631852A (en) Macro checking method, macro checking device, electronic equipment and computer readable storage medium
CN116187299B (en) Scientific and technological project text data verification and evaluation method, system and medium
CN114444489B (en) Information extraction method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant