CN112231417A - Data classification method and device, electronic equipment and storage medium - Google Patents

Data classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112231417A
CN112231417A CN202011099802.4A CN202011099802A CN112231417A CN 112231417 A CN112231417 A CN 112231417A CN 202011099802 A CN202011099802 A CN 202011099802A CN 112231417 A CN112231417 A CN 112231417A
Authority
CN
China
Prior art keywords
data
standard
business entity
original
entity table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011099802.4A
Other languages
Chinese (zh)
Inventor
陈昱彬
李婕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202011099802.4A priority Critical patent/CN112231417A/en
Publication of CN112231417A publication Critical patent/CN112231417A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to big data technology, and discloses a data classification method, which comprises the following steps: the method comprises the steps of obtaining an original data dictionary set and a preset service subject set, extracting a data dictionary in the original data dictionary set to the service subject set to obtain an original service entity table, carrying out missing value detection and duplicate removal operation on the original service entity table to obtain a standard service entity table, obtaining a standard data table, generating a mapping relation table according to the standard data table, generating a query statement according to the standard service entity table and the mapping relation table, generating a data extraction script according to the query statement, extracting data by using the data extraction script, and classifying to obtain a classification result. In addition, the invention also relates to a block chain technology, and the classification result can be stored in a node of the block chain. The invention also provides a data classification device, electronic equipment and a computer readable storage medium. The invention can solve the problem that technical personnel need to know specific services to classify data.

Description

Data classification method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of big data technologies, and in particular, to a data classification method and apparatus, an electronic device, and a computer-readable storage medium.
Background
With the improvement of internet big data platforms and technologies, the application requirements of various professional industry fields on business data analysis and prediction of the big data field are increased day by day. Technicians need to clean, integrate, process and standardize multi-channel and multi-source data so as to provide accurate business analysis and business prediction for managers.
For the above scenario, the prior art has the following drawbacks: 1. in the market, the data processing is mainly based on the processing and application of general source data, and technical personnel need to know the specific service of the data to process the data. 2. Specific automatic data processing methods are lacked in some fields with higher requirements. For example, in the service data in the judicial field, data modeling, classification and layering and industry standardization are required to be performed on the service data, and a processing method for the complex service data is lacked in the prior art.
Disclosure of Invention
The invention provides a data classification method, a data classification device and a computer readable storage medium, and mainly aims to solve the problem that a technician needs to know specific services to classify data.
In order to achieve the above object, the present invention provides a data classification method, including:
acquiring an original data dictionary set and a preset service theme set, and extracting a data dictionary in the original data dictionary set to the service theme set to obtain an original service entity table under each service theme;
carrying out missing value detection and duplicate removal operation on the original business entity table to obtain a standard business entity table;
acquiring a standard data table, and generating a mapping relation table according to the standard business entity table and the standard data table;
generating a query statement according to the standard business entity table and the mapping relation table;
and generating a data extraction script according to the query statement, extracting data by using the data extraction script and classifying to obtain a classification result.
Optionally, the obtaining an original data dictionary set and a preset service theme set, and extracting a data dictionary in the original data dictionary set to the service theme set to obtain an original service entity table under each service theme includes:
extracting key words in the service theme set by using a preset language processing algorithm;
matching a corresponding data dictionary in the original data dictionary set according to the keywords, and extracting metadata in the data dictionary to the service theme set;
and summarizing metadata in all data dictionaries under all the business topics in the business topic set to obtain the original business entity table.
Optionally, the extracting the keywords in the service theme set by using a preset language processing algorithm includes:
performing word segmentation processing on the text in the service theme set, and removing stop words to obtain word segmentation results;
and selecting one or more keywords from the word segmentation result.
Optionally, the performing missing value detection and duplicate removal operations on the original business entity table to obtain a standard business entity table includes:
carrying out missing value detection and filling on the data in the original business entity table to obtain a filled original business entity table;
and carrying out duplication removal operation on the data filled in the original business entity table, and obtaining the standard business entity table according to a preset business rule.
Optionally, the generating a mapping relationship table according to the standard business entity table and the standard data table includes:
finding data which is the same as the standard field name in the standard data table from the standard business entity table;
and configuring the mapping relation between the data and the standard code value corresponding to the standard field, and generating a mapping relation table.
Optionally, the generating a query statement according to the standard business entity table and the mapping relationship table includes:
generating a table building statement of the standard business entity table by using a preset statement building function;
acquiring the mapping ID of the standard business entity table, and searching all mapping scripts under the same mapping ID in the mapping relation table;
and summarizing the table building statement and the mapping script to obtain the query statement.
Optionally, the generating a data extraction script according to the query statement, extracting data by using the data extraction script, and classifying to obtain a classification result includes:
acquiring an operation script template of a preset platform, and generating the data extraction script by using the operation script template and the query statement;
and running the data extraction script in a preset time, extracting data from a database according to the data extraction script, and classifying to obtain the classification result.
In order to solve the above problem, the present invention also provides a data sorting apparatus, comprising:
the data dictionary extraction module is used for acquiring an original data dictionary set and a preset service theme set, extracting a data dictionary in the original data dictionary set to the service theme set and obtaining an original service entity table under each service theme;
the entity table processing module is used for carrying out missing value detection and duplicate removal operation on the original business entity table to obtain a standard business entity table;
the relation mapping module is used for acquiring a standard data table and generating a mapping relation table according to the standard business entity table and the standard data table;
the statement generating module is used for generating query statements according to the standard business entity table and the mapping relation table;
and the data classification module is used for generating a data extraction script according to the query statement, extracting data by using the data extraction script and classifying to obtain a classification result.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the data classification method.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, which stores at least one instruction, and the at least one instruction is executed by a processor in an electronic device to implement the data classification method described above.
According to the embodiment of the invention, the original business entity table under each business topic can be accurately determined through the original data dictionary set and the preset business topic set, missing value detection and duplicate removal operation are carried out on the original business entity table to obtain the standard business entity table, the accuracy of data in the standard business entity table can be improved, meanwhile, the mapping relation table is generated according to the standard business entity table and the standard data table, the query statement is generated according to the standard business entity table and the mapping relation table, the data extraction script is generated according to the query statement, and data standardization and data classification can be directly carried out. Therefore, the data classification method, the data classification device, the electronic equipment and the computer readable storage medium provided by the invention can solve the problem that technical personnel need to know specific services to classify data.
Drawings
Fig. 1 is a schematic flow chart of a data classification method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 1;
FIG. 3 is a schematic flow chart showing another step of FIG. 1;
FIG. 4 is a schematic flow chart showing another step of FIG. 1;
FIG. 5 is a mapping representation intent;
FIG. 6 is a schematic flow chart showing another step of FIG. 1;
FIG. 7 is a schematic diagram of a mapping script;
FIG. 8 is a schematic flow chart showing another step of FIG. 1;
FIG. 9 is a functional block diagram of a data sorting apparatus according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an electronic device implementing the data classification method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a data classification method. The execution subject of the data classification method includes, but is not limited to, at least one of electronic devices such as a server and a terminal, which can be configured to execute the method provided by the embodiments of the present application. In other words, the data classification method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Fig. 1 is a schematic flow chart of a data classification method according to an embodiment of the present invention. In this embodiment, the data classification method includes:
s1, acquiring an original data dictionary set and a preset service theme set, extracting a data dictionary in the original data dictionary set to the service theme set, and acquiring an original service entity table under each service theme.
In an embodiment of the present invention, the data dictionary includes metadata that generally describes content of data, and the raw data dictionary includes a plurality of the data dictionaries. For example, in the big data platform, the "case table" has fields such as case ID, case number, contractor, etc., wherein the values of the three fields are 23423546666, (2020) yue 0308 min 453 and zhang san, respectively, wherein the "case table" is a data dictionary, and the case ID, case number, contractor are metadata in the data dictionary "case table". The preset service theme set may be a service theme in multiple fields, for example, a judicial service theme set in a judicial field may be divided into: case information, judge information, party information, document information, evidence information, and the like.
Preferably, referring to fig. 2, the S1 includes:
s10, extracting keywords in the service theme set by using a preset language processing algorithm;
s11, matching a corresponding data dictionary in the original data dictionary set according to the keywords, and extracting metadata in the data dictionary to the service theme set;
s12, summarizing metadata in all data dictionaries under all the business topics in the business topic set to obtain the original business entity table.
In detail, the extracting the keywords in the service theme set by using a preset language processing algorithm includes:
performing word segmentation processing on the text in the service theme set, and removing stop words to obtain word segmentation results;
and selecting one or more keywords from the word segmentation result.
The preset language processing algorithm in the embodiment of the invention can be a TextRank which is disclosed at present, a keyword extraction algorithm based on semantics and the like. For example, under judicial business, extracting a keyword ' case information ' in a judicial business topic set, matching a data dictionary ' case table ' in an original data dictionary set according to the keyword ' case ', extracting fields such as metadata ' case ID, case number, contractor ' and the like in the data dictionary to the situation information ' business topic, and summarizing to obtain the original business entity table under the ' case information '.
According to the embodiment of the invention, the preset language processing algorithm is utilized to quickly identify the data in the original data dictionary, so that the omission of some key information in the original data dictionary is avoided.
And S2, carrying out missing value detection and duplicate removal operation on the original business entity table to obtain a standard business entity table.
Preferably, referring to fig. 3, the S2 includes:
s20, carrying out missing value detection and filling on the data in the original business entity table to obtain a filled original business entity table;
and S21, carrying out duplication elimination operation on the data filled in the original business entity table, and obtaining the standard business entity table according to a preset business rule.
In the embodiment of the invention, whether the data in the original business entity table has a missing value or not can be detected through a mismap function missing function, if the data does not have the missing value, the data is not processed, and if the data has the missing value, the missing value is filled by using a preset filling algorithm to obtain the filled original business entity table.
In detail, the preset filling algorithm may be:
Figure BDA0002723880590000061
wherein L (θ) represents a filled data missing value, xiRepresenting the ith data missing value, theta representing the probability parameter corresponding to the filled data missing value, n representing the data quantity in the original business entity table, p (x)i| θ) represents the probability of the data missing value of the padding.
Further, in the embodiment of the present invention, the data in the original business entity table is filled with a distance formula, where the distance formula includes:
Figure BDA0002723880590000062
wherein d represents the distance value of any two data in the filling original business entity table, w1jAnd w2jRepresenting any two data in the populated original business entity table. And deleting any one of the data when the distance value is smaller than a preset distance value, and simultaneously keeping the two data if the distance value is not smaller than the preset distance value. Preferably, the preset distance value may be 0.1.
Further, in the embodiment of the present invention, the preset business rule refers to a rule for accepting or rejecting the original business entity table in different business scenarios, for example, in a judicial business scenario, if "certificate data" appears repeatedly in a "case table" or a "document table", the "document table" in the original business entity table is removed.
According to the embodiment of the invention, the missing value detection and the duplicate removal operation are carried out on the data in the original business entity table, and the data in the original business entity table is adjusted according to the preset business rule, so that the accuracy of the data is improved.
And S3, acquiring a standard data table, and generating a mapping relation table according to the standard business entity table and the standard data table.
Preferably, the standard data table may be a national standard data table, and the national standard data table specifies each standard field and a corresponding standard code value of the standard field. For example, in the national standard data table, the value 1 of the gender field indicates male, and the value 2 indicates female.
In detail, referring to fig. 4, the generating a mapping relationship table according to the standard business entity table and the standard data table includes:
s30, finding the data which is the same as the standard field name in the standard data table from the standard business entity table;
s31, configuring the mapping relation between the data and the standard code value corresponding to the standard field, and generating a mapping relation table.
Preferably, for example, the standard business entity table "party information" has a gender field, which may merge gender data from a system a (value 01 represents male, value 02 represents female) and B system (value 00 represents male, and value 01 represents female), and data in the standard business entity table can be unified through the mapping table, so as to improve data utilization efficiency. Illustratively, referring to the mapping table shown in fig. 5, the source fields "XB" and "SEX" are both mapped to the standard field "xingbie", and the source code value "value 00 represents male, value 01 represents female" and "value 01 represents male, value 02 represents female" are both mapped to "value 1 represents male, value 2 represents female", and so on.
And S4, generating a query statement according to the standard business entity table and the mapping relation table.
Preferably, the query statement generated by the embodiment of the present invention may be Structured Query Language (SQL) that is currently disclosed, where the SQL is the most widely used language in data processing, and allows a user to concisely and briefly declare required business logic, and the SQL belongs to a set-up language, and only needs to clearly express a requirement without knowing a specific implementation; SQL can be optimized, various query optimizers are built in, and the various query optimizers can translate an optimal execution plan for SQL.
Preferably, referring to fig. 6, the S4 includes:
s40, generating a table building statement of the standard business entity table by using a preset statement building function;
s41, obtaining the mapping ID of the standard business entity table, and searching all mapping scripts in the mapping relation table under the same mapping ID;
and S42, summarizing the table building statement and the mapping script to obtain the query statement.
In the implementation of the present invention, the preset statement creation function may be, for example, credit TABLE IF NOT EXISTS RY _ ZP _ HTXX (id string comment 'xx'). According to the embodiment of the invention, the statement for creating the TABLE by using the statement creating function can be a CREAT TABLE IF NOT EXISTS RY _ ZP _ HTXX (id string comment 'id'), an id string comment 'person id', a scbs string comment 'delete identifier', …. Where "ryid" denotes "person id" and "scbs" denotes "delete id", ….
In the embodiment of the invention, each standard business entity table has a unique mapping ID. For example, as shown in the mapping script of fig. 7, all mapping scripts under the mapping ID "MP 0001" are searched to obtain a complete mapping script: "select case where a. xb ═ 00 'the' 1 ', where a. xb ═ 01' the '2' else null end as a. xb from dsrx xx a join CD _ yinggsxx _ YSLDMXX b a. xb ═ BDMZ".
According to the embodiment of the invention, the mapping script is generated through the unique mapping ID and the mapping relation table, so that the mapping is more accurate, the modification difficulty of the mapping script is reduced, and the maintainability is improved.
And S5, generating a data extraction script according to the query statement, extracting data by using the data extraction script and classifying to obtain a classification result.
Preferably, referring to fig. 8, the S5 includes:
s50, acquiring an operation script template of a preset platform, and generating the data extraction script by using the operation script template and the query statement;
and S51, operating the data extraction script in a preset time, extracting data from a database according to the data extraction script, and classifying to obtain the classification result.
Preferably, the preset platform may be a pre-constructed big data management platform, the data extraction script may be a shell script, and the big data management platform has a schedule scheduling task management module which provides a script template for running at regular time. The query statement is input into the big data management platform, a scheduling task is newly established, a data extraction script can be generated according to the query statement at regular time, data is extracted into the standard business entity table, and meanwhile the mapping script in the query statement is used for conducting standardized processing on the data in the standard business entity table to obtain a final classification result. For example, the set script extracts data every morning, and the classification result can be directly obtained.
The embodiment of the invention utilizes the big data management platform to automatically generate the data extraction script, simultaneously reduces the operation threshold, and can also operate without knowing specific services by technical personnel.
According to the embodiment of the invention, the original business entity table under each business topic can be accurately determined through the original data dictionary set and the preset business topic set, missing value detection and duplicate removal operation are carried out on the original business entity table to obtain the standard business entity table, the accuracy of data in the standard business entity table can be improved, meanwhile, the mapping relation table is generated according to the standard business entity table and the standard data table, the query statement is generated according to the standard business entity table and the mapping relation table, the data extraction script is generated according to the query statement, and data standardization and data classification can be directly carried out. Therefore, the embodiment of the invention can solve the problem that the data classification can be carried out only by a technician needing to know specific services.
Fig. 9 is a functional block diagram of a data sorting apparatus according to an embodiment of the present invention.
The data sorting apparatus 100 of the present invention may be installed in an electronic device. According to the realized functions, the data classification device 100 may include a data dictionary extraction module 101, an entity table processing module 102, a relationship mapping module 103, a sentence generation module 104, and a data classification module 105. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the data dictionary extraction module 101 is configured to obtain an original data dictionary set and a preset service theme set, extract a data dictionary in the original data dictionary set to the service theme set, and obtain an original service entity table under each service theme.
In an embodiment of the present invention, the data dictionary includes metadata that generally describes content of data, and the raw data dictionary includes a plurality of the data dictionaries. For example, in the big data platform, the "case table" has fields such as case ID, case number, contractor, etc., wherein the values of the three fields are 23423546666, (2020) yue 0308 min 453 and zhang san, respectively, wherein the "case table" is a data dictionary, and the case ID, case number, contractor are metadata in the data dictionary "case table". The preset service theme set may be a service theme in multiple fields, for example, a judicial service theme set in a judicial field may be divided into: case information, judge information, party information, document information, evidence information, and the like.
Preferably, the data dictionary extraction module 101 obtains the original business entity table by:
extracting key words in the service theme set by using a preset language processing algorithm;
matching a corresponding data dictionary in the original data dictionary set according to the keywords, and extracting metadata in the data dictionary to the service theme set;
and summarizing metadata in all data dictionaries under all the business topics in the business topic set to obtain the original business entity table.
In detail, the data dictionary extraction module 101 obtains the keywords in the business topic set by the following operations:
performing word segmentation processing on the text in the service theme set, and removing stop words to obtain word segmentation results;
and selecting one or more keywords from the word segmentation result.
The preset language processing algorithm in the embodiment of the invention can be a TextRank which is disclosed at present, a keyword extraction algorithm based on semantics and the like. For example, under judicial business, extracting a keyword ' case information ' in a judicial business topic set, matching a data dictionary ' case table ' in an original data dictionary set according to the keyword ' case ', extracting fields such as metadata ' case ID, case number, contractor ' and the like in the data dictionary to the situation information ' business topic, and summarizing to obtain the original business entity table under the ' case information '.
According to the embodiment of the invention, the preset language processing algorithm is utilized to quickly identify the data in the original data dictionary, so that the omission of some key information in the original data dictionary is avoided.
The entity table processing module 102 is configured to perform missing value detection and duplicate removal operations on the original business entity table to obtain a standard business entity table.
Preferably, the entity table processing module 102 obtains the standard business entity table by:
carrying out missing value detection and filling on the data in the original business entity table to obtain a filled original business entity table;
and carrying out duplication removal operation on the data filled in the original business entity table, and obtaining the standard business entity table according to a preset business rule.
In the embodiment of the invention, whether the data in the original business entity table has a missing value or not can be detected through a mismap function missing function, if the data does not have the missing value, the data is not processed, and if the data has the missing value, the missing value is filled by using a preset filling algorithm to obtain the filled original business entity table.
In detail, the preset filling algorithm may be:
Figure BDA0002723880590000111
wherein L (θ) represents a filled data missing value, xiRepresenting the ith data missing value, theta representing the probability parameter corresponding to the filled data missing value, n representing the data quantity in the original business entity table, p (x)i| θ) represents the probability of the data missing value of the padding.
Further, in the embodiment of the present invention, the data in the original business entity table is filled with a distance formula, where the distance formula includes:
Figure BDA0002723880590000112
wherein d represents the distance value of any two data in the filling original business entity table, w1jAnd w2jRepresenting any two data in the populated original business entity table. And deleting any one of the data when the distance value is smaller than a preset distance value, and simultaneously keeping the two data if the distance value is not smaller than the preset distance value. Preferably, the preset distance value may be 0.1.
Further, in the embodiment of the present invention, the preset business rule refers to a rule for accepting or rejecting the original business entity table in different business scenarios, for example, in a judicial business scenario, if "certificate data" appears repeatedly in a "case table" or a "document table", the "document table" in the original business entity table is removed.
According to the embodiment of the invention, the missing value detection and the duplicate removal operation are carried out on the data in the original business entity table, and the data in the original business entity table is adjusted according to the preset business rule, so that the accuracy of the data is improved.
The relation mapping module 103 is configured to obtain a standard data table, and generate a mapping relation table according to the standard business entity table and the standard data table.
Preferably, the standard data table may be a national standard data table, and the national standard data table specifies each standard field and a corresponding standard code value of the standard field. For example, in the national standard data table, the value 1 of the gender field indicates male, and the value 2 indicates female.
In detail, the relationship mapping module 103 generates the mapping relationship table by:
finding data which is the same as the standard field name in the standard data table from the standard business entity table;
and configuring the mapping relation between the data and the standard code value corresponding to the standard field, and generating a mapping relation table.
Preferably, for example, the standard business entity table "party information" has a gender field, which may merge gender data from a system a (value 01 represents male, value 02 represents female) and B system (value 00 represents male, and value 01 represents female), and data in the standard business entity table can be unified through the mapping table, so as to improve data utilization efficiency.
The statement generating module 104 is configured to generate a query statement according to the standard business entity table and the mapping relationship table.
Preferably, the query statement generated by the embodiment of the present invention may be Structured Query Language (SQL) that is currently disclosed, where the SQL is the most widely used language in data processing, and allows a user to concisely and briefly declare required business logic, and the SQL belongs to a set-up language, and only needs to clearly express a requirement without knowing a specific implementation; SQL can be optimized, various query optimizers are built in, and the various query optimizers can translate an optimal execution plan for SQL.
Preferably, the statement generation module 104 generates the query statement by:
generating a table building statement of the standard business entity table by using a preset statement building function;
acquiring the mapping ID of the standard business entity table, and searching all mapping scripts under the same mapping ID in the mapping relation table;
and summarizing the table building statement and the mapping script to obtain the query statement.
In the implementation of the present invention, for example, the preset sentence creating function may be a credit IF NOT EXISTS RY _ ZP _ HTXX (id string comment 'xx'), and the TABLE creating sentence may be a credit IF NOT EXISTS RY _ ZP _ HTXX (id string comment 'id'), an rfid string comment 'person id', a scbs string comment 'delete identifier', …. Where "ryid" denotes "person id" and "scbs" denotes "delete id", ….
Further, in this embodiment of the present invention, each of the standard business entity tables has a unique mapping ID.
According to the embodiment of the invention, the mapping script is generated through the unique mapping ID and the mapping relation table, so that the mapping is more accurate, the modification difficulty of the mapping script is reduced, and the maintainability is improved.
The data classification module 105 is configured to generate a data extraction script according to the query statement, extract data by using the data extraction script, and classify the data to obtain a classification result.
Preferably, the data classification module 105 obtains the classification result by:
acquiring an operation script template of a preset platform, and generating the data extraction script by using the operation script template and the query statement;
and running the data extraction script in a preset time, extracting data from a database according to the data extraction script, and classifying to obtain the classification result.
Preferably, the preset platform may be a pre-constructed big data management platform, the data extraction script may be a shell script, and the big data management platform has a schedule scheduling task management module which provides a script template for running at regular time. The query statement is input into the big data management platform, a scheduling task is newly established, a data extraction script can be generated according to the query statement at regular time, data is extracted into the standard business entity table, and meanwhile the mapping script in the query statement is used for conducting standardized processing on the data in the standard business entity table to obtain a final classification result. For example, the set script extracts data every morning, and the classification result can be directly obtained.
The embodiment of the invention utilizes the big data management platform to automatically generate the data extraction script, simultaneously reduces the operation threshold, and can also operate without knowing specific services by technical personnel.
Fig. 10 is a schematic structural diagram of an electronic device implementing a data classification method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a data sorting program 12, stored in the memory 11 and operable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the data sorting program 12, but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., data classification programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 10 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 10 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The data classification program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:
acquiring an original data dictionary set and a preset service theme set, and extracting a data dictionary in the original data dictionary set to the service theme set to obtain an original service entity table under each service theme;
carrying out missing value detection and duplicate removal operation on the original business entity table to obtain a standard business entity table;
acquiring a standard data table, and generating a mapping relation table according to the standard business entity table and the standard data table;
generating a query statement according to the standard business entity table and the mapping relation table;
and generating a data extraction script according to the query statement, extracting data by using the data extraction script and classifying to obtain a classification result.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiments corresponding to fig. 1 to fig. 8, which is not repeated herein.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a non-volatile computer-readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method of data classification, the method comprising:
acquiring an original data dictionary set and a preset service theme set, and extracting a data dictionary in the original data dictionary set to the service theme set to obtain an original service entity table under each service theme;
carrying out missing value detection and duplicate removal operation on the original business entity table to obtain a standard business entity table;
acquiring a standard data table, and generating a mapping relation table according to the standard business entity table and the standard data table;
generating a query statement according to the standard business entity table and the mapping relation table;
and generating a data extraction script according to the query statement, extracting data by using the data extraction script and classifying to obtain a classification result.
2. The data classification method according to claim 1, wherein the obtaining an original data dictionary set and a preset service topic set, and extracting a data dictionary from the original data dictionary set to the service topic set to obtain an original service entity table under each service topic comprises:
extracting key words in the service theme set by using a preset language processing algorithm;
matching a corresponding data dictionary in the original data dictionary set according to the keywords, and extracting metadata in the data dictionary to the service theme set;
and summarizing metadata in all data dictionaries under all the business topics in the business topic set to obtain the original business entity table.
3. The data classification method according to claim 2, wherein the extracting the keywords in the business topic sets by using a preset language processing algorithm comprises:
performing word segmentation processing on the text in the service theme set, and removing stop words to obtain word segmentation results;
and selecting one or more keywords from the word segmentation result.
4. The data classification method of claim 1, wherein the performing missing value detection and deduplication operations on the original business entity table to obtain a standard business entity table comprises:
carrying out missing value detection and filling on the data in the original business entity table to obtain a filled original business entity table;
and carrying out duplication removal operation on the data filled in the original business entity table, and obtaining the standard business entity table according to a preset business rule.
5. The data classification method according to claim 1, wherein the generating a mapping relation table according to the standard business entity table and the standard data table comprises:
finding data which is the same as the standard field name in the standard data table from the standard business entity table;
and configuring the mapping relation between the data and the standard code value corresponding to the standard field, and generating a mapping relation table.
6. The data classification method according to claim 1, wherein the generating a query statement from the standard business entity table and the mapping relationship table comprises:
generating a table building statement of the standard business entity table by using a preset statement building function;
acquiring the mapping ID of the standard business entity table, and searching all mapping scripts under the same mapping ID in the mapping relation table;
and summarizing the table building statement and the mapping script to obtain the query statement.
7. The data classification method according to claim 1, wherein the generating a data extraction script according to the query statement, and extracting and classifying data by using the data extraction script to obtain a classification result comprises:
acquiring an operation script template of a preset platform, and generating the data extraction script by using the operation script template and the query statement;
and running the data extraction script in a preset time, extracting data from a database according to the data extraction script, and classifying to obtain the classification result.
8. An apparatus for classifying data, the apparatus comprising:
the data dictionary extraction module is used for acquiring an original data dictionary set and a preset service theme set, extracting a data dictionary in the original data dictionary set to the service theme set and obtaining an original service entity table under each service theme;
the entity table processing module is used for carrying out missing value detection and duplicate removal operation on the original business entity table to obtain a standard business entity table;
the relation mapping module is used for acquiring a standard data table and generating a mapping relation table according to the standard business entity table and the standard data table;
the statement generating module is used for generating query statements according to the standard business entity table and the mapping relation table;
and the data classification module is used for generating a data extraction script according to the query statement, extracting data by using the data extraction script and classifying to obtain a classification result.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a data classification method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a data classification method according to any one of claims 1 to 7.
CN202011099802.4A 2020-10-14 2020-10-14 Data classification method and device, electronic equipment and storage medium Pending CN112231417A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011099802.4A CN112231417A (en) 2020-10-14 2020-10-14 Data classification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011099802.4A CN112231417A (en) 2020-10-14 2020-10-14 Data classification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112231417A true CN112231417A (en) 2021-01-15

Family

ID=74112971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011099802.4A Pending CN112231417A (en) 2020-10-14 2020-10-14 Data classification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112231417A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948380A (en) * 2021-02-24 2021-06-11 深圳壹账通智能科技有限公司 Data storage method and device based on big data, electronic equipment and storage medium
CN113283765A (en) * 2021-05-31 2021-08-20 浙江环玛信息科技有限公司 Intelligent court case data processing method and system
CN113806434A (en) * 2021-09-22 2021-12-17 平安科技(深圳)有限公司 Big data processing method, device, equipment and medium
CN117540343A (en) * 2024-01-09 2024-02-09 苏州元澄科技股份有限公司 Data fusion method and system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948380A (en) * 2021-02-24 2021-06-11 深圳壹账通智能科技有限公司 Data storage method and device based on big data, electronic equipment and storage medium
CN113283765A (en) * 2021-05-31 2021-08-20 浙江环玛信息科技有限公司 Intelligent court case data processing method and system
CN113806434A (en) * 2021-09-22 2021-12-17 平安科技(深圳)有限公司 Big data processing method, device, equipment and medium
CN113806434B (en) * 2021-09-22 2023-09-05 平安科技(深圳)有限公司 Big data processing method, device, equipment and medium
CN117540343A (en) * 2024-01-09 2024-02-09 苏州元澄科技股份有限公司 Data fusion method and system
CN117540343B (en) * 2024-01-09 2024-04-16 苏州元澄科技股份有限公司 Data fusion method and system

Similar Documents

Publication Publication Date Title
CN112231417A (en) Data classification method and device, electronic equipment and storage medium
CN112052242A (en) Data query method and device, electronic equipment and storage medium
CN111428458A (en) Universal report generation method and device and computer readable storage medium
CN112541338A (en) Similar text matching method and device, electronic equipment and computer storage medium
CN112115152B (en) Data increment updating and inquiring method and device, electronic equipment and storage medium
CN114979120B (en) Data uploading method, device, equipment and storage medium
CN112541745A (en) User behavior data analysis method and device, electronic equipment and readable storage medium
CN112364107A (en) System analysis visualization method and device, electronic equipment and computer readable storage medium
CN113961584A (en) Method and device for analyzing field blood relationship, electronic equipment and storage medium
CN115408399A (en) Blood relationship analysis method, device, equipment and storage medium based on SQL script
CN112559687A (en) Question identification and query method and device, electronic equipment and storage medium
CN114138784A (en) Information tracing method and device based on storage library, electronic equipment and medium
CN114610747A (en) Data query method, device, equipment and storage medium
CN112528013A (en) Text abstract extraction method and device, electronic equipment and storage medium
CN113806434A (en) Big data processing method, device, equipment and medium
CN113887941A (en) Business process generation method and device, electronic equipment and medium
CN114880368A (en) Data query method and device, electronic equipment and readable storage medium
CN113434542B (en) Data relationship identification method and device, electronic equipment and storage medium
CN114003704A (en) Method and device for creating designated tag guest group, electronic equipment and storage medium
CN113157739A (en) Cross-modal retrieval method and device, electronic equipment and storage medium
CN115409041B (en) Unstructured data extraction method, device, equipment and storage medium
CN115146064A (en) Intention recognition model optimization method, device, equipment and storage medium
CN115114297A (en) Data lightweight storage and search method and device, electronic equipment and storage medium
CN114996386A (en) Business role identification method, device, equipment and storage medium
CN112506931A (en) Data query method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination