CN113094415A - Data extraction method and device, computer readable medium and electronic equipment - Google Patents

Data extraction method and device, computer readable medium and electronic equipment Download PDF

Info

Publication number
CN113094415A
CN113094415A CN201911342183.4A CN201911342183A CN113094415A CN 113094415 A CN113094415 A CN 113094415A CN 201911342183 A CN201911342183 A CN 201911342183A CN 113094415 A CN113094415 A CN 113094415A
Authority
CN
China
Prior art keywords
data
field
target
record
empty
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911342183.4A
Other languages
Chinese (zh)
Other versions
CN113094415B (en
Inventor
陈雪松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yiyiyun Technology Co ltd
Original Assignee
Beijing Yiyiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yiyiyun Technology Co ltd filed Critical Beijing Yiyiyun Technology Co ltd
Priority to CN201911342183.4A priority Critical patent/CN113094415B/en
Publication of CN113094415A publication Critical patent/CN113094415A/en
Application granted granted Critical
Publication of CN113094415B publication Critical patent/CN113094415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Abstract

The present disclosure provides a data extraction method, a data extraction apparatus, a computer-readable medium, and an electronic device; relates to the technical field of data processing. The data extraction method comprises the following steps: acquiring a non-empty field set of a database and a plurality of data records in each data table in the database; determining a target field contained in each data record according to a non-null field value contained in each data record; and determining a target data record from the plurality of data records according to the target field and the non-empty field set contained in each data record, and determining the target data record as the data to be extracted of the database. The data extraction method disclosed by the disclosure can still ensure the consistency of the samples under the condition of extracting fewer samples, thereby saving the computing resources and improving the extraction effectiveness.

Description

Data extraction method and device, computer readable medium and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data extraction method, a data extraction device, a computer-readable medium, and an electronic device.
Background
Databases are important tools for storing and managing data, and any internet technology cannot be supported by the databases. When data with different sources and different structures are used, the data need to be structured to obtain a standard data model with a uniform and standardized structure, and then the data can be used for processing the service data by using the database.
Each raw data, when converted to the standard data model, typically does not refer to all types of standard data results, and the conversion results are only a subset of the standard data. For example, the original data can be mapped and associated to fill in the relevant fields of several service tables of the standard data model, and for other service tables of the standard data model or irrelevant fields in the tables, the original data are left empty and not filled. However, the data distribution in the service table is not clear to the service personnel, and when the quality inspection and verification of the data are required, it is not clear which fields in the table have value distribution, so that the extraction cannot be performed, all the data need to be processed, and the efficiency is low.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure is directed to a data extraction method, a data extraction device, a computer readable medium, and an electronic device, which can overcome the problem of low extraction efficiency caused by a large data extraction scale to a certain extent, and further improve the efficiency of extracting and processing data.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to a first aspect of the present disclosure, there is provided a data extraction method, including:
acquiring a non-empty field set of a database and a plurality of data records in each data table in the database;
determining a target field contained in each data record according to a non-null field value contained in each data record;
and determining a target data record from the plurality of data records according to the target field and the non-empty field set contained in each data record, and determining the target data record as the data to be extracted of the database.
In an exemplary embodiment of the present disclosure, before acquiring the non-empty field set of the database, the method further includes:
and judging whether the field value corresponding to each field in each data table in the database is empty or not, and determining the field with the corresponding field value not being empty as a non-empty field in the data table so as to obtain the non-empty field set.
In an exemplary embodiment of the present disclosure, the determining whether a field value corresponding to each field in each data table in the database is empty includes:
and judging whether the field value corresponding to each field is empty or not according to the field type of each field.
In an exemplary embodiment of the disclosure, the determining a target data record from the plurality of data records according to the target field and the non-empty field set included in each of the data records includes:
classifying the plurality of data records according to primary key values contained in the plurality of data records, and determining a record set corresponding to each primary key value;
and determining the target set from each record set according to the non-empty field set and the target field contained in each record set, and taking the data contained in the target set as the target data record.
In an exemplary embodiment of the present disclosure, the determining the target set from each record set according to the non-empty field set and a target field included in each record set includes:
determining the number of target fields contained in each record set, and sequencing each record set according to the number from large to small;
and determining a plurality of target sets according to the sorted order, so that all target fields contained in the plurality of target sets are the same as the non-empty fields in the non-empty field set.
In an exemplary embodiment of the present disclosure, the determining the target set from each record set according to the non-empty field set and a target field included in each record set includes:
according to target fields contained in each record set, respectively calculating a first complementary set of each record set and the non-empty field set, and taking a record set corresponding to the target first complementary set with the least contained elements as a candidate set;
calculating the target first complementary set and second complementary sets of all the record sets, and merging the record sets corresponding to the target second complementary set with the least elements into the candidate set until the candidate set is reached
And if the target field contained in the candidate set is equal to the non-empty field set, determining the candidate set as the target set.
In an exemplary embodiment of the present disclosure, after determining a target field corresponding to the target data record as the data to be extracted from the database, the method further includes:
determining primary key values respectively corresponding to the target data records to obtain a primary key value set;
and if an extraction request is received, sending the primary key value set to a sending end of the extraction request so that the sending end can extract the database through the primary key value set.
According to a second aspect of the present disclosure, there is provided a data extraction apparatus, including a non-empty field acquisition module, a table data determination module, and an extraction data determination module, wherein:
the non-empty field acquisition module is used for acquiring a non-empty field set of a database and a plurality of data records in each data table in the database;
the table data determining module is used for determining a target field contained in each data record according to a non-null field value contained in each data record;
and the extracted data determining module is used for determining a target data record from the plurality of data records according to the target field and the non-empty field set contained in each data record, and determining the target data record as the data to be extracted of the database.
In an exemplary embodiment of the present disclosure, the apparatus further includes a non-empty determining module, configured to determine whether a field value corresponding to each field in each data table in the database is empty, and determine a field whose corresponding field value is not empty as a non-empty field in the data table, so as to obtain the non-empty field set.
In an exemplary embodiment of the disclosure, the non-null determining module may be specifically configured to determine whether a field value corresponding to each of the fields is null according to a field type of each of the fields.
In an exemplary embodiment of the present disclosure, the extracted data determining module includes a classifying unit and a set determining unit, wherein:
and the classifying unit is used for classifying the plurality of data records according to the primary key values contained in the plurality of data records and determining the record sets corresponding to the primary key values respectively.
And the set determining unit is used for determining the target set from each record set according to the non-empty field set and the target field contained in each record set, and taking the data contained in the target set as the target data record.
In an exemplary embodiment of the present disclosure, the set determining unit may specifically include a sorting unit and a set selecting unit, where:
and the sorting unit is used for determining the number of target fields contained in each record set and sorting each record set according to the number from large to small.
And the set selection unit is used for determining a plurality of target sets according to the sorting order so that all target fields contained in the target sets are the same as the non-empty fields in the non-empty field set.
In an exemplary embodiment of the present disclosure, the set determining unit may specifically include a complement calculating unit, a set merging unit, and a set judging unit, where:
and the complementary set calculating unit is used for respectively calculating first complementary sets of the record sets and the non-empty field sets according to target fields contained in the record sets, and taking the record set corresponding to the target first complementary set with the least contained elements as a candidate set.
And the set merging unit is used for calculating the target first complementary set and second complementary sets of all the record sets, and merging the record sets corresponding to the target second complementary set with the least elements into the candidate set.
A set judgment unit, configured to determine the candidate set as the target set if the target field included in the candidate set is equal to the non-empty field set.
In an exemplary embodiment of the present disclosure, the apparatus further includes a primary key determining module and a set sending module, wherein:
and the primary key determining module is used for determining the primary key values corresponding to the target data records respectively so as to acquire a primary key value set.
And the set sending module is used for receiving an extraction request and sending the primary key value set to a sending end of the extraction request so that the sending end can extract the database through the primary key value set.
According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any one of the above via execution of the executable instructions.
According to a fourth aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.
Exemplary embodiments of the present disclosure may have some or all of the following benefits:
in the data extraction method provided by an example embodiment of the present disclosure, the target data record is determined from the data table through the non-empty field set of the database to obtain the extracted data, which can ensure that the extracted data is non-empty and improve the extraction effectiveness. Moreover, the coverage degree of the target data record to the field in the database can be maximized through the field value contained in the data record, so that the extraction effect can be improved. In addition, the target data records to be extracted are far less than the data amount required by random extraction, and the number of samples can be reduced, so that the computing resources are saved, the time cost is reduced, and the extraction efficiency is improved; meanwhile, the processing efficiency of the extracted data is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
FIG. 1 schematically illustrates a flow diagram of a data extraction method according to one embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow diagram of a data extraction method according to another embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow diagram of a data extraction method according to another embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow diagram of a data extraction method according to another embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow diagram of a data extraction method according to another embodiment of the present disclosure;
FIG. 6 schematically illustrates a data representation intent according to one embodiment of the present disclosure;
FIG. 7 schematically illustrates a block diagram of a data extraction apparatus according to one embodiment of the present disclosure;
FIG. 8 schematically illustrates a system architecture diagram for implementing a data extraction method according to one embodiment of the present disclosure;
FIG. 9 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The technical solution of the embodiment of the present disclosure is explained in detail below:
in a solution provided by the inventor, when data in a database is extracted, all tables may be traversed, and for all fields in each table, a data record including a field value corresponding to the field is extracted, so that each table is exhausted to obtain a final extraction result. However, the size of the extracted sample is equivalent to the number of all fields of the entire table, and a large amount of resources are required to be consumed in subsequently checking the extracted data, which results in an increase in cost.
In view of one or more of the above problems, the present example embodiment provides a data extraction method. Referring to fig. 1, the data extraction method may include the steps of:
step S110: a non-empty field set of a database is obtained, as well as a plurality of data records in respective data tables in the database.
Step S120: and determining a target field contained in each data record according to the non-null field value contained in each data record.
Step S130: and determining a target data record from the plurality of data records according to the target field and the non-empty field set contained in each data record, and determining the target data record as the data to be extracted of the database.
In the data extraction method provided by an example embodiment of the present disclosure, the target data record is determined from the data table through the non-empty field set of the database to obtain the extracted data, which can ensure that the extracted data is non-empty and improve the extraction effectiveness. Moreover, the coverage degree of the target data record to the field in the database can be maximized through the field value contained in the data record, so that the extraction effect can be improved. In addition, the target data records to be extracted are far less than the data volume required by random extraction, and the number of samples can be increased, so that the computing resources are saved, the time cost is reduced, and the extraction efficiency is improved; meanwhile, the processing efficiency of the extracted data is improved.
The above steps of the present exemplary embodiment will be described in more detail below.
In step S110, a non-empty field set of a database and a plurality of data records in each data table in the database are obtained.
A non-empty field may refer to a field in which a value exists, that is, a field is a non-empty field as long as there is a value in the column to which the field corresponds. The non-empty field may include the primary key of the table, or may include other fields in the table; in addition, the non-empty field in one table may include a plurality of fields, which is not particularly limited in this embodiment. The data records are a row of data in the table, each data record may include a field value corresponding to each field in the table, and the value corresponding to the primary key may uniquely identify one data record due to the uniqueness of the primary key. By querying each data table, all data records in the data table can be obtained.
When data is filled into the data model, a field filled with values can be recorded, so that a non-empty field of the database is obtained, for example, if the value of the "name" field is assigned as "Zhang III", the field is a non-empty field, and the non-empty field is labeled, so that the non-empty field of each data table is obtained according to the label, and a non-empty field set of the database is obtained. For example, before acquiring the non-empty fields in each data table of the database, it may be determined whether the value corresponding to each field is empty by combining all the data tables, and a field whose corresponding value is not empty is determined as a non-empty field. Specifically, all fields in the database are determined, whether the field has a value in the table is determined in a data table, if all data records of the table are traversed, the value corresponding to the field is only NULL (NULL), the next data table is continuously traversed until all data tables are traversed, the value corresponding to the field is NULL except for NULL, and no other value exists, the field is a NULL field, and if other value not NULL exists in the field, for example, the field is "cost", and the value corresponding to the field is "123", the field is a non-NULL field.
The field type of the field may include a plurality of types, for example, an integer, an array, etc., and the type of the value of the field is different for fields of different field types, so that whether the value corresponding to the field is empty may be determined according to the field type of the field. Specifically, if the field type is a simple structure type such as integer, boolean, etc., it is only necessary to determine whether the value in the field is non-null, and if the field has a value, the field is a non-null field; if the field type is a type of a composite structure such as an array, a dictionary, or the like, it is necessary to determine the elements included in each value of the field. Illustratively, when the field type is a dictionary type of key-value, each key serves as a subordinate field of the field, and when the value is not empty, the subordinate field can be considered as a non-empty field, and the field to which the subordinate field belongs is also the non-empty field; when an array exists, each element in the array is examined in turn, and until the element is encountered, the field to which the array belongs is considered to be non-empty.
In addition, according to each field type, whether each field is a non-empty field may be further differentiated in a more detailed manner, for example, for each field type, a different non-empty condition is determined to determine whether the field satisfies the non-empty condition, and if so, the field is a non-empty field, which is not limited in this embodiment.
The data table may generally define a plurality of fields, and the fields defined by different data tables may also be different, after all the fields in all the data tables are subjected to non-null determination, a plurality of non-null fields may be obtained, and each non-null field may be identified by an identifier, so as to distinguish the non-null field from the null field. The non-empty fields in all tables in the database may constitute a corresponding set of non-empty fields of the database. The corresponding set of non-empty fields may be different for different databases.
In step S120, a target field included in each of the data records is determined according to a non-null field value included in each of the data records.
The target field refers to a field corresponding to a non-null field value, i.e., a field in which a value exists in the data record, in other words, the target field is a non-null field in the data record. After all the data records in the data table are obtained, whether the data records contain non-null field values can be judged, and if the data records do not contain null field values, the corresponding fields are target fields. The condition of the target field contained in the data record in each data table is determined, and the coverage condition of the data record on the non-empty field can be determined, that is, if the target field contained in the data record is more, the more information the data record contains, the greater the meaning of the data record as a sample is.
For example, as shown in table 1, table 1 has 7 fields, where ID is a primary key and "√" indicates a valued value. Looking up table 1, two data records with primary key values Id1 and Id2 can be obtained, where the data record with Id1 includes field values of field 1, field 2, and field 3, and the values of other fields are all null, and the target fields included in the data record are field 1, field 2, and field 3; the target fields in the data record ID2 are field 3 and field 4.
ID Field 1 Field 2 Field 3 Field 4 Field 5 Field 6
Id1
Id2
In step S130, a target data record is determined from the plurality of data records according to the target field and the non-empty field set included in each data record, and the target data record is determined as the data to be extracted from the database.
And selecting the target data record from the data records according to the target fields contained in the data records and all the non-empty fields in the non-empty field set, wherein the target data record can be used as the data to be extracted. The data to be extracted can be used as a sample for data quality inspection and data verification, so that all data in the database are verified through the data to be extracted, and compared with a random extraction sample, the field value contained in the target data record is more, and the consistency with the database is higher.
For example, when determining the target fields included in each data record, the number of the target fields may be recorded, so that the data record with a larger number of the target fields is selected as the target data record. For example, data records containing target fields with a number exceeding a preset value are selected as target data records, or the data records are sorted according to the number of the contained target fields, and the first N (N is a positive integer) or the last N data records are selected as the target data records.
Since the primary key can uniquely identify a data record, the method may further include steps S201 to S202, as shown in fig. 2, wherein:
in step S201, primary key values corresponding to the target data records are determined to obtain a primary key value set. The primary key in each data table is set by a user when the table is built, the primary key has uniqueness and non-null property, and the primary key value is a value corresponding to the primary key, for example, the primary key in the student table may be a "school number" field, and the primary key value may be "20112356". After the target data records are determined, the primary key values in each data record can be extracted, so that the primary key value sets corresponding to all the target data records are obtained.
In step S202, an extraction request is received, and the primary key value set is sent to a sending end of the extraction request, so that the sending end extracts the database through the primary key value set. For example, in a plurality of scenarios that need to operate a database, such as data model verification and data quality inspection, data in the database needs to be extracted, a client having a data extraction requirement may send an extraction request to a server, the server may send the primary key value set to the client after receiving the extraction request, and the client may obtain a target data record by using the primary key value after obtaining the primary key value set, thereby completing operations such as data verification by using the target data record.
In other embodiments of the present disclosure, the module sending the extraction request may be a module in the terminal device, and the module receiving the extraction request may be another module, or the client sending the extraction request may be another client receiving the extraction request, and the present disclosure is not limited thereto. In the embodiment, the magnitude order of the primary key value set is much smaller than that of the database, and the extracted data in the database can be acquired by using the primary key value set, so that the operability of the data is greatly improved; and, the space occupied by the target data record is greatly reduced, thereby saving resources.
In an alternative embodiment, determining the target data record from the plurality of data records may comprise step S301 and step S302, as shown in fig. 3, wherein:
in step S301, the data records are classified according to primary key values included in the data records, and a record set corresponding to each primary key value is determined. Specifically, the data records identified by the same primary key value are divided into the same class, so that a plurality of record sets identified by different primary key values can be obtained, and the number of the primary key values can be the same as that of the record sets. In a data table, a primary key value only identifies one piece of data in the table, and in other data tables, the primary key value may identify data records in other data tables, that is, a set of records corresponding to a primary key value contains the same number of data records as the data tables, for example, a database contains 20 data tables, and each set of records may contain 20 data records identified by a primary key value in 20 tables, respectively.
It should be understood that in this embodiment, all tables in the database may have the same primary key. The value of the primary key may be the same in different data tables, while the value of the primary key is unique in the same data table.
In step S302, according to the non-empty field set and the target field included in each record set, the target set is determined from each record set, and the data included in the target set is used as the target data record. For example, a record set a is randomly selected as a target set, and target fields included in a are a, B, and c, then other non-empty fields except a, B, and c included in the record set in the non-empty field set are determined, a record set B including other non-empty fields is selected as a target set again, then other non-empty fields except the target field in B in the non-empty field set are determined, and then record sets including other non-empty fields are selected until a union of the target fields in the selected target set covers all the non-empty fields.
In an alternative embodiment, determining the target set from each record set may include step S401 and step S402, as shown in fig. 4, where:
in step S401, the number of target fields included in each record set is determined, and each record set is sorted according to the number from large to small. Specifically, the target fields included in all the data records in the record set are added to the number of the target fields included in the record set, and if a field value is included in each of the plurality of data records, the field is calculated only once, for example, the record set a includes 10 data records, where a first data record includes field 1, field 3, and field 4, and a second data record includes field 2, field 3, and field 5, the fields included in the first data record and the second data record are field 1, field 2, field 3, field 4, and field 5, the number is 5, and so on, the number of all the fields included in the 10 data records in the record set a is counted. After the number of target fields contained in each record set is determined, the record sets are sorted from large to small in number.
In step S402, a plurality of target sets are determined according to the sorted order, so that all target fields contained in the plurality of target sets are the same as the non-empty fields in the non-empty field set. Illustratively, the first record set may be selected as the target set, removing the target field contained in the target set from the non-empty field set to obtain the residual non-empty fields, then selecting a second record set as a target set in sequence, removing target fields contained in the second record set from the rest of non-empty fields, if the remaining non-empty fields do not contain the target field contained in the second set of records, it is indicated that the target field is the same as the target field in the first set of records, has been removed, and, therefore, and skipping target fields not contained in the remaining non-empty fields, executing the next target field, and repeating the steps until all the non-empty fields in the non-empty field set are removed, wherein the determined target set is the final extracted data. For example, after the target field is included in the 50 th record set in the non-empty field set, the non-empty field set is empty, and the first 50 record sets are the target sets. Furthermore, the obtained target sets may be merged into one set, and the set is taken as the extraction data.
In an alternative embodiment, determining the target set from each record set may include steps S501 to S503, as shown in fig. 5, where:
in step S501, a first complement of each record set and the non-empty field set is calculated according to the target field included in each record set, and a record set corresponding to the target first complement including the least elements is used as a candidate set. Specifically, after the target field and the non-empty field sets included in each record set are subjected to complement operation, a first complement corresponding to each record set is obtained. The coverage of the non-empty field by the record set can be determined through the first complementary set, and the fewer the elements in the complementary set, the greater the coverage of the non-empty field by the record set can be illustrated. And taking the least elements in the complementary set as a target first complementary set, and taking a record set corresponding to the target first complementary set as a candidate set.
In step S502, the target first complement and the second complement of each record set are calculated, and the record sets corresponding to the target second complement containing the fewest elements are merged into the candidate set. Specifically, the candidate set and each record set are subjected to complement operation again to obtain a second complement set corresponding to each record set, and the set element with the least number is used as a target second complement set, and elements in the record set corresponding to the target second complement set are merged into the candidate set, that is, the candidate set already includes the record set corresponding to the target first complement set and the record set corresponding to the target second complement set. And, the record set corresponding to the target second complement set may be deleted.
In step S503, if the target field contained in the candidate set is equal to the non-empty field set, the candidate set is determined as the target set. Specifically, after step S501, it may be determined whether the candidate set is equal to the non-empty field set, and if so, the candidate set is a target set; if not, step S503 is executed, after the target second complementary set is merged into the candidate set, the candidate set may be determined again, and if the candidate set is equal to the non-empty field set, the candidate set is the target set; if not, the target second complement set and the third complement sets of the record sets can be calculated again, the target third complement set with the least elements is determined from the third complement sets corresponding to the record sets, the elements in the record sets corresponding to the target third complement set can be merged into the candidate set, and the record set is deleted. It should be understood that, in this embodiment, after each merging of a record set into a candidate set, the candidate set may be determined, if the candidate set is equal to a non-empty field set, the candidate set is determined as a target set, and the next complementary set calculation is not needed, if the candidate set is not equal to the non-empty field set, the complementary sets of the candidate set and each record set are iteratively calculated repeatedly, and the complementary set containing the fewest elements is merged into the candidate set until the candidate set is equal to the non-empty field set. The finally obtained data records in the target set contain all the field values of the non-empty fields, and the target set can meet the requirements of data quality inspection, data verification and the like.
For example, as shown in fig. 6, it is assumed that the database includes table 1, table 2, and table 3. In advance, each field in the table may be determined to be non-null, and if a field value exists in a column corresponding to the field, the field is a non-null field, and the non-null field is identified. Then, the non-empty fields in each table may be determined according to the identifiers in the tables, as shown in fig. 6, where the non-empty fields in table 1 are "field 1, field 2, and field 3", and each field in table 2 has a value, that is, each field in table 2 is a non-empty field, and each data record in the column corresponding to "field 3" in table 3 is empty, so that "field 3" is not a non-empty field, and the fields of all tables in the database are integrated, so that the set of non-empty fields of the database may be { field 1, field 2, field 3, field 4, field 5, and field 6 }. According to the primary keys of the tables, the data record corresponding to each primary key value can be determined, wherein the target fields contained in the data record in table 1 corresponding to "id 1" are "field 1 and field 2", and the target field contained in the data record in table 2 corresponding to "field 1"; there is no data record identified by "id 1" in table 3. Then, the target fields included in the record set corresponding to "id 1" are "field 1, field 2". Similarly, the target fields contained in the record set corresponding to the "id 2" are "field 1, field 2, and field 3"; target fields contained in the record set corresponding to the "id 3" are "field 1, field 4 and field 5"; the target fields included in the record set corresponding to "id 4" are "field 1, field 2, field 5, and field 6". According to the number of target fields contained in each record set, record sets corresponding to "id 2 and id 4" can be used as target sets, and then record sets corresponding to "id 1 and id 3" can be used as target sets. Therefore, the extracted data of the database may be a record set corresponding to id1, id2, id3 and id4 respectively, that is, data records with primary key values of id1, id2, id3 and id4 in table 1, table 2 and table 3. Alternatively, the extracted data may be a set of primary key values { id1, id2, id3, id4 }.
By the embodiment, the minimum data set covering all the non-empty fields can be found, so that the problem that the calculated amount is overlarge when data are extracted by an exhaustive method is solved, the calculated amount of the data can be reduced in a big data scene, and the data processing efficiency is improved.
Further, in the present exemplary embodiment, a data extraction device is also provided, which is configured to execute the data extraction method of the present disclosure. The device can be applied to a server or terminal equipment.
Referring to fig. 7, the data extracting apparatus 700 may include: a non-empty field obtaining module 710, a table data determining module 720, and an extracted data determining module 730, wherein:
a non-empty field obtaining module 710, configured to obtain a set of non-empty fields of a database and a plurality of data records in each data table in the database.
A table data determining module 720, configured to determine a target field included in each data record according to a non-null field value included in each data record.
An extracted data determining module 730, configured to determine a target data record from the multiple data records according to the target field and the non-empty field set included in each data record, and determine the target data record as the data to be extracted from the database.
In an exemplary embodiment of the present disclosure, the apparatus further includes a non-empty determining module, configured to determine whether a field value corresponding to each field in each data table in the database is empty, and determine a field whose corresponding field value is not empty as a non-empty field in the data table, so as to obtain the non-empty field set.
In an exemplary embodiment of the disclosure, the non-null determining module may be specifically configured to determine whether a field value corresponding to each of the fields is null according to a field type of each of the fields.
In an exemplary embodiment of the present disclosure, the extracted data determining module 730 includes a classifying unit and a set determining unit, wherein:
and the classifying unit is used for classifying the plurality of data records according to the primary key values contained in the plurality of data records and determining the record sets corresponding to the primary key values respectively.
And the set determining unit is used for determining the target set from each record set according to the non-empty field set and the target field contained in each record set, and taking the data contained in the target set as the target data record.
In an exemplary embodiment of the present disclosure, the set determining unit may specifically include a sorting unit and a set selecting unit, where:
and the sorting unit is used for determining the number of target fields contained in each record set and sorting each record set according to the number from large to small.
And the set selection unit is used for determining a plurality of target sets according to the sorting order so that all target fields contained in the target sets are the same as the non-empty fields in the non-empty field set.
In an exemplary embodiment of the present disclosure, the set determining unit may specifically include a complement calculating unit, a set merging unit, and a set judging unit, where:
and the complementary set calculating unit is used for respectively calculating first complementary sets of the record sets and the non-empty field sets according to target fields contained in the record sets, and taking the record set corresponding to the target first complementary set with the least contained elements as a candidate set.
And the set merging unit is used for calculating the target first complementary set and second complementary sets of all the record sets, and merging the record sets corresponding to the target second complementary set with the least elements into the candidate set.
A set judgment unit, configured to determine the candidate set as the target set if the target field included in the candidate set is equal to the non-empty field set.
In an exemplary embodiment of the present disclosure, the apparatus further includes a primary key determining module and a set sending module, wherein:
and the primary key determining module is used for determining the primary key values corresponding to the target data records respectively so as to acquire a primary key value set.
And the set sending module is used for receiving an extraction request and sending the primary key value set to a sending end of the extraction request so that the sending end can extract the database through the primary key value set.
For details which are not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the data extraction method of the present disclosure for the details which are not disclosed in the embodiments of the apparatus of the present disclosure.
Referring to fig. 8, fig. 8 is a schematic diagram illustrating a system architecture of an exemplary application environment to which a data extraction method and a data extraction device according to an embodiment of the present disclosure may be applied.
As shown in fig. 8, the system architecture 800 may include one or more of terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the terminal devices 801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
The terminal devices 801, 802, 803 may be various electronic devices having a display screen including, but not limited to, desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 805 may be a server cluster comprised of multiple servers, or the like.
The data extraction method provided by the embodiment of the present disclosure is generally executed by the server 805, and accordingly, the data extraction device is generally disposed in the server 805. However, it is easily understood by those skilled in the art that the data extraction method provided in the embodiment of the present disclosure may also be executed by the terminal devices 801, 802, and 803, and accordingly, the data extraction apparatus may also be disposed in the terminal devices 801, 802, and 803, which is not particularly limited in this exemplary embodiment.
For example, in an exemplary embodiment, the server 805 may receive an extraction request from the client 801, obtain a non-empty field set of the database, obtain data records in each data table, determine a target field included in each data record according to a field value included in each data record, further, select a target data record from the plurality of data records according to the non-empty field set and the target field included in each data record, and send the target data record to the client 801 as extraction data, so that the client 801 can perform operations on the database according to the extraction data, for example, database data mapping relationship verification, data quality inspection, and the like.
FIG. 9 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure.
It should be noted that the computer system 900 of the electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for system operation are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program executes various functions defined in the method and apparatus of the present application when executed by a Central Processing Unit (CPU) 901.
It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 1 and 2, and so on.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A data extraction method, comprising:
acquiring a non-empty field set of a database and a plurality of data records in each data table in the database;
determining a target field contained in each data record according to a non-null field value contained in each data record;
and determining a target data record from the plurality of data records according to the target field and the non-empty field set contained in each data record, and determining the target data record as the data to be extracted of the database.
2. The method of claim 1, wherein prior to obtaining the non-empty set of fields of the database, further comprising:
and judging whether the field value corresponding to each field in each data table in the database is empty or not, and determining the field with the corresponding field value not being empty as a non-empty field in the data table so as to obtain the non-empty field set.
3. The method of claim 2, wherein determining whether field values corresponding to fields in data tables in the database are empty comprises:
and judging whether the field value corresponding to each field is empty or not according to the field type of each field.
4. The method of claim 1, wherein determining a target data record from the plurality of data records based on the target field and the set of non-empty fields contained in each of the data records comprises:
classifying the plurality of data records according to primary key values contained in the plurality of data records, and determining a record set corresponding to each primary key value;
and determining the target set from each record set according to the non-empty field set and the target field contained in each record set, and taking the data contained in the target set as the target data record.
5. The method of claim 4, wherein determining the target set from each of the record sets according to the non-empty field sets and the target fields contained in each of the record sets comprises:
determining the number of target fields contained in each record set, and sequencing each record set according to the number from large to small;
and determining a plurality of target sets according to the sorted order, so that all target fields contained in the plurality of target sets are the same as the non-empty fields in the non-empty field set.
6. The method of claim 4, wherein determining the target set from each of the record sets according to the non-empty field sets and the target fields contained in each of the record sets comprises:
according to target fields contained in each record set, respectively calculating a first complementary set of each record set and the non-empty field set, and taking a record set corresponding to the target first complementary set with the least contained elements as a candidate set;
and calculating the target first complementary set and a second complementary set of each record set, merging the record sets corresponding to the target second complementary set containing the fewest elements into the candidate set, and determining the candidate set as the target set until the target fields contained in the candidate set are equal to the non-empty field set.
7. The method of claim 1, wherein after determining the target data record as data to be extracted from the database, the method further comprises:
determining primary key values respectively corresponding to the target data records to obtain a primary key value set;
and receiving an extraction request, and sending the primary key value set to a sending end of the extraction request so that the sending end extracts the database through the primary key value.
8. A data extraction apparatus, comprising:
the non-empty field acquisition module is used for acquiring a non-empty field set of a database and a plurality of data records in each data table in the database;
the table data determining module is used for determining a target field contained in each data record according to a non-null field value contained in each data record;
and the extracted data determining module is used for determining a target data record from the plurality of data records according to the target field and the non-empty field set contained in each data record, and determining the target data record as the data to be extracted of the database.
9. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.
10. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any of claims 1-7 via execution of the executable instructions.
CN201911342183.4A 2019-12-23 2019-12-23 Data extraction method, data extraction device, computer readable medium and electronic equipment Active CN113094415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911342183.4A CN113094415B (en) 2019-12-23 2019-12-23 Data extraction method, data extraction device, computer readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911342183.4A CN113094415B (en) 2019-12-23 2019-12-23 Data extraction method, data extraction device, computer readable medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN113094415A true CN113094415A (en) 2021-07-09
CN113094415B CN113094415B (en) 2024-03-29

Family

ID=76663930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911342183.4A Active CN113094415B (en) 2019-12-23 2019-12-23 Data extraction method, data extraction device, computer readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113094415B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115686939A (en) * 2022-10-27 2023-02-03 湖南长银五八消费金融股份有限公司 Data backup method and device, computer equipment and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003323450A (en) * 2002-04-26 2003-11-14 Yamato Hiroshi Database retrieving device and method, computer program and computer-readable recording medium
US20040260711A1 (en) * 2003-06-21 2004-12-23 International Business Machines Corporation Profiling data in a data store
CN1834891A (en) * 2004-12-17 2006-09-20 佳能株式会社 Information processor, information processing method, and control program
CN104408150A (en) * 2014-12-03 2015-03-11 天津南大通用数据技术股份有限公司 Data import/ export method and device adapted to a plurality of data formats of databases
CN105930462A (en) * 2016-04-21 2016-09-07 成都数联铭品科技有限公司 Cloud computing platform based massive data processing method
CN107229721A (en) * 2017-06-02 2017-10-03 泰华智慧产业集团股份有限公司 A kind of method and device for changing data pick-up
CN107924417A (en) * 2015-08-26 2018-04-17 片山成仁 Data bank management device and its method
CN109271435A (en) * 2018-09-14 2019-01-25 南威软件股份有限公司 A kind of data pick-up method and system for supporting breakpoint transmission
US10192031B1 (en) * 2006-11-03 2019-01-29 Vidistar, Llc System for extracting information from DICOM structured reports
CN109491989A (en) * 2018-11-12 2019-03-19 北京懿医云科技有限公司 Data processing method and device, electronic equipment, storage medium
CN109977110A (en) * 2019-04-28 2019-07-05 杭州数梦工场科技有限公司 Data cleaning method, device and equipment
CN110502515A (en) * 2019-08-15 2019-11-26 中国平安财产保险股份有限公司 Collecting method, device, equipment and computer readable storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003323450A (en) * 2002-04-26 2003-11-14 Yamato Hiroshi Database retrieving device and method, computer program and computer-readable recording medium
US20040260711A1 (en) * 2003-06-21 2004-12-23 International Business Machines Corporation Profiling data in a data store
CN1834891A (en) * 2004-12-17 2006-09-20 佳能株式会社 Information processor, information processing method, and control program
US10192031B1 (en) * 2006-11-03 2019-01-29 Vidistar, Llc System for extracting information from DICOM structured reports
CN104408150A (en) * 2014-12-03 2015-03-11 天津南大通用数据技术股份有限公司 Data import/ export method and device adapted to a plurality of data formats of databases
CN107924417A (en) * 2015-08-26 2018-04-17 片山成仁 Data bank management device and its method
CN105930462A (en) * 2016-04-21 2016-09-07 成都数联铭品科技有限公司 Cloud computing platform based massive data processing method
CN107229721A (en) * 2017-06-02 2017-10-03 泰华智慧产业集团股份有限公司 A kind of method and device for changing data pick-up
CN109271435A (en) * 2018-09-14 2019-01-25 南威软件股份有限公司 A kind of data pick-up method and system for supporting breakpoint transmission
CN109491989A (en) * 2018-11-12 2019-03-19 北京懿医云科技有限公司 Data processing method and device, electronic equipment, storage medium
CN109977110A (en) * 2019-04-28 2019-07-05 杭州数梦工场科技有限公司 Data cleaning method, device and equipment
CN110502515A (en) * 2019-08-15 2019-11-26 中国平安财产保险股份有限公司 Collecting method, device, equipment and computer readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘如九 等: "一种通用的多数据库间数据抽取方法及应用", 《北京交通大学学报》, no. 4, pages 14 - 18 *
邹晓顺: "基于异构分类体系的书目数据库合并", 《武汉科技大学学报(社会科学版)》, no. 4, pages 94 - 97 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115686939A (en) * 2022-10-27 2023-02-03 湖南长银五八消费金融股份有限公司 Data backup method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113094415B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
US11526799B2 (en) Identification and application of hyperparameters for machine learning
EP3563243B1 (en) Determining application test results using screenshot metadata
CN107729935B (en) The recognition methods of similar pictures and device, server, storage medium
CN108933695B (en) Method and apparatus for processing information
CN111694866A (en) Data searching and storing method, data searching system, data searching device, data searching equipment and data searching medium
CN108460161B (en) Hierarchical sampling method and device and computer equipment
CN111680799B (en) Method and device for processing model parameters
CN114281663A (en) Test processing method, test processing device, electronic equipment and storage medium
CN114327493A (en) Data processing method and device, electronic equipment and computer readable medium
CN113094415B (en) Data extraction method, data extraction device, computer readable medium and electronic equipment
CN117093619A (en) Rule engine processing method and device, electronic equipment and storage medium
CN111125311A (en) Method and device for checking information normalization processing, storage medium and electronic equipment
CN116204428A (en) Test case generation method and device
CN111488386A (en) Data query method and device
CN110909288B (en) Service data processing method, device, platform, service end, system and medium
CN111737371B (en) Data flow detection classification method and device capable of dynamically predicting
CN113053531B (en) Medical data processing method, medical data processing device, computer readable storage medium and equipment
CN111079185B (en) Database information processing method and device, storage medium and electronic equipment
CN110321435B (en) Data source dividing method, device, equipment and storage medium
CN110688295A (en) Data testing method and device
CN110134691B (en) Data verification method, device, equipment and medium
CN112600756B (en) Service data processing method and device
CN109840196B (en) Method and device for testing business logic
US20230214394A1 (en) Data search method and apparatus, electronic device and storage medium
CN117807056A (en) Data auditing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant