CN113094415B - Data extraction method, data extraction device, computer readable medium and electronic equipment - Google Patents

Data extraction method, data extraction device, computer readable medium and electronic equipment Download PDF

Info

Publication number
CN113094415B
CN113094415B CN201911342183.4A CN201911342183A CN113094415B CN 113094415 B CN113094415 B CN 113094415B CN 201911342183 A CN201911342183 A CN 201911342183A CN 113094415 B CN113094415 B CN 113094415B
Authority
CN
China
Prior art keywords
target
data
field
record
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911342183.4A
Other languages
Chinese (zh)
Other versions
CN113094415A (en
Inventor
陈雪松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yiyiyun Technology Co ltd
Original Assignee
Beijing Yiyiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yiyiyun Technology Co ltd filed Critical Beijing Yiyiyun Technology Co ltd
Priority to CN201911342183.4A priority Critical patent/CN113094415B/en
Publication of CN113094415A publication Critical patent/CN113094415A/en
Application granted granted Critical
Publication of CN113094415B publication Critical patent/CN113094415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a data extraction method, a data extraction device, a computer readable medium, and an electronic apparatus; relates to the technical field of data processing. The data extraction method comprises the following steps: acquiring a non-empty field set of a database, and a plurality of data records in each data table in the database; determining a target field contained in each data record according to the non-empty field value contained in each data record; and determining a target data record from the plurality of data records according to the target field and the non-empty field set contained in each data record, and determining the target data record as data to be extracted of the database. The data extraction method can still ensure the consistency of samples under the condition of extracting fewer samples, thereby saving the computing resources and improving the extraction effectiveness.

Description

Data extraction method, data extraction device, computer readable medium and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data extraction method, a data extraction device, a computer readable medium, and an electronic apparatus.
Background
Databases are important tools for storing and managing data, and any internet technology is not supported by the databases. When data with different sources and different structures are used, the data needs to be structured to obtain a standard data model with a unified standard structure, and the data can be used for processing service data by utilizing a database.
Each raw data, when converted to a standard data model, typically does not involve all types of standard data results, the converted results being only a subset of the standard data. For example, the original data can be filled into relevant fields of a plurality of service tables of the standard data model through mapping and association, and other service tables or irrelevant fields in the tables of the standard data model are left unfilled. However, the data distribution in the service table is not clear to the service personnel, and when quality inspection and verification are required to be performed on the data, it is not clear which fields in the table have value distribution, and extraction cannot be performed, so that all the data needs to be processed, and the efficiency is low.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The disclosure aims to provide a data extraction method, a data extraction device, a computer readable medium and an electronic device, which can overcome the problem of low extraction efficiency caused by large data extraction scale to a certain extent, thereby improving the extraction processing efficiency of data.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
According to a first aspect of the present disclosure, there is provided a data extraction method, including:
obtaining a non-empty field set of a database, and a plurality of data records in each data table in the database;
determining a target field contained in each data record according to the non-empty field value contained in each data record;
and determining a target data record from the plurality of data records according to the target field and the non-empty field set contained in each data record, and determining the target data record as data to be extracted of the database.
In an exemplary embodiment of the present disclosure, before acquiring the non-empty field set of the database, further includes:
judging whether field values corresponding to fields in each data table in the database are empty or not, and determining the fields with the corresponding field values not being empty as non-empty fields in the data table so as to obtain the non-empty field set.
In an exemplary embodiment of the disclosure, the determining whether the field value corresponding to each field in each data table in the database is null includes:
judging whether the field value corresponding to each field is null or not according to the field type of each field.
In an exemplary embodiment of the disclosure, the determining a target data record from the plurality of data records according to the target field and the set of non-empty fields included in each of the data records includes:
classifying the plurality of data records according to the primary key values contained in the plurality of data records, and determining record sets corresponding to the primary key values respectively;
and determining the target set from each record set according to the non-empty field set and the target field contained in each record set, and taking the data contained in the target set as the target data record.
In an exemplary embodiment of the disclosure, the determining the target set from each record set according to the non-null field set and the target field contained in each record set includes:
determining the number of target fields contained in each record set, and sorting the record sets according to the number from large to small;
And determining a plurality of target sets according to the ordered sequence, so that all target fields contained in the target sets are identical to non-null fields in the non-null field set.
In an exemplary embodiment of the disclosure, the determining the target set from each record set according to the non-null field set and the target field contained in each record set includes:
according to target fields contained in each record set, respectively calculating first complement sets of each record set and the non-empty field set, and taking the record set corresponding to the target first complement set with the least contained elements as a candidate set;
calculating the first complement of the target and the second complement of each record set, merging the record set corresponding to the second complement of the target with the least contained elements into the candidate set until
And if the target field contained in the candidate set is equal to the non-null field set, determining the candidate set as the target set.
In an exemplary embodiment of the present disclosure, after determining the target field corresponding to the target data record as data to be extracted of the database, the method further includes:
Determining a primary key value corresponding to each target data record respectively so as to obtain a primary key value set;
and if the extraction request is received, the primary key value set is sent to a sending end of the extraction request, so that the sending end extracts the database through the primary key value set.
According to a second aspect of the present disclosure, there is provided a data extraction apparatus including a non-empty field acquisition module, a table data determination module, and an extraction data determination module, wherein:
the non-empty field acquisition module is used for acquiring a non-empty field set of the database and a plurality of data records in each data table in the database;
a table data determining module, configured to determine a target field included in each data record according to a non-null field value included in each data record;
and the extraction data determining module is used for determining a target data record from the plurality of data records according to the target field and the non-empty field set contained in each data record, and determining the target data record as data to be extracted of the database.
In an exemplary embodiment of the present disclosure, the apparatus further includes a non-null judging module, configured to judge whether field values corresponding to fields in each data table in the database are null, and determine a field whose corresponding field value is not null as a non-null field in the data table, so as to obtain the non-null field set.
In an exemplary embodiment of the present disclosure, the non-null determining module may be specifically configured to determine, according to a field type of each of the fields, whether a field value corresponding to each of the fields is null.
In one exemplary embodiment of the present disclosure, the extraction data determination module includes a classification unit and a set determination unit, wherein:
and the classification unit is used for classifying the plurality of data records according to the primary key values contained in the plurality of data records and determining record sets corresponding to the primary key values respectively.
And the set determining unit is used for determining the target set from the record sets according to the non-empty field sets and the target fields contained in the record sets, and taking the data contained in the target set as the target data record.
In an exemplary embodiment of the present disclosure, the set determining unit may specifically include an ordering unit and a set selecting unit, wherein:
and the sorting unit is used for determining the number of target fields contained in each record set, and sorting the record sets from large to small according to the number.
And the set selection unit is used for determining a plurality of target sets according to the sorting order so that all target fields contained in the target sets are identical to non-null fields in the non-null field set.
In an exemplary embodiment of the present disclosure, the set determining unit may specifically include a complement calculating unit, a set merging unit, and a set judging unit, wherein:
and the complement calculating unit is used for respectively calculating the first complements of the record sets and the non-empty field sets according to the target fields contained in the record sets, and taking the record set corresponding to the target first complement with the least contained elements as a candidate set.
And the set merging unit is used for calculating the first complement of the target and the second complement of each record set, and merging the record set corresponding to the second complement of the target with the least contained elements into the candidate set.
And the set judging unit is used for determining the candidate set as the target set if the target field contained in the candidate set is equal to the non-null field set.
In an exemplary embodiment of the present disclosure, the apparatus further includes a primary key determination module and an aggregate transmission module, wherein:
and the primary key determining module is used for determining primary key values corresponding to the target data records respectively so as to acquire a primary key value set.
And the set sending module is used for receiving the extraction request and sending the primary key value set to a sending end of the extraction request so that the sending end extracts the database through the primary key value set.
According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any of the above via execution of the executable instructions.
According to a fourth aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the above.
Exemplary embodiments of the present disclosure may have some or all of the following advantages:
in the data extraction method provided by the exemplary embodiment of the present disclosure, the target data record is determined from the data table through the non-empty field set of the database, so as to obtain the extracted data, which can ensure that the extracted data is non-empty, and can improve the effectiveness of extraction. And, the coverage degree of the target data record to the fields in the database can be maximized through the field values contained in the data record, so that the extraction effect can be improved. In addition, the target data record required to be extracted is far less than the data volume required by random extraction, so that the number of samples can be reduced, thereby saving the computing resources, reducing the time cost and improving the extraction efficiency; meanwhile, the processing efficiency of the extracted data is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 schematically illustrates a flow chart of a data extraction method according to one embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a data extraction method according to another embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of a data extraction method according to another embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of a data extraction method according to another embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of a data extraction method according to another embodiment of the disclosure;
FIG. 6 schematically illustrates a data representation intent in accordance with one embodiment of the present disclosure;
FIG. 7 schematically illustrates a block diagram of a data extraction apparatus according to one embodiment of the disclosure;
FIG. 8 schematically illustrates a system architecture diagram for implementing a data extraction method according to one embodiment of the present disclosure;
fig. 9 shows a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
The following describes the technical scheme of the embodiments of the present disclosure in detail:
in one solution provided by the inventor, when extracting data in the database, all tables can be traversed, and for all fields in each table, one data record containing the field value corresponding to the field is extracted, so that each table is exhausted to obtain a final extraction result. However, the size of the extracted samples is equivalent to the number of all the fields of the entire table, and a large amount of resources are required to be consumed in the subsequent verification of the extracted data, resulting in an increase in cost.
In view of one or more of the above problems, the present exemplary embodiment provides a data extraction method. Referring to fig. 1, the data extraction method may include the steps of:
Step S110: a set of non-empty fields of a database is obtained, along with a plurality of data records in respective data tables in the database.
Step S120: and determining a target field contained in each data record according to the non-null field value contained in each data record.
Step S130: and determining a target data record from the plurality of data records according to the target field and the non-empty field set contained in each data record, and determining the target data record as data to be extracted of the database.
In the data extraction method provided by the exemplary embodiment of the present disclosure, the target data record is determined from the data table through the non-empty field set of the database, so as to obtain the extracted data, which can ensure that the extracted data is non-empty, and can improve the effectiveness of extraction. And, the coverage degree of the target data record to the fields in the database can be maximized through the field values contained in the data record, so that the extraction effect can be improved. In addition, the target data record required to be extracted is far less than the data volume required by random extraction, and the sample number can be realized, so that the calculation resources are saved, the time cost is reduced, and the extraction efficiency is improved; meanwhile, the processing efficiency of the extracted data is improved.
Next, the above steps of the present exemplary embodiment will be described in more detail.
In step S110, a non-empty set of fields of a database is obtained, along with a plurality of data records in respective data tables in the database.
Wherein, the non-null field may refer to a field having a value, that is, a non-null field as long as there is one value in a column corresponding to the field. The non-empty field may include a primary key of the table, or may include other fields in the table; also, the non-empty field in one table may include a plurality of fields, which is not particularly limited in this embodiment. The data records are a row of data in the table, each data record can comprise a field value corresponding to each field in the table, and the value corresponding to the primary key can uniquely identify one data record due to the uniqueness of the primary key. By querying each data table, all data records in the data table can be obtained.
When the data is filled into the data model, a value-filled field can be recorded, so that a non-empty field of the database is obtained, for example, if a name field is assigned as Zhang Sany, the field is a non-empty field, and the non-empty field is marked, so that the non-empty field of each data table is obtained according to the mark, and a non-empty field set of the database is obtained. For example, before the non-empty fields in each data table of the database are acquired, all the data tables may be combined to determine whether the values corresponding to the fields are empty, and the fields corresponding to the values not being empty are determined to be non-empty fields. Specifically, all the fields in the database are determined, whether the fields have values in the table is determined in a data table, if all the data records of the table are traversed, the corresponding values of the fields are only NULL (NULL), the next data table is traversed until all the data tables are traversed, if the corresponding values of the fields except NULL do not have other values, the fields are NULL, and if the fields have other values which are not NULL, for example, the fields are "expense", and if the corresponding values of the fields are "123", the fields are non-NULL.
The field type of the field may include various types, for example, an integer, an array, etc., and the types of values thereof are different for the fields of different field types, so that whether the value corresponding to the field is null may be determined according to the field type of the field. Specifically, if the field type is simple structure type such as integer type, boolean type, etc., only need confirm whether the value in the field is non-empty, if there is value in the field, the field is non-empty; if the field type is the type of a composite structure such as an array, a dictionary, etc., the element contained in each value of the field needs to be judged. For example, when the field type is a dictionary type of key-value, each key is taken as a lower field of the field, and when the value is not empty, the lower field can be considered as a non-empty field, and then the field to which the value belongs is also a non-empty field; when an array exists, each element in the array is inspected in turn until the encountered element is valued, and then the field to which the array belongs is considered to be non-null.
In addition, according to each field type, whether each field is a non-null field may be further subdivided, for example, a different non-null condition is determined for each field type to determine whether the field satisfies the non-null condition, and if so, the field is a non-null field, and the embodiment is not limited thereto.
The data table can be defined into a plurality of fields, the fields defined by different data tables can be different, after all the fields in all the data tables are subjected to non-empty judgment, a plurality of non-empty fields can be obtained, and each non-empty field can be identified by an identifier so as to distinguish the non-empty field from the empty field. The non-empty fields in all tables in the database may constitute a corresponding set of non-empty fields in the database. The non-empty field sets corresponding to different databases may be different.
In step S120, a target field included in each of the data records is determined according to a non-null field value included in each of the data records.
The target field refers to a field corresponding to a non-null field value, i.e., a field in which a value exists in the data record, in other words, the target field is a non-null field in the data record. After all the data records in the data table are obtained, whether the data records contain non-empty field values can be judged, and if the values are not empty, the corresponding fields are target fields. Determining the condition of the target field contained in the data record in each data table can determine the coverage condition of the data record to the non-empty field, that is, if the more target fields are contained in the data record, the more information the data record contains, the greater the meaning of the data record as a sample.
For example, as shown in table 1, there are 7 fields in table 1, where ID is a primary key and "v" indicates a value. Looking up table 1 to obtain two data records with main key values Id1 and Id2 respectively, wherein the data record with id=id1 includes field values of field 1, field 2 and field 3, and the other field values are all null, and the target fields included in the data record are field 1, field 2 and field 3; the target fields in the data record of id=id2 are field 3 and field 4.
ID Field 1 Field 2 Field 3 Field 4 Field 5 Field 6
Id1
Id2
In step S130, a target data record is determined from the plurality of data records according to the target field and the non-null field set included in each data record, and the target data record is determined as data to be extracted of the database.
And selecting a target data record from the data records according to the target field contained in each data record and all non-null fields in the non-null field set, wherein the target data record can be used as data to be extracted. The data to be extracted can be used as a sample for data quality inspection and data verification, so that all data in the database is verified through the data to be extracted, and compared with a random extraction sample, the data to be extracted has more field values contained in the target data record and has higher consistency with the database.
For example, when determining the target fields included in each data record, the number of target fields may be recorded, so that a data record with a larger number of target fields is selected as the target data record. For example, selecting the data records with the number of the included target fields exceeding a preset value as target data records, or sorting the data records according to the number of the included target fields, selecting the first N (N is a positive integer) or the last N data records as target data records, and the like.
Since the primary key can uniquely identify a data record, the method may further include steps S201 to S202, as shown in fig. 2, in which:
in step S201, a primary key value corresponding to each of the target data records is determined, so as to obtain a primary key value set. The primary key in each data table is set by a user during table construction, the primary key has uniqueness and non-blank property, and the primary key value is a value corresponding to the primary key, for example, the primary key in the student table can be a "number" field, and the primary key value can be "20112356" and the like. After the target data records are determined, the primary key value in each data record can be extracted, so that a primary key value set corresponding to all the target data records is obtained.
In step S202, an extraction request is received, and the primary key value set is sent to a sending end of the extraction request, so that the sending end extracts the database through the primary key value set. In the data model verification, the data quality inspection and other scenarios in which the database needs to be operated, the client with data extraction requirements can send an extraction request to the server, the server can send the primary key value set to the client after receiving the extraction request, and the client can acquire the target data record by using the primary key value after acquiring the primary key value set, so that the operations such as data verification and the like are completed by using the target data record.
In other embodiments of the present disclosure, the sending of the extraction request may also be a module in the terminal device, and the receiving of the extraction request may be another module, or the sending of the extraction request may be a client, the receiving of the extraction request may be another client, and the like, and the embodiment is not limited thereto. In the embodiment, the order of magnitude of the primary key value set is much smaller than that of the database, and the primary key value set can be used for acquiring the extraction data in the database, so that the operability of the data is greatly improved; and the space occupied by the target data record is greatly reduced, so that resources can be saved.
In an alternative embodiment, determining the target data record from the plurality of data records may include step S301 and step S302, as shown in fig. 3, wherein:
in step S301, the plurality of data records are classified according to the primary key values included in the plurality of data records, and a record set corresponding to each primary key value is determined. Specifically, the data records identified by the same primary key value are divided into the same class, so that a plurality of record sets identified by different primary key values can be obtained, and the number of the primary key values can be the same as the number of the record sets. In a data table, a primary key value only identifies one piece of data in the table, while in other data tables, the primary key value may identify data records in other data tables, that is, a record set corresponding to a primary key value contains the same number of data records as the data tables, for example, 20 data tables are contained in the database, and each record set may include 20 data records whose primary key value is respectively identified in the 20 tables.
It should be understood that in this embodiment, all tables in the database may have the same primary key. The values of the primary keys may be identical in different data tables, while the values of the primary keys have uniqueness in the same data table.
In step S302, the target set is determined from the record sets according to the non-null field set and the target field contained in each record set, and the data contained in the target set is used as the target data record. For example, a target set may be selected from a plurality of record sets according to target fields included in each record set, for example, a record set a is randomly selected as the target set, and target fields included in a are a, B, and c, then other non-null fields except a, B, and c in the non-null field set are determined, a record set B including other non-null fields is selected again as the target set, other non-null fields except the target field in B in the non-null field set are determined again, and then the record set including other non-null fields is selected again until the union set of the target fields in the selected target set covers all non-null fields.
In an alternative embodiment, determining the target set from each record set may include step S401 and step S402, as shown in fig. 4, where:
in step S401, the number of target fields included in each record set is determined, and each record set is sorted from large to small according to the number. Specifically, the target fields included in all the data records in the record set are added to be the number of target fields included in the record set, if the field value of a field is included in each of the plurality of data records, the field is calculated only once, for example, 10 data records are included in the record set a, wherein the first data record includes field 1, field 3 and field 4, the second data record includes field 2, field 3 and field 5, and then the fields included in the first data record and the second data record are field 1, field 2, field 3, field 4 and field 5, the number is 5, and so on, and the number of all the fields included in the 10 data records in the record set a is counted. After the number of target fields contained in each record set is determined, the record sets are sorted from large to small in number.
In step S402, a plurality of target sets are determined in the order of the ordering, so that all target fields contained in the plurality of target sets are identical to non-null fields in the non-null field set. For example, a first record set may be selected as a target set according to the ordered sequence, target fields included in the target set are removed from the non-empty field set to obtain a remaining non-empty field, then a second record set is selected as a target set according to the ordered sequence, target fields included in the second record set are removed from the remaining non-empty field, if the remaining non-empty field does not include the target fields included in the second record set, it is indicated that the target field is the same as the target field in the first record set, and has been removed, so that the target field not included in the remaining non-empty field is skipped, a next target field is executed, and so on until all non-empty fields in the non-empty field set are removed, at this time, the determined target set is the final extraction data. For example, after the target field is included in the 50 th record set is removed from the non-empty field set, the non-empty field set is empty, and then the first 50 record sets are target sets. The plurality of target sets obtained may be combined into one set, and the set may be used as the extraction data.
In an alternative embodiment, determining the target set from each record set may include steps S501 to S503, as shown in fig. 5, where:
in step S501, according to the target field included in each record set, a first complement of each record set and the non-null field set is calculated, and a record set corresponding to the target first complement having the smallest element included therein is used as a candidate set. Specifically, after the complement operation is performed on the target field and the non-empty field set contained in each record set, a first complement corresponding to each record set is obtained. The coverage condition of the record set to the non-empty field can be determined through the first complement, and the smaller the elements in the complement, the larger the coverage of the record set to the non-empty field can be described. And taking the least element in the complement set as a target first complement set, and taking a record set corresponding to the target first complement set as a candidate set.
In step S502, the target first complement and the second complement of each record set are calculated, and the record set corresponding to the target second complement with the least included element is merged into the candidate set. Specifically, the candidate set and each record set are subjected to complement operation again to obtain a second complement corresponding to each record set, the second complement with the least set elements is used as a target second complement, and elements in the record set corresponding to the target second complement are merged into the candidate set, namely, the record set corresponding to the target first complement and the record set corresponding to the target second complement are already contained in the candidate set. And, the record set corresponding to the target second complement set may be deleted.
In step S503, if the target field included in the candidate set is equal to the non-null field set, the candidate set is determined as the target set. Specifically, after step S501, it may be determined whether the candidate set is equal to the non-null field set, and if so, the candidate set is the target set; if not, step S503 is executed, where after the target second complement set is merged into the candidate set, the candidate set may be judged again, and if the candidate set is equal to the non-null field set, the candidate set is the target set; if the first complement sets are not equal, the third complement sets of the target second complement set and the record sets can be calculated again, the target third complement set with the least contained elements is determined from the third complement sets corresponding to the record sets, the elements in the record sets corresponding to the target third complement sets can be combined into the candidate set, and the record sets are deleted. It should be understood that in this embodiment, after merging a record set into a candidate set each time, the candidate set may be determined, if the candidate set is equal to a non-null field set, the candidate set is determined as a target set, and the next complement calculation is not needed, if not, the candidate set and the complements of each record set are repeatedly iterated and calculated, and the complement containing the least element is merged into the candidate set until the candidate set is equal to the non-null field set. The data record in the final target set contains field values of all non-null fields, and the requirements of data quality inspection, data verification and the like can be met by using the target set.
For example, as shown in fig. 6, it is assumed that the database includes table 1, table 2, and table 3. The non-empty judgment can be carried out on each field in the table in advance, if the field value exists in the column corresponding to the field, the field is a non-empty field, and the non-empty field is identified. Then, according to the identification in the table, the non-empty fields in each table can be determined, as shown in fig. 6, the non-empty fields in the table 1 are "field 1, field 2, field 3", each field in the table 2 has a value, that is, each field in the table 2 is a non-empty field, and each data record in the column corresponding to "field 3" in the table 3 is empty, so "field 3" is not a non-empty field, and the non-empty field set of { field 1, field 2, field 3, field 4, field 5, field 6} of the database can be obtained by integrating the fields of all tables in the database. According to the primary key of the table, determining the data record corresponding to each primary key value, wherein the data record in the table 1 corresponding to the "id1" contains target fields of "field 1 and field 2", and the data record in the table 2 corresponding to the "id1" contains target fields of "field 1"; there are no data records identified by "id1" in table 3. Then, the target fields included in the record set corresponding to "id1" are "field 1, field 2". Similarly, the target fields contained in the record set corresponding to the id2 are field 1, field 2 and field 3; the target fields contained in the record set corresponding to the id3 are 'field 1, field 4 and field 5'; the target fields contained in the record set corresponding to the id4 are 'field 1, field 2, field 5 and field 6'. According to the number of target fields contained in each record set, the record set corresponding to "id2" and "id4" may be first used as a target set, and then the record set corresponding to "id1" and "id3" may be used as a target set. Thus, the extraction data of the database may be a record set corresponding to id1, id2, id3, and id4, that is, data records with primary key values of id1, id2, id3, and id4 in table 1, table 2, and table 3, respectively. Alternatively, the extraction data may be the primary key value set { id1, id2, id3, id4}.
By the method, the smallest data set covering all non-empty fields can be found, so that the problem that the calculated amount is too large when the data are extracted by adopting an exhaustion method is solved, the calculated amount of the data can be reduced in a large data scene, and the data processing efficiency is improved.
Further, in this example embodiment, a data extraction apparatus is further provided, which is configured to perform the data extraction method described in the disclosure. The device can be applied to a server or terminal equipment.
Referring to fig. 7, the data extraction apparatus 700 may include: a non-empty field acquisition module 710, a table data determination module 720, and an extraction data determination module 730, wherein:
the non-null field obtaining module 710 is configured to obtain a non-null field set of the database and a plurality of data records in each data table in the database.
The table data determining module 720 is configured to determine the target field included in each data record according to the non-null field value included in each data record.
And an extracted data determining module 730, configured to determine a target data record from the plurality of data records according to the target field and the non-null field set included in each data record, and determine the target data record as data to be extracted of the database.
In an exemplary embodiment of the present disclosure, the apparatus further includes a non-null judging module, configured to judge whether field values corresponding to fields in each data table in the database are null, and determine a field whose corresponding field value is not null as a non-null field in the data table, so as to obtain the non-null field set.
In an exemplary embodiment of the present disclosure, the non-null determining module may be specifically configured to determine, according to a field type of each of the fields, whether a field value corresponding to each of the fields is null.
In one exemplary embodiment of the present disclosure, the extraction data determination module 730 includes a classification unit and a set determination unit, wherein:
and the classification unit is used for classifying the plurality of data records according to the primary key values contained in the plurality of data records and determining record sets corresponding to the primary key values respectively.
And the set determining unit is used for determining the target set from the record sets according to the non-empty field sets and the target fields contained in the record sets, and taking the data contained in the target set as the target data record.
In an exemplary embodiment of the present disclosure, the set determining unit may specifically include an ordering unit and a set selecting unit, wherein:
And the sorting unit is used for determining the number of target fields contained in each record set, and sorting the record sets from large to small according to the number.
And the set selection unit is used for determining a plurality of target sets according to the sorting order so that all target fields contained in the target sets are identical to non-null fields in the non-null field set.
In an exemplary embodiment of the present disclosure, the set determining unit may specifically include a complement calculating unit, a set merging unit, and a set judging unit, wherein:
and the complement calculating unit is used for respectively calculating the first complements of the record sets and the non-empty field sets according to the target fields contained in the record sets, and taking the record set corresponding to the target first complement with the least contained elements as a candidate set.
And the set merging unit is used for calculating the first complement of the target and the second complement of each record set, and merging the record set corresponding to the second complement of the target with the least contained elements into the candidate set.
And the set judging unit is used for determining the candidate set as the target set if the target field contained in the candidate set is equal to the non-null field set.
In an exemplary embodiment of the present disclosure, the apparatus further includes a primary key determination module and an aggregate transmission module, wherein:
and the primary key determining module is used for determining primary key values corresponding to the target data records respectively so as to acquire a primary key value set.
And the set sending module is used for receiving the extraction request and sending the primary key value set to a sending end of the extraction request so that the sending end extracts the database through the primary key value set.
Since each functional module of the data extraction device according to the exemplary embodiment of the present disclosure corresponds to a step of the exemplary embodiment of the data extraction method described above, for details not disclosed in the embodiment of the device of the present disclosure, please refer to the embodiment of the data extraction method described above in the present disclosure.
Referring to fig. 8, fig. 8 is a schematic diagram illustrating a system architecture of an exemplary application environment to which a data extraction method and a data extraction apparatus according to embodiments of the present disclosure may be applied.
As shown in fig. 8, the system architecture 800 may include one or more of terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves as a medium for providing communication links between the terminal devices 801, 802, 803 and the server 805. The network 804 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The terminal devices 801, 802, 803 may be various electronic devices with display screens including, but not limited to, desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 805 may be a server cluster formed by a plurality of servers.
The data extraction method provided by the embodiments of the present disclosure is generally performed by the server 805, and accordingly, the data extraction device is generally disposed in the server 805. However, it will be readily understood by those skilled in the art that the data extraction method provided in the embodiment of the present disclosure may be performed by the terminal devices 801, 802, 803, and accordingly, the data extraction apparatus may be provided in the terminal devices 801, 802, 803, which is not particularly limited in the present exemplary embodiment.
For example, in one exemplary embodiment, the server 805 may receive a client 801 extraction request, obtain a non-empty field set of the database, obtain data records in each data table, determine a target field included in the data record according to a field value included in each data record, further select a target data record from a plurality of data records according to the non-empty field set and the target field included in each data record, and send the target data record as extraction data to the client 801, so that the client 801 can operate on the database according to the extraction data, for example, database data mapping relationship verification, data quality inspection, and the like.
Fig. 9 shows a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.
It should be noted that, the computer system 900 of the electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU) 901, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for system operation are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
The following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.
In particular, according to embodiments of the present disclosure, the processes described below with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. When executed by a Central Processing Unit (CPU) 901, performs the various functions defined in the methods and apparatus of the present application.
It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the methods described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 1 and 2, and so on.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (7)

1. A method of data extraction, comprising:
obtaining a non-empty field set of a database, and a plurality of data records in each data table in the database;
determining a target field contained in each data record according to the non-empty field value contained in each data record;
classifying the plurality of data records according to the primary key values contained in the plurality of data records, determining record sets corresponding to the primary key values respectively, determining a target set from the record sets according to the non-empty field sets and target fields contained in the record sets, taking data contained in the target set as target data records, and determining the target data records as data to be extracted of the database;
wherein determining a target set from each record set according to the non-empty field set and the target field contained in each record set comprises:
Determining the number of target fields contained in each record set, sorting the record sets according to the number from large to small, and determining a plurality of target sets according to the sorting order so that all target fields contained in the plurality of target sets are identical to non-empty fields in the non-empty field set; or (b)
According to target fields contained in each record set, respectively calculating first complement sets of each record set and the non-empty field set, and taking the record set corresponding to the target first complement set with the least contained elements as a candidate set;
and calculating the first complement of the target and the second complement of each record set, merging the record set corresponding to the second complement of the target with the least contained elements into the candidate set until the target field contained in the candidate set is equal to the non-empty field set, and determining the candidate set as the target set.
2. The method of claim 1, further comprising, prior to obtaining the non-empty field set of the database:
judging whether field values corresponding to fields in each data table in the database are empty or not, and determining the fields with the corresponding field values not being empty as non-empty fields in the data table so as to obtain the non-empty field set.
3. The method of claim 2, wherein determining whether field values corresponding to fields in respective data tables in the database are null comprises:
judging whether the field value corresponding to each field is null or not according to the field type of each field.
4. The method of claim 1, wherein after determining the target data record as data of the database to be extracted, the method further comprises:
determining a primary key value corresponding to each target data record respectively so as to obtain a primary key value set;
and receiving an extraction request, and sending the primary key value set to a sending end of the extraction request so that the sending end extracts the database through the primary key value.
5. A data extraction apparatus, comprising:
the non-empty field acquisition module is used for acquiring a non-empty field set of the database and a plurality of data records in each data table in the database;
a table data determining module, configured to determine a target field included in each data record according to a non-null field value included in each data record;
the extraction data determining module is used for classifying the plurality of data records according to the primary key values contained in the plurality of data records, determining record sets corresponding to the primary key values respectively, determining a target set from the record sets according to the non-empty field sets and target fields contained in the record sets, taking data contained in the target set as target data records, and determining the target data records as data to be extracted of the database;
Wherein determining a target set from each record set according to the non-empty field set and the target field contained in each record set comprises:
determining the number of target fields contained in each record set, sorting the record sets according to the number from large to small, and determining a plurality of target sets according to the sorting order so that all target fields contained in the plurality of target sets are identical to non-empty fields in the non-empty field set; or (b)
According to target fields contained in each record set, respectively calculating first complement sets of each record set and the non-empty field set, and taking the record set corresponding to the target first complement set with the least contained elements as a candidate set;
and calculating the first complement of the target and the second complement of each record set, merging the record set corresponding to the second complement of the target with the least contained elements into the candidate set until the target field contained in the candidate set is equal to the non-empty field set, and determining the candidate set as the target set.
6. A computer readable medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1-4.
7. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any of claims 1-4 via execution of the executable instructions.
CN201911342183.4A 2019-12-23 2019-12-23 Data extraction method, data extraction device, computer readable medium and electronic equipment Active CN113094415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911342183.4A CN113094415B (en) 2019-12-23 2019-12-23 Data extraction method, data extraction device, computer readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911342183.4A CN113094415B (en) 2019-12-23 2019-12-23 Data extraction method, data extraction device, computer readable medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN113094415A CN113094415A (en) 2021-07-09
CN113094415B true CN113094415B (en) 2024-03-29

Family

ID=76663930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911342183.4A Active CN113094415B (en) 2019-12-23 2019-12-23 Data extraction method, data extraction device, computer readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113094415B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115686939A (en) * 2022-10-27 2023-02-03 湖南长银五八消费金融股份有限公司 Data backup method and device, computer equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003323450A (en) * 2002-04-26 2003-11-14 Yamato Hiroshi Database retrieving device and method, computer program and computer-readable recording medium
CN1834891A (en) * 2004-12-17 2006-09-20 佳能株式会社 Information processor, information processing method, and control program
CN104408150A (en) * 2014-12-03 2015-03-11 天津南大通用数据技术股份有限公司 Data import/ export method and device adapted to a plurality of data formats of databases
CN105930462A (en) * 2016-04-21 2016-09-07 成都数联铭品科技有限公司 Cloud computing platform based massive data processing method
CN107229721A (en) * 2017-06-02 2017-10-03 泰华智慧产业集团股份有限公司 A kind of method and device for changing data pick-up
CN107924417A (en) * 2015-08-26 2018-04-17 片山成仁 Data bank management device and its method
CN109271435A (en) * 2018-09-14 2019-01-25 南威软件股份有限公司 A kind of data pick-up method and system for supporting breakpoint transmission
US10192031B1 (en) * 2006-11-03 2019-01-29 Vidistar, Llc System for extracting information from DICOM structured reports
CN109491989A (en) * 2018-11-12 2019-03-19 北京懿医云科技有限公司 Data processing method and device, electronic equipment, storage medium
CN109977110A (en) * 2019-04-28 2019-07-05 杭州数梦工场科技有限公司 Data cleaning method, device and equipment
CN110502515A (en) * 2019-08-15 2019-11-26 中国平安财产保险股份有限公司 Collecting method, device, equipment and computer readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0314591D0 (en) * 2003-06-21 2003-07-30 Ibm Profiling data in a data store

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003323450A (en) * 2002-04-26 2003-11-14 Yamato Hiroshi Database retrieving device and method, computer program and computer-readable recording medium
CN1834891A (en) * 2004-12-17 2006-09-20 佳能株式会社 Information processor, information processing method, and control program
US10192031B1 (en) * 2006-11-03 2019-01-29 Vidistar, Llc System for extracting information from DICOM structured reports
CN104408150A (en) * 2014-12-03 2015-03-11 天津南大通用数据技术股份有限公司 Data import/ export method and device adapted to a plurality of data formats of databases
CN107924417A (en) * 2015-08-26 2018-04-17 片山成仁 Data bank management device and its method
CN105930462A (en) * 2016-04-21 2016-09-07 成都数联铭品科技有限公司 Cloud computing platform based massive data processing method
CN107229721A (en) * 2017-06-02 2017-10-03 泰华智慧产业集团股份有限公司 A kind of method and device for changing data pick-up
CN109271435A (en) * 2018-09-14 2019-01-25 南威软件股份有限公司 A kind of data pick-up method and system for supporting breakpoint transmission
CN109491989A (en) * 2018-11-12 2019-03-19 北京懿医云科技有限公司 Data processing method and device, electronic equipment, storage medium
CN109977110A (en) * 2019-04-28 2019-07-05 杭州数梦工场科技有限公司 Data cleaning method, device and equipment
CN110502515A (en) * 2019-08-15 2019-11-26 中国平安财产保险股份有限公司 Collecting method, device, equipment and computer readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种通用的多数据库间数据抽取方法及应用;刘如九 等;《北京交通大学学报》(第第4期期);14-18 *
基于异构分类体系的书目数据库合并;邹晓顺;《武汉科技大学学报(社会科学版)》(第4期);94-97 *

Also Published As

Publication number Publication date
CN113094415A (en) 2021-07-09

Similar Documents

Publication Publication Date Title
US11526799B2 (en) Identification and application of hyperparameters for machine learning
CN108933695B (en) Method and apparatus for processing information
CN109597810B (en) Task segmentation method, device, medium and electronic equipment
CN112364014B (en) Data query method, device, server and storage medium
CN112084179B (en) Data processing method, device, equipment and storage medium
CN109976999B (en) Method and device for measuring coverage rate of test cases
CN114281663A (en) Test processing method, test processing device, electronic equipment and storage medium
CN113094415B (en) Data extraction method, data extraction device, computer readable medium and electronic equipment
CN111414528B (en) Method and device for determining equipment identification, storage medium and electronic equipment
CN117093619A (en) Rule engine processing method and device, electronic equipment and storage medium
CN112162859A (en) Data processing method and device, computer readable medium and electronic equipment
CN116204428A (en) Test case generation method and device
CN111241137A (en) Data processing method and device, electronic equipment and storage medium
CN111125311A (en) Method and device for checking information normalization processing, storage medium and electronic equipment
CN110909288B (en) Service data processing method, device, platform, service end, system and medium
CN111026629A (en) Method and device for automatically generating test script
CN112163127B (en) Relationship graph construction method and device, electronic equipment and storage medium
CN111079185B (en) Database information processing method and device, storage medium and electronic equipment
CN113053531B (en) Medical data processing method, medical data processing device, computer readable storage medium and equipment
CN110532304B (en) Data processing method and device, computer readable storage medium and electronic device
CN112488625A (en) Returned piece identification method, returned piece identification device, returned piece identification equipment and storage medium
CN112379967A (en) Simulator detection method, device, equipment and medium
CN113495891A (en) Data processing method and device
CN109840196B (en) Method and device for testing business logic
CN110134691B (en) Data verification method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant