WO2023245941A1 - 一种数据迁移方法及装置 - Google Patents

一种数据迁移方法及装置 Download PDF

Info

Publication number
WO2023245941A1
WO2023245941A1 PCT/CN2022/127665 CN2022127665W WO2023245941A1 WO 2023245941 A1 WO2023245941 A1 WO 2023245941A1 CN 2022127665 W CN2022127665 W CN 2022127665W WO 2023245941 A1 WO2023245941 A1 WO 2023245941A1
Authority
WO
WIPO (PCT)
Prior art keywords
field
primary key
data
migrated
fields
Prior art date
Application number
PCT/CN2022/127665
Other languages
English (en)
French (fr)
Inventor
奚伟宏
蔡远航
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2023245941A1 publication Critical patent/WO2023245941A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models

Definitions

  • the embodiments of this application relate to the field of financial technology (Fintech), and in particular, to a data migration method and device.
  • the Nebula graph database is an open source, distributed, and easily scalable native graph database; it uses points and edges as the basic storage units and can carry ultra-large-scale data sets with hundreds of billions of points and trillions of edges. And provides millisecond-level queries; it is a graph database specially used to store huge graph networks and retrieve information from them.
  • the number of data tables in the Hive database is very large, usually hundreds or even thousands, and the number of fields in each data table is also large, with an average of 20-40; secondly, some of the data in the Hive database
  • the data table may be associated with multiple Tags/Edge Types in the Nebula graph database (for example, suppose that a data table in the Hive database also stores the user's personal information, the user's bank card information, and the user's device. information and the user's company information, then personal information, bank card information, device information and company information need to be split and stored in 4 Tags in the Nebula graph database.
  • This application provides a data migration method and device to simply, quickly and accurately migrate data from the Hive database to the Nebula graph database.
  • embodiments of the present application provide a data migration method, which method includes: for a graph space in a graph database, constructing a primary key dictionary of the graph space based on multiple elements of the graph space; the graph space It is determined based on multiple data tables to be migrated in the source database; for any element, the primary key dictionary is constructed with the primary key field of the element as the key and the element name of the element as the value; for any configured element A data table to be migrated, matching each table field in the data table to be migrated based on each primary key field in the primary key dictionary; if it is determined that there is at least one table field in the data table to be migrated that matches the primary key dictionary
  • the primary key fields with the same meaning in are the same, then the data table to be migrated is determined to be a data table to be migrated and the at least one table field is determined to be a matching field, and the same key field is obtained from the primary key dictionary based on the at least one table field.
  • constructing a primary key dictionary of the graph space based on multiple elements of the graph space includes: when an element of the graph space is a label, converting the primary key field of the label As the key, use the name of the tag as the value; where the names of different tags are different, and the primary key field of any tag is consistent with the primary key field with the same meaning in the source database; in the graph space
  • the element is a directed edge
  • the first spliced field after splicing the primary key fields corresponding to the starting point and end point of the directed edge is used as the key, and the name of the directed edge is used as the value.
  • the construction method of the primary key dictionary is specifically described. Because the primary key dictionary is the basis for the execution of the data migration method of this application, by accurately constructing the primary key dictionary in the graph space, the speed and accuracy of data migration can be improved.
  • the element names under the element where the same primary key field is located are obtained from the primary key dictionary based on the at least one table field, and a set of element names is formed, including: for any proposed migration of the configuration Data table, matching each table field in the data table to be migrated based on each primary key field in the primary key dictionary, if it is determined that there is only one primary key field in the primary key dictionary and one in the data table to be migrated If the table fields are the same, obtain the element name under the element where the primary key field is located from the primary key dictionary based on the primary key field, and add the element name to the element name set; if it is determined that there are at least two elements in the primary key dictionary If the primary key field is the same as the same number of table fields in the data table to be migrated, then the at least two primary key fields are spliced in pairs to obtain multiple second spliced fields; for any second spliced field, all the The second splicing field is searched in each primary key field of the primary
  • the second splicing field is obtained from the primary key dictionary based on the second splicing field.
  • the element name under the element where the second splicing field is located is added, and the element name is added to the element name collection.
  • each element name in the element name set is the basis for subsequent matching of non-primary key fields (i.e. non-pairing fields) of the data table to be migrated in the Hive database and non-primary key attribute fields in the Nebula graph database, by accurately matching Building a collection of feature names will improve the speed and accuracy of data migration.
  • non-primary key fields i.e. non-pairing fields
  • determining each non-primary key attribute field of the graph space based on the element name set and each non-primary key attribute of the graph space includes: targeting the elements in the element name set For any element name, splice the element name and each non-primary key attribute of the element name in the graph space one by one, and use each third splicing field obtained by splicing as the element in the graph space.
  • the above solution specifically describes the technology of how to determine each non-primary key attribute field of the graph space, and how to determine the non-primary key attribute field of the graph space based on the determined non-primary key attribute field of the graph space.
  • To achieve the matching of the same fields for each non-primary key field (i.e., non-pairing field) in the data table to be migrated that is, it describes the technology of how to determine the data migration mapping relationship.
  • the data migration mapping relationship can be accurately established in an automated manner. Based on the established data migration mapping relationship, the speed and accuracy of data migration can be improved.
  • determining the data migration mapping relationship based on each first field and each second field includes: for any first field among each first field, according to the first field For the data type of the field value, each third field of the same data type is determined from each second field; for any third field, the third field is determined according to the field similarity calculation method that matches the data type. Field similarity of the first field; determine whether to construct a data migration mapping relationship between the first field and the third field based on the field similarity.
  • the method further includes: respectively setting a first Bloom filter for the source database and a second Bloom filter for the graph database. According to the generation time of the data, write the data moved out of the source database within the set time period into the first Bloom filter, and write the data within the set time period into the graph database The data is written into the second Bloom filter; based on the first writing result written into the first Bloom filter and the second writing result written into the second Bloom filter, it is determined that the Check whether the data migration within the set time period is correct.
  • any data migration method requires verification of the consistency of data migration during execution, that is, it is necessary to ensure that data is not lost during the migration process.
  • verifying whether the data in the migration process is consistent it is processed according to the granularity of a single piece of data. That is, for each piece of data, all fields need to be read and converted into json (JavaScript Object Notation, JS object notation) Finally, the hash value is calculated, so that the consistency of the data can be determined by comparing the hash value of each piece of data in the Hive database with the hash value in the Nebula graph database.
  • the above-mentioned solution of this application designs a Bloom filter, and then verifies the data within a specified time according to the time when the data is generated. In this way, data consistency verification can also be achieved for data with differences in field structures. This will overcome the problem of low verification efficiency in current data consistency verification.
  • the first Bloom filter and the second Bloom filter are both N-layer designed Bloom filters, and any Bloom filter in the latter layer is used to The write result of the Bloom filter of the previous layer is written; the first write result of the first Bloom filter is written according to the second write result of the second Bloom filter.
  • Writing the result to determine whether the data migration within the set time period is correct includes: comparing the first writing result written in the last layer of Bloom filters in the first Bloom filter with the writing result in the first Bloom filter. Compare the second writing result of the last layer of Bloom filters in the second Bloom filter; if it is determined that the first writing result is the same as the second writing result, the set duration is determined The data within is migrated correctly.
  • the N 2; the method further includes: combining the first layer Bloom filter in the first Bloom filter and the second layer Bloom filter
  • the first layer of Bloom filters are all designed in the form of a linked list; the linked list form means that after the number of written data meets the set threshold, a new Bloom filter will be added to the first layer of Bloom filters.
  • the miscalculation rate will also increase.
  • the above solution of this application records the number of data written to a Bloom filter, and then when the number of written data reaches the designed threshold, a new Bloom filter will be automatically used To write data, this can reduce the probability of miscalculation during use of the Bloom filter.
  • a data migration device which device includes: a primary key dictionary construction unit, configured to construct a graph space for a graph space in a graph database based on multiple elements of the graph space.
  • Primary key dictionary configured to construct a graph space for a graph space in a graph database based on multiple elements of the graph space.
  • the graph space is determined based on multiple data tables to be migrated in the source database; for any element, the primary key dictionary uses the primary key field of the element as the key and the element name of the element as the value Construct;
  • the element name set construction unit is used to match each table field in the to-be-migrated data table based on each primary key field in the primary key dictionary for any configured data table to be migrated; if it is determined that the If at least one table field in the data table to be migrated is the same as the primary key field with the same meaning in the primary key dictionary, it is determined that the data table to be migrated is a data table to be migrated and the at least one table field is determined to be a paired field, and Based
  • Each non-primary key attribute of the graph space is determined to determine each non-primary key attribute field of the graph space; the migration processing unit is used for any data table to be migrated, according to each non-paired field in the data table to be migrated and the graph space For each non-primary key attribute field, determine the data migration mapping relationship and perform data migration; wherein, the data migration mapping relationship is used to migrate the data in the data table to be migrated to the same field in the graph space according to the same field;
  • Each of the non-paired fields is every table field in the data table to be migrated excluding the paired fields.
  • embodiments of the present application provide a computing device, including:
  • Memory used to store program instructions
  • a processor configured to call program instructions stored in the memory, and execute any implementation method in the first aspect according to the obtained program.
  • embodiments of the present application provide a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to cause a computer to execute any implementation of the first aspect. method.
  • Figure 1 is a schematic diagram of a data migration method provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of a configuration file provided by an embodiment of the present application.
  • Figure 3 is a schematic diagram of a data migration device provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a computing device provided by an embodiment of the present application.
  • FIG. 1 it is a schematic diagram of a data migration method provided by an embodiment of the present application. The method is executed by a data migration device. Referring to Figure 1, the data migration method includes the following steps:
  • Step 101 For the graph space in the graph database, construct a primary key dictionary of the graph space based on multiple elements of the graph space; the graph space is determined based on multiple data tables to be migrated in the source database; for For any element, the primary key dictionary is constructed with the primary key field of the element as the key and the element name of the element as the value.
  • the graph database is Nebula graph database
  • the source database is Hive database.
  • this application will next describe the data migration method of this application by migrating data in the Hive database to the Nebula graph database.
  • Table 1 it is an element information table of a graph space provided by an embodiment of the present application.
  • the element information table of this graph space includes 4 columns. Each column from left to right represents the elements of the graph space, element names, primary key fields of the elements, and non-primary key attributes of the elements. Specifically, see the 4 records in column 1 of Table 1.
  • the 1st and 2nd records are tag elements (Tag) respectively, and the 3rd and 4th records are edge type elements respectively. (Edge Type);
  • the first record represents the element name of the label element "user”
  • the second record represents the element of the label element.
  • the name is "organization”
  • the third record represents the element name of the edge type element as "loan”
  • the fourth record represents the element name of the edge type element as "occupy”.
  • the data in columns 1 and 2 in Table 1 above are determined from the existing elements in the Nebula graph database after analyzing multiple data tables to be migrated in the Hive database; and for The data in columns 3 and 4 in Table 1, specifically taking the data in columns 3 and 4 in the row where the label element "user” is located, is used to determine the "user” )", continue to create the tag element with this name based on the Nebula graph database, the technical staff will create the primary key field and non-primary key attribute for the tag element in advance based on business needs, and then use the Nebula graph database to create the tag element. It is obtained by obtaining the primary key field and non-primary key attributes.
  • constructing a primary key dictionary of the graph space based on multiple elements of the graph space includes: when an element of the graph space is a label, converting the primary key field of the label As the key, use the name of the tag as the value; where the names of different tags are different, and the primary key field of any tag is consistent with the primary key field with the same meaning in the source database; in the graph space
  • the element is a directed edge
  • the first spliced field after splicing the primary key fields corresponding to the starting point and end point of the directed edge is used as the key, and the name of the directed edge is used as the value.
  • Tag For each Tag, its primary key name is used as the key (the key is recorded as key), and the list composed of the Tag (tag) name is used as the value (the value is recorded as value), and is stored in the primary key dictionary;
  • Edge Type For each Edge Type, string concatenate the names of its SRC_VID and DST_VID, and use the spliced field (the spliced field is the first spliced field) as the key.
  • Edge Type The list composed of (edge type) names is stored in the Nebula primary key dictionary as value.
  • the value list using the Tag primary key name as the key will only contain a unique list element.
  • the value list that is used as a key after string concatenation using the names of SRC_VID and DST_VID of Edge Type can contain several different values.
  • Step 102 For any configured data table to be migrated, match each table field in the data table to be migrated based on each primary key field in the primary key dictionary; if it is determined that at least one of the data tables to be migrated exists The table field is the same as the primary key field with the same meaning in the primary key dictionary, then it is determined that the data table to be migrated is a data table to be migrated and the at least one table field is determined to be a paired field, and based on the at least one table field, the The element names under the element where the same primary key field is located are obtained from the primary key dictionary and form a set of element names.
  • the data migration method in this application will use the Nebula Exchange tool and be completed based on the Spark cluster.
  • the graph has been obtained in step 101.
  • the required configuration includes: Spark related configuration, Nebula Graph related configuration, table names of the data tables to be migrated in the Hive database, and Nebula graph database
  • the name of the destination graph space (the destination graph space refers to the graph space into which data needs to be migrated).
  • the primary key dictionary of the graph space obtained in step 101 it only establishes the corresponding relationship between the primary key fields of the data table to be migrated in the Hive database and the primary key fields of the target graph space in the Nebula graph database.
  • the data to be migrated The remaining table fields of the table are not exactly the same as the field names of Tag/Edge Type in the Nebula graph database, and fields with the same meaning may also be different between different data tables to be migrated. Name, and the field alignment relationship is not specified in the configuration file, so it is necessary to align the fields between the Tag (label)/Edge Type (edge type) of the data table to be migrated in the Hive database and the corresponding graph space in the Nebula graph database. After alignment, the data migration mapping relationship generated during field alignment is written into the configuration file, so that by subsequently executing the configuration file, the data in the Hive database can be accurately migrated to the Nebula graph database.
  • the entity information table can be a user information table, an institution information table, etc.
  • the entity relationship table can be a user loan flow table (describing the lending relationship between the user and the institution), etc.
  • the following describes how to form a set of feature names for these two different types of data tables. After having a set of feature names, the candidate will be able to automatically align the non-primary key fields in the Hive database with the non-primary keys in the Nebula graph database. Attribute field.
  • any data table to be migrated configured in the configuration file search all table fields in the data table to be migrated based on each primary key field in the constructed primary key dictionary. If the corresponding primary key field cannot be found in all table fields , then it is determined that the configuration of the data table to be migrated in the configuration file is incorrect, that is, the data table to be migrated is not the data table that needs to be migrated this time, that is, it is not the data table to be migrated, so at this time it is necessary to Follow the same method for the next data table to be migrated in the configuration file.
  • Search result 1 indicates that only one primary key field has been retrieved, then the feature name set can be determined as follows:
  • Search result 2 indicates that more than or equal to 2 primary key fields are retrieved, then the feature name set can be determined as follows:
  • the at least two primary key fields are spliced in pairs to obtain multiple second spliced fields. ; For any second splicing field, search the second splicing field in each primary key field of the primary key dictionary. If it is determined that the second splicing field exists in the primary key dictionary, then based on the first The second splicing field obtains the element name under the element where the second splicing field is located from the primary key dictionary, and adds the element name to the element name set.
  • the table structure of the data table to be migrated contains the following 9 table fields:
  • association relationship The association relationship in this example is the loan relationship between the user and the institution.
  • these associations also need to be imported into the Nebula graph database, that is, in the user Tag (user tag) ) and organization Tag (organization tag) to build an Edge Type (edge type) edge relationship), and then userId (user identification number) and organizationId (organization identification number) are string spliced to obtain "userIdorganizationId” (the " userIdorganizationId” is the second splicing field), and the returned results obtained by retrieving the above primary key dictionary are "loan (loan)" and "occupy (occupancy)”. Finally, the two returned results are combined together to form a set of element names ["user", “organization”, “loan”, "occupy”].
  • Step 103 Determine each non-primary key attribute field of the graph space based on the element name set and each non-primary key attribute of the graph space.
  • determining each non-primary key attribute field of the graph space based on the element name set and each non-primary key attribute of the graph space includes: targeting the elements in the element name set For any element name, splice the element name and each non-primary key attribute of the element name in the graph space one by one, and use each third splicing field obtained by splicing as the element in the graph space.
  • Each non-primary key attribute field under the name includes: targeting the elements in the element name set For any element name, splice the element name and each non-primary key attribute of the element name in the graph space one by one, and use each third splicing field obtained by splicing as the element in the graph space.
  • the element name "user” is For example, by splicing the element name and the non-primary key attributes of the element name in the graph space one by one.
  • the non-primary key attributes of the name in this graph space include “username”, “nationality”, “age” and “phone number”. Therefore, by changing "user” )" and the four non-primary key attributes are spliced one by one, and four spliced fields can be obtained, namely "user.username”, “user.nationality”, “user.age” and “user.phone” ".
  • the spliced field "organization.address” can be obtained; by splicing the element name "loan” After splicing with the corresponding non-primary key attributes "loan_date (loan date)” and “loan_amount (loan amount)" one by one, two spliced fields can be obtained, namely "loan.loan_date” and "loan.loan_amount”; After the element name "occupy” is spliced one by one with the corresponding non-primary key attributes "start_year (start year)” and “end_year (end year)", two spliced fields can be obtained, which are "occupy.start_year” , "occupy.end_year”.
  • the splicing fields are the third splicing fields, and the splicing fields are the non-primary key attribute fields in
  • Step 104 For any data table to be migrated, determine the data migration mapping relationship and perform data migration based on each non-paired field in the data table to be migrated and each non-primary key attribute field of the graph space; wherein, The data migration mapping relationship is used to migrate the data in the data table to be migrated to the same field in the graph space according to the same field; the non-paired fields are each of the data tables to be migrated except the paired fields. Table fields.
  • determining the data migration mapping relationship based on each non-paired field in the data table to be migrated and each non-primary key attribute field of the graph space includes: for any one In the data table to be migrated, each non-paired field in the data table to be migrated is standardized to obtain each first field; for each non-primary key attribute field in the graph space, each non-primary key attribute field is The normalization process is performed to obtain each second field; and a data migration mapping relationship is determined based on each first field and each second field.
  • determining the data migration mapping relationship based on each first field and each second field includes: for any first field among each first field, according to the first field For the data type of the field value, each third field of the same data type is determined from each second field; for any third field, the third field is determined according to the field similarity calculation method that matches the data type. Field similarity of the first field; determine whether to construct a data migration mapping relationship between the first field and the third field based on the field similarity.
  • Numeric type Fields of this type include "age", etc.
  • t-test For distribution similarity testing methods of numerical data, there are usually two algorithms: t-test and KS test. Considering the actual application scenarios of data migration, there will be cases where the amount of data in some tables is too small. In addition, due to product positioning reasons , so that users tend to cluster at a certain feature level, rather than randomly conforming to the normal distribution. The difference between the KS test and the t-test is that the KS test does not need to know the distribution of the data. It can be regarded as a non-parametric test method. In this way, when the data distribution does not conform to a specific distribution, the sensitivity of the KS test is higher than that of the t-test. test.
  • the present invention selects the KS test to describe the distribution similarity of values in numeric fields. Calculation process: Take the set of values of the two fields, calculate the KS test, and obtain the similarity result.
  • the method of the present invention is: first perform frequency statistics on the set of all values in the field, and then sort them in reverse order according to the frequency, and take the set of values of 75% of the total data amount as the forward set, and take The set of remaining values is used as a negative set (for example, the total amount of data is 10, the frequency statistics of all values are: 7 A, 2 B, 1 C, and then the data is sorted in reverse order of the frequency of the values into A, A, A,A,A,A,A,B,B,C, the set of values of the first 75% of the data is [A, B], as the positive set, and then the set [C] is used as the negative set), next Assuming that the value belongs to the positive set is "the event occurs
  • Trie tree is constructed in the reverse direction, and the similarity is calculated using the same algorithm as above. As long as the Trie tree constructed by the set of values of the two fields is similar in either the forward or reverse direction, the distribution of the values of the two fields is considered to be similar.
  • the field name is less than 8 characters, the field name is the same, and the distribution of field values is similar, it is considered that the two fields can be aligned and merged. For example, in the example: "age (age)" in the source field set and “age (age)” in the destination field set are aligned;
  • the field name is greater than 8 characters, the field name similarity is greater than 0.8, and the distribution of field values is similar, it is considered that the two fields can be aligned and merged. For example: “user_name (user name)”, “telephone (phone number)”, “org_address (organization address)” in the source field set and “username (user name)”, “phone (telephone)” in the destination field set number)", “address” is aligned;
  • FIG. 2 it is a schematic diagram of a configuration file provided by the embodiment of the present application.
  • the content enclosed by a rectangular frame is the alignment information of the non-paired fields in the data table to be migrated and the non-primary key attribute fields in the graph space. , that is, the data migration mapping relationship.
  • mapping method for mapping hive table data to Nebula's Tag as follows:
  • VID represents the primary key of a specific Vertex
  • PROP_NAME_LIST represents other attribute values of Tag.
  • mapping method for mapping hive table data to Nebula's Edge Type as follows:
  • the field names, field order, number of fields, and field types in the Edge Type statement must be consistent with the fields appearing in the SELECT statement. Since the Edge Type is directional, the "starting point" is SRC_VID and the "end point” is DST_VID. Rank (sorting) is a unique edge field attribute of Nebula, which is used to distinguish data when the Edge Type (edge type), starting point, and end point types are all the same. Rank (sorting) can be some time attribute field, or other meaningful field. PROP_NAME_LIST represents other attribute values of Edge Type (edge type).
  • Spark.sql is called to write data.
  • any data migration method requires verification of the consistency of data migration during execution, that is, it is necessary to ensure that data is not lost during the migration process.
  • the data migration method of this application is no exception. To this end, this application proposes the following data consistency verification methods, including:
  • a first Bloom filter for the source database and a second Bloom filter for the graph database are respectively set; according to the generation time of the data, the migration data within the set time period are The data in the source database is written into the first Bloom filter, and the data written into the graph database within the set time period is written into the second Bloom filter; according to writing The first writing result of the first Bloom filter and the second writing result of the second Bloom filter determine whether the data migration within the set time period is correct.
  • the first Bloom filter and the second Bloom filter are both N-layer designed Bloom filters, and any Bloom filter in the latter layer is used to The write result of the Bloom filter of the previous layer is written; the first write result of the first Bloom filter is written according to the second write result of the second Bloom filter.
  • Writing the result to determine whether the data migration within the set time period is correct includes: comparing the first writing result written in the last layer of Bloom filters in the first Bloom filter with the writing result in the first Bloom filter. Compare the second writing result of the last layer of Bloom filters in the second Bloom filter; if it is determined that the first writing result is the same as the second writing result, the set duration is determined The data within is migrated correctly.
  • the N 2; the method further includes: combining the first layer of Bloom filters in the first Bloom filter and the second layer of Bloom filters.
  • the first-layer Bloom filters are all designed in the form of a linked list; the linked list form means that after the number of written data meets the set threshold, a new Bloom filter will be added to the first-layer Bloom filter.
  • BloomFilterList instances are first generated based on the BloomFilter (Bloom filter) class.
  • Two BloomFilterList instances are used to store the consistency verification data of the data table to be migrated in the Hive database.
  • Two BloomFilterList instances are used to store the Nebula graph. Database consistency check data.
  • the creation time (create_time field) or modification time (update_time) of the data the data in the same specified time period in the Hive database and the Nebula graph database are extracted respectively.
  • the time period can be measured in minutes. For example, all fields of all data in the first minute are written into the first layer of Bloom filters. After all the data in the first minute are written into the first layer of Bloom filters, the binary characters of the first layer of Bloom filters are The string sequence is written into the second-layer Bloom filter.
  • the status of the first-layer Bloom filter is cleared, and then compared with the corresponding Bloom filter in the Hive database ( Specifically refers to whether the state of the second layer Bloom filter) is consistent with the state of the corresponding Bloom filter (specifically refers to the second layer Bloom filter) of the Nebula graph database. If they are consistent, it means that the data migration is correct. Otherwise, it means that there is a problem with the data migration, and all data within this time window need to be migrated again. Next, the next minute's data is processed until all data processing is completed.
  • This double-layer compressed Bloom filter can greatly reduce the amount of data storage and reduce the occupation of server memory resources by the data consistency verification process.
  • the Bloomfilter class contains two properties, LinkedList ⁇ BloomFilterEntity>bloomFilterList, and String bloomFilterName property.
  • the bloomFilterList attribute uses a two-way circular linked list such as LinkList to wrap the BloomFilterEntity entity. This attribute is the core of realizing the automatic expansion of the Bloom filter.
  • bloomFilterName is the name of the defined bloom filter.
  • the core methods packaged in the Bloomfilter class are mightContain, put, mightContainAndPut. The following focuses on the implementation logic of mightContain, put, mightContainAndPut methods.
  • the function of mightContain is to determine whether a string is in the Bloom filter.
  • the implementation logic of the method is to return false directly when bloomFilterList is empty.
  • bloomFilterList is not empty, by traversing each BloomFilterEntity entity in bloomFilterList, it is judged whether a single BloomFilterEntity entity contains externally passed data. If it exists, then return true, otherwise return false.
  • the function of the put method is the core of realizing the automatic expansion Bloom filter. First determine whether bloomFilterList is empty. If it is empty, create a new BloomFilterEntity and add it to the end of bloomFilterList. Then with the help of the characteristics of a two-way circular linked list such as bloomFilterList, the last Bloom filter entity lastBloomFilterEntity can be quickly obtained with a time complexity of 1. Then add data to lastBloomFilterEntity.
  • the mightContainAndPut method is an enhancement to mightContain.
  • the incoming data is not in the Bloom filter, the data is added to the Bloom filter.
  • the implementation logic of this method is to first call the mightContain method. When false is returned, the above put method is directly called.
  • an embodiment of the present application provides a data migration device.
  • Figure 3 it is a schematic diagram of a data migration device provided by an embodiment of the present application.
  • the device includes a primary key dictionary construction unit 301, an element name set construction unit Unit 302, non-primary key attribute field determination unit 303 and migration processing unit 304;
  • the primary key dictionary construction unit 301 is used to construct a primary key dictionary of the graph space in the graph database according to multiple elements of the graph space; the graph space is based on multiple data to be migrated in the source database. Determined by the table; for any element, the primary key dictionary is constructed with the primary key field of the element as the key and the element name of the element as the value;
  • the element name set building unit 302 is configured to match each table field in the to-be-migrated data table based on each primary key field in the primary key dictionary for any configured data table to be migrated; if it is determined that the to-be-migrated data table is If at least one table field in the data table is the same as the primary key field with the same meaning in the primary key dictionary, then the data table to be migrated is determined to be a data table to be migrated and the at least one table field is determined to be a paired field, and based on the The at least one table field obtains the element name under the element where the same primary key field is located from the primary key dictionary, and forms a set of element names;
  • the non-primary key attribute field determination unit 303 is configured to determine each non-primary key attribute field of the graph space according to the element name set and each non-primary key attribute of the graph space;
  • the migration processing unit 304 is configured to, for any data table to be migrated, determine the data migration mapping relationship and perform data migration based on each non-paired field in the data table to be migrated and each non-primary key attribute field of the graph space; Wherein, the data migration mapping relationship is used to migrate the data in the data table to be migrated to the same field in the graph space according to the same field; the non-paired fields are the paired fields in the data table to be migrated except the paired fields. every table field except .
  • the primary key dictionary construction unit 301 is specifically configured to: when the element of the graph space is a label, use the primary key field of the label as the key, and use the name of the label as the value; wherein, different The names of the labels are different, and the primary key field of any label is consistent with the primary key field with the same meaning in the source database; when the element of the graph space is a directed edge, the starting point of the directed edge is The first spliced field after splicing the primary key fields corresponding to the end points is used as the key, and the name of the directed edge is used as the value.
  • the element name set construction unit 302 is specifically configured to: for any configured data table to be migrated, based on each primary key field in the primary key dictionary, each table field in the data table to be migrated is Matching is performed, and if it is determined that there is only one primary key field in the primary key dictionary that is the same as a table field in the data table to be migrated, then based on the primary key field, the element under the element where the primary key field is located is obtained from the primary key dictionary.
  • the element name and add the element name to the element name set; if it is determined that there are at least two primary key fields in the primary key dictionary that are identical to the same number of table fields in the data table to be migrated, then the at least two Two primary key fields are spliced together to obtain multiple second splicing fields; for any second splicing field, the second splicing field is searched in each primary key field of the primary key dictionary. If the second splicing field is determined, If the field exists in the primary key dictionary, then the element name under the element where the second splicing field is located is obtained from the primary key dictionary based on the second splicing field, and the element name is added to the element name set.
  • the non-primary key attribute field determination unit 303 is specifically configured to: for any element name in the element name set, combine the element name with each element name in the graph space.
  • the non-primary key attributes are spliced one by one, and each third spliced field obtained by splicing is used as each non-primary key attribute field of the graph space under the element name;
  • the migration processing unit 304 is specifically used to: For any one to be migrated data table, perform normalization processing on each non-paired field in the data table to be migrated, thereby obtaining each first field; for each non-primary key attribute field in the graph space, perform all operations on each non-primary key attribute field. The above-mentioned normalization process is performed to obtain each second field; based on each first field and each second field, the data migration mapping relationship is determined.
  • the migration processing unit 304 is also configured to: for any first field among the first fields, determine from each second field according to the data type of the field value of the first field Each third field of the same data type; for any third field, determine the field similarity between the third field and the first field according to the field similarity calculation method that matches the data type; according to the field Similarity, determine whether to construct a data migration mapping relationship between the first field and the third field.
  • the device also includes a data consistency check unit 305; the data consistency check unit 305 is used to: respectively set the first Bloom filter for the source database and the first Bloom filter for the graph database.
  • the second Bloom filter according to the generation time of the data, the data moved out of the source database within the set time period is written into the first Bloom filter, and the data written within the set time period is written into the first Bloom filter.
  • the data entered into the graph database is written into the second Bloom filter; according to the first writing result written into the first Bloom filter and the second result written into the second Bloom filter Write the results to determine whether the data migration within the set time period is correct.
  • the first Bloom filter and the second Bloom filter are both N-layer designed Bloom filters, and any Bloom filter in the latter layer is used to filter the previous one.
  • the writing result of one layer of Bloom filters is written; the data consistency check unit 305 is specifically used to: write the first layer of the last layer of Bloom filters in the first Bloom filter.
  • the writing result is compared with the second writing result of the last layer of Bloom filters written in the second Bloom filter; if it is determined that the first writing result is the same as the second writing result , then it is determined that the data migration within the set time period is correct.
  • data consistency check unit 305 is also used to: combine the first layer Bloom filter and the second Bloom filter in the first Bloom filter.
  • the first layer of bloom filters in the filter are all designed in the form of a linked list; the linked list form means that after the number of written data meets the set threshold, a new bloom filter will be added to the first layer of bloom filters. Long filter.
  • Embodiments of the present application also provide a computing device, which may be a desktop computer, a portable computer, a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), etc.
  • the computing device may include a central processing unit (Center Processing Unit, CPU), memory, input/output devices, etc.
  • the input device may include a keyboard, a mouse, a touch screen, etc.
  • the output device may include a display device, such as a Liquid Crystal Display (Liquid Crystal Display). LCD), cathode ray tube (Cathode Ray Tube, CRT), etc.
  • LCD Liquid Crystal Display
  • CRT cathode Ray Tube
  • Memory which may include read-only memory (ROM) and random access memory (RAM), provides the processor with program instructions and data stored in the memory.
  • the memory may be used to store program instructions of the data migration method
  • the processor is configured to call program instructions stored in the memory and execute the data migration method according to the obtained program.
  • FIG. 4 it is a schematic diagram of a computing device provided by an embodiment of the present application.
  • the computing device includes:
  • the processor 401 is used to read the program in the memory 402 and execute the above data migration method
  • the processor 401 may be a central processing unit (central processing unit, CPU for short), a network processor (network processor, NP for short) or a combination of CPU and NP. It can also be a hardware chip.
  • the above-mentioned hardware chip can be an application-specific integrated circuit (ASIC for short), a programmable logic device (PLD for short) or a combination thereof.
  • the above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL) or any of them combination.
  • the memory 402 is used to store one or more executable programs, and can store data used by the processor 401 when performing operations.
  • the program may include program code, which includes computer operating instructions.
  • Memory 402 may include volatile memory (volatile memory), such as random-access memory (RAM); memory 402 may also include non-volatile memory (non-volatile memory), such as flash memory ( flash memory), hard disk drive (hard disk drive, HDD for short) or solid-state drive (SSD for short); the memory 402 may also include a combination of the above types of memory.
  • volatile memory volatile memory
  • RAM random-access memory
  • non-volatile memory non-volatile memory
  • flash memory flash memory
  • hard disk drive hard disk drive, HDD for short
  • SSD solid-state drive
  • Memory 402 stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof:
  • Operation instructions including various operation instructions, used to implement various operations.
  • Operating system includes various system programs, used to implement various basic services and handle hardware-based tasks.
  • the bus 405 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 4, but it does not mean that there is only one bus or one type of bus.
  • the bus interface 404 may be a wired communication access port, a wireless bus interface or a combination thereof, wherein the wired bus interface may be an Ethernet interface, for example.
  • the Ethernet interface can be an optical interface, an electrical interface, or a combination thereof.
  • the wireless bus interface may be a WLAN interface.
  • Embodiments of the present application also provide a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to cause the computer to execute the data migration method.
  • embodiments of the present application may be provided as methods, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及金融科技(Fintech)领域,公开一种数据迁移方法及装置,根据图空间的要素,构建主键字典;基于主键字典的主键字段对拟迁移数据表的表字段进行匹配;若确定存在表字段与主键字典中相同含义的主键字段相同,则确定拟迁移数据表为待迁移数据表及确定表字段为配对字段,且基于表字段从主键字典中获取同一主键字段所在要素下的要素名称,构成要素名称集合;根据要素名称集合和图空间的各非主键属性,确定图空间的各非主键属性字段;根据待迁移数据表中的各非配对字段与图空间的各非主键属性字段,确定数据迁移映射关系并进行数据迁移。基于该方式可以实现简单、快速、准确地将数据从Hive数据库中迁入到Nebula图数据库中。

Description

一种数据迁移方法及装置
相关申请的交叉引用
本申请要求在2022年06月19日提交中国专利局、申请号为202210693937.6、申请名称为“一种数据迁移方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及金融科技(Fintech)领域,尤其涉及一种数据迁移方法及装置。
背景技术
随着计算机技术的发展,越来越多的技术(例如:大数据、云计算或区块链)应用在金融领域,传统金融业正在逐步向金融科技转变。金融科技的场景下,大量存在着用户与企业、组织间具有某种关联关系的数据,且该种数据普遍通过Hive数据库中的数据表进行存储。然而,随着技术的进一步发展,Nebula图数据库被研发出来。具体的,Nebula图数据库是一个开源的、分布式的、易扩展的原生图数据库;它以点、边为基础存储单元,能够承载千亿个点和数万亿条边的超大规模数据集,并提供毫秒级查询;它是专门用于存储庞大的图形网络并从中检索信息的图数据库,适用于各类具有内在联系的数据的存储,在欺诈检测、实时推荐、社交网络、知识图谱等领域都有应用场景。显然,通过将原先存储在Hive数据库中的数据迁入到Nebula图数据库中,将便于对数据进行高效查询。
目前在将存储于Hive数据库中的数据迁入到Nebula图数据库中时,需要预先在配置文件中写好Hive数据库中的迁出字段与Nebula图数据库中的迁入字段间的迁移对应关系,后续才可对同一字段下的数据自Hive数据库向Nebula数据库进行迁移。然而,该种数据迁移方法存在以下缺点:
首先,Hive数据库中的数据表的表数量规模很大,通常在几百甚至上千个,且每个数据表的字段数量也较多,平均在20-40个;其次,Hive数据库中的部分数据表可能会关联Nebula图数据库中的多个Tag(标签)/Edge Type(边类型)(比如假设Hive数据库的一个数据表中同时存储有用户的个人信息、用户的银行卡信息、用户的设备信息以及用户的公司信息,那么个人信息、银行卡信息,设备信息以及公司信息则需要拆分存储在Nebula图数据库中的4个Tag中,同时这4个Tag(标签)还需要两两之间建立Edge Type(边类型)关系),这就导致Hive数据库中的数据表与Nebula图数据库中的Tag(标签)/Edge Type(边类型)之间交错关联,存在着复杂的映射关系;此外,Hive数据库中不同产品不同系统之间相同含义的数据表,在表结构上也会存在差异,比如用户信息表,一些系统可能会存放用户的设备型号,另一些系统可能就不会存放用户的设备型号,这就导致Nebula图数据库中任一Tag(标签)或Edge Type(边类型),可能对应Hive数据库中的多个数据表,对Tag(标签)或Edge Type(边类型)的改动会带来复杂的关联影响;最后,Hive数据库中的数据表的表结构还会随时间变化而更新改变,那么对Nebula图数据库中的Tag(标签) 或Edge Type(边类型)的改动就是必然的操作。
基于以上种种原因,目前在将数据从Hive数据库向Nebula图数据库进行迁移时,面对在配置文件中写入迁出字段与迁入字段的迁移对应关系这一工作时,明显存在着耗时长、且易出错的风险。
因此,目前亟需一种数据迁移方法,通过该数据迁移方法,以实现简单、快速、准确地实现将数据从Hive数据库中迁入到Nebula图数据库中。
发明内容
本申请提供一种数据迁移方法及装置,用以简单、快速、准确地实现将数据从Hive数据库中迁入到Nebula图数据库中。
第一方面,本申请实施例提供一种数据迁移方法,该方法包括:针对图数据库中的图空间,根据所述图空间的多个要素,构建所述图空间的主键字典;所述图空间是根据源数据库中的多个待迁移数据表确定的;针对任一要素,所述主键字典以所述要素的主键字段作为键,以所述要素的要素名称作为值进行构建;针对配置的任一个拟迁移数据表,基于所述主键字典中的各主键字段对所述拟迁移数据表中的各表字段进行匹配;若确定所述拟迁移数据表中存在至少一个表字段与所述主键字典中相同含义的主键字段相同,则确定所述拟迁移数据表为一个待迁移数据表以及确定所述至少一个表字段为配对字段,且基于所述至少一个表字段从所述主键字典中获取同一主键字段所在要素下的要素名称,并构成要素名称集合;根据所述要素名称集合和所述图空间的各非主键属性,确定所述图空间的各非主键属性字段;针对任一个待迁移数据表,根据所述待迁移数据表中的各非配对字段与所述图空间的各非主键属性字段,确定数据迁移映射关系并进行数据迁移;其中,所述数据迁移映射关系用于对所述待迁移数据表中的数据按照同一字段向所述图空间的相同字段迁移;所述各非配对字段为所述待迁移数据表中除去配对字段外的每一个表字段。
上述方案中,通过构建图空间的主键字典,通过该主键字典将可以实现对Hive数据库中的各待迁移数据表的主键字段的自动匹配;而为了进一步将待迁移数据表中其他的非主键的字段的数据也能映射到Nebula图数据库中,则也可通过构建的主键字典而实现对各待迁移数据表中的非主键的字段的自动匹配,从而避免了背景技术中因为需要在配置文件中对错综复杂的字段的迁移对应关系进行配置、所需要付出的大量劳动工作且易出错的问题,基于该方式可以实现简单、快速、准确地将数据从Hive数据库中迁入到Nebula图数据库中。
在一种可能实现的方法中,所述根据所述图空间的多个要素,构建所述图空间的主键字典,包括:在所述图空间的要素为标签时,将所述标签的主键字段作为键,将所述标签的名称作为值;其中,不同标签的名称各不相同,且任一标签的主键字段与所述源数据库中具有相同含义的主键字段相一致;在所述图空间的要素为有向边时,将所述有向边的起点与终点分别对应的主键字段进行拼接后的第一拼接字段作为键,将所述有向边的名称作为值。
上述方案中,具体描述了主键字典的构成方式。因为主键字典是本申请的数据迁移方法在执行过程中的基础,因此通过精准地对图空间的主键字典进行构建,将可以提升数据迁移的速度以及迁移准确率。
在一种可能实现的方法中,所述基于所述至少一个表字段从所述主键字典中获取同一主键字段所在要素下的要素名称,并构成要素名称集合,包括:针对配置的任一个拟迁移数据表,基于所述主键字典中的各主键字段对所述拟迁移数据表中的各表字段进行匹配,若确定所述主键字典中仅存在一个主键字段与所述拟迁移数据表中的一个表字段相同,则基于所述主键字段从所述主键字典中获取所述主键字段所在要素下的要素名称,并添加所述要素名称至要素名称集合;若确定所述主键字典中存在至少两个主键字段与所述拟迁移数据表中相同数目的表字段一一相同,则对所述至少两个主键字段两两拼接,得到多个第二拼接字段;针对任一第二拼接字段,对所述第二拼接字段在所述主键字典的各主键字段中进行检索,若确定所述第二拼接字段存在于所述主键字典中,则基于所述第二拼接字段从所述主键字典中获取所述第二拼接字段所在要素下的要素名称,并添加所述要素名称至所述要素名称集合。
上述方案中,具体描述了要素名称集合的构成方式。因为要素名称集合中的各要素名称是后续对Hive数据库中的待迁移数据表的非主键字段(即非配对字段)与Nebula图数据库中的非主键属性字段进行匹配的基础,因此通过精准地对要素名称集合进行构建,将可以提升数据迁移的速度以及迁移准确率。
在一种可能实现的方法中,所述根据所述要素名称集合和所述图空间的各非主键属性,确定所述图空间的各非主键属性字段,包括:针对所述要素名称集合中的任一要素名称,将所述要素名称与所述要素名称在所述图空间中的各非主键属性一一进行拼接,并将拼接得到的各第三拼接字段作为所述图空间在所述要素名称下的各非主键属性字段;所述针对任一个待迁移数据表,根据所述待迁移数据表中的各非配对字段与所述图空间的各非主键属性字段,确定数据迁移映射关系,包括:针对任一个待迁移数据表,对所述待迁移数据表中的各非配对字段进行规范化处理,从而得到各第一字段;针对所述图空间中的各非主键属性字段,对所述各非主键属性字段进行所述规范化处理,从而得到各第二字段;根据各第一字段和各第二字段,确定数据迁移映射关系。
上述方案中,具体描述了该如何确定图空间的各非主键属性字段的技术,以及,在确定出图空间的非主键属性字段之后,又该如何基于所确定出来的图空间的非主键属性字段来实现对待迁移数据表中的各非主键字段(即非配对字段)进行相同字段的匹配,也即描述了该如何确定数据迁移映射关系的技术。该过程中以自动化的方式,可精准建立数据迁移映射关系,从而基于所建立的数据迁移映射关系,将可以提升数据迁移的速度以及迁移准确率。
在一种可能实现的方法中,所述根据各第一字段和各第二字段,确定数据迁移映射关系,包括:针对各第一字段中的任一第一字段,根据所述第一字段的字段值的数据类型,从各第二字段中确定出相同数据类型的各第三字段;针对任一第三字段,按照与所述数据类型匹配的字段相似度计算方法确定所述第三字段与所述第一字段的字段相似度;根据所述字段相似度,确定是否构建所述第一字段与所述第三字段的数据迁移映射关系。
上述方案中,进一步细化地描述了是否能建立起待迁移数据表中的各非主键字段(即非配对字段)与图空间的非主键属性字段间的数据迁移映射关系的技术。由于该技术中对字段的字段值的数据类型进行了考量,并仅对相同数据类型的字段去计算字段相似度,如此可以提升数据迁移映射关系的确定效率,有鉴于此,将可以提升数据迁移的速度。
在一种可能实现的方法中,所述进行数据迁移之后,所述方法还包括:分别设置针对 于所述源数据库的第一布隆过滤器和针对于所述图数据库的第二布隆过滤器;按照数据的生成时间,将设定时长内的迁出所述源数据库中的数据写入所述第一布隆过滤器,以及将所述设定时长内的写入所述图数据库中的数据写入所述第二布隆过滤器;根据写入所述第一布隆过滤器的第一写入结果和写入所述第二布隆过滤器的第二写入结果,确定所述设定时长内的数据迁移是否正确。
众所周知,任一数据迁移方法在执行的过程中都要求对数据迁移的一致性进行校验,也即需要确保数据在迁移过程中不发生丢失。其中,目前在校验迁移过程中的数据是否一致时,是按照单条数据的粒度处理的,也即对于每条数据,需要将所有字段读取出来转为json(JavaScript Object Notation,JS对象简谱)后,计算hash(哈希)值,从而通过比较每条数据在Hive数据库中的hash(哈希)值与在Nebula图数据库中的hash(哈希)值,就可以确定数据的一致性。而实际情况下,通常会存在Hive数据库中的数据表需要拆分成多个Nebula图数据库中Tag(标签)/Edge Type(边类型),或者多个兼容性的数据表迁移到Nebula图数据库中同一个Tag(标签)/Edge Type(边类型)中,那么这时候Hive数据库中的数据表和目的Tag(标签)/Edge Type(边类型)之间的字段结构就会出现差异。因此,在遇到这种情况时,则需要对Hive数据库中的数据表的表结构进行拆分,或者将多个兼容性的数据表的表结构进行统一,然后才能按照单条数据的粒度进行处理。显然该种数据一致性的校验方法在执行过程中存在效率低下的问题。对此,本申请的上述方案通过设计布隆过滤器,然后按照数据生成的时间,对指定时间内的数据进行校验,如此对字段结构存在差异的数据也能实现数据一致性的校验,从而将克服当前数据一致性校验时校验效率低下的问题。
在一种可能实现的方法中,所述第一布隆过滤器和所述第二布隆过滤器均为N层设计的布隆过滤器,任一在后一层的布隆过滤器用于对在前一层的布隆过滤器的写入结果进行写入;所述根据写入所述第一布隆过滤器的第一写入结果和写入所述第二布隆过滤器的第二写入结果,确定所述设定时长内的数据迁移是否正确,包括:将写入所述第一布隆过滤器中的最后一层布隆过滤器的第一写入结果与写入所述第二布隆过滤器中的最后一层布隆过滤器的第二写入结果进行比较;若确定所述第一写入结果与所述第二写入结果相同,则确定所述设定时长内的数据迁移正确。
上述方案中,通过设计多层的布隆过滤器,并使用在后一层的布隆过滤器来对在前一层的布隆过滤器的写入结果进行写入,那么通过比较最后一层的布隆过滤器的写入结果,将可以达到较快地对Hive数据库中的迁出数据与Nebula图数据库中的迁入数据进行校验的技术效果。
在一种可能实现的方法中,所述N=2;所述方法还包括:将所述第一布隆过滤器中的第一层布隆过滤器和所述第二布隆过滤器中的第一层布隆过滤器均设计为链表形态;所述链表形态表示在写入数据的数目满足设定阈值后,则在所述第一层布隆过滤器中新增一个布隆过滤器。
由于布隆过滤器随着数据越写越多,误算率也将越来越大。对此问题,本申请的上述方案通过对写入到一个布隆过滤器的数据的数目进行记录,那么当写入的数据的数目达到设计的阈值时,将自动使用一个新的布隆过滤器来对数据进行写入,如此可以降低布隆过滤器在使用过程中发生误算的概率。
第二方面,本申请实施例提供一种数据迁移装置,该装置包括:主键字典构建单元, 用于针对图数据库中的图空间,根据所述图空间的多个要素,构建所述图空间的主键字典;所述图空间是根据源数据库中的多个待迁移数据表确定的;针对任一要素,所述主键字典以所述要素的主键字段作为键,以所述要素的要素名称作为值进行构建;要素名称集合构建单元,用于针对配置的任一个拟迁移数据表,基于所述主键字典中的各主键字段对所述拟迁移数据表中的各表字段进行匹配;若确定所述拟迁移数据表中存在至少一个表字段与所述主键字典中相同含义的主键字段相同,则确定所述拟迁移数据表为一个待迁移数据表以及确定所述至少一个表字段为配对字段,且基于所述至少一个表字段从所述主键字典中获取同一主键字段所在要素下的要素名称,并构成要素名称集合;非主键属性字段确定单元,用于根据所述要素名称集合和所述图空间的各非主键属性,确定所述图空间的各非主键属性字段;迁移处理单元,用于针对任一个待迁移数据表,根据所述待迁移数据表中的各非配对字段与所述图空间的各非主键属性字段,确定数据迁移映射关系并进行数据迁移;其中,所述数据迁移映射关系用于对所述待迁移数据表中的数据按照同一字段向所述图空间的相同字段迁移;所述各非配对字段为所述待迁移数据表中除去配对字段外的每一个表字段。
第三方面,本申请实施例提供了一种计算设备,包括:
存储器,用于存储程序指令;
处理器,用于调用所述存储器中存储的程序指令,按照获得的程序执行如第一方面任一实现方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行如第一方面任一实现方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种数据迁移方法的示意图;
图2为本申请实施例提供的一种配置文件的示意图;
图3为本申请实施例提供的一种数据迁移装置的示意图;
图4为本申请实施例提供的一种计算设备的示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
对于目前的将数据从Hive数据库向Nebula图数据库进行迁移的工作,具体在开展在配置文件中写入迁出字段与迁入字段的迁移对应关系这一工作时,所明显存在着的耗时长、 且易出错的问题,本申请提出一种数据迁移方法。如图1所示,为本申请实施例提供的一种数据迁移方法的示意图,该方法由数据迁移装置执行。参见图1,该数据迁移方法包括以下步骤:
步骤101,针对图数据库中的图空间,根据所述图空间的多个要素,构建所述图空间的主键字典;所述图空间是根据源数据库中的多个待迁移数据表确定的;针对任一要素,所述主键字典以所述要素的主键字段作为键,以所述要素的要素名称作为值进行构建。
可选的,图数据库为Nebula图数据库,源数据库为Hive数据库。作为一种示例,本申请接下来以将Hive数据库中的数据向Nebula图数据库中进行迁移来阐述本申请的数据迁移方法。
其中,Hive数据库中存在有大量的数据表,那么在将Hive数据库中的数据向Nebula图数据库中进行迁移时,可以选择将Hive数据库中的部分数据表的数据向Nebula图数据库进行迁移,也可以选择将Hive数据库中的全部数据表的数据向Nebula图数据库进行迁移,那么该Hive数据库中的部分数据表或者该Hive数据库中的全部数据表均可以作为多个待迁移数据表。
如下表1所示,为本申请实施例提供的一种图空间的要素信息表。
表1
Figure PCTCN2022127665-appb-000001
参见表1,该图空间的要素信息表包括4列,自左至右的各列依次表示图空间的要素、要素名称、要素的主键字段、要素的非主键属性。具体的,参见表1第1列中的4条记录,其中第1条记录和第2条记录分别为标签要素(Tag(标签)),第3条记录和第4条记录分别为边类型要素(Edge Type(边类型));继续参见表1第2列中的4条记录,其中第1条记录表示标签要素的要素名称为“user(用户)”,第2条记录表示标签要素的要素名称为“organization(组织)”,第3条记录表示边类型要素的要素名称为“loan(贷款)”,第4条记录表示边类型要素的要素名称为“occupy(占据)”。其中,以上表1中的第1列与第2列的数据是通过对Hive数据库中的多个待迁移数据表进行分析后、而从Nebula图数据库中已有的要素中确定出来的;而对于表1中的第3列和第4列的数据,具体以“user(用户)”这一标签要素所在一行的第3列数据和第4列数据为例,其是在确定出“user(用户)”这一名称的标签要素后,继续基于Nebula图数据库在建立该名称的标签要素时,技术人员基于业务需要而预先为该标签要素所创建的主键字段以及非主键属性,而对该标签要素的主键字段和非主键属性进行获取所得出来的,换句话说,技术人员在Nebula图数据库中预先创建“user(用户)”这一名称的标签要素时,将同时定义出该标签要素的主键字段为“userId(用户身份标识号)”,该标签要素的非主键属性包括有“username(用户名)”、“nationality(国籍)”、“age(年龄)”和“phone(电话号码)”。其中“username”表示用户名,其值为string类型(即字符串类型),“nationality”表示国籍,其值为string类型,“age”表示年纪,其值为int类型(即数值类型),“phone”表示电话号码,其值为string类型。说明的是,本申请不再对表1中的第2行、第3行以及第4行的数据进行解释说明。
在一种可能实现的方法中,所述根据所述图空间的多个要素,构建所述图空间的主键字典,包括:在所述图空间的要素为标签时,将所述标签的主键字段作为键,将所述标签的名称作为值;其中,不同标签的名称各不相同,且任一标签的主键字段与所述源数据库中具有相同含义的主键字段相一致;在所述图空间的要素为有向边时,将所述有向边的起点与终点分别对应的主键字段进行拼接后的第一拼接字段作为键,将所述有向边的名称作为值。
继续以表1中提供的图空间的要素信息为例,在构建该图空间的主键字典时,可通过如下方式进行:
遍历该图空间中所有的Tag(标签)和Edge Type(边类型),其中:
对于每一个Tag(标签),将其主键名称作为键(键记作key),Tag(标签)名称构成的列表作为值(值记作value),存入主键字典中;
对于每一个Edge Type(边类型),将其SRC_VID和DST_VID的名称进行字符串拼接,将拼接而成的字段(该拼接而成的字段即为第一拼接字段)作为key(键),Edge Type(边类型)名称构成的列表作为value(值),存入Nebula主键字典中。
说明的是,当前的数据库开发规范有如下规定:Nebula图数据库中同一个图空间(即space)下的不同Tag(标签)的主键名称不能相同,且需要与Hive数据库的数据表中相同含义的主键字段的名称保持一致。
因此,由于不同Tag(标签)的主键名称不能相同,所以用Tag(标签)主键名称作为key的value列表中,仅会包含唯一的一个列表元素。但是,用Edge Type(边类型)的SRC_VID和DST_VID的名称进行字符串拼接后作为key的value列表中,则可以包含若干个不同的值。
根据上述描述的主键字典的构建方法,通过将该方法应用于表1所示的图空间的要素信息,那么可建立如下的主键字典:
Figure PCTCN2022127665-appb-000002
步骤102,针对配置的任一个拟迁移数据表,基于所述主键字典中的各主键字段对所述拟迁移数据表中的各表字段进行匹配;若确定所述拟迁移数据表中存在至少一个表字段与所述主键字典中相同含义的主键字段相同,则确定所述拟迁移数据表为一个待迁移数据表以及确定所述至少一个表字段为配对字段,且基于所述至少一个表字段从所述主键字典中获取同一主键字段所在要素下的要素名称,并构成要素名称集合。
可选的,本申请的数据迁移方法将应用Nebula Exchange工具,同时基于Spark集群来完成。
接着前述的例子,针对Hive数据库中的各待迁移数据表中的数据将如何正确迁入Nebula图数据库中,具体是迁入到Nebula图数据库中的图空间中的问题,在步骤101已经得到图空间的主键字典后,那么可在配置文件中进行配置,所需要配置的内容包括:Spark的相关配置,Nebula Graph的相关配置,Hive数据库中各待迁移数据表的表名称,以及Nebula图数据库的目的图空间(目的图空间指的是数据需要迁入的图空间)名称。此外,还需要在配置文件中对如何对齐Hive数据库中各待迁移数据表中的非主键的表字段与Nebula图数据库的目的图空间中的非主键属性字段进行配置。
对于步骤101中已经得到的图空间的主键字典,其仅仅是建立了Hive数据库中待迁移数据表的主键字段与Nebula图数据库中的目的图空间下的主键字段间的对应关系,而待迁移数据表的其余表字段与Nebula图数据库中Tag(标签)/Edge Type(边类型)的字段名称并不是完全相同的,且相同含义的字段在不同的待迁移数据表之间,也可能是不同的名称,而且配置文件中也没有指定字段对齐关系,因此需要对Hive数据库中的待迁移数据表和Nebula图数据库中的相应的图空间的Tag(标签)/Edge Type(边类型)之间的字段进行对齐后,将字段对齐时产生的数据迁移映射关系写入到配置文件中,从而后续通过执行配置文件,将可以实现准确将Hive数据库中的数据迁移到Nebula图数据库中。
具体的,Hive数据库中的数据表存在有两种类型,一种是实体信息表,一种是实体关系表。其中,实体信息表可为用户信息表,机构信息表等,实体关系表可为用户贷款流水表(描述用户和机构之间存在的借贷关系)等。下面分别描述对于这两中不同类型的数据表,如何构成要素名称集合;当在有了要素名称集合后,候选将可实现自动化地对齐Hive数据库中的非主键字段与Nebula图数据库中的非主键属性字段。
情况1,针对于实体信息表:
针对配置文件中所配置的任一个拟迁移数据表,依据构建的主键字典中的各主键字段对该拟迁移数据表内的所有表字段进行检索,若所有表字段均查不到对应的主键字段,则确定配置文件中对该拟迁移数据表进行过配置是不对的,也即该拟迁移数据表并非为本次需要进行数据迁移的数据表,即不是待迁移数据表,因此此时需要对配置文件中下一个拟迁移数据表、按照相同的方法进行操作。那么此时,若经过对配置文件中所有的拟迁移数 据表进行如上的操作方法后,都确定为不是待迁移数据表,那么则输出系统异常、抛出配置文件发生错误的提示信息。除此情况以外,在依据构建的主键字典中的各主键字段对一拟迁移数据表内的所有表字段进行检索时,还将出现如下两种的检索结果:
检索结果1为仅检索到1个主键字段,那么可按照下述方法确定要素名称集合:
若确定所述主键字典中仅存在一个主键字段与所述拟迁移数据表中的一个表字段相同,则基于所述主键字段从所述主键字典中获取所述主键字段所在要素下的要素名称,并添加所述要素名称至要素名称集合。
检索结果2为检索到大于等于2个主键字段,那么可按照下述方法确定要素名称集合:
若确定所述主键字典中存在至少两个主键字段与所述拟迁移数据表中相同数目的表字段一一相同,则对所述至少两个主键字段两两拼接,得到多个第二拼接字段;针对任一第二拼接字段,对所述第二拼接字段在所述主键字典的各主键字段中进行检索,若确定所述第二拼接字段存在于所述主键字典中,则基于所述第二拼接字段从所述主键字典中获取所述第二拼接字段所在要素下的要素名称,并添加所述要素名称至所述要素名称集合。
举例说明如下:
接着前文的例子,设一拟迁移数据表T,该拟迁移数据表的表结构中包含如下9个表字段:
“userId(用户身份标识号)”,“user_name(用户名)”,“age(年龄)”,“nationality(国籍)”,“telephone(电话号码)”,“organizationId(组织身份标识号)”,“org_address(组织地址)”,“loanDate(贷款日期)”,“loanAmount(贷款数量)”。
遍历所有表字段,并从前文确定的图空间的主键字典中检索,其中,该主键字典如下:
Figure PCTCN2022127665-appb-000003
经检索,得到返回结果为“user(用户)”和“organization(组织)”。由于返回数量大于2(这时候将足以说明该拟迁移数据表实际是一张待迁移数据表,且该待迁移数据表中同时存放了多种实体信息,而且这些实体之间存在某种内在的关联关系,本例中的关联关系是用户与机构之间的借贷关系。那么在数据导入Nebula图数据库后,也需要将这些关联关系一并导入Nebula图数据库中,也即在user Tag(用户标签)与organization Tag(组织标签)之间构建Edge Type(边类型)边关系),再将userId(用户身份标识号)和organizationId(组织身份标识号)进行字符串拼接后得到“userIdorganizationId”(该“userIdorganizationId”即为第二拼接字段),检索上述的主键字典得到返回结果为“loan(贷款)”和“occupy(占据)”。最后,将两次返回的结果合并在一起,从而构成要素名称集合为[“user(用户)”,“organization(组织)”,“loan(贷款)”,“occupy(占据)”]。
情况2,针对于实体关系表:
具体可以参考上文对实体信息表的处理,本申请不做具体描述。
步骤103,根据所述要素名称集合和所述图空间的各非主键属性,确定所述图空间的各非主键属性字段。
在一种可能实现的方法中,所述根据所述要素名称集合和所述图空间的各非主键属性, 确定所述图空间的各非主键属性字段,包括:针对所述要素名称集合中的任一要素名称,将所述要素名称与所述要素名称在所述图空间中的各非主键属性一一进行拼接,并将拼接得到的各第三拼接字段作为所述图空间在所述要素名称下的各非主键属性字段。
以前述例子中的要素名称集合中[“user(用户)”,“organization(组织)”,“loan(贷款)”,“occupy(占据)”]的“user(用户)”这一要素名称为例,通过将该要素名称与该要素名称在图空间中的各非主键属性一一进行拼接,具体的,根据表1所示的图空间的要素信息可知,“user(用户)”这一要素名称在该图空间中的各非主键属性有“username(用户名)”、“nationality(国籍)”、“age(年龄)”和“phone(电话号码)”,因此,通过将“user(用户)”这一要素名称与该4个非主键属性一一进行拼接后,可得到4个拼接字段,依次为“user.username”、“user.nationality”、“user.age”和“user.phone”。同理,将“organization(组织)”这一要素名称与对应的非主键属性“address”进行拼接后,可得到“organization.address”这一拼接字段;将“loan(贷款)”这一要素名称与对应的非主键属性“loan_date(贷款日期)”、“loan_amount(贷款数量)”一一进行拼接后,可得到2个拼接字段,依次为“loan.loan_date”、“loan.loan_amount”;将“occupy(占据)”这一要素名称与对应的非主键属性“start_year(开始年份)”、“end_year(结束年份)”一一进行拼接后,可得到2个拼接字段,依次为“occupy.start_year”、“occupy.end_year”。其中,该些拼接字段即为各第三拼接字段,该些拼接字段也即为图空间中的各非主键属性字段。
步骤104,针对任一个待迁移数据表,根据所述待迁移数据表中的各非配对字段与所述图空间的各非主键属性字段,确定数据迁移映射关系并进行数据迁移;其中,所述数据迁移映射关系用于对所述待迁移数据表中的数据按照同一字段向所述图空间的相同字段迁移;所述各非配对字段为所述待迁移数据表中除去配对字段外的每一个表字段。
可选的,所述针对任一个待迁移数据表,根据所述待迁移数据表中的各非配对字段与所述图空间的各非主键属性字段,确定数据迁移映射关系,包括:针对任一个待迁移数据表,对所述待迁移数据表中的各非配对字段进行规范化处理,从而得到各第一字段;针对所述图空间中的各非主键属性字段,对所述各非主键属性字段进行所述规范化处理,从而得到各第二字段;根据各第一字段和各第二字段,确定数据迁移映射关系。
例如,结合前文的例子,整理拟迁移数据表T中除主键外的所有表字段如下:
“user_name(用户名)”,“age(年龄)”,“nationality(国籍)”,“telephone(电话号码)”,“org_address(组织地址)”,“loanDate(贷款日期)”,“loanAmount(贷款数量)”。
对于以上字段,将其全部转换为下划线小写的方式,得到如下结果:“user_name(用户名)”,“age(年龄)”,“nationality(国籍)”,“telephone(电话号码)”,“org_address(组织地址)”,“loan_date(贷款日期)”,“loan_amount(贷款数量)”。该些字段即为各第一字段。
针对前文例子中的各非主键属性字段“user.username”,“user.nationality”,“user.age”,“user.phone”,“organization.address”,“loan.loan_date”,“loan.loan_amount”,“occupy.start_year”,“occupy.end_year”,也将其全部转换为下划线小写的方式,得到如下结果:
“user.username”,“user.nationality”,“user.age”,“user.phone”,“organization.address”,“loan.loan_date”,“loan.loan_amount”,“occupy.start_year”,“occupy.end_year”。该些字段即为各第二字段。
在本申请的某些实施中,所述根据各第一字段和各第二字段,确定数据迁移映射关系, 包括:针对各第一字段中的任一第一字段,根据所述第一字段的字段值的数据类型,从各第二字段中确定出相同数据类型的各第三字段;针对任一第三字段,按照与所述数据类型匹配的字段相似度计算方法确定所述第三字段与所述第一字段的字段相似度;根据所述字段相似度,确定是否构建所述第一字段与所述第三字段的数据迁移映射关系。
在字段对齐的过程中,需要计算字段名称的相似度,以及字段值的分布的相似度。其中:
(1)字段名称的相似度计算:
用k表示两个字段名称之间的编辑距离,用m1和m2分别表示两个字段名称的字符串长度,本发明用L=k/(2*(m1+m2))来表示两个字段名称之间的相似度。
(2)字段值的分布的相似度计算:
这里首先要区分字段值的数据类型。例如,对于数值型,离散型,字符串类型,分别有对应的字段值的分布的相似度计算方法,具体如下:
i.数值型:这类类型的字段包括“年龄”等。对于数值型数据的分布相似度检验方法,通常有t-检验和KS检验两种算法,考虑到数据迁移的实际应用场景中,会存在部分表数据量偏少的情况,此外由于产品定位的原因,使得用户往往会在某个特征层面发生聚集,而不是随机的符合正态分布的。而KS检验相比t-检验的不同是KS检验不需要知道数据的分布情况,可以算是一种非参数检验方法,这样在数据分布不符合特定的分布时,KS检验的灵敏度要高于t-检验。同时,在数据量比较小的时候,KS检验作为非参数检验在分析两组数据之间是否相似时效果也更好。基于以上的分析,本发明选择KS检验来描述数值型字段的值的分布相似度。计算过程:取两个字段的值的集合,计算KS检验,得到相似度结果。
ii.离散型:这类数据不能简单的通过排重,判断集合是否相等或者相似的方式来处理,比如“国籍”字段,不同hive表之间除了“中国”外,其他的值的重合度可能非常的低。针对这种情况,本发明的做法是:首先对该字段下所有的值的集合进行频数统计,然后按照频数大小进行倒序排序,取75%的总数据量的值的集合作为正向集合,取其余值的集合作为负向集合(比如总数据量10,所有的值的频数统计为:7个A,2个B,1个C,然后将数据按照值的频数倒序排序后为A,A,A,A,A,A,A,B,B,C,前75%的数据的值的集合为[A,B],作为正向集合,然后集合[C]作为负向集合),接下来假定值属于正向集合为“事件发生”,值属于负向集合为“事件不发生”,通过二项检验算法求得事件发生的概率P。最后,如果两个字段的值的集合的二项检验的概率P的差的绝对值小于0.05(也可以根据实际情况设定其他的常数值),则认为这两个字段的分布相似。
iii.字符串类型:这类类型的字段包括“姓名”,“手机号”,“身份证号”,“居住地址”等,首先考虑到这类字段的值虽然在内容上可能千差万别,但在前缀或者后缀上具有高度的相似性,本发明借助Trie树算法来实现对这类型的值的分布的相似度计算。步骤如下:将该字段下所有的值取出构建一棵Trie树,然后求得该字段下所有的值的最大长度L,接下来针对Trie树的前0.3L层,取出每一层的叶子节点,判断两棵树之间是否相似(每一层的叶子节点集合的重合度大于0.9)。最后将该字段下所有的值取出,按照反方向构建一棵Trie树,用上述相同的算法进行相似度计算。只要正向或者反向有一种情况下,两个字段的值的集合构建的Trie树是相似的,就认为这两个字段的值的分布是相似的。
最后,字段对齐的结果可以分为如下几个情况:
i.字段名称大于等于8个字符,且名称完全相同,则认为这两个字段可以对齐合并。比如举例中:源字段集合中的“nationality(国籍)”,“loan_date(贷款日期)”,“loan_amount(贷款数量)”和目的字段集合中的“nationality(国籍)”,“loan_date(贷款日期)”,“loan_amount(贷款数量)”是对齐的;
ii.字段名称小于8个字符的情况下字段名称相同,且字段值的分布是相似的,则认为这两个字段可以对齐合并。比如举例中:源字段集合中的“age(年龄)”和目的字段集合中的“age(年龄)”是对齐的;
iii.字段名称大于8个字符的情况下字段名称相似度大于0.8,且字段值的分布是相似的,则认为这两个字段可以对齐合并。比如举例中:源字段集合中的“user_name(用户名)”,“telephone(电话号码)”,“org_address(组织地址)”和目的字段集合中的“username(用户名)”,“phone(电话号码)”,“address”是对齐的;
iv.其他情况均认为源字段集合中的字段,与目的字段集合中任意字段都不对齐。跳过处理。
如图2所示,为本申请实施例提供的一种配置文件的示意图,其中使用矩形框框出来的内容即为待迁移数据表中的非配对字段与图空间中的非主键属性字段的对齐信息,也即是数据迁移映射关系。
最后,在得到图2所示的数据迁移映射关系后,使用Spark读取Hive数据库中的数据表数据,将数据导入到Nebula图数据库中,操作如下:
a.Spark程序的main函数中,创建SparkSession对象。通过appName函数指定本次导数的应用名,同时调用enableHiveSupport,支持查询hive表功能。
b.读取源hive表信息,获取源表字段,根据配置文件中源hive表与目的Tag(标签)/Edge Type(边类型)之间的字段对齐关系拼写映射sql:
1)将hive表数据迁移至Tag(标签)的异构数据映射方式:
Nebula建立TAG(标签)的语法如下:
CREATE TAG[IF NOT EXISTS]<tag_name>(<prop_name1><data_type>,(<prop_name2><data_type>...])
往Tag(标签)写入数据的语法如下:
INSERT VERTEX[IF NOT EXISTS]<tag_name>[tag_props1,[tag_props2]...]
VALUES VID:([prop_value_list])
参考往Tag(标签)写入数据的语法,这里我们规定将hive表数据映射到Nebula的Tag(标签)的映射方式如下:
SELECT VID,PROP_NAME_LIST FROM SRC_TABLE。
映射要求,建立Tag(标签)语句中的字段名、字段顺序、字段个数、字段类型与SELECT语句中出现的字段保持一致。其中VID表示某个具体Vertex的主键,PROP_NAME_LIST表示Tag(标签)的其他属性值。
2)将hive表数据迁移至Edge Type(边类型)的异构数据映射方式:
Nebula建立Edge Type(边类型)的语法如下:
CREATE EDGE[IF NOT EXISTS]<edge_type_name>(<prop_name1><data_type1>)
往EDGE写入数据的语法如下:
INSERT EDGE[IF NOT EXISTS]<edge_type_name>(<prop_name_list>)VALUES <src_vid>-><dst_vid>[@<rank>]:(<prop_value_list>)...];
参考往Edge Type(边类型)写入数据的语法,这里我们规定将hive表数据映射到Nebula的Edge Type(边类型)的映射方式如下:
SELECT SRC_VID,DST_VID,RANK,PROP_NAME_LIST FROM SRC_TABLE。
映射要求,建立Edge Type(边类型)语句中的字段名、字段顺序、字段个数、字段类型与SELECT语句出现的字段保持一致。由于Edge Type(边类型)是有方向的,所以需要“起点”即SRC_VID,“终点”DST_VID。Rank(排序)是Nebula特有的边字段属性,用于区分Edge Type(边类型)、起点、终点类型都一致时的数据。Rank(排序)可以是一些时间属性的字段,或者其他有含义的字段。PROP_NAME_LIST表示Edge Type(边类型)的其他属性值。
根据上面得到的Tag(标签)与Edge Type(边类型)的映射逻辑,调用Spark.sql进行数据的写入处理。
上述方案中,通过构建图空间的主键字典,通过该主键字典将可以实现对Hive数据库中的各待迁移数据表的主键字段的自动匹配;而为了进一步将待迁移数据表中其他的非主键的字段的数据也能映射到Nebula图数据库中,则也可通过构建的主键字典而实现对各待迁移数据表中的非主键的字段的自动匹配,从而避免了背景技术中因为需要在配置文件中对错综复杂的字段的迁移对应关系进行配置、所需要付出的大量劳动工作且易出错的问题,基于该方式可以实现简单、快速、准确地将数据从Hive数据库中迁入到Nebula图数据库中。
众所周知,任一数据迁移方法在执行的过程中都要求对数据迁移的一致性进行校验,也即需要确保数据在迁移过程中不发生丢失。本申请的数据迁移方法也不例外。为此,本申请提出如下的数据一致性校验的方法,包括:
在进行数据迁移之后,分别设置针对于所述源数据库的第一布隆过滤器和针对于所述图数据库的第二布隆过滤器;按照数据的生成时间,将设定时长内的迁出所述源数据库中的数据写入所述第一布隆过滤器,以及将所述设定时长内的写入所述图数据库中的数据写入所述第二布隆过滤器;根据写入所述第一布隆过滤器的第一写入结果和写入所述第二布隆过滤器的第二写入结果,确定所述设定时长内的数据迁移是否正确。
在本申请的某些实施中,所述第一布隆过滤器和所述第二布隆过滤器均为N层设计的布隆过滤器,任一在后一层的布隆过滤器用于对在前一层的布隆过滤器的写入结果进行写入;所述根据写入所述第一布隆过滤器的第一写入结果和写入所述第二布隆过滤器的第二写入结果,确定所述设定时长内的数据迁移是否正确,包括:将写入所述第一布隆过滤器中的最后一层布隆过滤器的第一写入结果与写入所述第二布隆过滤器中的最后一层布隆过滤器的第二写入结果进行比较;若确定所述第一写入结果与所述第二写入结果相同,则确定所述设定时长内的数据迁移正确。
在本申请的某些实施中,所述N=2;所述方法还包括:将所述第一布隆过滤器中的第一层布隆过滤器和所述第二布隆过滤器中的第一层布隆过滤器均设计为链表形态;所述链表形态表示在写入数据的数目满足设定阈值后,则在所述第一层布隆过滤器中新增一个布隆过滤器。
具体的,首先根据BloomFilter(布隆过滤器)类生成4个BloomFilterList实例,2个BloomFilterList实例用于存放Hive数据库中的待迁移数据表的一致性校验数据,2个 BloomFilterList实例用于存放Nebula图数据库的一致性校验数据。
根据数据的创建时间(create_time字段)或者修改时间(update_time),分别提取出Hive数据库与nebula图数据库中同一指定时间段内的数据,该时间段可以以分钟为维度。如对第一分钟内所有数据的所有字段,写入第一层布隆过滤器,第一分钟的数据全部写入第一层布隆过滤器后,将第一层布隆过滤器的二进制字符串序列写入第二层布隆过滤器,第一分钟的数据写入第二层布隆过滤器后,清空第一层布隆过滤器的状态,然后对比Hive数据库对应的布隆过滤器(具体指第二层布隆过滤器)状态与Nebula图数据库对应的布隆过滤器(具体指第二层布隆过滤器)状态是否一致。若一致,则说明数据迁移正确,否则说明数据迁移有问题,需要重新对这个时间窗口内的所有数据进行迁移。接下来开始处理下一分钟的数据,直到全部数据处理完成。这种双层压缩后的布隆过滤器可以极大的减少数据存储量,减少数据一致性校验过程对服务器内存资源的占用。
此外,考虑到传统的布隆过滤器随着数据的不断写入,误算率会随之增加。这里我们对传统布隆过滤器进行改造,支持自动扩容,随着存入的元素数量增加,误算率并不会因此而显著增加。降低数据一致性校验过程中的误算率的具体方法如下:
Bloomfilter(布隆过滤器)类中包含俩个属性,LinkedList<BloomFilterEntity>bloomFilterList,以及String bloomFilterName属性。其中bloomFilterList属性是利用LinkList这种双向循环链表包装BloomFilterEntity实体,该属性是实现布隆过滤器自动扩容的核心。bloomFilterName是定义的布隆过滤器的名称。Bloomfilter类中包装的几个核心方法分别是mightContain,put,mightContainAndPut。下面重点介绍mightContain,put,mightContainAndPut方法的实现逻辑。
a)mightContain的作用是判断一个字符串是否在布隆过滤器中。方法的实现逻辑是当bloomFilterList为空时,直接返回false。当bloomFilterList不为空时,通过遍历bloomFilterList的每一个BloomFilterEntity实体,判断单个BloomFilterEntity实体是否包含外部传递过来的数据,如果存在,那么返回true,否则返回false。
b)put方法的作用是实现自动扩容布隆过滤器的核心。首先判断bloomFilterList是否为空,如果为空,创建一个新的BloomFilterEntity,并加在bloomFilterList的末尾。然后借助bloomFilterList这种双向循环链表的特性,能够在时间复杂度为1的情况下,快速获取最后一个布隆过滤器实体lastBloomFilterEntity。然后往lastBloomFilterEntity添加数据,添加成功后,当lastBloomFilterEntity中count的属性(布隆过滤器记录的数据量)达到一定的数据量m之后(这个数据量可以自定义),比如到达1万条数据后,重新新建一个新的布隆过滤器,并添加在双向循环链表bloomFilterList的末尾。
c)mightContainAndPut方法是对mightContain的加强,是当传入的数据不在布隆过滤器中时,将该数据添加到布隆过滤器中。该方法的实现逻辑是先调用mightContain方法,当返回false时,直接调用上面的put方法。
在进行数据迁移时,创建Bloomfilter对象,并且调用mightContainAndPut方法,将数据写入到布隆过滤器中。数据迁移完毕之后,通过读取nebula的数据,然后调用Bloomfilter对象的mightContain方法,当每一次调用mightContain方法时,都是返回true时,则说明该条nebula数据存在Bloomfilter布隆过滤器中,数据是正常的。当出现返回的是false的数据时,说明数据迁移有误,打印该条数据。
基于同样的构思,本申请实施例提供一种数据迁移装置,如图3所示,为本申请实施 例提供的一种数据迁移装置的示意图,该装置包括主键字典构建单元301、要素名称集合构建单元302、非主键属性字段确定单元303和迁移处理单元304;
主键字典构建单元301,用于针对图数据库中的图空间,根据所述图空间的多个要素,构建所述图空间的主键字典;所述图空间是根据源数据库中的多个待迁移数据表确定的;针对任一要素,所述主键字典以所述要素的主键字段作为键,以所述要素的要素名称作为值进行构建;
要素名称集合构建单元302,用于针对配置的任一个拟迁移数据表,基于所述主键字典中的各主键字段对所述拟迁移数据表中的各表字段进行匹配;若确定所述拟迁移数据表中存在至少一个表字段与所述主键字典中相同含义的主键字段相同,则确定所述拟迁移数据表为一个待迁移数据表以及确定所述至少一个表字段为配对字段,且基于所述至少一个表字段从所述主键字典中获取同一主键字段所在要素下的要素名称,并构成要素名称集合;
非主键属性字段确定单元303,用于根据所述要素名称集合和所述图空间的各非主键属性,确定所述图空间的各非主键属性字段;
迁移处理单元304,用于针对任一个待迁移数据表,根据所述待迁移数据表中的各非配对字段与所述图空间的各非主键属性字段,确定数据迁移映射关系并进行数据迁移;其中,所述数据迁移映射关系用于对所述待迁移数据表中的数据按照同一字段向所述图空间的相同字段迁移;所述各非配对字段为所述待迁移数据表中除去配对字段外的每一个表字段。
进一步的,对于该装置,主键字典构建单元301,具体用于:在所述图空间的要素为标签时,将所述标签的主键字段作为键,将所述标签的名称作为值;其中,不同标签的名称各不相同,且任一标签的主键字段与所述源数据库中具有相同含义的主键字段相一致;在所述图空间的要素为有向边时,将所述有向边的起点与终点分别对应的主键字段进行拼接后的第一拼接字段作为键,将所述有向边的名称作为值。
进一步的,对于该装置,要素名称集合构建单元302,具体用于:针对配置的任一个拟迁移数据表,基于所述主键字典中的各主键字段对所述拟迁移数据表中的各表字段进行匹配,若确定所述主键字典中仅存在一个主键字段与所述拟迁移数据表中的一个表字段相同,则基于所述主键字段从所述主键字典中获取所述主键字段所在要素下的要素名称,并添加所述要素名称至要素名称集合;若确定所述主键字典中存在至少两个主键字段与所述拟迁移数据表中相同数目的表字段一一相同,则对所述至少两个主键字段两两拼接,得到多个第二拼接字段;针对任一第二拼接字段,对所述第二拼接字段在所述主键字典的各主键字段中进行检索,若确定所述第二拼接字段存在于所述主键字典中,则基于所述第二拼接字段从所述主键字典中获取所述第二拼接字段所在要素下的要素名称,并添加所述要素名称至所述要素名称集合。
进一步的,对于该装置,非主键属性字段确定单元303,具体用于:针对所述要素名称集合中的任一要素名称,将所述要素名称与所述要素名称在所述图空间中的各非主键属性一一进行拼接,并将拼接得到的各第三拼接字段作为所述图空间在所述要素名称下的各非主键属性字段;迁移处理单元304,具体用于:针对任一个待迁移数据表,对所述待迁移数据表中的各非配对字段进行规范化处理,从而得到各第一字段;针对所述图空间中的各非主键属性字段,对所述各非主键属性字段进行所述规范化处理,从而得到各第二字段;根据各第一字段和各第二字段,确定数据迁移映射关系。
进一步的,对于该装置,迁移处理单元304,还用于:针对各第一字段中的任一第一字段,根据所述第一字段的字段值的数据类型,从各第二字段中确定出相同数据类型的各第三字段;针对任一第三字段,按照与所述数据类型匹配的字段相似度计算方法确定所述第三字段与所述第一字段的字段相似度;根据所述字段相似度,确定是否构建所述第一字段与所述第三字段的数据迁移映射关系。
进一步的,对于该装置,还包括数据一致性校验单元305;数据一致性校验单元305,用于:分别设置针对于所述源数据库的第一布隆过滤器和针对于所述图数据库的第二布隆过滤器;按照数据的生成时间,将设定时长内的迁出所述源数据库中的数据写入所述第一布隆过滤器,以及将所述设定时长内的写入所述图数据库中的数据写入所述第二布隆过滤器;根据写入所述第一布隆过滤器的第一写入结果和写入所述第二布隆过滤器的第二写入结果,确定所述设定时长内的数据迁移是否正确。
进一步的,对于该装置,所述第一布隆过滤器和所述第二布隆过滤器均为N层设计的布隆过滤器,任一在后一层的布隆过滤器用于对在前一层的布隆过滤器的写入结果进行写入;数据一致性校验单元305,具体用于:将写入所述第一布隆过滤器中的最后一层布隆过滤器的第一写入结果与写入所述第二布隆过滤器中的最后一层布隆过滤器的第二写入结果进行比较;若确定所述第一写入结果与所述第二写入结果相同,则确定所述设定时长内的数据迁移正确。
进一步的,对于该装置,所述N=2;数据一致性校验单元305,还用于:将所述第一布隆过滤器中的第一层布隆过滤器和所述第二布隆过滤器中的第一层布隆过滤器均设计为链表形态;所述链表形态表示在写入数据的数目满足设定阈值后,则在所述第一层布隆过滤器中新增一个布隆过滤器。
本申请实施例还提供了一种计算设备,该计算设备具体可以为桌面计算机、便携式计算机、智能手机、平板电脑、个人数字助理(Personal Digital Assistant,PDA)等。该计算设备可以包括中央处理器(Center Processing Unit,CPU)、存储器、输入/输出设备等,输入设备可以包括键盘、鼠标、触摸屏等,输出设备可以包括显示设备,如液晶显示器(Liquid Crystal Display,LCD)、阴极射线管(Cathode Ray Tube,CRT)等。
存储器,可以包括只读存储器(ROM)和随机存取存储器(RAM),并向处理器提供存储器中存储的程序指令和数据。在本申请实施例中,存储器可以用于存储数据迁移方法的程序指令;
处理器,用于调用所述存储器中存储的程序指令,按照获得的程序执行数据迁移方法。
如图4所示,为本申请实施例提供的一种计算设备的示意图,该计算设备包括:
处理器401、存储器402、收发器403、总线接口404;其中,处理器401、存储器402与收发器403之间通过总线405连接;
所述处理器401,用于读取所述存储器402中的程序,执行上述数据迁移方法;
处理器401可以是中央处理器(central processing unit,简称CPU),网络处理器(network processor,简称NP)或者CPU和NP的组合。还可以是硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,简称ASIC),可编程逻辑器件(programmable logic device,简称PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,简称CPLD),现场可编程逻辑门阵列(field-programmable gate array,简称FPGA),通用阵列逻辑(generic array logic,简称GAL) 或其任意组合。
所述存储器402,用于存储一个或多个可执行程序,可以存储所述处理器401在执行操作时所使用的数据。
具体地,程序可以包括程序代码,程序代码包括计算机操作指令。存储器402可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,简称RAM);存储器402也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,简称HDD)或固态硬盘(solid-state drive,简称SSD);存储器402还可以包括上述种类的存储器的组合。
存储器402存储了如下的元素,可执行模块或者数据结构,或者它们的子集,或者它们的扩展集:
操作指令:包括各种操作指令,用于实现各种操作。
操作系统:包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。
总线405可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图4中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
总线接口404可以为有线通信接入口,无线总线接口或其组合,其中,有线总线接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线总线接口可以为WLAN接口。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行数据迁移方法。
本领域内的技术人员应明白,本申请的实施例可提供为方法、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个 方框或多个方框中指定的功能的步骤。
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (11)

  1. 一种数据迁移方法,其特征在于,包括:
    针对图数据库中的图空间,根据所述图空间的多个要素,构建所述图空间的主键字典;所述图空间是根据源数据库中的多个待迁移数据表确定的;针对任一要素,所述主键字典以所述要素的主键字段作为键,以所述要素的要素名称作为值进行构建;
    针对配置的任一个拟迁移数据表,基于所述主键字典中的各主键字段对所述拟迁移数据表中的各表字段进行匹配;若确定所述拟迁移数据表中存在至少一个表字段与所述主键字典中相同含义的主键字段相同,则确定所述拟迁移数据表为一个待迁移数据表以及确定所述至少一个表字段为配对字段,且基于所述至少一个表字段从所述主键字典中获取同一主键字段所在要素下的要素名称,并构成要素名称集合;
    根据所述要素名称集合和所述图空间的各非主键属性,确定所述图空间的各非主键属性字段;
    针对任一个待迁移数据表,根据所述待迁移数据表中的各非配对字段与所述图空间的各非主键属性字段,确定数据迁移映射关系并进行数据迁移;其中,所述数据迁移映射关系用于对所述待迁移数据表中的数据按照同一字段向所述图空间的相同字段迁移;所述各非配对字段为所述待迁移数据表中除去配对字段外的每一个表字段。
  2. 如权利要求1所述的方法,其特征在于,
    所述根据所述图空间的多个要素,构建所述图空间的主键字典,包括:
    在所述图空间的要素为标签时,将所述标签的主键字段作为键,将所述标签的名称作为值;其中,不同标签的名称各不相同,且任一标签的主键字段与所述源数据库中具有相同含义的主键字段相一致;
    在所述图空间的要素为有向边时,将所述有向边的起点与终点分别对应的主键字段进行拼接后的第一拼接字段作为键,将所述有向边的名称作为值。
  3. 如权利要求2所述的方法,其特征在于,
    所述基于所述至少一个表字段从所述主键字典中获取同一主键字段所在要素下的要素名称,并构成要素名称集合,包括:
    针对配置的任一个拟迁移数据表,基于所述主键字典中的各主键字段对所述拟迁移数据表中的各表字段进行匹配,若确定所述主键字典中仅存在一个主键字段与所述拟迁移数据表中的一个表字段相同,则基于所述主键字段从所述主键字典中获取所述主键字段所在要素下的要素名称,并添加所述要素名称至要素名称集合;
    若确定所述主键字典中存在至少两个主键字段与所述拟迁移数据表中相同数目的表字段一一相同,则对所述至少两个主键字段两两拼接,得到多个第二拼接字段;针对任一第二拼接字段,对所述第二拼接字段在所述主键字典的各主键字段中进行检索,若确定所述第二拼接字段存在于所述主键字典中,则基于所述第二拼接字段从所述主键字典中获取所述第二拼接字段所在要素下的要素名称,并添加所述要素名称至所述要素名称集合。
  4. 如权利要求3所述的方法,其特征在于,
    所述根据所述要素名称集合和所述图空间的各非主键属性,确定所述图空间的各非主键属性字段,包括:
    针对所述要素名称集合中的任一要素名称,将所述要素名称与所述要素名称在所述图 空间中的各非主键属性一一进行拼接,并将拼接得到的各第三拼接字段作为所述图空间在所述要素名称下的各非主键属性字段;
    所述针对任一个待迁移数据表,根据所述待迁移数据表中的各非配对字段与所述图空间的各非主键属性字段,确定数据迁移映射关系,包括:
    针对任一个待迁移数据表,对所述待迁移数据表中的各非配对字段进行规范化处理,从而得到各第一字段;
    针对所述图空间中的各非主键属性字段,对所述各非主键属性字段进行所述规范化处理,从而得到各第二字段;
    根据各第一字段和各第二字段,确定数据迁移映射关系。
  5. 如权利要求4所述的方法,其特征在于,
    所述根据各第一字段和各第二字段,确定数据迁移映射关系,包括:
    针对各第一字段中的任一第一字段,根据所述第一字段的字段值的数据类型,从各第二字段中确定出相同数据类型的各第三字段;
    针对任一第三字段,按照与所述数据类型匹配的字段相似度计算方法确定所述第三字段与所述第一字段的字段相似度;根据所述字段相似度,确定是否构建所述第一字段与所述第三字段的数据迁移映射关系。
  6. 如权利要求1所述的方法,其特征在于,
    所述进行数据迁移之后,所述方法还包括:
    分别设置针对于所述源数据库的第一布隆过滤器和针对于所述图数据库的第二布隆过滤器;
    按照数据的生成时间,将设定时长内的迁出所述源数据库中的数据写入所述第一布隆过滤器,以及将所述设定时长内的写入所述图数据库中的数据写入所述第二布隆过滤器;
    根据写入所述第一布隆过滤器的第一写入结果和写入所述第二布隆过滤器的第二写入结果,确定所述设定时长内的数据迁移是否正确。
  7. 如权利要求6所述的方法,其特征在于,所述第一布隆过滤器和所述第二布隆过滤器均为N层设计的布隆过滤器,任一在后一层的布隆过滤器用于对在前一层的布隆过滤器的写入结果进行写入;
    所述根据写入所述第一布隆过滤器的第一写入结果和写入所述第二布隆过滤器的第二写入结果,确定所述设定时长内的数据迁移是否正确,包括:
    将写入所述第一布隆过滤器中的最后一层布隆过滤器的第一写入结果与写入所述第二布隆过滤器中的最后一层布隆过滤器的第二写入结果进行比较;
    若确定所述第一写入结果与所述第二写入结果相同,则确定所述设定时长内的数据迁移正确。
  8. 如权利要求7所述的方法,其特征在于,所述N=2;
    所述方法还包括:
    将所述第一布隆过滤器中的第一层布隆过滤器和所述第二布隆过滤器中的第一层布隆过滤器均设计为链表形态;所述链表形态表示在写入数据的数目满足设定阈值后,则在所述第一层布隆过滤器中新增一个布隆过滤器。
  9. 一种数据迁移装置,其特征在于,包括:
    主键字典构建单元,用于针对图数据库中的图空间,根据所述图空间的多个要素,构 建所述图空间的主键字典;所述图空间是根据源数据库中的多个待迁移数据表确定的;针对任一要素,所述主键字典以所述要素的主键字段作为键,以所述要素的要素名称作为值进行构建;
    要素名称集合构建单元,用于针对配置的任一个拟迁移数据表,基于所述主键字典中的各主键字段对所述拟迁移数据表中的各表字段进行匹配;若确定所述拟迁移数据表中存在至少一个表字段与所述主键字典中相同含义的主键字段相同,则确定所述拟迁移数据表为一个待迁移数据表以及确定所述至少一个表字段为配对字段,且基于所述至少一个表字段从所述主键字典中获取同一主键字段所在要素下的要素名称,并构成要素名称集合;
    非主键属性字段确定单元,用于根据所述要素名称集合和所述图空间的各非主键属性,确定所述图空间的各非主键属性字段;
    迁移处理单元,用于针对任一个待迁移数据表,根据所述待迁移数据表中的各非配对字段与所述图空间的各非主键属性字段,确定数据迁移映射关系并进行数据迁移;其中,所述数据迁移映射关系用于对所述待迁移数据表中的数据按照同一字段向所述图空间的相同字段迁移;所述各非配对字段为所述待迁移数据表中除去配对字段外的每一个表字段。
  10. 一种计算机设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于调用所述存储器中存储的计算机程序,按照获得的程序执行如权利要求1-8任一项所述的方法。
  11. 一种计算机可读存储介质,其特征在于,所述存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行如权利要求1-8任一项所述的方法。
PCT/CN2022/127665 2022-06-19 2022-10-26 一种数据迁移方法及装置 WO2023245941A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210693937.6 2022-06-19
CN202210693937.6A CN115858487A (zh) 2022-06-19 2022-06-19 一种数据迁移方法及装置

Publications (1)

Publication Number Publication Date
WO2023245941A1 true WO2023245941A1 (zh) 2023-12-28

Family

ID=85660205

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/127665 WO2023245941A1 (zh) 2022-06-19 2022-10-26 一种数据迁移方法及装置

Country Status (2)

Country Link
CN (1) CN115858487A (zh)
WO (1) WO2023245941A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118132540A (zh) * 2024-05-08 2024-06-04 杭州悦数科技有限公司 一种实现在线图数据库迁移的方法及装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116737870B (zh) * 2023-08-09 2023-10-27 北京国电通网络技术有限公司 上报信息存储方法、装置、电子设备和计算机可读介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160092527A1 (en) * 2014-09-30 2016-03-31 Bitnine Co., Ltd. Data processing apparatus and data mapping method thereof
CN105930361A (zh) * 2016-04-12 2016-09-07 北京恒冠网络数据处理有限公司 一种关系型数据库向Neo4j模型转换和数据迁移方法
US20200201909A1 (en) * 2015-09-11 2020-06-25 Entit Software Llc Graph database and relational database mapping

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160092527A1 (en) * 2014-09-30 2016-03-31 Bitnine Co., Ltd. Data processing apparatus and data mapping method thereof
US20200201909A1 (en) * 2015-09-11 2020-06-25 Entit Software Llc Graph database and relational database mapping
CN105930361A (zh) * 2016-04-12 2016-09-07 北京恒冠网络数据处理有限公司 一种关系型数据库向Neo4j模型转换和数据迁移方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118132540A (zh) * 2024-05-08 2024-06-04 杭州悦数科技有限公司 一种实现在线图数据库迁移的方法及装置

Also Published As

Publication number Publication date
CN115858487A (zh) 2023-03-28

Similar Documents

Publication Publication Date Title
WO2023245941A1 (zh) 一种数据迁移方法及装置
US11194779B2 (en) Generating an index for a table in a database background
US20200050968A1 (en) Interactive interfaces for machine learning model evaluations
US8892525B2 (en) Automatic consistent sampling for data analysis
US9721009B2 (en) Primary and foreign key relationship identification with metadata analysis
WO2021068547A1 (zh) 日志模板提取方法及装置
TW202029079A (zh) 異常群體識別方法及裝置
US9135647B2 (en) Methods and systems for flexible and scalable databases
US20180253653A1 (en) Rich entities for knowledge bases
WO2019161645A1 (zh) 基于Shell的数据表提取方法、终端、设备及存储介质
US9600559B2 (en) Data processing for database aggregation operation
CN112052138A (zh) 业务数据质量检测方法、装置、计算机设备及存储介质
US10762068B2 (en) Virtual columns to expose row specific details for query execution in column store databases
JP7153420B2 (ja) データベース中にグラフ情報を記憶するためのb木使用
WO2020259325A1 (zh) 一种适用于机器学习的特征处理方法及装置
US20180307722A1 (en) Pattern mining method, high-utility itemset mining method, and related device
WO2020233347A1 (zh) 工作流管理系统的测试方法、装置、存储介质及终端设备
CN113220710B (zh) 数据查询方法、装置、电子设备以及存储介质
US9619458B2 (en) System and method for phrase matching with arbitrary text
US11520763B2 (en) Automated optimization for in-memory data structures of column store databases
US11720563B1 (en) Data storage and retrieval system for a cloud-based, multi-tenant application
US11847121B2 (en) Compound predicate query statement transformation
US9659059B2 (en) Matching large sets of words
CN113742322A (zh) 一种数据质量检测方法和装置
US20210165772A1 (en) Discovering and merging entity record fragments of a same entity across multiple entity stores for improved named entity disambiguation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22947678

Country of ref document: EP

Kind code of ref document: A1