CN115952168A - Education industry-oriented multi-scale progressive difference data positioning method - Google Patents

Education industry-oriented multi-scale progressive difference data positioning method Download PDF

Info

Publication number
CN115952168A
CN115952168A CN202211664131.0A CN202211664131A CN115952168A CN 115952168 A CN115952168 A CN 115952168A CN 202211664131 A CN202211664131 A CN 202211664131A CN 115952168 A CN115952168 A CN 115952168A
Authority
CN
China
Prior art keywords
data
primary key
kstart
kend
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211664131.0A
Other languages
Chinese (zh)
Inventor
唐雪飞
陈科
马晨曦
魏友佳
王帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Comsys Information Technology Co ltd
Original Assignee
Chengdu Comsys Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Comsys Information Technology Co ltd filed Critical Chengdu Comsys Information Technology Co ltd
Priority to CN202211664131.0A priority Critical patent/CN115952168A/en
Publication of CN115952168A publication Critical patent/CN115952168A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-scale progressive difference data positioning method for the education industry, which comprises the following steps: s1, data sorting: sequencing the operation table TR and the backup table TB according to the size of the primary key values; s2, calculating a starting primary key value KStart and a terminating primary key value KEnd; s3, aiming at the data range [ Kttart, KSnd ] of the primary key values]Carrying out scale division; s4, judging the grouping: s4.1, KStart i <KEnd i Calculating the data records of the primary key values in the TR and the TB within the grouping range; s4.2, KStart i =KEnd i Let K = KStart i =KEnd i Taking out the data records corresponding to K for comparison; and S5, performing the operation of the S4 on all the groups to obtain the difference data item. The invention can quickly and efficiently locate the data record items with differences in the operation table and the backup table, and output the result of data comparison, and has smaller performance pressure on the database.

Description

Education industry-oriented multi-scale progressive difference data positioning method
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a multi-scale progressive difference data positioning method for the education industry.
Background
With the continuous promotion and wide application of education informatization, a large amount of data is newly added every day in the education industry. Compared with other industries, the education industry has the characteristics of complex data structure, huge user quantity and fast data growth. Taking a common college as an example, the college comprises tens of thousands of students and thousands of teachers, each department runs various application systems, such as financial systems, personnel systems, academic systems, educational system, etc., which generate a large amount of data at every moment, and in order to store the data, a database is used, wherein a relational database is the most common and mature database system which stores and retrieves data in row (row) units.
The data in the database in the education industry is not only huge in quantity, but also can be changed at any time, such as the operations of adding, modifying and deleting the data, in order to manage the data more safely and effectively, the database is required to be backed up regularly, and the backup database provides necessary support conditions for maintenance operations such as data cleaning, error correction, recovery and the like. Many times, these operations require searching for changes in data, that is, comparing backup data with current running data to find out differences between the backup data and the current running data.
The traditional data comparison method needs to compare the data rows of the backup library and the data rows of the running library one by one, and the operation is time-consuming, and the required time is exponentially increased along with the expansion of the data volume. Therefore, it is necessary to design a more efficient and faster data comparison method, which can locate the difference position between two different databases in a shorter time and give the specific occurrence of change types (addition, modification, deletion).
Relational databases store data in the form of data tables (tables). The data table is a two-dimensional table comprising columns representing data fields and rows representing data records, as table 1 gives the table student for storing student information:
table 1: student information table example (student)
Student ID Name (I) Age (age) Sex College of college Type (B) ……
1 Zhang San 20 For male Computer school Master's soldier ……
2 Li Si 19 Woman LiteratureHotel Doctor (Rooibos) ……
3 Wang Wu 21 For male Foreign language college Bachelor ……
…… …… …… …… …… …… ……
Each table has one or more columns for uniquely identifying data row records, called "primary keys" (keys), for example, student IDs in Table 1 are primary keys of the student tables. Although the primary key may be multiple fields, in most cases, only one column is used as the primary key identification, since this is easier to manage and maintain and data retrieval is more efficient. The primary key typically has the following characteristics:
(1) The primary key cannot be empty and is not repeatable: the primary key must have a value and cannot be duplicated with the primary key values of existing records, which is a basic requirement of the relational database for the primary key.
(2) Primary key values are comparable and therefore can be sorted by primary key size. The most common primary key values are recorded in a way that positive integers sequentially increase. Although a character string can be used as a primary key, it is less suitable because it is not possible to perform addition and subtraction operations and has many restrictions in use.
(3) The primary key value is not modifiable. The primary key is used as a unique identifier of a data record, and represents a unique retrieval basis of a record, and modification of the primary key can cause data inconsistency, so most databases do not suggest or even allow modification of the primary key value. In practice we can avoid modifying its value by adding a column of fields with unique values but no actual physical meaning as a primary key.
(4) Primary key values are not reusable. If a record is deleted, the primary key value corresponding to that record will not be used in the newly inserted data record. For example, in table 1, after a student with ID 3 is deleted, the newly inserted data later does not use 3 as the primary key. The reason for not multiplexing primary key values: firstly, because the primary key value exists in the operation table and the backup table at the same time, in the process of comparing the data of the backup table and the operation table, whether the record is deleted, newly added or directly modified cannot be distinguished; secondly, because the primary key is only used as the data retrieval identification, under the condition of no actual physical significance, the primary key value is not needed to be multiplexed.
The data volume means: in most cases, the columns (i.e., fields) of the table will not change, and if the columns change, the database structure is considered to be changed qualitatively, which will cause all data usage, operation and maintenance procedures related to the data table to be greatly affected, so that in the actual operation process of the database, once formally online, the column definition never changes. Therefore, the measurement standard of the data amount of the data table is determined by the number of rows, no general standard of the database level exists at present, and the data amount is empirically regarded as a large amount of data in million levels, an ultra large amount of data in million levels, and a huge amount of data in more than hundred million levels.
The data sheet actually used in the education information system is called a run sheet, for example, a student information sheet in a student system, and all student information is recorded in real time. The structure and record of the operation table at a certain moment are copied and stored, and at the moment of finishing the backup, the backup table has the same data as the current operation table, but the operation table can be changed along with the time, for example, the type of student ID 3 is modified from student to master, but the data of the backup table can not be changed.
As mentioned above, the operation table may change with the execution of the service operation, and these changes mainly include the following cases:
(1) Inserting data: a record is newly added in the operation table, and the main key of the new record does not exist in the backup table at the moment;
(2) And (3) modifying data: when the value of one or more fields (which cannot be primary keys) of a record in the operation table changes, the record in the operation table and the backup table has the same primary key value;
(3) And (4) deleting data: and deleting one or more records from the operating list, wherein the backup list has the main key of the deleted record and the operating list does not have the main key of the deleted record.
The differential data positioning is to compare the data of the operation table and the backup table, and accurately position the changed record. Since the data columns are fixed and finite, the comparison of field values is easy as long as the data rows are located. The unique identification of the row is the primary key, so the difference data comparison requires that the primary key value and the change type of the difference data be given. According to different variation types, the difference results are given as follows:
(1) Inserting data: changeType = insert, newId = primary key value of new record;
(2) And (3) modifying data: changeType = update, id = primary key value of modified record;
(3) And (3) deleting data: changeType = delete, oldId = primary key value of deleted record.
To locate the difference data, data comparison needs to be performed on the operating table and the backup table, and particularly when the data volume is large (more than millions), it is difficult to find some tiny data changes. The current data comparison method mainly comprises the following steps:
(1) Comparing the whole table: each row of data in the backup table is sequentially taken out and compared with the data in the running table one by one, the process can locate the modified and deleted data, but the newly added data cannot be found (because the backup table does not have the data), so all the data which are not compared in the running table need to be recorded, and the process is time-consuming. For the case of large data volume, the full table comparison cannot meet the requirements on time and efficiency.
(2) Log analysis: most databases provide log records of data operations, and theoretically, all data change information can be found from the log, so that data difference positions can be further located. However, from the practical operation, the technical difficulty of log analysis is very high, firstly, the log information is unstructured and is only described in a section of characters, and the detailed content of data change is difficult to extract and analyze from the log information; secondly, the amount of log information is huge, all data operations need to be analyzed in sequence from the backup time, and the required time is long; thirdly, different databases provide different log information formats and data description modes, and the changed primary key value position may not be accurately located. Therefore, under the condition of large data volume, the log analysis method cannot meet the requirement of accurate positioning of difference data.
(3) Writing a trigger program: most databases support the writing of trigger programs, which perform a specific task (e.g., recording change information in a specific data table or file) when certain operations (e.g., adding, modifying, and deleting) occur on the run list, so as to obtain data changes in real time. Although the trigger program is powerful, the disadvantages are also very significant: firstly, the trigger program is generated in real time along with data change operation, so that the trigger program is additionally executed in each data operation, the burden of a database server is greatly increased, the performance of a database is greatly reduced, and the trigger program cannot be tolerated in a large-data-volume environment; secondly, the trigger program is not a universal standard, and the compiling formats of the trigger programs of different databases are different; thirdly, the programming of the trigger program has higher difficulty, and the BUG is easy to occur, thereby influencing the normal data storage operation and causing the system to be unstable or even crash. The trigger program is therefore only suitable for scenes with small data volumes and is not universal.
In summary, none of the currently available technical means and methods can support fast locating of difference data of the run table and the backup table well, especially in a large data volume environment, and therefore, redesign from the aspect of the method is required.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a multi-scale progressive difference data positioning method facing the education industry, which can quickly and efficiently position data record items with differences in an operation table and a backup table, output the result of data comparison and have less pressure on the performance of a database.
The purpose of the invention is realized by the following technical scheme: a multi-scale progressive difference data positioning method for education industry comprises the following steps:
s1, data sorting: sequencing a running table TR and a backup table TB according to the size of a primary key value, wherein the primary key of the TR is marked as KR, and the primary key of the TB is marked as KB;
s2, calculating a starting primary key value KStart and a terminating primary key value KEnd;
s3, carrying out scale division on a primary key value data range [ KStart, KEnd ]; the choice of the size of the scale is determined according to the following formula:
Figure BDA0004014032380000041
wherein the symbol
Figure BDA0004014032380000042
The upper integer is taken; the denominator n is an element of [2]If p is less than 2, taking n =2;
using the selected scale pair [ KStart, KEnd]Divide into the ith group G i The data value range calculation formula of (1) is as follows:
G i =[KStart i ,KEnd i ]=[KStart+(i-1)×D,min(KStart+i×D-1,KEnd)]
i≥1;
s4, grouping G according to the size of the scale i The judgment is carried out, and the following substeps are included:
s4.1, if KStart i <KEnd i Respectively calculating the data records of the primary key values in the TR and the TB in the grouping range;
s4.2, if KStart i =KEnd i Indicating that there is only one primary key value data, let: k = KStart i =KEnd i Directly taking out the data records corresponding to K for comparison;
s5, performing the operation of the step S4 on all the packets of [ KStart, KEnd ], and obtaining all the difference data items in TR and TB.
The specific implementation method of the step S2 comprises the following steps:
s2.1, calculating the minimum value min (KR) of KR;
s2.2, calculating the minimum value min (KB) of the KB;
s2.3, calculating the maximum value max (KR) of KR;
s2.4, calculating the maximum value max (KB) of the KB;
s2.5, calculating the start primary key value KStart = min (min (KR), min (KB));
s2.6, calculate termination primary key value KEnd = max (max (KR), max (KB)).
The specific implementation method of the step S4.1 comprises the following steps:
s4.1.1, querying data records in the TR under the following query conditions: KR e [ KStart ] i ,KEnd i ]The resulting data record set is DataSetR i (ii) a The corresponding SQL statement is: select from TR where id beta weenKStart i andKEnd i
S4.1.2, inquiring that the range of the primary key value is [ KStart ] in TB i ,KEnd i ]Data record set DataSetB in (1) i The corresponding SQL statement is: select from TB where id beta weenKStart i andKEnd i
S4.1.3 for DataSetR i And DataSetB i And carrying out Hash operation, and comparing whether the results are equal:
if hash (DataSetR) i )=hash(DataSetB i ) Showing that G is i Dataset DataSetR within range i And DataSetB i Equality, absence of difference data, no need to subdivide within the comparison data setRecord, packet G i The operation of (1) is directly finished;
if hash (DataSetR) i )≠hash(DataSetB i ) Indicates the primary key value is at G i The data records within the range are different and need to be at G i Carrying out progressive comparison by further dividing the scale; at this time, G i And (4) regarding the data as a complete set, continuously grouping and dividing the data, and returning to the step S3.
The specific implementation method of the step S4.2 is as follows:
s4.2.1, inquiring data records in the TR under the following conditions: KR = K, corresponding to SQL statement: select from TR where KR = K, and recording the query result as DataR i If the primary key value K does not exist in TR, put DataR i =null;
S4.2.2, querying the data record in the TB under the query conditions: KB = K, corresponding to SQL statement: select from TB where KB = K, and the query result is noted as DataB i If the primary key value K does not exist in TB, put DataB i = null; since K is the union from TR and TB primary bonds, dataR cannot occur i And DataB i Null cases at the same time;
s4.2.3, comparison DataR i And DataB i The value of (c):
if DataR i ≠nullandDataB i ≠nullandhash(DataR i )≠hash(DataB i ) If the record of the primary key K exists in both TR and TB, but the hash values of the TR and the TB are not equal, the data modification is caused, and the record data change is as follows: changeType = update, id = K;
if DataR i ≠nullandDataB i ≠nullandhash(DataR i )=hash(DataB i ) The TR and the TB are both recorded with the main key K, but the data is not changed and is not difference data, and at the moment, no record is needed;
if DataR i ≠nullandDataB i = null, indicating that the running table TR has the primary key K and the backup table TB has no K, indicating that data insertion has occurred, recording data change: changeType = insert, newId = K;
if DataR i =nullandDataB i Not equal to null, it shows that the running table TR has no main key K, and the backup table TB has K, indicating that data deletion has occurred, and recording data change: changeType = delete, oldId = K.
The invention has the beneficial effects that: the invention can quickly and efficiently locate the data record items with differences in the operation table and the backup table and output the result of data comparison. The method supports any general relational database, and has small performance pressure on the database, thereby solving various defects of the traditional data comparison method.
Drawings
FIG. 1 is a flow chart of the multi-scale progressive data comparison method of the present invention.
Detailed Description
The terms appearing in the present invention are defined as follows:
data division: all data in a certain data set can be grouped, the number of the groups is n, and each group is marked as G 1 ,G 2 ,G 3 ...G n (n.gtoreq.2). Let the full set of data be U, each packet being a subset of U. A certain grouping is called a division of U and only if the following conditions hold:
(1)G 1 ∪G 2 ∪G 3 ∪...∪G n = U, i.e. the union of all packets is the full set, so any one data record must be in a certain packet;
(2) For any sequence number i not equal to j, all have
Figure BDA0004014032380000061
I.e. no identical data record is contained in any two packets, any one data record must therefore be located in a unique packet.
The grouping mode satisfying the above condition is called a division of U, for example, the U is divided into odd-numbered groups and even-numbered groups according to the primary key value, or divided into male and female according to the gender, etc.
Scale of data: the data scale definition involved in the invention is a metric data packet G i An indication of scale. To improve data processingFor reasons of efficiency, the packets should be kept equal, i.e. the number of data records in each packet should be equal or as close as possible. The number of data in a divided packet is called the data scale and is noted
Figure BDA0004014032380000062
U represents the total number of records of the corpus. Since | U | may not be evenly divisible by n, in this way of data partitioning, the number of most packets should be equal (equal to the scale), and the last packet may be offset by | U | mod n data, although this does not affect the definition of the scale. The size of the scale value indicates the fineness of the U division, the larger the scale is, the coarser the data division is, and the smaller the scale is, the finer the division is.
If a certain data packet G is to be grouped i Treated as a new corpus, further comprising G i Is divided into
Figure BDA0004014032380000063
By analogy, any one packet can be further subdivided until the scale equals 1.
The multi-scale progressive is to gradually approach the data records with certain characteristics by adjusting the scale of the division, and large-scale groups which do not meet the conditions in the division can be eliminated integrally, so that the data matching and comparing efficiency is greatly improved.
Hash (Hash) operation of data: one or more data records are mapped by a hash function into message digests of equal length. The requirements of an ideal hash operation are:
(1) The hash operation is fast and the lengths of the operation results are equal regardless of the size of the original data r.
(2) If 2 data records r 1 =r 2 Then there should be a hash (r) 1 )=hash(r 2 )。
(3) If r is 1 ≠r 2 Then there should be a hash (r) 1 )≠hash(r 2 )。
For any hash function, (1) and (2) must be satisfied, but(3) Is satisfied only in the ideal state, i.e. r 1 =r 2 Is hash (r) 1 )=hash(r 2 ) However, in practical situations, since there is some probability of hash collision (different raw data are calculated to obtain the same hash value), the above r is implemented 1 =r 2 Is hash (r) 1 )=hash(r 2 ) Is not necessary enough, but if the probability of hash collisions is low enough (e.g., below 2) -10 ) The hash operation can be considered reliable.
The algorithm process of the hash function can be designed autonomously, or known public algorithms can be directly used, the invention recommends using an MD5 abstract algorithm, and the algorithm has the advantages of high operation speed and low collision probability (less than 2) -64 ) Is characterized in that.
In some limit cases, the reliability of the hash operation needs to be guaranteed with a great probability, and a double hash mode can be used, that is, 2 different hash algorithms are used for data comparison, if there are: hash1 (r 1) = hash1 (r 2) and hash2 (r 1) = hash2 (r 2), and the probability of collision at this time is reduced to collision probability 1 × collision probability 2, which can be considered as a very reliable hash operation, and it can be considered that r1= r2 is true at all.
The reason for using hash operation is that when one or more data records are compared, the data items can be prevented from being sequentially compared row by row, and only the hash values of the data items need to be compared integrally.
The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, the multi-scale progressive difference data positioning method for the educational industry of the present invention is used for quickly positioning data records with differences in an operation table and a backup table, and outputting a result of data comparison. The preconditions of the invention include:
(1) The running table (denoted as TR) and the backup table (denoted as TB) have identical columns (fields), including number and type, and order of the fields; the data sheet actually used in the education information system is called a run sheet, for example, a student information sheet in a student system, and all student information is recorded in real time.
(2) TR and TB have unique primary keys and are not modified;
(3) The primary key types of TR and TB are integers, so that sorting and addition and subtraction operations can be performed.
The positioning method comprises the following steps:
s1, data sorting: sorting the operation table TR and the backup table TB according to the size of a primary key value (ascending or descending, if no special requirement exists, the ascending is generally used, because the ascending is a default sorting rule of most databases), wherein the primary key of the TR is marked as KR, and the primary key of the TB is marked as KB; the purpose of the sorting is to facilitate subsequent calculation of primary key value ranges.
S2, calculating a starting primary key value KStart and a terminating primary key value KEnd; the specific implementation method comprises the following steps:
s2.1, calculating the minimum value min (KR) of KR;
s2.2, calculating the minimum value min (KB) of the KB;
s2.3, calculating the maximum value max (KR) of KR;
s2.4, calculating the maximum value max (KB) of the KB;
as the database adopts the index technology for the primary key values by default, the speed for calculating the minimum value and the maximum value of the primary key is very high, and the data size of ten million levels can be generally inquired within 0.1 second to obtain the result through testing.
S2.5, calculating a start primary key value KStart = min (min (KR), min (KB));
s2.6, calculate termination primary key value KEnd = max (max (KR), max (KB)).
Namely: the start primary key KStart is equal to the smallest primary key of KR and KB and the end primary key KEnd is equal to the largest primary key of KR and KB.
S3, carrying out scale division on a primary key value data range [ KStart, KEnd ]; the selection of the scale size is related to the size of the data range (i.e., the KEnd-KStart) value and can be determined according to the following formula:
Figure BDA0004014032380000081
wherein(symbol)
Figure BDA0004014032380000082
The upper integer is taken; the denominator n is an element of [2]If p is less than 2, taking n =2; the whole is taken to ensure that D is more than or equal to 1. When D =1 is true, it indicates that there is only one piece of data in the packet, and that it cannot be subdivided.
The value of n can be properly adjusted according to the actual situation, and the scale value obtained by too large n is smaller, so that the number of packets is too large, the number of data comparison times is increased, and the efficiency is reduced; and too small n is obtained with a larger scale value, so that the range of each group is larger, the probability of containing difference data is increased, and a new round of division comparison process in the group is initiated. Generally, p can take the core number of the CPU of the current computer so as to support multithread concurrent computation and improve the operation efficiency.
Using the selected scale pair [ KStart, KKnd ]]Divide into the ith group G i The data value range calculation formula of (1) is as follows:
G i =[KStart i ,KEnd i ]=[KStart+(i-1)×D,min(KStart+i×D-1,KEnd)]
i is more than or equal to 1; the formula conforms to the definition of the partition and ensures that the upper limit value does not exceed the KEnd, thereby avoiding an invalid data search range.
S4, grouping G according to the size of the scale i The judgment is carried out, and the following substeps are included:
s4.1, if KStart i <KEnd i Respectively calculating the data records of the primary key values in the TR and the TB in the grouping range; since the primary key value is the only mark of the record, this step effectively enables the partitioning of the data records in TR and TB. The specific implementation method comprises the following steps:
s4.1.1, inquiring data records in TR under the following conditions: KR e [ KStart ] i ,KEnd i ]Obtaining a data record set as DataSetS i (ii) a The corresponding SQL statement is: select from TR world id between
Figure BDA0004014032380000083
andKEnd i (ii) a Due to the key value ofIndexing, the time consumption of the step is very short, and the results can be obtained on the level of seconds for tens of millions of data.
S4.1.2, querying primary key value range in TB in [ KStart ] i ,KEnd i ]Data record set DataSetB in (1) i The corresponding SQL statement is: select from TB where id beta weenKStart i andKEnd i
S4.1.3 for DataSetR i And DataSetB i And carrying out Hash operation, and comparing whether the results are equal:
if hash (DataSetR) i )=hash(DataSetB i ) Showing that G is i Dataset DataSetR within range i And DataSetB i Equality, absence of difference data, no need to subdivide records in the comparison data set, group G i The operation of (1) is directly finished;
if hash (DataSetR) i )≠hash(DataSetB i ) Indicating that the primary key value is at G i The data records within the range are different and need to be at G i Carrying out progressive comparison by further dividing the scale; at this time, G i The data is regarded as a data complete set, the grouping and the division are continuously carried out, the step S3 is returned, and the data can be divided according to G when the data division is carried out i Take different values of n.
S4.2 if KStart i =KEnd i Indicating that there is only one primary key value data, let: k = KStart i =KEnd i Directly taking out the data records corresponding to K for comparison; the specific implementation method comprises the following steps:
s4.2.1, inquiring data records in the TR under the following conditions: KR = K, corresponding to SQL statement: select from TR where KR = K, and recording the query result as DataR i If the primary key value K does not exist in TR, put DataR i =null;
S4.2.2, inquiring data records in the TB under the following conditions: KB = K, corresponding to SQL statement: select from TB where KB = K, and the query result is noted as DataB i If the primary key value K does not exist in TB, put DataB i = null; since K is the union from TR and TB primary keys, dataR is unlikely to occur i And DataB i Null cases at the same time;
s4.2.3, comparison DataR i And DataB i The value of (c):
if DataR i ≠nullandDataB i ≠nullandhash(DataR i )≠hash(DataB i ) If the record of the primary key K exists in TR and TB, but the hash values of the TR and the TB are different, the data modification is generated, and the record data change is as follows: changeType = update, id = K;
if DataR i ≠nullandDataB i ≠nullandhash(DataR i )=hash(DataB i ) The TR and the TB are both recorded with the main key K, but the data is not changed and is not difference data, and at the moment, no record is needed;
if DataR i ≠nullandDataB i = null, indicating that the running table TR has a primary key K and the backup table TB has no K, indicating that data insertion has occurred, recording data change: changeType = insert, newId = K;
if DataR i =nullandDatab i And is not null, the fact that the main key K does not exist in the operation table TR, the fact that the K exists in the backup table TB indicates that data deletion and data record change occur is as follows: changeType = delete, oldId = K.
And S5, executing the judgment operation of the step S4 on all the groups of [ KStart, KEnd ] to obtain all the difference data items in TR and TB.
The method and the steps related to the invention are suitable for the case of larger data volume (more than 10 ten thousand), and the difference data volume < = 20%. More difference data may cause more progressive partition contrast and efficiency may be reduced. Through testing, ten million levels of data volume (5 ten thousand pieces of difference data) can be searched within 2 minutes.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (4)

1. A multi-scale progressive difference data positioning method oriented to education industry is characterized by comprising the following steps:
s1, data sorting: sequencing a running table TR and a backup table TB according to the size of a primary key value, wherein the primary key of the TR is marked as KR, and the primary key of the TB is marked as KB;
s2, calculating a starting primary key value KStart and a terminating primary key value KEnd;
s3, carrying out scale division on a primary key value data range [ KStart, KEnd ]; the choice of the size of the scale is determined according to the following formula:
Figure FDA0004014032370000011
wherein the symbol
Figure FDA0004014032370000012
The upper integer is taken; the denominator n is an element of [2]If p is<2, taking n =2;
using the selected scale pair [ KStart, KEnd]Is divided into the ith packet G The data value range calculation formula of (1) is as follows:
G i =[KStart i ,KEnd i ]=[KStart+(i-1)×D,min(KStart+i×D-1,KEnd)]i≥1;
s4, grouping G according to the size of the scale i The judgment is carried out, and the following substeps are included:
s4.1, if KStart i <KEnd i Respectively calculating the data records of the primary key values in the TR and the TB in the grouping range;
s4.2 if KStart i =KEnd i Indicating that there is only one primary key value data, let: k = KStart i =KEnd Directly taking out the data records corresponding to K for comparison;
s5, performing the operation of the step S4 on all the packets of [ KStart, KEnd ], and obtaining all the difference data items in TR and TB.
2. The education industry-oriented multi-scale progressive difference data positioning method according to claim 1, wherein the step S2 is specifically realized by the following steps:
s2.1, calculating the minimum value min (KR) of KR;
s2.2, calculating the minimum value min (KB) of the KB;
s2.3, calculating the maximum value max (KR) of KR;
s2.4, calculating the maximum value max (KB) of the KB;
s2.5, calculating the start primary key value KStart = min (min (KR), min (KB));
s2.6, calculate termination primary key value KEnd = max (max (KR), max (KB)).
3. The method for positioning the multi-scale progressive difference data facing the education industry according to claim 1, wherein the step S4.1 is realized by:
s4.1.1, inquiring data records in TR under the following conditions: KR e [ KStart ] i ,KEnd i ]The resulting data record set is DataSetR (ii) a The corresponding SQL statement is: select from TR where id between KStart i and KEnd i
S4.1.2, inquiring that the range of the primary key value is [ KStart ] in TB i ,KEnd i ]Data record set DataSetB in (1) i The corresponding SQL statement is: select from TB whereid between KStart i and KEnd i
S4.1.3 for DataSetR i And DataSetB i And carrying out Hash operation, and comparing whether the results are equal:
if hash (DataSetR) i )=hash(DataSetB i ) Showing that G is i Dataset DataSetR within range i And DataSetB i The data are equal, no difference data exists, the records in the data set do not need to be subdivided and compared, and the operation of the group Gi is directly finished;
if hash (DataSetR) i )≠hash(DataSetB i ) Indicating that the primary key value is at G i The data records within the range are different and need to be at G i Carrying out progressive comparison by further dividing the scale; at this time, G i And (4) regarding the data as a complete set, continuously grouping and dividing the data, and returning to the step S3.
4. The method for positioning the multi-scale progressive difference data facing the education industry according to claim 1, wherein the step S4.2 is realized by a method comprising the following steps:
s4.2.1, inquiring data records in the TR under the following conditions: KR = K, corresponding to SQL statement: select from TR where KR = K, and recording the query result as DataR i If the primary key value K does not exist in TR, put DataR i =null;
S4.2.2, querying the data record in the TB under the query conditions: KB = K, corresponding to SQL statement: select from TB where KB = K, and the query result is marked as DataB i If the primary key value K does not exist in TB, put DataB i = null; since K is the union from TR and TB primary bonds, dataR cannot occur i And DataB i Null cases at the same time;
s4.2.3, comparison DataR i And DataB i The value of (c):
if DataR i ≠null and DataB i ≠null and hash(DataR i )≠hash(DataB i ) If the record of the primary key K exists in TR and TB, but the hash values of the TR and the TB are different, the data modification is generated, and the record data change is as follows: changeType = update, id = K;
if DataR i ≠null and DataB ≠null and hash(DataR i )=hash(DataB i ) The TR and the TB are both recorded with the main key K, but the data is not changed and is not difference data, and at the moment, no record is needed;
if DataR i ≠null and DataB i = null, indicating that the running table TR has the primary key K and the backup table TB has no K, indicating that data insertion has occurred, recording data change: changeType = insert, newId = K;
if DataR i =null and DataB i Not equal to null, it shows that the running table TR has no main key K, and the backup table TB has K, indicating that data deletion has occurred, and recording data change: changeType = delete, oldId = K.
CN202211664131.0A 2022-12-23 2022-12-23 Education industry-oriented multi-scale progressive difference data positioning method Pending CN115952168A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211664131.0A CN115952168A (en) 2022-12-23 2022-12-23 Education industry-oriented multi-scale progressive difference data positioning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211664131.0A CN115952168A (en) 2022-12-23 2022-12-23 Education industry-oriented multi-scale progressive difference data positioning method

Publications (1)

Publication Number Publication Date
CN115952168A true CN115952168A (en) 2023-04-11

Family

ID=87297626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211664131.0A Pending CN115952168A (en) 2022-12-23 2022-12-23 Education industry-oriented multi-scale progressive difference data positioning method

Country Status (1)

Country Link
CN (1) CN115952168A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150179A (en) * 2023-04-14 2023-05-23 天津南大通用数据技术股份有限公司 Method and device for comparing data consistency between databases

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150179A (en) * 2023-04-14 2023-05-23 天津南大通用数据技术股份有限公司 Method and device for comparing data consistency between databases

Similar Documents

Publication Publication Date Title
US10515090B2 (en) Data extraction and transformation method and system
CN108388632B (en) Data clustering, segmentation, and parallelization
US8370355B2 (en) Managing entities within a database
US10521441B2 (en) System and method for approximate searching very large data
US20100257440A1 (en) High precision web extraction using site knowledge
US20160055212A1 (en) Automatic joining of data sets based on statistics of field values in the data sets
CN107506260A (en) A kind of dynamic division database incremental backup method
US10445370B2 (en) Compound indexes for graph databases
CN112463774A (en) Data deduplication method, data deduplication equipment and storage medium
CN115952168A (en) Education industry-oriented multi-scale progressive difference data positioning method
CN113297238B (en) Method and device for mining information based on history change record
WO2021021614A1 (en) Techniques for database entries de-duplication
Javdani et al. Deepblock: A novel blocking approach for entity resolution using deep learning
US20100063966A1 (en) Method for fast de-duplication of a set of documents or a set of data contained in a file
US20180144060A1 (en) Processing deleted edges in graph databases
EP3005161A1 (en) Datasets profiling tools, methods, and systems
US20210326361A1 (en) Dynamic clustering of sparse data utilizing hash partitions
Koch et al. Duplicate Table Detection with Xash
Zhang et al. An approximate approach to frequent itemset mining
CN116881262B (en) Intelligent multi-format digital identity mapping method and system
CN113779955B (en) Method, device and storage medium for generating difference script
Brkić et al. Improving the completeness and timeliness by horizontal fragmentation of data warehouse tables
CN114168478B (en) Software Bug detection method based on multi-graph multi-mark learning
CN108304430B (en) Method for modifying database
CN116383314A (en) Pattern and tuple level origin data storage method of relational database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination