CN115952168A

CN115952168A - Education industry-oriented multi-scale progressive difference data positioning method

Info

Publication number: CN115952168A
Application number: CN202211664131.0A
Authority: CN
Inventors: 唐雪飞; 陈科; 马晨曦; 魏友佳; 王帆
Original assignee: Chengdu Comsys Information Technology Co ltd
Current assignee: Chengdu Comsys Information Technology Co ltd
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-04-11

Abstract

The invention discloses a multi-scale progressive difference data positioning method for the education industry, which comprises the following steps: s1, data sorting: sequencing the operation table TR and the backup table TB according to the size of the primary key values; s2, calculating a starting primary key value KStart and a terminating primary key value KEnd; s3, aiming at the data range [ Kttart, KSnd ] of the primary key values]Carrying out scale division; s4, judging the grouping: s4.1, KStart _i ＜KEnd _i Calculating the data records of the primary key values in the TR and the TB within the grouping range; s4.2, KStart _i ＝KEnd _i Let K = KStart _i ＝KEnd _i Taking out the data records corresponding to K for comparison; and S5, performing the operation of the S4 on all the groups to obtain the difference data item. The invention can quickly and efficiently locate the data record items with differences in the operation table and the backup table, and output the result of data comparison, and has smaller performance pressure on the database.

Description

Education industry-oriented multi-scale progressive difference data positioning method

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a multi-scale progressive difference data positioning method for the education industry.

Background

With the continuous promotion and wide application of education informatization, a large amount of data is newly added every day in the education industry. Compared with other industries, the education industry has the characteristics of complex data structure, huge user quantity and fast data growth. Taking a common college as an example, the college comprises tens of thousands of students and thousands of teachers, each department runs various application systems, such as financial systems, personnel systems, academic systems, educational system, etc., which generate a large amount of data at every moment, and in order to store the data, a database is used, wherein a relational database is the most common and mature database system which stores and retrieves data in row (row) units.

The data in the database in the education industry is not only huge in quantity, but also can be changed at any time, such as the operations of adding, modifying and deleting the data, in order to manage the data more safely and effectively, the database is required to be backed up regularly, and the backup database provides necessary support conditions for maintenance operations such as data cleaning, error correction, recovery and the like. Many times, these operations require searching for changes in data, that is, comparing backup data with current running data to find out differences between the backup data and the current running data.

The traditional data comparison method needs to compare the data rows of the backup library and the data rows of the running library one by one, and the operation is time-consuming, and the required time is exponentially increased along with the expansion of the data volume. Therefore, it is necessary to design a more efficient and faster data comparison method, which can locate the difference position between two different databases in a shorter time and give the specific occurrence of change types (addition, modification, deletion).

Relational databases store data in the form of data tables (tables). The data table is a two-dimensional table comprising columns representing data fields and rows representing data records, as table 1 gives the table student for storing student information:

table 1: student information table example (student)

Student ID

Name (I)

Age (age)

Sex

College of college

Type (B)

……

1

Zhang San

20

For male

Computer school

Master's soldier

……

2

Li Si

19

Woman

LiteratureHotel

Doctor (Rooibos)

……

3

Wang Wu

21

For male

Foreign language college

Bachelor

……

Each table has one or more columns for uniquely identifying data row records, called "primary keys" (keys), for example, student IDs in Table 1 are primary keys of the student tables. Although the primary key may be multiple fields, in most cases, only one column is used as the primary key identification, since this is easier to manage and maintain and data retrieval is more efficient. The primary key typically has the following characteristics:

(1) The primary key cannot be empty and is not repeatable: the primary key must have a value and cannot be duplicated with the primary key values of existing records, which is a basic requirement of the relational database for the primary key.

(2) Primary key values are comparable and therefore can be sorted by primary key size. The most common primary key values are recorded in a way that positive integers sequentially increase. Although a character string can be used as a primary key, it is less suitable because it is not possible to perform addition and subtraction operations and has many restrictions in use.

(3) The primary key value is not modifiable. The primary key is used as a unique identifier of a data record, and represents a unique retrieval basis of a record, and modification of the primary key can cause data inconsistency, so most databases do not suggest or even allow modification of the primary key value. In practice we can avoid modifying its value by adding a column of fields with unique values but no actual physical meaning as a primary key.

(4) Primary key values are not reusable. If a record is deleted, the primary key value corresponding to that record will not be used in the newly inserted data record. For example, in table 1, after a student with ID 3 is deleted, the newly inserted data later does not use 3 as the primary key. The reason for not multiplexing primary key values: firstly, because the primary key value exists in the operation table and the backup table at the same time, in the process of comparing the data of the backup table and the operation table, whether the record is deleted, newly added or directly modified cannot be distinguished; secondly, because the primary key is only used as the data retrieval identification, under the condition of no actual physical significance, the primary key value is not needed to be multiplexed.

The data volume means: in most cases, the columns (i.e., fields) of the table will not change, and if the columns change, the database structure is considered to be changed qualitatively, which will cause all data usage, operation and maintenance procedures related to the data table to be greatly affected, so that in the actual operation process of the database, once formally online, the column definition never changes. Therefore, the measurement standard of the data amount of the data table is determined by the number of rows, no general standard of the database level exists at present, and the data amount is empirically regarded as a large amount of data in million levels, an ultra large amount of data in million levels, and a huge amount of data in more than hundred million levels.

The data sheet actually used in the education information system is called a run sheet, for example, a student information sheet in a student system, and all student information is recorded in real time. The structure and record of the operation table at a certain moment are copied and stored, and at the moment of finishing the backup, the backup table has the same data as the current operation table, but the operation table can be changed along with the time, for example, the type of student ID 3 is modified from student to master, but the data of the backup table can not be changed.

As mentioned above, the operation table may change with the execution of the service operation, and these changes mainly include the following cases:

(1) Inserting data: a record is newly added in the operation table, and the main key of the new record does not exist in the backup table at the moment;

(2) And (3) modifying data: when the value of one or more fields (which cannot be primary keys) of a record in the operation table changes, the record in the operation table and the backup table has the same primary key value;

(3) And (4) deleting data: and deleting one or more records from the operating list, wherein the backup list has the main key of the deleted record and the operating list does not have the main key of the deleted record.

The differential data positioning is to compare the data of the operation table and the backup table, and accurately position the changed record. Since the data columns are fixed and finite, the comparison of field values is easy as long as the data rows are located. The unique identification of the row is the primary key, so the difference data comparison requires that the primary key value and the change type of the difference data be given. According to different variation types, the difference results are given as follows:

(1) Inserting data: changeType = insert, newId = primary key value of new record;

(2) And (3) modifying data: changeType = update, id = primary key value of modified record;

(3) And (3) deleting data: changeType = delete, oldId = primary key value of deleted record.

To locate the difference data, data comparison needs to be performed on the operating table and the backup table, and particularly when the data volume is large (more than millions), it is difficult to find some tiny data changes. The current data comparison method mainly comprises the following steps:

(1) Comparing the whole table: each row of data in the backup table is sequentially taken out and compared with the data in the running table one by one, the process can locate the modified and deleted data, but the newly added data cannot be found (because the backup table does not have the data), so all the data which are not compared in the running table need to be recorded, and the process is time-consuming. For the case of large data volume, the full table comparison cannot meet the requirements on time and efficiency.

(2) Log analysis: most databases provide log records of data operations, and theoretically, all data change information can be found from the log, so that data difference positions can be further located. However, from the practical operation, the technical difficulty of log analysis is very high, firstly, the log information is unstructured and is only described in a section of characters, and the detailed content of data change is difficult to extract and analyze from the log information; secondly, the amount of log information is huge, all data operations need to be analyzed in sequence from the backup time, and the required time is long; thirdly, different databases provide different log information formats and data description modes, and the changed primary key value position may not be accurately located. Therefore, under the condition of large data volume, the log analysis method cannot meet the requirement of accurate positioning of difference data.

(3) Writing a trigger program: most databases support the writing of trigger programs, which perform a specific task (e.g., recording change information in a specific data table or file) when certain operations (e.g., adding, modifying, and deleting) occur on the run list, so as to obtain data changes in real time. Although the trigger program is powerful, the disadvantages are also very significant: firstly, the trigger program is generated in real time along with data change operation, so that the trigger program is additionally executed in each data operation, the burden of a database server is greatly increased, the performance of a database is greatly reduced, and the trigger program cannot be tolerated in a large-data-volume environment; secondly, the trigger program is not a universal standard, and the compiling formats of the trigger programs of different databases are different; thirdly, the programming of the trigger program has higher difficulty, and the BUG is easy to occur, thereby influencing the normal data storage operation and causing the system to be unstable or even crash. The trigger program is therefore only suitable for scenes with small data volumes and is not universal.

In summary, none of the currently available technical means and methods can support fast locating of difference data of the run table and the backup table well, especially in a large data volume environment, and therefore, redesign from the aspect of the method is required.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a multi-scale progressive difference data positioning method facing the education industry, which can quickly and efficiently position data record items with differences in an operation table and a backup table, output the result of data comparison and have less pressure on the performance of a database.

The purpose of the invention is realized by the following technical scheme: a multi-scale progressive difference data positioning method for education industry comprises the following steps:

s1, data sorting: sequencing a running table TR and a backup table TB according to the size of a primary key value, wherein the primary key of the TR is marked as KR, and the primary key of the TB is marked as KB;

s2, calculating a starting primary key value KStart and a terminating primary key value KEnd;

s3, carrying out scale division on a primary key value data range [ KStart, KEnd ]; the choice of the size of the scale is determined according to the following formula:

wherein the symbol

The upper integer is taken; the denominator n is an element of [2]If p is less than 2, taking n =2;

using the selected scale pair [ KStart, KEnd]Divide into the ith group G _i The data value range calculation formula of (1) is as follows:

G _i ＝[KStart _i ，KEnd _i ]＝[KStart+(i-1)×D，min(KStart+i×D-1，KEnd)]

i≥1；

s4, grouping G according to the size of the scale _i The judgment is carried out, and the following substeps are included:

s4.1, if KStart _i ＜KEnd _i Respectively calculating the data records of the primary key values in the TR and the TB in the grouping range;

s4.2, if KStart _i ＝KEnd _i Indicating that there is only one primary key value data, let: k = KStart _i ＝KEnd _i Directly taking out the data records corresponding to K for comparison;

s5, performing the operation of the step S4 on all the packets of [ KStart, KEnd ], and obtaining all the difference data items in TR and TB.

The specific implementation method of the step S2 comprises the following steps:

s2.1, calculating the minimum value min (KR) of KR;

s2.2, calculating the minimum value min (KB) of the KB;

s2.3, calculating the maximum value max (KR) of KR;

s2.4, calculating the maximum value max (KB) of the KB;

s2.5, calculating the start primary key value KStart = min (min (KR), min (KB));

s2.6, calculate termination primary key value KEnd = max (max (KR), max (KB)).

The specific implementation method of the step S4.1 comprises the following steps:

s4.1.1, querying data records in the TR under the following query conditions: KR e [ KStart ] _i ，KEnd _i ]The resulting data record set is DataSetR _i (ii) a The corresponding SQL statement is: select from TR where id beta weenKStart _i andKEnd _i ；

S4.1.2, inquiring that the range of the primary key value is [ KStart ] in TB _i ，KEnd _i ]Data record set DataSetB in (1) _i The corresponding SQL statement is: select from TB where id beta weenKStart _i andKEnd _i ；

S4.1.3 for DataSetR _i And DataSetB _i And carrying out Hash operation, and comparing whether the results are equal:

if hash (DataSetR) _i )＝hash(DataSetB _i ) Showing that G is _i Dataset DataSetR within range _i And DataSetB _i Equality, absence of difference data, no need to subdivide within the comparison data setRecord, packet G _i The operation of (1) is directly finished;

if hash (DataSetR) _i )≠hash(DataSetB _i ) Indicates the primary key value is at G _i The data records within the range are different and need to be at G _i Carrying out progressive comparison by further dividing the scale; at this time, G _i And (4) regarding the data as a complete set, continuously grouping and dividing the data, and returning to the step S3.

The specific implementation method of the step S4.2 is as follows:

s4.2.1, inquiring data records in the TR under the following conditions: KR = K, corresponding to SQL statement: select from TR where KR = K, and recording the query result as DataR _i If the primary key value K does not exist in TR, put DataR _i ＝null；

S4.2.2, querying the data record in the TB under the query conditions: KB = K, corresponding to SQL statement: select from TB where KB = K, and the query result is noted as DataB _i If the primary key value K does not exist in TB, put DataB _i = null; since K is the union from TR and TB primary bonds, dataR cannot occur _i And DataB _i Null cases at the same time;

s4.2.3, comparison DataR _i And DataB _i The value of (c):

if DataR _i ≠nullandDataB _i ≠nullandhash(DataR _i )≠hash(DataB _i ) If the record of the primary key K exists in both TR and TB, but the hash values of the TR and the TB are not equal, the data modification is caused, and the record data change is as follows: changeType = update, id = K;

if DataR _i ≠nullandDataB _i ≠nullandhash(DataR _i )＝hash(DataB _i ) The TR and the TB are both recorded with the main key K, but the data is not changed and is not difference data, and at the moment, no record is needed;

if DataR _i ≠nullandDataB _i = null, indicating that the running table TR has the primary key K and the backup table TB has no K, indicating that data insertion has occurred, recording data change: changeType = insert, newId = K;

if DataR _i ＝nullandDataB _i Not equal to null, it shows that the running table TR has no main key K, and the backup table TB has K, indicating that data deletion has occurred, and recording data change: changeType = delete, oldId = K.

The invention has the beneficial effects that: the invention can quickly and efficiently locate the data record items with differences in the operation table and the backup table and output the result of data comparison. The method supports any general relational database, and has small performance pressure on the database, thereby solving various defects of the traditional data comparison method.

Drawings

FIG. 1 is a flow chart of the multi-scale progressive data comparison method of the present invention.

Detailed Description

The terms appearing in the present invention are defined as follows:

data division: all data in a certain data set can be grouped, the number of the groups is n, and each group is marked as G ₁ ，G ₂ ，G ₃ ...G _n (n.gtoreq.2). Let the full set of data be U, each packet being a subset of U. A certain grouping is called a division of U and only if the following conditions hold:

(1)G ₁ ∪G ₂ ∪G ₃ ∪...∪G _n = U, i.e. the union of all packets is the full set, so any one data record must be in a certain packet;

(2) For any sequence number i not equal to j, all have

I.e. no identical data record is contained in any two packets, any one data record must therefore be located in a unique packet.

The grouping mode satisfying the above condition is called a division of U, for example, the U is divided into odd-numbered groups and even-numbered groups according to the primary key value, or divided into male and female according to the gender, etc.

Scale of data: the data scale definition involved in the invention is a metric data packet G _i An indication of scale. To improve data processingFor reasons of efficiency, the packets should be kept equal, i.e. the number of data records in each packet should be equal or as close as possible. The number of data in a divided packet is called the data scale and is noted

U represents the total number of records of the corpus. Since | U | may not be evenly divisible by n, in this way of data partitioning, the number of most packets should be equal (equal to the scale), and the last packet may be offset by | U | mod n data, although this does not affect the definition of the scale. The size of the scale value indicates the fineness of the U division, the larger the scale is, the coarser the data division is, and the smaller the scale is, the finer the division is.

If a certain data packet G is to be grouped _i Treated as a new corpus, further comprising G _i Is divided into

By analogy, any one packet can be further subdivided until the scale equals 1.

The multi-scale progressive is to gradually approach the data records with certain characteristics by adjusting the scale of the division, and large-scale groups which do not meet the conditions in the division can be eliminated integrally, so that the data matching and comparing efficiency is greatly improved.

Hash (Hash) operation of data: one or more data records are mapped by a hash function into message digests of equal length. The requirements of an ideal hash operation are:

(1) The hash operation is fast and the lengths of the operation results are equal regardless of the size of the original data r.

(2) If 2 data records r ₁ ＝r ₂ Then there should be a hash (r) ₁ )＝hash(r ₂ )。

(3) If r is ₁ ≠r ₂ Then there should be a hash (r) ₁ )≠hash(r ₂ )。

For any hash function, (1) and (2) must be satisfied, but(3) Is satisfied only in the ideal state, i.e. r ₁ ＝r ₂ Is hash (r) ₁ )＝hash(r ₂ ) However, in practical situations, since there is some probability of hash collision (different raw data are calculated to obtain the same hash value), the above r is implemented ₁ ＝r ₂ Is hash (r) ₁ )＝hash(r ₂ ) Is not necessary enough, but if the probability of hash collisions is low enough (e.g., below 2) ^-10 ) The hash operation can be considered reliable.

The algorithm process of the hash function can be designed autonomously, or known public algorithms can be directly used, the invention recommends using an MD5 abstract algorithm, and the algorithm has the advantages of high operation speed and low collision probability (less than 2) ^-64 ) Is characterized in that.

In some limit cases, the reliability of the hash operation needs to be guaranteed with a great probability, and a double hash mode can be used, that is, 2 different hash algorithms are used for data comparison, if there are: hash1 (r 1) = hash1 (r 2) and hash2 (r 1) = hash2 (r 2), and the probability of collision at this time is reduced to collision probability 1 × collision probability 2, which can be considered as a very reliable hash operation, and it can be considered that r1= r2 is true at all.

The reason for using hash operation is that when one or more data records are compared, the data items can be prevented from being sequentially compared row by row, and only the hash values of the data items need to be compared integrally.

The technical scheme of the invention is further explained by combining the attached drawings.

As shown in fig. 1, the multi-scale progressive difference data positioning method for the educational industry of the present invention is used for quickly positioning data records with differences in an operation table and a backup table, and outputting a result of data comparison. The preconditions of the invention include:

(1) The running table (denoted as TR) and the backup table (denoted as TB) have identical columns (fields), including number and type, and order of the fields; the data sheet actually used in the education information system is called a run sheet, for example, a student information sheet in a student system, and all student information is recorded in real time.

(2) TR and TB have unique primary keys and are not modified;

(3) The primary key types of TR and TB are integers, so that sorting and addition and subtraction operations can be performed.

The positioning method comprises the following steps:

s1, data sorting: sorting the operation table TR and the backup table TB according to the size of a primary key value (ascending or descending, if no special requirement exists, the ascending is generally used, because the ascending is a default sorting rule of most databases), wherein the primary key of the TR is marked as KR, and the primary key of the TB is marked as KB; the purpose of the sorting is to facilitate subsequent calculation of primary key value ranges.

S2, calculating a starting primary key value KStart and a terminating primary key value KEnd; the specific implementation method comprises the following steps:

s2.1, calculating the minimum value min (KR) of KR;

s2.2, calculating the minimum value min (KB) of the KB;

s2.3, calculating the maximum value max (KR) of KR;

s2.4, calculating the maximum value max (KB) of the KB;

as the database adopts the index technology for the primary key values by default, the speed for calculating the minimum value and the maximum value of the primary key is very high, and the data size of ten million levels can be generally inquired within 0.1 second to obtain the result through testing.

S2.5, calculating a start primary key value KStart = min (min (KR), min (KB));

s2.6, calculate termination primary key value KEnd = max (max (KR), max (KB)).

Namely: the start primary key KStart is equal to the smallest primary key of KR and KB and the end primary key KEnd is equal to the largest primary key of KR and KB.

S3, carrying out scale division on a primary key value data range [ KStart, KEnd ]; the selection of the scale size is related to the size of the data range (i.e., the KEnd-KStart) value and can be determined according to the following formula:

wherein(symbol)

The upper integer is taken; the denominator n is an element of [2]If p is less than 2, taking n =2; the whole is taken to ensure that D is more than or equal to 1. When D =1 is true, it indicates that there is only one piece of data in the packet, and that it cannot be subdivided.

The value of n can be properly adjusted according to the actual situation, and the scale value obtained by too large n is smaller, so that the number of packets is too large, the number of data comparison times is increased, and the efficiency is reduced; and too small n is obtained with a larger scale value, so that the range of each group is larger, the probability of containing difference data is increased, and a new round of division comparison process in the group is initiated. Generally, p can take the core number of the CPU of the current computer so as to support multithread concurrent computation and improve the operation efficiency.

Using the selected scale pair [ KStart, KKnd ]]Divide into the ith group G _i The data value range calculation formula of (1) is as follows:

G _i ＝[KStart _i ，KEnd _i ]＝[KStart+(i-1)×D，min(KStart+i×D-1，KEnd)]

i is more than or equal to 1; the formula conforms to the definition of the partition and ensures that the upper limit value does not exceed the KEnd, thereby avoiding an invalid data search range.

s4.1, if KStart _i ＜KEnd _i Respectively calculating the data records of the primary key values in the TR and the TB in the grouping range; since the primary key value is the only mark of the record, this step effectively enables the partitioning of the data records in TR and TB. The specific implementation method comprises the following steps:

s4.1.1, inquiring data records in TR under the following conditions: KR e [ KStart ] _i ，KEnd _i ]Obtaining a data record set as DataSetS _i (ii) a The corresponding SQL statement is: select from TR world id between

andKEnd _i (ii) a Due to the key value ofIndexing, the time consumption of the step is very short, and the results can be obtained on the level of seconds for tens of millions of data.

S4.1.2, querying primary key value range in TB in [ KStart ] _i ，KEnd _i ]Data record set DataSetB in (1) _i The corresponding SQL statement is: select from TB where id beta weenKStart _i andKEnd _i ；

if hash (DataSetR) _i )＝hash(DataSetB _i ) Showing that G is _i Dataset DataSetR within range _i And DataSetB _i Equality, absence of difference data, no need to subdivide records in the comparison data set, group G _i The operation of (1) is directly finished;

if hash (DataSetR) _i )≠hash(DataSetB _i ) Indicating that the primary key value is at G _i The data records within the range are different and need to be at G _i Carrying out progressive comparison by further dividing the scale; at this time, G _i The data is regarded as a data complete set, the grouping and the division are continuously carried out, the step S3 is returned, and the data can be divided according to G when the data division is carried out _i Take different values of n.

S4.2 if KStart _i ＝KEnd _i Indicating that there is only one primary key value data, let: k = KStart _i ＝KEnd _i Directly taking out the data records corresponding to K for comparison; the specific implementation method comprises the following steps:

S4.2.2, inquiring data records in the TB under the following conditions: KB = K, corresponding to SQL statement: select from TB where KB = K, and the query result is noted as DataB _i If the primary key value K does not exist in TB, put DataB _i = null; since K is the union from TR and TB primary keys, dataR is unlikely to occur _i And DataB _i Null cases at the same time;

s4.2.3, comparison DataR _i And DataB _i The value of (c):

if DataR _i ≠nullandDataB _i ≠nullandhash(DataR _i )≠hash(DataB _i ) If the record of the primary key K exists in TR and TB, but the hash values of the TR and the TB are different, the data modification is generated, and the record data change is as follows: changeType = update, id = K;

if DataR _i ≠nullandDataB _i = null, indicating that the running table TR has a primary key K and the backup table TB has no K, indicating that data insertion has occurred, recording data change: changeType = insert, newId = K;

if DataR _i ＝nullandDatab _i And is not null, the fact that the main key K does not exist in the operation table TR, the fact that the K exists in the backup table TB indicates that data deletion and data record change occur is as follows: changeType = delete, oldId = K.

And S5, executing the judgment operation of the step S4 on all the groups of [ KStart, KEnd ] to obtain all the difference data items in TR and TB.

The method and the steps related to the invention are suitable for the case of larger data volume (more than 10 ten thousand), and the difference data volume < = 20%. More difference data may cause more progressive partition contrast and efficiency may be reduced. Through testing, ten million levels of data volume (5 ten thousand pieces of difference data) can be searched within 2 minutes.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A multi-scale progressive difference data positioning method oriented to education industry is characterized by comprising the following steps:

wherein the symbol

The upper integer is taken; the denominator n is an element of [2]If p is<2, taking n =2;

using the selected scale pair [ KStart, KEnd]Is divided into the ith packet G _∈ The data value range calculation formula of (1) is as follows:

G _i ＝[KStart _i ，KEnd _i ]＝[KStart+(i-1)×D，min(KStart+i×D-1，KEnd)]i≥1；

s4.1, if KStart _i <KEnd _i Respectively calculating the data records of the primary key values in the TR and the TB in the grouping range;

s4.2 if KStart _i ＝KEnd _i Indicating that there is only one primary key value data, let: k = KStart _i ＝KEnd _∈ Directly taking out the data records corresponding to K for comparison;

2. The education industry-oriented multi-scale progressive difference data positioning method according to claim 1, wherein the step S2 is specifically realized by the following steps:

s2.1, calculating the minimum value min (KR) of KR;

s2.2, calculating the minimum value min (KB) of the KB;

s2.3, calculating the maximum value max (KR) of KR;

s2.4, calculating the maximum value max (KB) of the KB;

s2.6, calculate termination primary key value KEnd = max (max (KR), max (KB)).

3. The method for positioning the multi-scale progressive difference data facing the education industry according to claim 1, wherein the step S4.1 is realized by:

s4.1.1, inquiring data records in TR under the following conditions: KR e [ KStart ] _i ，KEnd _i ]The resulting data record set is DataSetR _∈ (ii) a The corresponding SQL statement is: select from TR where id between KStart _i and KEnd _i ；

S4.1.2, inquiring that the range of the primary key value is [ KStart ] in TB _i ，KEnd _i ]Data record set DataSetB in (1) _i The corresponding SQL statement is: select from TB whereid between KStart _i and KEnd _i ；

if hash (DataSetR) _i )＝hash(DataSetB _i ) Showing that G is _i Dataset DataSetR within range _i And DataSetB _i The data are equal, no difference data exists, the records in the data set do not need to be subdivided and compared, and the operation of the group Gi is directly finished;

if hash (DataSetR) _i )≠hash(DataSetB _i ) Indicating that the primary key value is at G _i The data records within the range are different and need to be at G _i Carrying out progressive comparison by further dividing the scale; at this time, G _i And (4) regarding the data as a complete set, continuously grouping and dividing the data, and returning to the step S3.

4. The method for positioning the multi-scale progressive difference data facing the education industry according to claim 1, wherein the step S4.2 is realized by a method comprising the following steps:

S4.2.2, querying the data record in the TB under the query conditions: KB = K, corresponding to SQL statement: select from TB where KB = K, and the query result is marked as DataB _i If the primary key value K does not exist in TB, put DataB _i = null; since K is the union from TR and TB primary bonds, dataR cannot occur _i And DataB _i Null cases at the same time;

s4.2.3, comparison DataR _i And DataB _i The value of (c):

if DataR _i ≠null and DataB _i ≠null and hash(DataR _i )≠hash(DataB _i ) If the record of the primary key K exists in TR and TB, but the hash values of the TR and the TB are different, the data modification is generated, and the record data change is as follows: changeType = update, id = K;

if DataR _i ≠null and DataB _∈ ≠null and hash(DataR _i )＝hash(DataB _i ) The TR and the TB are both recorded with the main key K, but the data is not changed and is not difference data, and at the moment, no record is needed;

if DataR _i ≠null and DataB _i = null, indicating that the running table TR has the primary key K and the backup table TB has no K, indicating that data insertion has occurred, recording data change: changeType = insert, newId = K;

if DataR _i ＝null and DataB _i Not equal to null, it shows that the running table TR has no main key K, and the backup table TB has K, indicating that data deletion has occurred, and recording data change: changeType = delete, oldId = K.