CN104376055B - A kind of large-sized model data comparing method based on allocation methods - Google Patents
A kind of large-sized model data comparing method based on allocation methods Download PDFInfo
- Publication number
- CN104376055B CN104376055B CN201410614042.4A CN201410614042A CN104376055B CN 104376055 B CN104376055 B CN 104376055B CN 201410614042 A CN201410614042 A CN 201410614042A CN 104376055 B CN104376055 B CN 104376055B
- Authority
- CN
- China
- Prior art keywords
- records
- fragment
- data source
- difference
- record
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 239000012634 fragment Substances 0.000 claims abstract description 31
- 238000013467 fragmentation Methods 0.000 claims abstract description 28
- 238000006062 fragmentation reaction Methods 0.000 claims abstract description 28
- 230000001174 ascending effect Effects 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 claims 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基于分片技术的大模型数据比较方法,包括以下几个步骤:设置分片参数;取出基准数据源的所有关键字,按从小到大顺序排列,并存放至关键字数组中;计算出分片个数fragment_num及每个分片中的记录数,再按序从关键字数组中获取每个分片的首尾关键字值;为每个分片启动一个工作线程,每个工作线程分别从基准数据源和待比较数据源中获取相对应的数据内容;各个工作线程逐行比较分配给自己的数据内容,并记录下差异结果;所有工作线程处理结束后,得到fragment_num个差异结果,将所有差异结果汇合为最终的差异结果。本发明应用在两个系统或两个数据库中可大幅度提高大模型数据比较效率。
The invention discloses a method for comparing large model data based on fragmentation technology, which comprises the following steps: setting fragmentation parameters; taking out all keywords of the reference data source, sorting them from small to large, and storing them in a keyword array Medium; calculate the number of fragments fragment_num and the number of records in each fragment, and then obtain the first and last keyword values of each fragment from the keyword array in sequence; start a worker thread for each fragment, each The worker threads obtain the corresponding data content from the benchmark data source and the data source to be compared respectively; each worker thread compares the data content assigned to itself line by line, and records the difference results; after all the worker threads are processed, fragment_num differences are obtained As a result, all diff results are merged into a final diff result. The application of the present invention in two systems or two databases can greatly improve the comparison efficiency of large model data.
Description
技术领域technical field
本发明涉及一种基于分片技术的大模型数据比较方法,属于电力系统自动化配电网模型管理技术领域。The invention relates to a large model data comparison method based on fragmentation technology, and belongs to the technical field of power system automation distribution network model management.
背景技术Background technique
配电网模型数据量比较大,一张模型表的记录数很可能会达到百万级别。针对这种数量级的表,传统的单工作流比较方式可能存在比较过程耗时较长的问题。The data volume of the distribution network model is relatively large, and the number of records in a model table is likely to reach the million level. For tables of this order of magnitude, the traditional single-workflow comparison method may have the problem that the comparison process takes a long time.
发明内容Contents of the invention
针对现有技术存在的不足,本发明目的是提供一种应用在两个系统或两个数据库中可大幅度提高大模型数据比较效率的基于分片技术的大模型数据比较方法。In view of the deficiencies in the prior art, the purpose of the present invention is to provide a method for comparing large model data based on fragmentation technology that can greatly improve the efficiency of large model data comparison when applied in two systems or two databases.
为了实现上述目的,本发明是通过如下的技术方案来实现:In order to achieve the above object, the present invention is achieved through the following technical solutions:
本发明的一种基于分片技术的大模型数据比较方法,具体包括以下几个步骤:A method for comparing large model data based on fragmentation technology of the present invention specifically includes the following steps:
(1)设置分片参数,所述分片参数支持两种设置方式:按记录数设置和按数据块大小设置;若分片参数设置为按数据块大小,设数据块大小为m,设待比较数据源中每条记录的长度为k,设每个分片所包含的记录数最多为n,则可获得n=m/k;若分片参数设置为按记录数,则n即为每个分片所包含的最多记录数;(1) Fragmentation parameters are set, and the fragmentation parameters support two kinds of setting modes: setting by the number of records and setting by the data block size; The length of each record in the comparison data source is k, and the number of records contained in each fragment is set to be n at most, then n=m/k can be obtained; if the fragmentation parameter is set to the number of records, then n is the The maximum number of records contained in a shard;
(2)取出基准数据源的所有关键字,按从小到大顺序排列,并存放至关键字数组中,所述关键字数组大小为该基准数据源中的总记录数record_sum;(2) Take out all the keywords of the benchmark data source, arrange them in ascending order, and store them in the keyword array, the size of the keyword array is the total number of records record_sum in the benchmark data source;
(3)计算出分片个数fragment_num及每个分片中的记录数,再按序从关键字数组中获取每个分片的首尾关键字值,即得到分片信息;(3) Calculate the number of fragments fragment_num and the number of records in each fragment, and then obtain the first and last keyword values of each fragment from the keyword array in order, that is, obtain the fragment information;
fragment_num=record_sum/n+(record_sum%n!=0)fragment_num=record_sum/n+(record_sum%n!=0)
若总记录数record_sum为n的整数倍,则获得的每个分片中记录数都为n;If the total number of records record_sum is an integer multiple of n, the number of records obtained in each shard is n;
若总记录数record_sum不为n的整数倍,则前fragment_num-1个分片中,每个分片的记录数为n,剩下的记录数分配在最后一个分片中;If the total number of records record_sum is not an integer multiple of n, in the first fragment_num-1 fragments, the number of records in each fragment is n, and the remaining number of records is allocated in the last fragment;
(4)为每个分片启动一个工作线程,根据对应的分片信息,每个工作线程分别从基准数据源和待比较数据源中获取相对应的数据内容;(4) Start a worker thread for each slice, and according to the corresponding slice information, each worker thread obtains corresponding data content from the benchmark data source and the data source to be compared;
(5)各个工作线程逐行按域比较分配给自己的数据内容,并记录下差异结果;(5) Each working thread compares the data content assigned to itself line by line by field, and records the difference results;
(6)所有工作线程处理结束后,得到fragment_num个差异结果,将所有差异结果按关键字从小到大汇合为一个结果,即为最终的差异结果。(6) After all the worker threads are processed, fragment_num difference results are obtained, and all the difference results are merged into one result according to the keywords from small to large, which is the final difference result.
上述差异结果包含差异标识和差异内容描述;所述差异标识包含插入、更新、删除三种标识;若某条记录待比较数据源中无,基准数据源中有,则该差异标识为插入;若某条记录待比较数据源中有,基准数据源中无,则该差异标识为删除;若某条记录在待比较数据源及基准数据源中关键字一致,但内容不一致,则该差异标识为更新;差异内容描述为基准数据源和待比较数据源中对应的数据记录。The above-mentioned difference results include a difference identification and a description of the difference content; the difference identification includes three types of identification: insert, update, and delete; if a certain record does not exist in the data source to be compared but exists in the reference data source, then the difference identification is inserted; if If a record exists in the data source to be compared but not in the reference data source, the difference is marked as deleted; if a record has the same keyword in the data source to be compared and the reference data source, but the content is inconsistent, the difference is marked as Update; the difference content is described as the corresponding data records in the benchmark data source and the data source to be compared.
在本发明中,提供按记录数设置和按数据块大小设置这两种分片参数设置方式保障了分片的灵活性;根据关键字进行每个分片的划分,保障了分片的不相交性和完整性,从而也就保障了比较过程的无冗余及差异结果的完整性;多个工作线程根据各自的分片信息同时进行读取数据内容和比较,将比较工作并发进行从而提高了整体比较效率;使用差异标识和差异内容描述记录待比较数据源记录与基准数据源记录的差异,从而根据差异结果可方便的组织出需同步的SQL语句。In the present invention, the flexibility of fragmentation is ensured by providing two fragmentation parameter setting modes of setting according to the number of records and setting according to the size of data blocks; dividing each fragment according to keywords ensures that the fragmentation is disjoint Integrity and integrity, thus ensuring the non-redundancy of the comparison process and the integrity of the difference results; multiple worker threads simultaneously read and compare the data content according to their respective fragmentation information, and the comparison work is performed concurrently to improve the efficiency Overall comparison efficiency; use the difference identification and difference content description to record the difference between the data source record to be compared and the reference data source record, so that the SQL statement to be synchronized can be easily organized according to the difference result.
附图说明Description of drawings
图1为本发明的基于分片技术的大模型数据比较方法工作流程图。Fig. 1 is a working flow chart of the method for comparing large model data based on sharding technology in the present invention.
具体实施方式detailed description
为使本发明实现的技术手段、创作特征、达成目的与功效易于明白了解,下面结合具体实施方式,进一步阐述本发明。In order to make the technical means, creative features, goals and effects achieved by the present invention easy to understand, the present invention will be further described below in conjunction with specific embodiments.
本发明的一种基于分片技术的大模型数据比较方法。配电网系统中的模型表一般都有关键字,这就提供了按关键字进行分片比较的可能。本发明主要针对记录数较多的模型数据表,根据需要设置分片参数,再根据分片参数从基准数据源与待比较数据源中获取分片内容,并对多个分片同时比较,最终获取差异结果。差异结果由差异标识和差异内容描述组成,差异标识有插入、更新、删除这三种标识,差异内容描述为基准数据源和待比较数据源相应记录的内容信息。差异结果根据基准数据源为待比较数据源而生成。A large model data comparison method based on fragmentation technology of the present invention. The model tables in the distribution network system generally have keywords, which provides the possibility of fragment comparison by keywords. The present invention mainly aims at the model data table with a large number of records, sets the fragmentation parameters according to the needs, and then obtains the fragmentation content from the reference data source and the data source to be compared according to the fragmentation parameters, and compares multiple fragments at the same time, and finally Get diff results. The difference result is composed of a difference identifier and a difference content description. The difference identifier has three identifiers: insert, update, and delete. The difference content description is the content information of the corresponding records of the reference data source and the data source to be compared. Difference results are generated for the data source being compared against the benchmark data source.
参见图1,本方法具体包括以下几个步骤:Referring to Figure 1, this method specifically includes the following steps:
(1)指定比较数据源和待比较的模型表,数据源支持的类型有数据库和数据文件等,模型表中需要有关键字。根据需要设置分片参数,可按记录数设置也可按数据块大小设置。(1) Specify the data source for comparison and the model table to be compared. The types supported by the data source include databases and data files, and keywords must be included in the model table. Set fragmentation parameters as required, either by the number of records or by the size of the data block.
若分片参数设置为按数据块大小,假设设置数据块大小为m,该表中每条记录的长度为k,对应的每个分片所包含的记录数最多为n,则可获得n=m/k;若分片参数设置为按记录数,则n即为此设置的数值。If the fragmentation parameter is set to the size of the data block, assuming that the data block size is set to m, the length of each record in the table is k, and the number of records contained in each corresponding fragment is at most n, then n= m/k; if the fragmentation parameter is set to the number of records, then n is the value set for this.
(2)获取基准数据源的所有关键字,并按由小到大顺序排列,存放至关键字数组中,该数组大小为该基准数据源中的总记录数record_sum。(2) Obtain all the keywords of the benchmark data source, arrange them in descending order, and store them in the keyword array. The size of the array is the total number of records record_sum in the benchmark data source.
结合关键字数组,根据分片参数获取分片信息,其中包括分片个数,每个分片的首尾关键字值。Combining with the keyword array, obtain fragmentation information according to fragmentation parameters, including the number of fragments and the first and last keyword values of each fragment.
(3)计算出分片个数及每个分片中的记录数,再按序从关键字数组中获取每个分片的首尾关键字值;(3) Calculate the number of shards and the number of records in each shard, and then obtain the first and last keyword values of each shard from the keyword array in sequence;
分片个数fragment_num值应为:Fragment number fragment_num value should be:
fragment_num=record_sum/n+(record_sum%n!=0)fragment_num=record_sum/n+(record_sum%n!=0)
总记录数record_sum若为n的整数倍,那么获得的每个分片中记录数都为n;If the total number of records record_sum is an integer multiple of n, then the number of records obtained in each shard is n;
总记录数record_sum不为n的整数倍,那么前fragment_num-1个分片中,每个分片的记录数为n,剩下的记录数分配在最后一个分片中。The total number of records record_sum is not an integer multiple of n, then in the first fragment_num-1 fragments, the number of records in each fragment is n, and the remaining records are allocated in the last fragment.
(4)为每个分片启动一个工作线程,按照对应的分片信息获取基准数据源及待比较数据源的相应内容;(4) Start a worker thread for each slice, and obtain the corresponding content of the benchmark data source and the data source to be compared according to the corresponding slice information;
(5)各个工作线程逐行比较分配给自己的数据内容,并记录下差异结果。差异结果中包含差异标识和差异内容描述,差异标识包含插入、更新、删除三种标识,差异内容描述为基准数据源和待比较数据源中相应记录的内容信息。(5) Each worker thread compares the data content allocated to itself line by line, and records the difference results. The difference result includes difference identification and difference content description. The difference identification includes insert, update, and delete three types of identification. The difference content description is the content information of the corresponding records in the reference data source and the data source to be compared.
(6)待所有工作线程比较完成后,得到fragment_num个差异结果,将所有差异结果汇合为一个结果,即为最终的差异结果。(6) After the comparison of all working threads is completed, fragment_num difference results are obtained, and all the difference results are merged into one result, which is the final difference result.
本发明的工作原理为:Working principle of the present invention is:
本发明主要针对在不同系统或不同数据库中的同构的记录数较多的大模型数据表的比较并获取差异结果。根据基准数据源的关键字设置分片信息,再从基准数据源与待比较数据源中获取分片内容,并同时比较多个分片,最终获取差异结果。本发明的方法实现了分片比较技术,大幅提高了记录数较多的大模型数据的比较效率。The present invention is mainly aimed at the comparison of large model data tables with large isomorphic records in different systems or different databases and obtaining difference results. Set the sharding information according to the keywords of the benchmark data source, and then obtain the shard content from the benchmark data source and the data source to be compared, and compare multiple shards at the same time, and finally obtain the difference result. The method of the invention realizes the slice comparison technology, and greatly improves the comparison efficiency of large model data with a large number of records.
在本发明中,提供按记录数设置和按数据块大小设置在一定程度上保障了的分片的灵活性。根据关键字进行每个分片的划分,保障了分片的不相交性和完整性,从而也就保障了比较过程的无冗余及差异结果的完整性。多个工作线程根据各自的分片信息同时进行读取数据内容和比较,将比较工作并发进行以提高整体比较效率。使用差异标识和差异内容描述记录待比较数据源记录与基准数据源记录的差异,从而根据差异结果可方便的组织出需同步的SQL语句。In the present invention, the setting according to the number of records and the setting according to the data block size guarantees the flexibility of fragmentation to a certain extent. The division of each slice according to the keyword ensures the disjointness and integrity of the slices, thereby ensuring the non-redundancy of the comparison process and the integrity of the difference results. Multiple worker threads simultaneously read data content and compare according to their respective shard information, and the comparison work is performed concurrently to improve the overall comparison efficiency. Use the difference identification and difference content description to record the difference between the data source record to be compared and the reference data source record, so that the SQL statement to be synchronized can be easily organized according to the difference result.
采用本发明的方法,对大模型表使用分片技术进行比较,可大幅提高比较效率。在不考虑机器性能及资源占用的情况下,分片数即接近于分片后比分片前提高的速度。By adopting the method of the invention, large model tables are compared using fragmentation technology, which can greatly improve the comparison efficiency. Without considering machine performance and resource usage, the number of shards is close to the speed after sharding compared to before sharding.
以上显示和描述了本发明的基本原理和主要特征和本发明的优点。本行业的技术人员应该了解,本发明不受上述实施例的限制,上述实施例和说明书中描述的只是说明本发明的原理,在不脱离本发明精神和范围的前提下,本发明还会有各种变化和改进,这些变化和改进都落入要求保护的本发明范围内。本发明要求保护范围由所附的权利要求书及其等效物界定。The basic principles and main features of the present invention and the advantages of the present invention have been shown and described above. Those skilled in the industry should understand that the present invention is not limited by the above-mentioned embodiments. What are described in the above-mentioned embodiments and the description only illustrate the principle of the present invention. Without departing from the spirit and scope of the present invention, the present invention will also have Variations and improvements all fall within the scope of the claimed invention. The protection scope of the present invention is defined by the appended claims and their equivalents.
Claims (2)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410614042.4A CN104376055B (en) | 2014-11-04 | 2014-11-04 | A kind of large-sized model data comparing method based on allocation methods |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410614042.4A CN104376055B (en) | 2014-11-04 | 2014-11-04 | A kind of large-sized model data comparing method based on allocation methods |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104376055A CN104376055A (en) | 2015-02-25 |
CN104376055B true CN104376055B (en) | 2017-08-29 |
Family
ID=52554962
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410614042.4A Active CN104376055B (en) | 2014-11-04 | 2014-11-04 | A kind of large-sized model data comparing method based on allocation methods |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104376055B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106033427A (en) * | 2015-03-11 | 2016-10-19 | 阿里巴巴集团控股有限公司 | A sampling data verification method and device |
CN105843886A (en) * | 2016-03-21 | 2016-08-10 | 国电南瑞科技股份有限公司 | Multi-thread based power grid offline model data query method |
CN106777337A (en) * | 2017-01-13 | 2017-05-31 | 山东浪潮商用系统有限公司 | The management method of data model |
CN116308848A (en) * | 2023-03-28 | 2023-06-23 | 中国工商银行股份有限公司 | Information processing method, device, computer equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1652116A (en) * | 2005-03-29 | 2005-08-10 | 威盛电子股份有限公司 | Database Synchronization System and Method |
CN101236554A (en) * | 2007-11-29 | 2008-08-06 | 中兴通讯股份有限公司 | A Method of Mass Data Comparison in Database |
CN102467570A (en) * | 2010-11-17 | 2012-05-23 | 日电(中国)有限公司 | Connection query system and method for distributed data warehouse |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1708096A1 (en) * | 2005-03-31 | 2006-10-04 | Ubs Ag | Computer Network System and Method for the Synchronisation of a Second Database with a First Database |
-
2014
- 2014-11-04 CN CN201410614042.4A patent/CN104376055B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1652116A (en) * | 2005-03-29 | 2005-08-10 | 威盛电子股份有限公司 | Database Synchronization System and Method |
CN101236554A (en) * | 2007-11-29 | 2008-08-06 | 中兴通讯股份有限公司 | A Method of Mass Data Comparison in Database |
CN102467570A (en) * | 2010-11-17 | 2012-05-23 | 日电(中国)有限公司 | Connection query system and method for distributed data warehouse |
Also Published As
Publication number | Publication date |
---|---|
CN104376055A (en) | 2015-02-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104376055B (en) | A kind of large-sized model data comparing method based on allocation methods | |
CN104809168B (en) | The division of ultra-large RDF graph data and Serial Distribution Processing method | |
CN106933823B (en) | Data synchronization method and device | |
CN105610899B (en) | A kind of parallel method for uploading of text file and device | |
EP3196781A1 (en) | Method and apparatus for deleting duplicate data | |
CN111694505B (en) | Data storage management method, device and computer readable storage medium | |
CN107220123A (en) | One kind solves Spark data skew method and system | |
CN103235811B (en) | A kind of date storage method and device | |
CN102725753A (en) | Method and apparatus for optimizing data access, method and apparatus for optimizing data storage | |
CN103902593A (en) | Data transfer method and device | |
CN103049355B (en) | Method and equipment for database system recovery | |
CN103309975A (en) | Duplicated data deleting method and apparatus | |
CN104217011A (en) | Method and device for inquiring HBase secondary index table | |
CN101271429A (en) | A data storage method and device | |
CN108920105B (en) | Community structure-based distributed storage method and device for graph data | |
CN104750744A (en) | Method and device for synchronizing compressed data on basis of Oracle databases | |
CN105812175A (en) | Resource management method and resource management device | |
CN101727503A (en) | Method for establishing disk file system | |
CN104050291B (en) | A kind of method for parallel processing and system of account balance data | |
CN103631589A (en) | Method and device for recognizing application | |
CN104156420B (en) | The management method and device of transaction journal | |
CN110705969A (en) | Transformer substation monitoring system, main station and method for automatically associating measuring point ID | |
US8606757B2 (en) | Storage and retrieval of concurrent query language execution results | |
CN103544302B (en) | The partition maintenance method and device of database | |
CN106528105A (en) | A BOM maintenance method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |