WO2012079460A1 - 数据重复性校验方法和装置及系统 - Google Patents

数据重复性校验方法和装置及系统 Download PDF

Info

Publication number
WO2012079460A1
WO2012079460A1 PCT/CN2011/083206 CN2011083206W WO2012079460A1 WO 2012079460 A1 WO2012079460 A1 WO 2012079460A1 CN 2011083206 W CN2011083206 W CN 2011083206W WO 2012079460 A1 WO2012079460 A1 WO 2012079460A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
character
node
parameters
parallel
Prior art date
Application number
PCT/CN2011/083206
Other languages
English (en)
French (fr)
Inventor
刘洋
Original Assignee
成都市华为赛门铁克科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 成都市华为赛门铁克科技有限公司 filed Critical 成都市华为赛门铁克科技有限公司
Publication of WO2012079460A1 publication Critical patent/WO2012079460A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/123Storage facilities

Definitions

  • Embodiments of the present invention relate to data processing technologies, and in particular, to a data repeatability verification method, apparatus, and system. Background technique
  • the existing data repetitive verification method implementation methods are mainly divided into two types: one is to repeat the judgment before the new data is inserted; the other is to repeat the judgment after the new data is inserted. Both of these methods rely on the database to compare the data one by one to verify repeatability. However, depending on the verification mode of the database, as the data increases, its judgment speed and efficiency will decrease significantly. . Summary of the invention
  • Embodiments of the present invention provide a data repeatability check method, apparatus, and system to improve data repetitive check efficiency.
  • the embodiment of the invention provides a data repeatability verification method, including:
  • each character of the data are matched with the parameters of the node in the parallel index tree, and each node of the parallel index tree corresponds to one character, and the parameters of the node include at least characters.
  • An embodiment of the present invention provides a data repeatability verification apparatus, including:
  • a parallel index tree storage module configured to store parameters of each node in the parallel index tree, each node of the parallel index tree respectively corresponding to a character, and the parameter of the node includes at least a string length and a character of the data where the character is located The position in the string;
  • a parameter matching module configured to match parameters of each character of the data to parameters of the node in the parallel index tree
  • a parallel repeatability determining module configured to determine, according to a matching result of each character, whether the data is overlapped with the stored data, and if not, storing the parameter of each character of the data as a parameter of the node into the parallel index tree .
  • the embodiment of the invention further provides a data application system, including:
  • An application server configured to receive data input by the user, and provide the data to the verification server for repeatability verification;
  • a verification server configured to match parameters of each character of the received data in a parallel index tree with parameters of the node, each node of the parallel index tree respectively corresponding to one character, and the parameter of the node includes at least a character The length of the string of the data and the position of the character in the character string; determining whether the data is duplicated with the stored data according to the matching result of each character, and if not, using the parameters of each character of the data as a node
  • the parameter is stored in the parallel index tree, and the data is provided to the database server for storage;
  • a database server for storing the data.
  • DRAWINGS 1 is a flowchart of a data repeatability check method according to Embodiment 1 of the present invention
  • FIG. 2 is a flowchart of a data repeatability check method according to Embodiment 2 of the present invention
  • FIG. 3 is a schematic diagram of a tree structure of data stored in an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a data repeatability verification apparatus according to Embodiment 6 of the present invention
  • FIG. 6 is a schematic structural diagram of a data repeatability verification apparatus according to Embodiment 7 of the present invention
  • Schematic diagram of a parameter matching module in a data repeatability verification device
  • FIG. 8 is a schematic structural diagram of a data repeatability verification apparatus according to Embodiment 9 of the present invention
  • FIG. 9 is a schematic structural diagram of a data application system according to Embodiment 10 of the present invention. detailed description
  • FIG. 1 is a flowchart of a data repeatability verification method according to Embodiment 1 of the present invention, where the method is applicable to any situation in which data needs to be repeatedly verified, for example, whether a newly added user name is duplicated with a stored user name; Whether the file name of the newly added file in the file system is duplicated with the used file name.
  • the data repeatability verification method in this embodiment may be specifically implemented by a data operating system, and the data operating system may be a device implemented by combining hardware and software. The method specifically includes the following steps:
  • Step 110 Match parameters of each character of the data to parameters of each node in a parallel index tree, where each node of the parallel index tree corresponds to one character, and each node parameter includes at least data corresponding to the character.
  • Step 120 Determine, according to the matching result of each character, whether the data is duplicated with the stored data. If not, that is, the data does not overlap with the stored data, perform step 1 30; if yes, the data is considered to be stored. Repeat the data and go to step 140.
  • Step 1 30 Store the parameters of each character of the data as parameters of the node into the parallel index tree, and the data may be accepted or stored, and the data is persisted into the database, that is, the data is stored in the database, and the process ends. .
  • Step 140 The data may be rejected or discarded directly. If the file path is duplicated, the file storage path may be rewritten.
  • step 120 since the parameter of each character of the data is stored as a parameter of the node in the parallel index tree every time the data is stored, the character parameter of the newly added data is matched in the parallel index tree to determine the Whether the data is repeated.
  • the technical solution of this embodiment adopts a method of performing parallel indexing on each character of data to verify data repeatability.
  • the parallel index tree can be in the form of a B+ tree, including multiple nodes, which can be set layer by layer, and each node corresponds to one character.
  • the parameters of each node include at least the length of the string of the data corresponding to the character and the position of the corresponding character in the data string. When the characters match the parameters of the node, each character of the data can be matched with the corresponding node at the same time, thereby improving the matching speed.
  • any one of the characters does not find the corresponding node, it indicates that the data is not stored, and does not overlap with the existing data; when the corresponding node can be found for each character, the probability that the data has been stored is compared high.
  • One is to repeat the data by default. You can discard the data directly. This case has higher accuracy when the data string length is smaller, and the other is to continue to be more precise. Repeatability check, which is more necessary when the data string length is long. Regardless of which of the above-described processing methods is employed, since the data is first checked in parallel, at least the repeatability of a part of the data can be verified, so that the efficiency of the repeatability check can be improved to some extent.
  • each character in the data corresponds to the node one by one, and the character can directly correspond to the node, but in order to improve the calculation speed and improve the versatility of the calculation, the parameters of each character of the data are respectively in the parallel index tree and the stored node. Before the parameters are matched, each character of the data can be first converted into a data identifier, and each node corresponds to a character by a data identifier of the character, for example, a byte form.
  • each character is identified by a numeric byte, such as "0001" for “up”, “0002” for “z”, etc., as long as the data identifier can be uniquely identified.
  • the so-called characters here can be a single number, punctuation, English letters, Chinese characters or some bytes in the data, or an organic combination of the above elements, such as the file name suffix ". PDF" can be defined as a character .
  • FIG. 2 is a flowchart of a data repeatability verification method according to Embodiment 2 of the present invention. This embodiment is based on the first embodiment, and further adds a serial indexing method to the parallel index. In this embodiment, after determining that the data is duplicated with the stored data according to the matching result of each character, the following operations are also performed:
  • Step 210 Match the index of the data in the auxiliary index table, where the index of the stored data includes an index of the stored data, and the index of the stored data may be a data string itself, or may be a digital index string composed of the data identifier.
  • the matching in the above step 210 is a data indexing-finding matching, and various data matching schemes in the prior art can be used, which are means for performing accurate searching according to the data itself.
  • Step 220 determining whether the data is duplicated with the stored data according to the matching result of the data in the auxiliary index table, if yes, executing step 230, if not, executing step 240;
  • Step 230 Generate a data repetition result, that is, the data is duplicated with the stored data, and the process ends.
  • the technical solution of the embodiment improves the data repeatability check speed by the parallel index, and ensures the uniqueness and accuracy of the data repeatability check through the serial accurate index.
  • the exact index check still depends on the index of the data itself, it is first excluded by the parallel index.
  • the amount of data that needs to be checked for accuracy is significantly reduced, so the data repeatability check efficiency can be improved to a certain extent. This advantage is especially noticeable when the data string length is small, and usually the user name, file name, etc. String lengths are mostly short strings of up to 20 characters.
  • the method further includes: intercepting the set quantity from the data.
  • the characters are used as characters that match the parameters of each node in the parallel index tree, and the parallel index performed only defines the intercepted characters.
  • the number of cuts to be specifically set can be set according to the amount of data to be saved and the number of characters.
  • the index will be intercepted from the data up to 7 characters each time, and the remaining characters are not indexed in parallel, but it should be noted that in parallel indexing, although the number of intercepts is reduced , but the length of the data string is not truncated and reduced. Because the parallel indexing method has obvious advantages and higher reliability in the case of fewer characters, intercepting a limited number of characters for parallel indexing can not only preserve the efficiency and accuracy advantages of parallel indexes, but also avoid excessive number of characters. Unnecessary parallel indexing.
  • the parallel index tree can be organized in various forms, for example, multiple parallel index trees organized by the first character of the data as the root node, or other data such as the string length of the data, the user name type, or the file name type. Parameters are used as multiple parallel index trees organized by the root node. The manner of matching using various parallel index trees is similar.
  • the first character of the data is taken as the root node of the parallel index tree as an example for interpretation.
  • a simplified example is incorporated for clarity of presentation. Assuming that data such as "a”, “ad”, “an”, “a dm”, “adn”, and "and” has been stored in the database, as shown in Figure 3, a tree structure of the stored data is shown.
  • the new data needs to be "ad i ".
  • the first letter "a" is used as the root node, and each character is used as a node.
  • Each character may appear in different data, so there are different layers and positions.
  • Each set of layers and positions is a parameter of a node corresponding to the character, and one character may correspond to multiple nodes.
  • all the layer numbers and position combinations of the character may be stored as parameters of one node of the character, and one character corresponds to one node.
  • Parameter value of each character in each node The form is recorded as "character data identification" ⁇ layer number 1 [position 1], layer number 2 [position 2], ... ⁇ .
  • the parameters of the node can be mounted under each character.
  • FIG. 4 is a flowchart of a method for verifying data repeatability according to Embodiment 3 of the present invention.
  • the present embodiment is based on the foregoing embodiments, and specifically, the number of parallel index trees is multiple, and the root of each parallel index tree.
  • the node corresponding to the first character of the data string, the operation of matching the parameters of each character of the data in the parallel index tree with the parameters of the node includes:
  • Step 420 Perform the following search and match operations for each character in the data in parallel, wherein the search matching operations are performed on "a”, “d", and “i” respectively, and the following is performed by matching "d” as an example.
  • the operation is as follows:
  • Step 421 Find a layer of the node in the selected parallel index tree according to the length of the data string, and the length of the string of "adi" is 3, and then search for the parameter of the third layer in each node corresponding to "d", and find Two values, 3 [2], 3 [3];
  • Step 422 Find a matching node in the layer where the found node is located according to the position of the character in the data string, and generate a character search matching result; "d" is in the second character position of "adi”, according to which 3 [2]. A similar lookup is made for the "i" character, but the result of the search is no, there are no matching nodes.
  • Step 430 When the character search matching result of identifying one character is no, the data search matching result is generated, and the search matching operation of other characters is stopped. If you select a parallel index tree based on the first letter The result is empty, which is equivalent to a character search result of the character is no.
  • Each character performs repetitive verification concurrently, and when a character is not matched, it can be determined that it is not repeated, thereby terminating the matching matching operation of other characters, and thus has high verification efficiency.
  • the parameters of each node of the parallel index tree further include pre-character coordinates and/or post-coordinates.
  • the so-called pre-coordinates refer to parameters of characters before the character, such as the previous character.
  • the data identifier, or the data identifier of the first two characters; correspondingly, the post-coordinate refers to the parameter of the character after the character.
  • the node includes the pre-character coordinates and/or the post-coordinates, according to the position of the character in the data string, after finding the matching node in the layer where the found node is located, before performing the character search matching result, the following operations are performed. :
  • the front coordinate of the preceding character and the back coordinate of the latter character may include only one, or both the front coordinate and the back coordinate, and may be selected in consideration of the execution speed and accuracy of the matching operation.
  • the parameter form of the node after setting the front and back coordinates is ⁇ layer number 1 [position 1: ⁇
  • Table 2 The value of the node corresponding to the parallel index tree of the data shown in Figure 3 is shown in Table 2:
  • step 422 to match the "d" character as an example, after finding the matching node in the layer where the found node is located, before the character search matching result is generated, the preceding word of the "d" character is also generated.
  • the sign and the post-character match the front and back coordinates in the parameters of the matched node.
  • the above technical solution for matching the pre-coordinate and the post-coordinate can further improve the accuracy of parallel matching in the repetitive verification process, reduce the need for serial matching, and improve the efficiency of repetitive verification.
  • the data repeatability verification method provided by the fifth embodiment of the present invention can be improved based on the foregoing embodiments.
  • the parameters of the parallel index tree node further include the number of occurrences of characters. Still using the foregoing embodiment, increasing the number of occurrences of characters at the same position of a node in the same layer. For example, if the "a" character appears twice at the position of the first character of the second layer, "2" is taken as the parameter of the node.
  • the parameter form of the node is ⁇ layer number 1 [position 1: number of times], layer number 2 [position 2: number of times], ... ⁇ .
  • the parameters of the node including the number of occurrences in the parallel index tree of the data shown in Figure 3 are as shown in Table 3:
  • the method further includes:
  • the "d” character appears twice in the second layer on the third level, recorded as 3 [2: 2], and when "adn" is deleted, 3 [2: 2] is changed to 3 [2: 1] , can not only characterize the "d” character in “adm”, but also reduce the "adn".
  • the technical advantages of the embodiments of the present invention are particularly significant in the case where the amount of data is increased.
  • the rate of the prior art relying solely on the duplicate check method of the database will be significantly reduced, and if there is a need to partition the database. In this case, you can't rely on the database to make judgments, which is costly and unacceptable.
  • the prior art adopts the method of relying solely on the database, when the concurrent access pressure is relatively large, if a lot of repetition occurs, the database reports a very large number of abnormalities, which will affect the database. Performance and stability.
  • the technical solution of the embodiments of the present invention overcomes the defects of the prior art, and realizes the uniqueness check of some individual data items without relying on the database; the efficiency of the verification is not affected by the amount of data to be stored, even if the amount of data reaches Massive does not affect the verification logic and efficiency; Because the parallel index tree occupies a small storage space, it can support system partition changes when the data is massive, and has wide applicability. Both general and distributed systems can be used universally. Regardless of the number of databases, and whether the database is a distributed system, because the parallel index tree and the secondary index table occupy a small storage space, they can be stored centrally, so the repetitive check is not affected by the database form, no additional Work to adapt to changes in the form of the database.
  • the amount of storage that requires index matching is reduced.
  • the number of characters such as Chinese characters, English letters, and numbers is about 6,000.
  • the division of the parallel index tree is a virtual structure, and the physical storage is actually a parameter of each node of the parallel index tree and a secondary index table, and the index storage amount occupied by these characters can be completely stored in the memory. In the middle, it is beneficial to further improve the matching speed.
  • FIG. 5 is a schematic structural diagram of a data repeatability verification apparatus according to Embodiment 6 of the present invention.
  • the apparatus includes: a parallel index tree storage module 510, a parameter matching module 520, and a parallel repeatability determining module 530.
  • the parallel index tree storage module 510 is configured to store parameters of each node in the parallel index tree. Each node of the parallel index tree corresponds to a character, and the parameter of the node includes at least a string length and a character of the data of the character.
  • the parameter matching module 520 is configured to match the parameters of each character of the data in the parallel index tree with the parameters of the node respectively; the parallel repeatability determining module 530 is configured to determine the data according to the matching result of each character. Whether it is duplicated with the stored data. If not, the parameters of each character of the data are stored as parameters of the node in the parallel index tree.
  • the technical solution of the embodiment adopts a method of performing parallel indexing on each character of the data to verify the repeatability of the data, thereby improving the efficiency of the repeatability check.
  • FIG. 6 is a schematic structural diagram of a data repeatability verification apparatus according to Embodiment 7 of the present invention, The embodiment is based on the sixth embodiment, and further includes: a secondary index table storage module 540, an index matching module 550, and a serial repeatability determining module 560.
  • the auxiliary index table storage module 540 is configured to store a secondary index table, where the auxiliary index table includes an index of the stored data.
  • the index matching module 550 is configured to determine the data and the parallel repeatability determining module 530 according to the matching result of each character.
  • the serial repeatability determining module 560 is configured to determine whether the data is duplicated with the stored data according to the matching result of the data in the auxiliary index table, and if so, Then, a data duplication result is generated. If not, the parallel repetitiveness judging module 530 instructs the parameters of each character of the data as parameters of the node to be stored in the parallel index tree, and stores the index of the data in the auxiliary index table.
  • the technical solution of the embodiment improves the data repeatability check speed by the parallel index, and ensures the uniqueness and accuracy of the data repeatability check through the serial accurate index.
  • the method further includes: a character intercepting module 570, connected to the parameter matching module 520, configured to: before the parameters of each character of the data are matched with the parameters of the node in the parallel index tree, the data is obtained from the data.
  • the set number of characters are intercepted as characters that match the parameters of the node in the parallel index tree.
  • the amount of work for parallel indexing is controlled by intercepting a set number of characters.
  • the apparatus further includes: a data conversion module 580, coupled to the parameter matching module 520, configured to: before the parameters of each character of the data are respectively matched with the parameters of the stored node in the parallel index tree, The characters are respectively converted into data identifiers, wherein each node corresponds to a character by a data identifier of the character, and the index of the data is a digital index string composed of the data identifier.
  • a data conversion module 580 coupled to the parameter matching module 520, configured to: before the parameters of each character of the data are respectively matched with the parameters of the stored node in the parallel index tree, The characters are respectively converted into data identifiers, wherein each node corresponds to a character by a data identifier of the character, and the index of the data is a digital index string composed of the data identifier.
  • Data representation by data can reduce the amount of calculations for matching and indexing, and also enable data indexing to be independent of database storage.
  • FIG. 7 is a schematic structural diagram of a parameter matching module in a data repeatability checking apparatus according to Embodiment 8 of the present invention.
  • This embodiment may be based on the foregoing Embodiment 6 or 7.
  • the number of parallel index trees is multiple.
  • the root node of each parallel index tree corresponds to the first character of the data string
  • the parameter matching module 520 specifically includes: an index tree selecting unit 521, one or more search matching lists Element 522, and result generating unit 523.
  • the index tree selection unit 521 is configured to select a corresponding parallel index tree according to the first character of the data string.
  • the search matching unit 522 is configured to perform a search matching operation in parallel for each character in the data.
  • Each of the search matching units 522 includes: The layer selection sub-unit 5221 and the node matching sub-unit 5222.
  • the layer selection sub-unit 5221 is configured to search the layer where the node is located according to the length of the data string in the selected parallel index tree; the node matching sub-unit 5222 is configured to locate the node according to the position of the character in the data string. Find the matching node and generate a character to find the matching result.
  • the result generation unit 523 is configured to generate a data search matching result when the character search matching result of identifying one character is NO, and stop the search matching operation of other characters.
  • each node of the parallel index tree may further include pre-character coordinates and/or post-coordinates
  • each of the search matching units 522 further includes: a coordinate matching sub-unit 5223 for using data according to characters. The position in the string, after finding the matching node in the layer where the found node is located, before generating the character to find the matching result, the pre-coordinate and/or the character in the parameter of the matched node are matched with the pre-coordinate and/or Or the coordinates are consistently matched.
  • FIG. 8 is a schematic structural diagram of a data repeatability verification apparatus according to Embodiment 9 of the present invention.
  • the embodiment may be based on the foregoing apparatus embodiments.
  • the parameters of the parallel index tree node may further include characters.
  • the number of occurrences, the device may further include: a number increase module 590 and a number reduction module 5100.
  • the number increase module 590 is configured to increase the number of occurrences of characters in the parameters of the corresponding node by one after storing the parameters of each character of the data as parameters of the node into the parallel index tree; the number reduction module 5100 and the parallel index tree storage module
  • the 510 is connected, and is used to search for a corresponding node in the parallel index tree according to each character of the deleted data when deleting the data, and reduce the number of occurrences of the characters in the parameter of the found node by one.
  • the data repetitive verification device provided by the embodiments of the present invention can perform the technical solution of any embodiment of the data repetitive verification method of the present invention, including corresponding functional modules, and effectively improve the repetitive verification efficiency.
  • FIG. 9 is a schematic structural diagram of a data application system according to Embodiment 10 of the present invention, where the system includes: The application server 91 0, the verification server 920, and the database server 930.
  • the application server 91 0 is configured to receive data input by the user, and provide the data to the verification server 920 for repetitive verification.
  • the verification server 920 is configured to separately input parameters of each character of the received data in the parallel index tree.
  • the parameters of the node are matched, and each node of the parallel index tree corresponds to one character, and the parameter of the node includes at least the length of the string of the data in which the character is located and the position of the character in the string; As a result, it is judged whether the data is duplicated with the stored data.
  • the parameters of each character of the data are stored as parameters of the node in the parallel index tree, and the data is provided to the database server 930.
  • Storage; Database Server 9 30 is used to store data.
  • the so-called database server 930 should be understood in a broad sense, can be a database composed of storage media, or a file system, such as a content management system (Content Management System, SMS for short).
  • the verification server in the data application system provided by the embodiment of the present invention may use the data repeatability verification apparatus provided by the embodiment of the present invention.
  • the verification server may be set independently of the application server or integrated in the application server.
  • the application server can be a server with any application service function, for example, a forum WEB webpage publishing server, which processes the user's login, registration, and forum access services, and the application server provides data to the verification server in addition to the data that needs to be repetitively verified. In addition, there are other functions that respond to specific services.
  • the uniqueness of the repetitiveness check is improved by the depth-first manner of the parallel index.
  • the repetitive verification implementation method of the prior art has a great influence on the verification efficiency after the data amount is increased.
  • the technical solution of the embodiment of the present invention does not depend on the persistent application data, and has a direct relationship with the used characters.
  • the verification efficiency is directly related to the number of characters constituting the data, and the auxiliary index has an indirect relationship with the data amount, but can be pressed.
  • the relevant data index needs to be loaded, and the "depth-first, breadth-first" strategy can greatly improve the efficiency, so even if the amount of data reaches a large amount, the impact on the verification efficiency is relatively small.
  • the technical solution of the embodiment of the present invention is only directly related to the used character. Relationships, regardless of whether the system is distributed or not, can use a centralized verification method; no matter how the amount of data changes, the data is a string of characters, and the verification method does not change.
  • the technical solution of the embodiment of the present invention is highly cost-effective because it does not depend on hardware, especially in the case of a large amount of data.
  • the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Description

数据重复性校验方法和装置及系统 本申请要求于 2010 年 12 月 14 日提交中国专利局、 申请号为 201010588219.X, 发明名称为 "数据重复性校验方法和装置及系统" 的中国 专利申请的优先权, 其全部内容通过引用结合在本申请中。 技术领域
本发明实施例涉及数据处理技术, 尤其涉及一种数据重复性校验方法和 装置及系统。 背景技术
在数据操作的众多领域中, 例如某软件系统中, 经常要保证某数据项具 备唯一性, 此时需要针对该数据项, 将新增加数据与已有数据进行重复性校 验。 例如, 某 Web应用论坛, 在注册新用户的时候, 需要对新增的用户名进 行校验, 检查是否已有重复的用户名, 如果重复将告知用户重新输入一个用 户名。
现有数据重复性校验方法实现方式主要分为两种: 一种是在新增数据插 入前进行重复性判断; 另一种是在新增数据插入后进行重复性判断。 这两种 方式均需要依赖数据库进行数据逐一比对来校验重复性。 然而, 依赖数据库 的校验模式, 随着数据的增加, 其判断速度和效率将显著下降。。 发明内容
本发明实施例提供一种数据重复性校验方法和装置及系统, 以提高数据 重复性校验的效率。
本发明实施例提供一种数据重复性校验方法, 包括:
将数据各字符的参数在并行索引树中分别与节点的参数进行匹配, 所述 并行索引树的每个节点分别与一个字符对应, 且节点的参数至少包括字符所 在数据的字符串长度和字符在所述字符串中的位置;
根据各字符的匹配结果判断所述数据是否与已存储的数据重复, 若否, 则将所述数据各字符的参数作为节点的参数存储到所述并行索引树中。
本发明实施例提供一种数据重复性校验装置, 包括:
并行索引树存储模块, 用于存储并行索引树中各节点的参数, 所述并行 索引树的每个节点分别与一个字符对应, 且节点的参数至少包括字符所在数 据的字符串长度和字符在所述字符串中的位置;
参数匹配模块, 用于将数据各字符的参数在并行索引树中分别与节点的 参数进行匹配;
并行重复性判断模块, 用于根据各字符的匹配结果判断所述数据是否与 已存储的数据重复, 若否, 则将所述数据各字符的参数作为节点的参数存储 到所述并行索引树中。
本发明实施例还提供了一种数据应用系统, 包括:
应用服务器, 用于接收用户输入的数据, 将数据提供给校验服务器进行 重复性校验;
校验服务器, 用于将接收到的数据各字符的参数在并行索引树中分别与 节点的参数进行匹配, 所述并行索引树的每个节点分别与一个字符对应, 且 节点的参数至少包括字符所在数据的字符串长度和字符在所述字符串中的位 置; 根据各字符的匹配结果判断所述数据是否与已存储的数据重复, 若否, 则将所述数据各字符的参数作为节点的参数存储到所述并行索引树中, 同时 将所述数据提供给数据库服务器进行存储;
数据库服务器, 用于将所述数据进行存储。
本发明实施例提供的数据重复性校验方法和装置及系统, 以并行索引树 的形式对数据中各字符的参数值进行并行的匹配, 并且该方案不依赖于存储 数据的数据库, 从而具有较小的索引量, 能够显著提高数据重复性校验效率。 附图说明 图 1为本发明实施例一提供的数据重复性校验方法的流程图; 图 2为本发明实施例二提供的数据重复性校验方法的流程图;
图 3为本发明实施例中所存储数据的树状结构示意图;
图 4为本发明实施例三提供的数据重复性校验方法的流程图;
图 5为本发明实施例六提供的数据重复性校验装置的结构示意图; 图 6为本发明实施例七提供的数据重复性校验装置的结构示意图; 图 Ί 为本发明实施例八提供的数据重复性校验装置中参数匹配模块的结 构示意图;
图 8为本发明实施例九提供的数据重复性校验装置的结构示意图; 图 9为本发明实施例十提供的数据应用系统的结构示意图。 具体实施方式
为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本发 明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描述, 显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于 本发明中的实施例, 本领域普通技术人员在没有作出创造性劳动前提下所获 得的所有其他实施例, 都属于本发明保护的范围。
实施例一
图 1 为本发明实施例一提供的数据重复性校验方法的流程图, 该方法可 适用于任何需要对数据进行重复性校验的情况, 例如新增用户名是否与已存 储用户名重复; 文件系统中新增文件的文件名是否与已使用的文件名重复等 各种情况。 本实施例中的数据重复性校验方法具体可以由数据操作系统来执 行, 数据操作系统可以是软硬件结合来实现的装置。 该方法具体包括如下步 骤:
步骤 110、将数据各字符的参数在并行索引树中分别与各节点的参数进行 匹配, 该并行索引树的每个节点分别与一个字符对应, 且每个节点的参数至 少包括对应的字符所在数据的字符串长度和对应的字符在该字符串中的位 置;
步骤 120、 根据各字符的匹配结果判断该数据是否与已存储的数据重复, 若否, 即该数据与已存储的数据不重复, 则执行步骤 1 30; 若是, 则视为该数 据与已存储的数据重复, 执行步骤 140。
步骤 1 30、 将该数据各字符的参数作为节点的参数存储到并行索引树中, 同时该数据可以被接受或进行存储、 将数据持久化到数据库中, 即将数据存 储到该数据库中, 流程结束。
步骤 140、 可以拒绝该数据或直接丟弃, 若为文件路径重复则可以改写文 件存储路径等。
在上述步骤 120 中, 由于每次存储数据都会将该数据各字符的参数作为 节点的参数存储到并行索引树中, 所以通过将新增数据的字符参数在并行索 引树中匹配就可以判断出该数据是否重复。
本实施例的技术方案采用了针对数据的各字符进行并行索引的方式来校 验数据的重复性。 并行索引树可以为 B+树的形式, 包括多个节点, 可逐层设 置, 每个节点分别与一个字符对应。 每个节点的参数至少包括对应字符所在 数据的字符串长度和所对应字符在数据字符串中的位置。 在字符与节点的参 数匹配时, 可以将数据的各个字符同时与对应的节点进行匹配, 提高匹配的 速度。 当任意一个字符没有查找到对应的节点时, 则说明该数据没有被存储, 与已有数据不重复; 当对每个字符都能查找到对应的节点时, 说明该数据已 经被存储的概率较高。 此时的处理方式有多种, 一种是默认该数据重复, 则 可以直接丟弃该数据, 这种情况在数据字符串长度较小时具有较高的准确性, 另一种是继续进行更精确的重复性校验, 这种情况在数据字符串长度较长时 更为必要。 无论采用上述哪种处理方式, 由于首先对数据各字符进行了并行 校验, 所以至少能够验证一部分数据的重复性, 因而能够在一定程度上提高 重复性校验的效率。 在目前的数据重复性校验中, 用户名和文件名等的字符 串长度通常不太长, 所以本实施例的技术方案可以在大部分的重复性校验中 提高效率, 且保证一定的准确性。 数据中的各字符与节点一一对应, 可以是字符与节点直接对应, 但是为 提高计算速度和改善计算的通用性, 在将数据各字符的参数在并行索引树中 分别与已存储的节点的参数进行匹配之前, 还可以首先将数据的各字符分别 转换为数据标识,每个节点通过字符的数据标识与字符对应,例如字节( byte ) 形式。 例如, 以数字字节标识每个字符, 如 "0001 " 代表 "上" , "0002" 代表 "z" 等, 只要满足数据标识能唯一标识字符即可。 此处所谓的字符, 可 以是单个数字、 标点符号、 英文字母、 汉字或数据中的某几位字节, 也可以 是上述元素的有机组合, 例如文件名后缀 ". PDF" 可以定义为一个字符。
实施例二
图 2 为本发明实施例二提供的数据重复性校验方法的流程图。 本实施例 以实施例一为基础, 进一步在并行索引的基础上增加了串行索引的手段。 本 实施例中, 当根据各字符的匹配结果判断出数据与已存储的数据重复之后, 还执行如下操作:
步骤 210、将数据的索引在辅助索引表中进行匹配, 该辅助索引表中包括 已存储数据的索引, 已存储数据的索引可以是数据字符串本身, 也可以是数 据标识组成的数字索引串;
上述步骤 210 中的匹配是数据索引的——查找匹配, 可以采用已有技术 中的各种数据匹配方案, 属于根据数据本身进行精确查找的手段。
步骤 220、根据数据在辅助索引表中的匹配结果判断该数据是否与已存储 的数据重复, 若是, 则执行步骤 230, 若否, 则执行步骤 240;
步骤 230、 产生数据重复结果, 即数据与已存储的数据重复, 流程结束; 步骤 240、数据与已存储的数据不重复, 则将数据各字符的参数作为节点 的参数存储到并行索引树中, 并将该数据的索引存储在辅助索引表中, 以备 新增其他数据时进行索引和匹配。
本实施例的技术方案通过并行索引提高了数据重复性校验速度, 并通过 串行的精确索引保证了数据重复性校验的唯一性和准确性。 虽然精确索引校 验仍然要依赖于数据本身的索引, 但由于首先经过了并行索引的排除, 使得 需要进行精确性校验的数据量显著减少, 因此在一定程度上能够改善数据重 复性校验效率, 在数据字符串长度较小的情况下这种优势尤为明显, 而通常 用户名、 文件名等字符串长度大都是不超过 20个字符的短字符串。
在本实施例中, 优选可以设定需要并行索引的字符数量, 即在将数据各字 符的参数在并行索引树中分别与各节点的参数进行匹配之前, 还包括: 从数 据中截取设定数量的字符作为在并行索引树中与各节点的参数进行匹配的各 字符, 而后进行的并行索引仅限定截取的字符。 具体设定的截取数量可以根 据要保存的数据量、 字符个数来设定。
例如, 设定并行索引的字符数为 7个, 则每次进行索引均从数据中截取至 多 7 个字符, 余下的字符不进行并行索引, 但应注意, 在并行索引时, 虽然 截取的数量减少, 但是数据字符串的长度并不引截取而减少。 由于并行索引 的方式在字符数较少的情况下优势明显, 可靠性更高, 所以截取有限数量的 字符进行并行索引, 既能够保留并行索引的效率和准确性优势, 又能避免字 符数过多时不必要的并行索引。
实施例三
并行索引树的组织形式可以有多种, 例如, 以数据的首字符为根节点组 织的多个并行索引树, 或者以数据的字符串长度、 用户名类型、 或者文件名 类型等数据整体的其他参数作为根节点组织的多个并行索引树。 使用各种并 行索引树进行匹配的方式类似, 本发明实施例三以数据的首字符为并行索引 树根节点为例进行解译说明。 为表述清楚而结合一简化的实例。 假设数据库 中已经存储了 "a" 、 "ad" 、 "an" 、 "a dm" 、 "adn" 和 "and" 等数据 时, 如图 3所示为所存储数据的树状结构示意图。 需要新增的数据为 "ad i " 。 在为这些数据建立的并行索引树中, 以首字母 "a" 作为根节点, 以各字符作 为节点, 每个字符可能出现在不同的数据中, 因此有不同的层数和位置, 则 可以将每一组层数和位置作为该字符对应的一个节点的参数, 则一个字符可 能对应多个节点。 或者也可以将该字符的所有层数和位置组合存储为该字符 的一个节点的参数, 则一个字符对应一个节点。 每个字符在各节点的参数值 形式记录为 "字符的数据标识" {层数 1 [位置 1] , 层数 2 [位置 2] , ... ... } 。 各字符下可挂载节点的参数, 当各字符的数据标识具体设定为 a=97 , d=100, m=109 , i=105 , n=110时, 则上述实例的并行索引树节点的参数可表示为表 1 中的矩阵形式: 表 1
Figure imgf000009_0001
图 4 为本发明实施例三提供的数据重复性校验方法的流程图, 本实施例 以上述各实施例为基础, 且具体的, 并行索引树的数量为多个, 各并行索引 树的根节点对应数据字符串的首字符, 则将数据各字符的参数在并行索引树 中分别与节点的参数进行匹配的操作具体包括:
据 "adi" 的首字母 "a" 选择了表 1所示的并行索引树;
步骤 420、 针对数据中各字符并行地分别执行如下查找匹配操作, 其中, 分别对 "a" 、 "d" 和 " i" 同时进行查找匹配操作, 下面以匹配 "d" 为例 进行说明, 具体操作如下:
步骤 421、 根据数据字符串的长度在选择的并行索引树查找节点所在层, "adi" 的字符串长度为 3, 则在 "d" 所对应的各节点中查找第三层的参数, 查找到两个值, 3 [2] , 3 [3] ;
步骤 422、 根据字符在数据字符串中的位置, 在查找到的节点所在层中查 找匹配的节点, 产生字符查找匹配结果; "d" 在 "adi" 的第二个字符位置, 据此匹配到 3 [2] 。 对 " i" 字符进行类似的查找, 但是查找结果为否, 没有匹配的节点。 步骤 430、 当识别到一个字符的字符查找匹配结果为否时, 产生数据查找 匹配结果, 并停止其他字符的查找匹配操作。 若根据首字母选择并行索引树的 结果即为空, 也相当于一个字符的字符查找匹配结果为否。
在前述步骤中, 当查找到没有 "i" 字符的节点时就可以停止查找 "d" 字 符, 任意一个字符没有匹配到就意味着该数据没有重复存在。
之一。各字符并发地执行重复性校验,在一个字符未匹配到时即可判定不重复, 从而终止其他字符的查找匹配操作, 因此具有较高的校验效率。
实施例四
本实施例以前述实施例为基础, 优选的是并行索引树的每个节点的参数 还包括字符前坐标和 /或后坐标, 所谓前坐标是指该字符之前字符的参数, 例 如前一个字符的数据标识, 或者前两个字符的数据标识; 相应地, 后坐标是 指该字符之后字符的参数。 当节点中包括字符前坐标和 /或后坐标时, 则根据 字符在数据字符串中的位置, 在查找到的节点所在层中查找匹配的节点之后, 产生字符查找匹配结果之前, 还执行如下操作:
将字符的前字符和 /或后字符与匹配到的节点的参数中的前坐标和 /或后 坐标进行一致性匹配。
前字符的前坐标和后字符的后坐标可以只包括一个, 或者前坐标和后坐标 均包括, 可以兼顾匹配操作的执行速度和精确性来选择。
设置前坐标和后坐标后的节点的参数形式为 {层数 1 [位置 1: {|后坐标 1, I 后坐标 2...... }]、 层数 2 [位置 2: {前坐标 I I后坐标 1, 前坐标 2|后坐标 2}] , 层数 3 [位置 3: {前坐标 I I , 前坐标 2|}]}。 则图 3所示数据的并行索引树对 应的节点数值如表 2所示:
表 2
Figure imgf000010_0001
仍以前述实施例三步骤 422匹配到 "d" 字符为例, 在查找到的节点所在层 中查找匹配的节点之后, 产生字符查找匹配结果之前, 还将 "d" 字符的前字 符和后字符与匹配到的节点的参数中的前坐标和后坐标进行一致性匹配。
"d" 的前字符是 "a ( 97 ) " , 后字符是 "i ( 105 ) " , 经匹配可知, 在 3 [2: {971109, 971110}]中不存在 3 [2: {971105}] , 因此, 对 "d" 字符的匹配 结果也为否。
上述匹配前坐标和后坐标的技术方案能够进一步提高重复性校验过程中 并行匹配的准确性, 减少需要进行串行匹配的情况, 提高重复性校验效率。
在上述实例中, 假设新增的字符串为 "admin在喝酒" , 在截取 "admin" 进行并行索引后发现 "admin***"存在,则继续在辅助索引表中进行查找匹配。 转换为数据标识形式的 "admin 在喝酒" 字符串对应的数据索引为 "/97/100/109/105/110/410, 510, 610" , 其中 "410, 510, 610" 对应于 "在 喝酒" 。
实施例五
本发明实施例五提供的数据重复性校验方法可以基于上述各实施例进行 改进, 并行索引树节点的参数还包括字符出现次数。 仍沿用前述实施例, 增 加字符在某个节点同一层数同一位置的出现次数, 例如, "a" 字符在第二层 第一字符的位置出现了两次, 则将 "2" 作为节点的参数进行记录, 节点的参 数形式为{层数 1 [位置 1:次数], 层数 2 [位置 2: 次数], ...... }。 则图 3所示 数据的并行索引树中包括出现次数的节点的参数如表 3所示:
表 3
Figure imgf000011_0001
则以前述实施例为基础, 将数据各字符的参数作为节点的参数存储到并 行索引树中之后, 还包括:
将对应节点的参数中的字符出现次数加一; 当删除数据时, 根据删除数据各字符查找并行索引树中的对应节点, 并 将查找到的节点的参数中字符出现次数减一。
对数据的增删步骤没有特定的时序关系。 上述技术方案可满足数据删 减的情况需求。 当需要从数据库中删除数据时, 将字符出现次数减一, 则 既能够避免数据删减时其节点仍然保留在并行索引树中, 也能够保证数据 删除时, 不会将标识其他数据中相应字符的节点删除。
例如, "d"字符在第三层第二个出现了两次,记为 3 [2: 2] , 当将 " adn" 删除时, 将 3 [2: 2]修改为 3 [2: 1 ] , 既能表征 "adm" 中的 "d" 字符, 又能表 征减少了 "adn" 。
按照上述技术方案执行查找匹配操作之后确定 "adi " 字符不重复, 则将 数据 "adi " 在数据库中持久化存储。 并且, 将并行索引树的节点索引表修改 为表 4:
表 4
Figure imgf000012_0001
其中, 修改了字符 "a" 和 "d" 的出现次数、 前坐标和后坐标, 还增加 了字符 " i" 的索引。
本发明各实施例的技术方案优势在数据量增大的情况下尤为显著。 当数 据量达到海量时, 例如论坛的注册用户名或文件系统中的文件名增加达到海 量时, 现有技术单纯依赖数据库的重复校验方式的速率将显著降低, 若出现 需要将数据库分库的情况下, 就不能依赖数据库来进行判断了, 代价高昂, 不能接受。 现有技术采用单纯依赖数据库的方式时, 当并发访问压力比较大 的时候, 如果发生很多的重复, 则数据库报异常非常多, 将会影响数据库本 身的性能和稳定性。
本发明各实施例的技术方案克服了现有技术的缺陷, 不依赖数据库实现 了某些单独数据项的唯一性校验; 校验的效率不受需要存储的数据量的影响, 即使数据量达到海量也不影响校验逻辑和效率; 由于并行索引树所占用的存 储空间小, 所以能够支持数据海量时的系统分库变化, 适用性很广, 普通系 统和分布式系统都可以通用。 无论数据库的数量有多少, 也无论数据库是否 为分布式系统, 由于并行索引树和辅助索引表所占用存储空间小, 可以集中 存储, 所以重复性校验不会受数据库形式的影响, 无需额外的工作来适应数 据库形式的改变。
由于采用了本发明实施例的技术方案, 减小了需要索引匹配的存储量, 例如, 所有汉字、 英文字母和数字等字符的数量大概为 6000。 本发明实施例 的技术方案中, 并行索引树的划分是虚拟的结构, 实际需要物理存储的是并 行索引树各节点的参数和辅助索引表, 这些字符所占据的索引存储量可以完 全存储于内存中, 有利于进一步提高匹配速度。
实施例六
图 5 为本发明实施例六提供的数据重复性校验装置的结构示意图, 该装 置包括: 并行索引树存储模块 510、参数匹配模块 520和并行重复性判断模块 530。 其中, 并行索引树存储模块 510用于存储并行索引树中各节点的参数, 并行索引树的每个节点分别与一个字符对应, 且节点的参数至少包括字符所 在数据的字符串长度和字符在所述字符串中的位置; 参数匹配模块 520用于 将数据各字符的参数在并行索引树中分别与节点的参数进行匹配; 并行重复 性判断模块 530用于根据各字符的匹配结果判断所述数据是否与已存储的数 据重复, 若否, 则将数据各字符的参数作为节点的参数存储到并行索引树中。
本实施例的技术方案采用了针对数据的各字符进行并行索引的方式来校 验数据的重复性 , 能够提高重复性校验的效率。
实施例七
图 6 为本发明实施例七提供的数据重复性校验装置的结构示意图, 本实 施例以实施例六为基础, 还包括: 辅助索引表存储模块 540、 索引匹配模块 550和串行重复性判断模块 560。 其中, 辅助索引表存储模块 540用于存储辅 助索引表, 该辅助索引表中包括已存储数据的索引; 索引匹配模块 550用于 当并行重复性判断模块 530根据各字符的匹配结果判断出数据与已存储的数 据重复之后, 将数据的索引在辅助索引表中进行匹配; 串行重复性判断模块 560 用于根据数据在辅助索引表中的匹配结果判断数据是否与已存储的数据 重复, 若是, 则产生数据重复结果, 若否, 则指示并行重复性判断模块 530 将数据各字符的参数作为节点的参数存储到并行索引树中, 并将数据的索引 存储在辅助索引表中。
本实施例的技术方案通过并行索引提高了数据重复性校验速度, 并通过 串行的精确索引保证了数据重复性校验的唯一性和准确性。
在上述技术方案的基 上, 进一步还可以包括: 字符截取模块 570 , 与参 数匹配模块 520相连, 用于在将数据各字符的参数在并行索引树中分别与节 点的参数进行匹配之前, 从数据中截取设定数量的字符作为在并行索引树中 与节点的参数进行匹配的各字符。 优选是通过截取设定数量字符来控制进行 并行索引的工作量。
该装置中优选是还包括: 数据转换模块 580 , 与参数匹配模块 520相连, 用于在将数据各字符的参数在并行索引树中分别与已存储的节点的参数进行 匹配之前, 将数据的各字符分别转换为数据标识, 其中, 每个节点通过字符 的数据标识与字符对应, 数据的索引为数据标识组成的数字索引串。
通过数据标识来表示数据, 能够减少匹配和索引的计算量, 还能够使数 据索引独立于数据库存储。
实施例八
图 7 为本发明实施例八提供的数据重复性校验装置中参数匹配模块的结 构示意图, 本实施例可以以上述实施例六或七为基础, 本实施例中, 并行索 引树的数量为多个, 各并行索引树的根节点对应数据字符串的首字符, 则参 数匹配模块 520具体包括: 索引树选择单元 521、一个或一个以上查找匹配单 元 522 , 以及结果产生单元 523。 其中, 索引树选择单元 521用于根据数据字 符串的首字符选择对应的并行索引树; 查找匹配单元 522 用于针对数据中各 字符并行地分别执行查找匹配操作, 每个查找匹配单元 522 包括: 层选子单 元 5221和节点匹配子单元 5222。其中, 层选子单元 5221用于根据数据字符串 的长度在选择的并行索引树查找节点所在层; 节点匹配子单元 5222用于根据 字符在数据字符串中的位置, 在查找到的节点所在层中查找匹配的节点, 产生 字符查找匹配结果。 结果产生单元 523用于当识别到一个字符的字符查找匹配 结果为否时, 产生数据查找匹配结果, 并停止其他字符的查找匹配操作。
在上述方案的基础上, 并行索引树的每个节点的参数还可以包括字符前 坐标和 /或后坐标, 则每个查找匹配单元 522还包括: 坐标匹配子单元 5223 , 用于根据字符在数据字符串中的位置,在查找到的节点所在层中查找匹配的节 点之后, 产生字符查找匹配结果之前, 将字符的前字符和 /或后字符与匹配到 的节点的参数中的前坐标和 /或后坐标进行一致性匹配。
实施例九
图 8 为本发明实施例九提供的数据重复性校验装置的结构示意图, 本实 施例可以以上述各装置实施例为基石出, 本实施例中, 并行索引树节点的参数 还可以进一步包括字符出现次数, 则该装置还可以包括: 次数增加模块 590 和次数减少模块 5100。 其中, 次数增加模块 590用于在将数据各字符的参数 作为节点的参数存储到并行索引树中之后, 将对应节点的参数中的字符出现 次数加一; 次数减少模块 5100与并行索引树存储模块 510相连, 用于当删除 数据时, 根据删除数据各字符查找并行索引树中的对应节点, 并将查找到的 节点的参数中字符出现次数减一。
本发明各实施例的所提供的数据重复性校验装置能够执行本发明数据重 复性校验方法任意实施例的技术方案, 包括相应的功能模块, 有效提高重复 性校验效率。
实施例十
图 9为本发明实施例十提供的数据应用系统的结构示意图, 该系统包括: 应用服务器 91 0、校验服务器 920和数据库服务器 930。其中,应用服务器 91 0 用于接收用户输入的数据, 将数据提供给校验服务器 920进行重复性校验; 校验服务器 920用于将接收到的数据各字符的参数在并行索引树中分别与节 点的参数进行匹配, 所述并行索引树的每个节点分别与一个字符对应, 且节 点的参数至少包括字符所在数据的字符串长度和字符在所述字符串中的位 置; 根据各字符的匹配结果判断所述数据是否与已存储的数据重复, 若否, 则将所述数据各字符的参数作为节点的参数存储到所述并行索引树中, 同时 将所述数据提供给数据库服务器 9 30进行存储; 数据库服务器 9 30用于将数 据进行存储。 所谓数据库服务器 930 , 应作广义理解, 既可以是存储介质构成 的数据库,又可以是文件系统,如内容管理系统( Content Managemen t Sys tem , 简称 CMS ) 。
本发明实施例所提供的数据应用系统中的校验服务器可以采用本发明实 施例提供的数据重复性校验装置, 校验服务器可以独立于应用服务器设置, 也可以集成在应用服务器之中。 应用服务器可以为具备任意应用业务功能的 服务器, 例如为论坛 WEB网页发布服务器, 处理用户的登录、 注册和论坛访问 的业务, 应用服务器除了将需要进行重复性校验的数据提供给校验服务器之 外, 还具有其他响应具体业务的功能。
本发明各实施例的技术方案, 以并行索引的深度优先方式提高了重复性 性校验的唯一性。 现有技术的重复性校验实现方式, 数据量增大后对校验效 率有很大影响。 本发明实施例的技术方案不依赖持久化的应用数据, 与使用 到的字符有直接关系, 校验效率与组成数据的字符个数有直接关系, 辅助索 引与数据量有间接关系, 但是可以按需加载相关数据索引, 并且使用 "深度 优先、 广度优先" 策略可以极大提高效率, 所以数据量即使达到海量对校验 效率影响比较小。 在需要自动支持系统演变的时候, 如从普通的系统变成分 布式系统, 数据量巨大, 需要进行分库。 此情况下, 现有校验方式不能满足 要求, 甚至根本不可用, 本发明实施例的技术方案只与使用到的字符有直接 关系, 不管系统是否是分布式, 都能使用集中式的校验方式; 不管数据量怎 么变化, 数据都是由字符组成的字符串, 校验方式不用变化。 本发明实施例 的技术方案由于不依赖硬件, 所以实现方式性价比很高, 尤其是数据量巨大 的情况下该优势尤为显著。
本领域普通技术人员可以理解: 实现上述方法实施例的全部或部分步骤 可以通过程序指令相关的硬件来完成, 前述的程序可以存储于一计算机可读 取存储介质中, 该程序在执行时, 执行包括上述方法实施例的步骤; 而前述 的存储介质包括: R0M、 RAM, 磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是: 以上实施例仅用以说明本发明的技术方案, 而非对其 限制; 尽管参照前述实施例对本发明进行了详细的说明, 本领域的普通技术 人员应当理解: 其依然可以对前述各实施例所记载的技术方案进行修改, 或 者对其中部分技术特征进行等同替换; 而这些修改或者替换, 并不使相应技 术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims

权 利 要 求
1、 一种数据重复性校验方法, 其特征在于, 包括:
将数据各字符的参数在并行索引树中分别与各节点的参数进行匹配, 所 述并行索引树的每个节点分别与一个字符对应, 且节点的参数至少包括对应 的字符所在数据的字符串长度和对应的字符在所述字符串中的位置;
根据各字符的匹配结果判断所述数据是否与已存储的数据重复, 若否, 则将所述数据各字符的参数作为节点的参数存储到所述并行索引树中。
2、 根据权利要求 1所述的数据重复性校验方法, 其特征在于, 当根据 各字符的匹配结果判断出所述数据与已存储的数据重复之后, 还包括:
将所述数据的索引在辅助索引表中进行匹配, 所述辅助索引表中包括已 存储数据的索引;
根据所述数据在辅助索引表中的匹配结果判断所述数据是否与已存储的 数据重复, 若是, 则产生数据重复结果, 若否, 则将所述数据各字符的参数 作为节点的参数存储到所述并行索引树中, 并将所述数据的索引存储在辅助 索引表中。
3、 根据权利要求 2所述的数据重复性校验方法, 其特征在于, 在将数 据各字符的参数在并行索引树中分别与各节点的参数进行匹配之前, 还包括: 从所述数据中截取设定数量的字符作为在并行索引树中与各节点的参数 进行匹配的各字符。
4、 根据权利要求 2所述的数据重复性校验方法, 其特征在于, 在将数 据各字符的参数在并行索引树中分别与已存储的节点的参数进行匹配之前, 还包括: 将所述数据的各字符分别转换为数据标识, 其中, 每个所述节点通 过字符的数据标识与所述字符对应, 所述数据的索引为数据标识组成的数字 索引串。
5、 根据权利要求 1 ~ 4任一所述的数据重复性校验方法, 其特征在于, 所述并行索引树的数量为多个, 各并行索引树的根节点对应数据字符串的首 字符, 则将数据各字符的参数在并行索引树中分别与各节点的参数进行匹配 包括: 针对所述数据中各字符分别执行如下查找匹配操作:
根据所述数据字符串的长度在选择的并行索引树查找节点所在层; 根据字符在数据字符串中的位置, 在查找到的节点所在层中查找匹配 的节点, 产生字符查找匹配结果;
当识别到一个字符的字符查找匹配结果为否时, 产生数据查找匹配结果, 并停止其他字符的查找匹配操作。
6、 根据权利要求 5所述的数据重复性校验方法, 其特征在于, 并行索 引树的每个节点的参数还包括字符前坐标和 /或后坐标, 则根据字符在数据字 符串中的位置, 在查找到的节点所在层中查找匹配的节点之后, 产生字符查找 匹配结果之前, 所述步骤还包括:
将字符的前字符和 /或后字符与匹配到的节点的参数中的前坐标和 /或后 坐标进行一致性匹配。
7、 根据权利要求 1 ~ 4任一所述的数据重复性校验方法, 其特征在于, 节点的参数还包括字符出现次数, 则将所述数据各字符的参数作为节点的参 数存储到所述并行索引树中之后, 还包括:
将对应节点的参数中的字符出现次数加一;
当删除数据时, 根据删除数据各字符查找并行索引树中的对应节点, 并 将查找到的节点的参数中字符出现次数减一。
8、 一种数据重复性校验装置, 其特征在于, 包括:
并行索引树存储模块, 用于存储并行索引树中各节点的参数, 所述并行 索引树的每个节点分别与一个字符对应, 且节点的参数至少包括对应的字符 所在数据的字符串长度和对应的字符在所述字符串中的位置;
参数匹配模块, 用于将数据各字符的参数在并行索引树中分别与各节点 的参数进行匹配; 并行重复性判断模块, 用于根据各字符的匹配结果判断所述数据是否与 已存储的数据重复, 若否, 则将所述数据各字符的参数作为节点的参数存储 到所述并行索引树中。
9、 根据权利要求 8所述的数据重复性校验装置,其特征在于,还包括: 辅助索引表存储模块, 用于存储辅助索引表, 所述辅助索引表中包括已 存储数据的索引;
索引匹配模块, 用于当所述并行重复性判断模块根据各字符的匹配结果 判断出所述数据与已存储的数据重复之后, 将所述数据的索引在辅助索引表 中进行匹配;
串行重复性判断模块, 用于根据所述数据在辅助索引表中的匹配结果判 断所述数据是否与已存储的数据重复, 若是, 则产生数据重复结果, 若否, 则指示并行重复性判断模块将所述数据各字符的参数作为节点的参数存储到 所述并行索引树中, 并将所述数据的索引存储在辅助索引表中;
字符截取模块, 用于在将数据各字符的参数在并行索引树中分别与节点 的参数进行匹配之前, 从所述数据中截取设定数量的字符作为在并行索引树 中与节点的参数进行匹配的各字符;
数据转换模块, 用于在将数据各字符的参数在并行索引树中分别与已存 储的各节点的参数进行匹配之前, 将所述数据的各字符分别转换为数据标识, 其中, 每个所述节点通过字符的数据标识与所述字符对应, 所述数据的索引 为数据标识组成的数字索引串。
10、 根据权利要求 8所述的数据重复性校验装置, 其特征在于, 所述并 行索引树的数量为多个, 各并行索引树的根节点对应数据字符串的首字符, 则参数匹配模块包括: 引树;
一个或一个以上查找匹配单元, 用于针对所述数据中各字符并行地分别 执行查找匹配操作, 每个所述查找匹配单元包括: 找节点所在层;
节点匹配子单元, 用于根据字符在数据字符串中的位置, 在查找到的 节点所在层中查找匹配的节点, 产生字符查找匹配结果;
结果产生单元, 用于当识别到一个字符的字符查找匹配结果为否时, 产生 数据查找匹配结果, 并停止其他字符的查找匹配操作。
1 1、 根据权利要求 10所述的数据重复性校验装置, 其特征在于, 每个 节点的参数还包括字符前坐标和 /或后坐标, 每个所述查找匹配单元还包括: 坐标匹配子单元,用于根据字符在数据字符串中的位置,在查找到的节点 所在层中查找匹配的节点之后, 产生字符查找匹配结果之前, 将字符的前字符 和 /或后字符与匹配到的节点的参数中的前坐标和 /或后坐标进行一致性匹 配。
12、 根据权利要求 8所述的数据重复性校验装置, 其特征在于, 并行索 引树节点的参数还包括字符出现次数, 所述装置还包括:
次数增加模块, 用于在将所述数据各字符的参数作为节点的参数存储到 所述并行索引树中之后, 将对应节点的参数中的字符出现次数加一;
次数减少模块, 用于当删除数据时, 根据删除数据各字符查找并行索引 树中的对应节点, 并将查找到的节点的参数中字符出现次数减一。
1 3、 一种数据应用系统, 其特征在于, 包括:
应用服务器, 用于接收用户输入的数据, 将数据提供给校验服务器进行 重复性校验;
校验服务器, 用于将接收到的数据各字符的参数在并行索引树中分别与 各节点的参数进行匹配, 所述并行索引树的每个节点分别与一个字符对应, 且节点的参数至少包括字符所在数据的字符串长度和字符在所述字符串中的 位置; 根据各字符的匹配结果判断所述数据是否与已存储的数据重复, 若否, 则将所述数据各字符的参数作为节点的参数存储到所述并行索引树中, 同时 将所述数据提供给数据库服务器进行存储; 数据库服务器, 用于将所述数据进行存储。
PCT/CN2011/083206 2010-12-14 2011-11-30 数据重复性校验方法和装置及系统 WO2012079460A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201010588219.X 2010-12-14
CN201010588219XA CN102024046B (zh) 2010-12-14 2010-12-14 数据重复性校验方法和装置及系统

Publications (1)

Publication Number Publication Date
WO2012079460A1 true WO2012079460A1 (zh) 2012-06-21

Family

ID=43865343

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/083206 WO2012079460A1 (zh) 2010-12-14 2011-11-30 数据重复性校验方法和装置及系统

Country Status (2)

Country Link
CN (1) CN102024046B (zh)
WO (1) WO2012079460A1 (zh)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024046B (zh) * 2010-12-14 2013-04-24 华为数字技术(成都)有限公司 数据重复性校验方法和装置及系统
US9070090B2 (en) * 2012-08-28 2015-06-30 Oracle International Corporation Scalable string matching as a component for unsupervised learning in semantic meta-model development
CN106649360B (zh) * 2015-10-30 2020-09-22 北京国双科技有限公司 数据重复性校验方法及装置
CN106649346B (zh) * 2015-10-30 2020-09-22 北京国双科技有限公司 数据重复性校验方法及装置
CN106055527B (zh) * 2016-05-24 2019-11-19 华为技术有限公司 一种数据处理的方法及装置
CN108460048B (zh) * 2017-02-21 2022-05-10 阿里巴巴集团控股有限公司 一种查询唯一值的方法及设备
CN107679146A (zh) * 2017-09-25 2018-02-09 南方电网科学研究院有限责任公司 电网数据质量的校验方法和系统
CN110245330B (zh) * 2018-03-09 2023-07-07 腾讯科技(深圳)有限公司 字符序列匹配方法、实现匹配的预处理方法和装置
CN109299719B (zh) * 2018-09-30 2021-07-23 武汉斗鱼网络科技有限公司 基于字符分割的弹幕校验方法、装置、终端及存储介质
CN109739831A (zh) * 2018-11-23 2019-05-10 网联清算有限公司 数据库之间数据校验方法及装置
CN112148710B (zh) * 2020-09-21 2023-11-14 珠海市卓轩科技有限公司 微服务分库方法、系统和介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6047283A (en) * 1998-02-26 2000-04-04 Sap Aktiengesellschaft Fast string searching and indexing using a search tree having a plurality of linked nodes
CN101159658A (zh) * 2007-11-02 2008-04-09 华为技术有限公司 虚拟私用网路由查找的方法和装置
CN101442731A (zh) * 2008-12-12 2009-05-27 中国移动通信集团安徽有限公司 一种话单剔重方法和装置
CN101496005A (zh) * 2005-12-29 2009-07-29 亚马逊科技公司 具有网络服务客户接口的分布式存储系统
CN102024046A (zh) * 2010-12-14 2011-04-20 成都市华为赛门铁克科技有限公司 数据重复性校验方法和装置及系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4445509B2 (ja) * 2007-03-20 2010-04-07 株式会社東芝 構造化文書検索システム及びプログラム
CN101183369A (zh) * 2007-12-11 2008-05-21 中山大学 一种嵌入式电子词典词库结构
CN101587484B (zh) * 2009-06-19 2011-05-11 南京航空航天大学 一种基于T-lt树的主存数据库的索引方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6047283A (en) * 1998-02-26 2000-04-04 Sap Aktiengesellschaft Fast string searching and indexing using a search tree having a plurality of linked nodes
CN101496005A (zh) * 2005-12-29 2009-07-29 亚马逊科技公司 具有网络服务客户接口的分布式存储系统
CN101159658A (zh) * 2007-11-02 2008-04-09 华为技术有限公司 虚拟私用网路由查找的方法和装置
CN101442731A (zh) * 2008-12-12 2009-05-27 中国移动通信集团安徽有限公司 一种话单剔重方法和装置
CN102024046A (zh) * 2010-12-14 2011-04-20 成都市华为赛门铁克科技有限公司 数据重复性校验方法和装置及系统

Also Published As

Publication number Publication date
CN102024046A (zh) 2011-04-20
CN102024046B (zh) 2013-04-24

Similar Documents

Publication Publication Date Title
WO2012079460A1 (zh) 数据重复性校验方法和装置及系统
US11281531B2 (en) Serial storage node processing of data functions
JP6431114B2 (ja) 個人用検索のための方法に用いるマルチユーザ検索システム
CN111373390B (zh) 在结构化框架中存储非结构化数据
US7702640B1 (en) Stratified unbalanced trees for indexing of data items within a computer system
US9600513B2 (en) Database table comparison
US9189506B2 (en) Database index management
US10222987B2 (en) Data deduplication with augmented cuckoo filters
CN110191428B (zh) 一种基于智能云平台的数据分配方法
US10104021B2 (en) Electronic mail data modeling for efficient indexing
US10275486B2 (en) Multi-system segmented search processing
US8015195B2 (en) Modifying entry names in directory server
US11082494B2 (en) Cross storage protocol access response for object data stores
CN112148217B (zh) 全闪存储系统的重删元数据的缓存方法、装置及介质
US20150278543A1 (en) System and Method for Optimizing Storage of File System Access Control Lists
US10990324B2 (en) Storage node processing of predefined data functions
WO2022175080A1 (en) Cache indexing using data addresses based on data fingerprints
CN112667636B (zh) 索引建立方法、装置及存储介质
CN113806803B (zh) 一种数据存储方法、系统、终端设备及存储介质
CN110297842B (zh) 一种数据比对方法、装置、终端和存储介质
US11055018B2 (en) Parallel storage node processing of data functions
US10997144B2 (en) Reducing write amplification in buffer trees
CN110321346A (zh) 一种字符串散列表实现方法和系统
WO2021017655A1 (zh) 获取索引节点号的方法、装置、计算设备和存储介质
EP3014482B1 (en) Method and system for searching and storing data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11849652

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11849652

Country of ref document: EP

Kind code of ref document: A1