US20210227007A1 - Data storage method, encoding device, and decoding device - Google Patents

Data storage method, encoding device, and decoding device Download PDF

Info

Publication number
US20210227007A1
US20210227007A1 US16/099,796 US201816099796A US2021227007A1 US 20210227007 A1 US20210227007 A1 US 20210227007A1 US 201816099796 A US201816099796 A US 201816099796A US 2021227007 A1 US2021227007 A1 US 2021227007A1
Authority
US
United States
Prior art keywords
data
piece
stream data
stream
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/099,796
Inventor
Zhaoxin LU
Peng Lin
Xun Chen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wangsu Science and Technology Co Ltd
Original Assignee
Wangsu Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wangsu Science and Technology Co Ltd filed Critical Wangsu Science and Technology Co Ltd
Assigned to WANGSU SCIENCE & TECHNOLOGY CO.,LTD. reassignment WANGSU SCIENCE & TECHNOLOGY CO.,LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHEN, XUN, LIN, PENG, LU, Zhaoxin
Publication of US20210227007A1 publication Critical patent/US20210227007A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • H04L65/4069
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • G06F9/30196Instruction operation extension or modification using decoder, e.g. decoder per instruction set, adaptable or programmable decoders
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/04Real-time or near real-time messaging, e.g. instant messaging [IM]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • H04L65/608
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/65Network streaming protocols, e.g. real-time transport protocol [RTP] or real-time control protocol [RTCP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • H04L67/5682Policies or rules for updating, deleting or replacing the stored data

Definitions

  • the present disclosure generally relates to the field of Internet technology and, more particularly, relates to a data storage method, an encoding device, and a decoding device thereof.
  • the data deduplication technology may deploy a data-storing database at both the data transmitting terminal and the data receiving terminal.
  • the data in the database may be stored in the form of data fragments, where each data fragment may have a unique pointer.
  • the to-be-transmitted data is first divided into multiple data fragments. If some of the data fragments are already stored in the database, these data fragments will be replaced with pointers.
  • the to-be-transmitted data can ultimately be processed as a combination of pointers and data fragments, which thus allows the amount of data that needs to be transmitted to be reduced.
  • the data receiving terminal may identify the data fragments corresponding to the pointers from the local database, thereby restoring the pointers to the data fragments.
  • One premise of the above pointer-mediated data fragment restore is that the databases of the data transmitting terminal and the data receiving terminal are kept in synchronization.
  • problems such as disconnection of data streaming or restart of the transmission processes may cause the databases of the data transmitting terminal and the data receiving terminal to fail to keep data synchronized.
  • the data receiving terminal may have a great possibility of not being able to identify data fragments corresponding to the pointers from the local database, and thus cannot receive the full data.
  • the data receiving terminal may require data synchronization with the data transmitting terminal, which will seriously affect the performance of data transmission.
  • the purpose of the present disclosure is to provide a data storage method, an encoding device, and a decoding device, that can effectively reduce the generation of data fragments to allow the consumption of memory resources controllable, improve the ability of identification of small stream data, and avoid data synchronization so as to improve the data transmission efficiency.
  • the present disclosure provides a data storage method.
  • the method includes: acquiring to-be-transmitted data, and dividing the to-be-transmitted data into a plurality of data blocks; determining, from the plurality of data blocks, a set of target data blocks that are not stored in a predefined database; classifying the set of target data blocks as at least one piece of stream data based on stream information, and mounting the at least one piece of stream data to a to-be-confirmed queue; encoding the at least one piece of stream data, and transmitting the encoded at least one piece of stream data to a decoding device; and receiving a confirmation message sent by the decoding device for the encoded at least one piece of stream data, and storing a set of data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, in the predefined database.
  • the present disclosure further provides an encoding device.
  • the encoding device comprises a memory and a processor, where the memory stores computer programs that, when executed by the processor, implement the following steps: acquiring to-be-transmitted data, and dividing the to-be-transmitted data into a plurality of data blocks; determining, from the plurality of data blocks, a set of target data blocks that are not stored in a predefined database; classifying the set of target data blocks as at least one piece of stream data based on stream information, and mounting the at least one piece of stream data to a to-be-confirmed queue; encoding the at least one piece of stream data, and transmitting the encoded at least one piece of stream data to a decoding device; and receiving a confirmation message sent by the decoding device for the encoded at least one piece of stream data, and storing a set of data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, in the predefined database.
  • the present disclosure further provides a data storage method.
  • the method includes: receiving encoded data transmitted by an encoding device, and decoding the encoded data into decoded data, where the decoded data includes a plurality of data blocks; classifying the plurality of data blocks as at least one piece of stream data based on stream information, and determining a piece of target stream data whose data volume is greater than or equal to a specified threshold; and storing the piece of target stream data in a predefined database, and sending a confirmation message to the encoding device, where the confirmation message includes a tag for a data block in the piece of target stream data, to allow the encoding device to store the data block in the piece of target stream data.
  • the present disclosure further provides a decoding device.
  • the decoding device comprises a memory and a processor, where the memory stores computer programs that, when executed by the processor, implement the following steps: receiving encoded data transmitted by an encoding device, and decoding the encoded data into decoded data, where the decoded data includes a plurality of data blocks; classifying the plurality of data blocks as at least one piece of stream data based on stream information, and determining a piece of target stream data whose data volume is greater than or equal to a specified threshold; and storing the piece of target stream data in a predefined database, and sending a confirmation message to the encoding device, where the confirmation message includes a tag for a data block in the piece of target stream data, to allow the encoding device to store the data block in the piece of target stream data.
  • the encoding device when the encoding device needs to transmit data to the decoding device, the encoding device may first divide the data into blocks based on the content, and determine target data blocks that are not stored in the predefined database. For a target data block, the encoding device may not directly store it in the predefined database, but rather classify it as stream data based on the stream information, and mount the classified stream data in the to-be-confirmed queue. The stream data may then be encoded and transmitted to the decoding device. After decoding the received data, the decoding device classifies the decoded data blocks as stream data based on the stream information.
  • the decoding device may store the stream data in the local database, and send a confirmation message to the encoding device.
  • the confirmation message may include the tags of the data blocks in the stored stream data.
  • the encoding device may identify the data blocks in the to-be-confirmed queue that correspond to the confirmation message, so that the data blocks corresponding to the confirmation message may be stored in its own database.
  • serving as a data transmitting terminal the encoding device may correspondingly execute a data block storing process only after the decoding device (serving as the data receiving terminal) stores the data blocks.
  • the decoding device even if a data transmission failure occurs, which leads to the disability of the encoding device to receive the confirmation message, the effect is just that the data in the database of the decoding device is more complete than the data in the database of the encoding device. Accordingly, for a tag, that signifies an already-stored data block, sent by the encoding device, the decoding device apparently is able to identify the corresponding data block from the local database and restore the data block. Therefore, in the technical solutions provided by the present disclosure, even if the data in the database of the encoding device and the database of the decoding device are not synchronized, it does not affect the data receiving process in the decoding device, thereby improving the efficiency of data transmission.
  • the virtual stream data where the target stream data is located may be determined. Based on the feature index of the virtual stream data, the similarity match is performed for the target stream data. In this way, when a new stream similarity match is performed on the small stream data, all feature indexes of the stream cluster are used, and thus the problem of insufficiency of the feature index can be solved.
  • FIG. 1 is a schematic diagram of a system for applying a data storage method according to some embodiments of the present disclosure
  • FIG. 2 is a flowchart of a data storage method in an encoding device according to some embodiments of the present disclosure
  • FIG. 3 is a schematic diagram of a data dividing process according to some embodiments of the present disclosure.
  • FIG. 4 is a schematic diagram of data storage according to some embodiments of the present disclosure.
  • FIG. 5 is a schematic structural diagram of an encoding device according to some embodiments of the present disclosure.
  • FIG. 6 is a flowchart of a data storage method in a decoding device according to some embodiments of the present disclosure
  • FIG. 7 is a schematic structural diagram of a decoding device according to some embodiments of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a computer terminal according to some embodiments of the present disclosure.
  • the present disclosure provides a data storage method, which may be applied in a system architecture shown in FIG. 1 .
  • the server may serve as a data source, and the client terminal may serve as a party to request loading data from the server.
  • the server may transmit data corresponding to the data loading request, through an encoding device, to a decoding device on the client terminal side.
  • the technical solutions provided by the present disclosure may be applied to the aforementioned encoding device and decoding device.
  • the execution entity may be an encoding device.
  • the method may include the following steps.
  • S 11 acquiring to-be-transmitted data, and dividing the to-be-transmitted data into a plurality of data blocks.
  • the encoding device may acquire the to-be-transmitted data from the server, and divide the to-be-transmitted data into blocks based on the content.
  • the data may be divided into blocks using a Rabin fingerprint algorithm.
  • the data may comprise a plurality of characters, which may be 8-bit binary numbers.
  • a data sliding window may be employed to slide from the beginning to the end of the data according to a fixed step length, and the Rabin fingerprint of the data blocks in each data sliding window is calculated one by one. If the calculated fingerprint value is the same as a predefined fingerprint value, the ending position of the present data sliding window may be used as a data block dividing position. For example, in FIG.
  • the data sliding window may slide to the right one character at a time, where the fingerprint k for the data blocks in the k window is the same as the predefined fingerprint value.
  • the ending position (the dashed line position in the figure) of the k window may thus be set as a dividing position for dividing data blocks.
  • the encoding device may also calculate a hash value for each data block, and the calculated hash value may uniquely signify the corresponding data block.
  • S 12 determining, from the plurality of data blocks, a set of target data blocks that are not stored in a predefined database.
  • the encoding device may store to-be-transmitted data in a predefined database, so that when sending out data later, the encoding device may determine whether the to-be-transmitted data is the duplicate data that has already been stored in the predefined database. In this way, after performing the block dividing process on the to-be-transmitted data to get a plurality of data blocks, the encoding device may compare the plurality of data blocks with the data blocks in the predefined database, to determine, among the plurality of data blocks, target data blocks that are not stored in the predefined database and the already-stored data blocks that have already been stored in the predefined database.
  • a data block in the predefined database may have its own hash value, while each data block in the plurality of data blocks may also have its own hash value.
  • the target data blocks and the already-stored data blocks in the plurality of data blocks may thus be determined.
  • a data block stored in the predefined database may be associated with a data block tag, and the data block tag may uniquely signify the associated data block.
  • the data block tag may be a hash value of the data block or a unique character string assigned to the data block by the encoding device.
  • the modes for expressing a data block tag are not limited in the present disclosure, and other expression modes are also applicable as long as they are able to uniquely signify a data block.
  • a data block tag associated with an already-stored data block may be used to replace the already-stored data block in the plurality of data blocks, and thus the plurality of data blocks may be converted into a combination of the target data blocks and data block tags. This may reduce the amount of data that needs to be transmitted.
  • S 13 classifying the set of target data blocks as at least one piece of stream data based on stream information, and mounting the at least one piece of stream data to a to-be-confirmed queue.
  • the encoding device may classify the target data blocks as at least one piece of stream data based on the stream information.
  • the stream information may be five-tuple information included in a target data block.
  • the five-tuple information may include the source IP address for transmitting the data block, the destination IP address for receiving the data block, the source port for transmitting the data block, the destination port for receiving the data block, and the transmission protocol used for the data block. In this way, the encoding device may classify data blocks that have the same five-tuple information into one piece of stream data.
  • the encoding device may not directly store the stream data in the predefined database, but rather mount the stream data to a to-be-confirmed queue.
  • the encoding device may store the corresponding stream data in the to-be-confirmed queue in the predefined database.
  • the total mounting period of each stream data may be determined. When the total mounting period reaches a predefined time-length threshold, the stream data is considered to have an overtime mounting. In order to avoid excessive stream data mounted in the to-be-confirmed queue, if the stream data in the to-be-confirmed queue has an overtime mounting, the stream data with the overtime mounting may be directly discarded, to save space of the to-be-confirmed queue.
  • the encoding device may encode the stream data according to a specified encoding algorithm.
  • the to-be-transmitted data may include data block tags that replace the already-stored data blocks and stream data that are classified from the un-stored target data blocks.
  • the data block tags that replace the already-stored data blocks may be encoded at the same time.
  • the encoding device may transmit the encoded stream data and the encoded data block tags to the links of the wide area network, to allow the transmission of the encoded stream data and the encoded data block tags to the decoding device on the client terminal side.
  • the decoding device may decode the encoded data into decoded data.
  • the decoded data may include a plurality of data blocks and tags that are used to signify the data blocks that have already been stored in the encoding device. Since the decoding device may alternatively receive different data blocks from multiple pieces of stream data when receiving the encoded data, the plurality of data blocks in the decoded data decoded by the decoding device may be not arranged according to the stream data. In this case, the decoding device may classify the plurality of data blocks as at least one piece of stream data based on stream information.
  • the stream information may be the above-described five-tuple information. The encoding device may classify the data blocks that have the same five-tuple information into the same piece of stream data.
  • the decoding device may determine target stream data that has a data volume greater than or equal to a specified threshold.
  • the specified threshold may be a predefined constant.
  • the specified threshold may be used to determine whether the data volume of the stream data is too small. When the data volume of the stream data is greater than or equal to the specified threshold, the stream data is considered to meet the criteria of being storable.
  • the decoding device may store the target stream data that has a data volume greater than or equal to the specified threshold in its own predefined database.
  • the predefined database of the decoding device may be a magnetic disk of the decoding device.
  • the target stream data may be stored as a storage unit.
  • the storage unit may be defined as a container.
  • the storage unit may contain data blocks included in the target stream data and index information of the target stream data. The index information may be used to signify the starting offset of each data block in the storage unit and the data size of each data block.
  • the feature index of the target stream data may also be extracted.
  • the extracted feature index may be written into the memory and be periodically and persistently stored on the magnetic disk. If it is necessary to match similar or identical stream data from the predefined database later, a preliminary assessment may be made based on the feature index.
  • the feature index may be extracted based on the intrinsic features of the target stream data.
  • the intrinsic features may include, for example, blocking window sizes, out-of-order fragments, and the like.
  • certain features may be specified in advance from the intrinsic features. Which feature that each data block of the target stream data corresponds to may be identified. The identified features may be sequenced based on the order of the data blocks in the target stream data, to obtain the feature index of the target stream data. As one example, feature 1 to feature 3 are specified at this time.
  • the target stream data include five data blocks.
  • the five data blocks respectively correspond to feature 2, feature 1, feature 3, feature 1, and feature 2.
  • the sequence of feature 2, feature 1, feature 3, feature 1, and feature 2 may then be considered as the feature index of the target stream data.
  • each feature index may include a key and a value corresponding to the key (cn1, cn2, cnn, etc.).
  • the structure of the storage unit may include metadata, offset and data size corresponding to the key, and specific stored data (data rec).
  • the decoding device may query, in its own predefined database, whether there exists a data block corresponding to the tag. If there exists a data block corresponding to the tag, the tag may be restored to the corresponding data block. If there does not exist a data block corresponding to the tag, it indicates that certain data blocks are missing in the current database of the decoding device. At this moment, the decoding device may send a synchronization request to the encoding device, to synchronize with the data blocks stored in the encoding device.
  • the decoding device may send a confirmation message to the encoding device.
  • the confirmation message may include tags for the data blocks included in the target stream data, which allows the encoding device to store the data blocks in the target stream data.
  • the tags for the data blocks in the target stream data may be the calculated hash values of the data blocks.
  • S 15 receiving a confirmation message sent by the decoding device for the encoded at least one piece of stream data, and storing a set of data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, in the predefined database.
  • the encoding device may determine data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, and store the data blocks that correspond to the confirmation message in the predefined database.
  • the confirmation message may include the tags for the data blocks.
  • the tags may be, for example, the hash values of the data blocks.
  • the encoding device may identify the corresponding data blocks in the to-be-confirmed queue through the tags of the data blocks, and store the identified data blocks.
  • the encoding device's own database may also be a magnetic disk as shown in FIG. 4 .
  • the encoding device may also store the data blocks on the magnetic disk based on the stream data.
  • the stream data where the data blocks corresponding to the confirmation message are located may be stored as a storage unit on the magnetic disk of the encoding device.
  • the storage unit may be defined as a container.
  • the storage unit may include data blocks included in the stream data and index information of the stream data. The index information may be used to signify the starting offset of each data block in the storage unit and the data size of each data block.
  • the encoding device may also extract the feature index of the stream data, write the feature index into the memory, and periodically and persistently store the feature index on the magnetic disk.
  • the decoding device may mount the stream data that has a data volume less than the specified threshold into a to-be-aggregated queue. For the stream data in the to-be-aggregated queue, the decoding device will aggregate multiple pieces of stream data only after it is confirmed that there is a correlation among the multiple pieces of stream data.
  • an application proxy module or other correlation analysis modules may be used to determine multiple pieces of stream data that have a correlation.
  • the application proxy module may be, for example, an HTTP (HyperText Transfer Protocol), a MAPI (Messaging Disclosure Programming Interface), or a CIFS (Common Internet File System), etc. These application proxy modules may determine that multiple different pieces of stream data are located in the same session, and thus determine an existence of the correlated stream data.
  • the decoding device may aggregate multiple pieces of stream data, in the to-be-aggregated queue, that correspond to the correlation information, into a piece of virtual stream data.
  • the correlation information may include five-tuple information of the multiple pieces of correlated stream data.
  • the decoding device may determine a set of corresponding stream data in the to-be-aggregated queue. Although the set of stream data are correlated, due to the difference in five-tuple information, the set of stream data may only be aggregated into a piece of virtual stream data in the form of stream data. In the virtual stream data, multiple sets of five-tuple information are still included. After obtaining the virtual stream data from the aggregation, the decoding device may store the virtual stream data in its own predefined database, extract the feature index of the virtual stream data, write the feature index into the memory, and periodically and persistently store the feature index on the magnetic disk.
  • the decoding device may add an aggregation instruction to a confirmation message to be sent to the encoding device.
  • the aggregation instruction may include five-tuple information of the multiple pieces of stream data that need to be aggregated, or may include the tag of each data block in the multiple pieces of stream data that need to be aggregated. In this way, the aggregation instruction may include the tag of each stream data in the already-stored virtual stream data.
  • the encoding device may identify a set of target stream data in the to-be-confirmed queue, aggregate the set of target stream data into a piece of virtual stream data, and extract the feature index of the virtual stream data into the memory.
  • the virtual stream data where the target stream data is located may be determined. Based on the feature index of the virtual stream data, similarity match may be performed on the target stream data. In this way, when a stream similarity match is performed on a piece of new small stream data, all feature indexes in the stream cluster are used, and thus the problem of insufficiency of the feature index can be solved.
  • the present disclosure further provides an encoding device.
  • the encoding device comprises a memory and a processor, where the memory stores computer programs that, when executed by the processor, implement the following steps:
  • S 13 classifying the set of target data blocks as at least one piece of stream data based on stream information, and mounting the at least one piece of stream data to a to-be-confirmed queue;
  • S 15 receiving a confirmation message sent by the decoding device for the encoded at least one piece of stream data, and storing a set of data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, in the predefined database.
  • the computer programs when executed by the processor, further implement the following step:
  • the confirmation message further includes an aggregation instruction
  • the aggregation instruction includes tags for a set of target stream data in the to-be-confirmed queue, and correspondingly, the computer programs, when executed by the processor, further implement the following steps:
  • the present disclosure further provides a data storage method.
  • the execution entity of the method may be a decoding device.
  • the method may include the following steps.
  • S 22 classifying the plurality of data blocks as at least one piece of stream data based on stream information, and determining a piece of target stream data whose data volume is greater than or equal to a specified threshold.
  • S 23 storing the piece of target stream data in a predefined database, and sending a confirmation message to the encoding device, where the confirmation message includes a tag for a data block in the piece of target stream data, to allow the encoding device to store the data block in the piece of target stream data.
  • the decoding device may decode the encoded data into decoded data.
  • the decoded data may include a plurality of data blocks and tags that are used to signify the data blocks that have already been stored in the encoding device. Since the decoding device may alternatively receive different data blocks from multiple pieces of stream data when receiving the encoded data, the plurality of data blocks in the decoded data decoded by the decoding device may be not arranged according to the stream data. In this case, the decoding device may classify the plurality of data blocks as at least one piece of stream data based on stream information.
  • the stream information may be the above-described five-tuple information. The encoding device may classify the data blocks that have the same five-tuple information into the same piece of stream data.
  • the decoding device may determine target stream data that has a data volume greater than or equal to a specified threshold.
  • the specified threshold may be a predefined constant.
  • the specified threshold may be used to determine whether the data volume of the stream data is too small. When the data volume of the stream data is greater than or equal to the specified threshold, the stream data is considered to meet the criteria of being storable.
  • the decoding device may store the target stream data that has a data volume greater than or equal to the specified threshold in its own predefined database.
  • the predefined database of the decoding device may be a magnetic disk of the decoding device.
  • the target stream data may be stored as a storage unit.
  • the storage unit may be defined as a container.
  • the storage unit may contain data blocks included in the target stream data and index information of the target stream data. The index information may be used to signify the starting offset of each data block in the storage unit and the data size of each data block.
  • the feature index of the target stream data may also be extracted.
  • the extracted feature index may be written into the memory and periodically and persistently stored on the magnetic disk. If it is necessary to match similar or identical stream data from the predefined database later, a preliminary assessment may be made based on the feature index.
  • the feature index may be extracted based on the intrinsic features of the target stream data.
  • the intrinsic features may include, for example, blocking window sizes, out-of-order fragments, and the like.
  • certain features may be specified in advance from the intrinsic features. Which feature that each data block of the target stream data corresponds to may be identified. The identified features may be sequenced based on the order of the data blocks in the target stream data, to obtain the feature index of the target stream data. As one example, feature 1 to feature 3 are specified at present.
  • the target stream data include five data blocks. The five data blocks respectively correspond to feature 2, feature 1, feature 3, feature 1, and feature 2. The sequence of feature 2, feature 1, feature 3, feature 1, and feature 2 may then be considered as the feature index of the target stream data.
  • the decoding device may query, in its own predefined database, whether there exists a data block corresponding to the tag. If there exists a data block corresponding to the tag, the tag may be restored to the corresponding data block. If there does not exist a data block corresponding to the tag, it indicates that certain data blocks are missing in the current database of the decoding device. At this moment, the decoding device may send a synchronization request to the encoding device, to synchronize with the data blocks stored in the encoding device.
  • the decoding device may send a confirmation message to the encoding device.
  • the confirmation message may include tags for the data blocks included in the target stream data, which allows the encoding device to store the data blocks in the target stream data.
  • the tags for the data blocks in the target stream data may be the calculated hash values of the data blocks.
  • the decoding device may mount the stream data that has a data volume less than the specified threshold into a to-be-aggregated queue.
  • the decoding device will aggregate multiple pieces of stream data only after it is confirmed that there is a correlation among the multiple pieces of stream data.
  • an application proxy module or other correlation analysis modules may be used to determine the stream data that have a correlation.
  • the application proxy module may be, for example, an HTTP (HyperText Transfer Protocol), a MAPI (Messaging Disclosure Programming Interface), or a CIFS (Common Internet File System), etc. These application proxy modules may determine that multiple different pieces of stream data are located in the same session, and thus determine an existence of the correlated stream data.
  • the decoding device may aggregate multiple pieces of stream data, in the to-be-aggregated queue, that correspond to the correlation information, into a piece of virtual stream data.
  • the correlation information may include five-tuple information of the multiple pieces of correlated stream data.
  • the decoding device may determine a set of corresponding stream data in the to-be-aggregated queue. Although the set of stream data are correlated, due to the difference of their five-tuple information, the set of stream data may only be aggregated into a piece of virtual stream data in the form of stream data. In the virtual stream data, multiple sets of five-tuple information are still included. After obtaining the virtual stream data from the aggregation, the decoding device may store the virtual stream data in its own predefined database, extract the feature index of the virtual stream data, write the feature index into the memory, and periodically and persistently store the feature index on the magnetic disk.
  • the decoding device may add an aggregation instruction to a confirmation message to be sent to the encoding device.
  • the aggregation instruction may include five-tuple information of the multiple pieces of stream data that need to be aggregated, or may include the tag of each data block in the multiple pieces of stream data that need to be aggregated. In this way, the aggregation instruction may include the tag of each stream data in the already-stored virtual stream data.
  • the encoding device may identify a set of target stream data in the to-be-confirmed queue, aggregate the set of target stream data into a piece of virtual stream data, and extract the feature index of the virtual stream data into the memory.
  • the decoding device may determine the total mounting period of the stream data in the to-be-aggregated queue. When the total mounting period exceeds a specified time-length threshold, the stream data may be considered to have an overtime mounting. To save the space of the to-be-aggregated queue, if the stream data, whose data volume is less than the specified threshold, has an overtime mounting in the to-be-aggregated queue, the decoding device may discard the stream data with the overtime mounting.
  • the classified stream data may very likely be similar to some stream data stored in the database.
  • the decoding device has previously received the data for the first draft of a document, and has again received the data for the revised draft of the document.
  • the revised draft may have only a little content different from the first draft.
  • the data for the first draft and the data for the revised draft may thus be considered as two pieces of similar stream data.
  • the classified stream data is stored in the database, there will be two pieces of similar stream data in the database. This will waste the storage space of the database.
  • the decoding device may check, in its own predefined database, similar stream data whose similarity to the classified data stream reaches a specified similarity threshold.
  • the similarity may be determined by comparing the similarity of the feature indexes of the two pieces of stream data. For example, the number of correspondingly matched features in the feature indexes of the two pieces of stream data may be determined, and the ratio of the number of correspondingly matched features in the total number of features of the feature indexes is calculated. The calculated ratio may be considered as the similarity between the two pieces of stream data. The higher the ratio, the higher the similarity.
  • the specified similarity threshold may be a predefined constant, which can be flexibly adjusted based on the real conditions.
  • the classified stream data and the determined similar stream data may be compared, to determine the additional new data in the classified stream data.
  • the similar stream data may be read into the memory from the predefined database, and different data between the similar stream data and the classified stream data may be identified by comparison.
  • the identified different data may be considered as additional new data.
  • the additional new data may be stored as a piece of new stream data according to the above-described approach. If the data volume of the additional new data is smaller than the specified data volume threshold, the additional new data may be added to the similar stream data. In this way, the problem that the stream data with a small data volume cannot be efficiently identified through the feature index can then be avoided.
  • the present disclosure further provides a decoding device.
  • the decoding device comprises a memory and a processor, where the memory stores computer programs that, when executed by the processor, implement the following steps:
  • S 22 classifying the plurality of data blocks as at least one piece of stream data based on stream information, and determining a piece of target stream data whose data volume is greater than or equal to a specified threshold;
  • S 23 storing the piece of target stream data in a predefined database, and sending a confirmation message to the encoding device, where the confirmation message includes a tag for a data block in the piece of target stream data, to allow the encoding device to store the data block in the piece of target stream data.
  • the computer programs when executed by the processor, further implement the following steps:
  • the aggregation instruction includes a tag for each piece of stream data in the piece of virtual stream data, to allow the encoding device to aggregate each piece of stream data in the piece of virtual stream data and store the aggregated virtual stream data.
  • the computer programs when executed by the processor, further implement the following steps:
  • the encoding device when the encoding device needs to transmit data to the decoding device, the encoding device may first divide the data into blocks based on the content, and determine target data blocks that are not stored in the predefined database. For a target data block, the encoding device may not directly store it in the predefined database, but rather classify it as stream data based on the stream information, and mount the stream data in the to-be-confirmed queue. The stream data may then be encoded and transmitted to the decoding device. After decoding the received data, the decoding device classifies the decoded data blocks as stream data based on the stream information.
  • the decoding device may store the stream data in the local database, and send a confirmation message to the encoding device.
  • the confirmation message may include the tags of the data blocks in the stored stream data.
  • the encoding device may identify the data blocks in the to-be-confirmed queue that correspond to the confirmation message, so that the data blocks corresponding to the confirmation message may be stored in its own database.
  • serving as a data transmitting terminal the encoding device may correspondingly execute data block storing process only after the decoding device (serving as the data receiving terminal) stores the data blocks.
  • the decoding device even if a data transmission failure occurs, which leads to the disability of the encoding device to receive the confirmation message, the effect is just that the data in the database of the decoding device is more complete than the data in the database of the encoding device. Accordingly, for a tag, that signifies an already-stored data block, sent by the encoding device, the decoding device apparently is able to identify the corresponding data block from the local database and restore the data block. Therefore, in the technical solutions provided by the present disclosure, even if the data in the database of the encoding device and the database of the decoding device are not synchronized, it does not affect the data receiving process in the decoding device, thereby improving the efficiency of data transmission.
  • the virtual stream data where the target stream data is located may be determined. Based on the feature index of the virtual stream data, the similarity match is performed for the target stream data. In this way, when a stream similarity match is performed on a piece of new small stream data, all feature indexes in the stream cluster are used, and thus the problem of insufficiency of the feature index can be solved.
  • the computer terminal 10 may include one or more (only one is shown in the figure) processors 102 (a processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication purpose.
  • processors 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication purpose.
  • a processing device such as a microprocessor MCU or a programmable logic device FPGA
  • FPGA programmable logic device
  • FIG. 8 the structure shown in FIG. 8 is provided by way of illustration, but not by way of limitation of the structures of the above-described electronic devices.
  • the computer terminal 10 may also include more or fewer components than those shown in FIG. 8 , or have a different configuration than that shown in FIG. 8 .
  • the memory 104 may be used to store software programs and modules of application software.
  • the processor 102 implements various functional applications and data processing by executing software programs and modules stored in the memory 104 .
  • the memory 104 may include a high-speed random access memory, and also a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include a memory remotely disposed with respect to the processor 102 , which may be connected to the computer terminal 10 through a network. Examples of such network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the transmission device 106 is configured to receive or transmit data via the network.
  • the aforementioned specific examples of the network may include a wireless network provided by the communication provider of the computer terminal 10 .
  • the transmission device 106 includes a network interface controller (NIC) that may be connected to other network devices through the base stations to allow it to communicate with the Internet.
  • the transmission device 106 may be a Radio Frequency (RF) module that is configured to communicate with the Internet via a wireless approach.
  • RF Radio Frequency
  • the various embodiments may take the form of a software plus a necessary general hardware platform implementation, and entirely a hardware implementation.
  • the technical solutions, or essentially the parts that contribute to the current technology may be embodied by way of a software product.
  • the computer software product may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disc, an optical disc, etc., and include a variety of programs that cause a computing device (which may be a personal computer, a server, or a network device, etc.) to implement each embodiment or methods described in certain parts of each embodiment.

Abstract

A data storage method includes: acquiring to-be-transmitted data, and dividing the to-be-transmitted data into a plurality of data blocks; determining, from the plurality of data blocks, a set of target data blocks that are not stored in a predefined database; classifying the set of target data blocks as at least one piece of stream data based on stream information, and mounting the at least one piece of stream data to a to-be-confirmed queue; encoding the at least one piece of stream data, and transmitting the encoded at least one piece of stream data to a decoding device; and receiving a confirmation message sent by the decoding device for the encoded at least one piece of stream data, and storing a set of data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, in the predefined database.

Description

    FIELD OF THE DISCLOSURE
  • The present disclosure generally relates to the field of Internet technology and, more particularly, relates to a data storage method, an encoding device, and a decoding device thereof.
  • BACKGROUND
  • With the continuous development of the Internet, the amount of data on the Internet has also increased day-by-day. At present, a huge amount of the data transmitted over the Internet is duplicate data. For example, in mass emailing and mass-messaging in instant messaging software, etc., multiple copies are made for the same data and then transmitted. This will inevitably waste precious bandwidth resources.
  • In order to solve the problem of duplicate data transmission, data deduplication technology can be used nowadays to reduce the amount of data that needs to be transmitted in the network. Specifically, the data deduplication technology may deploy a data-storing database at both the data transmitting terminal and the data receiving terminal. The data in the database may be stored in the form of data fragments, where each data fragment may have a unique pointer. When the data transmitting terminal needs to transmit data to the data receiving terminal, the to-be-transmitted data is first divided into multiple data fragments. If some of the data fragments are already stored in the database, these data fragments will be replaced with pointers. In this way, the to-be-transmitted data can ultimately be processed as a combination of pointers and data fragments, which thus allows the amount of data that needs to be transmitted to be reduced. After receiving the combination of pointers and data fragments, the data receiving terminal may identify the data fragments corresponding to the pointers from the local database, thereby restoring the pointers to the data fragments.
  • One premise of the above pointer-mediated data fragment restore is that the databases of the data transmitting terminal and the data receiving terminal are kept in synchronization. However, in actual transmission processes, problems such as disconnection of data streaming or restart of the transmission processes may cause the databases of the data transmitting terminal and the data receiving terminal to fail to keep data synchronized. Accordingly, the data receiving terminal may have a great possibility of not being able to identify data fragments corresponding to the pointers from the local database, and thus cannot receive the full data. At this point, the data receiving terminal may require data synchronization with the data transmitting terminal, which will seriously affect the performance of data transmission.
  • BRIEF SUMMARY OF THE DISCLOSURE
  • The purpose of the present disclosure is to provide a data storage method, an encoding device, and a decoding device, that can effectively reduce the generation of data fragments to allow the consumption of memory resources controllable, improve the ability of identification of small stream data, and avoid data synchronization so as to improve the data transmission efficiency.
  • To achieve the above purpose, in one aspect, the present disclosure provides a data storage method. The method includes: acquiring to-be-transmitted data, and dividing the to-be-transmitted data into a plurality of data blocks; determining, from the plurality of data blocks, a set of target data blocks that are not stored in a predefined database; classifying the set of target data blocks as at least one piece of stream data based on stream information, and mounting the at least one piece of stream data to a to-be-confirmed queue; encoding the at least one piece of stream data, and transmitting the encoded at least one piece of stream data to a decoding device; and receiving a confirmation message sent by the decoding device for the encoded at least one piece of stream data, and storing a set of data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, in the predefined database.
  • To achieve the above purpose, in another aspect, the present disclosure further provides an encoding device. The encoding device comprises a memory and a processor, where the memory stores computer programs that, when executed by the processor, implement the following steps: acquiring to-be-transmitted data, and dividing the to-be-transmitted data into a plurality of data blocks; determining, from the plurality of data blocks, a set of target data blocks that are not stored in a predefined database; classifying the set of target data blocks as at least one piece of stream data based on stream information, and mounting the at least one piece of stream data to a to-be-confirmed queue; encoding the at least one piece of stream data, and transmitting the encoded at least one piece of stream data to a decoding device; and receiving a confirmation message sent by the decoding device for the encoded at least one piece of stream data, and storing a set of data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, in the predefined database.
  • To achieve the above purpose, in another aspect, the present disclosure further provides a data storage method. The method includes: receiving encoded data transmitted by an encoding device, and decoding the encoded data into decoded data, where the decoded data includes a plurality of data blocks; classifying the plurality of data blocks as at least one piece of stream data based on stream information, and determining a piece of target stream data whose data volume is greater than or equal to a specified threshold; and storing the piece of target stream data in a predefined database, and sending a confirmation message to the encoding device, where the confirmation message includes a tag for a data block in the piece of target stream data, to allow the encoding device to store the data block in the piece of target stream data.
  • To achieve the above purpose, in another aspect, the present disclosure further provides a decoding device. The decoding device comprises a memory and a processor, where the memory stores computer programs that, when executed by the processor, implement the following steps: receiving encoded data transmitted by an encoding device, and decoding the encoded data into decoded data, where the decoded data includes a plurality of data blocks; classifying the plurality of data blocks as at least one piece of stream data based on stream information, and determining a piece of target stream data whose data volume is greater than or equal to a specified threshold; and storing the piece of target stream data in a predefined database, and sending a confirmation message to the encoding device, where the confirmation message includes a tag for a data block in the piece of target stream data, to allow the encoding device to store the data block in the piece of target stream data.
  • As can be seen from the above, in the technical solutions provided by the present disclosure, when the encoding device needs to transmit data to the decoding device, the encoding device may first divide the data into blocks based on the content, and determine target data blocks that are not stored in the predefined database. For a target data block, the encoding device may not directly store it in the predefined database, but rather classify it as stream data based on the stream information, and mount the classified stream data in the to-be-confirmed queue. The stream data may then be encoded and transmitted to the decoding device. After decoding the received data, the decoding device classifies the decoded data blocks as stream data based on the stream information. If the data volume of the stream data is large, the decoding device may store the stream data in the local database, and send a confirmation message to the encoding device. The confirmation message may include the tags of the data blocks in the stored stream data. In this way, after receiving the confirmation message, the encoding device may identify the data blocks in the to-be-confirmed queue that correspond to the confirmation message, so that the data blocks corresponding to the confirmation message may be stored in its own database. As can be seen, serving as a data transmitting terminal, the encoding device may correspondingly execute a data block storing process only after the decoding device (serving as the data receiving terminal) stores the data blocks. In this way, even if a data transmission failure occurs, which leads to the disability of the encoding device to receive the confirmation message, the effect is just that the data in the database of the decoding device is more complete than the data in the database of the encoding device. Accordingly, for a tag, that signifies an already-stored data block, sent by the encoding device, the decoding device apparently is able to identify the corresponding data block from the local database and restore the data block. Therefore, in the technical solutions provided by the present disclosure, even if the data in the database of the encoding device and the database of the decoding device are not synchronized, it does not affect the data receiving process in the decoding device, thereby improving the efficiency of data transmission. In addition, when there is a need to perform a similarity match on target stream data with a relatively small data volume, in order to avoid the problem of insufficiency of the feature index of the target stream data, the virtual stream data where the target stream data is located may be determined. Based on the feature index of the virtual stream data, the similarity match is performed for the target stream data. In this way, when a new stream similarity match is performed on the small stream data, all feature indexes of the stream cluster are used, and thus the problem of insufficiency of the feature index can be solved.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • To make the technical solutions in the embodiments of the present disclosure clearer, a brief introduction of the accompanying drawings consistent with descriptions of the embodiments will be provided hereinafter. It is to be understood that the following described drawings are merely some embodiments of the present disclosure. Based on the accompanying drawings and without creative efforts, persons of ordinary skill in the art may derive other drawings.
  • FIG. 1 is a schematic diagram of a system for applying a data storage method according to some embodiments of the present disclosure;
  • FIG. 2 is a flowchart of a data storage method in an encoding device according to some embodiments of the present disclosure;
  • FIG. 3 is a schematic diagram of a data dividing process according to some embodiments of the present disclosure;
  • FIG. 4 is a schematic diagram of data storage according to some embodiments of the present disclosure;
  • FIG. 5 is a schematic structural diagram of an encoding device according to some embodiments of the present disclosure;
  • FIG. 6 is a flowchart of a data storage method in a decoding device according to some embodiments of the present disclosure;
  • FIG. 7 is a schematic structural diagram of a decoding device according to some embodiments of the present disclosure; and
  • FIG. 8 is a schematic structural diagram of a computer terminal according to some embodiments of the present disclosure.
  • DETAILED DESCRIPTION
  • To make the objectives, technical solutions, and advantages of the present disclosure clearer, specific embodiments of the present disclosure will be made in detail with reference to the accompanying drawings.
  • Embodiment 1
  • The present disclosure provides a data storage method, which may be applied in a system architecture shown in FIG. 1. In FIG. 1, the server may serve as a data source, and the client terminal may serve as a party to request loading data from the server. Here, after receiving the data loading request sent by the client terminal, the server may transmit data corresponding to the data loading request, through an encoding device, to a decoding device on the client terminal side. The technical solutions provided by the present disclosure may be applied to the aforementioned encoding device and decoding device.
  • In the data storage method provided in the disclosed embodiment, the execution entity may be an encoding device. Referring to FIG. 2, the method may include the following steps.
  • S11: acquiring to-be-transmitted data, and dividing the to-be-transmitted data into a plurality of data blocks.
  • In the disclosed embodiment, the encoding device may acquire the to-be-transmitted data from the server, and divide the to-be-transmitted data into blocks based on the content. In actual applications, the data may be divided into blocks using a Rabin fingerprint algorithm. Specifically, referring to FIG. 3, the data may comprise a plurality of characters, which may be 8-bit binary numbers. When dividing the data into blocks, a data sliding window may be employed to slide from the beginning to the end of the data according to a fixed step length, and the Rabin fingerprint of the data blocks in each data sliding window is calculated one by one. If the calculated fingerprint value is the same as a predefined fingerprint value, the ending position of the present data sliding window may be used as a data block dividing position. For example, in FIG. 3, the data sliding window may slide to the right one character at a time, where the fingerprint k for the data blocks in the k window is the same as the predefined fingerprint value. The ending position (the dashed line position in the figure) of the k window may thus be set as a dividing position for dividing data blocks. Eventually, when the data sliding window moves to the end of the data, the block dividing process may be completed. In this way, through the block dividing process, the data may be divided into a plurality of data blocks.
  • In the disclosed embodiment, after the data is divided into a plurality of data blocks, the encoding device may also calculate a hash value for each data block, and the calculated hash value may uniquely signify the corresponding data block.
  • S12: determining, from the plurality of data blocks, a set of target data blocks that are not stored in a predefined database.
  • In the disclosed embodiment, the encoding device may store to-be-transmitted data in a predefined database, so that when sending out data later, the encoding device may determine whether the to-be-transmitted data is the duplicate data that has already been stored in the predefined database. In this way, after performing the block dividing process on the to-be-transmitted data to get a plurality of data blocks, the encoding device may compare the plurality of data blocks with the data blocks in the predefined database, to determine, among the plurality of data blocks, target data blocks that are not stored in the predefined database and the already-stored data blocks that have already been stored in the predefined database. Specifically, a data block in the predefined database may have its own hash value, while each data block in the plurality of data blocks may also have its own hash value. By comparing the hash values of the two sides, the target data blocks and the already-stored data blocks in the plurality of data blocks may thus be determined.
  • In the disclosed embodiment, a data block stored in the predefined database may be associated with a data block tag, and the data block tag may uniquely signify the associated data block. Compared to the data volume of a data block, the data volume of a data block tag will be much smaller. In actual applications, the data block tag may be a hash value of the data block or a unique character string assigned to the data block by the encoding device. The modes for expressing a data block tag are not limited in the present disclosure, and other expression modes are also applicable as long as they are able to uniquely signify a data block.
  • In the disclosed embodiment, after determining the un-stored target data blocks and the already-stored data blocks in the plurality of data blocks, a data block tag associated with an already-stored data block may be used to replace the already-stored data block in the plurality of data blocks, and thus the plurality of data blocks may be converted into a combination of the target data blocks and data block tags. This may reduce the amount of data that needs to be transmitted.
  • S13: classifying the set of target data blocks as at least one piece of stream data based on stream information, and mounting the at least one piece of stream data to a to-be-confirmed queue.
  • In the disclosed embodiment, for the target data blocks that are not stored in the predefined database, the encoding device may classify the target data blocks as at least one piece of stream data based on the stream information. Here, the stream information may be five-tuple information included in a target data block. The five-tuple information may include the source IP address for transmitting the data block, the destination IP address for receiving the data block, the source port for transmitting the data block, the destination port for receiving the data block, and the transmission protocol used for the data block. In this way, the encoding device may classify data blocks that have the same five-tuple information into one piece of stream data.
  • In the disclosed embodiment, after classifying the target data blocks as the stream data, the encoding device may not directly store the stream data in the predefined database, but rather mount the stream data to a to-be-confirmed queue. After receiving the confirmation message sent by the decoding device, the encoding device may store the corresponding stream data in the to-be-confirmed queue in the predefined database.
  • In one embodiment, the total mounting period of each stream data may be determined. When the total mounting period reaches a predefined time-length threshold, the stream data is considered to have an overtime mounting. In order to avoid excessive stream data mounted in the to-be-confirmed queue, if the stream data in the to-be-confirmed queue has an overtime mounting, the stream data with the overtime mounting may be directly discarded, to save space of the to-be-confirmed queue.
  • S14: encoding the at least one piece of stream data, and transmitting the encoded at least one piece of stream data to a decoding device.
  • In the disclosed embodiment, after classifying the target data blocks as the stream data, the encoding device may encode the stream data according to a specified encoding algorithm. In actual applications, the to-be-transmitted data may include data block tags that replace the already-stored data blocks and stream data that are classified from the un-stored target data blocks. In this case, during encoding, the data block tags that replace the already-stored data blocks may be encoded at the same time. After encoding the data block tags and the stream data separately, the encoding device may transmit the encoded stream data and the encoded data block tags to the links of the wide area network, to allow the transmission of the encoded stream data and the encoded data block tags to the decoding device on the client terminal side.
  • In the disclosed embodiment, after receiving the encoded data transmitted by the encoding device, the decoding device may decode the encoded data into decoded data. The decoded data may include a plurality of data blocks and tags that are used to signify the data blocks that have already been stored in the encoding device. Since the decoding device may alternatively receive different data blocks from multiple pieces of stream data when receiving the encoded data, the plurality of data blocks in the decoded data decoded by the decoding device may be not arranged according to the stream data. In this case, the decoding device may classify the plurality of data blocks as at least one piece of stream data based on stream information. Here, the stream information may be the above-described five-tuple information. The encoding device may classify the data blocks that have the same five-tuple information into the same piece of stream data.
  • In the disclosed embodiment, if storing stream data with a relatively small data volume in a predefined database of the decoding device, it may lead to a large number of data fragments in the predefined database of the decoding device, which takes the decoding device quite a good amount of time in the subsequent data indexing. To prevent this to happen, the decoding device may determine target stream data that has a data volume greater than or equal to a specified threshold. The specified threshold may be a predefined constant. The specified threshold may be used to determine whether the data volume of the stream data is too small. When the data volume of the stream data is greater than or equal to the specified threshold, the stream data is considered to meet the criteria of being storable. At this point, the decoding device may store the target stream data that has a data volume greater than or equal to the specified threshold in its own predefined database. Referring to FIG. 4, the predefined database of the decoding device may be a magnetic disk of the decoding device. In the magnetic disk, the target stream data may be stored as a storage unit. As shown in FIG. 4, the storage unit may be defined as a container. The storage unit may contain data blocks included in the target stream data and index information of the target stream data. The index information may be used to signify the starting offset of each data block in the storage unit and the data size of each data block. After the target stream data is stored in the magnetic disk according to the storage unit, the feature index of the target stream data may also be extracted. The extracted feature index may be written into the memory and be periodically and persistently stored on the magnetic disk. If it is necessary to match similar or identical stream data from the predefined database later, a preliminary assessment may be made based on the feature index. Specifically, the feature index may be extracted based on the intrinsic features of the target stream data. The intrinsic features may include, for example, blocking window sizes, out-of-order fragments, and the like. When extracting the feature index, certain features may be specified in advance from the intrinsic features. Which feature that each data block of the target stream data corresponds to may be identified. The identified features may be sequenced based on the order of the data blocks in the target stream data, to obtain the feature index of the target stream data. As one example, feature 1 to feature 3 are specified at this time. The target stream data include five data blocks. The five data blocks respectively correspond to feature 2, feature 1, feature 3, feature 1, and feature 2. The sequence of feature 2, feature 1, feature 3, feature 1, and feature 2 may then be considered as the feature index of the target stream data. In FIG. 4, each feature index may include a key and a value corresponding to the key (cn1, cn2, cnn, etc.). The structure of the storage unit may include metadata, offset and data size corresponding to the key, and specific stored data (data rec).
  • In the disclosed embodiment, regarding the tag for a data block included in the decoded data, the decoding device may query, in its own predefined database, whether there exists a data block corresponding to the tag. If there exists a data block corresponding to the tag, the tag may be restored to the corresponding data block. If there does not exist a data block corresponding to the tag, it indicates that certain data blocks are missing in the current database of the decoding device. At this moment, the decoding device may send a synchronization request to the encoding device, to synchronize with the data blocks stored in the encoding device.
  • In the disclosed embodiment, after storing the target stream data in its own database and extracting the feature index into the memory, the decoding device may send a confirmation message to the encoding device. The confirmation message may include tags for the data blocks included in the target stream data, which allows the encoding device to store the data blocks in the target stream data. Here, the tags for the data blocks in the target stream data may be the calculated hash values of the data blocks.
  • S15: receiving a confirmation message sent by the decoding device for the encoded at least one piece of stream data, and storing a set of data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, in the predefined database.
  • In the disclosed embodiment, after receiving the confirmation message sent by the decoding device, the encoding device may determine data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, and store the data blocks that correspond to the confirmation message in the predefined database. Specifically, the confirmation message may include the tags for the data blocks. The tags may be, for example, the hash values of the data blocks. The encoding device may identify the corresponding data blocks in the to-be-confirmed queue through the tags of the data blocks, and store the identified data blocks.
  • In actual applications, the encoding device's own database may also be a magnetic disk as shown in FIG. 4. The encoding device may also store the data blocks on the magnetic disk based on the stream data. The stream data where the data blocks corresponding to the confirmation message are located may be stored as a storage unit on the magnetic disk of the encoding device. The storage unit may be defined as a container. The storage unit may include data blocks included in the stream data and index information of the stream data. The index information may be used to signify the starting offset of each data block in the storage unit and the data size of each data block. After the stream data is stored in the magnetic disk according to the storage unit, the encoding device may also extract the feature index of the stream data, write the feature index into the memory, and periodically and persistently store the feature index on the magnetic disk.
  • In one embodiment of the present disclosure, after the decoding device obtains the decoded data and classifies the data blocks contained therein as stream data, if the data volume of the stream data is less than the above-noted specified threshold, the decoding device may mount the stream data that has a data volume less than the specified threshold into a to-be-aggregated queue. For the stream data in the to-be-aggregated queue, the decoding device will aggregate multiple pieces of stream data only after it is confirmed that there is a correlation among the multiple pieces of stream data. Specifically, in the disclosed embodiment, an application proxy module or other correlation analysis modules may be used to determine multiple pieces of stream data that have a correlation. The application proxy module may be, for example, an HTTP (HyperText Transfer Protocol), a MAPI (Messaging Disclosure Programming Interface), or a CIFS (Common Internet File System), etc. These application proxy modules may determine that multiple different pieces of stream data are located in the same session, and thus determine an existence of the correlated stream data. After receiving the correlation information sent by an application proxy module, the decoding device may aggregate multiple pieces of stream data, in the to-be-aggregated queue, that correspond to the correlation information, into a piece of virtual stream data. The correlation information may include five-tuple information of the multiple pieces of correlated stream data. Based on the respective five-tuple information included in the correlation information, the decoding device may determine a set of corresponding stream data in the to-be-aggregated queue. Although the set of stream data are correlated, due to the difference in five-tuple information, the set of stream data may only be aggregated into a piece of virtual stream data in the form of stream data. In the virtual stream data, multiple sets of five-tuple information are still included. After obtaining the virtual stream data from the aggregation, the decoding device may store the virtual stream data in its own predefined database, extract the feature index of the virtual stream data, write the feature index into the memory, and periodically and persistently store the feature index on the magnetic disk.
  • In the disclosed embodiment, after aggregating multiple pieces of stream data into one piece of virtual stream data and storing the virtual stream data, the decoding device may add an aggregation instruction to a confirmation message to be sent to the encoding device. The aggregation instruction may include five-tuple information of the multiple pieces of stream data that need to be aggregated, or may include the tag of each data block in the multiple pieces of stream data that need to be aggregated. In this way, the aggregation instruction may include the tag of each stream data in the already-stored virtual stream data. After extracting the aggregation instruction from the confirmation message, the encoding device may identify a set of target stream data in the to-be-confirmed queue, aggregate the set of target stream data into a piece of virtual stream data, and extract the feature index of the virtual stream data into the memory.
  • In the disclosed embodiment, when a similarity match needs to be performed on the target stream data that has a relatively small data volume, in order to avoid the problem of insufficiency of the feature index of the target stream data, the virtual stream data where the target stream data is located may be determined. Based on the feature index of the virtual stream data, similarity match may be performed on the target stream data. In this way, when a stream similarity match is performed on a piece of new small stream data, all feature indexes in the stream cluster are used, and thus the problem of insufficiency of the feature index can be solved.
  • Embodiment 2
  • Referring to FIG. 5, the present disclosure further provides an encoding device. The encoding device comprises a memory and a processor, where the memory stores computer programs that, when executed by the processor, implement the following steps:
  • S11: acquiring to-be-transmitted data, and dividing the to-be-transmitted data into a plurality of data blocks;
  • S12: determining, from the plurality of data blocks, a set of target data blocks that are not stored in a predefined database;
  • S13: classifying the set of target data blocks as at least one piece of stream data based on stream information, and mounting the at least one piece of stream data to a to-be-confirmed queue;
  • S14: encoding the at least one piece of stream data, and transmitting the encoded at least one piece of stream data to a decoding device; and
  • S15: receiving a confirmation message sent by the decoding device for the encoded at least one piece of stream data, and storing a set of data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, in the predefined database.
  • In the disclosed embodiment, the computer programs, when executed by the processor, further implement the following step:
  • extracting a feature index of a piece of stream data where a data block corresponding to the confirmation message is located, writing the feature index into the memory, and periodically and persistently storing the feature index on a magnetic disk.
  • In the disclosed embodiment, the confirmation message further includes an aggregation instruction, and the aggregation instruction includes tags for a set of target stream data in the to-be-confirmed queue, and correspondingly, the computer programs, when executed by the processor, further implement the following steps:
  • aggregating the set of target stream data into a piece of virtual stream data; and
  • storing the piece of virtual stream data in the predefined database, extracting a feature index of the piece of virtual stream data into the memory, and periodically and persistently storing the feature index on a magnetic disk.
  • Embodiment 3
  • Referring to FIG. 6, the present disclosure further provides a data storage method. The execution entity of the method may be a decoding device. The method may include the following steps.
  • S21: receiving encoded data transmitted by an encoding device, and decoding the encoded data into decoded data, where the decoded data includes a plurality of data blocks.
  • S22: classifying the plurality of data blocks as at least one piece of stream data based on stream information, and determining a piece of target stream data whose data volume is greater than or equal to a specified threshold.
  • S23: storing the piece of target stream data in a predefined database, and sending a confirmation message to the encoding device, where the confirmation message includes a tag for a data block in the piece of target stream data, to allow the encoding device to store the data block in the piece of target stream data.
  • In the disclosed embodiment, after receiving the encoded data transmitted by the encoding device, the decoding device may decode the encoded data into decoded data. The decoded data may include a plurality of data blocks and tags that are used to signify the data blocks that have already been stored in the encoding device. Since the decoding device may alternatively receive different data blocks from multiple pieces of stream data when receiving the encoded data, the plurality of data blocks in the decoded data decoded by the decoding device may be not arranged according to the stream data. In this case, the decoding device may classify the plurality of data blocks as at least one piece of stream data based on stream information. Here, the stream information may be the above-described five-tuple information. The encoding device may classify the data blocks that have the same five-tuple information into the same piece of stream data.
  • In the disclosed embodiment, if storing stream data with a relatively small data volume in a predefined database of the decoding device, it may lead to a large number of data fragments in the predefined database of the decoding device, which takes the decoding device quite a good amount of time in the subsequent data indexing. To prevent this to happen, the decoding device may determine target stream data that has a data volume greater than or equal to a specified threshold. The specified threshold may be a predefined constant. The specified threshold may be used to determine whether the data volume of the stream data is too small. When the data volume of the stream data is greater than or equal to the specified threshold, the stream data is considered to meet the criteria of being storable. At this moment, the decoding device may store the target stream data that has a data volume greater than or equal to the specified threshold in its own predefined database. Referring to FIG. 4, the predefined database of the decoding device may be a magnetic disk of the decoding device. In the magnetic disk, the target stream data may be stored as a storage unit. As shown in FIG. 4, the storage unit may be defined as a container. The storage unit may contain data blocks included in the target stream data and index information of the target stream data. The index information may be used to signify the starting offset of each data block in the storage unit and the data size of each data block. After the target stream data is stored in the magnetic disk according to the storage unit, the feature index of the target stream data may also be extracted. The extracted feature index may be written into the memory and periodically and persistently stored on the magnetic disk. If it is necessary to match similar or identical stream data from the predefined database later, a preliminary assessment may be made based on the feature index. Specifically, the feature index may be extracted based on the intrinsic features of the target stream data. The intrinsic features may include, for example, blocking window sizes, out-of-order fragments, and the like. When extracting the feature index, certain features may be specified in advance from the intrinsic features. Which feature that each data block of the target stream data corresponds to may be identified. The identified features may be sequenced based on the order of the data blocks in the target stream data, to obtain the feature index of the target stream data. As one example, feature 1 to feature 3 are specified at present. The target stream data include five data blocks. The five data blocks respectively correspond to feature 2, feature 1, feature 3, feature 1, and feature 2. The sequence of feature 2, feature 1, feature 3, feature 1, and feature 2 may then be considered as the feature index of the target stream data.
  • In the disclosed embodiment, regarding the tag for a data block included in the decoded data, the decoding device may query, in its own predefined database, whether there exists a data block corresponding to the tag. If there exists a data block corresponding to the tag, the tag may be restored to the corresponding data block. If there does not exist a data block corresponding to the tag, it indicates that certain data blocks are missing in the current database of the decoding device. At this moment, the decoding device may send a synchronization request to the encoding device, to synchronize with the data blocks stored in the encoding device.
  • In the disclosed embodiment, after storing the target stream data in its own database and extracting the feature index into the memory, the decoding device may send a confirmation message to the encoding device. The confirmation message may include tags for the data blocks included in the target stream data, which allows the encoding device to store the data blocks in the target stream data. Here, the tags for the data blocks in the target stream data may be the calculated hash values of the data blocks.
  • In one embodiment of the present disclosure, after the decoding device obtains the decoded data and classifies the data blocks contained therein as stream data, if the data volume of the stream data is less than the above-noted specified threshold, the decoding device may mount the stream data that has a data volume less than the specified threshold into a to-be-aggregated queue. For the stream data in the to-be-aggregated queue, the decoding device will aggregate multiple pieces of stream data only after it is confirmed that there is a correlation among the multiple pieces of stream data. Specifically, in the disclosed embodiment, an application proxy module or other correlation analysis modules may be used to determine the stream data that have a correlation. The application proxy module may be, for example, an HTTP (HyperText Transfer Protocol), a MAPI (Messaging Disclosure Programming Interface), or a CIFS (Common Internet File System), etc. These application proxy modules may determine that multiple different pieces of stream data are located in the same session, and thus determine an existence of the correlated stream data. After receiving the correlation information sent by an application proxy module, the decoding device may aggregate multiple pieces of stream data, in the to-be-aggregated queue, that correspond to the correlation information, into a piece of virtual stream data. The correlation information may include five-tuple information of the multiple pieces of correlated stream data. Based on the respective five-tuple information included in the correlation information, the decoding device may determine a set of corresponding stream data in the to-be-aggregated queue. Although the set of stream data are correlated, due to the difference of their five-tuple information, the set of stream data may only be aggregated into a piece of virtual stream data in the form of stream data. In the virtual stream data, multiple sets of five-tuple information are still included. After obtaining the virtual stream data from the aggregation, the decoding device may store the virtual stream data in its own predefined database, extract the feature index of the virtual stream data, write the feature index into the memory, and periodically and persistently store the feature index on the magnetic disk.
  • In the disclosed embodiment, after aggregating multiple pieces of stream data into one piece of virtual stream data and storing the virtual stream data, the decoding device may add an aggregation instruction to a confirmation message to be sent to the encoding device. The aggregation instruction may include five-tuple information of the multiple pieces of stream data that need to be aggregated, or may include the tag of each data block in the multiple pieces of stream data that need to be aggregated. In this way, the aggregation instruction may include the tag of each stream data in the already-stored virtual stream data. After extracting the aggregation instruction from the confirmation message, the encoding device may identify a set of target stream data in the to-be-confirmed queue, aggregate the set of target stream data into a piece of virtual stream data, and extract the feature index of the virtual stream data into the memory.
  • In one embodiment, the decoding device may determine the total mounting period of the stream data in the to-be-aggregated queue. When the total mounting period exceeds a specified time-length threshold, the stream data may be considered to have an overtime mounting. To save the space of the to-be-aggregated queue, if the stream data, whose data volume is less than the specified threshold, has an overtime mounting in the to-be-aggregated queue, the decoding device may discard the stream data with the overtime mounting.
  • In one embodiment of the present disclosure, after the decoding device classifies the data blocks in the decoded data as stream data, the classified stream data may very likely be similar to some stream data stored in the database. For example, the decoding device has previously received the data for the first draft of a document, and has again received the data for the revised draft of the document. The revised draft may have only a little content different from the first draft. The data for the first draft and the data for the revised draft may thus be considered as two pieces of similar stream data. At this point, if the classified stream data is stored in the database, there will be two pieces of similar stream data in the database. This will waste the storage space of the database. Under this situation, the decoding device may check, in its own predefined database, similar stream data whose similarity to the classified data stream reaches a specified similarity threshold. Specifically, when comparing the similarity of the two pieces of stream data, the similarity may be determined by comparing the similarity of the feature indexes of the two pieces of stream data. For example, the number of correspondingly matched features in the feature indexes of the two pieces of stream data may be determined, and the ratio of the number of correspondingly matched features in the total number of features of the feature indexes is calculated. The calculated ratio may be considered as the similarity between the two pieces of stream data. The higher the ratio, the higher the similarity. The specified similarity threshold may be a predefined constant, which can be flexibly adjusted based on the real conditions. After determining the similar stream data, the classified stream data and the determined similar stream data may be compared, to determine the additional new data in the classified stream data. Specifically, the similar stream data may be read into the memory from the predefined database, and different data between the similar stream data and the classified stream data may be identified by comparison. The identified different data may be considered as additional new data. In the disclosed embodiment, if the data volume of the additional new data is large, the additional new data may be stored as a piece of new stream data according to the above-described approach. If the data volume of the additional new data is smaller than the specified data volume threshold, the additional new data may be added to the similar stream data. In this way, the problem that the stream data with a small data volume cannot be efficiently identified through the feature index can then be avoided.
  • Embodiment 4
  • Referring to FIG. 7, the present disclosure further provides a decoding device. The decoding device comprises a memory and a processor, where the memory stores computer programs that, when executed by the processor, implement the following steps:
  • S21: receiving encoded data transmitted by an encoding device, and decoding the encoded data into decoded data, where the decoded data includes a plurality of data blocks;
  • S22: classifying the plurality of data blocks as at least one piece of stream data based on stream information, and determining a piece of target stream data whose data volume is greater than or equal to a specified threshold; and
  • S23: storing the piece of target stream data in a predefined database, and sending a confirmation message to the encoding device, where the confirmation message includes a tag for a data block in the piece of target stream data, to allow the encoding device to store the data block in the piece of target stream data.
  • In the disclosed embodiment, the computer programs, when executed by the processor, further implement the following steps:
  • when there exists a piece of stream data, in the classified at least one piece of stream data, whose data volume is less than the specified threshold, mounting the piece of stream data whose data volume is less than the specified threshold to a to-be-aggregated queue;
  • receiving correlation information sent by an application proxy module, where the correlation information is used to signify multiple pieces of stream data with a correlation;
  • aggregating multiple pieces of stream data, corresponding to the correlation information, in the to-be-aggregated queue, to a piece of virtual stream data, and storing the piece of virtual stream data in the predefined database;
  • extracting a feature index of the piece of virtual stream data, writing the feature index into the memory, and periodically and persistently storing the feature index in a magnetic disk; and
  • adding an aggregation instruction to the confirmation message, where the aggregation instruction includes a tag for each piece of stream data in the piece of virtual stream data, to allow the encoding device to aggregate each piece of stream data in the piece of virtual stream data and store the aggregated virtual stream data.
  • In the disclosed embodiment, the computer programs, when executed by the processor, further implement the following steps:
  • querying, in the predefined database, a piece of similar stream data whose similarity to a piece of classified stream data reaches a specified similarity threshold;
  • comparing the piece of classified stream data and the piece of similar stream data, to identify additional new data in the piece of classified stream data; and
  • when a data volume of the additional new data is less than a specified data volume threshold, adding the additional new data to the piece of similar stream data.
  • As can be seen from the above, in the technical solutions provided by the present disclosure, when the encoding device needs to transmit data to the decoding device, the encoding device may first divide the data into blocks based on the content, and determine target data blocks that are not stored in the predefined database. For a target data block, the encoding device may not directly store it in the predefined database, but rather classify it as stream data based on the stream information, and mount the stream data in the to-be-confirmed queue. The stream data may then be encoded and transmitted to the decoding device. After decoding the received data, the decoding device classifies the decoded data blocks as stream data based on the stream information. If the data volume of the stream data is large, the decoding device may store the stream data in the local database, and send a confirmation message to the encoding device. The confirmation message may include the tags of the data blocks in the stored stream data. In this way, after receiving the confirmation message, the encoding device may identify the data blocks in the to-be-confirmed queue that correspond to the confirmation message, so that the data blocks corresponding to the confirmation message may be stored in its own database. As can be seen, serving as a data transmitting terminal, the encoding device may correspondingly execute data block storing process only after the decoding device (serving as the data receiving terminal) stores the data blocks. In this way, even if a data transmission failure occurs, which leads to the disability of the encoding device to receive the confirmation message, the effect is just that the data in the database of the decoding device is more complete than the data in the database of the encoding device. Accordingly, for a tag, that signifies an already-stored data block, sent by the encoding device, the decoding device apparently is able to identify the corresponding data block from the local database and restore the data block. Therefore, in the technical solutions provided by the present disclosure, even if the data in the database of the encoding device and the database of the decoding device are not synchronized, it does not affect the data receiving process in the decoding device, thereby improving the efficiency of data transmission. In addition, when there is a need to perform a similarity match on target stream data with a relatively small data volume, in order to avoid the problem of insufficiency of the feature index of the target stream data, the virtual stream data where the target stream data is located may be determined. Based on the feature index of the virtual stream data, the similarity match is performed for the target stream data. In this way, when a stream similarity match is performed on a piece of new small stream data, all feature indexes in the stream cluster are used, and thus the problem of insufficiency of the feature index can be solved.
  • Referring to FIG. 8, in the present disclosure, the technical solutions in the foregoing embodiments may be applied to a computer terminal 10 shown in FIG. 8. The computer terminal 10 may include one or more (only one is shown in the figure) processors 102 (a processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication purpose. Persons of ordinary skill in the art may understand that the structure shown in FIG. 8 is provided by way of illustration, but not by way of limitation of the structures of the above-described electronic devices. For example, the computer terminal 10 may also include more or fewer components than those shown in FIG. 8, or have a different configuration than that shown in FIG. 8.
  • The memory 104 may be used to store software programs and modules of application software. The processor 102 implements various functional applications and data processing by executing software programs and modules stored in the memory 104. The memory 104 may include a high-speed random access memory, and also a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some applications, the memory 104 may further include a memory remotely disposed with respect to the processor 102, which may be connected to the computer terminal 10 through a network. Examples of such network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • The transmission device 106 is configured to receive or transmit data via the network. The aforementioned specific examples of the network may include a wireless network provided by the communication provider of the computer terminal 10. In one application, the transmission device 106 includes a network interface controller (NIC) that may be connected to other network devices through the base stations to allow it to communicate with the Internet. In one application, the transmission device 106 may be a Radio Frequency (RF) module that is configured to communicate with the Internet via a wireless approach.
  • Through the foregoing description of the embodiments, it is clear to those skilled in the art that the various embodiments may take the form of a software plus a necessary general hardware platform implementation, and entirely a hardware implementation. In light of this understanding, the technical solutions, or essentially the parts that contribute to the current technology, may be embodied by way of a software product. The computer software product may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disc, an optical disc, etc., and include a variety of programs that cause a computing device (which may be a personal computer, a server, or a network device, etc.) to implement each embodiment or methods described in certain parts of each embodiment.
  • Although the present disclosure has been described as above with reference to preferred embodiments, these embodiments are not constructed as limiting the present disclosure. Any modifications, equivalent replacements, and improvements made without departing from the spirit and principle of the present disclosure shall fall within the scope of the protection of the present disclosure.

Claims (19)

What is claimed is:
1. A data storage method, comprising:
acquiring to-be-transmitted data, and dividing the to-be-transmitted data into a plurality of data blocks;
determining, from the plurality of data blocks, a set of target data blocks that are not stored in a predefined database;
classifying the set of target data blocks as at least one piece of stream data based on stream information, and mounting the at least one piece of stream data to a to-be-confirmed queue;
encoding the at least one piece of stream data, and transmitting the encoded at least one piece of stream data to a decoding device; and
receiving a confirmation message sent by the decoding device for the encoded at least one piece of stream data, and storing a set of data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, in the predefined database.
2. The method according to claim 1, wherein a data block stored in the predefined database is associated with a data block tag, and after dividing the to-be-transmitted data into the plurality of data blocks, the method further includes:
determining, from the plurality of data blocks, an already-stored data block that has already been stored in the predefined database; and
replacing an already-stored data block in the plurality of data blocks with a data block tag associated with the already-stored data block in the plurality of data blocks.
3. The method according to claim 2, when encoding the at least one piece of stream data, the method further includes:
encoding the data block tag that replaces the already-stored data block, and transmitting the encoded data block tag to the decoding device, to allow the decoding device to identify a corresponding data block based on the encoded data block tag.
4. The method according to claim 1, after storing the set of data blocks in the to-be-confirmed queue that correspond to the confirmation message in the predefined database, the method further includes:
extracting a feature index of a piece of stream data where a data block corresponding to the confirmation message is located, writing the feature index into a memory, and backing up and storing the backup on a hard drive.
5. The method according to claim 1, further comprising:
if a piece of stream data has an overtime mounting in the to-be-confirmed queue, discarding the piece of stream data with the overtime mounting.
6. The method according to claim 1, wherein the confirmation message further includes an aggregation instruction, and the aggregation instruction includes tags for a set of target stream data in the to-be-confirmed queue, and the method further includes:
aggregating the set of target stream data into a piece of virtual stream data; and
storing the piece of virtual stream data in the predefined database, extracting a feature index of the piece of virtual stream data into a memory, and persistently storing the feature index on a hard drive.
7. The method according to claim 6, further comprising:
when performing a similarity match for a piece of target stream data, determining a piece of virtual stream data where the piece of target stream data is located, and performing the similarity match for the piece of target stream data based on a feature index of the piece of virtual stream data where the piece of target stream data is located.
8. An encoding device, comprising a memory and a processor, wherein the memory stores computer programs that, when executed by the processor, implement the following steps:
acquiring to-be-transmitted data, and dividing the to-be-transmitted data into a plurality of data blocks;
determining, from the plurality of data blocks, a set of target data blocks that are not stored in a predefined database;
classifying the set of target data blocks as at least one piece of stream data based on stream information, and mounting the at least one piece of stream data to a to-be-confirmed queue;
encoding the at least one piece of stream data, and transmitting the encoded at least one piece of stream data to a decoding device; and
receiving a confirmation message sent by the decoding device for the encoded at least one piece of stream data, and storing a set of data blocks, in the to-be-confirmed queue, that correspond to the confirmation message, in the predefined database.
9. The encoding device according to claim 8, wherein the computer programs, when executed by the processor, further implement the following step:
extracting a feature index of a piece of stream data where a data block corresponding to the confirmation message is located, writing the feature index into the memory, and persistently storing the feature index on a hard drive.
10. The encoding device according to claim 8, wherein the confirmation message further includes an aggregation instruction, and the aggregation instruction includes tags for a set of target stream data in the to-be-confirmed queue, and the computer programs, when executed by the processor, further implement the following steps:
aggregating the set of target stream data into a piece of virtual stream data; and
storing the piece of virtual stream data in the predefined database, extracting a feature index of the piece of virtual stream data into the memory, and persistently storing the feature index on a hard drive.
11. A data storage method, comprising:
receiving encoded data transmitted by an encoding device, and decoding the encoded data into decoded data, wherein the decoded data includes a plurality of data blocks;
classifying the plurality of data blocks as at least one piece of stream data based on stream information, and determining a piece of target stream data whose data volume is greater than or equal to a specified threshold; and
storing the piece of target stream data in a predefined database, and sending a confirmation message to the encoding device, wherein the confirmation message includes a tag for a data block in the piece of target stream data, to allow the encoding device to store the data block in the piece of target stream data.
12. The method according to claim 11, wherein the decoded data further include a tag that is used to signify a data block already stored in the encoding device, and, after decoding the encoded data into the decoded data, the method further includes:
checking, in the predefined database, whether there exists a data block corresponding to the tag included in the decoded data;
if there exists the data block corresponding to the tag included in the decoded data, restoring the tag included in the encoded data to the corresponding data block; and
if there does not exist the data block corresponding to the tag included in the decoded data, transmitting a synchronization request to the encoding device, to synchronize with data blocks stored in the encoding device.
13. The method according to claim 11, when there exists a piece of stream data, in the classified at least one piece of stream data, whose data volume is less than the specified threshold, the method further includes:
mounting the piece of stream data, whose data volume is less than the specified threshold, to a to-be-aggregated queue;
receiving correlation information sent by an application proxy module, wherein the correlation information is used to signify multiple pieces of stream data with a correlation;
aggregating multiple pieces of stream data, corresponding to the correlation information, in the to-be-aggregated queue, to a piece of virtual stream data, and storing the piece of virtual stream data in the predefined database; and
extracting a feature index of the piece of virtual stream data, and writing the feature index into a memory, and persistently storing the feature index in a hard drive.
14. The method according to claim 13, further comprising:
adding an aggregation instruction to the confirmation message, wherein the aggregation instruction includes a tag for each piece of stream data in the piece of virtual stream data, to allow the encoding device to aggregate each piece of stream data in the piece of virtual stream data and store the aggregated virtual stream data.
15. The method according to claim 13, further comprising:
if a piece of stream data, whose data volume is less than the specified threshold, has an overtime mounting in the to-be-aggregated queue, discarding the piece of stream data with the overtime mounting.
16. The method according to claim 11, after classifying the plurality of data blocks as the at least one piece of stream data based on the stream information, the method further includes:
querying, in the predefined database, a piece of similar stream data whose similarity to a piece of classified stream data reaches a specified similarity threshold;
comparing the piece of classified stream data and the piece of similar stream data, to identify additional new data in the piece of classified stream data; and
when a data volume of the additional new data is less than a specified data volume threshold, adding the additional new data to the piece of similar stream data.
17. A decoding device, comprising a memory and a processor, wherein the memory stores computer programs that, when executed by the processor, implement the following steps:
receiving encoded data transmitted by an encoding device, and decoding the encoded data into decoded data, wherein the decoded data includes a plurality of data blocks;
classifying the plurality of data blocks as at least one piece of stream data based on stream information, and determining a piece of target stream data whose data volume is greater than or equal to a specified threshold; and
storing the piece of target stream data in a predefined database, and sending a confirmation message to the encoding device, wherein the confirmation message includes a tag for a data block in the piece of target stream data, to allow the encoding device to store the data block in the piece of target stream data.
18. The decoding device according to claim 17, wherein the computer programs, when executed by the processor, further implement the following steps:
when there exists a piece of stream data, in the classified at least one piece of stream data, whose data volume is less than the specified threshold, mounting the piece of stream data whose data volume is less than the specified threshold to a to-be-aggregated queue;
receiving correlation information, wherein the correlation information is used to signify multiple pieces of stream data with a correlation;
aggregating multiple pieces of stream data, corresponding to the correlation information, in the to-be-aggregated queue, to a piece of virtual stream data, and storing the piece of virtual stream data in the predefined database;
extracting a feature index of the piece of virtual stream data, writing the feature index into the memory, and persistently storing the feature index in a hard drive; and
adding an aggregation instruction to the confirmation message, wherein the aggregation instruction includes a tag for each piece of stream data in the piece of virtual stream data, to allow the encoding device to aggregate each piece of stream data in the piece of virtual stream data and store the aggregated virtual stream data.
19. The decoding device according to claim 17, wherein the computer programs, when executed by the processor, further implement the following steps:
querying, in the predefined database, a piece of similar stream data whose similarity to a piece of classified stream data reaches a specified similarity threshold;
comparing the piece of classified stream data and the piece of similar stream data, to identify additional new data in the piece of classified stream data; and
when a data volume of the additional new data is less than a specified data volume threshold, adding the additional new data to the piece of similar stream data.
US16/099,796 2018-01-19 2018-02-12 Data storage method, encoding device, and decoding device Abandoned US20210227007A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201810054884.7A CN108243256B (en) 2018-01-19 2018-01-19 Data storage method, coding equipment and decoding equipment
CN2018100548847 2018-01-19
PCT/CN2018/076411 WO2019140732A1 (en) 2018-01-19 2018-02-12 Data storage method, encoding device and decoding device

Publications (1)

Publication Number Publication Date
US20210227007A1 true US20210227007A1 (en) 2021-07-22

Family

ID=62699663

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/099,796 Abandoned US20210227007A1 (en) 2018-01-19 2018-02-12 Data storage method, encoding device, and decoding device

Country Status (4)

Country Link
US (1) US20210227007A1 (en)
EP (1) EP3588914A4 (en)
CN (1) CN108243256B (en)
WO (1) WO2019140732A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114721601A (en) * 2022-05-26 2022-07-08 昆仑智汇数据科技(北京)有限公司 Industrial equipment data storage method and device
CN116204136A (en) * 2023-05-04 2023-06-02 山东浪潮科学研究院有限公司 Data storage and query method, device, equipment and storage medium
WO2023202294A1 (en) * 2022-04-18 2023-10-26 华为技术有限公司 Data stream order-preserving method, data exchange device, and network

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110837414B (en) * 2018-08-15 2024-04-12 京东科技控股股份有限公司 Task processing method and device
CN113918090A (en) * 2018-08-21 2022-01-11 成都华为技术有限公司 Data stream processing method, device and system
US11030149B2 (en) * 2018-09-06 2021-06-08 Sap Se File format for accessing data quickly and efficiently
CN111722787B (en) 2019-03-22 2021-12-03 华为技术有限公司 Blocking method and device
CN110737409B (en) * 2019-10-21 2023-09-26 网易(杭州)网络有限公司 Data loading method and device and terminal equipment
US11809430B2 (en) 2019-10-23 2023-11-07 Nec Corporation Efficient stream processing with data aggregations in a sliding window over out-of-order data streams
CN113746763B (en) * 2020-05-29 2022-11-11 华为技术有限公司 Data processing method, device and equipment
CN116963178B (en) * 2023-09-21 2024-01-16 季华实验室 Method for reducing power consumption of NB-IOT equipment and related equipment

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101159757B (en) * 2007-10-25 2011-11-30 中兴通讯股份有限公司 Dual-home synchronous data transmission method
US8862682B2 (en) * 2010-02-17 2014-10-14 Emulex Corporation Accelerated sockets
CN102469142A (en) * 2010-11-16 2012-05-23 英业达股份有限公司 Data transmission method for data deduplication program
US8943023B2 (en) * 2010-12-29 2015-01-27 Amazon Technologies, Inc. Receiver-side data deduplication in data systems
CN103067129B (en) * 2012-12-24 2015-10-28 中国科学院深圳先进技术研究院 network data transmission method and system
CN104753626B (en) * 2013-12-25 2019-05-24 华为技术有限公司 A kind of data compression method, equipment and system
CN103970875B (en) * 2014-05-15 2017-02-15 华中科技大学 Parallel repeated data deleting method and system
CN107291924B (en) * 2017-06-29 2020-08-14 深信服科技股份有限公司 Synchronous log replication control method and system for disaster recovery backup system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023202294A1 (en) * 2022-04-18 2023-10-26 华为技术有限公司 Data stream order-preserving method, data exchange device, and network
CN114721601A (en) * 2022-05-26 2022-07-08 昆仑智汇数据科技(北京)有限公司 Industrial equipment data storage method and device
CN116204136A (en) * 2023-05-04 2023-06-02 山东浪潮科学研究院有限公司 Data storage and query method, device, equipment and storage medium

Also Published As

Publication number Publication date
EP3588914A1 (en) 2020-01-01
EP3588914A4 (en) 2020-05-27
CN108243256A (en) 2018-07-03
CN108243256B (en) 2020-08-04
EP3588914A8 (en) 2020-02-19
WO2019140732A1 (en) 2019-07-25

Similar Documents

Publication Publication Date Title
US20210227007A1 (en) Data storage method, encoding device, and decoding device
CN107046812B (en) Data storage method and device
US20140172795A1 (en) Data processing method and data processing device
US20150006475A1 (en) Data deduplication in a file system
Yu et al. CoRE: Cooperative end-to-end traffic redundancy elimination for reducing cloud bandwidth cost
US11301425B2 (en) Systems and computer implemented methods for semantic data compression
US11431662B2 (en) Techniques for message deduplication
US11461276B2 (en) Method and device for deduplication
CN114201421B (en) Data stream processing method, storage control node and readable storage medium
US9843802B1 (en) Method and system for dynamic compression module selection
US10608960B2 (en) Techniques for batched bulk processing
CN111245748A (en) File transmission method, device, system, electronic equipment and storage medium
CN105511812A (en) Method and device for optimizing big data of memory system
WO2016095149A1 (en) Data compression and storage method and device, and distributed file system
CN112087490A (en) High-performance mobile terminal application software log collection system
US11315605B2 (en) Method, device, and computer program product for storing and providing video
US20230107760A1 (en) System and method for multiple pass data compaction utilizing delta encoding
US20190207899A1 (en) Techniques for messaging conversation indexing
EP4042307A1 (en) Method, system, electronic device, and storage medium for storing and collecting temperature data
Kim et al. Safe: Structure-aware file and email deduplication for cloud-based storage systems
CN112650755A (en) Data storage method, method for querying data, database and readable medium
US9571698B1 (en) Method and system for dynamic compression module selection
US9843702B1 (en) Method and system for dynamic compression module selection
EP3961414A1 (en) Method and apparatus for processing time records
Nam et al. An inter-data encoding technique that exploits synchronized data for network applications

Legal Events

Date Code Title Description
AS Assignment

Owner name: WANGSU SCIENCE & TECHNOLOGY CO.,LTD., CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LU, ZHAOXIN;LIN, PENG;CHEN, XUN;REEL/FRAME:047451/0085

Effective date: 20181102

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: ADVISORY ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION