WO2019140732A1 - 一种数据存储方法、编码设备及解码设备 - Google Patents

一种数据存储方法、编码设备及解码设备 Download PDF

Info

Publication number
WO2019140732A1
WO2019140732A1 PCT/CN2018/076411 CN2018076411W WO2019140732A1 WO 2019140732 A1 WO2019140732 A1 WO 2019140732A1 CN 2018076411 W CN2018076411 W CN 2018076411W WO 2019140732 A1 WO2019140732 A1 WO 2019140732A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
stream data
stream
target
block
Prior art date
Application number
PCT/CN2018/076411
Other languages
English (en)
French (fr)
Inventor
陆兆新
林鹏
陈迅
Original Assignee
网宿科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网宿科技股份有限公司 filed Critical 网宿科技股份有限公司
Priority to US16/099,796 priority Critical patent/US20210227007A1/en
Priority to EP18901556.3A priority patent/EP3588914A4/en
Publication of WO2019140732A1 publication Critical patent/WO2019140732A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • G06F9/30196Instruction operation extension or modification using decoder, e.g. decoder per instruction set, adaptable or programmable decoders
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/04Real-time or near real-time messaging, e.g. instant messaging [IM]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/65Network streaming protocols, e.g. real-time transport protocol [RTP] or real-time control protocol [RTCP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • H04L67/5682Policies or rules for updating, deleting or replacing the stored data

Definitions

  • the present invention relates to the field of Internet technologies, and in particular, to a data storage method, an encoding device, and a decoding device.
  • the data deduplication technology may deploy a database for storing data on both the data sending end and the data receiving end.
  • the data in the database may be stored in the form of data fragments, and each data fragment may have a unique pointer.
  • the data sending end needs to send data to the data receiving end, the data to be sent is first divided into multiple data fragments, and then if some data fragments in the data fragments are already stored in the database, then the part will be Data fragmentation is replaced with a pointer.
  • the data to be transmitted can eventually be processed as a combination of pointers and data fragments, thereby reducing the amount of data that needs to be transmitted.
  • the data receiving end queries the local database for the data fragment corresponding to the pointer, thereby restoring the pointer to the data fragment.
  • the above premise of restoring data fragments by pointers is that the data sending end and the data receiving end database are kept in sync.
  • the data is interrupted or the transmission process is restarted.
  • the database of the data sender and the data receiver cannot keep the data synchronized. Therefore, the data receiver may not be able to query the pointer from the local database.
  • the data is fragmented so that it cannot receive the complete data. At this time, the data receiving end will request data synchronization with the data transmitting end, which will seriously affect the data transmission performance.
  • the purpose of the application is to provide a data storage method, an encoding device, and a decoding device, which can effectively reduce the generation of data fragments, make memory resource consumption controllable, improve small data stream identification capability, and avoid synchronization generation to improve data transmission efficiency. .
  • an application of the present application provides a data storage method, the method comprising: acquiring data to be transmitted, and dividing the data into a plurality of data blocks; determining, from the plurality of data blocks, not Determining a target data block stored in the database; classifying the target data block into at least one piece of stream data according to the stream information, and mounting the stream data to the queue to be confirmed; encoding the stream data, and Transmitting the encoded stream data to the decoding device; receiving the acknowledgement information sent by the decoding device for the encoded stream data, and storing the data block corresponding to the acknowledgement information in the to-be-confirmed queue to the pre- Set in the database.
  • another aspect of the present application further provides an encoding device including a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the following steps are implemented Obtaining data to be transmitted, and dividing the data into a plurality of data blocks; determining, from the plurality of data blocks, a target data block that is not stored in a preset database; and returning the target data block according to the flow information
  • the class is at least one stream data, and the stream data is mounted in a queue to be confirmed; the stream data is encoded, and the encoded stream data is sent to the decoding device; and the decoding device is received for the encoding
  • the acknowledgment information sent by the stream data is stored, and the data block corresponding to the acknowledgment information in the queue to be confirmed is stored in the preset database.
  • another aspect of the present application is to provide a data storage method, the method comprising: receiving encoded data sent by an encoding device, and decoding the encoded data into decoded data, where the decoded data includes multiple Data blocks; classifying the plurality of data blocks into at least one piece of stream data according to the stream information, and determining target stream data whose data amount is greater than or equal to a specified threshold; storing the target stream data in a preset database, and The encoding device sends an acknowledgement information, where the acknowledgement information includes an identifier of a data block in the target stream data, such that the encoding device stores a data block in the target stream data.
  • another aspect of the present application is to provide a decoding apparatus including a memory and a processor in which a computer program is stored, and when the computer program is executed by the processor, the following steps are implemented.
  • the encoding device when the encoding device needs to send data to the decoding device, can first perform block processing on the data based on the content, and can determine the target data block that is not stored in the preset database. For the target data block, the encoding device does not directly store it in the default database, but classifies it into stream data according to the stream information, and mounts the stream data in the queue to be confirmed. The stream data can then be encoded and sent to the decoding device. After decoding the received data, the decoding device classifies the decoded data block into stream data according to the stream information.
  • the decoding device can store the stream data in a local database, and then send an acknowledgment message to the encoding device, where the acknowledgment information can include the identifier of the data block in the stream data stored above.
  • the encoding device receives the confirmation information, the data block corresponding to the confirmation information can be identified in the to-be-confirmed queue, so that the data block corresponding to the confirmation information is stored in its own database.
  • the encoding device acts as the data transmitting end, and only after the decoding device (as the data receiving end) stores the data block, the data block storage process is performed accordingly.
  • the encoding device cannot receive the acknowledgment information, and the effect is only that the data in the database of the decoding device is more complete than the data in the database of the encoding device, so the use for the encoding device is used.
  • the decoding device In order to characterize the tags of the stored data blocks, the decoding device must be able to query the corresponding data blocks from the local database and perform recovery of the data blocks. Therefore, the technical solution provided by the present application, the database of the instant encoding device and the database of the decoding device are not synchronized, and the decoding device does not affect the receiving process of the data, thereby improving the efficiency of data transmission.
  • the virtual stream data in which the target stream data is located may be determined, and based on the virtual stream data.
  • the feature index performs similarity matching on the target stream data. In this way, when the new stream similarity matching is performed on the small data stream, all the feature indexes in the stream cluster are used, thus solving the problem of insufficient feature index.
  • FIG. 1 is a schematic diagram of a system for applying a data storage method according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a data storage method in an encoding device according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of data partitioning in an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of data storage in an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of an encoding device according to an embodiment of the present invention.
  • FIG. 6 is a flowchart of a data storage method in a decoding device according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a decoding device according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a computer terminal according to an embodiment of the present invention.
  • the application provides a data storage method, which can be applied to the system architecture shown in FIG. 1.
  • the server can act as a data source, and the client can act as a party to request data loading from the server. After receiving the data loading request sent by the client, the server may send the data corresponding to the data loading request to the decoding device on the client side through the encoding device.
  • the technical solution provided by the present application can be applied to the above encoding device and decoding device.
  • the execution subject may be an encoding device.
  • the method may include the following steps.
  • S11 Acquire data to be sent, and divide the data into multiple data blocks.
  • the encoding device may acquire data to be transmitted from the server, and then may perform block processing on the data based on the content.
  • the data may be subjected to block processing using a rabbit fingerprint algorithm.
  • the data may be composed of a plurality of characters, which may be 8-bit binary numbers.
  • a data sliding window can be taken, sliding from the beginning to the end of the data according to a fixed step size, and calculating the rabbit fingerprint of the data block in the data sliding window one by one, if the calculated fingerprint value and If the preset fingerprint values are the same, then the starting position of the current data sliding window can be used as the segmentation position of the data block. For example, in FIG.
  • the data sliding window can slide to the right by one character length each time, wherein the fingerprint k of the data block in the k window is the same as the preset fingerprint value, then the end position of the k window can be The dotted line position is used as a separation position for dividing the data block. Finally, when the data sliding window is moved to the end of the data, the process of the block processing can be ended. Thus, the data can be divided into a plurality of data blocks by block processing.
  • the encoding device may further calculate a hash value of each data block, and the calculated hash value may uniquely represent the corresponding data block.
  • S12 Determine, from the plurality of data blocks, a target data block that is not stored in a preset database.
  • the encoding device may store the outwardly sent data in the preset database, so that when the data is sent out later, it is determined whether the data to be sent is the duplicate data already stored in the preset database. In this way, the encoding device performs block processing on the data to be sent, and after obtaining the plurality of data blocks, the plurality of data blocks can be matched with the data blocks in the preset database to obtain the plurality of data.
  • the target data block that is not stored in the preset database and the stored data block that has been stored in the preset database are determined in the block.
  • the data blocks in the preset database may have respective hash values, and each of the plurality of data blocks may also have a respective hash value, so by comparing the hash values of the two, A target data block in a plurality of data blocks and a stored data block can be determined.
  • the data block stored in the preset database may be associated with a data block label, and the data block label may uniquely represent the associated data block, and the data block is compared to the data amount of the data block.
  • the amount of data in the tag will be much smaller.
  • the data block label may be a hash value of the data block, or may be a unique character string allocated by the encoding device to the data block. The present application does not limit the representation of the data block label, as long as It is sufficient to uniquely characterize the data block.
  • the data block tag associated with the stored data block may be used instead of the The data blocks in the plurality of data blocks are stored, so that the plurality of data blocks can be converted into a combination of the target data block and the data block label, so that the amount of data to be transmitted can be reduced.
  • S13 classify the target data block into at least one piece of stream data according to the stream information, and mount the stream data to the queue to be confirmed.
  • the encoding device may classify the target data block into at least one piece of stream data according to the stream information.
  • the stream information may be quintuple information included in the target data block, and the quintuple information may include sending a source IP address of the data block, receiving a destination IP address of the data block, and transmitting the a source port of the data block, a destination port receiving the data block, and a transmission protocol employed by the data block.
  • the encoding device can classify the data block having the same quintuple information into one stream data.
  • the encoding device after the target data block is classified into stream data, the encoding device does not directly store the stream data in the preset database, but mounts the stream data in the queue to be confirmed. After the encoding device receives the confirmation information sent by the decoding device, the corresponding stream data in the to-be-confirmed queue is stored in the preset database.
  • the total duration of each stream data mount can be counted.
  • the stream data is regarded as a mount timeout.
  • the timeout stream data may be directly discarded to save space in the queue to be confirmed.
  • S14 Encode the stream data, and send the encoded stream data to the decoding device.
  • the encoding device can encode the stream data according to a specified encoding algorithm.
  • the data to be transmitted may include a data block label replacing the stored data block and stream data classified by the unstored target data block, so that when encoding, the stored data may be replaced at the same time.
  • the data block label of the data block is encoded.
  • the encoding device can send the encoded stream data and the encoded data block label to the link of the wide area network to send the encoded stream data and the code to the decoding device on the client side. After the data block label.
  • the encoded data may be decoded into decoded data, where the decoded data may include multiple data blocks and used to represent that the data is already in the The label of the data block stored in the encoding device. Since the decoding device may interleave different data blocks in the plurality of stream data when receiving the encoded data, in the decoded data decoded by the decoding device, the plurality of data blocks may not be arranged according to the stream data, so that The decoding device may classify the plurality of data blocks into at least one piece of stream data according to the stream information.
  • the stream information may be the quintuple information described above, and the encoding device may classify the data blocks having the same quintuple information into the same stream data.
  • the decoding device may determine target stream data whose amount of data is greater than or equal to a specified threshold.
  • the specified threshold may be a preset constant that may be used to determine whether the amount of data of the stream data is too small.
  • the stream data is considered to reach a storable standard.
  • the decoding device may store the target stream data whose data amount is greater than or equal to the specified threshold in its own preset database.
  • the preset database of the decoding device may be a disk of the decoding device, in which the target stream data may be stored as a storage unit.
  • the storage unit may be defined as a continent, and the storage unit may include a data block included in the target stream data and index information of the target stream data.
  • the index information can be used to characterize the starting offset of each data block in the storage unit and the amount of data of each data block.
  • the feature index of the target stream data may also be extracted, and the feature index is written into the memory and periodically stored on the disk. If you need to match similar or identical stream data in the default database, you can make a preliminary judgment through the feature index.
  • the feature index may be extracted based on an intrinsic feature of the target stream data.
  • the intrinsic features may include, for example, blocking window sizes, out-of-order segments, and the like.
  • a part of the features may be specified from the intrinsic features in advance, and then each feature of each data block of the target stream data is identified, and finally, the identified features may be performed according to the order of the data blocks in the target stream data. Sorting to get the feature index of the target stream feature. For example, feature 1 to feature 3 are currently specified, and the target stream data includes 5 data blocks corresponding to feature 2, feature 1, feature 3, feature 1, and feature 2, respectively ( The sequence formed by feature 2, feature 1, feature 3, feature 1, feature 2) can be used as a feature index of the target data stream.
  • each feature index may include a key and a value corresponding to the key (cn1, cn2, cnn, etc.).
  • the structure of the storage unit may include metadata, an offset (size) and a data size (size) corresponding to a key, and specifically stored data (data rec).
  • the decoding device may query, in its own preset database, whether the data block corresponding to the label is provided, and if yes, the label may be restored. a corresponding data block; if not, indicating that a part of the data block is missing from the database of the decoding device at this time, at this time, the decoding device may send a synchronization request to the encoding device to synchronize the data block stored in the encoding device. .
  • the decoding device may send the confirmation information to the encoding device, where the confirmation information includes the target stream.
  • An identification of a data block in the data such that the encoding device can store data blocks in the target stream data.
  • the identifier of the data block in the target stream data may be a hash value of the calculated data block.
  • S15 Receive acknowledgment information sent by the decoding device for the encoded stream data, and store the data block corresponding to the acknowledgment information in the to-be-confirmed queue into the preset database.
  • the encoding device may determine a data block corresponding to the acknowledgement information in the to-be-confirmed queue, and use the data block corresponding to the acknowledgement information.
  • the confirmation information may include an identifier of the data block, and the identifier may be, for example, a hash value of the data block. Then, the encoding device can query the corresponding data block by using the identifier of the data block in the queue to be confirmed, and store the queried data block.
  • the database of the encoding device itself may also be a disk as shown in FIG. 4, and the encoding device may also store the data block to the disk according to the streaming data.
  • the stream data in which the data block corresponding to the confirmation information is located may be stored as a storage unit in a disk of the encoding device.
  • the storage unit may be defined as a continent, and the storage unit may include a data block included in the stream data and index information of the stream data.
  • the index information can be used to characterize the starting offset of each data block in the storage unit and the amount of data of each data block.
  • the encoding device can also extract the feature index of the stream data, write the feature index into the memory, and periodically store the file on the disk.
  • the decoding device may reduce the data amount by less than the The stream data with the specified threshold is mounted in the queue to be aggregated.
  • the stream data in the queue to be aggregated will only be aggregated by the decoding device when it is confirmed that there is an association between the plurality of stream data.
  • the flow data of the association may be determined by the application proxy module or other association analysis module.
  • the application proxy module may be, for example, an application proxy such as HTTP (HyperText Transfer Protocol), MAPI (Messaging Application Programming Interface), and CIFS (Common Internet File System). Modules, these application proxy modules can confirm multiple different stream data in the same session, so that the associated stream data can be determined.
  • the decoding device may aggregate the multiple stream data corresponding to the association information in the to-be-aggregated queue into one piece of virtual stream data.
  • the association information may include quintuple information of the plurality of pieces of stream data that are associated with each other.
  • the decoding device may determine the corresponding stream data in the queue to be aggregated according to each quintuple information included in the association information.
  • the decoding device may store the virtual stream data in its own preset database, and may extract a feature index of the virtual stream data, and write the feature index into the memory. And periodically persist the storage to disk.
  • the decoding device may add an aggregation instruction to the confirmation information to be sent to the encoding device,
  • the aggregation instruction may include the quintuple information of the multiple pieces of stream data to be aggregated, or may include the identifier of each of the plurality of pieces of stream data that needs to be aggregated, so that the aggregation instruction may carry the already stored virtual stream data.
  • the identifier of each stream data in the middle may be used to be aggregated.
  • the target stream data may be found in the queue to be confirmed, and then the target stream data may be aggregated into one piece of virtual stream data, and the virtual stream data is aggregated.
  • the feature index is extracted into memory.
  • the virtual stream data in which the target stream data is located may be determined, and based on the The feature index of the virtual stream data performs similarity matching on the target stream data. In this way, when the new stream similarity matching is performed on the small data stream, all the feature indexes in the stream cluster are used, thus solving the problem of insufficient feature index.
  • the present application also provides an encoding device.
  • the encoding device includes a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the following steps are implemented:
  • S11 Acquire data to be sent, and divide the data into multiple data blocks
  • S12 determining, from the plurality of data blocks, a target data block that is not stored in a preset database
  • S13 classify the target data block into at least one piece of stream data according to the stream information, and mount the stream data to the queue to be confirmed;
  • S15 Receive acknowledgment information sent by the decoding device for the encoded stream data, and store the data block corresponding to the acknowledgment information in the to-be-confirmed queue into the preset database.
  • the feature index of the stream data in which the data block corresponding to the confirmation information is located is extracted, and the feature index is written into the memory and periodically stored on the disk.
  • the confirmation information further includes an aggregation instruction, where the aggregation instruction carries an identifier of the target stream data in the queue to be confirmed; correspondingly, when the computer program is executed by the processor, Implement the following steps:
  • the virtual stream data is stored in the preset database, and the feature index of the virtual stream data is extracted into a memory and periodically stored on a disk.
  • the application further provides a data storage method, and the execution body of the method may be a decoding device.
  • the method includes the following steps.
  • S21 Receive encoded data sent by the encoding device, and decode the encoded data into decoded data, where the decoded data includes multiple data blocks.
  • S22 classify the plurality of data blocks into at least one piece of stream data according to the stream information, and determine target stream data whose data amount is greater than or equal to a specified threshold.
  • S23 storing the target stream data in a preset database, and sending confirmation information to the encoding device, where the confirmation information includes an identifier of a data block in the target stream data, so that the encoding device stores A block of data in the target stream data.
  • the encoded data may be decoded into decoded data, where the decoded data may include multiple data blocks and used to represent that the data is already in the The label of the data block stored in the encoding device. Since the decoding device may interleave different data blocks in the plurality of stream data when receiving the encoded data, in the decoded data decoded by the decoding device, the plurality of data blocks may not be arranged according to the stream data, so that The decoding device may classify the plurality of data blocks into at least one piece of stream data according to the stream information.
  • the stream information may be the quintuple information described above, and the encoding device may classify the data blocks having the same quintuple information into the same stream data.
  • the decoding device may determine target stream data whose amount of data is greater than or equal to a specified threshold.
  • the specified threshold may be a preset constant that may be used to determine whether the amount of data of the stream data is too small.
  • the stream data is considered to reach a storable standard.
  • the decoding device may store the target stream data whose data amount is greater than or equal to the specified threshold in its own preset database.
  • the preset database of the decoding device may be a disk of the decoding device, in which the target stream data may be stored as a storage unit.
  • the storage unit may be defined as a continent, and the storage unit may include a data block included in the target stream data and index information of the target stream data.
  • the index information can be used to characterize the starting offset of each data block in the storage unit and the amount of data of each data block.
  • the feature index of the target stream data may also be extracted, and the feature index is written into the memory and periodically stored on the disk. If you need to match similar or identical stream data in the default database, you can make a preliminary judgment through the feature index.
  • the feature index may be extracted based on an intrinsic feature of the target stream data.
  • the intrinsic features may include, for example, blocking window sizes, out-of-order segments, and the like.
  • a part of the features may be specified from the intrinsic features in advance, and then each feature of each data block of the target stream data is identified, and finally, the identified features may be performed according to the order of the data blocks in the target stream data. Sorting to get the feature index of the target stream feature. For example, feature 1 to feature 3 are currently specified, and the target stream data includes 5 data blocks corresponding to feature 2, feature 1, feature 3, feature 1, and feature 2, respectively ( The sequence formed by feature 2, feature 1, feature 3, feature 1, feature 2) can be used as a feature index of the target data stream.
  • the decoding device may query, in its own preset database, whether the data block corresponding to the label is provided, and if yes, the label may be restored. a corresponding data block; if not, indicating that a part of the data block is missing from the database of the decoding device at this time, at this time, the decoding device may send a synchronization request to the encoding device to synchronize the data block stored in the encoding device. .
  • the decoding device may send the confirmation information to the encoding device, where the confirmation information includes the target stream.
  • An identification of a data block in the data such that the encoding device can store data blocks in the target stream data.
  • the identifier of the data block in the target stream data may be a hash value of the calculated data block.
  • the decoding device may reduce the data amount by less than the The stream data with the specified threshold is mounted in the queue to be aggregated.
  • the stream data in the queue to be aggregated will only be aggregated by the decoding device when it is confirmed that there is an association between the plurality of stream data.
  • the flow data of the association may be determined by the application proxy module or other association analysis module.
  • the application proxy module may be, for example, an application proxy such as HTTP (HyperText Transfer Protocol), MAPI (Messaging Application Programming Interface), and CIFS (Common Internet File System). Modules, these application proxy modules can confirm multiple different stream data in the same session, so that the associated stream data can be determined.
  • the decoding device may aggregate the multiple stream data corresponding to the association information in the to-be-aggregated queue into one piece of virtual stream data.
  • the association information may include quintuple information of the plurality of pieces of stream data that are associated with each other.
  • the decoding device may determine the corresponding stream data in the queue to be aggregated according to each quintuple information included in the association information.
  • the decoding device may store the virtual stream data in its own preset database, and may extract a feature index of the virtual stream data, and write the feature index into the memory. And periodically persist the storage to disk.
  • the decoding device may add an aggregation instruction to the confirmation information to be sent to the encoding device,
  • the aggregation instruction may include the quintuple information of the multiple pieces of stream data to be aggregated, or may include the identifier of each of the plurality of pieces of stream data that needs to be aggregated, so that the aggregation instruction may carry the already stored virtual stream data.
  • the identifier of each stream data in the middle may be used to be aggregated.
  • the target stream data may be found in the queue to be confirmed, and then the target stream data may be aggregated into one piece of virtual stream data, and the virtual stream data is aggregated.
  • the feature index is extracted into memory.
  • the decoding device may count the total duration of the stream data being mounted in the queue to be aggregated. When the total duration exceeds a fixed duration threshold, the stream data may be considered to time out. In order to save space for the queue to be aggregated, if the stream data whose data volume is less than the specified threshold is timed in the queue to be aggregated, the decoding device may discard the timed stream data.
  • the stream data obtained by the categorization may be similar to the stream data stored in the database.
  • the first draft data of a document is received, and then the revised draft data of the document is received.
  • the revised draft may have only a small amount of content compared to the first draft, and the preliminary draft data and the revised draft data may be regarded as similar.
  • Two streams of data At this point, if the stream data obtained by the classification is stored in the database, there will be two similar stream data in the database, which will waste the storage space of the database.
  • the decoded data may be queried in its own preset database for similar stream data whose similarity with the categorized stream data reaches a specified similarity threshold.
  • comparing the similarities of the two stream data it can be realized by comparing the similarity of the feature indexes of the two stream data. For example, the number of corresponding features in the feature index of the two stream data may be determined, and then the ratio of the number of the corresponding consistent features in the total number of features of the feature index is calculated, and the ratio is used as two stream data. Similarity. The higher the ratio, the higher the similarity.
  • the specified similarity threshold may be a preset constant, and may be flexibly adjusted according to actual conditions.
  • the stream data obtained by the sorting and the similar stream data may be compared to determine new data in the sorted stream data.
  • the similar stream data can be read into the memory from the preset database, and then the similar stream data and the collated stream data are compared with different data, and the different data can be used as the added data.
  • the newly added data may be stored as a new stream data according to the above manner, and when the amount of data of the newly added data is less than specified When the data amount threshold is used, the newly added data may be added to the similar stream data. In this way, the problem that the stream data with a small amount of data cannot be efficiently queried through the feature index can be avoided.
  • the present application further provides a decoding device, which includes a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the following steps are implemented:
  • S21 Receive encoded data sent by the encoding device, and decode the encoded data into decoded data, where the decoded data includes multiple data blocks;
  • S22 classify the plurality of data blocks into at least one piece of stream data according to the stream information, and determine target stream data whose data amount is greater than or equal to a specified threshold;
  • S23 storing the target stream data in a preset database, and sending confirmation information to the encoding device, where the confirmation information includes an identifier of a data block in the target stream data, so that the encoding device stores A block of data in the target stream data.
  • the stream data obtained by the categorization is less than the specified threshold
  • the stream data whose data amount is smaller than the specified threshold is mounted in the queue to be aggregated
  • the encoding device aggregates the pieces of stream data, and causes the encoding
  • the device stores the aggregated virtual stream data.
  • the newly added data is added to the similar stream data.
  • the encoding device when the encoding device needs to send data to the decoding device, can first perform block processing on the data based on the content, and can determine the target data block that is not stored in the preset database. For the target data block, the encoding device does not directly store it in the default database, but classifies it into stream data according to the stream information, and mounts the stream data in the queue to be confirmed. The stream data can then be encoded and sent to the decoding device. After decoding the received data, the decoding device classifies the decoded data block into stream data according to the stream information.
  • the decoding device can store the stream data in a local database, and then send an acknowledgment message to the encoding device, where the acknowledgment information can include the identifier of the data block in the stream data stored above.
  • the encoding device receives the confirmation information, the data block corresponding to the confirmation information can be identified in the to-be-confirmed queue, so that the data block corresponding to the confirmation information is stored in its own database.
  • the encoding device acts as the data transmitting end, and only after the decoding device (as the data receiving end) stores the data block, the data block storage process is performed accordingly.
  • the encoding device cannot receive the acknowledgment information, and the effect is only that the data in the database of the decoding device is more complete than the data in the database of the encoding device, so the use for the encoding device is used.
  • the decoding device In order to characterize the tags of the stored data blocks, the decoding device must be able to query the corresponding data blocks from the local database and perform recovery of the data blocks. Therefore, the technical solution provided by the present application does not affect the receiving process of the decoding device for the data even if the data of the database of the encoding device and the database of the decoding device are not synchronized, thereby improving the efficiency of data transmission.
  • the virtual stream data in which the target stream data is located may be determined, and based on the virtual stream data.
  • the feature index performs similarity matching on the target stream data. In this way, when the new stream similarity matching is performed on the small data stream, all the feature indexes in the stream cluster are used, thus solving the problem of insufficient feature index.
  • Computer terminal 10 may include one or more (only one of which is shown) processor 102 (processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), for storing data.
  • processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), for storing data.
  • FIG. 8 is merely illustrative and does not limit the structure of the above electronic device.
  • computer terminal 10 may also include more or fewer components than shown in FIG. 8, or have a different configuration than that shown in FIG.
  • the memory 104 can be used to store software programs and modules of application software, and the processor 102 executes various functional applications and data processing by running software programs and modules stored in the memory 104.
  • Memory 104 may include high speed random access memory, and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • memory 104 may further include memory remotely located relative to processor 102, which may be coupled to computer terminal 10 via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • Transmission device 106 is for receiving or transmitting data via a network.
  • the network specific examples described above may include a wireless network provided by a communication provider of the computer terminal 10.
  • the transmission device 106 includes a Network Interface Controller (NIC) that can be connected to other network devices through a base station to communicate with the Internet.
  • the transmission device 106 can be a Radio Frequency (RF) module for communicating with the Internet wirelessly.
  • NIC Network Interface Controller
  • RF Radio Frequency

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本发明公开了一种数据存储方法、编码设备及解码设备,其中,所述方法包括:获取待发送的数据,并将所述数据划分为多个数据块;从所述多个数据块中确定未在预设数据库中存储的目标数据块;将所述目标数据块按照流信息归类为至少一条流数据,并将所述流数据挂载至待确认队列中;对所述流数据进行编码,并向解码设备发送编码后的流数据;编码设备接收所述解码设备针对所述编码后的流数据发来的确认信息,并将所述待确认队列中所述确认信息对应的数据块存储至所述预设数据库中。本申请提供的技术方案,能够有效减少数据碎块的产生使得内存耗费资源可控、提高小数据流识别能力、以及避免同步产生以提高数据传输的效率。

Description

一种数据存储方法、编码设备及解码设备 技术领域
本发明涉及互联网技术领域,特别涉及一种数据存储方法、编码设备及解码设备。
背景技术
随着互联网的不断发展,互联网中的数据量也与日俱增。当前,在互联网中传输的数据有很多都是重复数据。例如邮件群发和即时通信软件中的消息群发等,都是将相同的数据拷贝很多份再进行传输,这样势必会浪费宝贵的带宽资源。
为了解决重复数据的传输问题,当前可以通过重复数据删除技术来减少网络中需要传输的数据量。具体地,重复数据删除技术可以在数据发送端和数据接收端均部署用于存放数据的数据库,数据库中的数据可以通过数据碎片的形式进行存储,并且每个数据碎片可以具备唯一的指针。当数据发送端需要向数据接收端发送数据时,首先会将待发送的数据划分为多个数据碎片,然后,如果这些数据碎片中已经有部分数据碎片存储于数据库中,那么便会将这部分数据碎片利用指针替代。这样,待发送的数据最终可以被处理为指针和数据碎片的组合,从而可以降低需要发送的数据量。数据接收端接收到指针和数据碎片的组合后,会从本地的数据库中查询指针对应的数据碎片,从而将指针恢复为数据碎片。
上述的通过指针恢复数据碎片的前提,是数据发送端和数据接收端的数据库保持同步。然而,在实际传输过程中,数据发生断流或者传输过程重启等问题会使得数据发送端和数据接收端的数据库无法保持数据同步,那么数据接收端很可能无法从本地的数据库中查询到指针对应的数据碎片,从而无法接收到完整的数据。此时,数据接收端会要求与数据发送端进行数据同步,这样会严重影响数据的传输性能。
发明内容
本申请的目的在于提供一种数据存储方法、编码设备及解码设备,能够有效减少数据碎块的产生使得内存资源耗费可控、提高小数据流识别能力、以及避免同步产生以提高数据传输的效率。
为实现上述目的,本申请一方面提供一种数据存储方法,所述方法包括:获取待发送的数据,并将所述数据划分为多个数据块;从所述多个数据块中确定未在预设数据库中存储的目标数据块;将所述目标数据块按照流信息归类为至少一条流数据,并将所述流数据挂载至待确认队列中;对所述流数据进行编码,并向解码设备发送编码后的流数据;接收所述解码设备针对所述编码后的流数据发来的确认信息,并将所述待确认队列中所述确认信息对应的数据块存储至所述预设数据库中。
为实现上述目的,本申请另一方面还提供一种编码设备,所述编码设备包括存储器和处理器,所述存储器中存储计算机程序,所述计算机程序被所述处理器执行时,实现以下步骤:获取待发送的数据,并将所述数据划分为多个数据块;从所述多个数据块中确定未在预设数据库中存储的目标数据块;将所述目标数据块按照流信息归类为至少一条流数据,并将所述流数据挂载至待确认队列中;对所述流数据进行编码,并向解码设备发送编码后的流数据;接收所述解码设备针对所述编码后的流数据发来的确认信息,并将所述待确认队列中所述确认信息对应的数据块存储至所述预设数据库中。
为实现上述目的,本申请另一方面还提供一种数据存储方法,所述方法包括:接收编码设备发来的编码数据,并将所述编码数据解码为解码数据,所述解码数据中包括多个数据块;将所述多个数据块按照流信息归类为至少一条流数据,并确定数据量大于或者等于指定阈值的目标流数据;在预设数据库中存储所述目标流数据,并向所述编码设备发送确认信息,所述确认信息中包含所述目标流数据中的数据块的标识,以使得所述编码设备存储所述目标流数据中的数据块。
为实现上述目的,本申请另一方面还提供一种解码设备,所述解码设备包括存储器和处理器,所述存储器中存储计算机程序,所述计算机程序被所述处理器执行时,实现以下步骤:接收编码设备发来的编码数据,并将所述编码数据解码为解码数据,所述解码数据中包括多个数据块;将所述多个数据块按照 流信息归类为至少一条流数据,并确定数据量大于或者等于指定阈值的目标流数据;在预设数据库中存储所述目标流数据,并向所述编码设备发送确认信息,所述确认信息中包含所述目标流数据中的数据块的标识,以使得所述编码设备存储所述目标流数据中的数据块。
由上可见,本申请提供的技术方案,当编码设备需要向解码设备发送数据时,首先可以将数据基于内容进行分块处理,并可以确定出没有在预设数据库中存储的目标数据块。针对目标数据块,编码设备并不会直接将其存储于预设数据库中,而是将其按照流信息归类为流数据,并将流数据挂载于待确认队列中。然后,可以将流数据编码后发送给解码设备。解码设备对接收到的数据进行解码之后,按照流信息将解码得到的数据块归类为流数据。如果流数据的数据量较大,解码设备便可以将流数据存储于本地的数据库中,然后,可以向编码设备发送确认信息,该确认信息中可以包含上述存储的流数据中数据块的标识。这样,当编码设备接收到该确认信息后,便可以将待确认队列中识别出确认信息对应的数据块,从而将确认信息对应的数据块在自身的数据库中进行存储。由此可见,编码设备作为数据发送端,只有在解码设备(作为数据接收端)将数据块存储之后,才会相应地进行数据块的存储流程。这样,就算发生数据传输故障,导致编码设备无法接收到确认信息,那么造成的影响也不过是解码设备的数据库中的数据比编码设备的数据库中的数据更加完备,那么针对编码设备发来的用于表征已经存储的数据块的标签,解码设备必然能够从本地的数据库中查询到对应的数据块,并进行数据块的恢复。因此,本申请提供的技术方案,即时编码设备的数据库和解码设备的数据库中数据不同步,也不会影响解码设备对于数据的接收过程,从而提高了数据传输的效率。此外,当需要对数据量较小的目标流数据进行相似性匹配时,为了避免目标数据流特征索引不足的问题,可以确定所述目标流数据所在的虚拟流数据,并基于所述虚拟流数据的特征索引对所述目标流数据进行相似性匹配。这样,在对小数据流做新流相似性匹配的时候,用到的是流簇中所有的特征索引,这样就解决了特征索引不足的问题。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所 需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例中数据存储方法应用的系统示意图;
图2是本发明实施例的编码设备中数据存储方法的流程图;
图3是本发明实施例中数据分块的示意图;
图4是本发明实施例中数据存储的示意图;
图5是本发明实施例中编码设备的结构示意图;
图6是本发明实施例的解码设备中数据存储方法的流程图;
图7是本发明实施例中解码设备的结构示意图;
图8是本发明实施例中计算机终端的结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
实施例一
本申请提供一种数据存储方法,所述方法可以应用于如图1所示的系统架构中。在图1中,服务器可以作为数据源,客户端可以作为向服务器请求加载数据的一方。其中,服务器在接收到客户端发来的数据加载请求后,可以将该数据加载请求对应的数据通过编码设备发送给客户端一侧的解码设备。本申请提供的技术方案便可以应用于上述的编码设备以及解码设备中。
本实施方式提供的数据存储方法,执行主体可以是编码设备。请参阅图2,所述方法可以包括以下步骤。
S11:获取待发送的数据,并将所述数据划分为多个数据块。
在本实施方式中,编码设备可以从服务器处获取到待发送的数据,然后可以基于内容对所述数据进行分块处理。在实际应用中,可以采用rabin指纹算法对所述数据进行分块处理。具体地,请参阅图3,所述数据可以由多个字符构成,这些字符可以是8位的二进制数。在对该数据进行分块时,可以采取一个数据滑动窗口,按照固定步长,从数据的开头向尾部滑动,并逐一计算数据滑动窗 口内的数据块的rabin指纹,如果计算出的指纹值与预设的指纹值相同,那么便可以将当前数据滑动窗口的起始位置作为数据块的分割位置。例如,在图3中,数据滑动窗口每次可以向右滑动一个字符的长度,其中,k窗口中数据块的指纹k与预设指纹值相同,那么便可以将k窗口的终止位置(图中虚线位置)作为划分数据块的分隔位置,最终,当数据滑动窗口移动至数据的末端时,分块处理的过程便可以结束。这样,通过分块处理,可以将所述数据划分为多个数据块。
在本实施方式中,在将所述数据划分为多个数据块之后,所述编码设备还可以计算各个数据块的哈希值,计算出的哈希值可以唯一地表征对应的数据块。
S12:从所述多个数据块中确定未在预设数据库中存储的目标数据块。
在本实施方式中,编码设备可以在预设数据库中存储向外发出的数据,以使得后续向外发送数据时,判断需要发送的数据是否为预设数据库中已经存储的重复数据。这样,编码设备将待发送的数据进行分块处理,得到多个数据块后,便可以将所述多个数据块与所述预设数据库中的数据块进行匹配,以从所述多个数据块中确定未在预设数据库中存储的目标数据块以及已在所述预设数据库中存储的已存储数据块。具体地,预设数据库中的数据块可以具备各自的哈希值,而所述多个数据块中的各个数据块也可以具备各自的哈希值,那么通过对比两者的哈希值,便可以确定多个数据块中的目标数据块以及已存储数据块。
在本实施方式中,所述预设数据库中存储的数据块可以与数据块标签相关联,所述数据块标签可以唯一地表示关联的数据块,并且相较于数据块的数据量,数据块标签的数据量会小得多。在实际应用中,所述数据块标签可以是数据块的哈希值,也可以是编码设备分配给数据块的具备唯一性的字符串,本申请对数据块标签的表现形式并不作限定,只要能够唯一地表征数据块即可。
在本实施方式中,在从所述多个数据库中确定了未存储的目标数据块以及已存储的已存储数据块后,可以利用与所述已存储数据块相关联的数据块标签替代所述多个数据块中的已存储数据块,从而可以将所述多个数据块转换为目标数据块和数据块标签的组合,这样便可以减小需要发送的数据量。
S13:将所述目标数据块按照流信息归类为至少一条流数据,并将所述流数据挂载至待确认队列中。
在本实施方式中,针对未在预设数据库中存储的目标数据块,编码设备可 以将所述目标数据块按照流信息归类为至少一条流数据。其中,所述流信息可以是目标数据块中包含的五元组信息,所述五元组信息可以包括发送所述数据块的源IP地址、接收所述数据块的目的IP地址、发送所述数据块的源端口、接收所述数据块的目的端口以及所述数据块所采用的传输协议。这样,编码设备可以将具备相同五元组信息的数据块归类为一条流数据。
在本实施方式中,编码设备在将目标数据块归类为流数据后,并不会直接在预设数据库中存储这些流数据,而是将所述流数据挂载至待确认队列中。当编码设备接收到解码设备发来的确认信息之后,才会将待确认队列中相应的流数据存储至预设数据库中。
在一个实施方式中,可以统计各个流数据挂载的总时长,当挂载的总时长达到预设时长阈值时,流数据便视为挂载超时。为了避免待确认队列中挂载的流数据过多,若所述流数据在所述待确认队列中挂载超时,可以直接将超时的流数据丢弃,以节省待确认队列中的空间。
S14:对所述流数据进行编码,并向解码设备发送编码后的流数据。
在本实施方式中,在将目标数据块归类为流数据之后,编码设备便可以按照指定的编码算法对所述流数据进行编码。在实际应用中,待发送的数据可以包括替代已存储数据块的数据块标签以及由未存储的目标数据块归类得到的流数据,这样,在编码时,也可以同时对替代所述已存储数据块的数据块标签进行编码。在将两者分别编码后,编码设备便可以将编码后的流数据以及编码后的数据块标签发送至广域网的链路中,以向客户端一侧的解码设备发送编码后的流数据以及编码后的数据块标签。
在本实施方式中,所述解码设备接收到编码设备发来的编码数据后,可以将所述编码数据解码为解码数据,该解码数据中可以包括多个数据块以及用于表征已在所述编码设备中存储的数据块的标签。由于解码设备在接收所述编码数据时,可能会交错接收多个流数据中的不同数据块,因此解码设备解码得到的解码数据中,多个数据块也可能并不是按照流数据排列的,这样,解码设备可以将所述多个数据块按照流信息归类为至少一条流数据。其中,所述流信息可以是上述的五元组信息,编码设备可以将具备相同五元组信息的数据块归类至同一条流数据中。
在本实施方式中,如果将数据量较小的流数据存储至解码设备的预设数据 库中,那么可能使得解码设备的预设数据库中数据碎片较多,导致解码设备后续在进行数据索引时需要耗费较多的时间。为了避免这种情况发生,解码设备可以确定数据量大于或者等于指定阈值的目标流数据。所述指定阈值可以是预先设置的常数,所述指定阈值可以用来判定流数据的数据量是否过小。当流数据的数据量大于或者等于所述指定阈值,便认为该流数据达到可以存储的标准。此时,解码设备可以在自身的预设数据库中存储数据量大于或者等于所述指定阈值的目标流数据。请参阅图4,解码设备的预设数据库可以是解码设备的磁盘,在所述磁盘中,目标流数据可以作为一个存储单元进行存储。如图4所示,所述存储单元可以定义为一个continer,在该存储单元中,可以包括目标流数据中包含的数据块以及所述目标流数据的索引信息。所述索引信息可以用于表征各个数据块在存储单元中的起始偏移量以及各个数据块的数据量大小。在将目标流数据按照存储单元存储至磁盘中后,还可以提取所述目标流数据的特征索引,并将所述特征索引写入内存中,并定期持久化存储到磁盘上。后续如果需要在预设数据库中匹配相似或者相同的流数据,便可以通过特征索引进行初步的判断。具体地,所述特征索引可以是基于目标流数据的固有特征提取的。所述固有特征例如可以包括阻塞窗口大小、乱序片段等。在提取特征索引时,可以预先从固有特征中指定一部分特征,然后识别目标流数据的各个数据块分别对应哪个特征,最终可以按照数据块在目标流数据中的排列顺序,将识别出的特征进行排序,从而得到目标流特征的特征索引。举例来说,当前指定了特征1至特征3,所述目标流数据中包括5个数据块,这5个数据块分别对应着特征2、特征1、特征3、特征1、特征2,那么(特征2,特征1,特征3,特征1,特征2)构成的序列便可以作为目标数据流的特征索引。在图4中,各个特征索引中可以包括键(key)以及与键对应的值(cn1、cn2、cnn等)。存储单元的结构中可以包括元数据(metadata)、由键(key)对应的偏移量(offset)和数据大小(size)以及具体存储的数据(data rec)。
在本实施方式中,针对解码数据中包含的数据块的标签,所述解码设备可以在自身的预设数据库中查询是否具备所述标签对应的数据块,若存在,便可以将所述标签还原为对应的数据块;若不存在,表示此时解码设备的数据库中缺失了一部分数据块,此时,解码设备可以向所述编码设备发送同步请求,以同步所述编码设备中存储的数据块。
在本实施方式中,解码设备在将所述目标流数据在自身的数据库中存储并将特征索引提取到内存中后,便可以向编码设备发送确认信息,所述确认信息中包含所述目标流数据中的数据块的标识,以使得所述编码设备可以存储所述目标流数据中的数据块。其中,所述目标流数据中数据块的标识可以是计算出的数据块的哈希值。
S15:接收所述解码设备针对所述编码后的流数据发来的确认信息,并将所述待确认队列中所述确认信息对应的数据块存储至所述预设数据库中。
在本实施方式中,编码设备在接收到所述解码设备发来的确认信息后,可以在所述待确认队列中确定所述确认信息对应的数据块,并将所述确认信息对应的数据块存储至所述预设数据库中。具体地,所述确认信息中可以包括数据块的标识,该标识例如可以是数据块的哈希值。那么编码设备可以在待确认队列中通过数据块的标识查询到对应的数据块,并将查询到的数据块进行存储。
在实际应用中,编码设备自身的数据库也可以是如图4所示的磁盘,编码设备也可以将数据块按照流数据存储至磁盘中。所述确认信息对应的数据块所在的流数据可以作为一个存储单元存储于编码设备的磁盘中。所述存储单元可以定义为一个continer,在该存储单元中,可以包括流数据中包含的数据块以及所述流数据的索引信息。所述索引信息可以用于表征各个数据块在存储单元中的起始偏移量以及各个数据块的数据量大小。在将流数据按照存储单元存储至磁盘中后,编码设备同样可以提取所述流数据的特征索引,并将所述特征索引写入内存中,并定期持久化存储到磁盘上。
在本申请一个实施方式中,解码设备在得到解码数据,并将其中包含的数据块归类为流数据后,如果流数据的数据量小于上述的指定阈值,解码设备可以将数据量小于所述指定阈值的流数据挂载于待聚合队列中。处于待聚合队列中的流数据,只有在确认多个流数据之间存在关联性时,解码设备才会将这多个流数据进行聚合。具体地,在本实施方式中可以由应用代理模块或者其他关联性分析模块来确定存在关联性的流数据。所述应用代理模块例如可以是HTTP(HyperText Transfer Protocol,超文本传输协议)、MAPI(Messaging Application Programming Interface,邮件应用程序接口)、CIFS(Common Internet File System,通用英特网文件系统)等应用代理模块,这些应用代理模块可以确认处于同一次会话中的多个不同的流数据,从而可以确定存在关联性 的流数据。解码设备在接收到应用代理模块发来的关联信息后,可以将所述待聚合队列中所述关联信息对应的多条流数据聚合为一条虚拟流数据。所述关联信息中,可以包括存在关联性的多条流数据的五元组信息,根据关联信息中包含的各个五元组信息,解码设备从而可以在待聚合队列中确定出对应的流数据。尽管这些流数据存在关联性,但由于它们的五元组信息不同,因此只能按照流数据的形式聚合为一条虚拟流数据,在该虚拟流数据中,还是会包括多组五元组信息。在聚合得到所述虚拟流数据后,解码设备便可以在自身的预设数据库中存储所述虚拟流数据,并可以提取所述虚拟流数据的特征索引,并将所述特征索引写入内存中,并定期持久化存储到磁盘上。
在本实施方式中,当解码设备将多条流数据聚合为一条虚拟流数据,并存储了该虚拟流数据后,解码设备便可以在即将发送给编码设备的确认信息中添加聚合指令,所述聚合指令中可以包括需要聚合的多条流数据的五元组信息,或者可以包括需要聚合的多条流数据中各个数据块的标识,这样,所述聚合指令便可以携带已经存储的虚拟流数据中的各条流数据的标识。编码设备在从确认信息中提取出聚合指令后,可以在待确认队列中查找出所述目标流数据,然后便可以将所述目标流数据聚合为一条虚拟流数据,并将所述虚拟流数据的特征索引抽取至内存中。
在本实施方式中,当需要对数据量较小的目标流数据进行相似性匹配时,为了避免目标数据流特征索引不足的问题,可以确定所述目标流数据所在的虚拟流数据,并基于所述虚拟流数据的特征索引对所述目标流数据进行相似性匹配。这样,在对小数据流做新流相似性匹配的时候,用到的是流簇中所有的特征索引,这样就解决了特征索引不足的问题。
实施例二
本申请还提供一种编码设备,请参阅图5,所述编码设备包括存储器和处理器,所述存储器中存储计算机程序,所述计算机程序被所述处理器执行时,实现以下步骤:
S11:获取待发送的数据,并将所述数据划分为多个数据块;
S12:从所述多个数据块中确定未在预设数据库中存储的目标数据块;
S13:将所述目标数据块按照流信息归类为至少一条流数据,并将所述流数 据挂载至待确认队列中;
S14:对所述流数据进行编码,并向解码设备发送编码后的流数据;
S15:接收所述解码设备针对所述编码后的流数据发来的确认信息,并将所述待确认队列中所述确认信息对应的数据块存储至所述预设数据库中。
在本实施方式中,所述计算机程序被所述处理器执行时,还实现以下步骤:
提取所述确认信息对应的数据块所在的流数据的特征索引,并将所述特征索引写入内存中,并定期持久化存储到磁盘上。
在本实施方式中,所述确认信息中还包括聚合指令,所述聚合指令携带所述待确认队列中的目标流数据的标识;相应地,所述计算机程序被所述处理器执行时,还实现以下步骤:
将所述目标流数据聚合为一条虚拟流数据;
在所述预设数据库中存储所述虚拟流数据,并将所述虚拟流数据的特征索引抽取至内存中,并定期持久化存储到磁盘上。
实施例三
本申请还提供一种数据存储方法,所述方法的执行主体可以是解码设备,请参阅图6,所述方法包括以下步骤。
S21:接收编码设备发来的编码数据,并将所述编码数据解码为解码数据,所述解码数据中包括多个数据块。
S22:将所述多个数据块按照流信息归类为至少一条流数据,并确定数据量大于或者等于指定阈值的目标流数据。
S23:在预设数据库中存储所述目标流数据,并向所述编码设备发送确认信息,所述确认信息中包含所述目标流数据中的数据块的标识,以使得所述编码设备存储所述目标流数据中的数据块。
在本实施方式中,所述解码设备接收到编码设备发来的编码数据后,可以将所述编码数据解码为解码数据,该解码数据中可以包括多个数据块以及用于表征已在所述编码设备中存储的数据块的标签。由于解码设备在接收所述编码数据时,可能会交错接收多个流数据中的不同数据块,因此解码设备解码得到的解码数据中,多个数据块也可能并不是按照流数据排列的,这样,解码设备可以将所述多个数据块按照流信息归类为至少一条流数据。其中,所述流信息 可以是上述的五元组信息,编码设备可以将具备相同五元组信息的数据块归类至同一条流数据中。
在本实施方式中,如果将数据量较小的流数据存储至解码设备的预设数据库中,那么可能使得解码设备的预设数据库中数据碎片较多,导致解码设备后续在进行数据索引时需要耗费较多的时间。为了避免这种情况发生,解码设备可以确定数据量大于或者等于指定阈值的目标流数据。所述指定阈值可以是预先设置的常数,所述指定阈值可以用来判定流数据的数据量是否过小。当流数据的数据量大于或者等于所述指定阈值,便认为该流数据达到可以存储的标准。此时,解码设备可以在自身的预设数据库中存储数据量大于或者等于所述指定阈值的目标流数据。请参阅图4,解码设备的预设数据库可以是解码设备的磁盘,在所述磁盘中,目标流数据可以作为一个存储单元进行存储。如图4所示,所述存储单元可以定义为一个continer,在该存储单元中,可以包括目标流数据中包含的数据块以及所述目标流数据的索引信息。所述索引信息可以用于表征各个数据块在存储单元中的起始偏移量以及各个数据块的数据量大小。在将目标流数据按照存储单元存储至磁盘中后,还可以提取所述目标流数据的特征索引,并将所述特征索引写入内存中,并定期持久化存储到磁盘上。后续如果需要在预设数据库中匹配相似或者相同的流数据,便可以通过特征索引进行初步的判断。具体地,所述特征索引可以是基于目标流数据的固有特征提取的。所述固有特征例如可以包括阻塞窗口大小、乱序片段等。在提取特征索引时,可以预先从固有特征中指定一部分特征,然后识别目标流数据的各个数据块分别对应哪个特征,最终可以按照数据块在目标流数据中的排列顺序,将识别出的特征进行排序,从而得到目标流特征的特征索引。举例来说,当前指定了特征1至特征3,所述目标流数据中包括5个数据块,这5个数据块分别对应着特征2、特征1、特征3、特征1、特征2,那么(特征2,特征1,特征3,特征1,特征2)构成的序列便可以作为目标数据流的特征索引。
在本实施方式中,针对解码数据中包含的数据块的标签,所述解码设备可以在自身的预设数据库中查询是否具备所述标签对应的数据块,若存在,便可以将所述标签还原为对应的数据块;若不存在,表示此时解码设备的数据库中缺失了一部分数据块,此时,解码设备可以向所述编码设备发送同步请求,以同步所述编码设备中存储的数据块。
在本实施方式中,解码设备在将所述目标流数据在自身的数据库中存储并将特征索引提取到内存中后,便可以向编码设备发送确认信息,所述确认信息中包含所述目标流数据中的数据块的标识,以使得所述编码设备可以存储所述目标流数据中的数据块。其中,所述目标流数据中数据块的标识可以是计算出的数据块的哈希值。
在本申请一个实施方式中,解码设备在得到解码数据,并将其中包含的数据块归类为流数据后,如果流数据的数据量小于上述的指定阈值,解码设备可以将数据量小于所述指定阈值的流数据挂载于待聚合队列中。处于待聚合队列中的流数据,只有在确认多个流数据之间存在关联性时,解码设备才会将这多个流数据进行聚合。具体地,在本实施方式中可以由应用代理模块或者其他关联性分析模块来确定存在关联性的流数据。所述应用代理模块例如可以是HTTP(HyperText Transfer Protocol,超文本传输协议)、MAPI(Messaging Application Programming Interface,邮件应用程序接口)、CIFS(Common Internet File System,通用英特网文件系统)等应用代理模块,这些应用代理模块可以确认处于同一次会话中的多个不同的流数据,从而可以确定存在关联性的流数据。解码设备在接收到应用代理模块发来的关联信息后,可以将所述待聚合队列中所述关联信息对应的多条流数据聚合为一条虚拟流数据。所述关联信息中,可以包括存在关联性的多条流数据的五元组信息,根据关联信息中包含的各个五元组信息,解码设备从而可以在待聚合队列中确定出对应的流数据。尽管这些流数据存在关联性,但由于它们的五元组信息不同,因此只能按照流数据的形式聚合为一条虚拟流数据,在该虚拟流数据中,还是会包括多组五元组信息。在聚合得到所述虚拟流数据后,解码设备便可以在自身的预设数据库中存储所述虚拟流数据,并可以提取所述虚拟流数据的特征索引,并将所述特征索引写入内存中,并定期持久化存储到磁盘上。
在本实施方式中,当解码设备将多条流数据聚合为一条虚拟流数据,并存储了该虚拟流数据后,解码设备便可以在即将发送给编码设备的确认信息中添加聚合指令,所述聚合指令中可以包括需要聚合的多条流数据的五元组信息,或者可以包括需要聚合的多条流数据中各个数据块的标识,这样,所述聚合指令便可以携带已经存储的虚拟流数据中的各条流数据的标识。编码设备在从确认信息中提取出聚合指令后,可以在待确认队列中查找出所述目标流数据,然 后便可以将所述目标流数据聚合为一条虚拟流数据,并将所述虚拟流数据的特征索引抽取至内存中。
在一个实施方式中,解码设备可以统计流数据在待聚合队列中挂载的总时长,当所述总时长超过固定的时长阈值时,便可以认为流数据超时。为了节省待聚合队列的空间,若数据量小于所述指定阈值的流数据在所述待聚合队列中挂载超时,解码设备可以将超时的流数据丢弃。
在本申请一个实施方式中,解码设备在将解码数据中的数据块归类为流数据后,该归类得到的流数据可能与数据库中存储的流数据比较相似。例如,解码数据之前接收到一份文档的初稿数据,然后又接收到该文档的修订稿数据,修订稿相比于初稿可能只有少量的内容不同,那么初稿数据和修订稿数据便可以视为相似的两条流数据。此时,如果将归类得到的流数据存储至数据库中,那么数据库中便会存在比较相似的两条流数据,这样会浪费数据库的存储空间。在这种情况下,解码数据可以在自身的预设数据库中查询与归类得到的流数据的相似度达到指定相似度阈值的相似流数据。具体地,在比较两条流数据的相似度时,可以通过比较两条流数据的特征索引的相似度来实现。例如,可以确定出两条流数据的特征索引中对应一致的特征的数量,然后计算该对应一致的特征的数量在特征索引的特征总数量中的比值,并将该比值作为两条流数据的相似度。比值越高,相似度则越高。所述指定相似度阈值可以是预先设定的常数,根据实际情况可以灵活进行调整。在确定出相似流数据之后,便可以比对所述归类得到的流数据和所述相似流数据,以确定所述归类得到的流数据中新增的数据。具体地,可以从预设数据库中将相似流数据读取到内存中,然后比对相似流数据与归类得到的流数据中不同的数据,并可以将该不同的数据作为新增的数据。在本实施方式中,若新增的数据的数据量较大,那么可以将新增的数据按照上述的方式作为一条新的流数据进行存储,而当所述新增的数据的数据量小于指定数据量阈值时,则可以将所述新增的数据添加至所述相似流数据中。这样,可以避免数据量较小的流数据无法有效地通过特征索引进行查询的问题。
实施例四
请参阅图7,本申请还提供一种解码设备,所述解码设备包括存储器和处理 器,所述存储器中存储计算机程序,所述计算机程序被所述处理器执行时,实现以下步骤:
S21:接收编码设备发来的编码数据,并将所述编码数据解码为解码数据,所述解码数据中包括多个数据块;
S22:将所述多个数据块按照流信息归类为至少一条流数据,并确定数据量大于或者等于指定阈值的目标流数据;
S23:在预设数据库中存储所述目标流数据,并向所述编码设备发送确认信息,所述确认信息中包含所述目标流数据中的数据块的标识,以使得所述编码设备存储所述目标流数据中的数据块。
在本实施方式中,所述计算机程序被所述处理器执行时,还实现以下步骤:
当归类得到的流数据中存在数据量小于所述指定阈值的流数据时,将数据量小于所述指定阈值的流数据挂载于待聚合队列中;
接收应用代理模块发来的关联信息,所述关联信息用于表征存在关联性的流数据;
将所述待聚合队列中所述关联信息对应的多条流数据聚合为一条虚拟流数据,并在所述预设数据库中存储所述虚拟流数据;
提取所述虚拟流数据的特征索引,并将所述特征索引写入内存中,并定期持久化存储到磁盘上;
在所述确认信息中添加聚合指令,所述聚合指令携带所述虚拟流数据中的各条流数据的标识,以使得所述编码设备将所述各条流数据进行聚合,并使得所述编码设备存储聚合后的虚拟流数据。
在本实施方式中,所述计算机程序被所述处理器执行时,还实现以下步骤:
在预设数据库中查询与归类得到的流数据的相似度达到指定相似度阈值的相似流数据;
比对所述归类得到的流数据和所述相似流数据,以确定所述归类得到的流数据中新增的数据;
当所述新增的数据的数据量小于指定数据量阈值时,将所述新增的数据添加至所述相似流数据中。
由上可见,本申请提供的技术方案,当编码设备需要向解码设备发送数据时,首先可以将数据基于内容进行分块处理,并可以确定出没有在预设数据库 中存储的目标数据块。针对目标数据块,编码设备并不会直接将其存储于预设数据库中,而是将其按照流信息归类为流数据,并将流数据挂载于待确认队列中。然后,可以将流数据编码后发送给解码设备。解码设备对接收到的数据进行解码之后,按照流信息将解码得到的数据块归类为流数据。如果流数据的数据量较大,解码设备便可以将流数据存储于本地的数据库中,然后,可以向编码设备发送确认信息,该确认信息中可以包含上述存储的流数据中数据块的标识。这样,当编码设备接收到该确认信息后,便可以将待确认队列中识别出确认信息对应的数据块,从而将确认信息对应的数据块在自身的数据库中进行存储。由此可见,编码设备作为数据发送端,只有在解码设备(作为数据接收端)将数据块存储之后,才会相应地进行数据块的存储流程。这样,就算发生数据传输故障,导致编码设备无法接收到确认信息,那么造成的影响也不过是解码设备的数据库中的数据比编码设备的数据库中的数据更加完备,那么针对编码设备发来的用于表征已经存储的数据块的标签,解码设备必然能够从本地的数据库中查询到对应的数据块,并进行数据块的恢复。因此,本申请提供的技术方案,即使编码设备的数据库和解码设备的数据库中数据不同步,也不会影响解码设备对于数据的接收过程,从而提高了数据传输的效率。此外,当需要对数据量较小的目标流数据进行相似性匹配时,为了避免目标数据流特征索引不足的问题,可以确定所述目标流数据所在的虚拟流数据,并基于所述虚拟流数据的特征索引对所述目标流数据进行相似性匹配。这样,在对小数据流做新流相似性匹配的时候,用到的是流簇中所有的特征索引,这样就解决了特征索引不足的问题。
请参阅图8,在本申请中,上述实施例中的技术方案可以应用于如图8所示的计算机终端10上。计算机终端10可以包括一个或多个(图中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器104、以及用于通信功能的传输模块106。本领域普通技术人员可以理解,图8所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,计算机终端10还可包括比图8中所示更多或者更少的组件,或者具有与图8所示不同的配置。
存储器104可用于存储应用软件的软件程序以及模块,处理器102通过运行存储在存储器104内的软件程序以及模块,从而执行各种功能应用以及数据 处理。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端10的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (19)

  1. 一种数据存储方法,其特征在于,所述方法包括:
    获取待发送的数据,并将所述数据划分为多个数据块;
    从所述多个数据块中确定未在预设数据库中存储的目标数据块;
    将所述目标数据块按照流信息归类为至少一条流数据,并将所述流数据挂载至待确认队列中;
    对所述流数据进行编码,并向解码设备发送编码后的流数据;
    接收所述解码设备针对所述编码后的流数据发来的确认信息,并将所述待确认队列中所述确认信息对应的数据块存储至所述预设数据库中。
  2. 根据权利要求1所述的方法,其特征在于,所述预设数据库中存储的数据块与数据块标签相关联;相应地,在将所述数据划分为多个数据块之后,所述方法还包括:
    从所述多个数据块中确定已在所述预设数据库中存储的已存储数据块;
    利用与所述已存储数据块相关联的数据块标签替代所述多个数据块中的已存储数据块。
  3. 根据权利要求2所述的方法,其特征在于,在对所述流数据进行编码时,所述方法还包括:
    对替代所述已存储数据块的数据块标签进行编码,并向解码设备发送编码后的数据块标签,以使得所述解码设备基于所述编码后的数据块标签查询到对应的数据块。
  4. 根据权利要求1所述的方法,其特征在于,在将所述待确认队列中所述确认信息对应的数据块存储至所述预设数据库中之后,所述方法还包括:
    提取所述确认信息对应的数据块所在的流数据的特征索引,并将所述特征索引写入内存中,并备份存储到硬盘。
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    若所述流数据在所述待确认队列中挂载超时,将超时的流数据丢弃。
  6. 根据权利要求1所述的方法,其特征在于,所述确认信息中还包括聚合指令,所述聚合指令中携带所述待确认队列中的目标流数据的标识;相应地,所述方法还包括:
    将所述目标流数据聚合为一条虚拟流数据;
    在所述预设数据库中存储所述虚拟流数据,并将所述虚拟流数据的特征索引抽取至内存中并持久化存储到硬盘。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    在对所述目标流数据进行相似性匹配时,确定所述目标流数据所在的虚拟流数据,并基于所述虚拟流数据的特征索引对所述目标流数据进行相似性匹配。
  8. 一种编码设备,其特征在于,所述编码设备包括存储器和处理器,所述存储器中存储计算机程序,所述计算机程序被所述处理器执行时,实现以下步骤:
    获取待发送的数据,并将所述数据划分为多个数据块;
    从所述多个数据块中确定未在预设数据库中存储的目标数据块;
    将所述目标数据块按照流信息归类为至少一条流数据,并将所述流数据挂载至待确认队列中;
    对所述流数据进行编码,并向解码设备发送编码后的流数据;
    接收所述解码设备针对所述编码后的流数据发来的确认信息,并将所述待确认队列中所述确认信息对应的数据块存储至所述预设数据库中。
  9. 根据权利要求8所述的编码设备,其特征在于,所述计算机程序被所述处理器执行时,还实现以下步骤:
    提取所述确认信息对应的数据块所在的流数据的特征索引,并将所述特征索引写入内存中,并持久化存储到硬盘。
  10. 根据权利要求8所述的编码设备,其特征在于,所述确认信息中还包括聚合指令,所述聚合指令携带所述待确认队列中的目标流数据的标识;相应 地,所述计算机程序被所述处理器执行时,还实现以下步骤:
    将所述目标流数据聚合为一条虚拟流数据;
    在所述预设数据库中存储所述虚拟流数据,并将所述虚拟流数据的特征索引抽取至内存中,并持久化存储到硬盘。
  11. 一种数据存储方法,其特征在于,所述方法包括:
    接收编码设备发来的编码数据,并将所述编码数据解码为解码数据,所述解码数据中包括多个数据块;
    将所述多个数据块按照流信息归类为至少一条流数据,并确定数据量大于或者等于指定阈值的目标流数据;
    在预设数据库中存储所述目标流数据,并向所述编码设备发送确认信息,所述确认信息中包含所述目标流数据中的数据块的标识,以使得所述编码设备存储所述目标流数据中的数据块。
  12. 根据权利要求11所述的方法,其特征在于,所述解码数据中还包括用于表征已在所述编码设备中存储的数据块的标签;相应地,在将所述编码数据解码为解码数据之后,所述方法还包括:
    在预设数据库中查询是否具备所述标签对应的数据块,若存在,将所述标签还原为对应的数据块;若不存在,向所述编码设备发送同步请求,以同步所述编码设备中存储的数据块。
  13. 根据权利要求11所述的方法,其特征在于,当归类得到的流数据中存在数据量小于所述指定阈值的流数据时,所述方法还包括:
    将数据量小于所述指定阈值的流数据挂载于待聚合队列中;
    接收应用代理模块发来的关联信息,所述关联信息用于表征存在关联性的流数据;
    将所述待聚合队列中所述关联信息对应的多条流数据聚合为一条虚拟流数据,并在所述预设数据库中存储所述虚拟流数据;
    提取所述虚拟流数据的特征索引,并将所述特征索引写入内存中,并持久化存储到硬盘。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    在所述确认信息中添加聚合指令,所述聚合指令携带所述虚拟流数据中的各条流数据的标识,以使得所述编码设备将所述各条流数据进行聚合,并使得所述编码设备存储聚合后的虚拟流数据。
  15. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    若数据量小于所述指定阈值的流数据在所述待聚合队列中挂载超时,将超时的流数据丢弃。
  16. 根据权利要求11所述的方法,其特征在于,在将所述多个数据块按照流信息归类为至少一条流数据之后,所述方法还包括:
    在预设数据库中查询与归类得到的流数据的相似度达到指定相似度阈值的相似流数据;
    比对所述归类得到的流数据和所述相似流数据,以确定所述归类得到的流数据中新增的数据;
    当所述新增的数据的数据量小于指定数据量阈值时,将所述新增的数据添加至所述相似流数据中。
  17. 一种解码设备,其特征在于,所述解码设备包括存储器和处理器,所述存储器中存储计算机程序,所述计算机程序被所述处理器执行时,实现以下步骤:
    接收编码设备发来的编码数据,并将所述编码数据解码为解码数据,所述解码数据中包括多个数据块;
    将所述多个数据块按照流信息归类为至少一条流数据,并确定数据量大于或者等于指定阈值的目标流数据;
    在预设数据库中存储所述目标流数据,并向所述编码设备发送确认信息,所述确认信息中包含所述目标流数据中的数据块的标识,以使得所述编码设备存储所述目标流数据中的数据块。
  18. 根据权利要求17所述的解码设备,其特征在于,所述计算机程序被所述处理器执行时,还实现以下步骤:
    当归类得到的流数据中存在数据量小于所述指定阈值的流数据时,将数据量小于所述指定阈值的流数据挂载于待聚合队列中;
    接收关联信息,所述关联信息用于表征存在关联性的流数据;
    将所述待聚合队列中所述关联信息对应的多条流数据聚合为一条虚拟流数据,并在所述预设数据库中存储所述虚拟流数据;
    提取所述虚拟流数据的特征索引,并将所述特征索引写入内存中并持久化存储到硬盘;
    在所述确认信息中添加聚合指令,所述聚合指令携带所述虚拟流数据中的各条流数据的标识,以使得所述编码设备将所述各条流数据进行聚合,并使得所述编码设备存储聚合后的虚拟流数据。
  19. 根据权利要求17所述的解码设备,其特征在于,所述计算机程序被所述处理器执行时,还实现以下步骤:
    在预设数据库中查询与归类得到的流数据的相似度达到指定相似度阈值的相似流数据;
    比对所述归类得到的流数据和所述相似流数据,以确定所述归类得到的流数据中新增的数据;
    当所述新增的数据的数据量小于指定数据量阈值时,将所述新增的数据添加至所述相似流数据中。
PCT/CN2018/076411 2018-01-19 2018-02-12 一种数据存储方法、编码设备及解码设备 WO2019140732A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/099,796 US20210227007A1 (en) 2018-01-19 2018-02-12 Data storage method, encoding device, and decoding device
EP18901556.3A EP3588914A4 (en) 2018-01-19 2018-02-12 DATA STORAGE METHOD, CODING DEVICE, AND DECODING DEVICE

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810054884.7 2018-01-19
CN201810054884.7A CN108243256B (zh) 2018-01-19 2018-01-19 一种数据存储方法、编码设备及解码设备

Publications (1)

Publication Number Publication Date
WO2019140732A1 true WO2019140732A1 (zh) 2019-07-25

Family

ID=62699663

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/076411 WO2019140732A1 (zh) 2018-01-19 2018-02-12 一种数据存储方法、编码设备及解码设备

Country Status (4)

Country Link
US (1) US20210227007A1 (zh)
EP (1) EP3588914A4 (zh)
CN (1) CN108243256B (zh)
WO (1) WO2019140732A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11809430B2 (en) 2019-10-23 2023-11-07 Nec Corporation Efficient stream processing with data aggregations in a sliding window over out-of-order data streams

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110837414B (zh) * 2018-08-15 2024-04-12 京东科技控股股份有限公司 任务处理方法和装置
CN109032530B (zh) * 2018-08-21 2021-10-01 成都华为技术有限公司 一种数据流处理方法及设备
US11030149B2 (en) * 2018-09-06 2021-06-08 Sap Se File format for accessing data quickly and efficiently
CN111722787B (zh) * 2019-03-22 2021-12-03 华为技术有限公司 一种分块方法及其装置
CN110737409B (zh) * 2019-10-21 2023-09-26 网易(杭州)网络有限公司 数据加载方法、装置和终端设备
CN113746763B (zh) * 2020-05-29 2022-11-11 华为技术有限公司 一种数据处理的方法、装置和设备
CN116962301A (zh) * 2022-04-18 2023-10-27 华为技术有限公司 一种数据流保序方法、数据交换装置及网络
CN114721601B (zh) * 2022-05-26 2022-08-30 昆仑智汇数据科技(北京)有限公司 一种工业设备数据的存储方法及装置
CN116204136B (zh) * 2023-05-04 2023-08-15 山东浪潮科学研究院有限公司 一种数据存储、查询方法、装置、设备及存储介质
CN116963178B (zh) * 2023-09-21 2024-01-16 季华实验室 一种降低nb-iot设备功耗的方法及相关设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110202623A1 (en) * 2010-02-17 2011-08-18 Emulex Design & Manufacturing Corporation Accelerated sockets
CN103970875A (zh) * 2014-05-15 2014-08-06 华中科技大学 一种并行重复数据删除方法
CN107291924A (zh) * 2017-06-29 2017-10-24 深信服科技股份有限公司 一种用于灾备系统的同步复制日志控制方法和系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101159757B (zh) * 2007-10-25 2011-11-30 中兴通讯股份有限公司 双归属同步数据传输方法
CN102469142A (zh) * 2010-11-16 2012-05-23 英业达股份有限公司 重复数据删除程序的数据传输方法
US8943023B2 (en) * 2010-12-29 2015-01-27 Amazon Technologies, Inc. Receiver-side data deduplication in data systems
CN103067129B (zh) * 2012-12-24 2015-10-28 中国科学院深圳先进技术研究院 网络数据传输方法和系统
CN104753626B (zh) * 2013-12-25 2019-05-24 华为技术有限公司 一种数据压缩方法、设备及系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110202623A1 (en) * 2010-02-17 2011-08-18 Emulex Design & Manufacturing Corporation Accelerated sockets
CN103970875A (zh) * 2014-05-15 2014-08-06 华中科技大学 一种并行重复数据删除方法
CN107291924A (zh) * 2017-06-29 2017-10-24 深信服科技股份有限公司 一种用于灾备系统的同步复制日志控制方法和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3588914A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11809430B2 (en) 2019-10-23 2023-11-07 Nec Corporation Efficient stream processing with data aggregations in a sliding window over out-of-order data streams

Also Published As

Publication number Publication date
CN108243256B (zh) 2020-08-04
EP3588914A8 (en) 2020-02-19
CN108243256A (zh) 2018-07-03
EP3588914A1 (en) 2020-01-01
EP3588914A4 (en) 2020-05-27
US20210227007A1 (en) 2021-07-22

Similar Documents

Publication Publication Date Title
WO2019140732A1 (zh) 一种数据存储方法、编码设备及解码设备
US10241682B2 (en) Dynamic caching module selection for optimized data deduplication
US8954392B2 (en) Efficient de-duping using deep packet inspection
US9210090B1 (en) Efficient storage and flexible retrieval of full packets captured from network traffic
US20120327956A1 (en) Flow compression across multiple packet flows
WO2013127309A1 (zh) 数据处理方法及数据处理设备
CN113590910B (zh) 一种网络流量检索方法和装置
US11461276B2 (en) Method and device for deduplication
CN105511812A (zh) 一种存储系统大数据优化方法及装置
US11620051B2 (en) System and method for data compaction and security using multiple encoding algorithms
US11733867B2 (en) System and method for multiple pass data compaction utilizing delta encoding
WO2020114256A1 (zh) 一种参数配置方法及装置
CN114584560A (zh) 一种分片帧重组方法及装置
US9690711B2 (en) Scheduler training for multi-module byte caching
CN107196879B (zh) Udp报文的处理方法、装置以及网络转发装置
WO2015169074A1 (zh) 内容中心网络中内容分发的方法及相关设备
KR101465891B1 (ko) 무선 네트워크에서 트래픽 중복 제거 방법 및 장치
CN110602059B (zh) 一种精准复原tls协议加密传输数据明文长度指纹的方法
CN113114968A (zh) 一种视频处理方法、装置、设备及存储介质
CN112350986B (zh) 一种音视频网络传输碎片化的整形方法及系统
CN113067989B (zh) 一种数据处理方法和芯片
CN108200481B (zh) 一种rtp-ps流处理方法、装置、设备及存储介质
CN115567460A (zh) 数据包处理方法及装置
US11853262B2 (en) System and method for computer data type identification
CN114125071B (zh) 数据压缩传输方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18901556

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2018901556

Country of ref document: EP

Effective date: 20190923

NENP Non-entry into the national phase

Ref country code: DE