WO2021174839A1 - Data compression method and apparatus, and computer-readable storage medium - Google Patents

Data compression method and apparatus, and computer-readable storage medium Download PDF

Info

Publication number
WO2021174839A1
WO2021174839A1 PCT/CN2020/119122 CN2020119122W WO2021174839A1 WO 2021174839 A1 WO2021174839 A1 WO 2021174839A1 CN 2020119122 W CN2020119122 W CN 2020119122W WO 2021174839 A1 WO2021174839 A1 WO 2021174839A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
data set
characters
message data
character set
Prior art date
Application number
PCT/CN2020/119122
Other languages
French (fr)
Chinese (zh)
Inventor
李桃
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021174839A1 publication Critical patent/WO2021174839A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/55Compression Theory, e.g. compression of random number, repeated compression

Definitions

  • This application relates to the field of big data technology, and in particular to a data compression method, device, and computer-readable storage medium.
  • the current messaging system in the IoT business scenario puts all the message data generated by the IoT device in the transmission process into a DB database or file, and the storage medium compresses and stores the message data.
  • the storage medium compresses and stores the message data.
  • redundant device data in the messages transmitted by the Internet of Things devices, and most of these device data are unchanged or rarely changed.
  • the inventor realizes that at present, people tend to store the redundant device data in In the disk space, due to the huge amount of data generated in the Internet of Things scenario, it is easy to cause insufficient memory in the disk space, which will bring maintenance inconvenience and cost increase in the later period.
  • a data compression method provided by this application includes:
  • the minimum path length of the characters in the redundant data set is calculated by using the character code table, and the redundant data set is compressed according to the minimum path length of the characters.
  • This application also provides an electronic device, which includes:
  • Memory storing at least one instruction
  • the processor executes the instructions stored in the memory to implement the data compression method as described below:
  • the minimum path length of the characters in the redundant data set is calculated by using the character code table, and the redundant data set is compressed according to the minimum path length of the characters.
  • the present application also provides a computer-readable storage medium in which at least one instruction is stored, and the at least one instruction is executed by a processor in an electronic device to implement the following data compression method:
  • the minimum path length of the characters in the redundant data set is calculated by using the character code table, and the redundant data set is compressed according to the minimum path length of the characters.
  • the present application also provides a data compression device, which includes:
  • the recognition module is used to obtain the message data set generated by the Internet of Things device during the transmission process, identify the character set in the message data set, and use the character set as the initial character set;
  • the calculation and screening module is used to calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, and obtain the redundant data Set of standard character sets;
  • the encoding module is used to encode the characters in the standard character set to obtain a character code table
  • the compression module is configured to use the character code table to calculate the minimum path length of the characters in the redundant data set, and compress the redundant data set according to the minimum path length of the characters.
  • FIG. 1 is a schematic flowchart of a data compression method provided by an embodiment of this application
  • FIG. 2 is a schematic diagram of the internal structure of a data compression device provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of modules of a data compression program in a data compression device provided by an embodiment of the application.
  • This application provides a data compression method.
  • FIG. 1 it is a schematic flowchart of a data compression method provided by an embodiment of this application.
  • the method can be executed by a device, and the device can be implemented by software and/or hardware.
  • the data compression method includes:
  • the IoT device includes a barcode, a sensor, a scanner, and so on.
  • a large amount of message data will be generated.
  • this application collects the message data generated by different IoT devices in the transmission process as the message data set of this application.
  • the character set includes a word character set and a symbol character set.
  • the present application recognizes the word characters in the message data set through the shortest path algorithm, so as to combine to form the word character set.
  • the shortest path algorithm includes: constructing a word segmentation directed acyclic graph through a custom dictionary, wherein each word in the word segmentation directed acyclic graph corresponds to a directed edge in the graph, and Assigned to the corresponding side length (weight); in all the paths from the start to the end of the word segmentation directed acyclic graph, the calculated word length values are arranged in ascending order (the values at any two different positions must be different , The same below) is the first, second,..., i,..., Nth path set as the corresponding rough score result set.
  • the final rough result set size is greater than or equal to N, get The character set of words contained in the path set.
  • the present application recognizes the symbol characters in the message data set through a character recognition algorithm, so as to combine to form the symbol character set.
  • the character recognition algorithm includes: presetting a character database template, matching the character database template with the message data set using character matching technology, and extracting symbol characters in the message data set that are successfully matched, Thus, the symbol character set is obtained.
  • the word character set and the symbol character set are stored in the database as the initial character set.
  • the frequency of each character in the initial character set is calculated by the following method
  • f i represents the frequency of appearance of the initial character i
  • n i represents the number of characters i in the initial character set
  • v represents the number of all characters in the initial character set.
  • the present application filters out the redundant data in the message data set in a preset manner according to the sort of the character frequency, and identifies the standard character set in the redundant data.
  • the preset method is to compare the character frequency with a preset threshold, and if the character frequency is less than the preset threshold, use its corresponding message data as redundant data.
  • the threshold is set to 0.35.
  • the method for identifying the character set in the message data set is the same as the method for identifying the character set in the message data set in step S1, and the description will not be repeated here.
  • the encoding of the standard character set in the embodiment of the present application includes: presetting the serial number character set of the standard character set, and obtaining a one-to-one correspondence between the standard character set and the serial number character set Calculating the probability of each standard character in the standard character set according to the one-to-one correspondence, forming the code number of the corresponding standard character according to the probability of each standard character, and establishing the character code according to the code number surface.
  • the probability of each standard character in the standard character set can be expressed for:
  • the serial number character set of the standard character set n 0 is preset to ⁇ 01,02,03,04 ⁇ .
  • the method for calculating the minimum path length of characters in the redundant data set includes:
  • MINWPL represents the minimum path length of a character
  • w k represents the character weight in the k-th character code table
  • l k represents the character offset in the k-th character code table
  • n represents the number of characters in the character code table.
  • this application compresses the redundant data set according to the minimum path length of the character.
  • the embodiment of the present application further includes: performing a decoding operation on the compressed redundant data set by using a pre-created decoding rule.
  • the pre-created decoding rules in this application include: according to the coding diagram in the serial number characters, suppose the maximum number of layers is k, then the tree layer is cyclically created from the kth layer to the second layer.
  • the established tree layer is not For the kth level, take 2 nodes from the non-leaf nodes and leaf nodes of the tree layer to build a new node; if the non-leaf node of the tree layer is singular, then the last node of the non-leaf node Combine with the first node of the leaf node of the tree layer to form a new node; if the established tree layer is the kth layer, the leaf node is used to directly establish the new node, thereby forming the decoding rule.
  • FIG. 2 it is a functional block diagram of the data compression device of the present application.
  • the data compression apparatus 100 described in this application can be installed in an electronic device.
  • the data compression device may include an identification module 101, a calculation and screening module 102, an encoding module 103, and a compression module 104.
  • the module described in the present invention can also be called a unit, which refers to a series of computer program segments that can be executed by the processor of an electronic device and can complete fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the recognition module 101 is configured to obtain a message data set generated by an Internet of Things device during transmission, identify a character set in the message data set, and use the character set as an initial character set;
  • the calculation and screening module 102 is configured to calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, and obtain the redundant data set.
  • the encoding module 103 is configured to encode characters in the standard character set to obtain a character code table
  • the compression module 104 is configured to use the character code table to calculate the minimum path length of the characters in the redundant data set, and compress the redundant data set according to the minimum path length of the characters.
  • each module in the data compression device 100 is as follows:
  • the recognition module 101 obtains a message data set generated by the Internet of Things device during transmission, recognizes a character set in the message data set, and uses the character set as an initial character set.
  • the IoT device includes a barcode, a sensor, a scanner, and so on.
  • a large amount of message data will be generated.
  • this application collects the message data generated by different IoT devices in the transmission process as the message data set of this application.
  • the character set includes a word character set and a symbol character set.
  • the present application recognizes the word characters in the message data set through the shortest path algorithm, so as to combine to form the word character set.
  • the shortest path algorithm includes: constructing a word segmentation directed acyclic graph through a custom dictionary, wherein each word in the word segmentation directed acyclic graph corresponds to a directed edge in the graph, and Assigned to the corresponding side length (weight); in all the paths from the start to the end of the word segmentation directed acyclic graph, the calculated word length values are arranged in ascending order (the values at any two different positions must be different , The same below) is the first, second,..., i,..., Nth path set as the corresponding rough score result set.
  • the final rough result set size is greater than or equal to N, get The character set of words contained in the path set.
  • the present application recognizes the symbol characters in the message data set through a character recognition algorithm, so as to combine to form the symbol character set.
  • the character recognition algorithm includes: presetting a character database template, matching the character database template with the message data set using character matching technology, and extracting symbol characters in the message data set that are successfully matched, Thus, the symbol character set is obtained.
  • the word character set and the symbol character set are stored in the database as the initial character set.
  • the calculation and screening module 102 calculates the frequency of each character in the initial character set, filters out the redundant data set in the message data set in a preset manner according to the frequency of the character, and obtains the redundant data The standard character set of the set.
  • the frequency of each character in the initial character set is calculated by the following method
  • f i represents the frequency of appearance of the initial character i
  • n i represents the number of characters i in the initial character set
  • v represents the number of all characters in the initial character set.
  • the present application filters out the redundant data in the message data set in a preset manner according to the sorting of the character frequency, and obtains the standard character set of the redundant data, and the preset manner is The character frequency is compared with a preset threshold, and if the character frequency is less than the preset threshold, the corresponding message data is used as redundant data, wherein the preset threshold in this application is 0.35.
  • the encoding module 103 encodes the characters in the standard character set to obtain a character code table.
  • the encoding of the standard character set in the embodiment of the present application includes: presetting the serial number character set of the standard character set, obtaining the one-to-one correspondence between the standard character set and the serial number character set, and according to all According to the one-to-one correspondence, the probability of each standard character in the standard character set is calculated, the code number of the corresponding standard character is formed according to the probability of each standard character, and the character code table is established according to the code number.
  • the probability of each standard character in the standard character set can be expressed for:
  • the serial number character set of the standard character set n 0 is preset to ⁇ 01,02,03,04 ⁇ .
  • the compression module 104 uses the character code table to calculate the minimum path length of the characters in the redundant data set, and compresses the redundant data set according to the minimum path length of the characters.
  • the method for calculating the minimum path length of characters in the redundant data set includes:
  • MINWPL represents the minimum path length of a character
  • w k represents the character weight in the k-th character code table
  • l k represents the character offset in the k-th character code table
  • n represents the number of characters in the character code table.
  • this application compresses the redundant data set according to the minimum path length of the character.
  • the embodiment of the present application further includes: performing a decoding operation on the compressed redundant data set by using a pre-created decoding rule.
  • the pre-created decoding rules in this application include: according to the coding diagram in the serial number characters, suppose the maximum number of layers is k, then the tree layer is cyclically created from the kth layer to the second layer.
  • the established tree layer is not For the kth level, take 2 nodes from the non-leaf nodes and leaf nodes of the tree layer to build a new node; if the non-leaf node of the tree layer is singular, then the last node of the non-leaf node Combine with the first node of the leaf node of the tree layer to form a new node; if the established tree layer is the kth layer, the leaf node is used directly to establish the new node, thereby forming the decoding rule.
  • FIG. 3 it is a schematic diagram of the structure of an electronic device implementing the data compression method of the present application.
  • the electronic device 1 may include a processor 10, a memory 11, and a bus, and may also include a computer program stored in the memory 11 and running on the processor 10, such as a data compression program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, for example, a mobile hard disk of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (SmartMediaCard, SMC), and a secure digital (SecureDigital, SD) equipped on the electronic device 1. Card, flash card (FlashCard), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the electronic device 1, such as the code of a data compression program, etc., but also to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits with the same function or different functions, including one or more Central Processing Unit (CPU), microprocessor, digital processing chip, graphics processor and a combination of various control chips, etc.
  • the processor 10 is the control core (ControlUnit) of the electronic device, which uses various interfaces and lines to connect the various components of the entire electronic device, and runs or executes programs or modules (such as execution data) stored in the memory 11 Compress programs, etc.), and call data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • ControlUnit the control core of the electronic device, which uses various interfaces and lines to connect the various components of the entire electronic device, and runs or executes programs or modules (such as execution data) stored in the memory 11 Compress programs, etc.), and call data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnection standard (PCI) bus or an extended industry standard architecture (EISA) bus or the like.
  • PCI peripheral component interconnection standard
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection and communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than shown in the figure. Components, or a combination of certain components, or different component arrangements.
  • the electronic device 1 may also include a power source (such as a battery) for supplying power to various components.
  • the power source may be logically connected to the at least one processor 10 through a power management device, thereby controlling power
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power supply may also include any components such as one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, and power status indicators.
  • the electronic device 1 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface.
  • the network interface may include a wired interface and/or a wireless interface (such as a Wi-Fi interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may also include a user interface.
  • the user interface may be a display (Display) and an input unit (such as a keyboard (Keyboard)).
  • the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the electronic device 1 and to display a visualized user interface.
  • the data compression program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions. When running in the processor 10, it can realize:
  • the minimum path length of the characters in the redundant data set is calculated by using the character code table, and the redundant data set is compressed according to the minimum path length of the characters.
  • the integrated module/unit of the electronic device 1 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory).
  • the computer-readable storage medium may be non-volatile or volatile.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

Abstract

A data compression method and apparatus, an electronic device, and a computer-readable storage medium. The method comprises: obtaining a message data set generated by an Internet of Things-based device during transmission process, identifying a character set in the message data set, and using the character set as an initial character set (S1); calculating the frequency of each character in the initial character set, selecting a redundant data set in the message data set according to a preset mode and the frequency of the character, identifying a standard character set in the redundant data set, and encoding characters in the standard character set to obtain a character code table (S2); and using the character code table to calculate a minimum character path length in the redundant data set, and compressing the redundant data set according to the minimum character path length (S3). The method, apparatus, electronic device, and computer-readable storage medium realize the compression of redundant data, so as to reduce the waste of a storage space.

Description

数据压缩方法、装置及计算机可读存储介质Data compression method, device and computer readable storage medium
本申请要求于2020年03月06日提交中国专利局、申请号为202010155298.9,发明名称为“数据压缩方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 6, 2020 with the application number 202010155298.9 and the invention title "Data compression method, device and computer readable storage medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及大数据技术领域,尤其涉及一种数据压缩方法、装置及计算机可读存储介质。This application relates to the field of big data technology, and in particular to a data compression method, device, and computer-readable storage medium.
背景技术Background technique
目前物联网业务场景的消息系统将物联网设备在传输过程中所产生的所有消息数据放入DB数据库或文件中,存储介质再对所述消息数据进行压缩存放。通常物联网设备传输的消息中存在很多冗余的设备数据,这些设备数据很大部分是不变或者变化很少,发明人意识到,目前对于所述冗余的设备数据人们往往将其保存在磁盘空间中,由于物联网场景中产生的数据量非常庞大,很容易造成磁盘空间的内存不足,给后期带来维护上的不便以及成本的增加。The current messaging system in the IoT business scenario puts all the message data generated by the IoT device in the transmission process into a DB database or file, and the storage medium compresses and stores the message data. Usually there is a lot of redundant device data in the messages transmitted by the Internet of Things devices, and most of these device data are unchanged or rarely changed. The inventor realizes that at present, people tend to store the redundant device data in In the disk space, due to the huge amount of data generated in the Internet of Things scenario, it is easy to cause insufficient memory in the disk space, which will bring maintenance inconvenience and cost increase in the later period.
发明内容Summary of the invention
本申请提供的一种数据压缩方法,包括:A data compression method provided by this application includes:
获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集;Acquiring a message data set generated by the Internet of Things device during transmission, identifying a character set in the message data set, and using the character set as an initial character set;
计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选出所述消息数据集中的冗余数据集,并得到所述冗余数据集的标准字符集,对所述标准字符集中的字符进行编码,得到字符码表;Calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, and obtain the standard character set of the redundant data set, Encoding the characters in the standard character set to obtain a character code table;
利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。The minimum path length of the characters in the redundant data set is calculated by using the character code table, and the redundant data set is compressed according to the minimum path length of the characters.
本申请还提供一种电子设备,所述电子设备包括:This application also provides an electronic device, which includes:
存储器,存储至少一个指令;及Memory, storing at least one instruction; and
处理器,执行所述存储器中存储的指令以实现如下所述的数据压缩方法:The processor executes the instructions stored in the memory to implement the data compression method as described below:
获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集;Acquiring a message data set generated by the Internet of Things device during transmission, identifying a character set in the message data set, and using the character set as an initial character set;
计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选出所述消息数据集中的冗余数据集,并得到所述冗余数据集的标准字符集,对所述标准字符集中的字符进行编码,得到字符码表;Calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, and obtain the standard character set of the redundant data set, Encoding the characters in the standard character set to obtain a character code table;
利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。The minimum path length of the characters in the redundant data set is calculated by using the character code table, and the redundant data set is compressed according to the minimum path length of the characters.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现如下所述的数据压缩方法:The present application also provides a computer-readable storage medium in which at least one instruction is stored, and the at least one instruction is executed by a processor in an electronic device to implement the following data compression method:
获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集;Acquiring a message data set generated by the Internet of Things device during transmission, identifying a character set in the message data set, and using the character set as an initial character set;
计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选出所述消息数据集中的冗余数据集,并得到所述冗余数据集的标准字符集,对所述标准字符集 中的字符进行编码,得到字符码表;Calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, and obtain the standard character set of the redundant data set, Encoding the characters in the standard character set to obtain a character code table;
利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。The minimum path length of the characters in the redundant data set is calculated by using the character code table, and the redundant data set is compressed according to the minimum path length of the characters.
本申请还提供一种数据压缩装置,所述装置包括:The present application also provides a data compression device, which includes:
识别模块,用于获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集;The recognition module is used to obtain the message data set generated by the Internet of Things device during the transmission process, identify the character set in the message data set, and use the character set as the initial character set;
计算及筛选模块,用于计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选出所述消息数据集中的冗余数据集,并得到所述冗余数据集的标准字符集;The calculation and screening module is used to calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, and obtain the redundant data Set of standard character sets;
编码模块,用于对所述标准字符集中的字符进行编码,得到字符码表;The encoding module is used to encode the characters in the standard character set to obtain a character code table;
压缩模块,用于利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。The compression module is configured to use the character code table to calculate the minimum path length of the characters in the redundant data set, and compress the redundant data set according to the minimum path length of the characters.
附图说明Description of the drawings
图1为本申请一实施例提供的数据压缩方法的流程示意图;FIG. 1 is a schematic flowchart of a data compression method provided by an embodiment of this application;
图2为本申请一实施例提供的数据压缩装置的内部结构示意图;2 is a schematic diagram of the internal structure of a data compression device provided by an embodiment of the application;
图3为本申请一实施例提供的数据压缩装置中数据压缩程序的模块示意图。FIG. 3 is a schematic diagram of modules of a data compression program in a data compression device provided by an embodiment of the application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.
本申请提供一种数据压缩方法。参照图1所示,为本申请一实施例提供的数据压缩方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。This application provides a data compression method. Referring to FIG. 1, it is a schematic flowchart of a data compression method provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.
在本实施例中,数据压缩方法包括:In this embodiment, the data compression method includes:
S1、获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集。S1. Obtain a message data set generated by the Internet of Things device during transmission, identify a character set in the message data set, and use the character set as an initial character set.
本申请较佳实施例中,所述物联网设备包括条码、传感器以及扫描器等。所述物联网设备在使用过程中,会产生大量的消息数据,例如,在进行条码扫描过程中,会产生包括所述条码的参数配置数据、条码型号以及条码识别号等消息数据集。较佳地,本申请通过搜集不同物联网设备在传输过程中所产生的消息数据作为本申请的消息数据集。In a preferred embodiment of the present application, the IoT device includes a barcode, a sensor, a scanner, and so on. In the process of using the Internet of Things device, a large amount of message data will be generated. For example, in the process of barcode scanning, a message data set including parameter configuration data of the barcode, barcode model, and barcode identification number will be generated. Preferably, this application collects the message data generated by different IoT devices in the transmission process as the message data set of this application.
本申请较佳实施例中,所述字符集包括单词字符集和符号字符集。In a preferred embodiment of the present application, the character set includes a word character set and a symbol character set.
较佳地,本申请通过最短路径算法识别出所述消息数据集中的单词字符,从而组合形成所述单词字符集。其中,所述最短路径算法包括:通过自定义词典构造单词切分有向无环图,其中,所述单词切分有向无环图中的每个词对应图中的一条有向边,并赋给相应的边长(权值);在所述单词切分有向无环图的起点到终点所有路径中,计算出单词长度值按升序排列(任何两个不同位置上的值一定不等,下同)依次为第1,第2,…,第i,…,第N的路径集合作为相应的粗分结果集。若两条或两条以上路径长度相等,那么他们的长度并列第i,都要列入粗分结果集,而且不影响其他路径的排列序号,最后的粗分结果集合大小大于或等于N,获取所述路径集合中所包含的单词字符集。Preferably, the present application recognizes the word characters in the message data set through the shortest path algorithm, so as to combine to form the word character set. Wherein, the shortest path algorithm includes: constructing a word segmentation directed acyclic graph through a custom dictionary, wherein each word in the word segmentation directed acyclic graph corresponds to a directed edge in the graph, and Assigned to the corresponding side length (weight); in all the paths from the start to the end of the word segmentation directed acyclic graph, the calculated word length values are arranged in ascending order (the values at any two different positions must be different , The same below) is the first, second,..., i,..., Nth path set as the corresponding rough score result set. If two or more paths are equal in length, then their lengths are tied for the i-th, and they must be included in the rough result set without affecting the sequence numbers of other paths. The final rough result set size is greater than or equal to N, get The character set of words contained in the path set.
较佳地,本申请通过字符识别算法识别出所述消息数据集中的符号字符,从而组合形成所述符号字符集。其中,所述字符识别算法包括:预设字符数据库模板,利用字符匹配技术将所述字符数据库模板与所述消息数据集进行匹配,并将匹配成功的所述消息数据集中的符号字符进行抽取,从而得到所述符号字符集。Preferably, the present application recognizes the symbol characters in the message data set through a character recognition algorithm, so as to combine to form the symbol character set. Wherein, the character recognition algorithm includes: presetting a character database template, matching the character database template with the message data set using character matching technology, and extracting symbol characters in the message data set that are successfully matched, Thus, the symbol character set is obtained.
本申请将所述单词字符集和所述符号字符集作为所述初始字符集存入数据库中。In this application, the word character set and the symbol character set are stored in the database as the initial character set.
S2、计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选 出所述消息数据集中的冗余数据集,识别所述冗余数据集中的标准字符集,对所述标准字符集中的字符进行编码,得到字符码表。S2. Calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, and identify the standard character set in the redundant data set, The characters in the standard character set are encoded to obtain a character code table.
本申请较佳实施例中,所述初始字符集中每个字符的频率通过下述方法计算In a preferred embodiment of the present application, the frequency of each character in the initial character set is calculated by the following method
Figure PCTCN2020119122-appb-000001
Figure PCTCN2020119122-appb-000001
其中,f i表示初始字符i出现的频率,n i表示在初始字符集中字符i的个数,v表示初始字符集中所有字符的个数。 Among them, f i represents the frequency of appearance of the initial character i, n i represents the number of characters i in the initial character set, and v represents the number of all characters in the initial character set.
较佳地,本申请根据所述字符频率的排序按预设的方式筛选出所述消息数据集中的冗余数据,识别所述冗余数据中的标准字符集。所述预设的方式为将所述字符频率与预设的阈值作为比较,若所述字符频率小于预设的阈值时,将其对应消息数据作为冗余数据,其中,本申请中所述预设的阈值为0.35。Preferably, the present application filters out the redundant data in the message data set in a preset manner according to the sort of the character frequency, and identifies the standard character set in the redundant data. The preset method is to compare the character frequency with a preset threshold, and if the character frequency is less than the preset threshold, use its corresponding message data as redundant data. The threshold is set to 0.35.
本申请实施例中,所述识别出所述消息数据集中的字符集的方法与上述步骤S1中所述识别出所述消息数据集中的字符集相同,这里不再重复描述。进一步地,本申请实施例所述对所述标准字符集进行编码包括:预设所述标准字符集的序号字符集,获取所述标准字符集与所述序号字符集之间的一一对应关系,根据所述一一对应关系计算所述标准字符集中每个标准字符的概率,根据所述每个标准字符的概率形成所述对应标准字符的编码号,根据所述编码号建立所述字符码表。其中,所述对应关系可以表示为:v={(1,v 1),(1,v 2),…,(1,v n)},所述标准字符集中每个标准字符的概率可以表示为:
Figure PCTCN2020119122-appb-000002
In the embodiment of the present application, the method for identifying the character set in the message data set is the same as the method for identifying the character set in the message data set in step S1, and the description will not be repeated here. Further, the encoding of the standard character set in the embodiment of the present application includes: presetting the serial number character set of the standard character set, and obtaining a one-to-one correspondence between the standard character set and the serial number character set Calculating the probability of each standard character in the standard character set according to the one-to-one correspondence, forming the code number of the corresponding standard character according to the probability of each standard character, and establishing the character code according to the code number surface. Wherein, the corresponding relationship can be expressed as: v={(1,v 1 ),(1,v 2 ),...,(1,v n )}, the probability of each standard character in the standard character set can be expressed for:
Figure PCTCN2020119122-appb-000002
以下本申请通过对标准字符集n 0={1,2,3,4}进行编码为例进行说明:预设所述标准字符集n 0的序号字符集为{01,02,03,04},计算得到所述标准字符集n 0中每个标准字符的概率为{0.4,0.3,0.2,0.1},得到所述标准字符集n 0中每个标准字符的编码号为{1,00,010,110},从而建立所述标准字符集n 0的字符码表为w 0={(1,1),(2,00),(3,010),(4,110)}。 In the following, this application is explained by encoding the standard character set n 0 ={1,2,3,4} as an example: the serial number character set of the standard character set n 0 is preset to {01,02,03,04} , The probability of each standard character in the standard character set n 0 is calculated as {0.4, 0.3, 0.2, 0.1}, and the code number of each standard character in the standard character set n 0 is obtained as {1, 00, 010,110}, so that the character code table of the standard character set n 0 is established as w 0 ={(1,1),(2,00),(3,010),(4,110)}.
S3、利用所述字符码表计算所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。S3. Calculate the minimum path length of the characters in the redundant data set by using the character code table, and compress the redundant data set according to the minimum path length of the characters.
本申请较佳实施例中,所述冗余数据集中的字符最小路径长度的计算方法包括:In a preferred embodiment of the present application, the method for calculating the minimum path length of characters in the redundant data set includes:
Figure PCTCN2020119122-appb-000003
Figure PCTCN2020119122-appb-000003
其中,MINWPL表示字符最小路径长度,w k表示第k个字符码表中的字符权重,l k表示第k个字符码表中的字符偏置,n表示字符码表中的字符数量。较佳地,本申请通过根据所述字符最小路径长度对所述冗余数据集进行压缩。 Among them, MINWPL represents the minimum path length of a character, w k represents the character weight in the k-th character code table, l k represents the character offset in the k-th character code table, and n represents the number of characters in the character code table. Preferably, this application compresses the redundant data set according to the minimum path length of the character.
进一步地,本申请实施例在对所述冗余数据集进行压缩之后还包括:利用预先创建的解码规则对压缩后的所述冗余数据集进行解码操作。其中,本申请中所述预先创建的解码规则包括:根据上述序号字符中的编码示意图,设最大层数为k,则从第k层到第2层依次循环建树层,若建立的树层不是第k层,则在所述树层的非叶子节点和叶子节点中取2个节点建一个新节点;若所述树层的非叶子节点为单数,则在所述非叶子节点的最后一个节点和所述树层叶子节点的第一个节点组合为一个新节点;若建立的树层是第k层,则直接用叶子节点建立新节点,从而形成所述解码规则。Further, after compressing the redundant data set, the embodiment of the present application further includes: performing a decoding operation on the compressed redundant data set by using a pre-created decoding rule. Wherein, the pre-created decoding rules in this application include: according to the coding diagram in the serial number characters, suppose the maximum number of layers is k, then the tree layer is cyclically created from the kth layer to the second layer. If the established tree layer is not For the kth level, take 2 nodes from the non-leaf nodes and leaf nodes of the tree layer to build a new node; if the non-leaf node of the tree layer is singular, then the last node of the non-leaf node Combine with the first node of the leaf node of the tree layer to form a new node; if the established tree layer is the kth layer, the leaf node is used to directly establish the new node, thereby forming the decoding rule.
如图2所示,是本申请数据压缩装置的功能模块图。As shown in Figure 2, it is a functional block diagram of the data compression device of the present application.
本申请所述数据压缩装置100可以安装于电子设备中。根据实现的功能,所述数据压 缩装置可以包括识别模块101、计算及筛选模块102、编码模块103以及压缩模块104。本发所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。The data compression apparatus 100 described in this application can be installed in an electronic device. According to the realized functions, the data compression device may include an identification module 101, a calculation and screening module 102, an encoding module 103, and a compression module 104. The module described in the present invention can also be called a unit, which refers to a series of computer program segments that can be executed by the processor of an electronic device and can complete fixed functions, and are stored in the memory of the electronic device.
在本实施例中,关于各模块/单元的功能如下:In this embodiment, the functions of each module/unit are as follows:
所述识别模块101用于获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集;The recognition module 101 is configured to obtain a message data set generated by an Internet of Things device during transmission, identify a character set in the message data set, and use the character set as an initial character set;
所述计算及筛选模块102用于计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选出所述消息数据集中的冗余数据集,并得到所述冗余数据集的标准字符集;The calculation and screening module 102 is configured to calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, and obtain the redundant data set. The standard character set of the remaining data set;
所述编码模块103用于对所述标准字符集中的字符进行编码,得到字符码表;The encoding module 103 is configured to encode characters in the standard character set to obtain a character code table;
所述压缩模块104用于利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。The compression module 104 is configured to use the character code table to calculate the minimum path length of the characters in the redundant data set, and compress the redundant data set according to the minimum path length of the characters.
详细地,所述数据压缩装置100中各模块的具体实施步骤如下:In detail, the specific implementation steps of each module in the data compression device 100 are as follows:
所述识别模块101获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集。The recognition module 101 obtains a message data set generated by the Internet of Things device during transmission, recognizes a character set in the message data set, and uses the character set as an initial character set.
本申请较佳实施例中,所述物联网设备包括条码、传感器以及扫描器等。所述物联网设备在使用过程中,会产生大量的消息数据,例如,在进行条码扫描过程中,会产生包括所述条码的参数配置数据、条码型号以及条码识别号等消息数据集。较佳地,本申请通过搜集不同物联网设备在传输过程中所产生的消息数据作为本申请的消息数据集。In a preferred embodiment of the present application, the IoT device includes a barcode, a sensor, a scanner, and so on. In the process of using the Internet of Things device, a large amount of message data will be generated. For example, in the process of barcode scanning, a message data set including parameter configuration data of the barcode, barcode model, and barcode identification number will be generated. Preferably, this application collects the message data generated by different IoT devices in the transmission process as the message data set of this application.
本申请较佳实施例中,所述字符集包括单词字符集和符号字符集。In a preferred embodiment of the present application, the character set includes a word character set and a symbol character set.
较佳地,本申请通过最短路径算法识别出所述消息数据集中的单词字符,从而组合形成所述单词字符集。其中,所述最短路径算法包括:通过自定义词典构造单词切分有向无环图,其中,所述单词切分有向无环图中的每个词对应图中的一条有向边,并赋给相应的边长(权值);在所述单词切分有向无环图的起点到终点所有路径中,计算出单词长度值按升序排列(任何两个不同位置上的值一定不等,下同)依次为第1,第2,…,第i,…,第N的路径集合作为相应的粗分结果集。若两条或两条以上路径长度相等,那么他们的长度并列第i,都要列入粗分结果集,而且不影响其他路径的排列序号,最后的粗分结果集合大小大于或等于N,获取所述路径集合中所包含的单词字符集。Preferably, the present application recognizes the word characters in the message data set through the shortest path algorithm, so as to combine to form the word character set. Wherein, the shortest path algorithm includes: constructing a word segmentation directed acyclic graph through a custom dictionary, wherein each word in the word segmentation directed acyclic graph corresponds to a directed edge in the graph, and Assigned to the corresponding side length (weight); in all the paths from the start to the end of the word segmentation directed acyclic graph, the calculated word length values are arranged in ascending order (the values at any two different positions must be different , The same below) is the first, second,..., i,..., Nth path set as the corresponding rough score result set. If two or more paths are equal in length, then their lengths are tied for the i-th, and they must be included in the rough result set without affecting the sequence numbers of other paths. The final rough result set size is greater than or equal to N, get The character set of words contained in the path set.
较佳地,本申请通过字符识别算法识别出所述消息数据集中的符号字符,从而组合形成所述符号字符集。其中,所述字符识别算法包括:预设字符数据库模板,利用字符匹配技术将所述字符数据库模板与所述消息数据集进行匹配,并将匹配成功的所述消息数据集中的符号字符进行抽取,从而得到所述符号字符集。Preferably, the present application recognizes the symbol characters in the message data set through a character recognition algorithm, so as to combine to form the symbol character set. Wherein, the character recognition algorithm includes: presetting a character database template, matching the character database template with the message data set using character matching technology, and extracting symbol characters in the message data set that are successfully matched, Thus, the symbol character set is obtained.
本申请将所述单词字符集和所述符号字符集作为所述初始字符集存入数据库中。In this application, the word character set and the symbol character set are stored in the database as the initial character set.
所述计算及筛选模块102计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选出所述消息数据集中的冗余数据集,并得到所述冗余数据集的标准字符集。The calculation and screening module 102 calculates the frequency of each character in the initial character set, filters out the redundant data set in the message data set in a preset manner according to the frequency of the character, and obtains the redundant data The standard character set of the set.
本申请较佳实施例中,所述初始字符集中每个字符的频率通过下述方法计算In a preferred embodiment of the present application, the frequency of each character in the initial character set is calculated by the following method
Figure PCTCN2020119122-appb-000004
Figure PCTCN2020119122-appb-000004
其中,f i表示初始字符i出现的频率,n i表示在初始字符集中字符i的个数,v表示初始字符集中所有字符的个数。 Among them, f i represents the frequency of appearance of the initial character i, n i represents the number of characters i in the initial character set, and v represents the number of all characters in the initial character set.
较佳地,本申请根据所述字符频率的排序按预设的方式筛选出所述消息数据集中的冗余数据,并得到所述冗余数据的标准字符集,所述预设的方式为将所述字符频率与预设的阈值作为比较,若所述字符频率小于预设的阈值时,将其对应消息数据作为冗余数据,其中,本申请中所述预设的阈值为0.35。Preferably, the present application filters out the redundant data in the message data set in a preset manner according to the sorting of the character frequency, and obtains the standard character set of the redundant data, and the preset manner is The character frequency is compared with a preset threshold, and if the character frequency is less than the preset threshold, the corresponding message data is used as redundant data, wherein the preset threshold in this application is 0.35.
所述编码模块103对所述标准字符集中的字符进行编码,得到字符码表。The encoding module 103 encodes the characters in the standard character set to obtain a character code table.
本申请实施例所述对所述标准字符集进行编码包括:预设所述标准字符集的序号字符集,获取所述标准字符集与所述序号字符集之间的一一对应关系,根据所述一一对应关系计算所述标准字符集中每个标准字符的概率,根据所述每个标准字符的概率形成所述对应标准字符的编码号,根据所述编码号建立所述字符码表。其中,所述对应关系可以表示为:v={(1,v 1),(1,v 2),…,(1,v n)},所述标准字符集中每个标准字符的概率可以表示为:
Figure PCTCN2020119122-appb-000005
The encoding of the standard character set in the embodiment of the present application includes: presetting the serial number character set of the standard character set, obtaining the one-to-one correspondence between the standard character set and the serial number character set, and according to all According to the one-to-one correspondence, the probability of each standard character in the standard character set is calculated, the code number of the corresponding standard character is formed according to the probability of each standard character, and the character code table is established according to the code number. Wherein, the corresponding relationship can be expressed as: v={(1,v 1 ),(1,v 2 ),...,(1,v n )}, the probability of each standard character in the standard character set can be expressed for:
Figure PCTCN2020119122-appb-000005
以下本申请通过对标准字符集n 0={1,2,3,4}进行编码为例进行说明:预设所述标准字符集n 0的序号字符集为{01,02,03,04},计算得到所述标准字符集n 0中每个标准字符的概率为{0.4,0.3,0.2,0.1},得到所述标准字符集n 0中每个标准字符的编码号为{1,00,010,110},从而建立所述标准字符集n 0的字符码表为w 0={(1,1),(2,00),(3,010),(4,110)}。 In the following, this application is explained by encoding the standard character set n 0 ={1,2,3,4} as an example: the serial number character set of the standard character set n 0 is preset to {01,02,03,04} , The probability of each standard character in the standard character set n 0 is calculated as {0.4, 0.3, 0.2, 0.1}, and the code number of each standard character in the standard character set n 0 is obtained as {1, 00, 010,110}, so that the character code table of the standard character set n 0 is established as w 0 ={(1,1),(2,00),(3,010),(4,110)}.
所述压缩模块104利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。The compression module 104 uses the character code table to calculate the minimum path length of the characters in the redundant data set, and compresses the redundant data set according to the minimum path length of the characters.
本申请较佳实施例中,所述冗余数据集中的字符最小路径长度的计算方法包括:In a preferred embodiment of the present application, the method for calculating the minimum path length of characters in the redundant data set includes:
Figure PCTCN2020119122-appb-000006
Figure PCTCN2020119122-appb-000006
其中,MINWPL表示字符最小路径长度,w k表示第k个字符码表中的字符权重,l k表示第k个字符码表中的字符偏置,n表示字符码表中的字符数量。较佳地,本申请通过根据所述字符最小路径长度对所述冗余数据集进行压缩。 Among them, MINWPL represents the minimum path length of a character, w k represents the character weight in the k-th character code table, l k represents the character offset in the k-th character code table, and n represents the number of characters in the character code table. Preferably, this application compresses the redundant data set according to the minimum path length of the character.
进一步地,本申请实施例在对所述冗余数据集进行压缩之后还包括:利用预先创建的解码规则对压缩后的所述冗余数据集进行解码操作。其中,本申请中所述预先创建的解码规则包括:根据上述序号字符中的编码示意图,设最大层数为k,则从第k层到第2层依次循环建树层,若建立的树层不是第k层,则在所述树层的非叶子节点和叶子节点中取2个节点建一个新节点;若所述树层的非叶子节点为单数,则在所述非叶子节点的最后一个节点和所述树层叶子节点的第一个节点组合为一个新节点;若建立的树层是第k层,则直接用叶子节点建立新节点,从而形成所述解码规则。Further, after compressing the redundant data set, the embodiment of the present application further includes: performing a decoding operation on the compressed redundant data set by using a pre-created decoding rule. Wherein, the pre-created decoding rules in this application include: according to the coding diagram in the serial number characters, suppose the maximum number of layers is k, then the tree layer is cyclically created from the kth layer to the second layer. If the established tree layer is not For the kth level, take 2 nodes from the non-leaf nodes and leaf nodes of the tree layer to build a new node; if the non-leaf node of the tree layer is singular, then the last node of the non-leaf node Combine with the first node of the leaf node of the tree layer to form a new node; if the established tree layer is the kth layer, the leaf node is used directly to establish the new node, thereby forming the decoding rule.
如图3所示,是本申请实现数据压缩方法的电子设备的结构示意图。As shown in FIG. 3, it is a schematic diagram of the structure of an electronic device implementing the data compression method of the present application.
所述电子设备1可以包括处理器10、存储器11和总线,还可以包括存储在所述存储器11中并可在所述处理器10上运行的计算机程序,如数据压缩程序12。The electronic device 1 may include a processor 10, a memory 11, and a bus, and may also include a computer program stored in the memory 11 and running on the processor 10, such as a data compression program 12.
其中,所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。所述存储器11在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。所述存储器11在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(SmartMediaCard,SMC)、安全数字(SecureDigital,SD)卡、闪存卡(FlashCard)等。进一步地,所述存储器11还可以既包括电子设备1的内部存储单元也包括外部存储设备。所述存储器11不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如数据压缩程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc. The memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, for example, a mobile hard disk of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (SmartMediaCard, SMC), and a secure digital (SecureDigital, SD) equipped on the electronic device 1. Card, flash card (FlashCard), etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 can be used not only to store application software and various data installed in the electronic device 1, such as the code of a data compression program, etc., but also to temporarily store data that has been output or will be output.
所述处理器10在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(CentralProcessingunit,CPU)、微处理器、数字处理芯片、图形处理器及 各种控制芯片的组合等。所述处理器10是所述电子设备的控制核心(ControlUnit),利用各种接口和线路连接整个电子设备的各个部件,通过运行或执行存储在所述存储器11内的程序或者模块(例如执行数据压缩程序等),以及调用存储在所述存储器11内的数据,以执行电子设备1的各种功能和处理数据。The processor 10 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits with the same function or different functions, including one or more Central Processing Unit (CPU), microprocessor, digital processing chip, graphics processor and a combination of various control chips, etc. The processor 10 is the control core (ControlUnit) of the electronic device, which uses various interfaces and lines to connect the various components of the entire electronic device, and runs or executes programs or modules (such as execution data) stored in the memory 11 Compress programs, etc.), and call data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
所述总线可以是外设部件互连标准(peripheralcomponentinterconnect,简称PCI)总线或扩展工业标准结构(extendedindustrystandardarchitecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器11以及至少一个处理器10等之间的连接通信。The bus may be a peripheral component interconnection standard (PCI) bus or an extended industry standard architecture (EISA) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to implement connection and communication between the memory 11 and at least one processor 10 and the like.
图3仅示出了具有部件的电子设备,本领域技术人员可以理解的是,图3示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than shown in the figure. Components, or a combination of certain components, or different component arrangements.
例如,尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器10逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。For example, although not shown, the electronic device 1 may also include a power source (such as a battery) for supplying power to various components. Preferably, the power source may be logically connected to the at least one processor 10 through a power management device, thereby controlling power The device implements functions such as charge management, discharge management, and power consumption management. The power supply may also include any components such as one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, and power status indicators. The electronic device 1 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。Further, the electronic device 1 may also include a network interface. Optionally, the network interface may include a wired interface and/or a wireless interface (such as a Wi-Fi interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(OrganicLight-EmittingDiode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 1 may also include a user interface. The user interface may be a display (Display) and an input unit (such as a keyboard (Keyboard)). Optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the electronic device 1 and to display a visualized user interface.
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。It should be understood that the embodiments are only for illustrative purposes, and are not limited by this structure in the scope of the patent application.
所述电子设备1中的所述存储器11存储的数据压缩程序12是多个指令的组合,在所述处理器10中运行时,可以实现:The data compression program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions. When running in the processor 10, it can realize:
获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集;Acquiring a message data set generated by the Internet of Things device during transmission, identifying a character set in the message data set, and using the character set as an initial character set;
计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选出所述消息数据集中的冗余数据集,识别所述冗余数据集中的标准字符集,对所述标准字符集中的字符进行编码,得到字符码表;Calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, identify the standard character set in the redundant data set, Encode the characters in the standard character set to obtain a character code table;
利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。The minimum path length of the characters in the redundant data set is calculated by using the character code table, and the redundant data set is compressed according to the minimum path length of the characters.
具体地,所述处理器10对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。Specifically, for the specific implementation method of the above-mentioned instructions by the processor 10, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1, which will not be repeated here.
进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-OnlyMemory)。所述计算机可读存储介质可以是非易失性,也可以是易失性。Further, if the integrated module/unit of the electronic device 1 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory). The computer-readable storage medium may be non-volatile or volatile.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的 划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed equipment, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application.
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any associated diagram marks in the claims should not be regarded as limiting the claims involved.
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。In addition, it is obvious that the word "including" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices stated in the system claims can also be implemented by one unit or device through software or hardware. The second class words are used to indicate names, and do not indicate any specific order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Make modifications or equivalent replacements without departing from the spirit and scope of the technical solution of the present application.

Claims (20)

  1. 一种数据压缩方法,其中,所述方法包括:A data compression method, wherein the method includes:
    获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集;Acquiring a message data set generated by the Internet of Things device during transmission, identifying a character set in the message data set, and using the character set as an initial character set;
    计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选出所述消息数据集中的冗余数据集,识别所述冗余数据集中的标准字符集,对所述标准字符集中的字符进行编码,得到字符码表;Calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, identify the standard character set in the redundant data set, Encode the characters in the standard character set to obtain a character code table;
    利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。The minimum path length of the characters in the redundant data set is calculated by using the character code table, and the redundant data set is compressed according to the minimum path length of the characters.
  2. 如权利要求1所述的数据压缩方法,其中,所述识别出所述消息数据集中的字符集包括:5. The data compression method according to claim 1, wherein said identifying a character set in said message data set comprises:
    利用最短路径算法识别出所述消息数据集中的单词字符集;Using the shortest path algorithm to identify the word character set in the message data set;
    通过字符识别算法识别出所述消息数据集中的符号字符集;Identify the symbol character set in the message data set through a character recognition algorithm;
    将所述单词字符集和符号字符集进行组合得到所述字符集。The character set is obtained by combining the word character set and the symbol character set.
  3. 如权利要求2所述的数据压缩方法,其中,所述利用最短路径算法识别出所述消息数据集中的单词字符集,包括:3. The data compression method according to claim 2, wherein said using the shortest path algorithm to identify the word character set in the message data set comprises:
    通过自定义词典构造单词切分有向无环图,在所述单词切分有向无环图的起点到终点所有路径中,计算出单词长度值的路径集合,根据预设的方式获取所述路径集合中所包含的单词字符集。Construct a word segmentation directed acyclic graph through a custom dictionary. Among all paths from the starting point to the end point of the word segmentation directed acyclic graph, calculate the path set of the word length value, and obtain the The character set of the words contained in the path set.
  4. 如权利要求2所述的数据压缩方法,其中,所述通过字符识别算法识别出所述消息数据集中的符号字符集,包括:3. The data compression method of claim 2, wherein the identifying the symbol character set in the message data set through a character recognition algorithm comprises:
    预设字符数据库模板,利用字符匹配技术将所述字符数据库模板与所述消息数据集进行匹配,并将匹配成功的所述消息数据集中的符号字符进行抽取,得到符号字符集。A character database template is preset, the character database template is matched with the message data set using character matching technology, and the symbol characters in the message data set that are successfully matched are extracted to obtain a symbol character set.
  5. 如权利要求1所述的数据压缩方法,其中,所述计算所述初始字符集中每个字符的频率,包括:5. The data compression method according to claim 1, wherein said calculating the frequency of each character in said initial character set comprises:
    利用下述公式计算所述初始字符集中每个字符的频率:Use the following formula to calculate the frequency of each character in the initial character set:
    Figure PCTCN2020119122-appb-100001
    Figure PCTCN2020119122-appb-100001
    其中,f i表示初始字符i出现的频率,n i表示在初始字符集中字符i的个数,v表示初始字符集中所有字符的个数。 Among them, f i represents the frequency of appearance of the initial character i, n i represents the number of characters i in the initial character set, and v represents the number of all characters in the initial character set.
  6. 如权利要求1所述的数据压缩方法,其中,所述对所述标准字符集进行编码,得到字符码表,包括:5. The data compression method according to claim 1, wherein said encoding said standard character set to obtain a character code table comprises:
    预设所述标准字符集的序号字符集,获取所述标准字符集与所述序号字符集之间的一一对应关系,根据所述一一对应关系计算所述标准字符集中每个标准字符的概率,根据所述每个标准字符的概率形成所述对应标准字符的编码号,根据所述编码号建立所述字符码表。Preset the serial number character set of the standard character set, obtain the one-to-one correspondence between the standard character set and the serial number character set, and calculate the value of each standard character in the standard character set according to the one-to-one correspondence With probability, the code number of the corresponding standard character is formed according to the probability of each standard character, and the character code table is established according to the code number.
  7. 如权利要求1至6中任意一项所述的数据压缩方法,其中,所述利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,包括:7. The data compression method according to any one of claims 1 to 6, wherein the calculating the minimum path length of the characters in the redundant data set by using the character code table comprises:
    利用下述公式计算所述冗余数据集中的字符最小路径长度:Use the following formula to calculate the minimum path length of the characters in the redundant data set:
    Figure PCTCN2020119122-appb-100002
    Figure PCTCN2020119122-appb-100002
    其中,MIN WPL表示字符最小路径长度,w k表示第k个字符码表中的字符权重,l k表示第k个字符码表中的字符偏置,n表示字符码表中的字符数量。 Among them, MIN WPL represents the minimum path length of characters, w k represents the character weight in the k-th character code table, l k represents the character offset in the k-th character code table, and n represents the number of characters in the character code table.
  8. 一种电子设备,其中,所述电子设备包括:An electronic device, wherein the electronic device includes:
    至少一个处理器;以及,At least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,A memory communicatively connected with the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如下所述的数据压缩方法:The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the following data compression method:
    获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集;Acquiring a message data set generated by the Internet of Things device during transmission, identifying a character set in the message data set, and using the character set as an initial character set;
    计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选出所述消息数据集中的冗余数据集,识别所述冗余数据集中的标准字符集,对所述标准字符集中的字符进行编码,得到字符码表;Calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, identify the standard character set in the redundant data set, Encode the characters in the standard character set to obtain a character code table;
    利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。The minimum path length of the characters in the redundant data set is calculated by using the character code table, and the redundant data set is compressed according to the minimum path length of the characters.
  9. 如权利要求8所述的电子设备,其中,所述识别出所述消息数据集中的字符集包括:8. The electronic device according to claim 8, wherein said identifying a character set in said message data set comprises:
    利用最短路径算法识别出所述消息数据集中的单词字符集;Using the shortest path algorithm to identify the word character set in the message data set;
    通过字符识别算法识别出所述消息数据集中的符号字符集;Identify the symbol character set in the message data set through a character recognition algorithm;
    将所述单词字符集和符号字符集进行组合得到所述字符集。The character set is obtained by combining the word character set and the symbol character set.
  10. 如权利要求9所述的电子设备,其中,所述利用最短路径算法识别出所述消息数据集中的单词字符集,包括:9. The electronic device according to claim 9, wherein said using the shortest path algorithm to identify the word character set in the message data set comprises:
    通过自定义词典构造单词切分有向无环图,在所述单词切分有向无环图的起点到终点所有路径中,计算出单词长度值的路径集合,根据预设的方式获取所述路径集合中所包含的单词字符集。A word segmentation directed acyclic graph is constructed through a custom dictionary. Among all the paths from the start point to the end point of the word segmentation directed acyclic graph, the path set of the word length value is calculated, and the path set is obtained according to a preset method. The character set of the words contained in the path set.
  11. 如权利要求9所述的电子设备,其中,所述通过字符识别算法识别出所述消息数据集中的符号字符集,包括:9. The electronic device according to claim 9, wherein said identifying the symbol character set in the message data set through a character recognition algorithm comprises:
    预设字符数据库模板,利用字符匹配技术将所述字符数据库模板与所述消息数据集进行匹配,并将匹配成功的所述消息数据集中的符号字符进行抽取,得到符号字符集。A character database template is preset, the character database template is matched with the message data set using character matching technology, and the symbol characters in the message data set that are successfully matched are extracted to obtain a symbol character set.
  12. 如权利要求8所述的电子设备,其中,所述计算所述初始字符集中每个字符的频率,包括:8. The electronic device according to claim 8, wherein said calculating the frequency of each character in said initial character set comprises:
    利用下述公式计算所述初始字符集中每个字符的频率:Use the following formula to calculate the frequency of each character in the initial character set:
    Figure PCTCN2020119122-appb-100003
    Figure PCTCN2020119122-appb-100003
    其中,f i表示初始字符i出现的频率,n i表示在初始字符集中字符i的个数,v表示初始字符集中所有字符的个数。 Among them, f i represents the frequency of appearance of the initial character i, n i represents the number of characters i in the initial character set, and v represents the number of all characters in the initial character set.
  13. 如权利要求8至12中任意一项所述的电子设备,其中,所述利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,包括:The electronic device according to any one of claims 8 to 12, wherein said calculating the minimum path length of the characters in the redundant data set by using the character code table comprises:
    利用下述公式计算所述冗余数据集中的字符最小路径长度:Use the following formula to calculate the minimum path length of the characters in the redundant data set:
    Figure PCTCN2020119122-appb-100004
    Figure PCTCN2020119122-appb-100004
    其中,MIN WPL表示字符最小路径长度,w k表示第k个字符码表中的字符权重,l k表示第k个字符码表中的字符偏置,n表示字符码表中的字符数量。 Among them, MIN WPL represents the minimum path length of characters, w k represents the character weight in the k-th character code table, l k represents the character offset in the k-th character code table, and n represents the number of characters in the character code table.
  14. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下所述的数据压缩方法:A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the following data compression method:
    获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集;Acquiring a message data set generated by the Internet of Things device during transmission, identifying a character set in the message data set, and using the character set as an initial character set;
    计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选出所述消息数据集中的冗余数据集,识别所述冗余数据集中的标准字符集,对所述标准字符集中的字符进行编码,得到字符码表;Calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, identify the standard character set in the redundant data set, Encode the characters in the standard character set to obtain a character code table;
    利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。The minimum path length of the characters in the redundant data set is calculated by using the character code table, and the redundant data set is compressed according to the minimum path length of the characters.
  15. 如权利要求14所述的计算机可读存储介质,其中,所述识别出所述消息数据集中的字符集包括:15. The computer-readable storage medium of claim 14, wherein said identifying a character set in said message data set comprises:
    利用最短路径算法识别出所述消息数据集中的单词字符集;Using the shortest path algorithm to identify the word character set in the message data set;
    通过字符识别算法识别出所述消息数据集中的符号字符集;Identify the symbol character set in the message data set through a character recognition algorithm;
    将所述单词字符集和符号字符集进行组合得到所述字符集。The character set is obtained by combining the word character set and the symbol character set.
  16. 如权利要求15所述的计算机可读存储介质,其中,所述利用最短路径算法识别出所述消息数据集中的单词字符集,包括:15. The computer-readable storage medium according to claim 15, wherein said using the shortest path algorithm to identify the word character set in the message data set comprises:
    通过自定义词典构造单词切分有向无环图,在所述单词切分有向无环图的起点到终点所有路径中,计算出单词长度值的路径集合,根据预设的方式获取所述路径集合中所包含的单词字符集。Construct a word segmentation directed acyclic graph through a custom dictionary. Among all paths from the starting point to the end point of the word segmentation directed acyclic graph, calculate the path set of the word length value, and obtain the The character set of the words contained in the path set.
  17. 如权利要求15所述的计算机可读存储介质,其中,所述通过字符识别算法识别出所述消息数据集中的符号字符集,包括:15. The computer-readable storage medium according to claim 15, wherein the identifying the symbol character set in the message data set by a character recognition algorithm comprises:
    预设字符数据库模板,利用字符匹配技术将所述字符数据库模板与所述消息数据集进行匹配,并将匹配成功的所述消息数据集中的符号字符进行抽取,得到符号字符集。A character database template is preset, the character database template is matched with the message data set using character matching technology, and the symbol characters in the message data set that are successfully matched are extracted to obtain a symbol character set.
  18. 如权利要求14所述的计算机可读存储介质,其中,所述计算所述初始字符集中每个字符的频率,包括:15. The computer-readable storage medium of claim 14, wherein the calculating the frequency of each character in the initial character set comprises:
    利用下述公式计算所述初始字符集中每个字符的频率:Use the following formula to calculate the frequency of each character in the initial character set:
    Figure PCTCN2020119122-appb-100005
    Figure PCTCN2020119122-appb-100005
    其中,f i表示初始字符i出现的频率,n i表示在初始字符集中字符i的个数,v表示初始字符集中所有字符的个数。 Among them, f i represents the frequency of appearance of the initial character i, n i represents the number of characters i in the initial character set, and v represents the number of all characters in the initial character set.
  19. 如权利要求14至18中任意一项所述的计算机可读存储介质,其中,所述利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,包括:18. The computer-readable storage medium according to any one of claims 14 to 18, wherein said calculating the minimum path length of the characters in the redundant data set by using the character code table comprises:
    利用下述公式计算所述冗余数据集中的字符最小路径长度:Use the following formula to calculate the minimum path length of the characters in the redundant data set:
    Figure PCTCN2020119122-appb-100006
    Figure PCTCN2020119122-appb-100006
    其中,MIN WPL表示字符最小路径长度,w k表示第k个字符码表中的字符权重,l k表示第k个字符码表中的字符偏置,n表示字符码表中的字符数量。 Among them, MIN WPL represents the minimum path length of characters, w k represents the character weight in the k-th character code table, l k represents the character offset in the k-th character code table, and n represents the number of characters in the character code table.
  20. 一种数据压缩装置,其中,所述装置包括:A data compression device, wherein the device includes:
    识别模块,用于获取基于物联网设备在传输过程中所产生的消息数据集,识别出所述消息数据集中的字符集,并将所述字符集作为初始字符集;The recognition module is used to obtain the message data set generated by the Internet of Things device during the transmission process, identify the character set in the message data set, and use the character set as the initial character set;
    计算及筛选模块,用于计算所述初始字符集中每个字符的频率,根据所述字符的频率按预设的方式筛选出所述消息数据集中的冗余数据集,识别所述冗余数据集中的标准字符集;The calculation and screening module is used to calculate the frequency of each character in the initial character set, filter out the redundant data set in the message data set in a preset manner according to the frequency of the character, and identify the redundant data set Standard character set;
    编码模块,用于对所述标准字符集中的字符进行编码,得到字符码表;The encoding module is used to encode the characters in the standard character set to obtain a character code table;
    压缩模块,用于利用所述字符码表计算将所述冗余数据集中的字符最小路径长度,根据所述字符最小路径长度对所述冗余数据集进行压缩。The compression module is configured to use the character code table to calculate the minimum path length of the characters in the redundant data set, and compress the redundant data set according to the minimum path length of the characters.
PCT/CN2020/119122 2020-03-06 2020-09-29 Data compression method and apparatus, and computer-readable storage medium WO2021174839A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010155298.9A CN111431537A (en) 2020-03-06 2020-03-06 Data compression method and device and computer readable storage medium
CN202010155298.9 2020-03-06

Publications (1)

Publication Number Publication Date
WO2021174839A1 true WO2021174839A1 (en) 2021-09-10

Family

ID=71547445

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/119122 WO2021174839A1 (en) 2020-03-06 2020-09-29 Data compression method and apparatus, and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN111431537A (en)
WO (1) WO2021174839A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114399766A (en) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 Optical character recognition model training method, device, equipment and medium
CN116663069A (en) * 2023-08-01 2023-08-29 国家基础地理信息中心 Database security encryption method and system based on data coding
WO2024021491A1 (en) * 2022-07-29 2024-02-01 天翼云科技有限公司 Data slicing method, apparatus and system
CN117176177B (en) * 2023-11-03 2024-02-06 金乡县林业保护和发展服务中心(金乡县湿地保护中心、金乡县野生动植物保护中心、金乡县国有白洼林场) Data sharing method and system for forestry information
CN114399766B (en) * 2022-01-18 2024-05-10 平安科技(深圳)有限公司 Optical character recognition model training method, device, equipment and medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111431537A (en) * 2020-03-06 2020-07-17 平安科技(深圳)有限公司 Data compression method and device and computer readable storage medium
CN112506879A (en) * 2020-12-18 2021-03-16 深圳智慧林网络科技有限公司 Data processing method and related equipment
CN113220651B (en) * 2021-04-25 2024-02-09 暨南大学 Method, device, terminal equipment and storage medium for compressing operation data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101783788A (en) * 2009-01-21 2010-07-21 联想(北京)有限公司 File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device
EP2680445A2 (en) * 2012-06-28 2014-01-01 Fujitsu Limited Code processing technique
CN104283567A (en) * 2013-07-02 2015-01-14 北京四维图新科技股份有限公司 Method for compressing or decompressing name data, and equipment thereof
CN109361686A (en) * 2018-11-16 2019-02-19 重庆邮电大学 A kind of compression method reducing sensing data time redundancy
CN111431537A (en) * 2020-03-06 2020-07-17 平安科技(深圳)有限公司 Data compression method and device and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101783788A (en) * 2009-01-21 2010-07-21 联想(北京)有限公司 File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device
EP2680445A2 (en) * 2012-06-28 2014-01-01 Fujitsu Limited Code processing technique
CN104283567A (en) * 2013-07-02 2015-01-14 北京四维图新科技股份有限公司 Method for compressing or decompressing name data, and equipment thereof
CN109361686A (en) * 2018-11-16 2019-02-19 重庆邮电大学 A kind of compression method reducing sensing data time redundancy
CN111431537A (en) * 2020-03-06 2020-07-17 平安科技(深圳)有限公司 Data compression method and device and computer readable storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114399766A (en) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 Optical character recognition model training method, device, equipment and medium
CN114399766B (en) * 2022-01-18 2024-05-10 平安科技(深圳)有限公司 Optical character recognition model training method, device, equipment and medium
WO2024021491A1 (en) * 2022-07-29 2024-02-01 天翼云科技有限公司 Data slicing method, apparatus and system
CN116663069A (en) * 2023-08-01 2023-08-29 国家基础地理信息中心 Database security encryption method and system based on data coding
CN116663069B (en) * 2023-08-01 2023-10-03 国家基础地理信息中心 Database security encryption method and system based on data coding
CN117176177B (en) * 2023-11-03 2024-02-06 金乡县林业保护和发展服务中心(金乡县湿地保护中心、金乡县野生动植物保护中心、金乡县国有白洼林场) Data sharing method and system for forestry information

Also Published As

Publication number Publication date
CN111431537A (en) 2020-07-17

Similar Documents

Publication Publication Date Title
WO2021174839A1 (en) Data compression method and apparatus, and computer-readable storage medium
WO2022134759A1 (en) Keyword generation method and apparatus, and electronic device and computer storage medium
WO2021189826A1 (en) Message generation method and apparatus, electronic device, and computer-readable storage medium
WO2022121171A1 (en) Similar text matching method and apparatus, and electronic device and computer storage medium
WO2022160449A1 (en) Text classification method and apparatus, electronic device, and storage medium
JP4912399B2 (en) Method for compressing language models using GOLOMB codes
CN104283567A (en) Method for compressing or decompressing name data, and equipment thereof
US9916314B2 (en) File extraction method, computer product, file extracting apparatus, and file extracting system
WO2022105179A1 (en) Biological feature image recognition method and apparatus, and electronic device and readable storage medium
WO2022222943A1 (en) Department recommendation method and apparatus, electronic device and storage medium
WO2021189911A1 (en) Target object position detection method and apparatus based on video stream, and device and medium
CN113157927B (en) Text classification method, apparatus, electronic device and readable storage medium
CN114979120B (en) Data uploading method, device, equipment and storage medium
US20060167902A1 (en) System and method for storing a document in a serial binary format
WO2021189897A1 (en) Road matching method and apparatus, and electronic device and readable storage medium
CN111651585A (en) Information verification method and device, electronic equipment and storage medium
CN112231417A (en) Data classification method and device, electronic equipment and storage medium
WO2021184641A1 (en) Intelligent sleep staging method and apparatus, electronic device, and computer readable storage medium
CN114138784A (en) Information tracing method and device based on storage library, electronic equipment and medium
WO2022142106A1 (en) Text analysis method and apparatus, electronic device, and readable storage medium
CN113360768A (en) Product recommendation method, device and equipment based on user portrait and storage medium
CN113627160B (en) Text error correction method and device, electronic equipment and storage medium
CN115409041B (en) Unstructured data extraction method, device, equipment and storage medium
CN113205814A (en) Voice data labeling method and device, electronic equipment and storage medium
CN116844711A (en) Disease auxiliary identification method and device based on deep learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20922959

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20922959

Country of ref document: EP

Kind code of ref document: A1