CN110442489B

CN110442489B - Method of data processing and storage medium

Info

Publication number: CN110442489B
Application number: CN201810410873.8A
Authority: CN
Inventors: 朱成生; 俞飞江
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-05-02
Filing date: 2018-05-02
Publication date: 2024-03-01
Anticipated expiration: 2038-05-02
Also published as: CN110442489A

Abstract

The application discloses a data processing method and a storage medium. Wherein the method comprises the following steps: acquiring a high-heat text block from a data file to be compressed; and replacing the data file to be compressed with the high-heat text block for storage. The invention solves the technical problem that the storage space of the compressed data is still greatly required due to the adoption of the common compression technology.

Description

Method of data processing and storage medium

Technical Field

The present application relates to the field of internet technology application, and in particular, to a data processing method and a storage medium.

Background

In the extending process of the internet industry, more and more industries are connected with the internet, and a large amount of data is generated, especially at the enterprise level, the generation, execution and archiving of daily business all bring about a large amount of data, and when a database for calling the data and a storage space for storing the data generate a log, SQL sentences are used as calling instructions or management logs, but the SQL sentences occupy more bytes and need large storage space, so that the problem of operation and maintenance personnel of enterprise data is increasingly plagued.

In the existing solution, the same supercooling data storage mode is adopted, namely, the data to be stored is compressed by a common compression technology, so that the requirement of the data to be stored on the storage space is reduced, and the compressed data is stored in the subsequent storage process. However, the problem with this prior art is that for the large generation of data to be stored, even if the data is compressed, the storage space requirements for the compressed data are still large, which results in a large storage pressure for the limited real storage space.

Aiming at the problem that the compressed data still has great requirement for storage space due to the adoption of a common compression technology, no effective solution is proposed at present.

Disclosure of Invention

The embodiment of the application provides a data processing method and a storage medium, which at least solve the technical problem that the storage space requirement of compressed data is still large due to the adoption of a common compression technology.

According to one aspect of the embodiments of the present application, there is provided a method of data processing, including: acquiring a high-heat text block from a data file to be compressed; and replacing the data file to be compressed with the high-heat text block for storage.

Optionally, the high-heat text block is a text block with heat greater than heat of a preset index, wherein the heat of the preset index is average reference times of the indexes in the same group.

Optionally, obtaining the high-heat text block from the data file to be compressed includes: carrying out data analysis on the data file to be compressed, and calculating text blocks with preset hotness ranking in the data file to be compressed through a preset algorithm; and determining the text blocks with preset hotness ranking as high-hotness text blocks.

Further, optionally, calculating, by a preset algorithm, a text block of a preset hotrank in the data file to be compressed includes: under the condition that the data file to be compressed is a log data table, word segmentation is carried out from the log data table according to preset word segmentation conditions, and a segmented log is obtained; vectorizing the segmented logs, and converting the logs into a high-dimensional vector space; clustering at least one high-dimensional vector space through a preset clustering algorithm to obtain a log similarity set; generating a dictionary library according to the log similarity set, and generating a digital log according to the dictionary library and the log similarity set; calculating convolution blocks with different spans through preset spans, and determining high-compression-rate convolution blocks with preset ranks according to the product of the preset spans and the occurrence times in the digital log; and restoring the data file to be compressed according to the dictionary library formatting codes to obtain a high-heat text block.

Optionally, clustering at least one high-dimensional vector space through a preset clustering algorithm, and obtaining the log similarity set includes: and under the condition that the preset clustering algorithm is a K-means clustering algorithm, clustering at least one high-dimensional vector space through the K-means clustering algorithm to obtain a log similarity set.

Optionally, generating the dictionary base according to the log similarity class set, and generating the digital log according to the dictionary base and the log similarity class set includes: performing word frequency statistics on each word in the log similarity class set to obtain a dictionary base; and mapping according to the dictionary library and the log similarity set to obtain a digital log, wherein the digital log is used for convolution summation, and the convolution summation is used for determining the span of the similar text blocks.

Optionally, calculating convolution blocks of different spans by the preset spans, and determining the high-compression-rate convolution block of the preset rank according to the product of the preset spans and the occurrence times in the digital log includes: calculating convolution summation of different spans according to the preset spans; obtaining a high compression rate span with preset ranking according to products of corresponding volumes of different spans and preset spans and occurrence times in the digital log; and calculating convolution blocks of different spans according to the high compression rate spans of the preset ranks, and determining the convolution blocks of the high compression rates of the preset ranks according to the product of the high compression rate spans of the preset ranks and the occurrence times in the digital log.

Optionally, storing the high-heat text block in place of the data file to be compressed includes: coding the high-heat text block according to a preset model to obtain a coded high-heat text block; and replacing the data file to be compressed with the encoded high-heat text block for storage.

According to another aspect of the embodiments of the present application, there is further provided a storage medium, including a stored program, where the program, when executed, controls a device in which the storage medium is located to perform: acquiring a high-heat text block from a data file to be compressed; and replacing the data file to be compressed with the high-heat text block for storage.

According to still another aspect of the embodiments of the present application, there is further provided a processor, configured to execute a program, where the program executes when running: acquiring a high-heat text block from a data file to be compressed; and replacing the data file to be compressed with the high-heat text block for storage.

According to still another aspect of the embodiments of the present application, there is also provided a method for data processing, including: acquiring a target data object, wherein the target data object is stored in a target data address; acquiring a text block with the heat degree larger than a preset threshold value from a target data object, wherein the preset threshold value comprises reference times or reference frequency; the text block is stored at the target data address.

In the embodiment of the application, the high-heat text block is obtained from the data file to be compressed; the high-heat text blocks are replaced by the data files to be compressed for storage, so that the aim of encoding and compressing the high-heat text blocks in each different log is fulfilled, the technical effect of reducing the storage space is realized, and the technical problem that the compressed data still has great demand for the storage space due to the adoption of a common compression technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a block diagram of the hardware architecture of a computer terminal of a method of data processing according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of data processing according to an embodiment of the present application;

FIG. 3 is a flow chart of a method of data processing according to a first embodiment of the present application;

FIG. 4 is a flow chart of computing high-heat text blocks in a method of data processing according to an embodiment of the present application;

fig. 5 is a flowchart of a method of data processing according to a second embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical terms referred to in this application are:

data compression: the method is a technical method for reducing the data volume to reduce the storage space and improve the transmission, storage and processing efficiency of the data or reorganizing the data according to a certain algorithm on the premise of not losing useful information and reducing the redundancy and storage space of the data.

Word segmentation: the text is split into single or multiple words.

Fast convolution: from each starting point and span, a convolution sum is calculated.

The polymerization method comprises the following steps: common aggregation methods are counting, deduplication counting, summing, maximum, minimum, etc.

Example 1

In accordance with the embodiments of the present application, there is also provided a method embodiment of data processing, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

The method embodiment provided in the first embodiment of the present application may be executed in a mobile terminal, a computer terminal or a similar computing device. Taking a computer terminal as an example, fig. 1 is a block diagram of a hardware structure of a computer terminal of a data processing method according to an embodiment of the present application. As shown in fig. 1, the computer terminal 10 may include one or more (only one is shown in the figure) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA), a memory 104 for storing data, and a transmission module 106 for communication functions. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the data processing methods in the embodiments of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the data processing methods of the application programs. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

In the above-described operating environment, the present application provides a method of data processing as shown in fig. 2. Fig. 2 is a flowchart of a method of data processing according to a first embodiment of the present application.

Step S202, obtaining a high-heat text block from a data file to be compressed;

in the above step S202 of the present application, the method for processing data provided in the present application selects a log data table with a large storage amount from the data files to be compressed, performs data analysis on the format of the log data table, and then calculates a text block with a preset ranking in the data files to be compressed through a calculation model, where the text block with the preset ranking may be a text block of TOP N in the data files to be compressed, and N is an integer, for example, 1,2,3,4,5,6,7,8,9,10, … …, N. And the text blocks with high heat provided by the application, namely the text blocks of TOP N in the data file to be compressed.

In the process of acquiring the high-heat text block, the method can search through data analysis, find out partial rules of log content, and find out the high-heat text block through a constructed algorithm model.

Here, in the process of calculating the heat degree of the text block, the heat degree of the index is obtained through the reference times of the text block, wherein the application defines the heat degree of the index as the average reference times of the indexes in the same group: assuming that there are n rows of log text strings in a sample file, text block i _m Reference number of r _m The heat of the text blocks in the sample is

The data processing method provided by the application is different from the storage based on the common compression technology in the prior art, overcomes the defect that encoding compression cannot be carried out on high-heat text blocks in different logs, and is characterized in that the high-heat text blocks are acquired from a data file to be compressed in step S202, and then the data file to be compressed is purposefully compressed and stored, wherein the storage step is shown in step S204.

Step S204, the high-heat text block is replaced with the data file to be compressed for storage.

In the above step S204, the high-heat text block is calculated through the data model based on the high-heat text block obtained in step S202, and then recoded, and the recoded high-heat text block is stored in place of the data file to be compressed in step S202.

Specifically, as shown in fig. 3, fig. 3 is a flowchart of a method for data processing according to a first embodiment of the present application. The data processing method provided by the application can be suitable for database logs, SQL sentences in the logs occupy a large number of bytes, but SQL sentences have very high similarity and a large number of text blocks with very high heat, and the text blocks can be recoded to reduce the storage space of data, so that the data processing method provided by the application is based on compressing and storing the high-heat text blocks in a recoding mode, and the technical effect of reducing the requirement on the storage space is achieved.

Referring to fig. 4, fig. 4 is a flowchart of calculating a high-heat text block in a data processing method according to the first embodiment of the present application. The text block with high heat is calculated as follows:

optionally, the obtaining the high-heat text block from the data file to be compressed in step S202 includes:

step S2021, data analysis is carried out on the data file to be compressed, and text blocks with preset hotness ranking in the data file to be compressed are calculated through a preset algorithm;

step S2022, the text block of the preset hotrank is determined as a high-hottext block.

Specifically, in the data processing method provided by the application, according to the service data in combination with step S2021 and step S2022, a log data table with large storage capacity is selected from the data files to be compressed, data analysis is performed on the format of the log data table, and then text blocks of TOP in the log files are calculated through a preset calculation model, so that the high-heat text blocks provided by the application are obtained.

Further optionally, calculating the text block of the preset hotrank in the data file to be compressed in step S2021 by a preset algorithm includes:

step S20211, in the case that the data file to be compressed is a log data table, performing word segmentation according to a preset word segmentation condition from the log data table to obtain a segmented log;

in the above step S20211 of the present application, in the process of performing word segmentation according to the preset word segmentation condition in the log data table, the word segmentation method may include the following two methods:

taking TXX_CHN and INTERNET_CHN as examples, a sentence is converted into a word segmented by space, and the two word segmentation modes are similar, wherein the word segmentation modes are embedded with word segmentation vocabulary related to panning, and meanwhile, the word segmentation modes can also be according to defined word segmentation standards, so that the method is more flexible.

Step S20212, vectorizing the segmented logs, and converting the logs into a high-dimensional vector space;

in step S20212, the log is converted into a high-dimensional vector space by vectorization based on the segmented log obtained in step S20211.

The word vectorization is mainly divided into two types:

the training input of the CBOW model is a word vector corresponding to a word of a certain characteristic word which is related in context, and the output is the word vector of the specific word;

the ideas of Skip-Gram model and CBOW are reversed, i.e. the input is a word vector for a particular word, and the output is a context word vector for the particular word.

The application applies a DOC2VEC (sentence vector) model on the basis of word vectors, and two methods exist for the model: allocated memory Distributed Memory (abbreviated DM) and allocated word bags Distributed Bag of Words (abbreviated DBOW). DM attempts to predict the probability of a word given context and paragraph vectors. The paragraph ID remains unchanged during the training of a sentence or document, sharing the same paragraph vector. DBOW predicts the probability of a set of random words in a paragraph given only a paragraph vector.

For example: the input "this is a sentence", after word segmentation: "this is", "one", "sentence";

performing sentence vectorization: such as DM, 100-dimensional output

doc_id ver1 ver2 … ver100

1 0.1 0.2 … 0.5

Step S20213, clustering at least one high-dimensional vector space by a preset clustering algorithm to obtain a log similarity set;

the method for obtaining the log similarity set comprises the following steps of:

step S202131, clustering at least one high-dimensional vector space by a K-means clustering algorithm to obtain a log similarity set under the condition that the preset clustering algorithm is the K-means clustering algorithm.

Based on the high-dimensional vector space obtained in step S20212, clustering is performed on at least one high-dimensional vector space by a preset clustering algorithm, so as to obtain a log similarity set. The existing clustering algorithm comprises the following three steps:

K-Means: one-dimensional grouping is calculated by using a distance concept;

kohonen: using a model of nerve-like self-organization to perform two-dimensional grouping;

2-Step: the most suitable grouping number can be automatically found out;

although the 2-Step training is quick, the K-Means has the advantages that the number of clusters can be specified, the required cluster amount of different log amounts is different, the uncontrollability of automatic N clusters is eliminated, and the method is more flexible, so the method is illustrated by taking the K-Means algorithm as a preferred example, and the method for realizing the data processing provided by the method is particularly not limited.

Step S20214, generating a dictionary library according to the log similarity class set, and generating a digital log according to the dictionary library and the log similarity class set;

the method for generating the dictionary base according to the log similarity class set and generating the digital log according to the dictionary base and the log similarity class set comprises the following steps:

step S202141, performing word frequency statistics on each word in the log similarity class set to obtain a dictionary base;

and step S202142, mapping is carried out according to the dictionary library and the log similarity class set to obtain a digital log, wherein the digital log is used for convolution summation, and the convolution summation is used for determining the span of the similar text blocks.

In the above step S20214, word frequency statistics is performed on similar classes to form a dictionary library, and the dictionary library is mapped into a digital log capable of convolution summation;

wherein the effect of convolution summation is to quickly determine the span of similar text blocks;

for example:

W_conv1＝tf.ones([j,1,1,1])

conv＝tf.nn.conv2d(x_image,W_conv1,strides＝[1,1,1,1],padding＝'VALID')

concurrent GPU computation is supported only by using a tenseorflow function, so that the method is more efficient.

Step S20215, calculating convolution blocks of different spans through preset spans, and determining high-compression-rate convolution blocks with preset ranks according to the product of the preset spans and the occurrence times in the digital log;

the method for determining the high-compression-rate convolution blocks with preset ranking comprises the following steps of calculating convolution blocks with different spans through preset spans and determining the high-compression-rate convolution blocks with preset ranking according to the product of the preset spans and the occurrence times in a digital log:

step S202151, calculating convolution summation of different spans according to the preset spans;

step S202152, obtaining a preset ranking high compression rate span according to products of corresponding volumes of different spans and preset spans and occurrence times in the digital log;

step S202153, calculating convolution blocks of different spans according to the high compression rate spans of the preset ranks, and determining the convolution blocks of the high compression rate of the preset ranks according to the product of the high compression rate spans of the preset ranks and the occurrence times in the digital log.

It should be noted that the same text blocks in the convolution sum are not necessarily the same, and may be sequentially exchanged, and the convolution sums are the same, but are not actually the same text blocks, so only similar spans can be obtained here, and then the log blocks need to be taken and completely matched according to the spans.

Specifically, taking the example that the span is 2-n, the convolution summation of different spans is rapidly calculated, and the TOPN high compression rate span is selected according to the product of the corresponding convolution sum of the spans and the occurrence times of the log file and the different spans; and recalculating convolution blocks with different spans according to the determined TOPN high compression rate spans, and determining the TOPN high compression rate convolution blocks according to the product of the TOPN high compression rate spans and the occurrence times in the log file.

Step S20216, restoring the data file to be compressed according to the dictionary database formatting codes to obtain the high-heat text block.

Optionally, storing the high-heat text block in place of the data file to be compressed in step S204 includes:

step S2041, coding the high-heat text block according to a preset model to obtain a coded high-heat text block; and replacing the data file to be compressed with the encoded high-heat text block for storage.

The preset model comprises the following steps: sentence vectorization, clustering, deep learning convolution and other multi-algorithm models.

In summary, as shown in fig. 4, in the method for data processing provided in the present application, a preferred example of calculating a high-heat text block is specifically as follows:

(1) Extracting logs and segmenting (step 1 in fig. 4);

performing log file standardization (replacing TAB and line feeding as space) and performing word segmentation according to the space;

(2) Vectorizing the log after the log word segmentation (step 2 in fig. 4);

converting the log into a high-dimensional vector space through vectorization;

(3) Clustering (step 3-4 in FIG. 4);

after the logs are converted into a high-dimensional vector space, similar logs are gathered together through common K-means clustering;

(4) Formatting for similar classes (steps 5-8 in fig. 4);

performing word frequency statistics on similar classes to form a dictionary library, and mapping the dictionary library into a digital log capable of carrying out convolution summation;

(5) Fast convolution and span selection (steps 9-12 in fig. 4);

according to the span of 2-n, quickly calculating convolution summation of different spans;

selecting TOPN high compression rate spans according to different spans and products of corresponding volumes and occurrence times of log files of the spans;

(6) Convolutional word segmentation and compression rate evaluation (steps 13-15 in fig. 4);

according to the spans determined before, calculating convolution blocks with different spans again, and determining the high-compression-rate convolution blocks of TOPN according to the product of the spans and the occurrence times in the log file;

(7) Formatting the code (step 16 in fig. 4);

and restoring the content of the text block of the actual log according to the dictionary database formatting codes to obtain a hot text block.

According to the data processing method, the log big data are subjected to word segmentation processing, the high-heat text blocks are found, the high-heat text blocks are encoded and compressed, the encoded and compressed high-heat text blocks replace original data files to be stored, the requirements on storage space are reduced, the utilization rate of the storage space is improved, and the maintenance pressure in later maintenance is reduced.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method of data processing according to the above embodiments may be implemented by means of software plus a necessary general hardware platform, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.

Example 2

The present application provides a method of data processing as shown in fig. 5. Fig. 5 is a flowchart of a method of data processing according to a second embodiment of the present application.

Step S502, a target data object is obtained, wherein the target data object is stored in a target data address;

in the above step S502 of the present application, the method for data processing provided in the present application acquires a stored target data object on a target data address, where the target data object may include a data file to be compressed, and the data file may include: the method for processing data provided in the present application is only described by taking the above examples as an example, and is not particularly limited.

Step S504, obtaining a text block with the heat degree larger than a preset threshold value from the target data object, wherein the preset threshold value comprises the reference times or the reference frequency;

in the above step S504 of the present application, based on the target data object obtained in step S502, the data processing method provided in the present application selects the log data table with large storage capacity from the target data objects, performs data analysis on the format of the log data table, and calculates the text block with preset rank in the target data object through the calculation method model, where the text block with preset rank may be the text block of TOP N in the target data object, and N is an integer, for example, 1,2,3,4,5,6,7,8,9,10, … …, N. The text block provided by the application, that is, the text block of TOP N in the target data object, that is, the text block of TOP3 may be a text block with a heat degree greater than a preset threshold.

In the process of acquiring text blocks with the heat degree larger than a preset threshold value, searching partial rules of log content through data analysis and exploration, and finding out the text blocks with the heat degree larger than the preset threshold value through a constructed algorithm model.

Step S506, storing the text block in the target data address.

The data processing method provided by the application is different from the storage based on the common compression technology in the prior art, the problem that the high-heat text blocks in each different log cannot be encoded and compressed is solved, the text blocks with the heat degree larger than the preset threshold value are acquired for the target data object, further the compression storage is purposefully carried out, and then the original target data object is replaced by the text blocks with the heat degree larger than the preset threshold value, so that the storage space is saved.

In the embodiment of the application, the target data object is acquired, wherein the target data object is stored in the target data address; acquiring a text block with the heat degree larger than a preset threshold value from a target data object, wherein the preset threshold value comprises reference times or reference frequency; the text blocks are stored in the target data address, so that the aim of encoding and compressing the text blocks with high heat in each different log is fulfilled, the technical effect of reducing the storage space is realized, and the technical problem that the compressed data still has great requirement on the storage space due to the adoption of a common compression technology is solved.

Example 3

Example 4

Example 5

Embodiments of the present application also provide a storage medium. Alternatively, in this embodiment, the storage medium may be used to store program codes executed by the method for data processing provided in the first embodiment.

Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: acquiring a high-heat text block from a data file to be compressed; and replacing the data file to be compressed with the high-heat text block for storage.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: the high-heat text block is a text block with heat greater than the heat of a preset index, wherein the heat of the preset index is the average reference times of the indexes in the same group.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: the obtaining of the high-heat text block from the data file to be compressed comprises: carrying out data analysis on the data file to be compressed, and calculating text blocks with preset hotness ranking in the data file to be compressed through a preset algorithm; and determining the text blocks with preset hotness ranking as high-hotness text blocks.

Further optionally, in the present embodiment, the storage medium is configured to store program code for performing the steps of: calculating text blocks with preset hotness ranking in the data file to be compressed through a preset algorithm comprises: under the condition that the data file to be compressed is a log data table, word segmentation is carried out from the log data table according to preset word segmentation conditions, and a segmented log is obtained; vectorizing the segmented logs, and converting the logs into a high-dimensional vector space; clustering at least one high-dimensional vector space through a preset clustering algorithm to obtain a log similarity set; generating a dictionary library according to the log similarity set, and generating a digital log according to the dictionary library and the log similarity set; calculating convolution blocks with different spans through preset spans, and determining high-compression-rate convolution blocks with preset ranks according to the product of the preset spans and the occurrence times in the digital log; and restoring the data file to be compressed according to the dictionary library formatting codes to obtain a high-heat text block.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: clustering at least one high-dimensional vector space through a preset clustering algorithm, wherein the obtaining of the log similarity set comprises the following steps: and under the condition that the preset clustering algorithm is a K-means clustering algorithm, clustering at least one high-dimensional vector space through the K-means clustering algorithm to obtain a log similarity set.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: generating a dictionary library according to the log similarity class set, and generating a digital log according to the dictionary library and the log similarity class set comprises: performing word frequency statistics on each word in the log similarity class set to obtain a dictionary base; and mapping according to the dictionary library and the log similarity set to obtain a digital log, wherein the digital log is used for convolution summation, and the convolution summation is used for determining the span of the similar text blocks.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: calculating convolution blocks of different spans through preset spans, and determining the preset ranked high-compression-rate convolution blocks according to the product of the preset spans and the occurrence times in the digital log comprises the following steps: calculating convolution summation of different spans according to the preset spans; obtaining a high compression rate span with preset ranking according to products of corresponding volumes of different spans and preset spans and occurrence times in the digital log; and calculating convolution blocks of different spans according to the high compression rate spans of the preset ranks, and determining the convolution blocks of the high compression rates of the preset ranks according to the product of the high compression rate spans of the preset ranks and the occurrence times in the digital log.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: storing the high-heat text block in place of the data file to be compressed includes: coding the high-heat text block according to a preset model to obtain a coded high-heat text block; and replacing the data file to be compressed with the encoded high-heat text block for storage.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. A method of data processing, comprising:

acquiring a high-heat text block from a data file to be compressed;

replacing the data file to be compressed with the high-heat text block for storage;

the obtaining the high-heat text block from the data file to be compressed comprises the following steps: when the data file to be compressed is a log data table, word segmentation is carried out from the log data table according to preset word segmentation conditions, and a segmented log is obtained; vectorizing the segmented log, and converting the log into a high-dimensional vector space; clustering at least one high-dimensionality vector space through a preset clustering algorithm to obtain a log similarity set; generating a dictionary library according to the log similarity class set, and generating a digital log according to the dictionary library and the log similarity class set; calculating convolution blocks with different spans through preset spans, and determining high-compression-rate convolution blocks with preset ranks according to the product of the preset spans and the occurrence times in the digital log; and restoring the data file to be compressed according to the dictionary database formatting codes to obtain the high-heat text block.

2. The method of claim 1, wherein the high-heat text block is a text block having a heat greater than a heat of a preset index, wherein the heat of the preset index is an average number of references of the same set of indexes.

3. The method of claim 1, wherein the obtaining the high-heat text block from the data file to be compressed comprises:

carrying out data analysis on the data file to be compressed, and calculating text blocks with preset hotness ranking in the data file to be compressed through a preset algorithm;

and determining the text blocks with the preset hotness ranking as the high-hotness text blocks.

4. The method of claim 1, wherein clustering at least one of the high-dimensional vector spaces by a preset clustering algorithm to obtain a log similarity class set comprises:

and under the condition that the preset clustering algorithm is a K-means clustering algorithm, clustering at least one high-dimensionality vector space through the K-means clustering algorithm to obtain a log similarity set.

5. The method of claim 1, wherein generating a dictionary library from the set of log similarity classes and generating a digital log from the dictionary library and the set of log similarity classes comprises:

performing word frequency statistics on each word in the log similarity class set to obtain the dictionary base;

and mapping according to the dictionary library and the log similarity set to obtain the digital log, wherein the digital log is used for convolution summation, and the convolution summation is used for determining the span of similar text blocks.

6. The method of data processing according to claim 1 or 5, wherein calculating convolution blocks of different spans by a preset span and determining a preset ranked high compression rate convolution block from a product of the preset span and a number of occurrences in the digital log comprises:

calculating convolution summation of different spans according to the preset spans;

obtaining a preset ranking high compression rate span according to products of corresponding volumes of the different spans and the preset spans and the occurrence times in the digital log;

and calculating convolution blocks with different spans according to the preset ranking high-compression-rate spans, and determining the preset ranking high-compression-rate convolution blocks according to the product of the preset ranking high-compression-rate spans and the occurrence times in the digital log.

7. The method of data processing according to claim 1, wherein said storing the high-heat text block in place of the data file to be compressed comprises:

coding the high-heat text block according to a preset model to obtain a coded high-heat text block;

and replacing the data file to be compressed with the encoded high-heat text block for storage.

8. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium resides to perform: acquiring a high-heat text block from a data file to be compressed; and replacing the high-heat text block with the data file to be compressed for storage, wherein the obtaining the high-heat text block from the data file to be compressed comprises the following steps: when the data file to be compressed is a log data table, word segmentation is carried out from the log data table according to preset word segmentation conditions, and a segmented log is obtained; vectorizing the segmented log, and converting the log into a high-dimensional vector space; clustering at least one high-dimensionality vector space through a preset clustering algorithm to obtain a log similarity set; generating a dictionary library according to the log similarity class set, and generating a digital log according to the dictionary library and the log similarity class set; calculating convolution blocks with different spans through preset spans, and determining high-compression-rate convolution blocks with preset ranks according to the product of the preset spans and the occurrence times in the digital log; and restoring the data file to be compressed according to the dictionary database formatting codes to obtain the high-heat text block.

9. A processor for running a program, wherein the program executes when running: acquiring a high-heat text block from a data file to be compressed; and replacing the high-heat text block with the data file to be compressed for storage, wherein the obtaining the high-heat text block from the data file to be compressed comprises the following steps: when the data file to be compressed is a log data table, word segmentation is carried out from the log data table according to preset word segmentation conditions, and a segmented log is obtained; vectorizing the segmented log, and converting the log into a high-dimensional vector space; clustering at least one high-dimensionality vector space through a preset clustering algorithm to obtain a log similarity set; generating a dictionary library according to the log similarity class set, and generating a digital log according to the dictionary library and the log similarity class set; calculating convolution blocks with different spans through preset spans, and determining high-compression-rate convolution blocks with preset ranks according to the product of the preset spans and the occurrence times in the digital log; and restoring the data file to be compressed according to the dictionary database formatting codes to obtain the high-heat text block.

10. A method of data processing, comprising:

acquiring a target data object, wherein the target data object is stored in a target data address;

acquiring a text block with the heat degree larger than a preset threshold value from the target data object, wherein the preset threshold value comprises reference times or reference frequency;

storing the text block in the target data address, wherein the target data object comprises a data file to be compressed, and acquiring the high-heat text block from the data file to be compressed comprises: when the data file to be compressed is a log data table, word segmentation is carried out from the log data table according to preset word segmentation conditions, and a segmented log is obtained; vectorizing the segmented log, and converting the log into a high-dimensional vector space; clustering at least one high-dimensionality vector space through a preset clustering algorithm to obtain a log similarity set; generating a dictionary library according to the log similarity class set, and generating a digital log according to the dictionary library and the log similarity class set; calculating convolution blocks with different spans through preset spans, and determining high-compression-rate convolution blocks with preset ranks according to the product of the preset spans and the occurrence times in the digital log; and restoring the data file to be compressed according to the dictionary database formatting codes to obtain the high-heat text block.