CN110442489A

CN110442489A - The method and storage medium of data processing

Info

Publication number: CN110442489A
Application number: CN201810410873.8A
Authority: CN
Inventors: 朱成生; 俞飞江
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-05-02
Filing date: 2018-05-02
Publication date: 2019-11-12
Anticipated expiration: 2038-05-02
Also published as: CN110442489B

Abstract

This application discloses a kind of method of data processing and storage mediums.Wherein, this method comprises: obtaining high temperature text block from data file to be compressed；High temperature text block is replaced data file to be compressed to store.The present invention is solved due to using general compression techniques, demand of the caused compressed data to memory space still very big technical problem.

Description

The method and storage medium of data processing

Technical field

This application involves Internet technology application fields, and the method and storage in particular to a kind of data processing are situated between Matter.

Background technique

During the extension of internet industry, more and more industries are associated with internet, and consequent is big The generation of data, especially enterprise level are measured, the generation of routine work, execution, archive can bring a large amount of data, and use Memory space in the database for calling data and for storing data is all made of SQL statement work when generating log For call instruction or log is managed, but the byte that SQL statement occupies is more, the big problem of the memory space for needing to occupy also is got over More to perplex the operation maintenance personnel of business data.

In existing solution, with the mode for crossing cold data storage, that is, with common compress technique to data to be stored It is compressed, so that reduce demand of the data to be stored to memory space, it is compressed to be stored during follow-up storage Data.But the problem of the prior art, is, it is compressed even if data are compressed for a large amount of generations to deposit data Demand of the data to memory space is still very big, this just gives the limited physical memory space to cause very big storage pressure.

For above-mentioned due to using general compression techniques, demand of the caused compressed data to memory space is still very Big problem, currently no effective solution has been proposed.

Summary of the invention

The embodiment of the present application provides the method and storage medium of a kind of data processing, at least to solve due to using common Compress technique, demand of the caused compressed data to memory space still very big technical problem.

According to the one aspect of the embodiment of the present application, a kind of method of data processing is provided, comprising: from number to be compressed According to obtaining high temperature text block in file；High temperature text block is replaced data file to be compressed to store.

Optionally, high temperature text block is the text block that temperature is greater than pre-set level temperature, wherein pre-set level temperature is With the average reference number of group index.

Optionally, obtained from data file to be compressed high temperature text block include: to data file to be compressed into Row data are analyzed, and the text block of preset heat ranking in data file to be compressed is calculated by preset algorithm；By default heat The text block of degree ranking is determined as high temperature text block.

Further, optionally, the text of preset heat ranking in data file to be compressed is calculated by preset algorithm Block includes: in the case where data file to be compressed is daily record data table, according to default word segmentation condition from daily record data table It is segmented, the log after being segmented；Vectorization is carried out to the log after participle, log is changed into high-dimensional vector space； By presetting clustering algorithm, at least one high-dimensional vector space is clustered, log Similarity Class set is obtained；According to log Similarity Class set generates dictionary library, and generates digital log according to dictionary library and log Similarity Class set；By presetting span meter The convolution block of different spans is calculated, and according to the product of default span and the frequency of occurrence in digital log, determines default ranking High compression rate convolution block；According to dictionary library formating coding, data file to be compressed is restored, obtains high temperature text block.

Optionally, by presetting clustering algorithm, at least one high-dimensional vector space is clustered, it is similar to obtain log Class set includes: in the case where default clustering algorithm is K mean cluster algorithm, by K mean cluster algorithm, at least one High-dimensional vector space is clustered, and log Similarity Class set is obtained.

Optionally, dictionary library is generated according to log Similarity Class set, and is generated according to dictionary library and log Similarity Class set Digital log includes: to carry out word frequency statistics to each participle in log Similarity Class set, obtains dictionary library；According to dictionary library and day Will Similarity Class set is mapped, and digital log is obtained, wherein digital log is summed for convolution, and convolution is summed for determining The span of Similar Text block.

Optionally, the convolution block of different spans is calculated by presetting span, and according to default span and in digital log The product of frequency of occurrence determines that the high compression rate convolution block of default ranking includes: to calculate different spans convolution according to span is preset Summation；Convolution sum frequency of occurrence product in digital log is corresponded to according to different spans and default span, obtains default ranking High compression rate span；The convolution block of different spans is calculated according to the high compression rate span for presetting ranking, and according to default ranking The product of high compression rate span and the frequency of occurrence in digital log determines the high compression rate convolution block of default ranking.

Optionally, high temperature text block is replaced data file to be compressed to carry out storage includes: according to preset model pair High temperature text block is encoded, the high temperature text block after being encoded；High temperature text block after coding is replaced wait press The data file of contracting is stored.

According to the another aspect of the embodiment of the present application, a kind of storage medium is additionally provided, storage medium includes the journey of storage Sequence, wherein equipment where control storage medium executes in program operation: obtaining Gao Reduwen from data file to be compressed This block；High temperature text block is replaced data file to be compressed to store.

According to the another aspect of the embodiment of the present application, a kind of processor is additionally provided, processor is used to run program, In, program executes when running: high temperature text block is obtained from data file to be compressed；High temperature text block is replaced wait press The data file of contracting is stored.

According to the embodiment of the present application in another aspect, additionally providing a kind of method of data processing, comprising: obtain number of targets According to object, wherein target data objects are stored in target data address；From target data objects, temperature is obtained greater than default The text block of threshold value, wherein preset threshold includes reference number or reference frequency；Text block is stored in target data address.

In the embodiment of the present application, by obtaining high temperature text block from data file to be compressed；By Gao Reduwen This block is replaced data file to be compressed and is stored, and has reached and has been compiled according to temperature text block high in every part of different log The purpose of code compression to realize the technical effect for reducing memory space, and then is solved due to using general compression techniques, Demand of the caused compressed data to memory space still very big technical problem.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present application, constitutes part of this application, this Shen Illustrative embodiments and their description please are not constituted an undue limitation on the present application for explaining the application.In the accompanying drawings:

Fig. 1 is a kind of hardware block diagram of the terminal of the method for data processing of the embodiment of the present application；

Fig. 2 is the flow chart according to the method for the data processing of the embodiment of the present application one；

Fig. 3 is the flow chart according to a kind of method of data processing of the embodiment of the present application one；

Fig. 4 is the flow chart that high fever degree text block is calculated in the method according to the data processing of the embodiment of the present application one；

Fig. 5 is the flow chart according to the method for the data processing of the embodiment of the present application two.

Specific embodiment

In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection It encloses.

It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

This application involves technical term:

Data compression: referring under the premise of not losing useful information, reduces data volume to reduce memory space, improves it Transmission, storage and processing efficiency, or data are reorganized according to certain algorithm, reduce the redundancy and storage of data A kind of technical method in space.

Participle: text is carried out to split into single or multiple words.

Fast convolution: according to each starting point and span, convolution sum is calculated.

Polymerization: common polymerization has counting, non-repetition counting, summation, maximum value, minimum value etc..

Embodiment 1

According to the embodiment of the present application, a kind of embodiment of the method for data processing is additionally provided, it should be noted that in attached drawing Process the step of illustrating can execute in a computer system such as a set of computer executable instructions, although also, Logical order is shown in flow charts, but in some cases, can be executed with the sequence for being different from herein it is shown or The step of description.

Embodiment of the method provided by the embodiment of the present application one can be in mobile terminal, terminal or similar fortune It calculates and is executed in device.For running on computer terminals, Fig. 1 is a kind of method of data processing of the embodiment of the present application The hardware block diagram of terminal.As shown in Figure 1, terminal 10 may include one or more (only shows in figure One) (processor 102 can include but is not limited to the place of Micro-processor MCV or programmable logic device FPGA etc. to processor 102 Manage device), memory 104 for storing data and the transmission module 106 for communication function.Ordinary skill Personnel are appreciated that structure shown in FIG. 1 is only to illustrate, and do not cause to limit to the structure of above-mentioned electronic device.For example, meter Calculation machine terminal 10 may also include than shown in Fig. 1 more perhaps less component or with the configuration different from shown in Fig. 1.

Memory 104 can be used for storing the software program and module of application software, such as the data in the embodiment of the present application Corresponding program instruction/the module of the method for processing, processor 102 by the software program that is stored in memory 104 of operation with And module realizes the method for the data processing of above-mentioned application program thereby executing various function application and data processing. Memory 104 may include high speed random access memory, may also include nonvolatile memory, as one or more magnetic storage fills It sets, flash memory or other non-volatile solid state memories.In some instances, memory 104 can further comprise relative to place The remotely located memory of device 102 is managed, these remote memories can pass through network connection to terminal 10.Above-mentioned network Example include but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.

Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of terminal 10 provide.In an example, transmitting device 106 includes that a network is suitable Orchestration (Network Interface Controller, NIC), can be connected by base station with other network equipments so as to Internet is communicated.In an example, transmitting device 106 can be radio frequency (Radio Frequency, RF) module, For wirelessly being communicated with internet.

Under above-mentioned running environment, this application provides the methods of data processing as shown in Figure 2.Fig. 2 is according to this Shen Please embodiment one data processing method flow chart.

Step S202 obtains high temperature text block from data file to be compressed；

The method of the application above-mentioned steps S202, data processing provided by the present application are selected from data file to be compressed The big daily record data table of amount of storage, and data analysis is carried out to daily record data tableau format, then by calculating method model, calculate The text block of ranking is preset in data file to be compressed out, wherein the text block of the default ranking can be data to be compressed The text block of TOP N in file, N are integer, for example, 1,2,3,4,5,6,7,8,9,10 ... ..., N.And height provided by the present application Temperature text block, that is, the text block of TOP N in data file to be compressed.

Wherein, during obtaining high temperature text block, it can be analyzed and be detected by data, search the portion of log content Divider rule, by the algorithm model of construction, finds out high temperature text block.

Here it during calculating text block temperature, can be obtained by the reference number of text block, wherein the application Index temperature is defined as with the average reference number for organizing index: assuming that having n row log text string, text in a sample file Block i_mReference number be r_m, then the temperature of text block is all in sample

The method of data processing provided by the present application be different from the prior art in based on the storage after general compression techniques, gram Coding compression can not be carried out to high temperature text block in every part of different log by having taken, and step S202 is exactly to number to be compressed According to the high temperature text block of file acquisition, so it is purposive carry out compression storage, storing step is shown in step S204.

High temperature text block is replaced data file to be compressed and stored by step S204.

The application above-mentioned steps S204 is calculated high based on the high temperature text block that step S202 is obtained by data model Temperature text block, and then recompile, and it is to be compressed in the high temperature text block replacement step S202 after recompiling Data file is stored.

Specifically, as shown in figure 3, Fig. 3 is the flow chart according to a kind of method of data processing of the embodiment of the present application one. In conjunction with step S202 to step S204, the method for data processing provided by the present application can be adapted for database journal, this kind of day SQL statement inside will occupies a large amount of byte, but SQL statement similarity is very high, there is the very high text block of a large amount of temperatures, This part text block can reduce the memory space of data by recoding, therefore the method for data processing provided by the present application is just It is based on carrying out compression storage by way of recodification to high temperature text block, to reach the demand reduced to memory space Technical effect.

Referring to Fig. 4 it is found that Fig. 4 is calculating high fever degree text block in the method according to the data processing of the embodiment of the present application one Flow chart.It is specific as follows to calculate high temperature text block:

Optionally, obtaining high temperature text block in step S202 from data file to be compressed includes:

Step S2021 carries out data analysis to data file to be compressed, and calculates number to be compressed by preset algorithm According to the text block of preset heat ranking in file；

The text block of preset heat ranking is determined as high temperature text block by step S2022.

Specifically, in conjunction with step S2021 and step S2022, according to business in the method for data processing provided by the present application Data are selected the big daily record data table of amount of storage in data file to be compressed, and are counted to daily record data tableau format According to analysis, then by presetting computation model, the text block of TOP in journal file is calculated, and then obtain provided by the present application High temperature text block.

Further, optionally, preset heat in data file to be compressed is calculated by preset algorithm in step S2021 The text block of ranking includes:

Step S20211, in the case where data file to be compressed is daily record data table, the basis from daily record data table Default word segmentation condition is segmented, the log after being segmented；

In the application above-mentioned steps S20211, what the application was segmented from daily record data table according to default word segmentation condition In the process, segmenting method may include the following two kinds:

It is illustrated by taking TXXX_CHN and INTERNET_CHN as an example, a sentence is changed into the word divided with space, Two kinds of participle modes are similar, wherein it is relevant participle vocabulary that the former, which is embedded in and washes in a pan, while can also be according to point of definition Word standard, it is more flexible in this way.

Step S20212 carries out vectorization to the log after participle, log is changed into high-dimensional vector space；

In the application above-mentioned steps S20212, based on the log after being segmented obtained in step S20211, by vectorization, Log is changed into high-dimensional vector space.

Wherein, there are mainly two types of term vectors:

The training input of CBOW model is the corresponding term vector of context-sensitive word of some Feature Words, and exports just It is the term vector of this specific one word；

The thinking of Skip-Gram model and CBOW are reverse, the i.e. input term vectors that are specific one word, and are exported It is the corresponding context term vector of specific word.

The application applies DOC2VEC (sentence vector) model on the basis of term vector, and there is also two methods for the model: Storage allocation Distributed Memory (referred to as, DM) and distribution bag of words Distributed Bag of Words are (referred to as, DBOW).DM attempts the probability that word is predicted in the case where given context and paragraph vector.In a sentence or document In training process, paragraph ID is remained unchanged, and shares the same paragraph vector.DBOW is then in the case where only giving paragraph vector Predict the probability of one group of random word in paragraph.

Such as: it inputs " this is a sentence ", after participle: " this is ", "one", " sentence "；

Execute sentence vectorization mode: such as DM, 100 dimension outputs

doc_id ver1 ver2 … ver100

1 0.1 0.2 … 0.5

Step S20213 clusters at least one high-dimensional vector space, obtains log by presetting clustering algorithm Similarity Class set；

Wherein, by presetting clustering algorithm, at least one high-dimensional vector space is clustered, log Similarity Class is obtained Set includes:

Step S202131, in the case where default clustering algorithm is K mean cluster algorithm, by K mean cluster algorithm, At least one high-dimensional vector space is clustered, log Similarity Class set is obtained.

Based on high-dimensional vector space obtained in step S20212, by presetting clustering algorithm, at least one higher-dimension Degree vector space is clustered, and log Similarity Class set is obtained.Wherein, in existing clustering algorithm, including following three kinds:

K-Means: one-dimensional divides group, is calculated with ' distance ' concept；

Kohonen: two-dimensions are done using the model of class nerve self and divide group；

2-Step: most suitable point of group's number can be found out automatically；

Although 2-Step training is quickly, the advantage of K-Means is to can specify the quantity of cluster in this application, no Same log amount needs cluster amount to be also different, and excludes automatically at the uncontrollability of N number of cluster, more flexible, so the application It is illustrated using K-Means algorithm as preferable example, is subject to the method for realizing data processing provided by the present application, specifically not It limits.

Step S20214 generates dictionary library according to log Similarity Class set, and according to dictionary library and log Similarity Class set Generate digital log；

Wherein, dictionary library is generated according to log Similarity Class set, and number is generated according to dictionary library and log Similarity Class set Word log includes:

Step S202141 carries out word frequency statistics to each participle in log Similarity Class set, obtains dictionary library；

Step S202142 is mapped according to dictionary library and log Similarity Class set, obtains digital log, wherein number Log is summed for convolution, the span that convolution is summed for determining Similar Text block.

In the application above-mentioned steps S20214, word frequency statistics are carried out for Similarity Class and form dictionary library, and being mapped to can be with The digital log of convolution summation；

Wherein, the effect of convolution summation is quickly to determine the span of Similar Text block；

Such as:

W_conv1=tf.ones ([j, 1,1,1])

Conv=tf.nn.conv2d (x_image, W_conv1, strides=[1,1,1,1], padding=' VALID')

Only concurrent GPU is supported to calculate with tenseorflow function, it is more efficient.

Step S20215 calculates the convolution block of different spans by presetting span, and according to default span and in digital day The product of frequency of occurrence in will determines the high compression rate convolution block of default ranking；

Wherein, the convolution block of different spans is calculated by presetting span, and is gone out according to default span in digital log The product of occurrence number determines that the high compression rate convolution block of default ranking includes:

Step S202151 calculates the summation of different spans convolution according to span is preset；

Step S202152 corresponds to convolution sum frequency of occurrence product in digital log according to different spans and default span, Obtain the high compression rate span of default ranking；

Step S202153 calculates the convolution block of different spans according to the high compression rate span for presetting ranking, and according to default The product of the high compression rate span of ranking and the frequency of occurrence in digital log determines the high compression rate convolution block of default ranking.

It should be noted that convolution summation same text block might not be identical, it may be possible to which sequence is exchanged, convolution summation It is identical, but be not the same text block in fact, so can only acquire similar span here, subsequent or needs take according to span Interception log blocks are exactly matched.

Specifically, quickly calculating the summation of different spans convolution for being 2-n according to span, according to different spans and it is somebody's turn to do Span corresponds to convolution sum in journal file frequency of occurrence product, selects TOPN high compression rate span；According to determining TOPN high pressure Shrinkage span recalculates the convolution block of different spans, and goes out occurrence according to TOPN high compression rate span and in journal file Several products determines the high compression rate convolution block of TOPN.

Step S20216 restores data file to be compressed according to dictionary library formating coding, obtains high temperature text Block.

Optionally, high temperature text block is replaced data file to be compressed and store in step S204 and include:

Step S2041 encodes high temperature text block according to preset model, the high temperature text after being encoded Block；High temperature text block after coding is replaced data file to be compressed to store.

Wherein, preset model includes: more algorithm models such as sentence vectorization, cluster, deep learning convolution.

To sum up, as shown in figure 4, in the method for data processing provided by the present application, preferably showing for high temperature text block is calculated Example is specific as follows:

(1), it extracts log and segments the (step 1) in Fig. 4；

It carries out journal file standardization (substitution TAB, newline are space) and is segmented by space；

(2), the log after log being segmented carries out the vectorization (step 2) in Fig. 4；

By vectorization, log is changed into high-dimensional vector space；

(3), (the step 3-4 in Fig. 4) is clustered；

After log changes into high-dimensional vector space, by common K mean cluster, similar log can flock together；

(4), (the step 5-8 in Fig. 4) is formatted for Similarity Class；

Word frequency statistics are carried out for Similarity Class and form dictionary library, and are mapped to the digital log that can be summed with convolution；

(5), fast convolution and span selection (the step 9-12 in Fig. 4)；

It is 2-n according to span, quickly calculates the summation of different spans convolution；

Convolution sum is corresponded in journal file frequency of occurrence product according to different spans and the span, selects TOPN high compression rate Span；

(6), convolution word cutting and compression ratio assessment (the step 13-15 in Fig. 4)；

According to span determined above, the convolution block of different spans is recalculated, and according to span and in journal file The product of frequency of occurrence determines the high compression rate convolution block of TOPN；

(7), the formating coding (step 16) in Fig. 4；

According to dictionary library formating coding, actual log text block content is restored, temperature text block is obtained.

The method of data processing provided by the present application finds Gao Reduwen by the way that log class big data is carried out word segmentation processing This block, and coding compression is carried out to high temperature text block, former data to be stored is substituted to encode compressed high temperature text block File reaches the demand reduced to memory space, improves the utilization rate of memory space, and when reducing later maintenance Safeguard pressure.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, related actions and modules not necessarily the application It is necessary.

Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of the data processing of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hard Part, but the former is more preferably embodiment in many cases.Based on this understanding, the technical solution of the application substantially or Say that the part that contributes to existing technology can be embodied in the form of software products, which is stored in In one storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be hand Machine, computer, server or network equipment etc.) execute method described in each embodiment of the application.

Embodiment 2

This application provides the methods of data processing as shown in Figure 5.Fig. 5 is at the data according to the embodiment of the present application two The flow chart of the method for reason.

Step S502 obtains target data objects, wherein target data objects are stored in target data address；

In the application above-mentioned steps S502, the method for data processing provided by the present application is obtained on target data address and is deposited The target data objects of storage, the target data objects may include data file to be compressed, which may include: number According to the function file of the calling data stored in library, to the run program file of data processing or encryption and decryption is carried out to data adds Decryption program file, the method for data processing provided by the present application are only illustrated taking the above example as an example, specifically without limitation.

Step S504 obtains the text block that temperature is greater than preset threshold, wherein preset threshold from target data objects Including reference number or reference frequency；

In the application above-mentioned steps S504, based on target data objects obtained in step S502, number provided by the present application The big daily record data table of amount of storage is selected from target data objects according to the method for processing, and daily record data tableau format is carried out Data analysis calculates the text block that ranking is preset in target data objects then by calculating method model, wherein the default row The text block of name can be the text block of TOP N in target data objects, and N is integer, for example, 1,2,3,4,5,6,7,8,9, 10 ... ..., N.And text block provided by the present application, that is, the text block of TOP N in target data objects, that is, can be TOP3's Text block is the text block that temperature is greater than preset threshold.

Wherein, during obtaining text block of the temperature greater than preset threshold, it can be analyzed and be detected by data, searched The partial rules of log content find out the text block that temperature is greater than preset threshold by the algorithm model of construction.

Text block is stored in target data address by step S506.

The method of data processing provided by the present application be different from the prior art in based on the storage after general compression techniques, gram The problem of can not carrying out coding compression to high temperature text block in every part of different log has been taken, has been obtained to target data objects Temperature is greater than the text block of preset threshold, so it is purposive carry out compression storage, and then default threshold is greater than by storage temperature The text block of value replaces original target data objects, saves memory space.

In the embodiment of the present application, by obtaining target data objects, wherein target data objects are stored in target data Address；From target data objects, the text block that temperature is greater than preset threshold is obtained, wherein preset threshold includes reference number Or reference frequency；Text block is stored in target data address, has been reached according to temperature text block high in every part of different log The purpose of coding compression is carried out, to realize the technical effect for reducing memory space, and then is solved due to using common pressure Contracting technology, demand of the caused compressed data to memory space still very big technical problem.

Embodiment 3

Embodiment 4

Embodiment 5

Embodiments herein additionally provides a kind of storage medium.Optionally, in the present embodiment, above-mentioned storage medium can With program code performed by the method for saving data processing provided by above-described embodiment one.

Optionally, in the present embodiment, above-mentioned storage medium can be located in computer network in computer terminal group In any one terminal, or in any one mobile terminal in mobile terminal group.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: from High temperature text block is obtained in data file to be compressed；High temperature text block is replaced data file to be compressed to deposit Storage.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: high Temperature text block is the text block that temperature is greater than pre-set level temperature, wherein pre-set level temperature is to draw with being averaged for index of group Use number.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: from It includes: to carry out data analysis to data file to be compressed, and pass through that high temperature text block is obtained in data file to be compressed Preset algorithm calculates the text block of preset heat ranking in data file to be compressed；The text block of preset heat ranking is determined For high temperature text block.

Further, optionally, in the present embodiment, storage medium is arranged to store the journey for executing following steps Sequence code: the text block that preset heat ranking in data file to be compressed is calculated by preset algorithm includes: to be compressed In the case that data file is daily record data table, is segmented, segmented according to default word segmentation condition from daily record data table Log afterwards；Vectorization is carried out to the log after participle, log is changed into high-dimensional vector space；By presetting clustering algorithm, At least one high-dimensional vector space is clustered, log Similarity Class set is obtained；Word is generated according to log Similarity Class set Allusion quotation library, and digital log is generated according to dictionary library and log Similarity Class set；The convolution of different spans is calculated by presetting span Block, and according to the product of default span and the frequency of occurrence in digital log, determine the high compression rate convolution block of default ranking；Root According to dictionary library formating coding, data file to be compressed is restored, obtains high temperature text block.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: logical Default clustering algorithm is crossed, at least one high-dimensional vector space is clustered, obtaining log Similarity Class set includes: default In the case that clustering algorithm is K mean cluster algorithm, by K mean cluster algorithm, at least one high-dimensional vector space into Row cluster, obtains log Similarity Class set.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: root Dictionary library is generated according to log Similarity Class set, and generating digital log according to dictionary library and log Similarity Class set includes: to day Each participle in will Similarity Class set carries out word frequency statistics, obtains dictionary library；It is carried out according to dictionary library and log Similarity Class set Mapping, obtains digital log, wherein digital log is summed for convolution, the span that convolution is summed for determining Similar Text block.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: logical The convolution block that default span calculates different spans is crossed, and according to the product of default span and the frequency of occurrence in digital log, really Surely the high compression rate convolution block of default ranking includes: to calculate the summation of different spans convolution according to span is preset；According to different spans Convolution sum frequency of occurrence product in digital log is corresponded to default span, obtains the high compression rate span of default ranking；Foundation The high compression rate span of default ranking calculates the convolution block of different spans, and according to the high compression rate span of default ranking and in number The product of frequency of occurrence in word log determines the high compression rate convolution block of default ranking.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: will It includes: to compile according to preset model to high temperature text block that high temperature text block, which replaces data file to be compressed and carries out storage, Code, the high temperature text block after being encoded；High temperature text block after coding is replaced data file to be compressed to deposit Storage.

Above-mentioned the embodiment of the present application serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.

In above-described embodiment of the application, all emphasizes particularly on different fields to the description of each embodiment, do not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, only A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the application whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.

The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art For member, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications are also answered It is considered as the protection scope of the application.

Claims

1. a kind of method of data processing characterized by comprising

High temperature text block is obtained from data file to be compressed；

The high temperature text block is replaced the data file to be compressed to store.

2. the method for data processing according to claim 1, which is characterized in that the high temperature text block is greater than for temperature The text block of pre-set level temperature, wherein pre-set level temperature is the average reference number with group index.

3. the method for data processing according to claim 1, which is characterized in that described to be obtained from data file to be compressed The high temperature text block is taken to include:

Data analysis is carried out to the data file to be compressed, and the data file to be compressed is calculated by preset algorithm The text block of middle preset heat ranking；

The text block of the preset heat ranking is determined as the high temperature text block.

4. the method for data processing according to claim 3, which is characterized in that it is described by preset algorithm calculate it is described to The text block of preset heat ranking includes: in the data file of compression

In the case where the data file to be compressed is daily record data table, according to default participle from the daily record data table Condition is segmented, the log after being segmented；

Vectorization is carried out to the log after the participle, log is changed into high-dimensional vector space；

By presetting clustering algorithm, at least one described high-dimensional vector space is clustered, log Similarity Class set is obtained；

Dictionary library is generated according to the log Similarity Class set, and is generated according to the dictionary library and the log Similarity Class set Digital log；

The convolution block of different spans is calculated by presetting span, and is occurred according to the default span in the digital log The product of number determines the high compression rate convolution block of default ranking；

According to the dictionary library formating coding, the data file to be compressed is restored, the high temperature text block is obtained.

5. the method for data processing according to claim 4, which is characterized in that it is described by presetting clustering algorithm, to extremely A few high-dimensional vector space is clustered, and obtaining log Similarity Class set includes:

In the case where the default clustering algorithm is K mean cluster algorithm, by the K mean cluster algorithm, at least one A high-dimensional vector space is clustered, and log Similarity Class set is obtained.

6. the method for data processing according to claim 4, which is characterized in that described according to the log Similarity Class set Dictionary library is generated, and digital log is generated with the log Similarity Class set according to the dictionary library and includes:

Word frequency statistics are carried out to each participle in the log Similarity Class set, obtain the dictionary library；

It is mapped according to the dictionary library and the log Similarity Class set, obtains the digital log, wherein the number Log is summed for convolution, the span that the convolution is summed for determining Similar Text block.

7. the method for the data processing according to claim 4 or 6, which is characterized in that by default span calculate it is different across The convolution block of degree, and according to the product of the default span and the frequency of occurrence in the digital log, determine default ranking High compression rate convolution block includes:

According to span is preset, the summation of different spans convolution is calculated；

Convolution sum frequency of occurrence product in the digital log is corresponded to according to the different spans and the default span, is obtained The high compression rate span of default ranking；

The convolution block of different spans is calculated according to the high compression rate span of the default ranking, and according to the height of the default ranking The product of compression ratio span and the frequency of occurrence in the digital log, determines the high compression rate convolution block of the default ranking.

8. the method for data processing according to claim 1, which is characterized in that described to replace the high temperature text block The data file to be compressed carries out storage

The high temperature text block is encoded according to preset model, the high temperature text block after being encoded；

High temperature text block after the coding is replaced the data file to be compressed to store.

9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where equipment execute: high temperature text block is obtained from data file to be compressed；By the height Temperature text block is replaced the data file to be compressed and is stored.

10. a kind of processor, which is characterized in that the processor is for running program, wherein described program executes when running: High temperature text block is obtained from data file to be compressed；The high temperature text block is replaced into the data text to be compressed Part is stored.

11. a kind of method of data processing characterized by comprising

Obtain target data objects, wherein the target data objects are stored in target data address；

From the target data objects, the text block that temperature is greater than preset threshold is obtained, wherein the preset threshold includes drawing With number or reference frequency；

The text block is stored in the target data address.