CN110442489A - The method and storage medium of data processing - Google Patents

The method and storage medium of data processing Download PDF

Info

Publication number
CN110442489A
CN110442489A CN201810410873.8A CN201810410873A CN110442489A CN 110442489 A CN110442489 A CN 110442489A CN 201810410873 A CN201810410873 A CN 201810410873A CN 110442489 A CN110442489 A CN 110442489A
Authority
CN
China
Prior art keywords
text block
log
compressed
data
high temperature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810410873.8A
Other languages
Chinese (zh)
Other versions
CN110442489B (en
Inventor
朱成生
俞飞江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810410873.8A priority Critical patent/CN110442489B/en
Publication of CN110442489A publication Critical patent/CN110442489A/en
Application granted granted Critical
Publication of CN110442489B publication Critical patent/CN110442489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • G06F11/3082Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting the data filtering being achieved by aggregating or compressing the monitored data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of method of data processing and storage mediums.Wherein, this method comprises: obtaining high temperature text block from data file to be compressed;High temperature text block is replaced data file to be compressed to store.The present invention is solved due to using general compression techniques, demand of the caused compressed data to memory space still very big technical problem.

Description

The method and storage medium of data processing
Technical field
This application involves Internet technology application fields, and the method and storage in particular to a kind of data processing are situated between Matter.
Background technique
During the extension of internet industry, more and more industries are associated with internet, and consequent is big The generation of data, especially enterprise level are measured, the generation of routine work, execution, archive can bring a large amount of data, and use Memory space in the database for calling data and for storing data is all made of SQL statement work when generating log For call instruction or log is managed, but the byte that SQL statement occupies is more, the big problem of the memory space for needing to occupy also is got over More to perplex the operation maintenance personnel of business data.
In existing solution, with the mode for crossing cold data storage, that is, with common compress technique to data to be stored It is compressed, so that reduce demand of the data to be stored to memory space, it is compressed to be stored during follow-up storage Data.But the problem of the prior art, is, it is compressed even if data are compressed for a large amount of generations to deposit data Demand of the data to memory space is still very big, this just gives the limited physical memory space to cause very big storage pressure.
For above-mentioned due to using general compression techniques, demand of the caused compressed data to memory space is still very Big problem, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the present application provides the method and storage medium of a kind of data processing, at least to solve due to using common Compress technique, demand of the caused compressed data to memory space still very big technical problem.
According to the one aspect of the embodiment of the present application, a kind of method of data processing is provided, comprising: from number to be compressed According to obtaining high temperature text block in file;High temperature text block is replaced data file to be compressed to store.
Optionally, high temperature text block is the text block that temperature is greater than pre-set level temperature, wherein pre-set level temperature is With the average reference number of group index.
Optionally, obtained from data file to be compressed high temperature text block include: to data file to be compressed into Row data are analyzed, and the text block of preset heat ranking in data file to be compressed is calculated by preset algorithm;By default heat The text block of degree ranking is determined as high temperature text block.
Further, optionally, the text of preset heat ranking in data file to be compressed is calculated by preset algorithm Block includes: in the case where data file to be compressed is daily record data table, according to default word segmentation condition from daily record data table It is segmented, the log after being segmented;Vectorization is carried out to the log after participle, log is changed into high-dimensional vector space; By presetting clustering algorithm, at least one high-dimensional vector space is clustered, log Similarity Class set is obtained;According to log Similarity Class set generates dictionary library, and generates digital log according to dictionary library and log Similarity Class set;By presetting span meter The convolution block of different spans is calculated, and according to the product of default span and the frequency of occurrence in digital log, determines default ranking High compression rate convolution block;According to dictionary library formating coding, data file to be compressed is restored, obtains high temperature text block.
Optionally, by presetting clustering algorithm, at least one high-dimensional vector space is clustered, it is similar to obtain log Class set includes: in the case where default clustering algorithm is K mean cluster algorithm, by K mean cluster algorithm, at least one High-dimensional vector space is clustered, and log Similarity Class set is obtained.
Optionally, dictionary library is generated according to log Similarity Class set, and is generated according to dictionary library and log Similarity Class set Digital log includes: to carry out word frequency statistics to each participle in log Similarity Class set, obtains dictionary library;According to dictionary library and day Will Similarity Class set is mapped, and digital log is obtained, wherein digital log is summed for convolution, and convolution is summed for determining The span of Similar Text block.
Optionally, the convolution block of different spans is calculated by presetting span, and according to default span and in digital log The product of frequency of occurrence determines that the high compression rate convolution block of default ranking includes: to calculate different spans convolution according to span is preset Summation;Convolution sum frequency of occurrence product in digital log is corresponded to according to different spans and default span, obtains default ranking High compression rate span;The convolution block of different spans is calculated according to the high compression rate span for presetting ranking, and according to default ranking The product of high compression rate span and the frequency of occurrence in digital log determines the high compression rate convolution block of default ranking.
Optionally, high temperature text block is replaced data file to be compressed to carry out storage includes: according to preset model pair High temperature text block is encoded, the high temperature text block after being encoded;High temperature text block after coding is replaced wait press The data file of contracting is stored.
According to the another aspect of the embodiment of the present application, a kind of storage medium is additionally provided, storage medium includes the journey of storage Sequence, wherein equipment where control storage medium executes in program operation: obtaining Gao Reduwen from data file to be compressed This block;High temperature text block is replaced data file to be compressed to store.
According to the another aspect of the embodiment of the present application, a kind of processor is additionally provided, processor is used to run program, In, program executes when running: high temperature text block is obtained from data file to be compressed;High temperature text block is replaced wait press The data file of contracting is stored.
According to the embodiment of the present application in another aspect, additionally providing a kind of method of data processing, comprising: obtain number of targets According to object, wherein target data objects are stored in target data address;From target data objects, temperature is obtained greater than default The text block of threshold value, wherein preset threshold includes reference number or reference frequency;Text block is stored in target data address.
In the embodiment of the present application, by obtaining high temperature text block from data file to be compressed;By Gao Reduwen This block is replaced data file to be compressed and is stored, and has reached and has been compiled according to temperature text block high in every part of different log The purpose of code compression to realize the technical effect for reducing memory space, and then is solved due to using general compression techniques, Demand of the caused compressed data to memory space still very big technical problem.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present application, constitutes part of this application, this Shen Illustrative embodiments and their description please are not constituted an undue limitation on the present application for explaining the application.In the accompanying drawings:
Fig. 1 is a kind of hardware block diagram of the terminal of the method for data processing of the embodiment of the present application;
Fig. 2 is the flow chart according to the method for the data processing of the embodiment of the present application one;
Fig. 3 is the flow chart according to a kind of method of data processing of the embodiment of the present application one;
Fig. 4 is the flow chart that high fever degree text block is calculated in the method according to the data processing of the embodiment of the present application one;
Fig. 5 is the flow chart according to the method for the data processing of the embodiment of the present application two.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection It encloses.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
This application involves technical term:
Data compression: referring under the premise of not losing useful information, reduces data volume to reduce memory space, improves it Transmission, storage and processing efficiency, or data are reorganized according to certain algorithm, reduce the redundancy and storage of data A kind of technical method in space.
Participle: text is carried out to split into single or multiple words.
Fast convolution: according to each starting point and span, convolution sum is calculated.
Polymerization: common polymerization has counting, non-repetition counting, summation, maximum value, minimum value etc..
Embodiment 1
According to the embodiment of the present application, a kind of embodiment of the method for data processing is additionally provided, it should be noted that in attached drawing Process the step of illustrating can execute in a computer system such as a set of computer executable instructions, although also, Logical order is shown in flow charts, but in some cases, can be executed with the sequence for being different from herein it is shown or The step of description.
Embodiment of the method provided by the embodiment of the present application one can be in mobile terminal, terminal or similar fortune It calculates and is executed in device.For running on computer terminals, Fig. 1 is a kind of method of data processing of the embodiment of the present application The hardware block diagram of terminal.As shown in Figure 1, terminal 10 may include one or more (only shows in figure One) (processor 102 can include but is not limited to the place of Micro-processor MCV or programmable logic device FPGA etc. to processor 102 Manage device), memory 104 for storing data and the transmission module 106 for communication function.Ordinary skill Personnel are appreciated that structure shown in FIG. 1 is only to illustrate, and do not cause to limit to the structure of above-mentioned electronic device.For example, meter Calculation machine terminal 10 may also include than shown in Fig. 1 more perhaps less component or with the configuration different from shown in Fig. 1.
Memory 104 can be used for storing the software program and module of application software, such as the data in the embodiment of the present application Corresponding program instruction/the module of the method for processing, processor 102 by the software program that is stored in memory 104 of operation with And module realizes the method for the data processing of above-mentioned application program thereby executing various function application and data processing. Memory 104 may include high speed random access memory, may also include nonvolatile memory, as one or more magnetic storage fills It sets, flash memory or other non-volatile solid state memories.In some instances, memory 104 can further comprise relative to place The remotely located memory of device 102 is managed, these remote memories can pass through network connection to terminal 10.Above-mentioned network Example include but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of terminal 10 provide.In an example, transmitting device 106 includes that a network is suitable Orchestration (Network Interface Controller, NIC), can be connected by base station with other network equipments so as to Internet is communicated.In an example, transmitting device 106 can be radio frequency (Radio Frequency, RF) module, For wirelessly being communicated with internet.
Under above-mentioned running environment, this application provides the methods of data processing as shown in Figure 2.Fig. 2 is according to this Shen Please embodiment one data processing method flow chart.
Step S202 obtains high temperature text block from data file to be compressed;
The method of the application above-mentioned steps S202, data processing provided by the present application are selected from data file to be compressed The big daily record data table of amount of storage, and data analysis is carried out to daily record data tableau format, then by calculating method model, calculate The text block of ranking is preset in data file to be compressed out, wherein the text block of the default ranking can be data to be compressed The text block of TOP N in file, N are integer, for example, 1,2,3,4,5,6,7,8,9,10 ... ..., N.And height provided by the present application Temperature text block, that is, the text block of TOP N in data file to be compressed.
Wherein, during obtaining high temperature text block, it can be analyzed and be detected by data, search the portion of log content Divider rule, by the algorithm model of construction, finds out high temperature text block.
Here it during calculating text block temperature, can be obtained by the reference number of text block, wherein the application Index temperature is defined as with the average reference number for organizing index: assuming that having n row log text string, text in a sample file Block imReference number be rm, then the temperature of text block is all in sample
The method of data processing provided by the present application be different from the prior art in based on the storage after general compression techniques, gram Coding compression can not be carried out to high temperature text block in every part of different log by having taken, and step S202 is exactly to number to be compressed According to the high temperature text block of file acquisition, so it is purposive carry out compression storage, storing step is shown in step S204.
High temperature text block is replaced data file to be compressed and stored by step S204.
The application above-mentioned steps S204 is calculated high based on the high temperature text block that step S202 is obtained by data model Temperature text block, and then recompile, and it is to be compressed in the high temperature text block replacement step S202 after recompiling Data file is stored.
Specifically, as shown in figure 3, Fig. 3 is the flow chart according to a kind of method of data processing of the embodiment of the present application one. In conjunction with step S202 to step S204, the method for data processing provided by the present application can be adapted for database journal, this kind of day SQL statement inside will occupies a large amount of byte, but SQL statement similarity is very high, there is the very high text block of a large amount of temperatures, This part text block can reduce the memory space of data by recoding, therefore the method for data processing provided by the present application is just It is based on carrying out compression storage by way of recodification to high temperature text block, to reach the demand reduced to memory space Technical effect.
In the embodiment of the present application, by obtaining high temperature text block from data file to be compressed;By Gao Reduwen This block is replaced data file to be compressed and is stored, and has reached and has been compiled according to temperature text block high in every part of different log The purpose of code compression to realize the technical effect for reducing memory space, and then is solved due to using general compression techniques, Demand of the caused compressed data to memory space still very big technical problem.
Optionally, high temperature text block is the text block that temperature is greater than pre-set level temperature, wherein pre-set level temperature is With the average reference number of group index.
Referring to Fig. 4 it is found that Fig. 4 is calculating high fever degree text block in the method according to the data processing of the embodiment of the present application one Flow chart.It is specific as follows to calculate high temperature text block:
Optionally, obtaining high temperature text block in step S202 from data file to be compressed includes:
Step S2021 carries out data analysis to data file to be compressed, and calculates number to be compressed by preset algorithm According to the text block of preset heat ranking in file;
The text block of preset heat ranking is determined as high temperature text block by step S2022.
Specifically, in conjunction with step S2021 and step S2022, according to business in the method for data processing provided by the present application Data are selected the big daily record data table of amount of storage in data file to be compressed, and are counted to daily record data tableau format According to analysis, then by presetting computation model, the text block of TOP in journal file is calculated, and then obtain provided by the present application High temperature text block.
Further, optionally, preset heat in data file to be compressed is calculated by preset algorithm in step S2021 The text block of ranking includes:
Step S20211, in the case where data file to be compressed is daily record data table, the basis from daily record data table Default word segmentation condition is segmented, the log after being segmented;
In the application above-mentioned steps S20211, what the application was segmented from daily record data table according to default word segmentation condition In the process, segmenting method may include the following two kinds:
It is illustrated by taking TXXX_CHN and INTERNET_CHN as an example, a sentence is changed into the word divided with space, Two kinds of participle modes are similar, wherein it is relevant participle vocabulary that the former, which is embedded in and washes in a pan, while can also be according to point of definition Word standard, it is more flexible in this way.
Step S20212 carries out vectorization to the log after participle, log is changed into high-dimensional vector space;
In the application above-mentioned steps S20212, based on the log after being segmented obtained in step S20211, by vectorization, Log is changed into high-dimensional vector space.
Wherein, there are mainly two types of term vectors:
The training input of CBOW model is the corresponding term vector of context-sensitive word of some Feature Words, and exports just It is the term vector of this specific one word;
The thinking of Skip-Gram model and CBOW are reverse, the i.e. input term vectors that are specific one word, and are exported It is the corresponding context term vector of specific word.
The application applies DOC2VEC (sentence vector) model on the basis of term vector, and there is also two methods for the model: Storage allocation Distributed Memory (referred to as, DM) and distribution bag of words Distributed Bag of Words are (referred to as, DBOW).DM attempts the probability that word is predicted in the case where given context and paragraph vector.In a sentence or document In training process, paragraph ID is remained unchanged, and shares the same paragraph vector.DBOW is then in the case where only giving paragraph vector Predict the probability of one group of random word in paragraph.
Such as: it inputs " this is a sentence ", after participle: " this is ", "one", " sentence ";
Execute sentence vectorization mode: such as DM, 100 dimension outputs
doc_id ver1 ver2 … ver100
1 0.1 0.2 … 0.5
Step S20213 clusters at least one high-dimensional vector space, obtains log by presetting clustering algorithm Similarity Class set;
Wherein, by presetting clustering algorithm, at least one high-dimensional vector space is clustered, log Similarity Class is obtained Set includes:
Step S202131, in the case where default clustering algorithm is K mean cluster algorithm, by K mean cluster algorithm, At least one high-dimensional vector space is clustered, log Similarity Class set is obtained.
Based on high-dimensional vector space obtained in step S20212, by presetting clustering algorithm, at least one higher-dimension Degree vector space is clustered, and log Similarity Class set is obtained.Wherein, in existing clustering algorithm, including following three kinds:
K-Means: one-dimensional divides group, is calculated with ' distance ' concept;
Kohonen: two-dimensions are done using the model of class nerve self and divide group;
2-Step: most suitable point of group's number can be found out automatically;
Although 2-Step training is quickly, the advantage of K-Means is to can specify the quantity of cluster in this application, no Same log amount needs cluster amount to be also different, and excludes automatically at the uncontrollability of N number of cluster, more flexible, so the application It is illustrated using K-Means algorithm as preferable example, is subject to the method for realizing data processing provided by the present application, specifically not It limits.
Step S20214 generates dictionary library according to log Similarity Class set, and according to dictionary library and log Similarity Class set Generate digital log;
Wherein, dictionary library is generated according to log Similarity Class set, and number is generated according to dictionary library and log Similarity Class set Word log includes:
Step S202141 carries out word frequency statistics to each participle in log Similarity Class set, obtains dictionary library;
Step S202142 is mapped according to dictionary library and log Similarity Class set, obtains digital log, wherein number Log is summed for convolution, the span that convolution is summed for determining Similar Text block.
In the application above-mentioned steps S20214, word frequency statistics are carried out for Similarity Class and form dictionary library, and being mapped to can be with The digital log of convolution summation;
Wherein, the effect of convolution summation is quickly to determine the span of Similar Text block;
Such as:
W_conv1=tf.ones ([j, 1,1,1])
Conv=tf.nn.conv2d (x_image, W_conv1, strides=[1,1,1,1], padding=' VALID')
Only concurrent GPU is supported to calculate with tenseorflow function, it is more efficient.
Step S20215 calculates the convolution block of different spans by presetting span, and according to default span and in digital day The product of frequency of occurrence in will determines the high compression rate convolution block of default ranking;
Wherein, the convolution block of different spans is calculated by presetting span, and is gone out according to default span in digital log The product of occurrence number determines that the high compression rate convolution block of default ranking includes:
Step S202151 calculates the summation of different spans convolution according to span is preset;
Step S202152 corresponds to convolution sum frequency of occurrence product in digital log according to different spans and default span, Obtain the high compression rate span of default ranking;
Step S202153 calculates the convolution block of different spans according to the high compression rate span for presetting ranking, and according to default The product of the high compression rate span of ranking and the frequency of occurrence in digital log determines the high compression rate convolution block of default ranking.
It should be noted that convolution summation same text block might not be identical, it may be possible to which sequence is exchanged, convolution summation It is identical, but be not the same text block in fact, so can only acquire similar span here, subsequent or needs take according to span Interception log blocks are exactly matched.
Specifically, quickly calculating the summation of different spans convolution for being 2-n according to span, according to different spans and it is somebody's turn to do Span corresponds to convolution sum in journal file frequency of occurrence product, selects TOPN high compression rate span;According to determining TOPN high pressure Shrinkage span recalculates the convolution block of different spans, and goes out occurrence according to TOPN high compression rate span and in journal file Several products determines the high compression rate convolution block of TOPN.
Step S20216 restores data file to be compressed according to dictionary library formating coding, obtains high temperature text Block.
Optionally, high temperature text block is replaced data file to be compressed and store in step S204 and include:
Step S2041 encodes high temperature text block according to preset model, the high temperature text after being encoded Block;High temperature text block after coding is replaced data file to be compressed to store.
Wherein, preset model includes: more algorithm models such as sentence vectorization, cluster, deep learning convolution.
To sum up, as shown in figure 4, in the method for data processing provided by the present application, preferably showing for high temperature text block is calculated Example is specific as follows:
(1), it extracts log and segments the (step 1) in Fig. 4;
It carries out journal file standardization (substitution TAB, newline are space) and is segmented by space;
(2), the log after log being segmented carries out the vectorization (step 2) in Fig. 4;
By vectorization, log is changed into high-dimensional vector space;
(3), (the step 3-4 in Fig. 4) is clustered;
After log changes into high-dimensional vector space, by common K mean cluster, similar log can flock together;
(4), (the step 5-8 in Fig. 4) is formatted for Similarity Class;
Word frequency statistics are carried out for Similarity Class and form dictionary library, and are mapped to the digital log that can be summed with convolution;
(5), fast convolution and span selection (the step 9-12 in Fig. 4);
It is 2-n according to span, quickly calculates the summation of different spans convolution;
Convolution sum is corresponded in journal file frequency of occurrence product according to different spans and the span, selects TOPN high compression rate Span;
(6), convolution word cutting and compression ratio assessment (the step 13-15 in Fig. 4);
According to span determined above, the convolution block of different spans is recalculated, and according to span and in journal file The product of frequency of occurrence determines the high compression rate convolution block of TOPN;
(7), the formating coding (step 16) in Fig. 4;
According to dictionary library formating coding, actual log text block content is restored, temperature text block is obtained.
The method of data processing provided by the present application finds Gao Reduwen by the way that log class big data is carried out word segmentation processing This block, and coding compression is carried out to high temperature text block, former data to be stored is substituted to encode compressed high temperature text block File reaches the demand reduced to memory space, improves the utilization rate of memory space, and when reducing later maintenance Safeguard pressure.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, related actions and modules not necessarily the application It is necessary.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of the data processing of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hard Part, but the former is more preferably embodiment in many cases.Based on this understanding, the technical solution of the application substantially or Say that the part that contributes to existing technology can be embodied in the form of software products, which is stored in In one storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be hand Machine, computer, server or network equipment etc.) execute method described in each embodiment of the application.
Embodiment 2
This application provides the methods of data processing as shown in Figure 5.Fig. 5 is at the data according to the embodiment of the present application two The flow chart of the method for reason.
Step S502 obtains target data objects, wherein target data objects are stored in target data address;
In the application above-mentioned steps S502, the method for data processing provided by the present application is obtained on target data address and is deposited The target data objects of storage, the target data objects may include data file to be compressed, which may include: number According to the function file of the calling data stored in library, to the run program file of data processing or encryption and decryption is carried out to data adds Decryption program file, the method for data processing provided by the present application are only illustrated taking the above example as an example, specifically without limitation.
Step S504 obtains the text block that temperature is greater than preset threshold, wherein preset threshold from target data objects Including reference number or reference frequency;
In the application above-mentioned steps S504, based on target data objects obtained in step S502, number provided by the present application The big daily record data table of amount of storage is selected from target data objects according to the method for processing, and daily record data tableau format is carried out Data analysis calculates the text block that ranking is preset in target data objects then by calculating method model, wherein the default row The text block of name can be the text block of TOP N in target data objects, and N is integer, for example, 1,2,3,4,5,6,7,8,9, 10 ... ..., N.And text block provided by the present application, that is, the text block of TOP N in target data objects, that is, can be TOP3's Text block is the text block that temperature is greater than preset threshold.
Wherein, during obtaining text block of the temperature greater than preset threshold, it can be analyzed and be detected by data, searched The partial rules of log content find out the text block that temperature is greater than preset threshold by the algorithm model of construction.
Here it during calculating text block temperature, can be obtained by the reference number of text block, wherein the application Index temperature is defined as with the average reference number for organizing index: assuming that having n row log text string, text in a sample file Block imReference number be rm, then the temperature of text block is all in sample
Text block is stored in target data address by step S506.
The method of data processing provided by the present application be different from the prior art in based on the storage after general compression techniques, gram The problem of can not carrying out coding compression to high temperature text block in every part of different log has been taken, has been obtained to target data objects Temperature is greater than the text block of preset threshold, so it is purposive carry out compression storage, and then default threshold is greater than by storage temperature The text block of value replaces original target data objects, saves memory space.
In the embodiment of the present application, by obtaining target data objects, wherein target data objects are stored in target data Address;From target data objects, the text block that temperature is greater than preset threshold is obtained, wherein preset threshold includes reference number Or reference frequency;Text block is stored in target data address, has been reached according to temperature text block high in every part of different log The purpose of coding compression is carried out, to realize the technical effect for reducing memory space, and then is solved due to using common pressure Contracting technology, demand of the caused compressed data to memory space still very big technical problem.
Embodiment 3
According to the another aspect of the embodiment of the present application, a kind of storage medium is additionally provided, storage medium includes the journey of storage Sequence, wherein equipment where control storage medium executes in program operation: obtaining Gao Reduwen from data file to be compressed This block;High temperature text block is replaced data file to be compressed to store.
Embodiment 4
According to the another aspect of the embodiment of the present application, a kind of processor is additionally provided, processor is used to run program, In, program executes when running: high temperature text block is obtained from data file to be compressed;High temperature text block is replaced wait press The data file of contracting is stored.
Embodiment 5
Embodiments herein additionally provides a kind of storage medium.Optionally, in the present embodiment, above-mentioned storage medium can With program code performed by the method for saving data processing provided by above-described embodiment one.
Optionally, in the present embodiment, above-mentioned storage medium can be located in computer network in computer terminal group In any one terminal, or in any one mobile terminal in mobile terminal group.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: from High temperature text block is obtained in data file to be compressed;High temperature text block is replaced data file to be compressed to deposit Storage.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: high Temperature text block is the text block that temperature is greater than pre-set level temperature, wherein pre-set level temperature is to draw with being averaged for index of group Use number.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: from It includes: to carry out data analysis to data file to be compressed, and pass through that high temperature text block is obtained in data file to be compressed Preset algorithm calculates the text block of preset heat ranking in data file to be compressed;The text block of preset heat ranking is determined For high temperature text block.
Further, optionally, in the present embodiment, storage medium is arranged to store the journey for executing following steps Sequence code: the text block that preset heat ranking in data file to be compressed is calculated by preset algorithm includes: to be compressed In the case that data file is daily record data table, is segmented, segmented according to default word segmentation condition from daily record data table Log afterwards;Vectorization is carried out to the log after participle, log is changed into high-dimensional vector space;By presetting clustering algorithm, At least one high-dimensional vector space is clustered, log Similarity Class set is obtained;Word is generated according to log Similarity Class set Allusion quotation library, and digital log is generated according to dictionary library and log Similarity Class set;The convolution of different spans is calculated by presetting span Block, and according to the product of default span and the frequency of occurrence in digital log, determine the high compression rate convolution block of default ranking;Root According to dictionary library formating coding, data file to be compressed is restored, obtains high temperature text block.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: logical Default clustering algorithm is crossed, at least one high-dimensional vector space is clustered, obtaining log Similarity Class set includes: default In the case that clustering algorithm is K mean cluster algorithm, by K mean cluster algorithm, at least one high-dimensional vector space into Row cluster, obtains log Similarity Class set.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: root Dictionary library is generated according to log Similarity Class set, and generating digital log according to dictionary library and log Similarity Class set includes: to day Each participle in will Similarity Class set carries out word frequency statistics, obtains dictionary library;It is carried out according to dictionary library and log Similarity Class set Mapping, obtains digital log, wherein digital log is summed for convolution, the span that convolution is summed for determining Similar Text block.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: logical The convolution block that default span calculates different spans is crossed, and according to the product of default span and the frequency of occurrence in digital log, really Surely the high compression rate convolution block of default ranking includes: to calculate the summation of different spans convolution according to span is preset;According to different spans Convolution sum frequency of occurrence product in digital log is corresponded to default span, obtains the high compression rate span of default ranking;Foundation The high compression rate span of default ranking calculates the convolution block of different spans, and according to the high compression rate span of default ranking and in number The product of frequency of occurrence in word log determines the high compression rate convolution block of default ranking.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: will It includes: to compile according to preset model to high temperature text block that high temperature text block, which replaces data file to be compressed and carries out storage, Code, the high temperature text block after being encoded;High temperature text block after coding is replaced data file to be compressed to deposit Storage.
Above-mentioned the embodiment of the present application serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
In above-described embodiment of the application, all emphasizes particularly on different fields to the description of each embodiment, do not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, only A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the application whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.
The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art For member, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications are also answered It is considered as the protection scope of the application.

Claims (11)

1. a kind of method of data processing characterized by comprising
High temperature text block is obtained from data file to be compressed;
The high temperature text block is replaced the data file to be compressed to store.
2. the method for data processing according to claim 1, which is characterized in that the high temperature text block is greater than for temperature The text block of pre-set level temperature, wherein pre-set level temperature is the average reference number with group index.
3. the method for data processing according to claim 1, which is characterized in that described to be obtained from data file to be compressed The high temperature text block is taken to include:
Data analysis is carried out to the data file to be compressed, and the data file to be compressed is calculated by preset algorithm The text block of middle preset heat ranking;
The text block of the preset heat ranking is determined as the high temperature text block.
4. the method for data processing according to claim 3, which is characterized in that it is described by preset algorithm calculate it is described to The text block of preset heat ranking includes: in the data file of compression
In the case where the data file to be compressed is daily record data table, according to default participle from the daily record data table Condition is segmented, the log after being segmented;
Vectorization is carried out to the log after the participle, log is changed into high-dimensional vector space;
By presetting clustering algorithm, at least one described high-dimensional vector space is clustered, log Similarity Class set is obtained;
Dictionary library is generated according to the log Similarity Class set, and is generated according to the dictionary library and the log Similarity Class set Digital log;
The convolution block of different spans is calculated by presetting span, and is occurred according to the default span in the digital log The product of number determines the high compression rate convolution block of default ranking;
According to the dictionary library formating coding, the data file to be compressed is restored, the high temperature text block is obtained.
5. the method for data processing according to claim 4, which is characterized in that it is described by presetting clustering algorithm, to extremely A few high-dimensional vector space is clustered, and obtaining log Similarity Class set includes:
In the case where the default clustering algorithm is K mean cluster algorithm, by the K mean cluster algorithm, at least one A high-dimensional vector space is clustered, and log Similarity Class set is obtained.
6. the method for data processing according to claim 4, which is characterized in that described according to the log Similarity Class set Dictionary library is generated, and digital log is generated with the log Similarity Class set according to the dictionary library and includes:
Word frequency statistics are carried out to each participle in the log Similarity Class set, obtain the dictionary library;
It is mapped according to the dictionary library and the log Similarity Class set, obtains the digital log, wherein the number Log is summed for convolution, the span that the convolution is summed for determining Similar Text block.
7. the method for the data processing according to claim 4 or 6, which is characterized in that by default span calculate it is different across The convolution block of degree, and according to the product of the default span and the frequency of occurrence in the digital log, determine default ranking High compression rate convolution block includes:
According to span is preset, the summation of different spans convolution is calculated;
Convolution sum frequency of occurrence product in the digital log is corresponded to according to the different spans and the default span, is obtained The high compression rate span of default ranking;
The convolution block of different spans is calculated according to the high compression rate span of the default ranking, and according to the height of the default ranking The product of compression ratio span and the frequency of occurrence in the digital log, determines the high compression rate convolution block of the default ranking.
8. the method for data processing according to claim 1, which is characterized in that described to replace the high temperature text block The data file to be compressed carries out storage
The high temperature text block is encoded according to preset model, the high temperature text block after being encoded;
High temperature text block after the coding is replaced the data file to be compressed to store.
9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where equipment execute: high temperature text block is obtained from data file to be compressed;By the height Temperature text block is replaced the data file to be compressed and is stored.
10. a kind of processor, which is characterized in that the processor is for running program, wherein described program executes when running: High temperature text block is obtained from data file to be compressed;The high temperature text block is replaced into the data text to be compressed Part is stored.
11. a kind of method of data processing characterized by comprising
Obtain target data objects, wherein the target data objects are stored in target data address;
From the target data objects, the text block that temperature is greater than preset threshold is obtained, wherein the preset threshold includes drawing With number or reference frequency;
The text block is stored in the target data address.
CN201810410873.8A 2018-05-02 2018-05-02 Method of data processing and storage medium Active CN110442489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810410873.8A CN110442489B (en) 2018-05-02 2018-05-02 Method of data processing and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810410873.8A CN110442489B (en) 2018-05-02 2018-05-02 Method of data processing and storage medium

Publications (2)

Publication Number Publication Date
CN110442489A true CN110442489A (en) 2019-11-12
CN110442489B CN110442489B (en) 2024-03-01

Family

ID=68427586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810410873.8A Active CN110442489B (en) 2018-05-02 2018-05-02 Method of data processing and storage medium

Country Status (1)

Country Link
CN (1) CN110442489B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104381A (en) * 2019-11-30 2020-05-05 北京浪潮数据技术有限公司 Log management method, device and equipment and computer readable storage medium
CN113282552A (en) * 2021-06-04 2021-08-20 上海天旦网络科技发展有限公司 Similarity direction quantization method and system for flow statistic log
CN115834504A (en) * 2022-11-04 2023-03-21 电子科技大学 AXI bus-based data compression/decompression method and device
CN117313657A (en) * 2023-11-30 2023-12-29 深圳市伟奇服装有限公司 School uniform design data coding compression method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002132546A (en) * 2000-10-24 2002-05-10 Xaxon R & D Corp Storage device
CN1367896A (en) * 1999-08-13 2002-09-04 富士通株式会社 File processing method, data processing device and storage medium
TW527784B (en) * 2000-12-18 2003-04-11 Inventec Besta Co Ltd Method for compressing statistical data characteristics
CN105893337A (en) * 2015-01-04 2016-08-24 伊姆西公司 Method and equipment for text compression and decompression
CN106446148A (en) * 2016-09-21 2017-02-22 中国运载火箭技术研究院 Cluster-based text duplicate checking method
CN106815124A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 Journal file treating method and apparatus
CN107145485A (en) * 2017-05-11 2017-09-08 百度国际科技(深圳)有限公司 Method and apparatus for compressing topic model
CN107977442A (en) * 2017-12-08 2018-05-01 北京希嘉创智教育科技有限公司 Journal file compresses and decompression method, electronic equipment and readable storage medium storing program for executing

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1367896A (en) * 1999-08-13 2002-09-04 富士通株式会社 File processing method, data processing device and storage medium
JP2002132546A (en) * 2000-10-24 2002-05-10 Xaxon R & D Corp Storage device
TW527784B (en) * 2000-12-18 2003-04-11 Inventec Besta Co Ltd Method for compressing statistical data characteristics
CN105893337A (en) * 2015-01-04 2016-08-24 伊姆西公司 Method and equipment for text compression and decompression
CN106815124A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 Journal file treating method and apparatus
CN106446148A (en) * 2016-09-21 2017-02-22 中国运载火箭技术研究院 Cluster-based text duplicate checking method
CN107145485A (en) * 2017-05-11 2017-09-08 百度国际科技(深圳)有限公司 Method and apparatus for compressing topic model
CN107977442A (en) * 2017-12-08 2018-05-01 北京希嘉创智教育科技有限公司 Journal file compresses and decompression method, electronic equipment and readable storage medium storing program for executing

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104381A (en) * 2019-11-30 2020-05-05 北京浪潮数据技术有限公司 Log management method, device and equipment and computer readable storage medium
CN113282552A (en) * 2021-06-04 2021-08-20 上海天旦网络科技发展有限公司 Similarity direction quantization method and system for flow statistic log
CN115834504A (en) * 2022-11-04 2023-03-21 电子科技大学 AXI bus-based data compression/decompression method and device
CN117313657A (en) * 2023-11-30 2023-12-29 深圳市伟奇服装有限公司 School uniform design data coding compression method
CN117313657B (en) * 2023-11-30 2024-03-19 深圳市伟奇服装有限公司 School uniform design data coding compression method

Also Published As

Publication number Publication date
CN110442489B (en) 2024-03-01

Similar Documents

Publication Publication Date Title
CN110442489A (en) The method and storage medium of data processing
CN111339433B (en) Information recommendation method and device based on artificial intelligence and electronic equipment
US9015083B1 (en) Distribution of parameter calculation for iterative optimization methods
EP3766021B1 (en) Cluster compression for compressing weights in neural networks
US20230289828A1 (en) Data processing method, computer device, and readable storage medium
CN115965058B (en) Neural network training method, entity information classification method, device and storage medium
CN110008192A (en) A kind of data file compression method, apparatus, equipment and readable storage medium storing program for executing
CN113610240A (en) Method and system for performing predictions using nested machine learning models
CN114722091A (en) Data processing method, data processing device, storage medium and processor
KR20220075407A (en) neural network representation
CN110708285A (en) Flow monitoring method, device, medium and electronic equipment
CN113822315A (en) Attribute graph processing method and device, electronic equipment and readable storage medium
CN110263917B (en) Neural network compression method and device
Liu et al. Efficient neural networks for edge devices
KR20210124811A (en) Apparatus and method for generating training data for network failure diagnosis
US20220277031A1 (en) Guided exploration for conversational business intelligence
CN113767403A (en) Automatic resolution of over-and under-designations in knowledge graphs
CN116484105A (en) Service processing method, device, computer equipment, storage medium and program product
CN110175645B (en) Method for determining model of protection device and computing device
CN115424725A (en) Data analysis method and device, storage medium and processor
CN113611427A (en) User portrait generation method, device, equipment and storage medium
US11074591B2 (en) Recommendation system to support mapping between regulations and controls
CN111639260A (en) Content recommendation method, device and storage medium thereof
CN112953914A (en) DGA domain name detection and classification method and device
CN113839799A (en) Alarm association rule mining method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40016266

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant