CN110442489A - The method and storage medium of data processing - Google Patents
The method and storage medium of data processing Download PDFInfo
- Publication number
- CN110442489A CN110442489A CN201810410873.8A CN201810410873A CN110442489A CN 110442489 A CN110442489 A CN 110442489A CN 201810410873 A CN201810410873 A CN 201810410873A CN 110442489 A CN110442489 A CN 110442489A
- Authority
- CN
- China
- Prior art keywords
- text block
- log
- compressed
- data
- high temperature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 70
- 238000012545 processing Methods 0.000 title claims abstract description 48
- 238000007906 compression Methods 0.000 claims abstract description 46
- 230000006835 compression Effects 0.000 claims abstract description 46
- 239000013598 vector Substances 0.000 claims description 31
- 238000007405 data analysis Methods 0.000 claims description 5
- 230000015654 memory Effects 0.000 abstract description 33
- 238000004891 communication Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 208000021760 high fever Diseases 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006116 polymerization reaction Methods 0.000 description 2
- 102100040401 DNA topoisomerase 3-alpha Human genes 0.000 description 1
- 101000611068 Homo sapiens DNA topoisomerase 3-alpha Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 244000144992 flock Species 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 101150116173 ver-1 gene Proteins 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
- G06F11/3072—Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
- G06F11/3082—Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting the data filtering being achieved by aggregating or compressing the monitored data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application discloses a kind of method of data processing and storage mediums.Wherein, this method comprises: obtaining high temperature text block from data file to be compressed;High temperature text block is replaced data file to be compressed to store.The present invention is solved due to using general compression techniques, demand of the caused compressed data to memory space still very big technical problem.
Description
Technical field
This application involves Internet technology application fields, and the method and storage in particular to a kind of data processing are situated between
Matter.
Background technique
During the extension of internet industry, more and more industries are associated with internet, and consequent is big
The generation of data, especially enterprise level are measured, the generation of routine work, execution, archive can bring a large amount of data, and use
Memory space in the database for calling data and for storing data is all made of SQL statement work when generating log
For call instruction or log is managed, but the byte that SQL statement occupies is more, the big problem of the memory space for needing to occupy also is got over
More to perplex the operation maintenance personnel of business data.
In existing solution, with the mode for crossing cold data storage, that is, with common compress technique to data to be stored
It is compressed, so that reduce demand of the data to be stored to memory space, it is compressed to be stored during follow-up storage
Data.But the problem of the prior art, is, it is compressed even if data are compressed for a large amount of generations to deposit data
Demand of the data to memory space is still very big, this just gives the limited physical memory space to cause very big storage pressure.
For above-mentioned due to using general compression techniques, demand of the caused compressed data to memory space is still very
Big problem, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the present application provides the method and storage medium of a kind of data processing, at least to solve due to using common
Compress technique, demand of the caused compressed data to memory space still very big technical problem.
According to the one aspect of the embodiment of the present application, a kind of method of data processing is provided, comprising: from number to be compressed
According to obtaining high temperature text block in file;High temperature text block is replaced data file to be compressed to store.
Optionally, high temperature text block is the text block that temperature is greater than pre-set level temperature, wherein pre-set level temperature is
With the average reference number of group index.
Optionally, obtained from data file to be compressed high temperature text block include: to data file to be compressed into
Row data are analyzed, and the text block of preset heat ranking in data file to be compressed is calculated by preset algorithm;By default heat
The text block of degree ranking is determined as high temperature text block.
Further, optionally, the text of preset heat ranking in data file to be compressed is calculated by preset algorithm
Block includes: in the case where data file to be compressed is daily record data table, according to default word segmentation condition from daily record data table
It is segmented, the log after being segmented;Vectorization is carried out to the log after participle, log is changed into high-dimensional vector space;
By presetting clustering algorithm, at least one high-dimensional vector space is clustered, log Similarity Class set is obtained;According to log
Similarity Class set generates dictionary library, and generates digital log according to dictionary library and log Similarity Class set;By presetting span meter
The convolution block of different spans is calculated, and according to the product of default span and the frequency of occurrence in digital log, determines default ranking
High compression rate convolution block;According to dictionary library formating coding, data file to be compressed is restored, obtains high temperature text block.
Optionally, by presetting clustering algorithm, at least one high-dimensional vector space is clustered, it is similar to obtain log
Class set includes: in the case where default clustering algorithm is K mean cluster algorithm, by K mean cluster algorithm, at least one
High-dimensional vector space is clustered, and log Similarity Class set is obtained.
Optionally, dictionary library is generated according to log Similarity Class set, and is generated according to dictionary library and log Similarity Class set
Digital log includes: to carry out word frequency statistics to each participle in log Similarity Class set, obtains dictionary library;According to dictionary library and day
Will Similarity Class set is mapped, and digital log is obtained, wherein digital log is summed for convolution, and convolution is summed for determining
The span of Similar Text block.
Optionally, the convolution block of different spans is calculated by presetting span, and according to default span and in digital log
The product of frequency of occurrence determines that the high compression rate convolution block of default ranking includes: to calculate different spans convolution according to span is preset
Summation;Convolution sum frequency of occurrence product in digital log is corresponded to according to different spans and default span, obtains default ranking
High compression rate span;The convolution block of different spans is calculated according to the high compression rate span for presetting ranking, and according to default ranking
The product of high compression rate span and the frequency of occurrence in digital log determines the high compression rate convolution block of default ranking.
Optionally, high temperature text block is replaced data file to be compressed to carry out storage includes: according to preset model pair
High temperature text block is encoded, the high temperature text block after being encoded;High temperature text block after coding is replaced wait press
The data file of contracting is stored.
According to the another aspect of the embodiment of the present application, a kind of storage medium is additionally provided, storage medium includes the journey of storage
Sequence, wherein equipment where control storage medium executes in program operation: obtaining Gao Reduwen from data file to be compressed
This block;High temperature text block is replaced data file to be compressed to store.
According to the another aspect of the embodiment of the present application, a kind of processor is additionally provided, processor is used to run program,
In, program executes when running: high temperature text block is obtained from data file to be compressed;High temperature text block is replaced wait press
The data file of contracting is stored.
According to the embodiment of the present application in another aspect, additionally providing a kind of method of data processing, comprising: obtain number of targets
According to object, wherein target data objects are stored in target data address;From target data objects, temperature is obtained greater than default
The text block of threshold value, wherein preset threshold includes reference number or reference frequency;Text block is stored in target data address.
In the embodiment of the present application, by obtaining high temperature text block from data file to be compressed;By Gao Reduwen
This block is replaced data file to be compressed and is stored, and has reached and has been compiled according to temperature text block high in every part of different log
The purpose of code compression to realize the technical effect for reducing memory space, and then is solved due to using general compression techniques,
Demand of the caused compressed data to memory space still very big technical problem.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present application, constitutes part of this application, this Shen
Illustrative embodiments and their description please are not constituted an undue limitation on the present application for explaining the application.In the accompanying drawings:
Fig. 1 is a kind of hardware block diagram of the terminal of the method for data processing of the embodiment of the present application;
Fig. 2 is the flow chart according to the method for the data processing of the embodiment of the present application one;
Fig. 3 is the flow chart according to a kind of method of data processing of the embodiment of the present application one;
Fig. 4 is the flow chart that high fever degree text block is calculated in the method according to the data processing of the embodiment of the present application one;
Fig. 5 is the flow chart according to the method for the data processing of the embodiment of the present application two.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only
The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people
Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection
It encloses.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to
Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product
Or other step or units that equipment is intrinsic.
This application involves technical term:
Data compression: referring under the premise of not losing useful information, reduces data volume to reduce memory space, improves it
Transmission, storage and processing efficiency, or data are reorganized according to certain algorithm, reduce the redundancy and storage of data
A kind of technical method in space.
Participle: text is carried out to split into single or multiple words.
Fast convolution: according to each starting point and span, convolution sum is calculated.
Polymerization: common polymerization has counting, non-repetition counting, summation, maximum value, minimum value etc..
Embodiment 1
According to the embodiment of the present application, a kind of embodiment of the method for data processing is additionally provided, it should be noted that in attached drawing
Process the step of illustrating can execute in a computer system such as a set of computer executable instructions, although also,
Logical order is shown in flow charts, but in some cases, can be executed with the sequence for being different from herein it is shown or
The step of description.
Embodiment of the method provided by the embodiment of the present application one can be in mobile terminal, terminal or similar fortune
It calculates and is executed in device.For running on computer terminals, Fig. 1 is a kind of method of data processing of the embodiment of the present application
The hardware block diagram of terminal.As shown in Figure 1, terminal 10 may include one or more (only shows in figure
One) (processor 102 can include but is not limited to the place of Micro-processor MCV or programmable logic device FPGA etc. to processor 102
Manage device), memory 104 for storing data and the transmission module 106 for communication function.Ordinary skill
Personnel are appreciated that structure shown in FIG. 1 is only to illustrate, and do not cause to limit to the structure of above-mentioned electronic device.For example, meter
Calculation machine terminal 10 may also include than shown in Fig. 1 more perhaps less component or with the configuration different from shown in Fig. 1.
Memory 104 can be used for storing the software program and module of application software, such as the data in the embodiment of the present application
Corresponding program instruction/the module of the method for processing, processor 102 by the software program that is stored in memory 104 of operation with
And module realizes the method for the data processing of above-mentioned application program thereby executing various function application and data processing.
Memory 104 may include high speed random access memory, may also include nonvolatile memory, as one or more magnetic storage fills
It sets, flash memory or other non-volatile solid state memories.In some instances, memory 104 can further comprise relative to place
The remotely located memory of device 102 is managed, these remote memories can pass through network connection to terminal 10.Above-mentioned network
Example include but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include
The wireless network that the communication providers of terminal 10 provide.In an example, transmitting device 106 includes that a network is suitable
Orchestration (Network Interface Controller, NIC), can be connected by base station with other network equipments so as to
Internet is communicated.In an example, transmitting device 106 can be radio frequency (Radio Frequency, RF) module,
For wirelessly being communicated with internet.
Under above-mentioned running environment, this application provides the methods of data processing as shown in Figure 2.Fig. 2 is according to this Shen
Please embodiment one data processing method flow chart.
Step S202 obtains high temperature text block from data file to be compressed;
The method of the application above-mentioned steps S202, data processing provided by the present application are selected from data file to be compressed
The big daily record data table of amount of storage, and data analysis is carried out to daily record data tableau format, then by calculating method model, calculate
The text block of ranking is preset in data file to be compressed out, wherein the text block of the default ranking can be data to be compressed
The text block of TOP N in file, N are integer, for example, 1,2,3,4,5,6,7,8,9,10 ... ..., N.And height provided by the present application
Temperature text block, that is, the text block of TOP N in data file to be compressed.
Wherein, during obtaining high temperature text block, it can be analyzed and be detected by data, search the portion of log content
Divider rule, by the algorithm model of construction, finds out high temperature text block.
Here it during calculating text block temperature, can be obtained by the reference number of text block, wherein the application
Index temperature is defined as with the average reference number for organizing index: assuming that having n row log text string, text in a sample file
Block imReference number be rm, then the temperature of text block is all in sample
The method of data processing provided by the present application be different from the prior art in based on the storage after general compression techniques, gram
Coding compression can not be carried out to high temperature text block in every part of different log by having taken, and step S202 is exactly to number to be compressed
According to the high temperature text block of file acquisition, so it is purposive carry out compression storage, storing step is shown in step S204.
High temperature text block is replaced data file to be compressed and stored by step S204.
The application above-mentioned steps S204 is calculated high based on the high temperature text block that step S202 is obtained by data model
Temperature text block, and then recompile, and it is to be compressed in the high temperature text block replacement step S202 after recompiling
Data file is stored.
Specifically, as shown in figure 3, Fig. 3 is the flow chart according to a kind of method of data processing of the embodiment of the present application one.
In conjunction with step S202 to step S204, the method for data processing provided by the present application can be adapted for database journal, this kind of day
SQL statement inside will occupies a large amount of byte, but SQL statement similarity is very high, there is the very high text block of a large amount of temperatures,
This part text block can reduce the memory space of data by recoding, therefore the method for data processing provided by the present application is just
It is based on carrying out compression storage by way of recodification to high temperature text block, to reach the demand reduced to memory space
Technical effect.
In the embodiment of the present application, by obtaining high temperature text block from data file to be compressed;By Gao Reduwen
This block is replaced data file to be compressed and is stored, and has reached and has been compiled according to temperature text block high in every part of different log
The purpose of code compression to realize the technical effect for reducing memory space, and then is solved due to using general compression techniques,
Demand of the caused compressed data to memory space still very big technical problem.
Optionally, high temperature text block is the text block that temperature is greater than pre-set level temperature, wherein pre-set level temperature is
With the average reference number of group index.
Referring to Fig. 4 it is found that Fig. 4 is calculating high fever degree text block in the method according to the data processing of the embodiment of the present application one
Flow chart.It is specific as follows to calculate high temperature text block:
Optionally, obtaining high temperature text block in step S202 from data file to be compressed includes:
Step S2021 carries out data analysis to data file to be compressed, and calculates number to be compressed by preset algorithm
According to the text block of preset heat ranking in file;
The text block of preset heat ranking is determined as high temperature text block by step S2022.
Specifically, in conjunction with step S2021 and step S2022, according to business in the method for data processing provided by the present application
Data are selected the big daily record data table of amount of storage in data file to be compressed, and are counted to daily record data tableau format
According to analysis, then by presetting computation model, the text block of TOP in journal file is calculated, and then obtain provided by the present application
High temperature text block.
Further, optionally, preset heat in data file to be compressed is calculated by preset algorithm in step S2021
The text block of ranking includes:
Step S20211, in the case where data file to be compressed is daily record data table, the basis from daily record data table
Default word segmentation condition is segmented, the log after being segmented;
In the application above-mentioned steps S20211, what the application was segmented from daily record data table according to default word segmentation condition
In the process, segmenting method may include the following two kinds:
It is illustrated by taking TXXX_CHN and INTERNET_CHN as an example, a sentence is changed into the word divided with space,
Two kinds of participle modes are similar, wherein it is relevant participle vocabulary that the former, which is embedded in and washes in a pan, while can also be according to point of definition
Word standard, it is more flexible in this way.
Step S20212 carries out vectorization to the log after participle, log is changed into high-dimensional vector space;
In the application above-mentioned steps S20212, based on the log after being segmented obtained in step S20211, by vectorization,
Log is changed into high-dimensional vector space.
Wherein, there are mainly two types of term vectors:
The training input of CBOW model is the corresponding term vector of context-sensitive word of some Feature Words, and exports just
It is the term vector of this specific one word;
The thinking of Skip-Gram model and CBOW are reverse, the i.e. input term vectors that are specific one word, and are exported
It is the corresponding context term vector of specific word.
The application applies DOC2VEC (sentence vector) model on the basis of term vector, and there is also two methods for the model:
Storage allocation Distributed Memory (referred to as, DM) and distribution bag of words Distributed Bag of Words are (referred to as,
DBOW).DM attempts the probability that word is predicted in the case where given context and paragraph vector.In a sentence or document
In training process, paragraph ID is remained unchanged, and shares the same paragraph vector.DBOW is then in the case where only giving paragraph vector
Predict the probability of one group of random word in paragraph.
Such as: it inputs " this is a sentence ", after participle: " this is ", "one", " sentence ";
Execute sentence vectorization mode: such as DM, 100 dimension outputs
doc_id ver1 ver2 … ver100
1 0.1 0.2 … 0.5
Step S20213 clusters at least one high-dimensional vector space, obtains log by presetting clustering algorithm
Similarity Class set;
Wherein, by presetting clustering algorithm, at least one high-dimensional vector space is clustered, log Similarity Class is obtained
Set includes:
Step S202131, in the case where default clustering algorithm is K mean cluster algorithm, by K mean cluster algorithm,
At least one high-dimensional vector space is clustered, log Similarity Class set is obtained.
Based on high-dimensional vector space obtained in step S20212, by presetting clustering algorithm, at least one higher-dimension
Degree vector space is clustered, and log Similarity Class set is obtained.Wherein, in existing clustering algorithm, including following three kinds:
K-Means: one-dimensional divides group, is calculated with ' distance ' concept;
Kohonen: two-dimensions are done using the model of class nerve self and divide group;
2-Step: most suitable point of group's number can be found out automatically;
Although 2-Step training is quickly, the advantage of K-Means is to can specify the quantity of cluster in this application, no
Same log amount needs cluster amount to be also different, and excludes automatically at the uncontrollability of N number of cluster, more flexible, so the application
It is illustrated using K-Means algorithm as preferable example, is subject to the method for realizing data processing provided by the present application, specifically not
It limits.
Step S20214 generates dictionary library according to log Similarity Class set, and according to dictionary library and log Similarity Class set
Generate digital log;
Wherein, dictionary library is generated according to log Similarity Class set, and number is generated according to dictionary library and log Similarity Class set
Word log includes:
Step S202141 carries out word frequency statistics to each participle in log Similarity Class set, obtains dictionary library;
Step S202142 is mapped according to dictionary library and log Similarity Class set, obtains digital log, wherein number
Log is summed for convolution, the span that convolution is summed for determining Similar Text block.
In the application above-mentioned steps S20214, word frequency statistics are carried out for Similarity Class and form dictionary library, and being mapped to can be with
The digital log of convolution summation;
Wherein, the effect of convolution summation is quickly to determine the span of Similar Text block;
Such as:
W_conv1=tf.ones ([j, 1,1,1])
Conv=tf.nn.conv2d (x_image, W_conv1, strides=[1,1,1,1], padding='
VALID')
Only concurrent GPU is supported to calculate with tenseorflow function, it is more efficient.
Step S20215 calculates the convolution block of different spans by presetting span, and according to default span and in digital day
The product of frequency of occurrence in will determines the high compression rate convolution block of default ranking;
Wherein, the convolution block of different spans is calculated by presetting span, and is gone out according to default span in digital log
The product of occurrence number determines that the high compression rate convolution block of default ranking includes:
Step S202151 calculates the summation of different spans convolution according to span is preset;
Step S202152 corresponds to convolution sum frequency of occurrence product in digital log according to different spans and default span,
Obtain the high compression rate span of default ranking;
Step S202153 calculates the convolution block of different spans according to the high compression rate span for presetting ranking, and according to default
The product of the high compression rate span of ranking and the frequency of occurrence in digital log determines the high compression rate convolution block of default ranking.
It should be noted that convolution summation same text block might not be identical, it may be possible to which sequence is exchanged, convolution summation
It is identical, but be not the same text block in fact, so can only acquire similar span here, subsequent or needs take according to span
Interception log blocks are exactly matched.
Specifically, quickly calculating the summation of different spans convolution for being 2-n according to span, according to different spans and it is somebody's turn to do
Span corresponds to convolution sum in journal file frequency of occurrence product, selects TOPN high compression rate span;According to determining TOPN high pressure
Shrinkage span recalculates the convolution block of different spans, and goes out occurrence according to TOPN high compression rate span and in journal file
Several products determines the high compression rate convolution block of TOPN.
Step S20216 restores data file to be compressed according to dictionary library formating coding, obtains high temperature text
Block.
Optionally, high temperature text block is replaced data file to be compressed and store in step S204 and include:
Step S2041 encodes high temperature text block according to preset model, the high temperature text after being encoded
Block;High temperature text block after coding is replaced data file to be compressed to store.
Wherein, preset model includes: more algorithm models such as sentence vectorization, cluster, deep learning convolution.
To sum up, as shown in figure 4, in the method for data processing provided by the present application, preferably showing for high temperature text block is calculated
Example is specific as follows:
(1), it extracts log and segments the (step 1) in Fig. 4;
It carries out journal file standardization (substitution TAB, newline are space) and is segmented by space;
(2), the log after log being segmented carries out the vectorization (step 2) in Fig. 4;
By vectorization, log is changed into high-dimensional vector space;
(3), (the step 3-4 in Fig. 4) is clustered;
After log changes into high-dimensional vector space, by common K mean cluster, similar log can flock together;
(4), (the step 5-8 in Fig. 4) is formatted for Similarity Class;
Word frequency statistics are carried out for Similarity Class and form dictionary library, and are mapped to the digital log that can be summed with convolution;
(5), fast convolution and span selection (the step 9-12 in Fig. 4);
It is 2-n according to span, quickly calculates the summation of different spans convolution;
Convolution sum is corresponded in journal file frequency of occurrence product according to different spans and the span, selects TOPN high compression rate
Span;
(6), convolution word cutting and compression ratio assessment (the step 13-15 in Fig. 4);
According to span determined above, the convolution block of different spans is recalculated, and according to span and in journal file
The product of frequency of occurrence determines the high compression rate convolution block of TOPN;
(7), the formating coding (step 16) in Fig. 4;
According to dictionary library formating coding, actual log text block content is restored, temperature text block is obtained.
The method of data processing provided by the present application finds Gao Reduwen by the way that log class big data is carried out word segmentation processing
This block, and coding compression is carried out to high temperature text block, former data to be stored is substituted to encode compressed high temperature text block
File reaches the demand reduced to memory space, improves the utilization rate of memory space, and when reducing later maintenance
Safeguard pressure.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of
Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because
According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know
It knows, the embodiments described in the specification are all preferred embodiments, related actions and modules not necessarily the application
It is necessary.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation
The method of the data processing of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hard
Part, but the former is more preferably embodiment in many cases.Based on this understanding, the technical solution of the application substantially or
Say that the part that contributes to existing technology can be embodied in the form of software products, which is stored in
In one storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be hand
Machine, computer, server or network equipment etc.) execute method described in each embodiment of the application.
Embodiment 2
This application provides the methods of data processing as shown in Figure 5.Fig. 5 is at the data according to the embodiment of the present application two
The flow chart of the method for reason.
Step S502 obtains target data objects, wherein target data objects are stored in target data address;
In the application above-mentioned steps S502, the method for data processing provided by the present application is obtained on target data address and is deposited
The target data objects of storage, the target data objects may include data file to be compressed, which may include: number
According to the function file of the calling data stored in library, to the run program file of data processing or encryption and decryption is carried out to data adds
Decryption program file, the method for data processing provided by the present application are only illustrated taking the above example as an example, specifically without limitation.
Step S504 obtains the text block that temperature is greater than preset threshold, wherein preset threshold from target data objects
Including reference number or reference frequency;
In the application above-mentioned steps S504, based on target data objects obtained in step S502, number provided by the present application
The big daily record data table of amount of storage is selected from target data objects according to the method for processing, and daily record data tableau format is carried out
Data analysis calculates the text block that ranking is preset in target data objects then by calculating method model, wherein the default row
The text block of name can be the text block of TOP N in target data objects, and N is integer, for example, 1,2,3,4,5,6,7,8,9,
10 ... ..., N.And text block provided by the present application, that is, the text block of TOP N in target data objects, that is, can be TOP3's
Text block is the text block that temperature is greater than preset threshold.
Wherein, during obtaining text block of the temperature greater than preset threshold, it can be analyzed and be detected by data, searched
The partial rules of log content find out the text block that temperature is greater than preset threshold by the algorithm model of construction.
Here it during calculating text block temperature, can be obtained by the reference number of text block, wherein the application
Index temperature is defined as with the average reference number for organizing index: assuming that having n row log text string, text in a sample file
Block imReference number be rm, then the temperature of text block is all in sample
Text block is stored in target data address by step S506.
The method of data processing provided by the present application be different from the prior art in based on the storage after general compression techniques, gram
The problem of can not carrying out coding compression to high temperature text block in every part of different log has been taken, has been obtained to target data objects
Temperature is greater than the text block of preset threshold, so it is purposive carry out compression storage, and then default threshold is greater than by storage temperature
The text block of value replaces original target data objects, saves memory space.
In the embodiment of the present application, by obtaining target data objects, wherein target data objects are stored in target data
Address;From target data objects, the text block that temperature is greater than preset threshold is obtained, wherein preset threshold includes reference number
Or reference frequency;Text block is stored in target data address, has been reached according to temperature text block high in every part of different log
The purpose of coding compression is carried out, to realize the technical effect for reducing memory space, and then is solved due to using common pressure
Contracting technology, demand of the caused compressed data to memory space still very big technical problem.
Embodiment 3
According to the another aspect of the embodiment of the present application, a kind of storage medium is additionally provided, storage medium includes the journey of storage
Sequence, wherein equipment where control storage medium executes in program operation: obtaining Gao Reduwen from data file to be compressed
This block;High temperature text block is replaced data file to be compressed to store.
Embodiment 4
According to the another aspect of the embodiment of the present application, a kind of processor is additionally provided, processor is used to run program,
In, program executes when running: high temperature text block is obtained from data file to be compressed;High temperature text block is replaced wait press
The data file of contracting is stored.
Embodiment 5
Embodiments herein additionally provides a kind of storage medium.Optionally, in the present embodiment, above-mentioned storage medium can
With program code performed by the method for saving data processing provided by above-described embodiment one.
Optionally, in the present embodiment, above-mentioned storage medium can be located in computer network in computer terminal group
In any one terminal, or in any one mobile terminal in mobile terminal group.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: from
High temperature text block is obtained in data file to be compressed;High temperature text block is replaced data file to be compressed to deposit
Storage.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: high
Temperature text block is the text block that temperature is greater than pre-set level temperature, wherein pre-set level temperature is to draw with being averaged for index of group
Use number.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: from
It includes: to carry out data analysis to data file to be compressed, and pass through that high temperature text block is obtained in data file to be compressed
Preset algorithm calculates the text block of preset heat ranking in data file to be compressed;The text block of preset heat ranking is determined
For high temperature text block.
Further, optionally, in the present embodiment, storage medium is arranged to store the journey for executing following steps
Sequence code: the text block that preset heat ranking in data file to be compressed is calculated by preset algorithm includes: to be compressed
In the case that data file is daily record data table, is segmented, segmented according to default word segmentation condition from daily record data table
Log afterwards;Vectorization is carried out to the log after participle, log is changed into high-dimensional vector space;By presetting clustering algorithm,
At least one high-dimensional vector space is clustered, log Similarity Class set is obtained;Word is generated according to log Similarity Class set
Allusion quotation library, and digital log is generated according to dictionary library and log Similarity Class set;The convolution of different spans is calculated by presetting span
Block, and according to the product of default span and the frequency of occurrence in digital log, determine the high compression rate convolution block of default ranking;Root
According to dictionary library formating coding, data file to be compressed is restored, obtains high temperature text block.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: logical
Default clustering algorithm is crossed, at least one high-dimensional vector space is clustered, obtaining log Similarity Class set includes: default
In the case that clustering algorithm is K mean cluster algorithm, by K mean cluster algorithm, at least one high-dimensional vector space into
Row cluster, obtains log Similarity Class set.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: root
Dictionary library is generated according to log Similarity Class set, and generating digital log according to dictionary library and log Similarity Class set includes: to day
Each participle in will Similarity Class set carries out word frequency statistics, obtains dictionary library;It is carried out according to dictionary library and log Similarity Class set
Mapping, obtains digital log, wherein digital log is summed for convolution, the span that convolution is summed for determining Similar Text block.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: logical
The convolution block that default span calculates different spans is crossed, and according to the product of default span and the frequency of occurrence in digital log, really
Surely the high compression rate convolution block of default ranking includes: to calculate the summation of different spans convolution according to span is preset;According to different spans
Convolution sum frequency of occurrence product in digital log is corresponded to default span, obtains the high compression rate span of default ranking;Foundation
The high compression rate span of default ranking calculates the convolution block of different spans, and according to the high compression rate span of default ranking and in number
The product of frequency of occurrence in word log determines the high compression rate convolution block of default ranking.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: will
It includes: to compile according to preset model to high temperature text block that high temperature text block, which replaces data file to be compressed and carries out storage,
Code, the high temperature text block after being encoded;High temperature text block after coding is replaced data file to be compressed to deposit
Storage.
Above-mentioned the embodiment of the present application serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
In above-described embodiment of the application, all emphasizes particularly on different fields to the description of each embodiment, do not have in some embodiment
The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others
Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, only
A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or
Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module
It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the application whole or
Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code
Medium.
The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art
For member, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications are also answered
It is considered as the protection scope of the application.
Claims (11)
1. a kind of method of data processing characterized by comprising
High temperature text block is obtained from data file to be compressed;
The high temperature text block is replaced the data file to be compressed to store.
2. the method for data processing according to claim 1, which is characterized in that the high temperature text block is greater than for temperature
The text block of pre-set level temperature, wherein pre-set level temperature is the average reference number with group index.
3. the method for data processing according to claim 1, which is characterized in that described to be obtained from data file to be compressed
The high temperature text block is taken to include:
Data analysis is carried out to the data file to be compressed, and the data file to be compressed is calculated by preset algorithm
The text block of middle preset heat ranking;
The text block of the preset heat ranking is determined as the high temperature text block.
4. the method for data processing according to claim 3, which is characterized in that it is described by preset algorithm calculate it is described to
The text block of preset heat ranking includes: in the data file of compression
In the case where the data file to be compressed is daily record data table, according to default participle from the daily record data table
Condition is segmented, the log after being segmented;
Vectorization is carried out to the log after the participle, log is changed into high-dimensional vector space;
By presetting clustering algorithm, at least one described high-dimensional vector space is clustered, log Similarity Class set is obtained;
Dictionary library is generated according to the log Similarity Class set, and is generated according to the dictionary library and the log Similarity Class set
Digital log;
The convolution block of different spans is calculated by presetting span, and is occurred according to the default span in the digital log
The product of number determines the high compression rate convolution block of default ranking;
According to the dictionary library formating coding, the data file to be compressed is restored, the high temperature text block is obtained.
5. the method for data processing according to claim 4, which is characterized in that it is described by presetting clustering algorithm, to extremely
A few high-dimensional vector space is clustered, and obtaining log Similarity Class set includes:
In the case where the default clustering algorithm is K mean cluster algorithm, by the K mean cluster algorithm, at least one
A high-dimensional vector space is clustered, and log Similarity Class set is obtained.
6. the method for data processing according to claim 4, which is characterized in that described according to the log Similarity Class set
Dictionary library is generated, and digital log is generated with the log Similarity Class set according to the dictionary library and includes:
Word frequency statistics are carried out to each participle in the log Similarity Class set, obtain the dictionary library;
It is mapped according to the dictionary library and the log Similarity Class set, obtains the digital log, wherein the number
Log is summed for convolution, the span that the convolution is summed for determining Similar Text block.
7. the method for the data processing according to claim 4 or 6, which is characterized in that by default span calculate it is different across
The convolution block of degree, and according to the product of the default span and the frequency of occurrence in the digital log, determine default ranking
High compression rate convolution block includes:
According to span is preset, the summation of different spans convolution is calculated;
Convolution sum frequency of occurrence product in the digital log is corresponded to according to the different spans and the default span, is obtained
The high compression rate span of default ranking;
The convolution block of different spans is calculated according to the high compression rate span of the default ranking, and according to the height of the default ranking
The product of compression ratio span and the frequency of occurrence in the digital log, determines the high compression rate convolution block of the default ranking.
8. the method for data processing according to claim 1, which is characterized in that described to replace the high temperature text block
The data file to be compressed carries out storage
The high temperature text block is encoded according to preset model, the high temperature text block after being encoded;
High temperature text block after the coding is replaced the data file to be compressed to store.
9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program
When control the storage medium where equipment execute: high temperature text block is obtained from data file to be compressed;By the height
Temperature text block is replaced the data file to be compressed and is stored.
10. a kind of processor, which is characterized in that the processor is for running program, wherein described program executes when running:
High temperature text block is obtained from data file to be compressed;The high temperature text block is replaced into the data text to be compressed
Part is stored.
11. a kind of method of data processing characterized by comprising
Obtain target data objects, wherein the target data objects are stored in target data address;
From the target data objects, the text block that temperature is greater than preset threshold is obtained, wherein the preset threshold includes drawing
With number or reference frequency;
The text block is stored in the target data address.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810410873.8A CN110442489B (en) | 2018-05-02 | 2018-05-02 | Method of data processing and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810410873.8A CN110442489B (en) | 2018-05-02 | 2018-05-02 | Method of data processing and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110442489A true CN110442489A (en) | 2019-11-12 |
CN110442489B CN110442489B (en) | 2024-03-01 |
Family
ID=68427586
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810410873.8A Active CN110442489B (en) | 2018-05-02 | 2018-05-02 | Method of data processing and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110442489B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111104381A (en) * | 2019-11-30 | 2020-05-05 | 北京浪潮数据技术有限公司 | Log management method, device and equipment and computer readable storage medium |
CN113282552A (en) * | 2021-06-04 | 2021-08-20 | 上海天旦网络科技发展有限公司 | Similarity direction quantization method and system for flow statistic log |
CN115834504A (en) * | 2022-11-04 | 2023-03-21 | 电子科技大学 | AXI bus-based data compression/decompression method and device |
CN117313657A (en) * | 2023-11-30 | 2023-12-29 | 深圳市伟奇服装有限公司 | School uniform design data coding compression method |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002132546A (en) * | 2000-10-24 | 2002-05-10 | Xaxon R & D Corp | Storage device |
CN1367896A (en) * | 1999-08-13 | 2002-09-04 | 富士通株式会社 | File processing method, data processing device and storage medium |
TW527784B (en) * | 2000-12-18 | 2003-04-11 | Inventec Besta Co Ltd | Method for compressing statistical data characteristics |
CN105893337A (en) * | 2015-01-04 | 2016-08-24 | 伊姆西公司 | Method and equipment for text compression and decompression |
CN106446148A (en) * | 2016-09-21 | 2017-02-22 | 中国运载火箭技术研究院 | Cluster-based text duplicate checking method |
CN106815124A (en) * | 2015-12-01 | 2017-06-09 | 北京国双科技有限公司 | Journal file treating method and apparatus |
CN107145485A (en) * | 2017-05-11 | 2017-09-08 | 百度国际科技(深圳)有限公司 | Method and apparatus for compressing topic model |
CN107977442A (en) * | 2017-12-08 | 2018-05-01 | 北京希嘉创智教育科技有限公司 | Journal file compresses and decompression method, electronic equipment and readable storage medium storing program for executing |
-
2018
- 2018-05-02 CN CN201810410873.8A patent/CN110442489B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1367896A (en) * | 1999-08-13 | 2002-09-04 | 富士通株式会社 | File processing method, data processing device and storage medium |
JP2002132546A (en) * | 2000-10-24 | 2002-05-10 | Xaxon R & D Corp | Storage device |
TW527784B (en) * | 2000-12-18 | 2003-04-11 | Inventec Besta Co Ltd | Method for compressing statistical data characteristics |
CN105893337A (en) * | 2015-01-04 | 2016-08-24 | 伊姆西公司 | Method and equipment for text compression and decompression |
CN106815124A (en) * | 2015-12-01 | 2017-06-09 | 北京国双科技有限公司 | Journal file treating method and apparatus |
CN106446148A (en) * | 2016-09-21 | 2017-02-22 | 中国运载火箭技术研究院 | Cluster-based text duplicate checking method |
CN107145485A (en) * | 2017-05-11 | 2017-09-08 | 百度国际科技(深圳)有限公司 | Method and apparatus for compressing topic model |
CN107977442A (en) * | 2017-12-08 | 2018-05-01 | 北京希嘉创智教育科技有限公司 | Journal file compresses and decompression method, electronic equipment and readable storage medium storing program for executing |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111104381A (en) * | 2019-11-30 | 2020-05-05 | 北京浪潮数据技术有限公司 | Log management method, device and equipment and computer readable storage medium |
CN113282552A (en) * | 2021-06-04 | 2021-08-20 | 上海天旦网络科技发展有限公司 | Similarity direction quantization method and system for flow statistic log |
CN115834504A (en) * | 2022-11-04 | 2023-03-21 | 电子科技大学 | AXI bus-based data compression/decompression method and device |
CN117313657A (en) * | 2023-11-30 | 2023-12-29 | 深圳市伟奇服装有限公司 | School uniform design data coding compression method |
CN117313657B (en) * | 2023-11-30 | 2024-03-19 | 深圳市伟奇服装有限公司 | School uniform design data coding compression method |
Also Published As
Publication number | Publication date |
---|---|
CN110442489B (en) | 2024-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110442489A (en) | The method and storage medium of data processing | |
CN111339433B (en) | Information recommendation method and device based on artificial intelligence and electronic equipment | |
US9015083B1 (en) | Distribution of parameter calculation for iterative optimization methods | |
EP3766021B1 (en) | Cluster compression for compressing weights in neural networks | |
US20230289828A1 (en) | Data processing method, computer device, and readable storage medium | |
CN115965058B (en) | Neural network training method, entity information classification method, device and storage medium | |
CN110008192A (en) | A kind of data file compression method, apparatus, equipment and readable storage medium storing program for executing | |
CN113610240A (en) | Method and system for performing predictions using nested machine learning models | |
CN114722091A (en) | Data processing method, data processing device, storage medium and processor | |
KR20220075407A (en) | neural network representation | |
CN110708285A (en) | Flow monitoring method, device, medium and electronic equipment | |
CN113822315A (en) | Attribute graph processing method and device, electronic equipment and readable storage medium | |
CN110263917B (en) | Neural network compression method and device | |
Liu et al. | Efficient neural networks for edge devices | |
KR20210124811A (en) | Apparatus and method for generating training data for network failure diagnosis | |
US20220277031A1 (en) | Guided exploration for conversational business intelligence | |
CN113767403A (en) | Automatic resolution of over-and under-designations in knowledge graphs | |
CN116484105A (en) | Service processing method, device, computer equipment, storage medium and program product | |
CN110175645B (en) | Method for determining model of protection device and computing device | |
CN115424725A (en) | Data analysis method and device, storage medium and processor | |
CN113611427A (en) | User portrait generation method, device, equipment and storage medium | |
US11074591B2 (en) | Recommendation system to support mapping between regulations and controls | |
CN111639260A (en) | Content recommendation method, device and storage medium thereof | |
CN112953914A (en) | DGA domain name detection and classification method and device | |
CN113839799A (en) | Alarm association rule mining method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40016266 Country of ref document: HK |
|
GR01 | Patent grant | ||
GR01 | Patent grant |