CN109992569A - Cluster log feature extracting method, device and storage medium - Google Patents
Cluster log feature extracting method, device and storage medium Download PDFInfo
- Publication number
- CN109992569A CN109992569A CN201910123928.1A CN201910123928A CN109992569A CN 109992569 A CN109992569 A CN 109992569A CN 201910123928 A CN201910123928 A CN 201910123928A CN 109992569 A CN109992569 A CN 109992569A
- Authority
- CN
- China
- Prior art keywords
- data
- value
- daily record
- record data
- acquisition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000003860 storage Methods 0.000 title claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 42
- 230000008569 process Effects 0.000 claims description 7
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims description 3
- 230000008030 elimination Effects 0.000 claims description 3
- 238000003379 elimination reaction Methods 0.000 claims description 3
- 241001269238 Data Species 0.000 claims 2
- 238000004519 manufacturing process Methods 0.000 abstract description 2
- NJPPVKZQTLUDBO-UHFFFAOYSA-N novaluron Chemical compound C1=C(Cl)C(OC(F)(F)C(OC(F)(F)F)F)=CC=C1NC(=O)NC(=O)C1=C(F)C=CC=C1F NJPPVKZQTLUDBO-UHFFFAOYSA-N 0.000 abstract description 2
- 239000003795 chemical substances by application Substances 0.000 description 55
- 230000005540 biological transmission Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 238000006116 polymerization reaction Methods 0.000 description 3
- 238000013480 data collection Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000009333 weeding Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 210000001367 artery Anatomy 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 210000003462 vein Anatomy 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Abstract
The present invention relates to pedestal O&M, a kind of cluster log feature extracting method, device and storage medium are provided by the log of flume client acquisition server cluster and is sent to database;Data cleansing is carried out to daily record data, filters out initial data;Initial data is carried out include mean value, virtual value, peak value, root amplitude, waveform index, pulse index, kurtosis index characteristics extraction;The characteristic value of extraction is carried out with initial data to the operation of Pearson correlation coefficient respectively, it is compared according to calculated related coefficient with relevance threshold, valid data are then considered higher than relevance threshold, and invalid data is then considered lower than relevance threshold and are rejected.Energy Effective selection of the present invention goes out the effective information of the creation data of each host in server cluster, and the characteristic value of creation data is extracted from effective information, and the failure predication and failure modes of system easy to produce reduce the generation of production accident.
Description
Technical field
The present invention relates to pedestal O&Ms, are situated between specifically, being related to a kind of cluster log feature extracting method, device and storage
Matter.
Background technique
In the epoch that information explosion formula increases, file size and data scale march toward TB grades even PB grades and have become reality, collect
Group's storage nodes number has reached 64 node number of clusters, and the so huge group system of management has become institute, data center
The severe challenge faced.Tracking clustered node operating status in time, being accurately positioned node error message becomes particularly important.Collecting
In group's actual operation of storage system, a kind of cluster storage system blog management method is commonly used at present, it can timing or real-time hair
System log is sent, the concentration of transmissions of log is realized, but log is not analyzed and managed, understanding that cannot be global is whole
The operating condition of a cluster storage system cannot quickly navigate to error message.But increasing with clustered node number, it is right
Cluster system management becomes to become increasingly complex.From magnanimity server data, the feature that can reflect server performance is extracted, essence
Determine the incipient fault of position clustered node, carrying out corresponding performance detection in advance is particularly important.
Summary of the invention
In order to solve the above problem, the present invention provides a kind of cluster log feature extracting method, is applied to electronic device, including
Following steps: by the log of flume client acquisition server cluster, it is sent to Hbase database, wherein flume client
End corresponds to the log of every server in acquisition server cluster by multiple Agent processes, and Agent is periodically by corresponding clothes
Business device on collection of log data and Hbase database is sent to by api interface;Daily record data is counted using Hadoop
According to cleaning, initial data is filtered out, wherein initial data includes at least server disk occupancy, memory usage, cpu and occupies
Rate, business interface calling amount;Initial data is carried out to include that mean value, virtual value, peak value, root amplitude, waveform index, pulse refer to
The characteristics extraction of mark, kurtosis index;Filter out validity feature with Pearson correlation coefficient, by the characteristic value of extraction respectively with
Initial data carries out the operation of Pearson correlation coefficient, is compared according to calculated related coefficient with relevance threshold, high
Valid data are then considered in relevance threshold, and invalid data is then considered lower than relevance threshold and are rejected.
Preferably, the data with gross error are rejected in data cleansing using Pauta criterion, comprising the following steps: right
Daily record data x1,x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual error Wherein, xi
For the daily record data of single Agent acquisition;
Calculate standard deviation Sx,
If data xbResidual error vb(1≤b≤n), meets formula
Then think xbIt is the singular value containing gross error value, and abnormal value elimination.
Preferably, the singular value of daily record data is substituted with intermediate value, wherein the intermediate value refers to each daily record data x1,
x2...,xnSequence arranges by size, and value in an intermediate position is known as intermediate value.
Preferably, it includes mean value that initial data, which carries out, virtual value, peak value, root amplitude, waveform index, pulse index, high and steep
Spend the characteristics extraction of index, wherein
Virtual value is calculated using following formula:
Peak value is calculated using following formula:
Xp=max (xi)
Root amplitude is calculated using following formula:
Waveform index is calculated using following formula:
Pulse index is calculated using following formula:
Kurtosis index is calculated using following formula:
Wherein, xiFor the daily record data of single Agent acquisition;
N is the number of data acquisition;
For the arithmetic mean of instantaneous value of the daily record data of acquisition;
XrmsFor the virtual value of the daily record data of acquisition;
XpFor the peak value of the daily record data of acquisition;
XrFor the root amplitude of the daily record data of acquisition;
XwsFor the waveform index of the daily record data of acquisition;
XifFor the pulse index of the daily record data of acquisition;
XkvFor the kurtosis index of the daily record data of acquisition.
Preferably, the formula of Pearson correlation coefficient is as follows:
Wherein, xiFor the daily record data of single Agent acquisition;
yjThe a certain characteristic value extracted in data is acquired for single Agent;
It is daily record data x1, x2...,xnArithmetic average;
It is y1, y2...,ynArithmetic average;
N is the number of log data acquisition.
Preferably, Flume includes multiple first level Agent and a second level Agent, each first level
The daily record data of the daily record data of Agent one server of corresponding acquisition, multiple first level Agent acquisitions is collected to
Second level Agent, and be transmitted in HDFS by the second level Agent.
The present invention also provides a kind of electronic device, which includes: memory and processor, is deposited in the memory
Cluster log feature extraction procedure is contained, following step is realized when the cluster log feature extraction procedure is executed by the processor
It is rapid: by the log of flume client acquisition server cluster, to be sent to Hbase database, wherein flume client passes through
Multiple Agent processes correspond to the log of every server in acquisition server cluster, and Agent timing will be on corresponding server
Collection of log data and Hbase database is sent to by api interface;It is clear that data are carried out to daily record data using Hadoop
Wash, filter out initial data, wherein initial data include at least server disk occupancy, memory usage, cpu occupancy,
Business interface calling amount;To initial data carry out include mean value, virtual value, peak value, root amplitude, waveform index, pulse index,
The characteristics extraction of kurtosis index;Filter out validity feature with Pearson correlation coefficient, by the characteristic value of extraction respectively with original
Beginning data carry out the operation of Pearson correlation coefficient, are compared, are higher than with relevance threshold according to calculated related coefficient
Relevance threshold is then considered valid data, and invalid data is then considered lower than relevance threshold and is rejected.
Preferably, the data with gross error are rejected in data cleansing using Pauta criterion, comprising the following steps: right
Daily record data x1, x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual error Wherein, xiFor
Single Agent acquired data values;
Calculate standard deviation Sx,
If data xbResidual error vb(1≤b≤n), meets following formula
Then think xbIt is the singular value containing gross error value, and rejects the singular value.
Preferably, the singular value of daily record data is substituted with intermediate value, wherein the intermediate value refers to each daily record data x1,
x2...,xnSequence arranges by size, and value in an intermediate position is known as intermediate value.
The present invention also provides a kind of computer readable storage medium, the computer-readable recording medium storage has computer
Program, the computer program include that program instruction realizes above-described cluster when described program instruction is executed by processor
Log feature extracting method.
Energy Effective selection of the present invention goes out the effective information of the creation data of each host in server cluster, and from effective information
In extract the characteristic value of creation data, the failure predication and failure modes of system easy to produce reduce the generation of production accident.
Detailed description of the invention
By the way that embodiment is described in conjunction with following accompanying drawings, features described above of the invention and technological merit will become
More understands and be readily appreciated that.
Fig. 1 is the flow diagram of the cluster log feature extracting method of the embodiment of the present invention;
Fig. 2 is the hardware structure schematic diagram of the electronic device of the embodiment of the present invention;
Fig. 3 is the module structure drafting of the cluster log feature extraction procedure of the embodiment of the present invention;
Fig. 4 is the unit composition figure of the log acquisition module of the embodiment of the present invention;
Fig. 5 is the unit composition figure of the characteristic extracting module of the embodiment of the present invention;
Fig. 6 is the unit composition figure of the data cleansing module of the embodiment of the present invention;
Fig. 7 is that the Agent process of Flume reads the schematic diagram of data.
Specific embodiment
Cluster log feature extracting method of the present invention, device and storage medium described below with reference to the accompanying drawings
Embodiment.Those skilled in the art will recognize, without departing from the spirit and scope of the present invention, can be with
Described embodiment is modified with a variety of different modes or combinations thereof.Therefore, attached drawing and description are inherently said
Bright property, it is not intended to limit the scope of the claims.In addition, in the present specification, attached drawing is drawn not in scale, and
And identical appended drawing reference indicates identical part.
As shown in Figure 1, the cluster log feature extracting method of the present embodiment, includes the following steps:
Step S10 passes through flume (distributed massive logs acquisition, polymerization and Transmission system) client acquisition service
The log of device cluster is sent to Hbase database server.Flume with Agent process be the smallest independent operating unit, one
A Agent process is exactly a complete data gathering tool.As shown in fig. 7, Agent includes component Source (data collection
Component), Channel (transfer temporarily stores), Sink, three set up an Agent, and source collects data from server,
Pass to Channel, Channel saves the Event (data cell) passed over by Source component, and Sink is from Channel
Middle reading simultaneously removes Event, and Event is transmitted to backstage.Flume corresponds to each server collector journal by multiple Agent
Data.An Agent is arranged in corresponding each server, periodically by the collection of log data on corresponding server and passes through
Api interface is sent to backstage.
Step S30 carries out data cleansing to daily record data using Hadoop (distributed system infrastructure), filters out original
Beginning data, wherein initial data includes at least server disk occupancy, memory usage, cpu occupancy, business interface and calls
Amount.
Step S50 carries out initial data to include that mean value, virtual value, peak value, root amplitude, waveform index, pulse refer to
The characteristics extraction of mark, kurtosis index.
Step S70 filters out validity feature with Pearson correlation coefficient: by the characteristic value of extraction respectively with initial data
The operation for carrying out Pearson correlation coefficient, is compared according to calculated related coefficient with relevance threshold, is higher than the degree of correlation
Threshold value is then considered valid data, and invalid data is then considered lower than relevance threshold and is rejected.
Further, the data with gross error are rejected in data cleansing using Pauta criterion, comprising the following steps:
To daily record data x1,x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual error
Wherein, xiFor the daily record data of single Agent acquisition;
Calculate standard deviation Sx,
If the x in daily record databResidual error vb(1≤b≤n), meets formula
Then think xbIt is the singular value containing gross error value, and abnormal value elimination.
Further, the singular value of creation data can be efficiently identified out using La Yida rule, but for weeding out
Data can then generate null value.Therefore, the singular value of the daily record data identified is substituted with intermediate value, is realized to creation data information
Pretreatment.Wherein the intermediate value refers to each variate-value x1,x2...,xnIt sequentially lines up by size, forms a number
Column, the variate-value in variable series middle position are known as intermediate value.
In one alternate embodiment, initial data is carried out to include that mean value, virtual value, peak value, root amplitude, waveform refer to
The characteristics extraction of mark, pulse index, kurtosis index, wherein
Virtual value is calculated using following formula:
Peak value is calculated using following formula:
Xp=max (xi)
Root amplitude is calculated using following formula:
Waveform index is calculated using following formula:
Pulse index is calculated using following formula:
Kurtosis index is calculated using following formula:
Wherein, xiFor the daily record data of single Agent acquisition;
N is the number of log data acquisition;
For the arithmetic mean of instantaneous value of the daily record data of acquisition;
XrmsFor the virtual value of the daily record data of acquisition;
XpFor the peak value of the daily record data of acquisition;
XrFor the root amplitude of the daily record data of acquisition;
XwsFor the waveform index of the daily record data of acquisition;
XifFor the pulse index of the daily record data of acquisition;
XkvFor the kurtosis index of the daily record data of acquisition.
Filter out validity feature with Pearson correlation coefficient, specifically, be by features above value respectively with initial data
The operation for carrying out Pearson correlation coefficient, according to calculated related coefficient with relevance threshold come compared with, higher than degree of correlation threshold
Value is then considered valid data, is then considered invalid data lower than relevance threshold, needs to be rejected, so as to filter out
The data of effect.For example, relevance threshold is 0.7, the related coefficient of root amplitude and initial data is 0.2, then shows root width
Being worth is invalid data, and the related coefficient of kurtosis index and initial data is 0.85, then assert that kurtosis index is valid data.Its
In, the formula of Pearson correlation coefficient is as follows:
Wherein, xiFor single Agent acquired data values;
yiThe a certain characteristic value extracted in data is acquired for single Agent;
It is daily record data x1,x2...,xnArithmetic average;
It is y1,y2...,ynArithmetic average;
N is the number of log data acquisition.
In one alternate embodiment, Flume includes multiple first level Agent and a second level Agent, each
The daily record data of first level Agent, one server of corresponding acquisition, the log number of multiple first level Agent acquisitions
According to being collected to the second level Agent, and it is transmitted in HDFS (distributed file system) by the second level Agent.
As shown in fig.2, being the hardware structure schematic diagram of the embodiment of electronic device of the present invention.It is described in the present embodiment
Electronic device 2 be it is a kind of can according to the instruction for being previously set or store, automatic progress numerical value calculating and/or information processing
Equipment.For example, it may be smart phone, tablet computer, laptop, desktop computer, rack-mount server, blade type take
It is engaged in device, tower server or Cabinet-type server (including server set composed by independent server or multiple servers
Group) etc..As shown in Fig. 2, the electronic device 2 includes at least, but it is not limited to, depositing for connection can be in communication with each other by system bus
Reservoir 21, processor 22, network interface 23.Wherein: the memory 21 includes at least a type of computer-readable storage
Medium, the readable storage medium storing program for executing include flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.),
Random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable are only
Read memory (EEPROM), programmable read only memory (PROM), magnetic storage, disk, CD etc..In some embodiments
In, the memory 21 can be the internal storage unit of the electronic device 2, such as the hard disk or memory of the electronic device 2.
In further embodiments, the memory 21 is also possible to the External memory equipment of the electronic device 2, such as electronics dress
Set the plug-in type hard disk being equipped on 2, intelligent memory card (Smart Media Card, SMC), secure digital (Secure
Digital, SD) card, flash card (Flash Card) etc..Certainly, the memory 21 can also both include the electronic device 2
Internal storage unit also include its External memory equipment.In the present embodiment, the memory 21 is installed on commonly used in storage
Operating system and types of applications software of the electronic device 2, such as the cluster log feature extraction procedure code etc..This
Outside, the memory 21 can be also used for temporarily storing the Various types of data that has exported or will export.
The processor 22 can be in some embodiments central processing unit (Central Processing Unit,
CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 22 is commonly used in the control electricity
The overall operation of sub-device 2, such as execute control relevant to the electronic device 2 progress data interaction or communication and processing
Deng.In the present embodiment, the processor 22 is for running the program code stored in the memory 21 or processing data, example
Cluster log feature extraction procedure as described in running.
The network interface 23 may include radio network interface or wired network interface, which is commonly used in
Communication connection is established between the electronic device 2 and other electronic devices.For example, the network interface 23 is used to incite somebody to action by network
The electronic device 2 is connected with push platform, and data transmission channel is established between the electronic device 2 and push platform and is led to
Letter connection etc..The network can be intranet (Intranet), internet (Internet), global system for mobile communications
(Global System of Mobile communication, GSM), wideband code division multiple access (Wideband
CodeDivision Multiple Access, WCDMA), 4G network, 5G network, bluetooth (Bluetooth), Wi-Fi etc. is wireless
Or cable network.
Optionally, which can also include display, and display is referred to as display screen or display unit.
It can be light-emitting diode display, liquid crystal display, touch-control liquid crystal display and Organic Light Emitting Diode in some embodiments
(Organic Light-Emitting Diode, OLED) display etc..Display is used to be shown in handle in electronic device 2
Information and for showing visual user interface.
It should be pointed out that Fig. 2 illustrates only the electronic device 2 with component 21-23, it should be understood that not
It is required that implement all components shown, the implementation that can be substituted is more or less component.
It may include operating system, cluster log feature extraction procedure 50 in memory 21 comprising readable storage medium storing program for executing
Deng.Processor 22 realizes following steps when executing cluster log feature extraction procedure 50 in memory 21:
Step S10 passes through flume (distributed massive logs acquisition, polymerization and Transmission system) client acquisition service
The log of device cluster is sent to Hbase database server.Flume with Agent component be the smallest independent operating unit, one
A Agent component is exactly a complete data gathering tool.Flume is corresponded to each server by multiple Agent and collects day
Will data.An Agent is arranged in corresponding each server, periodically by the collection of log data on corresponding server and passes through
Api interface is sent to backstage.
Step S30 carries out data cleansing to daily record data using Hadoop (distributed system infrastructure), filters out original
Beginning data, wherein initial data includes at least server disk occupancy, memory usage, cpu occupancy, business interface and calls
Amount.
Step S50 carries out initial data to include that mean value, virtual value, peak value, root amplitude, waveform index, pulse refer to
The characteristics extraction of mark, kurtosis index.
Step S70 filters out validity feature with Pearson correlation coefficient, by the characteristic value of extraction respectively with initial data
The operation for carrying out Pearson correlation coefficient, is compared according to calculated related coefficient with relevance threshold, is higher than the degree of correlation
Threshold value is then considered valid data, and invalid data is then considered lower than relevance threshold and is rejected.
In the present embodiment, the cluster log feature extraction procedure being stored in memory 21 can be divided into one
A or multiple program modules, one or more of program modules are stored in memory 21, and can be by one or more
A processor (the present embodiment is processor 22) is performed, to complete the present invention.For example, Fig. 3 shows the cluster log spy
The program module schematic diagram for levying extraction procedure, in the embodiment, the cluster log feature extraction procedure 50 can be divided into
Log acquisition module 501, data cleansing module 502, characteristic extracting module 503, validity feature screening module 504.Wherein, this hair
Bright so-called program module is the series of computation machine program instruction section for referring to complete specific function, than program more suitable for retouching
State implementation procedure of the cluster log feature extraction procedure in the electronic device 2.It will be described below described in specifically introducing
The concrete function of program module.
Wherein, log acquisition module 501 is used for through flume (distributed massive logs acquisition, polymerization and transmission system
System) client acquisition server cluster log, be sent to Hbase database server.Flume is minimum with Agent component
Independent operating unit, an Agent component is exactly a complete data gathering tool.Flume by multiple Agent come pair
Answer each server collector journal data.An Agent is arranged in corresponding each server, periodically by the day on corresponding server
Will data collection is simultaneously sent to backstage by api interface.
Data cleansing module 502 is used to carry out data to daily record data using Hadoop (distributed system infrastructure) clear
Wash, filter out initial data, wherein initial data include at least server disk occupancy, memory usage, cpu occupancy,
Business interface calling amount.
Characteristic extracting module 503 is used to carry out initial data to include that mean value, virtual value, peak value, root amplitude, waveform refer to
The characteristics extraction of mark, pulse index, kurtosis index.
Validity feature screening module 504 filters out validity feature with Pearson correlation coefficient, by the characteristic value of extraction point
The operation for not carrying out Pearson correlation coefficient with initial data, is compared according to calculated related coefficient with relevance threshold
Compared with being then considered valid data higher than relevance threshold, invalid data be then considered lower than relevance threshold and is rejected.
In one alternate embodiment, as shown in fig. 6, data cleansing module 502 includes Pauta criterion judging unit
5021, Pauta criterion judging unit 5021 rejects the data with gross error using Pauta criterion, comprising the following steps:
To daily record data x1,x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual error
Wherein, xiFor single Agent acquired data values;
Calculate standard deviation Sx,
If data xbResidual error vb(1≤b≤n), meets following formula
Then think xbIt is the singular value containing gross error value, and rejects the singular value.
Further, data cleansing module 502 further includes singular value replacement unit 5022.It can be effective using La Yida rule
Ground identifies the singular value of creation data, but the data for weeding out can then generate null value.Singular value replacement unit 5022 is right
The singular value of the daily record data identified is substituted with intermediate value, realizes the pretreatment to creation data information.Wherein the intermediate value is
Refer to each variate-value x1,x2...,xnIt sequentially lines up by size, forms an ordered series of numbers, be in variable series middle position
Variate-value be known as intermediate value.
In one alternate embodiment, as shown in figure 5, characteristic extracting module 503 include mean value extraction unit 5031, effectively
It is worth extraction unit 5032, peak extraction unit 5033, root magnitude extraction unit 5034, waveform index extraction unit 5035, arteries and veins
Rush index extraction unit 5036, kurtosis index extraction unit 5037.Respectively initial data is carried out to include mean value, virtual value, peak
The characteristics extraction of value, root amplitude, waveform index, pulse index, kurtosis index, wherein
Virtual value is calculated using following formula:
Peak value is calculated using following formula:
Xp=max (xi)
Root amplitude is calculated using following formula:
Waveform index is calculated using following formula:
Pulse index is calculated using following formula:
Kurtosis index is calculated using following formula:
Wherein, xiFor the daily record data of single Agent acquisition;
N is the number of log data acquisition;
For the arithmetic mean of instantaneous value of the daily record data of acquisition;
XrmsFor the virtual value of the daily record data of acquisition;
XpFor the peak value of the daily record data of acquisition;
XrFor the root amplitude of the daily record data of acquisition;
XwsFor the waveform index of the daily record data of acquisition;
XifFor the pulse index of the daily record data of acquisition;
XkvFor the kurtosis index of the daily record data of acquisition.
Filter out validity feature with Pearson correlation coefficient, specifically, be by features above value respectively with initial data
The operation for carrying out Pearson correlation coefficient, according to calculated related coefficient with relevance threshold come compared with, higher than degree of correlation threshold
Value is then considered valid data, is then considered invalid data lower than relevance threshold, needs to be rejected, so as to filter out
The data of effect.For example, relevance threshold is 0.7, the related coefficient of root amplitude and initial data is 0.2, then shows root width
Being worth is invalid data, and the related coefficient of kurtosis index and initial data is 0.85, then assert that kurtosis index is valid data.Its
In, the formula of Pearson correlation coefficient is as follows:
Wherein, xiFor single Agent acquired data values;
yiThe a certain characteristic value extracted in data is acquired for single Agent;
It is daily record data x1, x2...,xnArithmetic average;
It is y1, y2...,ynArithmetic average;
N is the number of data acquisition.
In one alternate embodiment, as shown in figure 4, log acquisition module 501 further includes Agent setting unit 5011,
For for Flume carry out include multiple first level Agent and a second level Agent setting, each first level
The daily record data of the daily record data of Agent one server of corresponding acquisition, multiple first level Agent acquisitions is collected to
Second level Agent, and be transmitted in HDFS by the second level Agent.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium
It can be hard disk, multimedia card, SD card, flash card, SMC, read-only memory (ROM), Erasable Programmable Read Only Memory EPROM
(EPROM), any one in portable compact disc read-only memory (CD-ROM), USB storage etc. or several timess
Meaning combination.It include cluster log feature extraction procedure etc. in the computer readable storage medium, the cluster log feature mentions
Following operation is realized when program fetch 50 is executed by processor 22:
Step S10 is sent to Hbase database server by the log of flume client acquisition server cluster.
For Flume with Agent component for the smallest independent operating unit, an Agent component is exactly a complete data gathering tool.
Flume corresponds to each server collector journal data by multiple Agent.An Agent is arranged in corresponding each server, fixed
When by the collection of log data on corresponding server and by api interface be sent to backstage.
Step S30 carries out data cleansing to daily record data using Hadoop, filters out initial data, wherein initial data
Including at least server disk occupancy, memory usage, cpu occupancy, business interface calling amount.
Step S50 carries out initial data to include that mean value, virtual value, peak value, root amplitude, waveform index, pulse refer to
The characteristics extraction of mark, kurtosis index.
Step S70 filters out validity feature with Pearson correlation coefficient, by the characteristic value of extraction respectively with initial data
The operation for carrying out Pearson correlation coefficient, is compared according to calculated related coefficient with relevance threshold, is higher than the degree of correlation
Threshold value is then considered valid data, and invalid data is then considered lower than relevance threshold and is rejected.
The specific embodiment of the computer readable storage medium of the present invention and above-mentioned cluster log feature extracting method with
And the specific embodiment of electronic device 2 is roughly the same, details are not described herein.
The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art
For member, the invention may be variously modified and varied.All within the spirits and principles of the present invention, it is made it is any modification,
Equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of cluster log feature extracting method is applied to electronic device, which comprises the following steps:
By the log of flume client acquisition server cluster, it is sent to Hbase database, wherein flume client is logical
The log for every server that multiple Agent processes correspond in acquisition server cluster is crossed, Agent is periodically by corresponding server
On collection of log data and Hbase database is sent to by api interface;
Data cleansing is carried out to daily record data using Hadoop, filters out initial data, wherein initial data includes at least service
Device disk occupancy, memory usage, cpu occupancy, business interface calling amount;
Initial data is carried out to include mean value, virtual value, peak value, root amplitude, waveform index, pulse index, kurtosis index
Characteristics extraction;
Validity feature is filtered out with Pearson correlation coefficient: the characteristic value of extraction is subjected to Pearson came phase with initial data respectively
The operation of relationship number is compared with relevance threshold according to calculated related coefficient, is then effective higher than relevance threshold
Data are then invalid datas lower than relevance threshold, and are rejected.
2. cluster log feature extracting method according to claim 1, which is characterized in that
During data cleansing, the data with gross error are rejected using Pauta criterion, comprising the following steps:
To daily record data x1,x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual errorWherein,
xiFor the daily record data of single Agent acquisition;
Calculate standard deviation Sx,
If the x in daily record databResidual error vb(1≤b≤n), meets formula
Then determine xbIt is the singular value containing gross error value, and abnormal value elimination.
3. cluster log feature extracting method according to claim 2, which is characterized in that
The singular value of daily record data is substituted with intermediate value, wherein the intermediate value refers to each daily record data x1,x2...,xnBy big
Small sequence arrangement, value in an intermediate position are known as intermediate value.
4. cluster log feature extracting method according to claim 2, which is characterized in that
Initial data carry out include mean value, virtual value, peak value, root amplitude, waveform index, pulse index, kurtosis index spy
Value indicative is extracted, wherein
Virtual value is calculated using following formula:
Peak value is calculated using following formula:
Xp=max (xi)
Root amplitude is calculated using following formula:
Waveform index is calculated using following formula:
Pulse index is calculated using following formula:
Kurtosis index is calculated using following formula:
Wherein, xiFor the daily record data of single Agent acquisition;
N is the number of data acquisition;
For the arithmetic mean of instantaneous value of the daily record data of acquisition;
XrmsFor the virtual value of the daily record data of acquisition;
XpFor the peak value of the daily record data of acquisition;
XrFor the root amplitude of the daily record data of acquisition;
XwsFor the waveform index of the daily record data of acquisition;
XifFor the pulse index of the daily record data of acquisition;
XkvFor the kurtosis index of the daily record data of acquisition.
5. cluster log feature extracting method according to claim 2, which is characterized in that the formula of Pearson correlation coefficient
It is as follows:
Wherein, xiFor the daily record data of single Agent acquisition;
yiThe a certain characteristic value extracted in data is acquired for single Agent;
It is daily record data x1, x2...,xnArithmetic average;
It is y1, y2...,ynArithmetic average;
N is the number of log data acquisition.
6. cluster log feature extracting method according to claim 1, which is characterized in that
Flume includes multiple first level Agent and one second level Agent, each first level Agent corresponding
The daily record data of a server is acquired, the daily record data of multiple first level Agent acquisitions is collected to the second level Agent,
And it is transmitted in HDFS by the second level Agent.
7. a kind of electronic device, which is characterized in that the electronic device includes: memory and processor, is stored in the memory
There is cluster log feature extraction procedure, following step is realized when the cluster log feature extraction procedure is executed by the processor
It is rapid:
By the log of flume client acquisition server cluster, it is sent to Hbase database, wherein flume client is logical
The log for every server that multiple Agent processes correspond in acquisition server cluster is crossed, Agent is periodically by corresponding server
On collection of log data and Hbase database is sent to by api interface;
Data cleansing is carried out to daily record data using Hadoop, filters out initial data, wherein initial data includes at least service
Device disk occupancy, memory usage, cpu occupancy, business interface calling amount;
Initial data is carried out to include mean value, virtual value, peak value, root amplitude, waveform index, pulse index, kurtosis index
Characteristics extraction;
Validity feature is filtered out with Pearson correlation coefficient: the characteristic value of extraction is subjected to Pearson came phase with initial data respectively
The operation of relationship number is compared with relevance threshold according to calculated related coefficient, is then effective higher than relevance threshold
Data are then invalid datas lower than relevance threshold, and are rejected.
8. electronic device according to claim 7, which is characterized in that
The data with gross error are rejected in data cleansing using Pauta criterion, comprising the following steps:
To daily record data x1,x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual errorWherein,
xiFor single Agent acquired data values;
Calculate standard deviation Sx,
If the x in daily record databResidual error vb(1≤b≤n), meets formula
Then think xbIt is the singular value containing gross error value, and rejects the singular value.
9. electronic device according to claim 8, which is characterized in that
Singular value in daily record data is substituted with intermediate value, wherein the intermediate value refers to each daily record data x1,x2...,xnIt presses
Size order arrangement, value in an intermediate position are known as intermediate value.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey
Sequence, the computer program include that program instruction is realized in claim 1 to 6 and appointed when described program instruction is executed by processor
Cluster log feature extracting method described in one.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910123928.1A CN109992569A (en) | 2019-02-19 | 2019-02-19 | Cluster log feature extracting method, device and storage medium |
PCT/CN2019/118288 WO2020168756A1 (en) | 2019-02-19 | 2019-11-14 | Cluster log feature extraction method, and apparatus, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910123928.1A CN109992569A (en) | 2019-02-19 | 2019-02-19 | Cluster log feature extracting method, device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109992569A true CN109992569A (en) | 2019-07-09 |
Family
ID=67129790
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910123928.1A Pending CN109992569A (en) | 2019-02-19 | 2019-02-19 | Cluster log feature extracting method, device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109992569A (en) |
WO (1) | WO2020168756A1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110737648A (en) * | 2019-09-17 | 2020-01-31 | 平安科技(深圳)有限公司 | Performance characteristic dimension reduction method and device, electronic equipment and storage medium |
CN111290916A (en) * | 2020-02-18 | 2020-06-16 | 深圳前海微众银行股份有限公司 | Big data monitoring method, device and equipment and computer readable storage medium |
WO2020168756A1 (en) * | 2019-02-19 | 2020-08-27 | 平安科技(深圳)有限公司 | Cluster log feature extraction method, and apparatus, device and storage medium |
CN111984499A (en) * | 2020-08-04 | 2020-11-24 | 中国建设银行股份有限公司 | Fault detection method and device for big data cluster |
CN112069036A (en) * | 2020-11-10 | 2020-12-11 | 南京信易达计算技术有限公司 | Management and monitoring system based on cluster computing |
CN113945684A (en) * | 2021-10-14 | 2022-01-18 | 中国计量科学研究院 | Big data-based micro air station self-calibration method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105353644A (en) * | 2015-09-29 | 2016-02-24 | 中国人民解放军63892部队 | Radar target track derivative system on the basis of information mining of real-equipment data and method thereof |
US20170116330A1 (en) * | 2015-10-23 | 2017-04-27 | International Business Machines Corporation | Generating Important Values from a Variety of Server Log Files |
CN106769032A (en) * | 2016-11-28 | 2017-05-31 | 南京工业大学 | A kind of Forecasting Methodology of pivoting support service life |
US20170169360A1 (en) * | 2013-04-02 | 2017-06-15 | Patternex, Inc. | Method and system for training a big data machine to defend |
CN108399199A (en) * | 2018-01-30 | 2018-08-14 | 武汉大学 | A kind of collection of the application software running log based on Spark and service processing system and method |
CN109032910A (en) * | 2018-07-24 | 2018-12-18 | 北京百度网讯科技有限公司 | Log collection method, device and storage medium |
CN109033404A (en) * | 2018-08-03 | 2018-12-18 | 北京百度网讯科技有限公司 | Daily record data processing method, device and system |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7356550B1 (en) * | 2001-06-25 | 2008-04-08 | Taiwan Semiconductor Manufacturing Company | Method for real time data replication |
CN104036025A (en) * | 2014-06-27 | 2014-09-10 | 蓝盾信息安全技术有限公司 | Distribution-base mass log collection system |
CN106570151A (en) * | 2016-10-28 | 2017-04-19 | 上海斐讯数据通信技术有限公司 | Data collection processing method and system for mass files |
CN106845799B (en) * | 2016-12-29 | 2023-12-19 | 中国电力科学研究院 | Evaluation method for typical working condition of battery energy storage system |
CN107092592B (en) * | 2017-04-10 | 2020-06-05 | 浙江鸿程计算机系统有限公司 | Site personalized semantic recognition method based on multi-situation data and cost-sensitive integrated model |
CN109992569A (en) * | 2019-02-19 | 2019-07-09 | 平安科技(深圳)有限公司 | Cluster log feature extracting method, device and storage medium |
-
2019
- 2019-02-19 CN CN201910123928.1A patent/CN109992569A/en active Pending
- 2019-11-14 WO PCT/CN2019/118288 patent/WO2020168756A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170169360A1 (en) * | 2013-04-02 | 2017-06-15 | Patternex, Inc. | Method and system for training a big data machine to defend |
CN105353644A (en) * | 2015-09-29 | 2016-02-24 | 中国人民解放军63892部队 | Radar target track derivative system on the basis of information mining of real-equipment data and method thereof |
US20170116330A1 (en) * | 2015-10-23 | 2017-04-27 | International Business Machines Corporation | Generating Important Values from a Variety of Server Log Files |
CN106769032A (en) * | 2016-11-28 | 2017-05-31 | 南京工业大学 | A kind of Forecasting Methodology of pivoting support service life |
CN108399199A (en) * | 2018-01-30 | 2018-08-14 | 武汉大学 | A kind of collection of the application software running log based on Spark and service processing system and method |
CN109032910A (en) * | 2018-07-24 | 2018-12-18 | 北京百度网讯科技有限公司 | Log collection method, device and storage medium |
CN109033404A (en) * | 2018-08-03 | 2018-12-18 | 北京百度网讯科技有限公司 | Daily record data processing method, device and system |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020168756A1 (en) * | 2019-02-19 | 2020-08-27 | 平安科技(深圳)有限公司 | Cluster log feature extraction method, and apparatus, device and storage medium |
CN110737648A (en) * | 2019-09-17 | 2020-01-31 | 平安科技(深圳)有限公司 | Performance characteristic dimension reduction method and device, electronic equipment and storage medium |
WO2021051578A1 (en) * | 2019-09-17 | 2021-03-25 | 平安科技(深圳)有限公司 | Method and device for performance feature dimensionality reduction, electronic device, and storage medium |
CN111290916A (en) * | 2020-02-18 | 2020-06-16 | 深圳前海微众银行股份有限公司 | Big data monitoring method, device and equipment and computer readable storage medium |
CN111984499A (en) * | 2020-08-04 | 2020-11-24 | 中国建设银行股份有限公司 | Fault detection method and device for big data cluster |
CN112069036A (en) * | 2020-11-10 | 2020-12-11 | 南京信易达计算技术有限公司 | Management and monitoring system based on cluster computing |
CN112069036B (en) * | 2020-11-10 | 2021-09-03 | 南京信易达计算技术有限公司 | Management and monitoring system based on cluster computing |
CN113945684A (en) * | 2021-10-14 | 2022-01-18 | 中国计量科学研究院 | Big data-based micro air station self-calibration method |
Also Published As
Publication number | Publication date |
---|---|
WO2020168756A1 (en) | 2020-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109992569A (en) | Cluster log feature extracting method, device and storage medium | |
CN107800591B (en) | Unified log data analysis method | |
EP3031216A1 (en) | Dynamic collection analysis and reporting of telemetry data | |
CN108415845A (en) | AB tests computational methods, device and the server of system index confidence interval | |
CN103294592A (en) | Leveraging user-to-tool interactions to automatically analyze defects in it services delivery | |
CN105930527A (en) | Searching method and device | |
CN102541884B (en) | Method and device for database optimization | |
CN108875091A (en) | A kind of distributed network crawler system of unified management | |
CN108650684A (en) | A kind of correlation rule determines method and device | |
CN104765689A (en) | Method and device for conducting real-time supervision to interface performance data | |
CN107832291A (en) | Client service method, electronic installation and the storage medium of man-machine collaboration | |
CN112463859B (en) | User data processing method and server based on big data and business analysis | |
CN111800292B (en) | Early warning method and device based on historical flow, computer equipment and storage medium | |
CN110519263A (en) | Anti- brush amount method, apparatus, equipment and computer readable storage medium | |
CN111242430A (en) | Power equipment supplier evaluation method and device | |
CN111970151A (en) | Flow fault positioning method and system for virtual and container network | |
CN104199850A (en) | Method and device for processing essential data | |
CN108664322A (en) | Data processing method and system | |
CN113360313B (en) | Behavior analysis method based on massive system logs | |
CN114598731B (en) | Cluster log acquisition method, device, equipment and storage medium | |
CN112416800B (en) | Intelligent contract testing method, device, equipment and storage medium | |
CN104426708A (en) | Method and system for executing security detection service | |
CN107147542A (en) | A kind of information generating method and device | |
KR101718599B1 (en) | System for analyzing social media data and method for analyzing social media data using the same | |
CN112685376A (en) | Massive log data analysis method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |