CN109992569A - Cluster log feature extracting method, device and storage medium - Google Patents

Cluster log feature extracting method, device and storage medium Download PDF

Info

Publication number
CN109992569A
CN109992569A CN201910123928.1A CN201910123928A CN109992569A CN 109992569 A CN109992569 A CN 109992569A CN 201910123928 A CN201910123928 A CN 201910123928A CN 109992569 A CN109992569 A CN 109992569A
Authority
CN
China
Prior art keywords
data
value
daily record
record data
acquisition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910123928.1A
Other languages
Chinese (zh)
Inventor
吴超勇
陈仕财
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910123928.1A priority Critical patent/CN109992569A/en
Publication of CN109992569A publication Critical patent/CN109992569A/en
Priority to PCT/CN2019/118288 priority patent/WO2020168756A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Abstract

The present invention relates to pedestal O&M, a kind of cluster log feature extracting method, device and storage medium are provided by the log of flume client acquisition server cluster and is sent to database;Data cleansing is carried out to daily record data, filters out initial data;Initial data is carried out include mean value, virtual value, peak value, root amplitude, waveform index, pulse index, kurtosis index characteristics extraction;The characteristic value of extraction is carried out with initial data to the operation of Pearson correlation coefficient respectively, it is compared according to calculated related coefficient with relevance threshold, valid data are then considered higher than relevance threshold, and invalid data is then considered lower than relevance threshold and are rejected.Energy Effective selection of the present invention goes out the effective information of the creation data of each host in server cluster, and the characteristic value of creation data is extracted from effective information, and the failure predication and failure modes of system easy to produce reduce the generation of production accident.

Description

Cluster log feature extracting method, device and storage medium
Technical field
The present invention relates to pedestal O&Ms, are situated between specifically, being related to a kind of cluster log feature extracting method, device and storage Matter.
Background technique
In the epoch that information explosion formula increases, file size and data scale march toward TB grades even PB grades and have become reality, collect Group's storage nodes number has reached 64 node number of clusters, and the so huge group system of management has become institute, data center The severe challenge faced.Tracking clustered node operating status in time, being accurately positioned node error message becomes particularly important.Collecting In group's actual operation of storage system, a kind of cluster storage system blog management method is commonly used at present, it can timing or real-time hair System log is sent, the concentration of transmissions of log is realized, but log is not analyzed and managed, understanding that cannot be global is whole The operating condition of a cluster storage system cannot quickly navigate to error message.But increasing with clustered node number, it is right Cluster system management becomes to become increasingly complex.From magnanimity server data, the feature that can reflect server performance is extracted, essence Determine the incipient fault of position clustered node, carrying out corresponding performance detection in advance is particularly important.
Summary of the invention
In order to solve the above problem, the present invention provides a kind of cluster log feature extracting method, is applied to electronic device, including Following steps: by the log of flume client acquisition server cluster, it is sent to Hbase database, wherein flume client End corresponds to the log of every server in acquisition server cluster by multiple Agent processes, and Agent is periodically by corresponding clothes Business device on collection of log data and Hbase database is sent to by api interface;Daily record data is counted using Hadoop According to cleaning, initial data is filtered out, wherein initial data includes at least server disk occupancy, memory usage, cpu and occupies Rate, business interface calling amount;Initial data is carried out to include that mean value, virtual value, peak value, root amplitude, waveform index, pulse refer to The characteristics extraction of mark, kurtosis index;Filter out validity feature with Pearson correlation coefficient, by the characteristic value of extraction respectively with Initial data carries out the operation of Pearson correlation coefficient, is compared according to calculated related coefficient with relevance threshold, high Valid data are then considered in relevance threshold, and invalid data is then considered lower than relevance threshold and are rejected.
Preferably, the data with gross error are rejected in data cleansing using Pauta criterion, comprising the following steps: right Daily record data x1,x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual error Wherein, xi For the daily record data of single Agent acquisition;
Calculate standard deviation Sx,
If data xbResidual error vb(1≤b≤n), meets formula
Then think xbIt is the singular value containing gross error value, and abnormal value elimination.
Preferably, the singular value of daily record data is substituted with intermediate value, wherein the intermediate value refers to each daily record data x1, x2...,xnSequence arranges by size, and value in an intermediate position is known as intermediate value.
Preferably, it includes mean value that initial data, which carries out, virtual value, peak value, root amplitude, waveform index, pulse index, high and steep Spend the characteristics extraction of index, wherein
Virtual value is calculated using following formula:
Peak value is calculated using following formula:
Xp=max (xi)
Root amplitude is calculated using following formula:
Waveform index is calculated using following formula:
Pulse index is calculated using following formula:
Kurtosis index is calculated using following formula:
Wherein, xiFor the daily record data of single Agent acquisition;
N is the number of data acquisition;
For the arithmetic mean of instantaneous value of the daily record data of acquisition;
XrmsFor the virtual value of the daily record data of acquisition;
XpFor the peak value of the daily record data of acquisition;
XrFor the root amplitude of the daily record data of acquisition;
XwsFor the waveform index of the daily record data of acquisition;
XifFor the pulse index of the daily record data of acquisition;
XkvFor the kurtosis index of the daily record data of acquisition.
Preferably, the formula of Pearson correlation coefficient is as follows:
Wherein, xiFor the daily record data of single Agent acquisition;
yjThe a certain characteristic value extracted in data is acquired for single Agent;
It is daily record data x1, x2...,xnArithmetic average;
It is y1, y2...,ynArithmetic average;
N is the number of log data acquisition.
Preferably, Flume includes multiple first level Agent and a second level Agent, each first level The daily record data of the daily record data of Agent one server of corresponding acquisition, multiple first level Agent acquisitions is collected to Second level Agent, and be transmitted in HDFS by the second level Agent.
The present invention also provides a kind of electronic device, which includes: memory and processor, is deposited in the memory Cluster log feature extraction procedure is contained, following step is realized when the cluster log feature extraction procedure is executed by the processor It is rapid: by the log of flume client acquisition server cluster, to be sent to Hbase database, wherein flume client passes through Multiple Agent processes correspond to the log of every server in acquisition server cluster, and Agent timing will be on corresponding server Collection of log data and Hbase database is sent to by api interface;It is clear that data are carried out to daily record data using Hadoop Wash, filter out initial data, wherein initial data include at least server disk occupancy, memory usage, cpu occupancy, Business interface calling amount;To initial data carry out include mean value, virtual value, peak value, root amplitude, waveform index, pulse index, The characteristics extraction of kurtosis index;Filter out validity feature with Pearson correlation coefficient, by the characteristic value of extraction respectively with original Beginning data carry out the operation of Pearson correlation coefficient, are compared, are higher than with relevance threshold according to calculated related coefficient Relevance threshold is then considered valid data, and invalid data is then considered lower than relevance threshold and is rejected.
Preferably, the data with gross error are rejected in data cleansing using Pauta criterion, comprising the following steps: right Daily record data x1, x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual error Wherein, xiFor Single Agent acquired data values;
Calculate standard deviation Sx,
If data xbResidual error vb(1≤b≤n), meets following formula
Then think xbIt is the singular value containing gross error value, and rejects the singular value.
Preferably, the singular value of daily record data is substituted with intermediate value, wherein the intermediate value refers to each daily record data x1, x2...,xnSequence arranges by size, and value in an intermediate position is known as intermediate value.
The present invention also provides a kind of computer readable storage medium, the computer-readable recording medium storage has computer Program, the computer program include that program instruction realizes above-described cluster when described program instruction is executed by processor Log feature extracting method.
Energy Effective selection of the present invention goes out the effective information of the creation data of each host in server cluster, and from effective information In extract the characteristic value of creation data, the failure predication and failure modes of system easy to produce reduce the generation of production accident.
Detailed description of the invention
By the way that embodiment is described in conjunction with following accompanying drawings, features described above of the invention and technological merit will become More understands and be readily appreciated that.
Fig. 1 is the flow diagram of the cluster log feature extracting method of the embodiment of the present invention;
Fig. 2 is the hardware structure schematic diagram of the electronic device of the embodiment of the present invention;
Fig. 3 is the module structure drafting of the cluster log feature extraction procedure of the embodiment of the present invention;
Fig. 4 is the unit composition figure of the log acquisition module of the embodiment of the present invention;
Fig. 5 is the unit composition figure of the characteristic extracting module of the embodiment of the present invention;
Fig. 6 is the unit composition figure of the data cleansing module of the embodiment of the present invention;
Fig. 7 is that the Agent process of Flume reads the schematic diagram of data.
Specific embodiment
Cluster log feature extracting method of the present invention, device and storage medium described below with reference to the accompanying drawings Embodiment.Those skilled in the art will recognize, without departing from the spirit and scope of the present invention, can be with Described embodiment is modified with a variety of different modes or combinations thereof.Therefore, attached drawing and description are inherently said Bright property, it is not intended to limit the scope of the claims.In addition, in the present specification, attached drawing is drawn not in scale, and And identical appended drawing reference indicates identical part.
As shown in Figure 1, the cluster log feature extracting method of the present embodiment, includes the following steps:
Step S10 passes through flume (distributed massive logs acquisition, polymerization and Transmission system) client acquisition service The log of device cluster is sent to Hbase database server.Flume with Agent process be the smallest independent operating unit, one A Agent process is exactly a complete data gathering tool.As shown in fig. 7, Agent includes component Source (data collection Component), Channel (transfer temporarily stores), Sink, three set up an Agent, and source collects data from server, Pass to Channel, Channel saves the Event (data cell) passed over by Source component, and Sink is from Channel Middle reading simultaneously removes Event, and Event is transmitted to backstage.Flume corresponds to each server collector journal by multiple Agent Data.An Agent is arranged in corresponding each server, periodically by the collection of log data on corresponding server and passes through Api interface is sent to backstage.
Step S30 carries out data cleansing to daily record data using Hadoop (distributed system infrastructure), filters out original Beginning data, wherein initial data includes at least server disk occupancy, memory usage, cpu occupancy, business interface and calls Amount.
Step S50 carries out initial data to include that mean value, virtual value, peak value, root amplitude, waveform index, pulse refer to The characteristics extraction of mark, kurtosis index.
Step S70 filters out validity feature with Pearson correlation coefficient: by the characteristic value of extraction respectively with initial data The operation for carrying out Pearson correlation coefficient, is compared according to calculated related coefficient with relevance threshold, is higher than the degree of correlation Threshold value is then considered valid data, and invalid data is then considered lower than relevance threshold and is rejected.
Further, the data with gross error are rejected in data cleansing using Pauta criterion, comprising the following steps:
To daily record data x1,x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual error Wherein, xiFor the daily record data of single Agent acquisition;
Calculate standard deviation Sx,
If the x in daily record databResidual error vb(1≤b≤n), meets formula
Then think xbIt is the singular value containing gross error value, and abnormal value elimination.
Further, the singular value of creation data can be efficiently identified out using La Yida rule, but for weeding out Data can then generate null value.Therefore, the singular value of the daily record data identified is substituted with intermediate value, is realized to creation data information Pretreatment.Wherein the intermediate value refers to each variate-value x1,x2...,xnIt sequentially lines up by size, forms a number Column, the variate-value in variable series middle position are known as intermediate value.
In one alternate embodiment, initial data is carried out to include that mean value, virtual value, peak value, root amplitude, waveform refer to The characteristics extraction of mark, pulse index, kurtosis index, wherein
Virtual value is calculated using following formula:
Peak value is calculated using following formula:
Xp=max (xi)
Root amplitude is calculated using following formula:
Waveform index is calculated using following formula:
Pulse index is calculated using following formula:
Kurtosis index is calculated using following formula:
Wherein, xiFor the daily record data of single Agent acquisition;
N is the number of log data acquisition;
For the arithmetic mean of instantaneous value of the daily record data of acquisition;
XrmsFor the virtual value of the daily record data of acquisition;
XpFor the peak value of the daily record data of acquisition;
XrFor the root amplitude of the daily record data of acquisition;
XwsFor the waveform index of the daily record data of acquisition;
XifFor the pulse index of the daily record data of acquisition;
XkvFor the kurtosis index of the daily record data of acquisition.
Filter out validity feature with Pearson correlation coefficient, specifically, be by features above value respectively with initial data The operation for carrying out Pearson correlation coefficient, according to calculated related coefficient with relevance threshold come compared with, higher than degree of correlation threshold Value is then considered valid data, is then considered invalid data lower than relevance threshold, needs to be rejected, so as to filter out The data of effect.For example, relevance threshold is 0.7, the related coefficient of root amplitude and initial data is 0.2, then shows root width Being worth is invalid data, and the related coefficient of kurtosis index and initial data is 0.85, then assert that kurtosis index is valid data.Its In, the formula of Pearson correlation coefficient is as follows:
Wherein, xiFor single Agent acquired data values;
yiThe a certain characteristic value extracted in data is acquired for single Agent;
It is daily record data x1,x2...,xnArithmetic average;
It is y1,y2...,ynArithmetic average;
N is the number of log data acquisition.
In one alternate embodiment, Flume includes multiple first level Agent and a second level Agent, each The daily record data of first level Agent, one server of corresponding acquisition, the log number of multiple first level Agent acquisitions According to being collected to the second level Agent, and it is transmitted in HDFS (distributed file system) by the second level Agent.
As shown in fig.2, being the hardware structure schematic diagram of the embodiment of electronic device of the present invention.It is described in the present embodiment Electronic device 2 be it is a kind of can according to the instruction for being previously set or store, automatic progress numerical value calculating and/or information processing Equipment.For example, it may be smart phone, tablet computer, laptop, desktop computer, rack-mount server, blade type take It is engaged in device, tower server or Cabinet-type server (including server set composed by independent server or multiple servers Group) etc..As shown in Fig. 2, the electronic device 2 includes at least, but it is not limited to, depositing for connection can be in communication with each other by system bus Reservoir 21, processor 22, network interface 23.Wherein: the memory 21 includes at least a type of computer-readable storage Medium, the readable storage medium storing program for executing include flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), Random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable are only Read memory (EEPROM), programmable read only memory (PROM), magnetic storage, disk, CD etc..In some embodiments In, the memory 21 can be the internal storage unit of the electronic device 2, such as the hard disk or memory of the electronic device 2. In further embodiments, the memory 21 is also possible to the External memory equipment of the electronic device 2, such as electronics dress Set the plug-in type hard disk being equipped on 2, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, the memory 21 can also both include the electronic device 2 Internal storage unit also include its External memory equipment.In the present embodiment, the memory 21 is installed on commonly used in storage Operating system and types of applications software of the electronic device 2, such as the cluster log feature extraction procedure code etc..This Outside, the memory 21 can be also used for temporarily storing the Various types of data that has exported or will export.
The processor 22 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 22 is commonly used in the control electricity The overall operation of sub-device 2, such as execute control relevant to the electronic device 2 progress data interaction or communication and processing Deng.In the present embodiment, the processor 22 is for running the program code stored in the memory 21 or processing data, example Cluster log feature extraction procedure as described in running.
The network interface 23 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the electronic device 2 and other electronic devices.For example, the network interface 23 is used to incite somebody to action by network The electronic device 2 is connected with push platform, and data transmission channel is established between the electronic device 2 and push platform and is led to Letter connection etc..The network can be intranet (Intranet), internet (Internet), global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband CodeDivision Multiple Access, WCDMA), 4G network, 5G network, bluetooth (Bluetooth), Wi-Fi etc. is wireless Or cable network.
Optionally, which can also include display, and display is referred to as display screen or display unit. It can be light-emitting diode display, liquid crystal display, touch-control liquid crystal display and Organic Light Emitting Diode in some embodiments (Organic Light-Emitting Diode, OLED) display etc..Display is used to be shown in handle in electronic device 2 Information and for showing visual user interface.
It should be pointed out that Fig. 2 illustrates only the electronic device 2 with component 21-23, it should be understood that not It is required that implement all components shown, the implementation that can be substituted is more or less component.
It may include operating system, cluster log feature extraction procedure 50 in memory 21 comprising readable storage medium storing program for executing Deng.Processor 22 realizes following steps when executing cluster log feature extraction procedure 50 in memory 21:
Step S10 passes through flume (distributed massive logs acquisition, polymerization and Transmission system) client acquisition service The log of device cluster is sent to Hbase database server.Flume with Agent component be the smallest independent operating unit, one A Agent component is exactly a complete data gathering tool.Flume is corresponded to each server by multiple Agent and collects day Will data.An Agent is arranged in corresponding each server, periodically by the collection of log data on corresponding server and passes through Api interface is sent to backstage.
Step S30 carries out data cleansing to daily record data using Hadoop (distributed system infrastructure), filters out original Beginning data, wherein initial data includes at least server disk occupancy, memory usage, cpu occupancy, business interface and calls Amount.
Step S50 carries out initial data to include that mean value, virtual value, peak value, root amplitude, waveform index, pulse refer to The characteristics extraction of mark, kurtosis index.
Step S70 filters out validity feature with Pearson correlation coefficient, by the characteristic value of extraction respectively with initial data The operation for carrying out Pearson correlation coefficient, is compared according to calculated related coefficient with relevance threshold, is higher than the degree of correlation Threshold value is then considered valid data, and invalid data is then considered lower than relevance threshold and is rejected.
In the present embodiment, the cluster log feature extraction procedure being stored in memory 21 can be divided into one A or multiple program modules, one or more of program modules are stored in memory 21, and can be by one or more A processor (the present embodiment is processor 22) is performed, to complete the present invention.For example, Fig. 3 shows the cluster log spy The program module schematic diagram for levying extraction procedure, in the embodiment, the cluster log feature extraction procedure 50 can be divided into Log acquisition module 501, data cleansing module 502, characteristic extracting module 503, validity feature screening module 504.Wherein, this hair Bright so-called program module is the series of computation machine program instruction section for referring to complete specific function, than program more suitable for retouching State implementation procedure of the cluster log feature extraction procedure in the electronic device 2.It will be described below described in specifically introducing The concrete function of program module.
Wherein, log acquisition module 501 is used for through flume (distributed massive logs acquisition, polymerization and transmission system System) client acquisition server cluster log, be sent to Hbase database server.Flume is minimum with Agent component Independent operating unit, an Agent component is exactly a complete data gathering tool.Flume by multiple Agent come pair Answer each server collector journal data.An Agent is arranged in corresponding each server, periodically by the day on corresponding server Will data collection is simultaneously sent to backstage by api interface.
Data cleansing module 502 is used to carry out data to daily record data using Hadoop (distributed system infrastructure) clear Wash, filter out initial data, wherein initial data include at least server disk occupancy, memory usage, cpu occupancy, Business interface calling amount.
Characteristic extracting module 503 is used to carry out initial data to include that mean value, virtual value, peak value, root amplitude, waveform refer to The characteristics extraction of mark, pulse index, kurtosis index.
Validity feature screening module 504 filters out validity feature with Pearson correlation coefficient, by the characteristic value of extraction point The operation for not carrying out Pearson correlation coefficient with initial data, is compared according to calculated related coefficient with relevance threshold Compared with being then considered valid data higher than relevance threshold, invalid data be then considered lower than relevance threshold and is rejected.
In one alternate embodiment, as shown in fig. 6, data cleansing module 502 includes Pauta criterion judging unit 5021, Pauta criterion judging unit 5021 rejects the data with gross error using Pauta criterion, comprising the following steps:
To daily record data x1,x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual error Wherein, xiFor single Agent acquired data values;
Calculate standard deviation Sx,
If data xbResidual error vb(1≤b≤n), meets following formula
Then think xbIt is the singular value containing gross error value, and rejects the singular value.
Further, data cleansing module 502 further includes singular value replacement unit 5022.It can be effective using La Yida rule Ground identifies the singular value of creation data, but the data for weeding out can then generate null value.Singular value replacement unit 5022 is right The singular value of the daily record data identified is substituted with intermediate value, realizes the pretreatment to creation data information.Wherein the intermediate value is Refer to each variate-value x1,x2...,xnIt sequentially lines up by size, forms an ordered series of numbers, be in variable series middle position Variate-value be known as intermediate value.
In one alternate embodiment, as shown in figure 5, characteristic extracting module 503 include mean value extraction unit 5031, effectively It is worth extraction unit 5032, peak extraction unit 5033, root magnitude extraction unit 5034, waveform index extraction unit 5035, arteries and veins Rush index extraction unit 5036, kurtosis index extraction unit 5037.Respectively initial data is carried out to include mean value, virtual value, peak The characteristics extraction of value, root amplitude, waveform index, pulse index, kurtosis index, wherein
Virtual value is calculated using following formula:
Peak value is calculated using following formula:
Xp=max (xi)
Root amplitude is calculated using following formula:
Waveform index is calculated using following formula:
Pulse index is calculated using following formula:
Kurtosis index is calculated using following formula:
Wherein, xiFor the daily record data of single Agent acquisition;
N is the number of log data acquisition;
For the arithmetic mean of instantaneous value of the daily record data of acquisition;
XrmsFor the virtual value of the daily record data of acquisition;
XpFor the peak value of the daily record data of acquisition;
XrFor the root amplitude of the daily record data of acquisition;
XwsFor the waveform index of the daily record data of acquisition;
XifFor the pulse index of the daily record data of acquisition;
XkvFor the kurtosis index of the daily record data of acquisition.
Filter out validity feature with Pearson correlation coefficient, specifically, be by features above value respectively with initial data The operation for carrying out Pearson correlation coefficient, according to calculated related coefficient with relevance threshold come compared with, higher than degree of correlation threshold Value is then considered valid data, is then considered invalid data lower than relevance threshold, needs to be rejected, so as to filter out The data of effect.For example, relevance threshold is 0.7, the related coefficient of root amplitude and initial data is 0.2, then shows root width Being worth is invalid data, and the related coefficient of kurtosis index and initial data is 0.85, then assert that kurtosis index is valid data.Its In, the formula of Pearson correlation coefficient is as follows:
Wherein, xiFor single Agent acquired data values;
yiThe a certain characteristic value extracted in data is acquired for single Agent;
It is daily record data x1, x2...,xnArithmetic average;
It is y1, y2...,ynArithmetic average;
N is the number of data acquisition.
In one alternate embodiment, as shown in figure 4, log acquisition module 501 further includes Agent setting unit 5011, For for Flume carry out include multiple first level Agent and a second level Agent setting, each first level The daily record data of the daily record data of Agent one server of corresponding acquisition, multiple first level Agent acquisitions is collected to Second level Agent, and be transmitted in HDFS by the second level Agent.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium It can be hard disk, multimedia card, SD card, flash card, SMC, read-only memory (ROM), Erasable Programmable Read Only Memory EPROM (EPROM), any one in portable compact disc read-only memory (CD-ROM), USB storage etc. or several timess Meaning combination.It include cluster log feature extraction procedure etc. in the computer readable storage medium, the cluster log feature mentions Following operation is realized when program fetch 50 is executed by processor 22:
Step S10 is sent to Hbase database server by the log of flume client acquisition server cluster. For Flume with Agent component for the smallest independent operating unit, an Agent component is exactly a complete data gathering tool. Flume corresponds to each server collector journal data by multiple Agent.An Agent is arranged in corresponding each server, fixed When by the collection of log data on corresponding server and by api interface be sent to backstage.
Step S30 carries out data cleansing to daily record data using Hadoop, filters out initial data, wherein initial data Including at least server disk occupancy, memory usage, cpu occupancy, business interface calling amount.
Step S50 carries out initial data to include that mean value, virtual value, peak value, root amplitude, waveform index, pulse refer to The characteristics extraction of mark, kurtosis index.
Step S70 filters out validity feature with Pearson correlation coefficient, by the characteristic value of extraction respectively with initial data The operation for carrying out Pearson correlation coefficient, is compared according to calculated related coefficient with relevance threshold, is higher than the degree of correlation Threshold value is then considered valid data, and invalid data is then considered lower than relevance threshold and is rejected.
The specific embodiment of the computer readable storage medium of the present invention and above-mentioned cluster log feature extracting method with And the specific embodiment of electronic device 2 is roughly the same, details are not described herein.
The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art For member, the invention may be variously modified and varied.All within the spirits and principles of the present invention, it is made it is any modification, Equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of cluster log feature extracting method is applied to electronic device, which comprises the following steps:
By the log of flume client acquisition server cluster, it is sent to Hbase database, wherein flume client is logical The log for every server that multiple Agent processes correspond in acquisition server cluster is crossed, Agent is periodically by corresponding server On collection of log data and Hbase database is sent to by api interface;
Data cleansing is carried out to daily record data using Hadoop, filters out initial data, wherein initial data includes at least service Device disk occupancy, memory usage, cpu occupancy, business interface calling amount;
Initial data is carried out to include mean value, virtual value, peak value, root amplitude, waveform index, pulse index, kurtosis index Characteristics extraction;
Validity feature is filtered out with Pearson correlation coefficient: the characteristic value of extraction is subjected to Pearson came phase with initial data respectively The operation of relationship number is compared with relevance threshold according to calculated related coefficient, is then effective higher than relevance threshold Data are then invalid datas lower than relevance threshold, and are rejected.
2. cluster log feature extracting method according to claim 1, which is characterized in that
During data cleansing, the data with gross error are rejected using Pauta criterion, comprising the following steps:
To daily record data x1,x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual errorWherein, xiFor the daily record data of single Agent acquisition;
Calculate standard deviation Sx,
If the x in daily record databResidual error vb(1≤b≤n), meets formula
Then determine xbIt is the singular value containing gross error value, and abnormal value elimination.
3. cluster log feature extracting method according to claim 2, which is characterized in that
The singular value of daily record data is substituted with intermediate value, wherein the intermediate value refers to each daily record data x1,x2...,xnBy big Small sequence arrangement, value in an intermediate position are known as intermediate value.
4. cluster log feature extracting method according to claim 2, which is characterized in that
Initial data carry out include mean value, virtual value, peak value, root amplitude, waveform index, pulse index, kurtosis index spy Value indicative is extracted, wherein
Virtual value is calculated using following formula:
Peak value is calculated using following formula:
Xp=max (xi)
Root amplitude is calculated using following formula:
Waveform index is calculated using following formula:
Pulse index is calculated using following formula:
Kurtosis index is calculated using following formula:
Wherein, xiFor the daily record data of single Agent acquisition;
N is the number of data acquisition;
For the arithmetic mean of instantaneous value of the daily record data of acquisition;
XrmsFor the virtual value of the daily record data of acquisition;
XpFor the peak value of the daily record data of acquisition;
XrFor the root amplitude of the daily record data of acquisition;
XwsFor the waveform index of the daily record data of acquisition;
XifFor the pulse index of the daily record data of acquisition;
XkvFor the kurtosis index of the daily record data of acquisition.
5. cluster log feature extracting method according to claim 2, which is characterized in that the formula of Pearson correlation coefficient It is as follows:
Wherein, xiFor the daily record data of single Agent acquisition;
yiThe a certain characteristic value extracted in data is acquired for single Agent;
It is daily record data x1, x2...,xnArithmetic average;
It is y1, y2...,ynArithmetic average;
N is the number of log data acquisition.
6. cluster log feature extracting method according to claim 1, which is characterized in that
Flume includes multiple first level Agent and one second level Agent, each first level Agent corresponding The daily record data of a server is acquired, the daily record data of multiple first level Agent acquisitions is collected to the second level Agent, And it is transmitted in HDFS by the second level Agent.
7. a kind of electronic device, which is characterized in that the electronic device includes: memory and processor, is stored in the memory There is cluster log feature extraction procedure, following step is realized when the cluster log feature extraction procedure is executed by the processor It is rapid:
By the log of flume client acquisition server cluster, it is sent to Hbase database, wherein flume client is logical The log for every server that multiple Agent processes correspond in acquisition server cluster is crossed, Agent is periodically by corresponding server On collection of log data and Hbase database is sent to by api interface;
Data cleansing is carried out to daily record data using Hadoop, filters out initial data, wherein initial data includes at least service Device disk occupancy, memory usage, cpu occupancy, business interface calling amount;
Initial data is carried out to include mean value, virtual value, peak value, root amplitude, waveform index, pulse index, kurtosis index Characteristics extraction;
Validity feature is filtered out with Pearson correlation coefficient: the characteristic value of extraction is subjected to Pearson came phase with initial data respectively The operation of relationship number is compared with relevance threshold according to calculated related coefficient, is then effective higher than relevance threshold Data are then invalid datas lower than relevance threshold, and are rejected.
8. electronic device according to claim 7, which is characterized in that
The data with gross error are rejected in data cleansing using Pauta criterion, comprising the following steps:
To daily record data x1,x2...,xn, calculate its arithmetic mean of instantaneous valueAnd residual errorWherein, xiFor single Agent acquired data values;
Calculate standard deviation Sx,
If the x in daily record databResidual error vb(1≤b≤n), meets formula
Then think xbIt is the singular value containing gross error value, and rejects the singular value.
9. electronic device according to claim 8, which is characterized in that
Singular value in daily record data is substituted with intermediate value, wherein the intermediate value refers to each daily record data x1,x2...,xnIt presses Size order arrangement, value in an intermediate position are known as intermediate value.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program include that program instruction is realized in claim 1 to 6 and appointed when described program instruction is executed by processor Cluster log feature extracting method described in one.
CN201910123928.1A 2019-02-19 2019-02-19 Cluster log feature extracting method, device and storage medium Pending CN109992569A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910123928.1A CN109992569A (en) 2019-02-19 2019-02-19 Cluster log feature extracting method, device and storage medium
PCT/CN2019/118288 WO2020168756A1 (en) 2019-02-19 2019-11-14 Cluster log feature extraction method, and apparatus, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910123928.1A CN109992569A (en) 2019-02-19 2019-02-19 Cluster log feature extracting method, device and storage medium

Publications (1)

Publication Number Publication Date
CN109992569A true CN109992569A (en) 2019-07-09

Family

ID=67129790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910123928.1A Pending CN109992569A (en) 2019-02-19 2019-02-19 Cluster log feature extracting method, device and storage medium

Country Status (2)

Country Link
CN (1) CN109992569A (en)
WO (1) WO2020168756A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110737648A (en) * 2019-09-17 2020-01-31 平安科技(深圳)有限公司 Performance characteristic dimension reduction method and device, electronic equipment and storage medium
CN111290916A (en) * 2020-02-18 2020-06-16 深圳前海微众银行股份有限公司 Big data monitoring method, device and equipment and computer readable storage medium
WO2020168756A1 (en) * 2019-02-19 2020-08-27 平安科技(深圳)有限公司 Cluster log feature extraction method, and apparatus, device and storage medium
CN111984499A (en) * 2020-08-04 2020-11-24 中国建设银行股份有限公司 Fault detection method and device for big data cluster
CN112069036A (en) * 2020-11-10 2020-12-11 南京信易达计算技术有限公司 Management and monitoring system based on cluster computing
CN113945684A (en) * 2021-10-14 2022-01-18 中国计量科学研究院 Big data-based micro air station self-calibration method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105353644A (en) * 2015-09-29 2016-02-24 中国人民解放军63892部队 Radar target track derivative system on the basis of information mining of real-equipment data and method thereof
US20170116330A1 (en) * 2015-10-23 2017-04-27 International Business Machines Corporation Generating Important Values from a Variety of Server Log Files
CN106769032A (en) * 2016-11-28 2017-05-31 南京工业大学 A kind of Forecasting Methodology of pivoting support service life
US20170169360A1 (en) * 2013-04-02 2017-06-15 Patternex, Inc. Method and system for training a big data machine to defend
CN108399199A (en) * 2018-01-30 2018-08-14 武汉大学 A kind of collection of the application software running log based on Spark and service processing system and method
CN109032910A (en) * 2018-07-24 2018-12-18 北京百度网讯科技有限公司 Log collection method, device and storage medium
CN109033404A (en) * 2018-08-03 2018-12-18 北京百度网讯科技有限公司 Daily record data processing method, device and system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7356550B1 (en) * 2001-06-25 2008-04-08 Taiwan Semiconductor Manufacturing Company Method for real time data replication
CN104036025A (en) * 2014-06-27 2014-09-10 蓝盾信息安全技术有限公司 Distribution-base mass log collection system
CN106570151A (en) * 2016-10-28 2017-04-19 上海斐讯数据通信技术有限公司 Data collection processing method and system for mass files
CN106845799B (en) * 2016-12-29 2023-12-19 中国电力科学研究院 Evaluation method for typical working condition of battery energy storage system
CN107092592B (en) * 2017-04-10 2020-06-05 浙江鸿程计算机系统有限公司 Site personalized semantic recognition method based on multi-situation data and cost-sensitive integrated model
CN109992569A (en) * 2019-02-19 2019-07-09 平安科技(深圳)有限公司 Cluster log feature extracting method, device and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170169360A1 (en) * 2013-04-02 2017-06-15 Patternex, Inc. Method and system for training a big data machine to defend
CN105353644A (en) * 2015-09-29 2016-02-24 中国人民解放军63892部队 Radar target track derivative system on the basis of information mining of real-equipment data and method thereof
US20170116330A1 (en) * 2015-10-23 2017-04-27 International Business Machines Corporation Generating Important Values from a Variety of Server Log Files
CN106769032A (en) * 2016-11-28 2017-05-31 南京工业大学 A kind of Forecasting Methodology of pivoting support service life
CN108399199A (en) * 2018-01-30 2018-08-14 武汉大学 A kind of collection of the application software running log based on Spark and service processing system and method
CN109032910A (en) * 2018-07-24 2018-12-18 北京百度网讯科技有限公司 Log collection method, device and storage medium
CN109033404A (en) * 2018-08-03 2018-12-18 北京百度网讯科技有限公司 Daily record data processing method, device and system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020168756A1 (en) * 2019-02-19 2020-08-27 平安科技(深圳)有限公司 Cluster log feature extraction method, and apparatus, device and storage medium
CN110737648A (en) * 2019-09-17 2020-01-31 平安科技(深圳)有限公司 Performance characteristic dimension reduction method and device, electronic equipment and storage medium
WO2021051578A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Method and device for performance feature dimensionality reduction, electronic device, and storage medium
CN111290916A (en) * 2020-02-18 2020-06-16 深圳前海微众银行股份有限公司 Big data monitoring method, device and equipment and computer readable storage medium
CN111984499A (en) * 2020-08-04 2020-11-24 中国建设银行股份有限公司 Fault detection method and device for big data cluster
CN112069036A (en) * 2020-11-10 2020-12-11 南京信易达计算技术有限公司 Management and monitoring system based on cluster computing
CN112069036B (en) * 2020-11-10 2021-09-03 南京信易达计算技术有限公司 Management and monitoring system based on cluster computing
CN113945684A (en) * 2021-10-14 2022-01-18 中国计量科学研究院 Big data-based micro air station self-calibration method

Also Published As

Publication number Publication date
WO2020168756A1 (en) 2020-08-27

Similar Documents

Publication Publication Date Title
CN109992569A (en) Cluster log feature extracting method, device and storage medium
CN107800591B (en) Unified log data analysis method
EP3031216A1 (en) Dynamic collection analysis and reporting of telemetry data
CN108415845A (en) AB tests computational methods, device and the server of system index confidence interval
CN103294592A (en) Leveraging user-to-tool interactions to automatically analyze defects in it services delivery
CN105930527A (en) Searching method and device
CN102541884B (en) Method and device for database optimization
CN108875091A (en) A kind of distributed network crawler system of unified management
CN108650684A (en) A kind of correlation rule determines method and device
CN104765689A (en) Method and device for conducting real-time supervision to interface performance data
CN107832291A (en) Client service method, electronic installation and the storage medium of man-machine collaboration
CN112463859B (en) User data processing method and server based on big data and business analysis
CN111800292B (en) Early warning method and device based on historical flow, computer equipment and storage medium
CN110519263A (en) Anti- brush amount method, apparatus, equipment and computer readable storage medium
CN111242430A (en) Power equipment supplier evaluation method and device
CN111970151A (en) Flow fault positioning method and system for virtual and container network
CN104199850A (en) Method and device for processing essential data
CN108664322A (en) Data processing method and system
CN113360313B (en) Behavior analysis method based on massive system logs
CN114598731B (en) Cluster log acquisition method, device, equipment and storage medium
CN112416800B (en) Intelligent contract testing method, device, equipment and storage medium
CN104426708A (en) Method and system for executing security detection service
CN107147542A (en) A kind of information generating method and device
KR101718599B1 (en) System for analyzing social media data and method for analyzing social media data using the same
CN112685376A (en) Massive log data analysis method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination