WO2023061209A1 - 内存故障的预测方法、电子设备和计算机可读存储介质 - Google Patents

内存故障的预测方法、电子设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2023061209A1
WO2023061209A1 PCT/CN2022/121694 CN2022121694W WO2023061209A1 WO 2023061209 A1 WO2023061209 A1 WO 2023061209A1 CN 2022121694 W CN2022121694 W CN 2022121694W WO 2023061209 A1 WO2023061209 A1 WO 2023061209A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
data
log data
time period
feature
Prior art date
Application number
PCT/CN2022/121694
Other languages
English (en)
French (fr)
Inventor
易哲
黄景丰
郑紫阳
陈晓艳
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to KR1020247014345A priority Critical patent/KR20240065183A/ko
Publication of WO2023061209A1 publication Critical patent/WO2023061209A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application involves but is not limited to the fields of big data analysis and artificial intelligence technology.
  • the present application provides a method for predicting memory faults, including: obtaining various log data of the memory to be tested; wherein, the various log data at least include: memory error information address data; engineering construction, to obtain the feature data tables corresponding to the various log data respectively; splicing the feature data tables corresponding to the various log data respectively to obtain the feature splicing data table; according to the feature splicing data table and the pre-trained A fault prediction model is used to obtain a fault prediction result of the memory to be tested; wherein, the fault prediction model is trained according to a pre-collected training data set, and samples in the training data set include various log data of various types of memory.
  • the present application also provides an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores instructions executable by the at least one processor, The instructions are executed by the at least one processor, so that the at least one processor can perform any memory fault prediction method described herein.
  • the present application also provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, any method for predicting a memory fault described herein is implemented.
  • Fig. 1 is a flow chart of a method for predicting a memory failure according to the present application
  • Fig. 2 is the architecture diagram of the memory to be tested according to the present application.
  • Fig. 3 is a schematic diagram of obtaining various log data according to the present application.
  • Fig. 4 is a flow chart of data preprocessing according to the present application.
  • Fig. 5 is a flow chart of obtaining a feature data table corresponding to memory error information address data according to the present application
  • Fig. 6 is the flow chart of training fault prediction model according to the present application.
  • Fig. 7 is a schematic diagram of marking according to the binary classification and regression method of the present application.
  • FIG. 8 is a structural diagram of an electronic device according to the present application.
  • the electronic device may be a server for fault prediction (prediction server for short), for example, a device under test that needs to predict memory faults collects After receiving its own data, it is sent to the prediction server, and then the prediction server predicts the memory failure of the device under test.
  • the device to be tested may be a server, a big data center cluster, a communication base station, a PC, and other devices containing memory devices.
  • the flowchart of the memory fault prediction method may be as shown in FIG. 1 , including steps 101 to 104 .
  • Step 101 Obtain various log data of the memory to be tested; wherein, the various log data at least include: memory error information address data.
  • Step 102 Perform feature engineering construction according to various log data, and obtain feature data tables respectively corresponding to the various log data.
  • Step 103 Concatenate the feature data tables corresponding to the various log data to obtain a feature splicing data table.
  • Step 104 According to the feature splicing data table and the pre-trained fault prediction model, the fault prediction result of the memory to be tested is obtained; wherein, the fault prediction model is trained according to the pre-collected training data set, and the samples in the training data set include various types of memory Various log data.
  • various log data of the memory to be tested are obtained, and the various log data at least include: memory error information address data, and the various log data can measure the state of the memory to be tested from various aspects, thereby providing information for fault prediction.
  • Various log data include at least memory error information address data to ensure that the necessary reference information can be provided for fault prediction.
  • the generalization ability and stability of the fault prediction model are stronger, and the prediction results based on the fault prediction model are also more accurate, that is, the improvement accuracy in predicting memory failures.
  • the device under test (for example, a server) where the memory to be tested is located can obtain various log data from the memory to be tested, and send the various log data to the prediction server, so that the prediction server can obtain the memory to be tested various log data.
  • the various log data include at least: memory error information address data, ensuring that necessary reference information can be provided for memory fault prediction.
  • the architecture diagram of the memory to be tested is as shown in Figure 2, the left side of Figure 2 is a schematic diagram of multiple memory particles (chips) in the memory to be tested, and the right side of Figure 2 is a schematic diagram of a chip in the memory to be tested,
  • a chip is divided into 8 layers in physical structure, and one layer is a bank.
  • layer 1 (bank 1) in Figure 2 can indicate that this layer is the first layer in the chip, row (row) is a row on bank 1, column (column) is a column on bank 1, row and column intersect The point is a memory unit (cell).
  • the acquired memory error information address data includes: memory serial number, memory manufacturer, log reporting time, memory dual-inline-memory-modules (Dual-Inline-Memory-Modules, referred to as "DIMM"), memory rank number , memory chip number, memory bank information, row number where the memory unit is located, and column number where the memory unit is located.
  • DIMM dual-inline-memory-modules
  • memory rank number memory chip number
  • memory bank information memory bank information
  • row number where the memory unit is located referred to as "DIMM”
  • the address data of the memory error information may also include the detailed physical location of the memory fault parsed from the memory fault log.
  • the various log data obtained include memory error information address data
  • it also includes one or more combinations of the following: memory log data, operating system kernel log, error detection and correction EDAC log, performance data, environment and location information data.
  • the memory log data includes: register group number field, transaction field, memory serial number, memory manufacturer, and log reporting time.
  • the memory log data is the dynamic random access memory (Dynamic Random Access Memory, DRAM) fault log reported by mcelog collected by the mcelog tool.
  • DRAM Dynamic Random Access Memory
  • the standard tool for failure, that is, the memory log data can be mcelog data.
  • the operating system kernel log may record information related to memory faults obtained from the Linux kernel log.
  • the operating system kernel log includes: memory serial number, memory manufacturer, log reporting time, and various fields related to memory errors.
  • the error detection and correction (Error Detection and Correction, EDAC) log includes: memory serial number, log reporting time, memory controller (Memory Controller, MC) field, page field and offset field.
  • EDAC and ECC are both error detection and correction, ECC is error detection and correction on memory hardware; EDAC is error detection and correction on software in the Linux kernel.
  • the performance data records the physical performance data related to the server where the memory to be tested is located, including: the number of page in per second for data transfer from disk to memory, and the number of page out for data transfer from memory to disk per second , minimum voltage, maximum voltage, configured voltage, and memory operating speed.
  • the environment and location information data records the environment and location information of the server where the memory to be tested is located, including: temperature, humidity, the site where the server where the memory to be tested is located, the room where the server where the memory to be tested is located, and The rack of the server where the memory to be tested is located.
  • memory log data operating system kernel logs, EDAC logs, performance data, environment and location information data all have an important impact on memory faults, and the addition of these data can improve the generalization ability of the fault prediction model The stability is conducive to improving the accuracy of fault prediction.
  • the server where the memory to be tested is located can simultaneously obtain 6 kinds of log data in the memory to be tested, as shown in Figure 3, including: mcelog data, Linux kernel log data, EDAC log data , performance data, environment and location information data, memory error information address data.
  • the prediction server can preprocess the obtained various log data in the memory to be tested, wherein , the content of preprocessing can include: fill the fields that are not obtained in various log data with mean value or zero value; discard fields that are not related to memory faults; if there are correlations between certain columns of fields in various log data If it is very strong, only one of the columns will be left, and the other columns will be deleted.
  • data verification can also be performed on various preprocessed log data, for example, whether the time length of the verification data and the number of sampling points can meet the minimum required data volume for fault prediction or fault prediction Conditional requirements, that is, whether the data collection time meets the length of time required for feature engineering construction, and whether the various log data collected at least include memory error information address data.
  • step 102 processes various log data and constructs feature engineering to obtain the execution process of the feature data table corresponding to the various log data, which can be realized through the following sub-steps, and the exemplary process is shown in FIG. 4 shown, including substeps 1021 to 1023.
  • Sub-step 1021 preprocessing various log data.
  • the preprocessing can include: fill the empty value with "0"; analyze the various collected log data with Pearson correlation coefficient, delete the columns with strong correlation between the fields and the data without change or all "0" , discarding fields not related to memory failures.
  • Sub-step 1022 data verification.
  • the data verification may include: verifying the time length and the number of sampling points of various preprocessed log data. Check whether the length of data time and the number of sampling points can meet the minimum data volume requirements for fault detection and prediction: that is, whether the data collection time meets the length of time required for feature engineering construction, and whether the various log data include at least the address of memory error information data.
  • sub-step 1023 feature engineering construction is performed respectively according to various log data after verification.
  • the prediction server performs feature construction according to the preprocessed memory error information address data to obtain a feature data table corresponding to the memory error information address data.
  • obtaining the characteristic data table corresponding to the memory error information address data includes steps 501 to 506 .
  • Step 501 Count the first quantity and the first quantity.
  • the prediction server counts the first number of first objects appearing in the first preset time period before the current time point and the first number of ECC occurrences on each first object according to the memory error information address data; wherein, The first target is the memory unit with 2 or more times of ECC accumulatively in each memory unit of the memory to be tested in the second preset time period in the past, and the duration of the second preset time period is longer than the first preset time The duration of the segment.
  • Step 502 Count the second quantity and the second times.
  • the predicting server counts the second number of second objects that appear in the third preset time period before the current time point and the second number of times that ECC appears on each second object according to the address data of the memory error information; wherein, the first The second target is the row in which ECC has occurred 2 or more times at the same time in each row of memory cells of the memory to be tested during the fourth preset time period in the past, and the duration of the fourth preset time period is longer than the third preset time period The duration of the time period.
  • the same moment may be the same second or the same minute.
  • Step 503 Count the third quantity and the third times.
  • the prediction server counts the third quantity of the third object appearing in the fifth preset time period before the current time point and the third number of ECC occurrences on each third object according to the address data of the memory error information; wherein, the first The third target is that in the sixth preset time period in the past, the rows of memory cells of the memory to be tested have accumulated 2 or more times of ECC at the same time, and the duration of the sixth preset time period is longer than the fifth preset time period. Set the duration of the time period.
  • Step 504 Count the fourth quantity.
  • the predicting server counts the fourth quantity of the fourth object appearing in the seventh preset time period before the current time point according to the address data of the memory error information; wherein, the fourth object is at least 3 objects in the memory to be tested
  • a memory cell block composed of memory cells located in the same column, at least 3 memory cells located in the same column have ECC at the same moment in the past eighth preset time period, and at least 3 memory cells located in the same column are mutually incompatible At most one memory unit apart, the duration of the eighth preset time period is longer than the duration of the seventh preset time period.
  • Step 505 Count the fifth quantity.
  • the prediction server counts the fifth quantity of the fifth object that appeared in the ninth preset time period before the current time point according to the address data of the memory error information; wherein, the fifth object is at least 3 objects in the memory to be tested
  • a memory cell block composed of memory cells located in the same row, at least 3 memory cells located in the same row have ECC at the same moment in the past tenth preset time period, and at least 3 memory cells located in the same row are mutually incompatible At most one memory unit apart, the duration of the tenth preset time period is longer than the duration of the ninth preset time period.
  • Step 506 Obtain the characteristic data table corresponding to the address data of the memory error information.
  • the feature data table corresponding to the memory error information address data is obtained: the first number, the first number, the second number, the second number, the third number, the third number, the fourth number, fifth quantity.
  • the first preset time period is a period of time before the current time point
  • the first target can be recorded as ERROR CELL
  • the first number is the first target occurring in the first preset time period before the current time point
  • the number of ERROR CELLs, the first number is the total number of ECC occurrences on each ERROR CELL within the first preset time period before the current time point
  • the second preset time period is the memory error required when constructing the first target
  • the collection time length of information address data is a period of time before the current time point
  • the first target can be recorded as ERROR CELL
  • the first number is the first target occurring in the first preset time period before the current time point
  • the number of ERROR CELLs, the first number is the total number of ECC occurrences on each ERROR CELL within the first preset time period before the current time point
  • the second preset time period is the memory error required when constructing the first target
  • the collection time length of information address data is a period of time before
  • the second preset time period can be 3 months.
  • the first preset time period can be a whole minute granularity time period before the current time point, for example, the first preset time The period can be taken as 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point, that is, to count the total number of ECC occurrences on the ERROR CELL within 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point (that is, the first number of times mentioned above) and The number of ERROR CELLs corresponding to ECC (that is, the first number mentioned above).
  • the second target can be recorded as ERROR COLUMN
  • the second number is the number of the second target ERROR COLUMN that occurred in the third preset time period before the current time point
  • the second time is the number of the second target ERROR COLUMN before the current time point
  • the fourth preset time period is the collection time length of the memory error information address data required when constructing the second target.
  • the fourth preset time period can be 3 months
  • the third preset time period can be 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point, that is, 8 minutes, 4 minutes, and 1 minute before the current time point are counted respectively.
  • the total number of ECC occurrences on ERROR COLUMN within 2 minutes and 1 minute that is, the second number mentioned above
  • the number of ERROR COLUMNs corresponding to ECC that is, the second number mentioned above.
  • the third target can be recorded as ERROR ROW
  • the third quantity is the quantity of the third target ERROR ROW occurring in the fifth preset time period before the current time point
  • the third time is the number of the third target ERROR ROW before the current time point
  • the total number of ECC occurrences on each ERROR ROW within five preset time periods, and the sixth preset time period is the length of time for collecting memory error information address data required for constructing the third object.
  • the sixth preset time period can be 3 months
  • the fifth preset time period can be 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point, that is, 8 minutes, 4 minutes, and 1 minute before the current time point are counted respectively.
  • the fourth target can be recorded as ERROR COL BLOCK
  • the fourth quantity is the quantity of the fourth target ERROR COL BLOCK that appeared in the seventh preset time period before the current time point
  • the eighth preset time period is the The length of time for collecting the address data of the memory error information required by the fourth object.
  • the eighth preset time period may be 3 months. For example, if at the same time, if 3 of the 5 consecutive memory cells in the same column have experienced ECC, that is, among the 5 memory cells, the first, 3. If ECC has occurred in 5 memory units, then these 5 memory units are an ERROR COL BLOCK. Or, at the same time, if ECC has occurred in three consecutive memory units in the same column, then these three memory units are an ERROR COL BLOCK.
  • the seventh preset time period can be 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point, that is, count the number of ERROR COL BLOCKs within 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point (that is, the fourth number mentioned above) .
  • the same moment may be the same second or the same minute.
  • the fifth object can be recorded as ERROR ROW BLOCK
  • the fifth number is the number of the fifth object ERROR ROW BLOCK that appeared in the ninth preset time period before the current time point
  • the tenth preset time period is the number of constructed The length of time for collecting the address data of the memory error information required by the fifth object.
  • the tenth preset time period may be 3 months. For example, if at the same time, if 3 of the 5 consecutive memory cells in the same row have experienced ECC, that is, among the 5 memory cells, the first, 3. If ECC has occurred in 5 memory units, then the 5 consecutive memory units are an ERROR ROW BLOCK. Or, at the same time, if ECC has occurred in three consecutive memory units in the same row, then these three memory units are an ERROR ROW BLOCK.
  • the ninth preset time period can be 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point, that is, count the number of ERROR ROW BLOCKs within 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point (that is, the fifth number mentioned above) .
  • the first, third, fifth, seventh, and ninth preset time periods may be the same or different; the second, fourth, sixth, eighth, and tenth preset time periods may be the same, or Can be different.
  • the first target ERROR CELL since memory faults are contagious on the rows and columns of the memory, by constructing the first target ERROR CELL, the second target ERROR COLUMN, the third target ERROR ROW, the fourth target ERROR COL BLOCK, the fifth target Target ERROR ROW BLOCK and other concepts related to memory defects and carry out statistics on the related quantities and times, which is conducive to accurately measuring memory faults and improving the accuracy of memory fault prediction.
  • the prediction server performs feature construction according to various log data, and obtains feature data tables corresponding to the various log data.
  • the various log data includes memory error information address data
  • it also includes one or more combinations of the following: memory log data, operating system kernel log, error detection and correction EDAC log, performance data, environment and location information data.
  • memory log data operating system kernel log, error detection and correction EDAC log, performance data, environment and location information data.
  • the characteristic data tables corresponding to each type of log data are obtained, which is conducive to improving fault prediction the accuracy rate.
  • the following is an exemplary description of obtaining feature data tables corresponding to various types of log data.
  • the transaction field in the eleventh preset time period before the current time point is summed to obtain the transaction field Sum the result, and count the register group number field in the eleventh preset time period before the current time point to obtain the count result of the register group number field, and then according to the summation result of the transaction field and the register group number field Count the results to obtain the feature data table corresponding to the memory log data.
  • the register bank number field can be the mca_bank field
  • the eleventh preset time period can be 8 minutes, 4 minutes, 2 minutes, and 1 minute, that is, the prediction server can calculate mca_bank counts and sums the transaction fields as a new feature column, which is the feature column in the feature data table corresponding to the memory log data.
  • the data within 1 minute before the current time point is the data collected during the period of 10:26-10:27, and then the data collected during the period of 10:26-10:27
  • the mca_bank field is counted, and the transaction field is summed; the data within 2 minutes before the current time point is the data collected during the period of 10:25-10:27; the data within 4 minutes before the current time point is 10:23- The data collected during the period of 10:27; the data within 8 minutes before the current time point is the data collected during the period of 10:19-10:27.
  • the register group number field is the register group number captured by the mcelog tool from the memory, and its data type is a string type, it can be counted; and the transaction field is a group of information transmitted between the CPU and the memory Operations, including read transactions and write transactions, whose data type is integer, summing the transaction field can better reflect the data characteristics of the field, and only counting will lose the data information in the transaction field. Therefore, the transaction field needs to be Do the summation.
  • various log data also include operating system kernel logs
  • various fields related to memory errors in the operating system kernel logs are counted, and the twelfth preset time period before the current time point is calculated.
  • Various fields are counted to obtain the corresponding counting results of each type of field, and then according to the corresponding counting results of each type of field, the characteristic data table corresponding to the operating system kernel log is obtained.
  • various fields related to memory errors within the twelfth preset time period can be counted according to the granularity of the whole minute. For example, when the 12th preset time period is 8 minutes, 4 minutes, 2 minutes, and 1 minute, the prediction server can Count various fields related to memory errors within 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point as a new feature column.
  • the new feature column is the feature column in the feature data table corresponding to the operating system kernel log.
  • the MC field, page field, and offest field in the thirteenth preset time period before the current time point are counted to obtain MC
  • the count results of the field, page field, and offset field are used to obtain the feature data table corresponding to the EDAC log based on the count results of the MC field, page field, and offset field.
  • the MC field indicates the number of the memory controller
  • page indicates the page of the virtual memory
  • offset indicates the offset
  • page+offset can calculate the physical address of the memory.
  • the fields (MC, page, offset) in the thirteenth preset time period can be counted according to the granularity of the whole minute. Count the fields (MC, page, offset) within 8 minutes, 4 minutes, 2 minutes, and 1 minute before the current time point as a new feature column.
  • the new feature column is the feature column in the feature data table corresponding to the EDAC log.
  • the average calculation is performed on various performance data affecting memory faults in the fourteenth preset time period before the current time point, and each According to the average value of performance data of various types, the characteristic data table corresponding to the performance data is obtained.
  • the fields (Page_in, page_out, minimum voltage, maximum voltage, configuration voltage, memory operating speed) within the fourteenth preset time period can be counted according to the granularity of the whole minute.
  • the fourteenth preset time period is 8 minutes , 4min, 2min, 1min, that is, the prediction server can average the fields related to performance data within 8min, 4min, 2min, and 1min before the current time point as new feature columns.
  • average calculations are performed on page in, page out, minimum voltage, maximum voltage, configuration voltage, and memory operating speed within the fourteenth preset time period before the current time point to obtain page in, page out, the lowest voltage, the highest voltage, the configuration voltage, and the average value of the memory operating speed, and then get the characteristic data table corresponding to the performance data based on these average values.
  • the various log data also include environment and location information data
  • the average temperature and humidity in the fifteenth preset time period before the current time point are calculated according to the environment and location information data
  • the average temperature and the average humidity are obtained
  • the feature data table corresponding to the environment and location information data is obtained according to the average temperature and the average humidity.
  • the fields (such as temperature and humidity) within the fifteenth preset time period can be counted according to the granularity of the whole minute.
  • Fields related to environmental data such as temperature and humidity within 8 minutes, 4 minutes, 2 minutes, and 1 minute before the previous time point are averaged as a new feature column.
  • the new feature column is the feature data table corresponding to the environment and location information data. feature column.
  • the memory to be tested corresponds to the site, room, rack and other location information of the server where it is located.
  • the above eleventh, twelfth, thirteenth, fourteenth, and fifteenth preset time periods may be the same or different, which is not specifically limited in this embodiment.
  • the prediction server can obtain various log data of the memory to be tested at the granularity of the whole minute
  • the feature engineering construction in step 102 can include: Various types of log data, merge various types of log data within m minutes and various types of log data within n minutes before m minutes, and use the combined various log data as the fault prediction required at the current time point Log data; where, n+m minutes is the preset minimum time required for fault prediction.
  • minute-grained feature engineering construction is carried out, and feature data tables corresponding to various log data are obtained.
  • the value of m can be 1, 10, 0.1 and so on. Obtaining log data at the granularity of minutes is conducive to completing minute-level predictions, achieving real-time monitoring of memory status, and improving the real-time performance of memory failure predictions.
  • the value of m is 1, and the value of n can be 7.
  • various log data within 1 minute closest to the current time point are obtained, and various log data of 8 minutes are satisfied
  • 8 minutes is the preset minimum time required for fault prediction
  • the various log data obtained in the 9th minute can be compared with the multiple log data obtained in the previous 7 minutes.
  • the various log data are merged, and the combined log data is used as the various log data required for fault prediction at the 9th minute.
  • step 103 the various log data of the memory to be tested are characterized, and after corresponding characteristic data tables are respectively obtained, the characteristic data table corresponding to the memory error information address data and one or more of the following characteristic data tables are arbitrarily combined according to The serial number of the memory and the granularity of the whole minute are spliced: the characteristic data table corresponding to the memory log data, the characteristic data table corresponding to the operating system kernel log, the characteristic data table corresponding to the performance data, the characteristic data table corresponding to the environment and location information data, and the characteristic data table is obtained. splice data table.
  • step 104 the server where the various memories are located will collect various log data of various memories in advance, and the prediction server performs feature engineering construction according to the various log data, obtains feature data tables corresponding to the various log data, and splices multiple
  • the feature data table corresponding to each kind of log data is used to obtain the feature splicing data table, and the feature splicing data table is used as a sample in the training data set for model training to obtain a pre-trained fault prediction model.
  • the fault prediction of the memory to be tested can be performed according to the fault prediction model.
  • the fault prediction model is trained according to the pre-collected training data set and the LightGBM machine learning method, wherein the loss function of the LightGBM machine learning method is Focal Loss, and the Focal Loss formula is:
  • is the balance factor
  • is the modulation parameter
  • y′ is the sample prediction value
  • y is the sample label
  • L fl is the error between the sample prediction value and the sample label value.
  • the Focal Loss loss function can be used to effectively solve the problem of unbalanced positive and negative samples in the training data set of the fault prediction model, and ensure the accuracy of memory fault prediction. Accuracy.
  • the samples in the training data set are marked with label values, and the label values are determined in a binary manner. For example, if a sample at a certain time point is a sample of a memory failure, push forward T minutes from this time point , the samples within T minutes are marked as "1", and the samples before T minutes are marked as "0"; where, the value of T can be 30-100, such as 90. In other words, although this is a faulty memory, it has not reached the risk threshold 90 minutes ago, and it is considered normal data.
  • the output result of the fault prediction model is "1" it means that the memory under test may have a fault in the future T period of time; if the output of the fault prediction model If the result is "0", it means that the memory to be tested is a normal memory.
  • the samples in the training data set are marked with label values, and the label values are determined by regression, such as: calculating the time interval between the time point corresponding to each sample and the time point corresponding to the sample where the memory failure occurred; Calculate the label value of each sample according to the time interval and the calculation formula used to map the time interval to the [0, 1] interval; where the calculation formula is:
  • label is the calculated label value
  • X is the time interval
  • a is the first preset coefficient
  • T is the preset fault influence duration.
  • the regression method is used for labeling to avoid data loss caused by the one-size-fits-all situation of binary labeling.
  • the label value can reflect the distance from the data to the fault, which is closer to the authenticity.
  • the samples in the training data set will use the modified sigmiod function to map the time interval to the [0, 1] interval, calculate the label value of each sample, and use the sigmiod function to predict the failure of the memory to be tested time point to judge the extent of the tag’s distance from the time point of failure.
  • the first preset coefficient a is an empirical value, and the value range is [1, 10], for example, it can be 8; the value range of X/T is [0, 1].
  • the fault of the memory under test can be obtained according to the output result and the time formula used to predict the time point of the fault occurrence The forecast results at the time point;
  • t is the time point when the memory under test fails
  • b is the second preset coefficient
  • output is the output result
  • T is the preset fault influence time. It can not only predict whether a failure will occur in the future, but also predict when the failure will occur, and the prediction accuracy is higher.
  • the value range of the output result of the fault prediction model is [0, 1] and the second preset coefficient is an empirical value, and the value range is [1, 20], such as 11.25.
  • the fault prediction model in step 104 can be obtained in the manner shown in FIG. 6 , including steps 601 to 603 .
  • Step 601 Label the samples in the training data set to obtain the label value of each sample.
  • Step 602 perform model training to obtain a fault prediction model according to each sample marked with a label value.
  • Step 603 evaluating the fault prediction model.
  • the formation method of the training data set can be:
  • S1 Collect memory-related log data of different data centers, manufacturers, and models (such as the 6 types of log data mentioned above);
  • S3 Splicing the feature data tables corresponding to various log data to form a training data set.
  • feature engineering can be constructed according to the log data of the memory error information address data obtained in S1.
  • the manner of performing feature engineering construction according to the memory error information address data can refer to the manner in FIG. 5 , and will not be repeated here to avoid repetition.
  • step 601 can use the binary classification method (that is, the above-mentioned binary method) to mark the samples in the training data set.
  • the label-time relationship diagram of the binary classification is shown in the dotted line classification (Classification ), the abscissa is a period of time before the failure of the memory to be tested (Time before failure), and the ordinate is the label value (Label) marked by the binary classification method.
  • the prediction server will mark the label values of the samples in the training data set within the first 90 minutes of the fault as "1", 90 Samples older than minutes have a label value of "0".
  • step 601 can use a regression method to label the samples in the training data set.
  • the label-time relationship diagram of the regression model is shown in the solid line regression (Regression) in Figure 7, and the abscissa is A period of time before the failure of the memory under test, the vertical axis is the label value marked by the regression method.
  • the prediction server will map the time interval to the interval [0, 1] through the following modified sigmoid function:
  • label is the calculated label value
  • X is the time interval.
  • label is 0.
  • LightGBM may be used to train and model each sample marked with a label value, and the loss function may use a mean square error (Mean Square Error, MSE) loss function or a Focal Loss loss function.
  • MSE mean square Error
  • step 602 may use random forest to train and model each sample marked with a label value.
  • step 603 30% of sample data is randomly selected from the training data set as a verification set, and the remaining sample data is used as a training set.
  • the evaluation index of the model may use the F1-Score index, that is, use the F1-Score index to evaluate on the verification set.
  • F1-Score indicator related terms and detailed indicators are as follows, where precision is the precision rate, and recall is the recall rate:
  • n pp the number of memory that is predicted to be bad in the future T period within the evaluation window
  • n tp the number of faulty memory found in the evaluation window T period of time in advance
  • n tr Number of all memory faults within the evaluation window.
  • grid search can be used for model-related parameters to make the model score F1 reach the highest point.
  • the fault prediction result of the memory to be tested is obtained, including: according to the feature splicing data table and the pre-trained fault prediction model, the memory to be tested is classified into Confidence of 1; according to the confidence, the health of the memory to be tested is determined; wherein, the health is 1-confidence, and the lower the health, the more likely the memory to be tested to fail.
  • the output result of the fault prediction model obtained by the method in Figure 6 is: the confidence level that the machine learning algorithm classifies the memory to be tested as "1" (the range is 0-100%), so that the health degree of the memory to be tested can be determined according to the confidence degree, and finally the health degree index is reported to the monitoring center.
  • the fault prediction result of the memory to be tested is obtained, including: according to the feature splicing data table and the pre-trained fault prediction model, obtaining the predicted value of the memory to be tested ; Wherein, the predicted value is a floating-point number between 0-1; according to the predicted value, determine the health degree of the memory to be tested; wherein, the health degree is 1-the predicted value, the lower the health degree, the more likely the memory to be tested to fail. That is to say, if the regression method is used to label the samples in the training data set, the prediction server will map the time interval to the [0, 1] interval through the following modified sigmoid function:
  • label is the calculated label value
  • X is the time interval.
  • label is 0.
  • the output result of the fault prediction model obtained by the method in Fig. 6 is: a floating-point number between [0, 1]. Generally, if the output result is greater than 0.5, it means that a memory fault will occur in the future T time; otherwise, it means that the memory fault will not occur in the future T time. Convert the output to a health level of 0-100%. Wherein, the output result is the predicted value (ranging from 0-1) of failure prediction of the memory to be tested through the machine learning algorithm, and the health degree of the memory to be tested is 1-predicted value. The lower the health degree, the higher the probability that the memory will fail within T time, and finally the health degree index will be reported to the monitoring center.
  • the prediction result of the fault prediction model obtained in the manner shown in FIG. 6 may be: the time point when a memory fault will occur.
  • the loss function uses the MSE loss function, and the time point when a memory fault will occur can be obtained through the following formula:
  • the prediction server can report the calculated time point of memory failure to the monitoring center, so that the monitoring center can take corresponding actions.
  • step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.
  • FIG. 8 Another embodiment of the present invention relates to an electronic device, as shown in FIG. 8 , including at least one processor 801; and a memory 802 communicatively connected to at least one processor 801; The instructions executed by the processor 801 are executed by at least one processor 801, so that the at least one processor 801 can execute the method for predicting a memory fault in any one of the above-mentioned implementations.
  • the memory 802 and the processor 801 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 801 and various circuits of the memory 802 together.
  • the bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein.
  • the bus interface provides an interface between the bus and the transceivers.
  • a transceiver may be a single element, or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium.
  • the data processed by the processor 801 is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor 801 .
  • the processor 801 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interface, voltage regulation, power management and other control functions. And the memory 802 may be used to store data used by the processor 801 when performing operations.
  • Another embodiment of the present invention relates to a computer-readable storage medium storing a computer program.
  • the computer program is executed by the processor, the above-mentioned method embodiments are realized.
  • a storage medium includes several instructions to make a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Debugging And Monitoring (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Hardware Design (AREA)

Abstract

本申请涉及一种内存故障的预测方法、电子设备和计算机可读存储介质。其中内存故障的预测方法包括:获取待测内存的多种日志数据;其中,所述多种日志数据至少包括:内存错误信息地址数据;根据所述多种日志数据进行特征工程构造,得到所述多种日志数据分别对应的特征数据表;对所述多种日志数据分别对应的特征数据表进行拼接,得到特征拼接数据表;根据所述特征拼接数据表和预训练的故障预测模型,得到所述待测内存的故障预测结果;其中,所述故障预测模型根据预先采集的训练数据集训练得到,所述训练数据集中的样本包括多种内存的多种日志数据。

Description

内存故障的预测方法、电子设备和计算机可读存储介质
相关申请的交叉引用
本申请要求2021年10月12日提交给中国专利局的第202111189254.9号专利申请的优先权,其全部内容通过引用合并于此。
技术领域
本申请涉及但不限于大数据分析和人工智能技术领域。
背景技术
在服务器等大数据应用领域,非预期的内存故障导致计算机宕机频繁发生,情况严重的甚至导致服务器业务中断。因此,对于内存进行实时的故障预测和健康度评估对于维持业务的可靠性及稳定性具有重要的实用意义。
目前工业界对于内存故障风险的感知,采用的是对内存上报日志的进行简单加和统计:即错误检查和纠正(Error Checking and Correcting,简称:ECC)次数达到设定的阈值时,上报告警到网管。然而,实际上发生ECC次数和内存故障不存在极高的相关关系,仅靠对ECC计数来预测内存故障准确率很低。
发明内容
本申请提供了一种内存故障的预测方法,包括:获取待测内存的多种日志数据;其中,所述多种日志数据至少包括:内存错误信息地址数据;根据所述多种日志数据进行特征工程构造,得到所述多种日志数据分别对应的特征数据表;对所述多种日志数据分别对应的特征数据表进行拼接,得到特征拼接数据表;根据所述特征拼接数据表和预训练的故障预测模型,得到所述待测内存的故障预测结果;其中,所述故障预测模型根据预先采集的训练数据集训练得到,所述训练数据集中的样本包括多种内存的多种日志数据。
本申请还提供了一种电子设备,包括:至少一个处理器;以及, 与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行本文所述的任一内存故障的预测方法。
本申请还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现本文所述的任一内存故障的预测方法。
附图说明
图1是根据本申请的一种内存故障的预测方法流程图;
图2是根据本申请的待测内存的架构图;
图3是根据本申请的获取多种日志数据的示意图;
图4是根据本申请的数据预处理流程图;
图5是根据本申请获取内存错误信息地址数据对应的特征数据表的流程图;
图6是根据本申请的训练故障预测模型的流程图;
图7是根据本申请的二分类与回归方法的打标示意图;
图8是根据本申请的电子设备的结构图。
具体实施方式
为使本申请实施方式的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施方式进行详细的阐述。然而,本领域的普通技术人员可以理解,在本申请各实施方式中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施方式的种种变化和修改,也可以实现本申请所要求保护的技术方案。以下各个实施方式的划分是为了描述方便,不应对本申请的具体实现方式构成任何限定,各个实施方式在不矛盾的前提下可以相互结合相互引用。
本申请的一个实施方式涉及一种内存故障的预测方法,应用于电子设备,该电子设备可以为用于进行故障预测的服务器(简称预测 服务器),比如,需要进行内存故障预测的待测设备采集自身的数据后,发送给预测服务器,然后由预测服务器进行待测设备的内存故障的预测。其中,待测设备可以为服务器、大数据中心集群、通讯基站、PC机等包含内存器件的设备。本实施方式中,内存故障的预测方法的流程图可以如图1所示,包括步骤101至104。
步骤101:获取待测内存的多种日志数据;其中,多种日志数据至少包括:内存错误信息地址数据。
步骤102:根据多种日志数据进行特征工程构造,得到多种日志数据分别对应的特征数据表。
步骤103:对多种日志数据分别对应的特征数据表进行拼接,得到特征拼接数据表。
步骤104:根据特征拼接数据表和预训练的故障预测模型,得到待测内存的故障预测结果;其中,故障预测模型根据预先采集的训练数据集训练得到,训练数据集中的样本包括多种内存的多种日志数据。
在本实施方式中,获取待测内存的多种日志数据,多种日志数据至少包括:内存错误信息地址数据,多种日志数据能够从多方面来衡量待测内存的状态,从而对故障预测提供更加全面的参考,以提高故障预测的准确性,多种日志数据至少包括内存错误信息地址数据,确保能够为故障预测提供必备的参考信息。通过对多种日志数据分别进行的特征工程构造,并对特征工程构造后的各种日志数据分别对应的特征数据表进行拼接,使得能够得到适合故障预测模型的输入的特征拼接数据表,从而可以根据特征拼接数据表和预训练的故障预测模型,得到待测内存的故障预测结果。由于故障预测模型的训练数据集中的样本包括多种内存的多种日志数据,因此故障预测模型的泛化能力和稳定性更强,从而基于该故障预测模型得到的预测结果也更加准确,即提高了预测内存故障的准确性。
下面对本实施方式的内存故障的预测方法的实现细节进行示例性的说明,以下内容仅为方便理解提供的实现细节,并非实施本方案的必须。
在步骤101中,待测内存所在的待测设备(例如,服务器)可 以从待测内存中获取多种日志数据,并将多种日志数据发送给预测服务器,从而预测服务器可以获取到待测内存的多种日志数据。其中,多种日志数据至少包括:内存错误信息地址数据,确保能够为内存故障预测提供必备的参考信息。
其中,待测内存的架构图如图2所示,图2中左侧为待测内存中多个内存颗粒(chip)的示意图,图2中右侧为待测内存中的一个chip的示意图,一个chip在物理结构上分为8层,一层为一个bank。比如,图2中的层1(bank 1)可以表示该层为chip中的第一层,行(row)为bank 1上的一行,列(column)为bank 1上的一列,row与column交汇的点为一个内存单元(cell)。
其中,获取的内存错误信息地址数据包括:内存序列号、内存制造商、日志上报时间、内存的双列直插式存储模块(Dual-Inline-Memory-Modules,简称“DIMM”)、内存rank号、内存chip号、内存bank信息、内存单元所在的行号、内存单元所在的列号。其中,内存错误信息地址数据还可以包括内存故障日志中解析出的发生内存故障的详细物理位置。
在一个实施方式中,获取的多种日志数据在包括内存错误信息地址数据的情况下,还包括以下之一或多种组合:内存日志数据,操作系统内核日志、错误检测与纠正EDAC日志、性能数据、环境及位置信息数据。
其中,内存日志数据包括:寄存器组编号字段、事务transaction字段、内存序列号、内存制造商以及日志上报时间。
在一个例子中,内存日志数据为用mcelog工具采集mcelog上报的动态随机存储器(Dynamic Random Access Memory,DRAM)故障日志,mcelog工具是Linux系统基于Intel的机器检查架构(Machine Check Architecture,MCA)记录DRAM故障的标准工具,即内存日志数据可以为mcelog数据。
在一个例子中,操作系统内核日志可以记录从Linux内核日志中获取到的与内存故障相关的信息。操作系统内核日志包括:内存序列号、内存制造商、日志上报时间以及和内存错误相关的各类字段。
在一个例子中,错误检测与纠正(Error Detection and Correction,EDAC)日志包括:内存序列号、日志上报时间、内存控制器(Memory Controller,MC)字段、page字段和offset字段。其中,EDAC和ECC都是错误检测和纠正,ECC是内存硬件上的错误检测和纠正;EDAC是linux内核里面的软件上的错误检测和纠正。
在一个例子中,性能数据记录了待测内存所在的服务器相关的物理性能数据,包括:每秒将数据从磁盘传输到内存的次数page in、每秒将数据从内存传输到磁盘的次数page out、最低电压、最高电压、配置电压以及内存运行速度。
在一个例子中,环境及位置信息数据记录了待测内存所在的服务器的环境和位置信息,包括:温度、湿度、待测内存所在服务器的所在站点site、待测内存所在服务器的所在房间room以及待测内存所在服务器的所在机架rack。
在本实施方式中,根据研究发现内存日志数据、操作系统内核日志、EDAC日志、性能数据、环境及位置信息数据均对内存故障有重要影响,这些数据的加入能提升故障预测模型的泛化能力的稳定性,有利于提高故障预测的准确率。
在一个实施方式中,在实现步骤101时,待测内存所在的服务器可以同时获取待测内存中的6种日志数据,如图3所示,包括:mcelog数据、Linux内核日志数据、EDAC日志数据、性能数据、环境及位置信息数据、内存错误信息地址数据。
在步骤102中,在根据多种日志数据进行特征工程构造,得到多种日志数据分别对应的特征数据表之前,预测服务器可以对获取到的待测内存中的多种日志数据进行预处理,其中,预处理的内容可以包括:对多种日志数据中没有获取到的字段进行均值填充或0值填充;丢弃与内存故障不相关的字段;若多种日志数据中有某几列字段的相关性很强,则只留下其中一列字段,删除其他列的字段。
在一个例子中,对预处理后的多种日志数据还可以进行数据校验,例如可以为:校验数据的时间长度和采样点个数是否能够满足故障预测或故障预测所需数据量的最低条件要求,即数据采集时间是否 满足特征工程构造所需时间长度,以及采集的多种日志数据中是否至少包括了内存错误信息地址数据。
在一个实施方式中,步骤102对多种日志数据进行处理和特征工程构造,以得到多种日志数据分别对应的特征数据表的执行过程,可以通过以下子步骤实现,示例性流程如图4所示,包括子步骤1021至1023。
子步骤1021,对多种日志数据进行预处理。
其中,预处理可以包括:对空值进行“0”填充;对于采集到的多种日志数据用Pearson相关系数进行分析,删除掉字段间相关性强的列和数据没有变化或全为“0”的列,丢弃与内存故障不相关的字段。
子步骤1022,数据校验。
其中,数据校验可以包括:对预处理后的多种日志数据的时间长度和采样点个数进行校验。校验数据时间长度和采样点个数是否能够满足故障检测和预测的数据量最低条件要求:即数据采集时间是否满足特征工程构造所需时间长度,多种日志数据中是否至少包括内存错误信息地址数据。
子步骤1023,根据校验后的多种日志数据,分别进行特征工程构造。
在一个实施方式中,预测服务器根据预处理后的内存错误信息地址数据进行特征构造,得到内存错误信息地址数据对应的特征数据表。参考图5,得到内存错误信息地址数据对应的特征数据表包括步骤501至506。
步骤501:统计第一数量和第一次数。其中,预测服务器根据内存错误信息地址数据,统计当前时间点前的第一预设时间段内出现的第一目标的第一数量以及每个第一目标上出现ECC的第一次数;其中,第一目标为过去第二预设时间段内,待测内存的各内存单元中累计出现过2次或2次以上ECC的内存单元,第二预设时间段的持续时长大于第一预设时间段的持续时长。
步骤502:统计第二数量和第二次数。其中,预测服务器根据内 存错误信息地址数据,统计当前时间点之前的第三预设时间段内出现的第二目标的第二数量以及每个第二目标上出现ECC的第二次数;其中,第二目标为过去第四预设时间段内,待测内存的各列内存单元中在同一时刻累计出现2次或2次以上ECC的列,第四预设时间段的持续时长大于第三预设时间段的持续时长。其中,同一时刻可以是同一秒,也可以是同一分钟。
步骤503:统计第三数量和第三次数。其中,预测服务器根据内存错误信息地址数据,统计当前时间点之前的第五预设时间段内出现的第三目标的第三数量以及每个第三目标上出现ECC的第三次数;其中,第三目标为过去第六预设时间段内,待测内存的各行内存单元中在同一时刻累计出现2次或2次以上ECC的行,第六预设时间段的持续时长大于所述第五预设时间段的持续时长。
步骤504:统计第四数量。其中,预测服务器根据内存错误信息地址数据,统计当前时间点之前的第七预设时间段内出现的第四目标的第四数量;其中,第四目标为所述待测内存中由至少3个位于同一列的内存单元组成的内存单元块,至少3个位于同一列的内存单元在过去第八预设时间段内的同一时刻均出现ECC,且至少3个位于同一列的内存单元相互之间最多间隔1个内存单元,第八预设时间段的持续时长大于第七预设时间段的持续时长。
步骤505:统计第五数量。其中,预测服务器根据内存错误信息地址数据,统计当前时间点之前的第九预设时间段内出现的第五目标的第五数量;其中,第五目标为所述待测内存中由至少3个位于同一行的内存单元组成的内存单元块,至少3个位于同一行的内存单元在过去第十预设时间段内的同一时刻均出现ECC,且至少3个位于同一行的内存单元相互之间最多间隔1个内存单元,第十预设时间段的持续时长大于所述第九预设时间段的持续时长。
步骤506:获取内存错误信息地址数据对应的特征数据表。
根据以下之一或其任意组合,得到内存错误信息地址数据对应的特征数据表:第一数量、第一次数、第二数量、第二次数、第三数量、第三次数、第四数量、第五数量。
在步骤501中,第一预设时间段为当前时间点之前的一段时间,第一目标可以记为ERROR CELL,第一数量为当前时间点前的第一预设时间段内出现的第一目标ERROR CELL的数量,第一次数为当前时间点前的第一预设时间段内每个ERROR CELL上出现ECC的总次数,第二预设时间段为构造第一目标时所需的内存错误信息地址数据的采集时间长度。
在一个例子中,第二预设时间段可以为3个月,为了方便数据的拼接对齐,第一预设时间段可以取当前时间点前的整分钟粒度时间段,比如,第一预设时间段可以取当前时间点之前的8min、4min、2min、1min,即分别统计当前时间点之前的8min、4min、2min、1min内ERROR CELL上出现ECC的总次数(即上述的第一次数)及ECC对应的ERROR CELL个数(即上述的第一数量)。
在步骤502中,第二目标可以记为ERROR COLUMN,第二数量为当前时间点前的第三预设时间段内出现的第二目标ERROR COLUMN的数量,第二次数为当前时间点前的第三预设时间段内每个ERROR COLUMN上出现ECC的总次数,第四预设时间段为构造第二目标时所需的内存错误信息地址数据的采集时间长度。
在一个例子中,第四预设时间段可以为3个月,第三预设时间段可以取当前时间点之前的8min、4min、2min、1min,即分别统计当前时间点之前的8min、4min、2min、1min内ERROR COLUMN上出现ECC的总次数(即上述的第二次数)及ECC对应的ERROR COLUMN个数(即上述的第二数量)。
在步骤503中,第三目标可以记为ERROR ROW,第三数量为当前时间点前的第五预设时间段内出现的第三目标ERROR ROW的数量,第三次数为当前时间点前的第五预设时间段内每个ERROR ROW上出现ECC的总次数,第六预设时间段为构造第三目标时所需的内存错误信息地址数据的采集时间长度。
在一个例子中,第六预设时间段可以为3个月,第五预设时间段可以取当前时间点之前的8min、4min、2min、1min,即分别统计当前时间点之前的8min、4min、2min、1min内ERROR ROW上出现 ECC的总次数(即上述的第三次数)及ECC对应的ERROR ROW个数(即上述的第三数量)。
在步骤504中,第四目标可以记为ERROR COL BLOCK,第四数量为当前时间点前的第七预设时间段内出现的第四目标ERROR COL BLOCK的数量,第八预设时间段为构造第四目标时所需的内存错误信息地址数据的采集时间长度。
在一个例子中,第八预设时间段可以为3个月,比如,在同一时刻如果同一列连续的5个内存单元中,有3个出现过ECC,即5个内存单元中,第1,3,5个内存单元出现过ECC,则这5个内存单元为一个ERROR COL BLOCK。或者,在同一时刻如果同一列连续的3个内存单元均出现过ECC,则这3个内存单元为一个ERROR COL BLOCK。第七预设时间段可以取当前时间点之前的8min、4min、2min、1min,即分别统计当前时间点之前的8min、4min、2min、1min内ERROR COL BLOCK的数量(即上述的第四数量)。其中,同一时刻可以为同一秒或同一分钟。
在步骤505中,第五目标可以记为ERROR ROW BLOCK,第五数量为当前时间点前的第九预设时间段内出现的第五目标ERROR ROW BLOCK的数量,第十预设时间段为构造第五目标时所需的内存错误信息地址数据的采集时间长度。
在一个例子中,第十预设时间段可以为3个月,比如,在同一时刻如果同一行连续的5个内存单元中,有3个出现过ECC,即5个内存单元中,第1,3,5个内存单元出现过ECC,则这连续的5个内存单元为一个ERROR ROW BLOCK。或者,在同一时刻如果同一行连续的3个内存单元均出现过ECC,则这3个内存单元为一个ERROR ROW BLOCK。第九预设时间段可以取当前时间点之前的8min、4min、2min、1min,即分别统计当前时间点之前的8min、4min、2min、1min内ERROR ROW BLOCK的数量(即上述的第五数量)。
需要说明的是,在上述步骤中,第一、三、五、七、九预设时间段可以相同,也可以不相同,第二,四、六、八、十预设时间段可以相同,也可以不相同。
在本实施方式中,由于内存的故障在内存的行与列上具有传染性,通过构造第一目标ERROR CELL、第二目标ERROR COLUMN、第三目标ERROR ROW、第四目标ERROR COL BLOCK、第五目标ERROR ROW BLOCK等和内存缺陷相关的概念并进行相关数量和次数的统计,有利于准确的衡量内存故障,提升内存故障预测的准确率。
在一个实施方式中,预测服务器根据多种日志数据进行特征构造,得到多种日志数据分别对应的特征数据表。其中,多种日志数据在包括内存错误信息地址数据的情况下,还包括以下之一或多种组合:内存日志数据,操作系统内核日志、错误检测与纠正EDAC日志、性能数据、环境及位置信息数据。通过对内存日志数据、操作系统内核日志、错误检测与纠正EDAC日志、性能数据、环境及位置信息数据等进行的精细化处理,得到每种日志数据分别对应的特征数据表,有利于提高故障预测的准确率。下面对获取多种日志数据分别对应的特征数据表进行示例性说明。
在一个例子中,在多种日志数据还包括内存日志数据的情况下,根据内存日志数据,对当前时间点之前的第十一预设时间段内的事务transaction字段进行求和,得到transaction字段的求和结果,并对当前时间点之前的第十一预设时间段内的寄存器组编号字段进行计数,得到寄存器组编号字段的计数结果,然后根据transaction字段的求和结果和寄存器组编号字段的计数结果,得到内存日志数据对应的特征数据表。
在一个例子中,寄存器组编号字段可以为mca_bank字段,第十一预设时间段可以取8min、4min、2min、1min,即预测服务器可以对当前时间点之前的8min、4min、2min、1min内的mca_bank进行计数并对transaction字段进行求和作为新的特征列,该新的特征列即为内存日志数据对应的特征数据表中的特征列。
比如,当前时间为10:27,则当前时间点之前的1min内的数据为10:26-10:27这段时间内采集的数据,然后对10:26-10:27这段时间内采集的mca_bank字段进行计数、transaction字段进行求和;当前时间点之前的2min内的数据为10:25-10:27这段时间内 采集的数据;当前时间点之前的4min内的数据为10:23-10:27这段时间内采集的数据;当前时间点之前的8min内的数据为10:19-10:27这段时间内采集的数据。
在一个示例性实现中,由于寄存器组编号字段为mcelog工具从内存中抓取的寄存器组编号,其数据类型为字符串类型,因此可以进行计数;而transaction字段为CPU和内存互相传送的一组操作,包含读事务和写事务,其数据类型为整型,对transaction字段进行求和更能够反映出该字段的数据特征,只进行计数会丢掉transaction字段中的数据信息,因此,需要对transaction字段进行求和。
在一个例子中,在多种日志数据还包括操作系统内核日志的情况下,统计操作系统内核日志中和内存错误相关的各类字段,对当前时间点之前的第十二预设时间段内的各类字段进行计数,得到各类字段分别对应的计数结果,然后根据各类字段分别对应的计数结果,得到操作系统内核日志对应的特征数据表。
其中,可以按照整分钟粒度对第十二预设时间段内和内存错误相关的各类字段进行统计,比如,第十二预设时间段取8min、4min、2min、1min时,即预测服务器可以对当前时间点前8min、4min、2min、1min内和内存错误相关的各类字段进行计数作为新的特征列,该新的特征列即为操作系统内核日志对应的特征数据表中的特征列。
在一个示例性实现中,由于在操作系统内核日志中有24类字段,其中有8类字段与内存错误相关,因此,只需对当前时间点之前的第十二预设时间段内的8类与内存错误相关的字段进行计数,得到8类与内存错误相关字段的计数结果,然后根据这8类与内存错误相关字段分别对应的计数结果,得到操作系统内核日志对应的特征数据表。
在一个例子中,在多种日志数据还包括EDAC日志的情况下,根据EDAC日志,对当前时间点之前的第十三预设时间段内的MC字段、page字段、offest字段进行计数,得到MC字段的计数结果、page字段的计数结果、offset字段的计数结果,然后根据MC字段的计数结果、page字段的计数结果、offset字段的计数结果,得到EDAC日志对应的特征数据表。其中,MC字段表示内存控制器的编号,page 表示虚拟内存的页,offset表示偏移量,page+offset可以计算出内存的物理地址。
其中,可以按照整分钟粒度对第十三预设时间段内的字段(MC、page、offset)进行统计,比如,第十三预设时间段取8min、4min、2min、1min,即预测服务器可以对当前时间点前8min、4min、2min、1min内的字段(MC、page、offset)进行计数作为新的特征列。该新的特征列即为EDAC日志对应的特征数据表中的特征列。
在一个例子中,在多种日志数据还包括性能数据的情况下,根据性能数据,对当前时间点之前的第十四预设时间段内影响内存故障的各类性能数据进行平均计算,得到各类性能数据的平均值,根据各类性能数据的平均值,得到性能数据对应的特征数据表。
其中,可以按照整分钟粒度对第十四预设时间段内的字段(Page_in、page_out、最低电压、最高电压、配置电压、内存运行速度)进行统计,比如,第十四预设时间段取8min、4min、2min、1min,即预测服务器可以对当前时间点前8min、4min、2min、1min内和性能数据相关的字段进行平均计算作为新的特征列。
在一个示例性实现中,对当前时间点之前的第十四预设时间段内的page in、page out、最低电压、最高电压、配置电压以及内存运行速度进行平均计算,分别得到page in、page out、最低电压、最高电压、配置电压以及内存运行速度的平均值,然后根据这些平均值,得到性能数据对应的特征数据表。
在一个例子中,在多种日志数据还包括环境及位置信息数据的情况下,根据环境及位置信息数据,对当前时间点之前的第十五预设时间段内的温度和湿度进行平均计算,得到温度的平均值和所述湿度的平均值,根据温度的平均值和所述湿度的平均值,得到环境及位置信息数据对应的特征数据表。
其中,可以按照整分钟粒度对第十五预设时间段内的字段(比如温度、湿度)进行统计,比如第十五预设时间段取8min、4min、2min、1min,即预测服务器可以将对前时间点前8min、4min、2min、1min内温度、湿度等和环境数据相关的字段进行平均计算作为新的 特征列,该新的特征列即为环境及位置信息数据对应的特征数据表中的特征列。然后该待测内存再对应上所在服务器的site、room、rack等位置信息。
需要说明的是,在具体实现中,上述第十一、十二、十三、十四、十五预设时间段可以相同也可以不同,本实施方式对此不作具体限定。
在一个实施方式中,步骤101中预测服务器可以按整分钟粒度获取待测内存的多种日志数据,步骤102中进行特征工程构造可以包括:当获取到距离当前时间点最近的m分钟内的多种日志数据,将m分钟内的多种日志数据和m分钟之前的n分钟之内的多种日志数据进行合并,并将合并后的多种日志数据作为在当前时间点进行故障预测所需的日志数据;其中,n+m分钟为预设的进行故障预测所需的最短时长。根据在当前时间点进行故障预测所需的日志数据进行分钟粒度的特征工程构造,得到多种日志数据分别对应的特征数据表。比如,m的取值可以为1、10、0.1等。以分钟粒度获取日志数据,有利于完成分钟级的预测,做到对内存状态的实时监控,提高内存的故障预测的实时性。
在一个例子中,m的取值为1,n可以取值为7,示例性地,若获取到距离当前时间点最近的1分钟内的多种日志数据,而满足8分钟的多种日志数据才能进行故障预测,即8分钟为预设的进行故障预测所需的最短时长,则在到达第9分钟时可以将获取到的第9分钟的多种日志数据与前7分钟之内获取的多种日志数据合并,并将合并后的多种日志数据作为在第9分钟进行故障预测所需的多种日志数据。比如,当前的时间为10:27,若获取到10:27之前的1分钟内的多种日志数据,即10:26-10:27之间的多种日志数据,而需要10:19-10:27这8分钟内的多种日志数据才能进行故障预测,则将10:26-10:27这1分钟与10:19-10:26这7分钟获取到的多种日志数据合并,并将合并后的8分钟的数据作为在10:27进行故障预测所需的多种日志数据。
在步骤103中,对待测内存的多种日志数据进行特征构造,分 别得到对应的特征数据表后,对内存错误信息地址数据对应的特征数据表和以下特征数据表之一或多个任意组合按照内存序列号、整分钟粒度进行拼接:内存日志数据对应的特征数据表、操作系统内核日志对应的特征数据表、性能数据对应的特征数据表、环境及位置信息数据对应的特征数据表,得到特征拼接数据表。
在步骤104中,多种内存所在的服务器预先会采集多种内存的多种日志数据,预测服务器根据多种日志数据进行特征工程构造,得到多种日志数据分别对应的特征数据表,并拼接多种日志数据分别对应的特征数据表得到特征拼接数据表,以特征拼接数据表作为训练数据集中的样本进行模型训练,得到预训练的故障预测模型。训练得到故障预测模型后,则可以根据该故障预测模型对待测内存进行故障预测。
在一个实施方式中,故障预测模型根据预先采集的训练数据集和LightGBM机器学习方法训练得到,其中,LightGBM机器学习方法的损失函数为Focal Loss,Focal Loss公式为:
Figure PCTCN2022121694-appb-000001
其中,α为平衡因子,γ为调制参数,y′为样本预测值,y为样本标签,L fl为样本预测值和样本标签值的误差。
在一个示例性实现中,由于内存的故障预测会面临正负样本不均衡的问题,正常的内存数据远大于发生故障的内存数据,而Focal Loss损失函数是专门用于处理正负样本不均衡的问题的函数,因此,采用LightGBM机器学习方法建立训练模型时,采用Focal Loss损失函数,能够有效的解决故障预测模型的训练数据集中正负样本不均衡的问题,保证了在进行内存故障预测时的准确率。
在一个实施方式中,训练数据集中的样本均标注有标签值,标签值通过二值方式确定,比如,如果某个时间点的样本为内存故障的样本,则从该时间点往前推T分钟,这T分钟内的样本标注为“1”,T分钟之前的样本标注为“0”;其中,T的取值可以为30-100,比如可以取90。也就是说,虽然这是一个故障内存,但90分钟前还没有达到风险的界点,认为还算是正常的数据。基于采用二值方式标注 的样本训练得到的故障预测模型,如果该故障预测模型的输出结果为“1”,则表示待测内存在未来T段时间内可能会发生故障;若故障预测模型的输出结果为“0”,则表示待测内存为正常内存。
在一个实施方式中,训练数据集中的样本均标注有标签值,标签值通过回归方式确定,比如:计算每个样本对应的时间点距离发生内存故障的样本对应的时间点之间的时间间隔;根据时间间隔和用于将时间间隔映射到[0,1]区间内的计算公式,计算每个样本的标签值;其中,计算公式为:
Figure PCTCN2022121694-appb-000002
其中,label为计算的标签值,X为时间间隔,a为第一预设系数,T为预设的故障影响时长。采用回归方式进行打标签,避免采用二值方式打标签这种一刀切的情况造成的数据损失,标签值可以体现数据距离发生故障的远近程度,更加贴近真实性。
在一个示例性实现中,训练数据集中的样本会采用变型的sigmiod函数将时间间隔映射到[0,1]区间内,计算每个样本的标签值,使用sigmiod函数可以预测出待测内存发生故障的时间点,以判断标签距离故障的时间点的程度。其中,第一预设系数a为一个经验值,取值范围为[1,10],比如可以取8;X/T的取值范围为[0,1]。
在一个实施方式中,根据特征拼接数据表和预训练的故障预测模型,得到故障预测模型的输出结果后,可以根据输出结果和用于预测故障发生时间点的时间公式,得到待测内存发生故障的时间点的预测结果;
其中,时间公式为:
Figure PCTCN2022121694-appb-000003
其中,t为待测内存发生故障的时间点,b为第二预设系数,output为输出结果,T为预设的故障影响时长。不仅能预测未来是否会发生故障,还能预测具体在什么时间发生故障,预测的精准性更高。
其中,故障预测模型的输出结果的取值范围为[0,1],第二预设系数为一个经验值,取值范围为[1,20],比如可以取11.25。
在一个例子中,a*b=T,比如,当a=8,b=11.25,即a*b=8*11.25=90,则预设的故障影响时长T=90。
在一个实施方式中,步骤104中故障预测模型可以通过如图6所示的方式得,包括步骤601至603。
步骤601,对训练数据集中的样本进行标注,得到各样本的标签值。
步骤602,根据标注有标签值的各样本,进行模型训练得到故障预测模型。
步骤603,对故障预测模型进行评估。
其中,训练数据集的形成方式可以为:
S1:采集不同数据中心、厂商、型号的内存相关日志数据(比如上述的6种日志数据);
S2:对日志数据进行特征工程构造,得到多种日志数据分别对应的特征数据表;
S3:将多种日志数据分别对应的特征数据表拼接,形成训练数据集。
在S2中,多种日志数据包括内存错误信息地址数据的情况下,为了获取内存错误信息地址数据对应的特征数据表,使其作为故障预测模型的训练数据集样本,供预测服务器建立故障预测模型,可以根据S1中得到的内存错误信息地址数据这一种日志数据进行特征工程构造。其中,根据内存错误信息地址数据进行特征工程构造的方式,可以参考图5中的方式,为避免重复,此处不再赘述。
在S2中,多种日志数据还包括内存日志数据、操作系统内核日志、性能数据、环境及位置信息数据的情况下,为了获取内存日志数据对应的特征数据表、操作系统内核日志对应的特征数据表、性能数据对应的特征数据表、环境及位置信息数据对应的特征数据表,使其作为故障预测模型的训练数据集样本,供预测服务器建立故障预测模型,因此,可以根据S1中得到的内存日志数据,操作系统内核日志、错误检测与纠正EDAC日志、性能数据、环境及位置信息数据等日志数据进行特征工程构造。其中,特征工程构造的方式在上文已经 描述过,为避免重复,此处不再赘述。
在一个例子中,步骤601可以采用二分类法(即上述的二值方式)对训练数据集中的样本进行标注,示例性地,二分类的标签-时间关系图如图7中的虚线分类(Classification)所示,横坐标为待测内存发生故障前的一段时间(Time before failure),纵坐标为利用二分类方法标注的标签值(Label)。其中,若预设的故障影响时长T为90分钟,则对于二分类方法,预测服务器会将训练数据集中的样本中发生故障的前90分钟内的样本的标签值都标注为“1”,90分钟之前的样本的标签值标注为“0”。
在另一个例子中,步骤601可以采用回归方式对训练数据集中的样本进行标注,示例性地,回归模型的标签-时间关系图如图7中的实线回归(Regression)所示,横坐标为待测内存发生故障前的一段时间,纵坐标为利用回归方法标注的标签值。其中,若预设的故障影响时长T为90分钟,则对于回归方法,预测服务器会通过如下变型的sigmoid函数将时间间隔映射到[0,1]区间内:
Figure PCTCN2022121694-appb-000004
其中,label为计算的标签值,X为时间间隔。对于正常的内存数据X定为+∞,label则为0。
在一个例子中,步骤602可以采用LightGBM对标注有标签值的各样本进行训练建模,损失函数可以使用均方误差(Mean Square Error,MSE)损失函数或Focal Loss损失函数。
在另一个例子中,步骤602可以采用随机森林对标注有标签值的各样本进行训练建模。
在步骤603中,在训练数据集中随机抽取30%的样本数据作为验证集,剩下样本数据作为训练集。
在一个例子中,模型的评价指标可以采用F1-Score指标,即使用F1-Score指标在验证集上进行评估。其中,F1-Score指标的定义相关术语和详细指标如下,其中precision为精准率,recall为召回率:
n pp:评估窗口内被预测出未来T段时间内会坏的内存数;
n tp:评估窗口内故障内存被提前T段时间发现的数量;
n tr:评估窗口内所有的内存故障数量。
Figure PCTCN2022121694-appb-000005
Figure PCTCN2022121694-appb-000006
Figure PCTCN2022121694-appb-000007
其中,可以对模型相关参数采用网格搜索使得模型评分F1达到最高点。
在一个实施方式中,根据特征拼接数据表和预训练的故障预测模型,得到待测内存的故障预测结果,包括:根据特征拼接数据表和预训练的故障预测模型,得到将待测内存分类为1的置信度;根据置信度,确定待测内存的健康度;其中,健康度为1-置信度,健康度越低,待测内存越容易发生故障。也就是说,如果采用二分类法对训练数据集中的样本进行标注,则通过图6中的方式得到的故障预测模型的输出结果为:机器学习算法将待测内存分类为“1”的置信度(范围为0-100%),从而可以根据置信度,确定待测内存的健康度,最后将健康度指标上报到监控中心。
在一个实施方式中,根据特征拼接数据表和预训练的故障预测模型,得到待测内存的故障预测结果,包括:根据特征拼接数据表和预训练的故障预测模型,得到对待测内存的预测值;其中,预测值为0-1之间的浮点数;根据预测值,确定待测内存的健康度;其中,健康度为1-预测值,健康度越低,待测内存越容易发生故障。也就是说,如果采用回归法对训练数据集中的样本进行标注,预测服务器会通过如下变型的sigmoid函数将时间间隔映射到[0,1]区间内:
Figure PCTCN2022121694-appb-000008
其中,label为计算的标签值,X为时间间隔。对于正常的内存数据X定为+∞,label则为0。通过图6中的方式得到的故障预测模型的输出结果为:[0,1]之间的浮点数。一般地,如果输出结果大于0.5,则代表在未来T时间内会发生内存故障;否则,代表未来T时间内会不发生内存故障。将输出结果转换为0-100%的健康度。其中, 输出结果即为通过机器学习算法对待测内存进行故障预测的预测值(范围为0-1),待测内存的健康度为1-预测值。健康度越低,代表内存发生在T时间内发生故障的概率越高,最后将健康度指标上报到监控中心。
在一个实施方式中,如果采用回归法对训练数据集中的样本进行标注,则通过图6中的方式得到的故障预测模型的预测结果可以为:将会发生内存故障的时间点。比如在建立故障预测模型时,损失函数使用MSE损失函数,通过如下公式可以得到将会发生内存故障的时间点:
Figure PCTCN2022121694-appb-000009
最后,预测服务器可以将计算得到的发生内存故障的时间点上报到监控中心,以便监控中心采取相应操作。
需要说明的是,本实施方式中的上述各示例均为为方便理解进行的举例说明,并不对本发明的技术方案构成限定。
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。
本发明的另一个实施方式涉及一种电子设备,如图8所示,包括至少一个处理器801;以及,与至少一个处理器801通信连接的存储器802;其中,存储器802存储有可被至少一个处理器801执行的指令,指令被至少一个处理器801执行,以使至少一个处理器801能够执行任一上述实施方式的内存故障的预测方法。
其中,存储器802和处理器801采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器801和存储器802的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件, 比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器801处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器801。
处理器801负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器802可以被用于存储处理器801在执行操作时所使用的数据。
本发明的另一个实施方式涉及一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施方式。
即,本领域技术人员可以理解,实现上述实施方式方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施方式所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域的普通技术人员可以理解,上述各实施方式是实现本发明的示例性实施方式,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本发明的精神和范围。

Claims (12)

  1. 一种内存故障的预测方法,包括:
    获取待测内存的多种日志数据;其中,所述多种日志数据至少包括:内存错误信息地址数据;
    根据所述多种日志数据进行特征工程构造,得到所述多种日志数据分别对应的特征数据表;
    对所述多种日志数据分别对应的特征数据表进行拼接,得到特征拼接数据表;
    根据所述特征拼接数据表和预训练的故障预测模型,得到所述待测内存的故障预测结果;其中,所述故障预测模型根据预先采集的训练数据集训练得到,所述训练数据集中的样本包括多种内存的多种日志数据。
  2. 根据权利要求1所述的内存故障的预测方法,其中,所述根据所述多种日志数据进行特征工程构造,得到所述多种日志数据分别对应的特征数据表,包括:
    根据所述内存错误信息地址数据,统计当前时间点之前的第一预设时间段内出现的第一目标的第一数量以及每个所述第一目标上出现错误检查和纠正ECC的第一次数;其中,所述第一目标为过去第二预设时间段内,所述待测内存的各内存单元中累计出现过2次或2次以上ECC的内存单元,所述第二预设时间段的持续时长大于所述第一预设时间段的持续时长;
    根据所述内存错误信息地址数据,统计当前时间点之前的第三预设时间段内出现的第二目标的第二数量以及每个所述第二目标上出现ECC的第二次数;其中,所述第二目标为过去第四预设时间段内,所述待测内存的各列内存单元中在同一时刻累计出现2次或2次以上ECC的列,所述第四预设时间段的持续时长大于所述第三预设时间段的持续时长;
    根据所述内存错误信息地址数据,统计当前时间点之前的第五预设时间段内出现的第三目标的第三数量以及每个所述第三目标上 出现ECC的第三次数;其中,所述第三目标为过去第六预设时间段内,所述待测内存的各行内存单元中在同一时刻累计出现2次或2次以上ECC的行,所述第六预设时间段的持续时长大于所述第五预设时间段的持续时长;
    根据所述内存错误信息地址数据,统计当前时间点之前的第七预设时间段内出现的第四目标的第四数量;其中,所述第四目标为所述待测内存中由至少3个位于同一列的内存单元组成的内存单元块,所述至少3个位于同一列的内存单元在过去第八预设时间段内的同一时刻均出现ECC,且所述至少3个位于同一列的内存单元相互之间最多间隔1个内存单元,所述第八预设时间段的持续时长大于所述第七预设时间段的持续时长;
    根据所述内存错误信息地址数据,统计当前时间点之前的第九预设时间段内出现的第五目标的第五数量;其中,所述第五目标为所述待测内存中由至少3个位于同一行的内存单元组成的内存单元块,所述至少3个位于同一行的内存单元在过去第十预设时间段内的同一时刻均出现ECC,且所述至少3个位于同一行的内存单元相互之间最多间隔1个内存单元,所述第十预设时间段的持续时长大于所述第九预设时间段的持续时长;
    根据以下之一或其任意组合,得到所述内存错误信息地址数据对应的特征数据表:
    所述第一数量、所述第一次数、所述第二数量、所述第二次数、所述第三数量、所述第三次数、所述第四数量、所述第五数量。
  3. 根据权利要求1或2所述的内存故障的预测方法,其中,所述多种日志数据还包括以下之一或其任意组合:
    内存日志数据、操作系统内核日志、错误检测与纠正EDAC日志、性能数据、环境及位置信息数据。
  4. 根据权利要求3所述的内存故障的预测方法,其中,
    在所述多种日志数据还包括所述内存日志数据的情况下,所述根据所述多种日志数据进行特征工程构造,得到所述多种日志数据分别对应的特征数据表,包括:
    根据所述内存日志数据,对当前时间点之前的第十一预设时间段内的事务transaction字段进行求和,得到所述transaction字段的求和结果,并对当前时间点之前的第十一预设时间段内的寄存器组编号字段进行计数,得到所述寄存器组编号字段的计数结果;
    根据所述transaction字段的求和结果和所述寄存器组编号字段的计数结果,得到所述内存日志数据分别对应的特征数据表;
    在所述多种日志数据还包括所述操作系统内核日志的情况下,所述根据所述多种日志数据进行特征工程构造,得到所述多种日志数据分别对应的特征数据表,包括:
    统计所述操作系统内核日志中和内存错误相关的各类字段;
    对当前时间点之前的第十二预设时间段内的各类字段进行计数,得到所述各类字段分别对应的计数结果;
    根据所述各类字段分别对应的计数结果,得到所述操作系统内核日志对应的特征数据表;
    在所述多种日志数据还包括所述EDAC日志的情况下,所述根据所述多种日志数据进行特征工程构造,得到所述多种日志数据分别对应的特征数据表,包括:
    根据所述EDAC日志,对当前时间点之前的第十三预设时间段内的MC字段、page字段、offset字段进行计数,得到所述MC字段的计数结果、所述page字段的计数结果、所述offset字段的计数结果;
    根据所述MC字段的计数结果、所述page字段的计数结果、所述offset字段的计数结果,得到所述EDAC日志对应的特征数据表;
    在所述多种日志数据还包括所述性能数据的情况下,所述根据所述多种日志数据进行特征工程构造,得到所述多种日志数据分别对应的特征数据表,包括:
    根据所述性能数据,对当前时间点之前的第十四预设时间段内影响内存故障的各类性能数据进行平均计算,得到各类性能数据的平均值;
    根据所述各类性能数据的平均值,得到所述性能数据对应的特 征数据表;
    在所述多种日志数据还包括所述环境及位置信息数据的情况下,所述根据所述多种日志数据进行特征工程构造,得到所述多种日志数据分别对应的特征数据表,包括:
    根据所述环境及位置信息数据,对当前时间点之前的第十五预设时间段内的温度和湿度进行平均计算,得到所述温度的平均值和所述湿度的平均值;
    根据所述温度的平均值和所述湿度的平均值,得到所述环境及位置信息数据对应的特征数据表。
  5. 根据权利要求1所述的内存故障的预测方法,其中,所述故障预测模型根据预先采集的训练数据集和机器学习方法训练得到,机器学习方法预设的损失函数为FocalLoss,所述损失函数公式为:
    Figure PCTCN2022121694-appb-100001
    其中,α为平衡因子,γ为调制参数,y′为样本预测值,y为样本标签值,L fl为所述样本预测值和所述样本标签值的误差。
  6. 根据权利要求1所述的内存故障的预测方法,其中,所述训练数据集中的样本均标注有标签值,所述标签值通过如下方式确定:
    计算每个样本对应的时间点距离发生内存故障的样本对应的时间点之间的时间间隔;
    根据所述时间间隔和用于将所述时间间隔映射到[0,1]区间内的计算公式,计算每个所述样本的标签值;
    其中,所述计算公式为:
    Figure PCTCN2022121694-appb-100002
    其中,label为计算的标签值,X为所述时间间隔,a为第一预设系数,T为预设的故障影响时长。
  7. 根据权利要求1或6所述的内存故障的预测方法,其中,所述根据所述特征拼接数据表和预训练的故障预测模型,得到所述待测内存的故障预测结果,包括:
    根据所述特征拼接数据表和预训练的故障预测模型,得到所述 故障预测模型的输出结果;
    根据所述输出结果和用于预测故障发生时间点的时间公式,得到所述待测内存发生故障的时间点的预测结果;
    其中,所述时间公式为:
    Figure PCTCN2022121694-appb-100003
    其中,t为所述待测内存发生故障的时间点,b为第二预设系数,output为所述输出结果,T为预设的故障影响时长。
  8. 根据权利要求1所述的内存故障的预测方法,其中,所述获取待测内存的多种日志数据,包括:
    按分钟粒度获取待测内存的多种日志数据;
    所述根据所述多种日志数据进行特征工程构造,得到所述多种日志数据分别对应的特征数据表,包括:
    当获取到距离当前时间点最近的m分钟内的多种日志数据,将所述m分钟内的多种日志数据和所述m分钟之前的n分钟之内的多种日志数据进行合并,并将合并后的多种日志数据作为在当前时间点进行故障预测所需的日志数据;其中,n+m分钟为预设的进行故障预测所需的最短时长;
    根据在当前时间点进行故障预测所需的日志数据进行分钟粒度的特征工程构造,得到所述多种日志数据分别对应的特征数据表。
  9. 根据权利要求5所述的内存故障的预测方法,其中,所述根据所述特征拼接数据表和预训练的故障预测模型,得到所述待测内存的故障预测结果,包括:
    根据所述特征拼接数据表和预训练的故障预测模型,得到将所述待测内存分类为1的置信度;
    根据所述置信度,确定所述待测内存的健康度;其中,所述健康度为1-置信度,所述健康度越低,所述待测内存越容易发生故障。
  10. 根据权利要求6所述的内存故障的预测方法,其中,所述根据所述特征拼接数据表和预训练的故障预测模型,得到所述待测内存的故障预测结果,包括:
    根据所述特征拼接数据表和预训练的故障预测模型,得到对所 述待测内存的预测值;其中,所述预测值为0-1之间的浮点数;
    根据所述预测值,确定所述待测内存的健康度;其中,所述健康度为1-预测值,所述健康度越低,所述待测内存越容易发生故障。
  11. 一种电子设备,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至10中任意一项所述的内存故障的预测方法。
  12. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1至10中任一项所述的内存故障的预测方法。
PCT/CN2022/121694 2021-10-12 2022-09-27 内存故障的预测方法、电子设备和计算机可读存储介质 WO2023061209A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020247014345A KR20240065183A (ko) 2021-10-12 2022-09-27 메모리 오류의 예측 방법, 전자 장치 및 컴퓨터 판독 가능한 저장 매체

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111189254.9A CN115981911A (zh) 2021-10-12 2021-10-12 内存故障的预测方法、电子设备和计算机可读存储介质
CN202111189254.9 2021-10-12

Publications (1)

Publication Number Publication Date
WO2023061209A1 true WO2023061209A1 (zh) 2023-04-20

Family

ID=85966745

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121694 WO2023061209A1 (zh) 2021-10-12 2022-09-27 内存故障的预测方法、电子设备和计算机可读存储介质

Country Status (3)

Country Link
KR (1) KR20240065183A (zh)
CN (1) CN115981911A (zh)
WO (1) WO2023061209A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116755910B (zh) * 2023-08-16 2023-11-03 中移(苏州)软件技术有限公司 基于冷启动的宿主机高可用预测方法、装置和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150067410A1 (en) * 2013-08-27 2015-03-05 Tata Consultancy Services Limited Hardware failure prediction system
WO2016188175A1 (zh) * 2015-10-14 2016-12-01 中兴通讯股份有限公司 一种硬件故障分析系统和方法
CN110598802A (zh) * 2019-09-26 2019-12-20 腾讯科技(深圳)有限公司 一种内存检测模型训练的方法、内存检测的方法及装置
CN112579327A (zh) * 2019-09-27 2021-03-30 阿里巴巴集团控股有限公司 一种故障检测方法、装置及设备
CN113297046A (zh) * 2020-08-03 2021-08-24 阿里巴巴集团控股有限公司 一种内存故障的预警方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150067410A1 (en) * 2013-08-27 2015-03-05 Tata Consultancy Services Limited Hardware failure prediction system
WO2016188175A1 (zh) * 2015-10-14 2016-12-01 中兴通讯股份有限公司 一种硬件故障分析系统和方法
CN110598802A (zh) * 2019-09-26 2019-12-20 腾讯科技(深圳)有限公司 一种内存检测模型训练的方法、内存检测的方法及装置
CN112579327A (zh) * 2019-09-27 2021-03-30 阿里巴巴集团控股有限公司 一种故障检测方法、装置及设备
CN113297046A (zh) * 2020-08-03 2021-08-24 阿里巴巴集团控股有限公司 一种内存故障的预警方法及装置

Also Published As

Publication number Publication date
CN115981911A (zh) 2023-04-18
KR20240065183A (ko) 2024-05-14

Similar Documents

Publication Publication Date Title
Sun et al. System-level hardware failure prediction using deep learning
CN108052528A (zh) 一种存储设备时序分类预警方法
EP3764592B1 (en) Automatic root cause diagnosis in networks based on hypothesis testing
US11036572B2 (en) Method, device, and computer program product for facilitating prediction of disk failure
TW201732789A (zh) 磁片的故障預測方法和裝置
WO2022089202A1 (zh) 故障识别模型训练方法、故障识别方法、装置及电子设备
US11562252B2 (en) Systems and methods for expanding data classification using synthetic data generation in machine learning models
US11429497B2 (en) Predicting and handling of slow disk
WO2023061209A1 (zh) 内存故障的预测方法、电子设备和计算机可读存储介质
CN109710501A (zh) 一种服务器数据传输稳定性的检测方法和系统
CN115248757A (zh) 一种硬盘健康评估方法和存储设备
CN111209274A (zh) 一种数据质量检核方法、系统、设备及可读存储介质
CN115841046A (zh) 基于维纳过程的加速退化试验数据处理方法和装置
CN115576738A (zh) 一种基于芯片分析实现设备故障确定的方法及系统
CN108897765A (zh) 一种数据批量导入方法及其系统
CN115221218A (zh) 车辆数据的质量评估方法、装置、计算机设备和存储介质
CN116069618A (zh) 一种面向应用场景的国产化系统评估方法
CN115509853A (zh) 一种集群数据异常检测方法及电子设备
CN115543665A (zh) 一种内存可靠性评估方法、装置及存储介质
CN117912534B (zh) 一种磁盘状态预测方法、装置、电子设备及存储介质
WO2023236753A1 (zh) 一种硬盘故障预测方法、装置、存储介质及电子装置
US20240004765A1 (en) Data processing method and apparatus for distributed storage system, device, and storage medium
CN117591337B (zh) 计算机信息数据交互传输管理系统及方法
CN117785643B (zh) 一种软件开发用性能测试平台
CN113434088A (zh) 一种磁盘识别方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22880149

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20247014345

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2022880149

Country of ref document: EP

Effective date: 20240430