CN117608956A - Performance abnormality analysis method and storage device - Google Patents

Performance abnormality analysis method and storage device Download PDF

Info

Publication number
CN117608956A
CN117608956A CN202311429740.2A CN202311429740A CN117608956A CN 117608956 A CN117608956 A CN 117608956A CN 202311429740 A CN202311429740 A CN 202311429740A CN 117608956 A CN117608956 A CN 117608956A
Authority
CN
China
Prior art keywords
command
key
event
performance
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311429740.2A
Other languages
Chinese (zh)
Inventor
秦汉张
袁戎
孙宝勇
徐凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Memblaze Technology Co Ltd
Original Assignee
Beijing Memblaze Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Memblaze Technology Co Ltd filed Critical Beijing Memblaze Technology Co Ltd
Priority to CN202311429740.2A priority Critical patent/CN117608956A/en
Publication of CN117608956A publication Critical patent/CN117608956A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3485Performance evaluation by tracing or monitoring for I/O devices

Abstract

The disclosure relates to a performance anomaly analysis method and storage equipment. In at least one embodiment of the present disclosure, by pre-configuring a command type and a key event in a command execution process, and triggering a corresponding key event in the command execution process to collect information of the key event in real time, so that performance abnormality is analyzed by using the information of the key event, a specific cause of the performance abnormality, that is, determining which key node of which command is abnormal in execution and what is the cause of the abnormality at the moment of determining the performance abnormality can be determined, thereby improving performance abnormality analysis efficiency.

Description

Performance abnormality analysis method and storage device
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to a performance abnormality analysis method and storage equipment.
Background
With the development of Hard Disk technology, solid State Disk (SSD) gradually replaces a mechanical Hard Disk (HDD) to improve data storage stability and data reading efficiency.
Quality of service (Quality ofService, qoS) is a performance indicator of the SSD, and represents an IO delay at a certain instant in the process of performing Input Output (IO) by the SSD.
The reason for the bad QoS is complex, the moment with larger IO delay may occur at any place of SSD firmware, and the SSD state at the moment with bad QoS cannot be analyzed and determined through common debugging (debug) means such as serial ports, grabbing sites and the like.
Disclosure of Invention
At least one embodiment of the present disclosure provides a method and a storage device for analyzing a performance anomaly to determine a specific cause of the performance anomaly.
In a first aspect, an embodiment of the present disclosure provides a method for analyzing a performance anomaly, where a command type and a key event in a command execution process are preconfigured; the method comprises the following steps:
in response to receiving the command, determining a critical event in the execution of the command based on the type of the command;
triggering a key event if the key event is detected in the executing process of the command; generating information of the key event in response to triggering the key event;
based on the information of the key event, analyzing the performance abnormality to obtain a command and/or an abnormality cause causing the performance abnormality.
In some embodiments, the method of analyzing a performance anomaly further comprises: pre-configuring the type of the command, key nodes in the command execution process and association relations among key events corresponding to the key nodes; wherein, the key node is an executing step in the command executing process;
Determining key events in the execution of a command based on the type of the command, including: determining a key node and a key event corresponding to the key node in the execution process of the command based on the type and the association relation of the command;
if a critical event is detected, triggering the critical event, including: and if the key node is detected, triggering a key event corresponding to the key node.
In some embodiments, after generating the information of the critical event, the analysis method of the performance anomaly further includes: caching information of key events;
based on the information of the key event, analyzing the performance abnormality to obtain a command and/or an abnormality cause causing the performance abnormality, including:
acquiring a performance detection value;
if the performance detection value meets the triggering condition preset by the abnormal performance, backing up the cached information of the key event;
and analyzing the performance abnormality based on the information of the backup key event to obtain a command and/or an abnormality cause which cause the performance abnormality.
In some embodiments, analyzing the performance anomalies based on the information of the critical event includes:
based on the information of the key events, determining each abnormal command related to the performance abnormality and the information of each key event corresponding to each abnormal command;
Based on the information of each key event corresponding to each abnormal command, carrying out event semantic analysis on the performance abnormality to obtain an abnormality cause causing the performance abnormality; the event semantic analysis refers to analyzing whether an abnormal cause causing the abnormal performance exists among the key events corresponding to the abnormal command.
In some embodiments, when the performance exception is a delay exception, performing event semantic analysis on the performance exception based on information of each key event corresponding to each exception command to obtain an exception cause causing the performance exception, including:
if the time interval of two adjacent key events of the same abnormal command is greater than or equal to a preset first time interval, determining that the abnormal cause is the execution timeout of the previous key event in the two adjacent key events; and/or the number of the groups of groups,
if the same executor executes a plurality of key events and the time interval between the first key event and the last key event executed by the executor is larger than or equal to a preset second time interval, determining that the abnormality cause is that the time of the executor occupying the CPU is overtime; and/or the number of the groups of groups,
if the time interval between the start of waiting for the resource or the key event and the acquisition of the resource or the key event by any executor is larger than or equal to a preset third time interval, the time occupied by the executor by the CPU is not overtime, and the execution of the previous key event in the two adjacent key events of the same abnormal command executed by the executor is not overtime, determining that the abnormal reason is that the executor waits for the resource or the key event overtime; and/or the number of the groups of groups,
If the time interval between the acquisition command and the response command of the NAND flash memory controller is greater than or equal to a preset fourth time interval, determining that the abnormality is caused by abnormal processing capacity of the NAND flash memory controller.
In some embodiments, the information of the critical event includes at least one of:
time, event type, executor, additional information;
wherein the event type is an event type preconfigured based on a key event;
the executor is an execution subject for executing the key event;
the additional information is preconfigured information providing basis for performance anomaly analysis, and comprises at least one of the following:
NAND flash addresses, context addresses inside SSD firmware, and summary information.
In some embodiments, based on information of each key event corresponding to each abnormal command, performing event semantic analysis on the performance abnormality to obtain an abnormality cause causing the performance abnormality, including:
if the read command and the write command are executed at the same NAND flash memory address and the read-write conflict exists, determining that the abnormality cause is that the NAND flash memory address has the read-write conflict; and/or the number of the groups of groups,
if a plurality of key events are executed by the same context address in the SSD firmware and a key event of timeout processing exists in the plurality of key events, determining that the reason of the abnormality is that the key event of timeout processing exists in the same context address.
In some embodiments, analyzing the performance anomalies based on the information of the critical event includes:
determining a complete command period of the command based on event semantics of the key event and command identification of the command;
extracting all logs of the command that occur in one complete command cycle;
and carrying out event semantic analysis based on all logs and information of key events which occur in one complete command period of the command to obtain an abnormal cause causing abnormal performance.
In some embodiments, event semantic analysis is performed based on information of all logs and critical events occurring in one complete command cycle of the command, resulting in an anomaly cause that leads to a performance anomaly, including:
based on all logs of the command and information of key events occurring in one complete command period, carrying out event semantic hierarchical processing to obtain an abnormal reason causing abnormal performance; the event semantic classification processing comprises the following steps:
based on the information of the key event, performing primary analysis on the performance index type to determine whether the command is abnormal;
if the command is abnormal, performing secondary analysis on the performance index type based on all logs and information of key events of the command in one complete command period, and obtaining an abnormal reason for the performance abnormality.
In some embodiments, when the performance anomaly is a bandwidth anomaly, performing event semantic analysis on the performance anomaly based on information of each key event corresponding to each anomaly command to obtain an anomaly cause causing the performance anomaly, including:
based on the information of each key event corresponding to each abnormal command, determining the bandwidth occupied by the command execution process corresponding to each key event;
judging whether the bandwidth occupied by the command execution process corresponding to each key event is larger than or equal to a bandwidth critical value corresponding to bandwidth abnormality;
if the command execution process is greater than or equal to the critical value, determining that the bandwidth abnormality occurs in the corresponding command execution process.
In some embodiments, after analyzing the performance anomalies, the method of analyzing the performance anomalies further comprises:
generating an analysis log; the analysis log comprises at least one of the following:
the number of abnormal commands, the number of commands with different abnormal reasons, a file list of the abnormal commands with different abnormal reasons, and the number of abnormal commands processed by different chips in SSD firmware;
the number of the files in the file list is the same as the number of the abnormal commands and corresponds to the number of the abnormal commands one by one, and each file in the file list records information of each key event generated in the corresponding abnormal command executing process.
In a second aspect, an embodiment of the present disclosure further provides an analysis apparatus for performance anomaly, where a command type and a key event in a command execution process are preconfigured; the device comprises:
a determining unit for determining a critical event in the execution of the command based on the type of the command in response to receiving the command;
the generating unit is used for triggering the key event if the key event is detected in the executing process of the command; generating information of the key event in response to triggering the key event;
and the analysis unit is used for analyzing the performance abnormality based on the information of the key event to obtain a command and/or an abnormality cause which cause the performance abnormality.
In a third aspect, embodiments of the present disclosure further provide an electronic device, including a memory, a processor, and a computer program stored on the memory, where the processor executes the computer program to implement the steps of the method for analyzing a performance anomaly as provided by any embodiment of the first aspect.
In a fourth aspect, embodiments of the present disclosure further provide a storage device, where a control unit and an NVM chip, where the control unit performs the steps of the method for analyzing a performance anomaly as provided in any embodiment of the first aspect.
In a fifth aspect, embodiments of the present disclosure also provide a computer-readable storage medium, where the computer-readable storage medium stores a program or instructions that cause a computer to perform the steps of the method for analyzing a performance anomaly as provided by any embodiment of the first aspect.
In a sixth aspect, embodiments of the present disclosure also provide a computer program product, wherein the computer program product comprises a computer program stored in a computer readable storage medium, from which at least one processor of the computer reads and executes the computer program, such that the computer performs the steps of the method of analyzing a performance anomaly as provided by any embodiment of the first aspect.
It can be seen that, in at least one embodiment of the present disclosure, by pre-configuring a command type and key events in a command execution process, in the command execution process, triggering a corresponding key event and collecting information of the key event in real time, so that the performance anomaly is analyzed by using the information of the key event, a specific cause of the performance anomaly, that is, determining which key node of which command is abnormal in execution and what is the cause of the anomaly, can be determined, thereby improving the efficiency of performance anomaly analysis.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings to those of ordinary skill in the art.
FIG. 1 is a flow chart of a method for analyzing performance anomalies according to an embodiment of the present disclosure;
FIG. 2 is a scene graph of a method for analyzing performance anomalies provided by an embodiment of the present disclosure;
FIG. 3 is a schematic representation of the content of the file exported on the basis of FIG. 2;
FIG. 4 is a schematic diagram of an analysis device for performance anomalies according to an embodiment of the present disclosure;
fig. 5 is an exemplary block diagram of an electronic device provided by an embodiment of the present disclosure.
Detailed Description
The following description of the technical solutions in the embodiments of the present disclosure is made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
The following detailed description is provided to assist the reader in obtaining a thorough understanding of the methods, apparatus, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatuses, and/or systems described herein will be apparent after an understanding of the present disclosure. For example, the order of operations described herein is merely an example and is not limited to those set forth herein, but rather may be altered as would be apparent after an understanding of the disclosure, except for operations that must occur in a specific order. Furthermore, descriptions of features known after understanding the present disclosure may be omitted for added clarity and conciseness.
The features described herein may be embodied in different forms and should not be construed as limited to the examples described herein. Rather, the examples described herein have been provided to illustrate only some of the many possible ways in which the methods, devices, and/or systems described herein may be implemented that will be apparent upon an understanding of the present disclosure.
The terminology used herein is for the purpose of describing various examples only and is not intended to be limiting of the disclosure. Singular forms also are intended to include plural forms unless the context clearly indicates otherwise. The terms "comprises," "comprising," and "having" specify the presence of stated features, amounts, operations, components, elements, and/or combinations thereof, but do not preclude the presence or addition of one or more other features, amounts, operations, components, elements, and/or combinations thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs based on the understanding of this disclosure. Unless explicitly so defined herein, terms (such as those defined in a general dictionary) should be construed to have meanings consistent with their meanings in the context of the relevant art and the disclosure of the present disclosure, and should not be interpreted idealized or overly formal. The use of the term "may" herein with respect to an example or embodiment (e.g., with respect to what the example or embodiment may include or implement) indicates that there is at least one example or embodiment that includes or implements such feature, and all examples are not so limited.
In SSD performance testing, since QoS indexes are long delays that occur with a relatively small probability, debugging (debug) can be performed only by manually analyzing codes and continuously adding debug codes, and the efficiency of debugging is low. Therefore, the embodiment of the disclosure provides a method, a device, a storage device or an electronic device for analyzing a performance abnormality, by pre-configuring a command type and key events in the command execution process, triggering corresponding key events in the command execution process to collect information of the key events in real time, so that the information of the key events is utilized to analyze the performance abnormality, and specific reasons causing the performance abnormality, namely, determining which key node of which command is abnormal in execution and what is the cause of the abnormality at the moment of the performance abnormality, can be determined, thereby improving the performance abnormality analysis efficiency.
Fig. 1 is a flow chart of an analysis method of performance abnormality provided by an embodiment of the present disclosure, where an execution body of the analysis method of performance abnormality is an electronic device, and the electronic device includes, but is not limited to, a storage device (for example, a solid state disk, a flash memory device, etc.), a smart phone, a palm computer, a tablet computer, a wearable device with a display screen, a desktop computer, a notebook computer, an all-in-one machine, a smart home device, a server, etc., where the server may be an independent server, or may be a cluster of multiple servers, or may include a server built locally and a server erected at a cloud.
As shown in fig. 1, the analysis method of the performance abnormality may include, but is not limited to, steps 101 to 103:
in step 101, in response to receiving a command, critical events in the execution of the command are determined based on the type of command.
In this embodiment, the command type and the key event in the command execution process are preconfigured. The key events are used to trigger a collection operation to collect relevant information in real time during command execution, the details of which are described below.
In some embodiments, the association relationship between the command type, the key node in the command execution process, and the key event corresponding to the key node is also preconfigured, where the key node is an execution step in the command execution process.
In some embodiments, in response to receiving a command, a critical node and a critical event corresponding to the critical node in the execution of the command are determined based on the type of the command and a pre-configured association.
The pre-configured association relationship is as follows: the command type, the key nodes in the command execution process and the association relation between key events corresponding to the key nodes.
For example, the command type is Read (Read), and the Read command execution process is summarized as: resolving a read command, applying for resources, querying L2P (mapping logical address to physical address), reading data in NAND flash, and replying the data to a Host (Host).
Based on the read command execution process, key nodes in the read command execution process are preconfigured, for example, "start read NAND flash", "end read NAND flash", "start reply Host", "reply Host complete" are configured as four key nodes in the read command execution process.
After the key nodes are configured, key events corresponding to the key nodes are further configured, each key node corresponds to one key event, and the key events are used for triggering collection operation so as to collect relevant information in the execution process of the key nodes in real time.
The relevant information in the execution process of the key node includes, but is not limited to, at least one of the following information items: start time, event type, executor executing critical nodes, etc. In this embodiment, for convenience of description, the event type of the key event is denoted as the node name of the corresponding key node, for example, the event type of the key event corresponding to the key node "start reading NAND flash" is denoted as "start reading NAND flash". An executor executing a critical node can understand which module on which CPU executes the critical node, for example, there are two CPUs, denoted as CPU0 and CPU1, and if the read module of CPU0 executes the critical node "start reading NAND flash memory", the read module of CPU0 is the executor executing the critical node. The executor may also be some functional module or some functional function in the SSD firmware.
For another example, the command type is Write (Write), and the Write command execution process is summarized as: resolving write commands, applying resources, querying L2P (logical address to physical address mapping), modifying L2P, replying to Host. Based on the write command execution process, key nodes in the write command execution process are preconfigured, and after the key nodes are configured, key events corresponding to the key nodes are further configured, so that a person skilled in the art can configure specific key nodes and corresponding key events according to actual needs.
For another example, the command type is Erase (Erase), and the Erase command execution process is summarized as: apply for resources→send an Erase command to different LUNs (logical units→wait for all LUNs to complete erasure). Based on the execution process of the erasing command, the key nodes in the execution process of the erasing command are pre-configured, and after the key nodes are configured, the key events corresponding to the key nodes are further configured, so that a person skilled in the art can configure specific key nodes and corresponding key events according to actual needs.
As another example, the command type is garbage collection (Garbage Collection, GC), and the GC command execution process is summarized as: selecting a GC source (namely a physical block to be recycled), reading effective data, comparing L2P and writing the effective data. Based on the GC command execution process, key nodes in the GC command execution process are preconfigured, and after the key nodes are configured, key events corresponding to the key nodes are further configured, so that a person skilled in the art can configure specific key nodes and corresponding key events according to actual needs.
Therefore, the type of the command (read, write, erase or garbage collection) can be determined in response to the command being received because the association relationship among the command type, the key node in the command execution process and the key event corresponding to the key node is preconfigured, and further the preconfigured association relationship is searched based on the type of the command, so that the key node in the command execution process and the key event corresponding to the key node can be determined.
In step 102, in the process of executing the command, if a key event is detected, triggering the key event; in response to triggering the critical event, information of the critical event is generated.
In this embodiment, if an association relationship is pre-configured, the association relationship is: the method comprises the steps that in the command type, key nodes in the command execution process and association relations among key events corresponding to the key nodes, if the key nodes are detected in the command execution process, the key events corresponding to the key nodes are triggered; in response to triggering the critical event, information of the critical event is generated.
Because the key nodes and the key events corresponding to the key nodes in the execution process of the command are determined, the corresponding key events are triggered along with the execution of different key nodes in the execution process of the command, and the information of the key events is collected in real time.
In some embodiments, considering that the SSD has multiple performance metrics, different performance anomalies are corresponded for different performance metrics, e.g., qoS metrics correspond to delay anomalies, bandwidth metrics correspond to bandwidth anomalies. It can be seen that the information of the critical events that needs to be collected differs for different performance anomalies, e.g. for delay anomalies, time-related information needs to be collected, and for bandwidth anomalies, bandwidth-related information needs to be collected, so that the information of the critical events that needs to be collected (i.e. one or more information items of the critical events are preconfigured) is preconfigured according to performance test requirements (e.g. QoS test, bandwidth test, etc.).
In step 103, based on the information of the critical event, the performance anomaly is analyzed to obtain a command and/or an anomaly cause that resulted in the performance anomaly.
In this embodiment, if an association relationship is pre-configured, the association relationship is: and analyzing the performance abnormality based on the information of the key events to obtain the command, the key node and/or the abnormality cause causing the performance abnormality.
For different performance anomalies, because the collected information of the key events is different, the anomaly reasons for the performance anomalies are different based on the analysis of the information of the key events, in order to facilitate the analysis of the anomaly reasons, the value range and/or the combination relation of the information of the key events corresponding to the anomaly reasons for the performance anomalies can be preconfigured for different performance anomalies, and thus when the collected information of the key events meets the value range and/or the combination relation of the information of the key events corresponding to the anomaly reasons, the occurrence of the anomaly reasons for the performance anomalies can be determined.
It can be seen that, in this embodiment, the specific cause of the performance abnormality, that is, the moment of determining the performance abnormality, which key node of which command is abnormal in execution and what the cause of the abnormality is, can be determined by analyzing the performance abnormality using the information of the key event, so as to solve the problem that the conventional common debug (debug) means such as serial port, grab site, etc. cannot determine the state of the SSD at the moment of the performance abnormality.
In some embodiments, after "generating information of critical events" in step 102, the analysis method of performance anomalies further includes: information of the key event is cached.
In this embodiment, the generated information of the key event is cached in a buffer area corresponding to the CPU, where the CPU is the CPU executing the command received in step 101, so that the subsequent acquisition of the information of the key event from the buffer area is convenient to analyze the cause of the abnormality. In addition, the cycle count (cycle count) of the CPU may be cached in a buffer (buffer) corresponding to the CPU as auxiliary information for analyzing the cause of the abnormality.
Accordingly, in step 103, "analyzing the performance abnormality based on the information of the key event, and obtaining the command and/or the cause of the performance abnormality" is implemented as follows:
acquiring a performance detection value; if the performance detection value meets the triggering condition preset by the abnormal performance, backing up the cached information of the key event; and analyzing the performance abnormality based on the information of the backup key event to obtain a command and/or an abnormality cause which cause the performance abnormality.
If the scheme of the present embodiment is applied to the performance test phase, the performance test value is a performance test value collected in response to the performance test command (i.e., a performance test value collected during the performance test phase). If the scheme of the embodiment is applied to a non-performance test phase, that is, to a normal operation phase, the performance detection value is a performance detection value acquired during the execution of the command.
In order to improve the efficiency of performance anomaly analysis, the embodiment only acquires the information of the key event that has been cached during the performance anomaly, rather than waiting for the end of the entire performance test phase or the end of the normal operation phase to acquire the information of the key event, that is, not acquiring the information of the key event that is irrelevant to the performance anomaly, so that the information of the key event for performance anomaly analysis is related to the performance anomaly, thereby improving the efficiency of performance anomaly analysis.
In order to obtain the information of the cached key event when the performance is abnormal, in this embodiment, a trigger condition is preset for the performance abnormality, the trigger condition may be understood as a critical value of a performance detection value corresponding to the performance abnormality, if the performance detection value meets the trigger condition preset for the performance abnormality, the current performance abnormality is described, and then the cached information of the key event is backed up, so as to obtain the information of the cached key event when the performance is abnormal. The trigger condition may be set in various ways, for example: the triggering condition is set by sending Vendor Specific (VS) command, direct modification code, GDB (GNU Debugger) access and the like through a Host (Host).
As an example, the performance detection value is a delay time, the trigger condition is a critical value of delay time corresponding to the delay abnormality, and if the performance detection value is greater than or equal to the critical value, that is, the delay time meets the trigger condition preset by the delay abnormality, the current delay abnormality is described, and then the cached information of the key event is backed up, so that the cached information of the key event when the delay abnormality is obtained. As another example, the performance detection value is the bandwidth size, the trigger condition is the critical value of the bandwidth size corresponding to the bandwidth abnormality, and if the performance detection value is greater than or equal to the critical value, that is, the bandwidth size meets the trigger condition preset by the bandwidth abnormality, the current bandwidth abnormality is described, and then the cached information of the key event is backed up, so that the cached information of the key event when the bandwidth abnormality is obtained is realized.
After backing up the cached information of the key event, in this embodiment, the information of the backed up key event may be exported by a GDB (GNU debug) or by other means, where the exporting time may be when the performance test is finished or may be any time defined manually. And analyzing the performance abnormality based on the information of the backup key event to obtain a command, a key node and/or an abnormality cause which cause the performance abnormality, wherein the information of the backup key event is the information of the cached key event when the performance abnormality is related to the performance abnormality, so that the efficiency of the performance abnormality analysis can be improved.
Fig. 2 is a scenario diagram of an analysis method of performance anomaly provided in an embodiment of the present disclosure, in fig. 2, it is assumed that N CPUs are present, which are denoted as CPU0, … …, and CPUN, and taking a Write command (Write Order) executed by CPU0 as an example, CPU0 generates information of a key event in the process of executing the Write command, and caches the generated information of the key event in a Buffer (Buffer 0) corresponding to CPU0, where each piece of data in the Buffer corresponds to information of one key event, and each piece of data includes Time (Time) and information (event) of the key event. After the CPU0 records N pieces of data in the buffer, when generating information of a new key event, the CPU records the pieces of data in a Loop Back in the buffer.
In fig. 2, if the performance detection value meets the triggering condition preset by the performance exception, the CPU0 backs up the cached information of the key event into the BackUp cache (BackUp Buffer 0). After the cached information of the key event is backed up, the information of the backed up key event can be exported by a GDB (GNU Debugger) or by other modes (for example, VS CMD), and the exporting time can be when the performance test is finished or can be any time defined manually. And analyzing the performance abnormality based on the information of the backup key event to obtain a command, a key node and/or an abnormality cause which cause the performance abnormality, wherein the information of the backup key event is the information of the cached key event when the performance abnormality is related to the performance abnormality, so that the efficiency of the performance abnormality analysis can be improved. In fig. 2, a readable text log may be generated by using a data analysis tool, and then the generated text log is automatically analyzed by a script such as Python, so as to extract required information, and the automatic analysis generates a possible cause of poor QOS (Quality of Service ).
In some embodiments, one implementation of "analyze performance anomalies based on information of critical events" in step 103 is:
based on the information of the key events, determining each abnormal command related to the performance abnormality and the information of each key event corresponding to each abnormal command; based on the information of each key event corresponding to each abnormal command, carrying out event semantic analysis on the performance abnormality to obtain an abnormality cause causing the performance abnormality; the event semantic analysis refers to analyzing whether an abnormal cause causing the abnormal performance exists among the key events corresponding to the abnormal command.
In this embodiment, the information of the critical event includes command identifiers, so, based on the information of the critical event, each abnormal command related to the performance abnormality may be determined, for example, based on the information of the backed-up critical event (i.e., the information of the critical event that has been cached in the case of the performance abnormality), the command corresponding to each command identifier included in the information of the backed-up critical event is taken as an abnormal command that may cause the performance abnormality. And further, by utilizing the pre-configured command type, the key nodes in the command execution process and the association relation between the key events corresponding to the key nodes, the information of each key event corresponding to each abnormal command can be determined from the information of the backed-up key events.
In this embodiment, by using information of each key event corresponding to each abnormal command, event semantic analysis is performed on the performance abnormality, where the event semantic analysis refers to analyzing whether there is an abnormality cause that causes the performance abnormality between each key event corresponding to the abnormal command.
In order to facilitate the event semantic analysis, a value range and/or a combination relation of information of a key event corresponding to an abnormal cause causing the performance abnormality may be preconfigured, wherein the value range and/or the combination relation of information of the key event may be understood as event semantics corresponding to the abnormal cause. Thus, when the information of the key event satisfies the value range and/or the combination relation of the information of the key event corresponding to the abnormality cause, it can be determined that the abnormality cause has caused the performance abnormality.
Taking the performance abnormality as a delay abnormality as an example, carrying out event semantic analysis on the performance abnormality based on the information of each key event corresponding to each abnormal command to obtain an abnormality cause causing the performance abnormality, including but not limited to the following cases:
case one
If the time interval of two adjacent key events of the same abnormal command is greater than or equal to the preset first time interval, determining that the abnormal cause is the execution timeout of the previous key event in the two adjacent key events.
In this embodiment, if an association relationship is pre-configured, the association relationship is: if the time interval of two adjacent key nodes of the same abnormal command in the execution process is greater than or equal to a preset first time interval, determining that the reason of the abnormality is that the execution of the previous key node in the two adjacent key nodes is overtime.
In this embodiment, for the delay exception, a time interval threshold (i.e., a first time interval) between two neighboring key nodes executing the same command may be preconfigured, and exceeding the time interval threshold indicates that the execution of a preceding key node in the two neighboring key nodes is overtime.
Case two
If the same executor executes a plurality of key events and the time interval between the first key event and the last key event executed by the executor is greater than or equal to a preset second time interval, determining that the abnormality cause is that the time of the executor occupying the CPU is overtime.
In this embodiment, if an association relationship is pre-configured, the association relationship is: if the same executor executes a plurality of key nodes and the time interval between the first key node and the last key node executed by the executor is greater than or equal to a preset second time interval, determining that the reason of the abnormality is that the time of the executor occupying the CPU is overtime.
In this embodiment, the executor is an execution body for executing the key nodes, for example, may be any module on any CPU in the SSD firmware, and it is understood that the executor may execute a plurality of key nodes, where the plurality of key nodes may be derived from the same command or may be derived from different commands.
In this embodiment, for the delay exception, the same executor may be preconfigured to execute a time interval threshold (i.e., a second time interval) between the first and the last key nodes, and if the time interval threshold is exceeded, it indicates that the time of the executor occupying the CPU is overtime.
Case three
If the time interval between the start of waiting for the resource or the key event and the acquisition of the resource or the key event by any executor is larger than or equal to a preset third time interval, the time occupied by the executor by the CPU is not overtime, and the execution of the previous key event in the two adjacent key events of the same abnormal command executed by the executor is not overtime, determining that the abnormal reason is that the executor waits for the resource or the key event overtime.
In this embodiment, if an association relationship is pre-configured, the association relationship is: if the time interval between the start of waiting for a resource or a key event and the acquisition of the resource or the key event by any executor is greater than or equal to a preset third time interval, the time occupied by the executor by the CPU is not overtime, and the execution of the previous key node in the two adjacent key nodes of the same abnormal command by the executor is not overtime, the abnormal reason is determined to be that the executor waits for the resource or the key event overtime.
In this embodiment, for the delay exception, a waiting time interval threshold (i.e., a third time interval) may be preconfigured, and if the time interval between waiting for a resource or a critical event and obtaining the resource or the critical event exceeds the waiting time interval threshold, and in the case that the first case and the second case do not occur, it may be determined that the executor waits for the resource or the critical event to timeout.
Case four
If the time interval between the acquisition command and the response command of the NAND flash memory controller is greater than or equal to a preset fourth time interval, determining that the abnormality is caused by abnormal processing capacity of the NAND flash memory controller.
In this embodiment, for the delay exception, a time interval threshold (i.e., a fourth time interval) between the time when the NAND flash memory controller acquires the command and the time when the NAND flash memory controller responds to the same command may be preconfigured, and when the time interval threshold is exceeded, it is indicated that the NAND flash memory controller processes the same command overtime, i.e., the processing capability of the NAND flash memory controller is abnormal.
On the basis of the above embodiments, the present embodiment specifically describes information items included in information of a key event, including but not limited to at least one of the following:
time, event type, executor, additional information;
The event type is a preconfigured event type based on the key event, and the event type can be customized, in this embodiment, for convenience of description, the event type of the key event is recorded as a node name of a corresponding key node, for example, the event type of the key event corresponding to the key node "start reading NAND flash" is recorded as "start reading NAND flash".
The executives are execution subjects that execute the critical events. In this embodiment, if an association relationship is pre-configured, the association relationship is: the executor is an execution body for executing the key node, and can understand which module on which CPU executes the key node, for example, two CPUs are denoted as CPU0 and CPU1, and if the read module of CPU0 executes the key node to "start reading NAND flash memory", the read module of CPU0 is the executor for executing the key node.
The additional information is preconfigured information that provides basis for performance anomaly analysis, for example, the additional information includes, but is not limited to, at least one of the following: NAND flash addresses, context addresses inside SSD firmware, and summary information. The summary information is the sum of time consumed by all events in the whole command or task execution process when the ending event of one command or task occurs.
The NAND flash memory address is helpful for determining whether the NAND flash memory address has a NAND error or the like, and is helpful for determining a command processed on the same NAND flash memory address so as to analyze whether read-write conflict exists, and is helpful for determining the load condition of each Die (Die) included in the NAND flash memory so as to analyze whether delay abnormality is caused by too high load. Wherein Die is the smallest individual unit in the NAND flash that can execute commands and report its own status. One or more Die are included within the NAND flash package. Typically, a Logical UNit (LUN) corresponds to a single die. The logic cell may include multiple planes (planes). Multiple planes within a logical unit may be accessed in parallel.
FIG. 3 is a schematic view of the content of the file derived from FIG. 2, where, as shown in FIG. 3, the file records information of a critical event generated during a processing process of a write command that meets a triggering condition preset by a performance abnormality, a duration of the write command processing is 1.5ms, and the information of the critical event includes: time (represented by time points in fig. 3), executives (i.e., execution bodies that execute events), additional information, and event types (represented by event names in fig. 3). The detailed description is referred to the description of the information about the key event above, and will not be repeated.
Based on different additional information, performing event semantic analysis on the performance anomalies to obtain anomaly causes causing the performance anomalies, including but not limited to the following ways:
mode one: if the read command and the write command are executed at the same NAND flash memory address and the read-write conflict exists, determining that the abnormality cause is that the NAND flash memory address has the read-write conflict.
Mode two: if a plurality of key events are executed by the same context address in the SSD firmware and a key event of timeout processing exists in the plurality of key events, determining that the reason of the abnormality is that the key event of timeout processing exists in the same context address.
In this embodiment, if an association relationship is pre-configured, the association relationship is: and if the plurality of key nodes are executed by the same context address in the SSD firmware and the key nodes of the overtime processing exist in the plurality of key nodes, determining that the reason of the abnormality is the key nodes of the overtime processing exist in the same context address.
In some embodiments, one implementation of "analyze performance anomalies based on information of critical events" in step 103 is:
Determining a complete command period of the command based on event semantics of the key event and command identification of the command; extracting all logs of the command that occur in one complete command cycle; and carrying out event semantic analysis based on all logs and information of key events which occur in one complete command period of the command to obtain an abnormal cause causing abnormal performance.
As an example, the event semantic analysis analyzes for a complete command cycle, e.g., a read command, requiring the following steps: step1, command parsing starts, step2, command parsing ends, step3, read L2P starts, step4, read L2P ends, step5, read NAND starts, step6, read NAND ends, step7, return host starts, step8, return host ends. Based on the event semantics and the command identification included in the additional information, step1 can be determined to be the start of a read command and step8 represents the end of a read command. If the delay of a complete command cycle is too long, then all logs occurring between steps 1-8 are extracted and then it is analyzed which 2 steps are too long to process between.
In some embodiments, event semantic hierarchical processing may be performed based on information of all logs and key events occurring in one complete command cycle of the command, to obtain an anomaly cause that leads to a performance anomaly; the event semantic classification processing comprises the following steps:
Based on the information of the key event, performing primary analysis on the performance index type to determine whether the command is abnormal; if the command is abnormal, performing secondary analysis on the performance index type based on all logs and information of key events of the command in one complete command period, and obtaining an abnormal reason for the performance abnormality.
As an example, the event semantic analysis analyzes for a complete command cycle, e.g., a read command, requiring the following steps: step1, command parsing starts, step2, command parsing ends, step3, read L2P starts, step4, read L2P ends, step5, read NAND starts, step6, read NAND ends, step7, return host starts, step8, return host ends. Based on the event semantics and the command identification included in the additional information, step1 can be determined to be the start of a read command and step8 represents the end of a read command. If the delay of a complete command cycle is too long, then all logs occurring between steps 1-8 are extracted and then it is analyzed which 2 steps are too long to process between. In the analysis process, grading processing is carried out according to actual conditions or needs, for example, a great amount of similar data exists in the collected data, the total duration is firstly seen, and the problem command is extracted and the reasons are analyzed according to event semantics after the problem command is found.
In some embodiments, when the performance anomaly is a bandwidth anomaly, one implementation of "analyze performance anomaly based on information of critical events" in step 103 is:
based on the information of the key events, determining each abnormal command related to the performance abnormality and the information of each key event corresponding to each abnormal command; based on the information of each key event corresponding to each abnormal command, carrying out event semantic analysis on the performance abnormality to obtain an abnormality cause causing the performance abnormality; the event semantic analysis refers to analyzing whether an abnormal cause causing the abnormal performance exists among the key events corresponding to the abnormal command.
Based on the information of each key event corresponding to each abnormal command, performing event semantic analysis on the performance abnormality to obtain an abnormality cause causing the performance abnormality, including:
based on the information of each key event corresponding to each abnormal command, determining the bandwidth occupied by the command execution process corresponding to each key event; judging whether the bandwidth occupied by the command execution process corresponding to each key event is larger than or equal to a bandwidth critical value corresponding to bandwidth abnormality; if the command execution process is greater than or equal to the critical value, determining that the bandwidth abnormality occurs in the corresponding command execution process.
On the basis of the above embodiments, after analyzing the performance abnormality, the analysis method of the performance abnormality further includes: and generating an analysis log so that after the analysis log is manually obtained, the analysis log can be combined to analyze the abnormal reasons causing the abnormal performance more quickly. The generated analysis log comprises at least one of the following:
the number of abnormal commands, the number of commands with different abnormal reasons, a file list of the abnormal commands with different abnormal reasons, and the number of abnormal commands processed by different chips in SSD firmware; the number of the files in the file list is the same as the number of the abnormal commands and corresponds to the number of the abnormal commands one by one, and each file in the file list records information of each key event generated in the corresponding abnormal command executing process.
Taking delay exception as an example, considering that different key events are triggered in the process of executing the command, a large number of logs are recorded correspondingly, a large number of normally delayed commands may be contained in a large number of logs, and various command event interleaving records are also caused due to parallel processing. Thus, it is difficult to manually screen out long-delay commands that need to be observed therein. In order to help staff get rid of interference information, let staff concentrate on a certain command more, analyze the reason of wherein long delay, in this embodiment, after analyzing the performance anomaly, generate the analysis log, record all key events of one or more long delay commands (i.e. anomaly command) in the analysis log, so that after the staff gets the analysis log, concentrate on a certain anomaly command more, combine the analysis log to analyze the anomaly reason that leads to the performance anomaly more quickly. In this embodiment, the content of the generated analysis log (report) is as follows:
report
Number of found timeout commands: n (N)
Number of commands for timeout reason a: a, wherein the read command occupies: ra, write commands occupy: wa;
number of commands for timeout reason b: b, wherein the read command occupies: rb, write commands occupy: wb;
number of commands for timeout reason c: c, wherein the read command occupies: rc, write command occupies: wc;
……
the timeout source is a list of files of timeout commands of a [ files A0, A1, A2, ];
the timeout source is the file list of timeout commands of B [ files B0, B1, B2,. ];
the timeout source is the file list of timeout commands of C [ files C0, C1, C2. ];
……
die0 processes Read: x0 pen, process Write: y0 pen;
……
DieN processing Read: xn pen, process Write: yn pen.
After the report is obtained manually, if the timeout reason a is long delay caused by overlong waiting resources, the file A0 records information of all key events of the long delay command A0, when the report is checked manually, the event waiting for the resources can be focused, and after the event is determined, the specific resource waiting overlong can be analyzed, and whether the resource waiting process is influenced by other factors or not is judged. Based on the report, the proportion of the read-write command in the overtime command can be counted.
As an example, the event semantic analysis analyzes for a complete command cycle, e.g., a read command, requiring the following steps: step1, command parsing starts, step2, command parsing ends, step3, read L2P starts, step4, read L2P ends, step5, read NAND starts, step6, read NAND ends, step7, return host starts, step8, return host ends. Based on the event semantics and the command identification included in the additional information, step1 can be determined to be the start of a read command and step8 represents the end of a read command. If the delay of a complete command cycle is too long, then all logs occurring between steps 1-8 are extracted and then analysis is made as to which two steps have too long processing time between them. In the analysis process, grading processing is carried out according to actual conditions or needs, for example, a great amount of similar data exists in the collected data, the total duration is firstly seen, and the problem command is extracted and the reasons are analyzed according to event semantics after the problem command is found.
It can be seen that the possible reasons for QoS and the like are usually only guessed by mental analysis of the code, and then modifying the code continually verifies and eliminates the various possibilities. The specific execution details of the SSD firmware in a certain short instant can be known in detail through the application. It is clear at what point in time the firmware has performed what function, what command was sent to the NAND, what resource is waiting, at what point in time specific information is obtained about the resource in waiting, etc. The method and the device realize the automation and analyze whether the firmware execution stage occupies too long CPU event by a certain function or wait too long resource event or NAND response is not timely and the like from a large amount of data, and extract the log of the abnormal part of the problem for the engineer to analyze.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but those skilled in the art can appreciate that the disclosed embodiments are not limited by the order of actions described, as some steps may occur in other orders or concurrently in accordance with the disclosed embodiments. In addition, those skilled in the art will appreciate that the embodiments described in the specification are all alternatives.
Fig. 4 is a schematic diagram of an analysis device for performance abnormality provided in an embodiment of the present disclosure, where a command type and a key node in a command execution process are preconfigured. The analysis device for performance abnormality provided by the embodiment of the present disclosure may be a processing flow provided by each embodiment of the analysis method for performance abnormality, as shown in fig. 2, where the analysis device for performance abnormality includes, but is not limited to: a determination unit 41, a generation unit 42 and an analysis unit 43. The functions of each unit are described as follows:
a determining unit for determining a critical event in the execution of the command based on the type of the command in response to receiving the command;
the generating unit is used for triggering the key event if the key event is detected in the executing process of the command; generating information of the key event in response to triggering the key event;
And the analysis unit is used for analyzing the performance abnormality based on the information of the key event to obtain a command and/or an abnormality cause which cause the performance abnormality.
In some embodiments, pre-configuring the association relationship among the command type, the key nodes in the command execution process and the key events corresponding to the key nodes; wherein, the key node is an executing step in the command executing process.
A determining unit 41, configured to determine, in response to receiving the command, a key node and a key event corresponding to the key node in the execution process of the command based on the type and the association relationship of the command;
the generating unit 42 is configured to trigger a key event corresponding to a key node if the key node is detected during the execution of the command; generating information of the key event in response to triggering the key event;
and an analysis unit 43, configured to analyze the performance abnormality based on the information of the key event, so as to obtain a command, a key node and/or an abnormality cause that cause the performance abnormality.
In some embodiments, the analysis device of performance anomaly further includes a buffering unit for buffering the information of the critical event after the generating unit 42 generates the information of the critical event. An analysis unit 43 for: acquiring a performance detection value; if the performance detection value meets the triggering condition preset by the abnormal performance, backing up the cached information of the key event; and analyzing the performance abnormality based on the information of the backup key event to obtain a command and/or an abnormality cause which cause the performance abnormality.
In some embodiments, the analysis unit 43 is configured to:
based on the information of the key events, determining each abnormal command related to the performance abnormality and the information of each key event corresponding to each abnormal command;
based on the information of each key event corresponding to each abnormal command, carrying out event semantic analysis on the performance abnormality to obtain an abnormality cause causing the performance abnormality; the event semantic analysis refers to analyzing whether an abnormal cause causing the abnormal performance exists among the key events corresponding to the abnormal command.
In some embodiments, when the performance anomaly is a delay anomaly, the analysis unit 43 performs event semantic analysis on the performance anomaly based on information of each key event corresponding to each anomaly command, to obtain an anomaly cause that causes the performance anomaly, including:
if the time interval of two adjacent key events of the same abnormal command is greater than or equal to a preset first time interval, determining that the abnormal cause is the execution timeout of the previous key event in the two adjacent key events; and/or the number of the groups of groups,
if the same executor executes a plurality of key events and the time interval between the first key event and the last key event executed by the executor is larger than or equal to a preset second time interval, determining that the abnormality cause is that the time of the executor occupying the CPU is overtime; and/or the number of the groups of groups,
If the time interval between the start of waiting for the resource or the key event and the acquisition of the resource or the key event by any executor is larger than or equal to a preset third time interval, the time occupied by the executor by the CPU is not overtime, and the execution of the previous key event in the two adjacent key events of the same abnormal command executed by the executor is not overtime, determining that the abnormal reason is that the executor waits for the resource or the key event overtime; and/or the number of the groups of groups,
if the time interval between the acquisition command and the response command of the NAND flash memory controller is greater than or equal to a preset fourth time interval, determining that the abnormality is caused by abnormal processing capacity of the NAND flash memory controller.
In some embodiments, the information of the critical event includes at least one of:
time, event type, executor, additional information;
wherein the event type is an event type preconfigured based on a key event;
the executor is an execution subject for executing the key event;
the additional information is preconfigured information providing basis for performance anomaly analysis, and comprises at least one of the following:
NAND flash addresses, context addresses inside SSD firmware, and summary information.
In some embodiments, the analysis unit 43 performs event semantic analysis on the performance exception based on the information of each key event corresponding to each exception command, to obtain an exception cause that causes the performance exception, including:
If the read command and the write command are executed at the same NAND flash memory address and the read-write conflict exists, determining that the abnormality cause is that the NAND flash memory address has the read-write conflict; and/or the number of the groups of groups,
if a plurality of key events are executed by the same context address in the SSD firmware and a key event of timeout processing exists in the plurality of key events, determining that the abnormality is caused by the fact that the key event of timeout processing exists in the same context address.
In some embodiments, the analysis unit 43 is configured to:
determining a complete command period of the command based on event semantics of the key event and command identification of the command;
extracting all logs of the command that occur in one complete command cycle;
and carrying out event semantic analysis based on all logs and information of key events which occur in one complete command period of the command to obtain an abnormal cause causing abnormal performance.
In some embodiments, the analysis unit 43 performs event semantic analysis based on all logs and information of key events occurring in one complete command cycle of the command, to obtain an abnormality cause that causes a performance abnormality, including:
based on all logs of the command and information of key events occurring in one complete command period, carrying out event semantic hierarchical processing to obtain an abnormal reason causing abnormal performance; the event semantic classification processing comprises the following steps:
Based on the information of the key event, performing primary analysis on the performance index type to determine whether the command is abnormal;
if the command is abnormal, performing secondary analysis on the performance index type based on all logs and information of key events of the command in one complete command period, and obtaining an abnormal reason for the performance abnormality.
In some embodiments, when the performance anomaly is a bandwidth anomaly, the analysis unit 43 performs event semantic analysis on the performance anomaly based on information of each key event corresponding to each anomaly command, to obtain an anomaly cause that causes the performance anomaly, including:
based on the information of each key event corresponding to each abnormal command, determining the bandwidth occupied by the command execution process corresponding to each key event;
judging whether the bandwidth occupied by the command execution process corresponding to each key event is larger than or equal to a bandwidth critical value corresponding to bandwidth abnormality;
if the command execution process is greater than or equal to the critical value, determining that the bandwidth abnormality occurs in the corresponding command execution process.
In some embodiments, the analysis device of the performance abnormality further includes a log unit for generating an analysis log after the analysis unit 43 analyzes the performance abnormality; the analysis log comprises at least one of the following:
The number of abnormal commands, the number of commands with different abnormal reasons, a file list of the abnormal commands with different abnormal reasons, and the number of abnormal commands processed by different chips in SSD firmware; the number of the files in the file list is the same as the number of the abnormal commands and corresponds to the number of the abnormal commands one by one, and each file in the file list records information of each key event generated in the corresponding abnormal command executing process.
According to the analysis device for the performance abnormality provided by at least one embodiment of the present disclosure, through pre-configuring the association relationship among the command type, the key nodes in the command execution process, and the key events corresponding to the key nodes, in the command execution process, as different key nodes execute, the corresponding key events are triggered to collect the information of the key events in real time, so that the performance abnormality is analyzed by using the information of the key events, and specific reasons for the performance abnormality, namely, the moment of determining the performance abnormality, which key node of which command executes the abnormality, and what the abnormality reason is, can be determined.
For details of the embodiments of the analysis device for abnormal performance, reference is made to the embodiments of the analysis method for abnormal performance, and details are not repeated.
In an embodiment of the present disclosure, there is also provided a storage device (or solid state storage device, etc.), including: the control unit and an NVM (Non-Volatile Memory) chip, the control unit performs an analysis method of performance abnormality.
Fig. 5 is an exemplary block diagram of an electronic device provided by an embodiment of the present disclosure. As shown in fig. 5, the electronic device includes: a memory 51, a processor 52 and a computer program stored on said memory 51. It is to be understood that the memory 51 in the present embodiment may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
In some embodiments, the memory 51 stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof: an operating system and application programs.
The operating system includes various system programs, such as a framework layer, a core library layer, a driving layer, and the like, and is used for realizing various basic tasks and processing hardware-based tasks. Applications, including various applications such as Media players (Media players), browsers (browses), etc., are used to implement various application tasks. A program implementing the method for analyzing a performance abnormality provided by the embodiment of the present disclosure may be included in an application program.
In the embodiment of the present disclosure, the at least one processor 52 is configured to execute the steps of the embodiments of the analysis method for performance anomalies provided by the embodiment of the present disclosure by calling a program or an instruction stored in the at least one memory 51, specifically, a program or an instruction stored in an application program.
The analysis method of the performance anomaly provided by the embodiment of the present disclosure may be applied to the processor 52 or implemented by the processor 52. Processor 52 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware in processor 52 or by instructions in the form of software. The processor 52 may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The steps of the method for analyzing a performance abnormality provided in the embodiments of the present disclosure may be directly embodied and executed by a hardware decoding processor, or may be executed by a combination of hardware and software modules in the decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 51 and the processor 52 reads the information in the memory 51 and, in combination with its hardware, performs the steps of the method.
The embodiments of the present disclosure further provide a computer-readable storage medium storing a program or instructions that cause a computer to perform steps of each embodiment of the analysis method, such as performance anomalies, and are not described herein in detail to avoid repetitive description. Wherein the computer readable storage medium may be a non-transitory computer readable storage medium.
The disclosed embodiments also provide a computer program product comprising a computer program stored in a computer readable storage medium, which may be a non-transitory computer readable storage medium. At least one processor of the computer reads and executes the computer program from the computer-readable storage medium, so that the computer performs the steps of the embodiments of the analysis method, such as the performance anomaly, and a detailed description thereof is omitted to avoid redundancy.
The apparatus or device embodiments described above are merely illustrative, in which the unit modules illustrated as separate components may or may not be physically separate, and the components shown as unit modules may or may not be physical units, may be located in one place, or may be distributed over multiple network module units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., and include several instructions for up to a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of each embodiment or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; the technical features of the above embodiments or in different embodiments may also be combined under the idea of the present disclosure, the steps may be implemented in any order, and there are many other variations of the different aspects of the present disclosure as above, which are not provided in details for the sake of brevity; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present disclosure.

Claims (10)

1. A method for analyzing abnormal performance includes pre-configuring command type and key event in command execution process; the method comprises the following steps:
in response to receiving a command, determining a critical event in the execution of the command based on a type of the command;
triggering the key event if the key event is detected in the executing process of the command; generating information of the key event in response to triggering the key event;
And analyzing the performance abnormality based on the information of the key event to obtain a command and/or an abnormality cause which cause the performance abnormality.
2. The method of claim 1, wherein the method further comprises: pre-configuring a command type, key nodes in a command execution process and association relations among key events corresponding to the key nodes; wherein, the key node is an executing step in the command executing process;
the determining key events in the execution of the command based on the type of the command includes: determining a key node and a key event corresponding to the key node in the execution process of the command based on the type of the command and the association relation;
and if the key event is detected, triggering the key event, including: and if the key node is detected, triggering a key event corresponding to the key node.
3. The method of claim 1, wherein after the generating the information of the critical event, the method further comprises: caching information of the key event;
the analyzing the performance abnormality based on the information of the key event to obtain a command and/or an abnormality cause causing the performance abnormality comprises:
Acquiring a performance detection value;
if the performance detection value meets the triggering condition preset by the abnormal performance, backing up the cached information of the key event;
and analyzing the performance abnormality based on the information of the backup key event to obtain a command and/or an abnormality cause which cause the performance abnormality.
4. The method of claim 1, wherein the analyzing the performance anomalies based on the information of the critical event comprises:
based on the information of the key events, determining each abnormal command related to the performance abnormality and the information of each key event corresponding to each abnormal command;
based on the information of each key event corresponding to each abnormal command, carrying out event semantic analysis on the performance abnormality to obtain an abnormality cause causing the performance abnormality; the event semantic analysis refers to analyzing whether an abnormal reason causing abnormal performance exists among the key events corresponding to the abnormal command.
5. The method according to claim 4, wherein when the performance anomaly is a delay anomaly, the performing event semantic analysis on the performance anomaly based on information of each key event corresponding to each anomaly command to obtain an anomaly cause causing the performance anomaly includes:
If the time interval of two adjacent key events of the same abnormal command is greater than or equal to a preset first time interval, determining that the abnormality is caused by the execution timeout of the previous key event in the two adjacent key events; and/or the number of the groups of groups,
if the same executor executes a plurality of key events and the time interval between the first key event and the last key event executed by the executor is larger than or equal to a preset second time interval, determining that the abnormality is caused by time timeout of the executor occupying the CPU; and/or the number of the groups of groups,
if the time interval between the start of waiting for a resource or a key event and the acquisition of the resource or the key event by any executor is greater than or equal to a preset third time interval, the time of the executor occupying the CPU is not overtime, and the execution of the previous key event in the two adjacent key events of the same abnormal command executed by the executor is not overtime, determining that the abnormality is that the executor waits for the resource or the key event to overtime; and/or the number of the groups of groups,
if the time interval between the acquisition command and the response command of the NAND flash memory controller is greater than or equal to a preset fourth time interval, determining that the abnormality is caused by abnormal processing capacity of the NAND flash memory controller.
6. The method of any of claims 1 to 5, wherein the information of the critical event comprises at least one of:
time, event type, executor, additional information;
wherein the event type is an event type preconfigured based on a key event;
the executor is an execution subject for executing the key event;
the additional information is preconfigured information for providing basis for performance anomaly analysis, and comprises at least one of the following:
NAND flash addresses, context addresses inside SSD firmware, and summary information.
7. The method of claim 6, wherein the performing the event semantic analysis on the performance exception based on the information of each key event corresponding to each exception command to obtain the exception cause causing the performance exception comprises:
if the read command and the write command are executed at the same NAND flash memory address and the read-write conflict exists, determining that the abnormality is caused by the read-write conflict of the NAND flash memory address; and/or the number of the groups of groups,
if a plurality of key events are executed by the same context address in the SSD firmware and a key event of timeout processing exists in the plurality of key events, determining that the abnormality is caused by the fact that the key event of timeout processing exists in the same context address.
8. The method of claim 1, the analyzing performance anomalies based on the information of the critical event, comprising:
determining a complete command cycle of the command based on event semantics of the critical event and command identification of the command;
extracting all logs of the command which occur in one complete command cycle;
and carrying out event semantic analysis based on all logs of the command and the information of the key event, and obtaining an abnormal cause causing abnormal performance.
9. The method of claim 8, wherein the performing the event semantic analysis based on all logs of the command and the information of the critical event to obtain the abnormality cause causing the performance abnormality comprises:
based on all logs of the command which occur in one complete command period and the information of the key event, carrying out event semantic grading processing to obtain an abnormal reason causing abnormal performance; wherein, the event semantic classification processing includes:
performing primary analysis on the performance index type based on the information of the key event, and determining whether the command is abnormal;
And if the command is abnormal, carrying out secondary analysis on the performance index type based on all logs of the command and the information of the key event, and obtaining an abnormal cause causing the performance abnormality.
10. A memory device, comprising: control means and NVM chip, the control means performing the steps of the method for analyzing a performance anomaly according to any one of claims 1 to 9.
CN202311429740.2A 2023-10-31 2023-10-31 Performance abnormality analysis method and storage device Pending CN117608956A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311429740.2A CN117608956A (en) 2023-10-31 2023-10-31 Performance abnormality analysis method and storage device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311429740.2A CN117608956A (en) 2023-10-31 2023-10-31 Performance abnormality analysis method and storage device

Publications (1)

Publication Number Publication Date
CN117608956A true CN117608956A (en) 2024-02-27

Family

ID=89955179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311429740.2A Pending CN117608956A (en) 2023-10-31 2023-10-31 Performance abnormality analysis method and storage device

Country Status (1)

Country Link
CN (1) CN117608956A (en)

Similar Documents

Publication Publication Date Title
US11294743B2 (en) Firmware event tracking for NAND-based storage devices, and methods and instruction sets for performing the same
US7603589B2 (en) Method and system for debugging a software program
US8326894B2 (en) Method and system to space-efficiently track memory access of object-oriented language in presence of garbage collection
US20120278585A1 (en) Optimizing heap memory usage
US20080276129A1 (en) Software tracing
CN108959526B (en) Log management method and log management device
US9043653B2 (en) Introspection of software program components and conditional generation of memory dump
US8065565B2 (en) Statistical debugging using paths and adaptive profiling
Martino et al. Logdiver: A tool for measuring resilience of extreme-scale systems and applications
US20170177272A1 (en) Methods and systems for memory suspect detection
EP3274839B1 (en) Technologies for root cause identification of use-after-free memory corruption bugs
CN113366452A (en) Management of event log information for a memory subsystem
CN110647472A (en) Breakdown information statistical method and device, computer equipment and storage medium
CN109542341B (en) Read-write IO monitoring method, device, terminal and computer readable storage medium
CN116069571A (en) Storage device performance automatic test method, device, equipment and storage medium
US11151013B2 (en) Systems and methods for performance evaluation of input/output (I/O) intensive enterprise applications
WO2015198600A1 (en) Analysis device, analysis method, and storage medium in which analysis program is recorded
US20110202903A1 (en) Apparatus and method for debugging a shared library
CN112965845A (en) Delay analysis method, electronic device, and storage medium
US7546489B2 (en) Real time event logging and analysis in a software system
CN102918508B (en) The replay architecture of catching without probe tracking is adopted to perform
US20210081238A1 (en) Exception analysis for data storage devices
CN108628761A (en) Atomic commands execute method and apparatus
CN117608956A (en) Performance abnormality analysis method and storage device
US20100153926A1 (en) Operating system aided code coverage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination