WO2024066506A1 - 数据监控分析方法、装置、服务器、运维系统及存储介质 - Google Patents

数据监控分析方法、装置、服务器、运维系统及存储介质 Download PDF

Info

Publication number
WO2024066506A1
WO2024066506A1 PCT/CN2023/101436 CN2023101436W WO2024066506A1 WO 2024066506 A1 WO2024066506 A1 WO 2024066506A1 CN 2023101436 W CN2023101436 W CN 2023101436W WO 2024066506 A1 WO2024066506 A1 WO 2024066506A1
Authority
WO
WIPO (PCT)
Prior art keywords
server
monitoring
change command
identifier
data
Prior art date
Application number
PCT/CN2023/101436
Other languages
English (en)
French (fr)
Inventor
上官栋栋
张钧宇
曾维富
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024066506A1 publication Critical patent/WO2024066506A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine

Definitions

  • the present application relates to the field of computers, and in particular to data monitoring and analysis methods, devices, servers, operation and maintenance systems, and storage media.
  • Operation and maintenance personnel will execute commands to change applications according to user needs to delete, modify or add application data and modify the behavior of applications.
  • the server runs the application, it will monitor the application to find operational failures caused by changes.
  • the server will generate a large amount of monitoring data when monitoring all processes in the application, and the efficiency of analyzing all monitoring data at the same time is low. Therefore, how to provide a more effective monitoring and analysis method for the change process has become an urgent problem to be solved.
  • the present application provides a data monitoring and analysis method, device, server, operation and maintenance system and storage medium to solve the problem of low efficiency in analyzing monitoring data in conventional technologies.
  • a data monitoring and analysis method is provided, which is executed by one or more servers in a server cluster, and the data monitoring and analysis method includes: first, the server receives and executes a change command, and the change command is used to instruct to perform a change operation on a first object in a first application. Secondly, in the process of executing the change command, the server determines the monitoring object associated with the first object, and uses a pre-deployed monitor to obtain the identifier of the first object and the identifier of the monitoring object generated during the operation of the monitoring object. Finally, the server performs a risk assessment based on the identifier of the first object and the identifier of the monitoring object, and obtains the risk level corresponding to the change command.
  • the server may determine whether to issue an alarm according to the risk level.
  • the first object may be a data file or process in the first application.
  • the monitoring object may be a process associated with the change command.
  • the server only monitors the monitoring object associated with the change command, which reduces the monitoring scope and improves the monitoring accuracy and efficiency. Moreover, since the scope of the process that the server needs to monitor is reduced, the amount of monitoring data generated by the server is reduced, the redundant data generated during the data monitoring process is reduced, and the occupation of storage resources in the server by the monitoring data is reduced. In addition, the server only analyzes the data of the monitoring object associated with the change command, and does not need to analyze the aforementioned redundant data, which improves the data analysis efficiency of the server.
  • the monitoring object includes a second object in the first application and an object in the second application, wherein the second application is an application that interacts with the first application.
  • the above objects are used to indicate a process.
  • the monitoring object is associated with the first object through a system call function.
  • the monitoring object can read the data file through a read function, and the monitoring object is associated with the first object based on the read function.
  • the monitoring object can create the first object through a process creation (copy_process) function, and the monitoring object is associated with the first object based on the copy_process function.
  • copy_process process creation
  • the server can clearly display the relationship between the monitored object and the first object, and then accurately associate the monitored object with the first object, ensuring that the process of performing change operations on the first object belongs to the monitored object, avoiding the problem of incomplete monitored data due to missing monitored objects, and improving the accuracy of monitoring.
  • the change operation includes one or more of adding, deleting, and modifying.
  • the server processes the identifier of the first object and the identifier of the monitored object using a preset risk assessment model to obtain a risk level corresponding to the change command.
  • the server processes the identifier of the first object and the identifier of the monitored object using the risk assessment model. Compared with the analysis and processing using the indicator data, the server uses the identifier of the first object and the identifier of the monitored object to determine the risk situation of the change order, such as inputting the identifier of the first object and the identifier of the monitored object into the risk assessment model for analysis and processing to determine the risk level of the change order, thereby determining whether it is No, issue an alarm.
  • the server after the change command is executed, when a failure occurs in the server or other servers in the server cluster, the server obtains alarm information corresponding to the failure, and retrieves the identifier of the first object and the identifier of the monitored object based on the alarm information. The server determines the corresponding operation log based on the retrieved identifier of the first object and the identifier of the monitored object, and then obtains the change command in the operation log.
  • the alarm information is used to indicate the fault data generated by the server during operation;
  • the operation log is used to indicate the operation record of multiple change commands for the first application, and the multiple change commands include the aforementioned change commands.
  • the server may use a spatiotemporal retrieval algorithm of graph computing to determine the identifier of the first object and the identifier of the monitored object that match the alarm information from the data stored in the server.
  • the server uses a spatiotemporal retrieval algorithm based on graph computing to retrieve data.
  • the spatiotemporal retrieval algorithm based on graph computing refers to a deep semantic matching model (DSSM). If the server queries the impact surface data corresponding to the alarm information based on the DSSM, it can quickly obtain the impact surface data with the highest matching degree with the alarm information, thereby improving the retrieval efficiency of the impact surface data.
  • the server determines the change command in the operation log corresponding to the identifier of the aforementioned first object and the identifier of the monitored object, and outputs it to the front end, indicating to the user the change command that may cause a fault, thereby shortening the time consumption of abnormality troubleshooting and improving the efficiency of abnormality troubleshooting.
  • the server uses the change command and the corresponding risk level as training data to update the interception model deployed in the bastion host, and the updated interception is used to intercept some change commands.
  • the server includes the bastion host.
  • the updated interception model can more accurately intercept execution commands whose risk levels meet the set conditions, thereby improving the accuracy of interception.
  • the server does not need to judge the execution commands intercepted by the interception model, reducing the number of execution commands that need to be monitored and analyzed during the data monitoring and analysis process, which is conducive to improving the efficiency of monitoring and analysis.
  • the server calls a preset detection point, which triggers a monitor deployed in advance in the server, and the monitor obtains the system resources processed by the monitored object when it is running, and obtains the identifier of the first object and the identifier of the monitored object.
  • the tracking point includes the detection point and the monitor.
  • the above-mentioned monitor can be an extended Berkeley Packet Filter (eBPF).
  • eBPF extended Berkeley Packet Filter
  • the server cooperates with the detection point and the monitor according to the tracking point.
  • the detection point only monitors the monitoring object that executes the preset command or function, thereby further narrowing the monitoring scope and avoiding the generation of redundant monitoring data.
  • the aforementioned detection point triggers the monitor to monitor the aforementioned monitoring object, and obtains the identification of the first object and the identification of the monitoring object.
  • the server inputs the impact surface data into the risk assessment model for analysis and processing, determines the risk level of the change command, and thus determines whether to issue an alarm, thereby improving the accuracy of the server's alarm.
  • the server when the monitored object is a remote access process, the server obtains a message generated when the monitored object is running, and the server parses the message to obtain an identifier of the first object and an identifier of the monitored object.
  • the server can obtain the messages generated when running the monitored object through the Express Data Path (XDP) and parse them.
  • XDP Express Data Path
  • the server parses the message of the remote access process, determines the type of remote service accessed in the impact surface data, and monitors the remote access process, thereby increasing the types of monitored processes and improving monitoring efficiency.
  • the server sends the identifier of the first object, the identifier of the monitored object, and at least one of the risk levels to the front end of the terminal for display.
  • the front end here may refer to a display connected to the terminal, or a display screen provided by the terminal, etc., which is not limited in this application.
  • the server displays the aforementioned data on the front end, realizing data visualization, and the user can process the command input to the server in a timely manner according to the visualized data.
  • a data monitoring and analysis device which is applied to a server, and the device includes various modules for executing the data monitoring and analysis method in the first aspect or any possible implementation of the first aspect.
  • the data monitoring and analysis device includes: a receiving module, an object determination module, and a level determination module.
  • the receiving module is used to receive a change command;
  • the object determination module is used to determine the monitoring object associated with the first object during the execution of the change command;
  • the level determination module is used to determine the risk level corresponding to the change command based on the identifier of the first object and the identifier of the monitoring object.
  • the change command is used to instruct the execution of a change operation on the first object in the first application.
  • the beneficial effects can be found in the description of any possible implementation in the first aspect, which will not be repeated here.
  • the data monitoring and analysis device has the function of implementing the behavior in the method example in any possible implementation in the first aspect.
  • the function can be implemented by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • a server comprising at least one processor and a memory, wherein the memory stores instructions.
  • the processor calls the instruction to implement the method in the first aspect and any possible implementation manner of the first aspect.
  • an operation and maintenance system comprising: a bastion host and multiple servers;
  • the bastion host is used to receive and filter execution commands to obtain change commands
  • the server is used to execute the change command and monitor and analyze the process of executing the change command to implement the method in the first aspect and any possible implementation of the first aspect.
  • the present application provides a computer-readable storage medium, in which a computer program or instruction is stored.
  • a computer program or instruction is stored.
  • the method in the first aspect and any possible implementation manner of the first aspect is implemented.
  • the present application provides a computer program product, which includes a computer program or instructions.
  • the processing device executes the computer program or instructions to implement the method in the first aspect and any possible implementation of the first aspect.
  • FIG1 is an application scenario diagram of an operation and maintenance system provided by the present application.
  • FIG2 is a first monitoring diagram of an eBPF provided by the present application.
  • FIG3 is a second monitoring diagram of an eBPF provided by the present application.
  • FIG4 is a schematic diagram of monitoring of an XDP provided by the present application.
  • FIG5 is a schematic diagram of a process tree provided by the present application.
  • FIG6 is a flow chart of a data monitoring and analysis method provided by the present application.
  • FIG7 is a schematic diagram of an association of servers provided by the present application.
  • FIG8 is a schematic diagram of the structure of a data monitoring and analysis device provided by the present application.
  • FIG. 9 is a schematic diagram of the structure of a server provided in the present application.
  • the present application provides a data monitoring and analysis method, which includes: first, the server receives a change command, executes the change command, and saves a log record (or operation log) of the change command; the change command is used to indicate the execution of a change operation on a first object in a first application. Secondly, in the process of executing the change command, the server determines the monitoring object associated with the first object, and uses a pre-deployed monitor to obtain the identifier of the first object and the identifier of the monitoring object generated during the operation of the monitoring object. Finally, the server performs a risk assessment based on the identifier of the first object and the identifier of the monitoring object, and obtains the risk level corresponding to the change command. The server can determine whether to issue an alarm based on the risk level.
  • the first object may be a data file or process in the first application.
  • the monitoring object may be a process associated with the change command.
  • the server may use a deep learning model to perform a risk assessment on the identifier of the first object and the identifier of the monitoring object to obtain an analysis result (such as the aforementioned risk level).
  • the server only monitors the monitoring object associated with the change command, which reduces the monitoring scope and improves the monitoring accuracy and efficiency. Moreover, since the scope of the process that needs to be monitored by the service area is reduced, the amount of monitoring data generated by the server is reduced, the redundant data generated during the data monitoring process is reduced, and the occupation of storage resources in the server by the monitoring data is reduced. In addition, the server only analyzes the data of the monitoring object associated with the change command, and does not need to analyze the aforementioned redundant data, which improves the data analysis efficiency of the server.
  • the server's management of command execution is usually divided into three stages: pre-management, in-process management, and post-management.
  • Pre-management refers to the screening and processing phase before the server executes the change command.
  • In-process management refers to the monitoring process during the server's execution of change commands.
  • Post-event management refers to the troubleshooting and model update process after the server executes the change command.
  • Interception error means that in the pre-management, due to the complexity of executing commands, the interception method used by the server in pre-management is easy to misjudge the risk of executing commands, resulting in high-risk execution commands, or intercepting normal execution commands. For example, since modifying ordinary files will not cause risks, if the /etc/passwd file (user database, in which the fields give the user name, real name, home directory, encrypted password and other user information) is displayed, it will cause serious consequences such as user information leakage.
  • Risk level refers to the risk status of server command execution, which can be divided into multiple levels.
  • the blacklist and whitelist method is used to intercept commands.
  • the risk level of the execution command is level 2, so the execution command is a command that needs to be intercepted.
  • the monitoring object refers to the process being monitored in the event management. For example, the process that executes the read command.
  • Monitoring behavior refers to the action or behavior of the monitored object in in-process management.
  • the server monitors the resource calls of a process during its execution.
  • Secondary risk means that the commands in an operation and maintenance script are interrelated. If the previous command is intercepted by the server, the next command will be executed in an incorrect state, which may easily cause unknown risks.
  • a bastion host is a device that uses various technical means to monitor and record the operations of servers, network equipment, security equipment, databases and other equipment within the network in a specific network environment in order to protect the network and data from intrusion and damage from external and internal users, so as to centrally alarm, handle in a timely manner and audit and determine responsibility.
  • FIG 1 is an application scenario diagram of an operation and maintenance system provided by the present application.
  • the operation and maintenance system 100 may include a bastion host 102 and n servers 103, where n is a positive integer.
  • the bastion host 102 and any server 103 may communicate via wire or wirelessly.
  • the impact surface data is used to indicate the identity of the first object and the identity of the monitored object.
  • the operation and maintenance system 100 also includes a terminal 101.
  • the bastion host 102 filters and intercepts the execution commands to obtain the filtered change commands.
  • the bastion host 102 forwards the change command to the server 103 where the first application that needs to perform the change operation is located.
  • the server 103 executes the change command, and the monitor on the server 103 obtains the impact surface data generated by the monitoring object associated with the change command during the execution process.
  • the server 103 determines the risk level of the aforementioned change command based on the impact surface data, and issues an alarm based on the risk level.
  • the operation and maintenance personnel can specify the changed server in the change system, set the operations to be performed, and upload the script program used for the change.
  • the change system then executes the change and returns the execution result, which can be an alarm or no alarm.
  • the interceptor may be a black and white list interception algorithm
  • the bastion host 102 uses the black and white list interception algorithm to judge the risk level of each execution command to obtain the risk level of the execution command.
  • the bastion host 102 intercepts or releases the execution command according to the obtained risk level.
  • the black and white list interception algorithm means that the bastion host 102 uses the input execution command to query whether there is the same command in the preset black and white list. If so, if the command is included in the black list, the corresponding risk level is determined; if not, if the command is included in the white list, the execution command is released.
  • the interceptor may be a deep learning model.
  • the deep learning module may include models such as K-Neares Neighbor (KNN) and Support Vector Machine (SVM).
  • KNN K-Neares Neighbor
  • SVM Support Vector Machine
  • the method for the deep learning model to process the execution command can refer to the processing steps of the black and white list interception algorithm for the execution command, which will not be repeated here.
  • a monitor will be deployed in the server 103 where the change operation needs to be performed.
  • this application provides two possible implementation methods as follows.
  • the server 103 deploys eBPF locally, and uses eBPF to obtain the identifier of the first object operated by the server 103 when running the monitoring object and the identifier of the monitoring object; wherein, the eBPF includes a kernel program, a collector, and an intermediate medium, and the intermediate medium is used to exchange data between the kernel program and the collector.
  • Figure 2 is a monitoring schematic diagram of eBPF provided by the present application.
  • the server 103 monitors the system resource status by deploying the eBPF kernel program in the kernel state.
  • the server 103 collects the system resource status by deploying the collector in the user state.
  • the server 103 also deploys an intermediate storage medium (eBPF Map) for interacting with the eBPF kernel program and the collector.
  • the eBPF Map is a shared memory for interacting with the eBPF kernel program and the collector.
  • the eBPF kernel program monitors the operation status of the system resources when the server 103 executes the change command, the operation status is written into the eBPF Map, and the collector obtains the aforementioned operation status from the eBPF Map, and then obtains the impact surface data.
  • FIG3 is a second monitoring schematic diagram of an eBPF provided by the present application. Taking the monitoring object accessing the local dynamic object through the socket as an example, the above eBPF will monitor the sys_recv and sys_send system calls to obtain the impact surface data.
  • the server 103 is further provided with a detection point, which is used to trigger the above-mentioned eBPF when the server 103 executes a preset command or function during the operation of the monitoring object.
  • the server 103 when the server 103 receives the "vim" command to view the contents of a file, the server 103 first starts a child process through Bash to execute the vim program, and the vim program calls the system's open and read functions. Since the detection point is inserted in the read function in advance, when the child process executes the vim program and calls the read function, the detection point triggers eBPF to monitor and obtain the impact surface data of the child process.
  • the server cooperates with the detection points and eBPF included in the tracking points.
  • the detection points only monitor the monitoring objects that execute preset commands or functions, thereby further narrowing the monitoring scope and avoiding the generation of redundant monitoring data.
  • the detection point triggers eBPF to monitor the aforementioned monitoring object to obtain the impact surface data.
  • the server uses the impact surface data to determine the risk of the change command, such as inputting the impact surface data into the risk assessment model for processing, determining the risk level of the change command, and thus determining whether to issue an alarm.
  • the server 103 deploys the XDP program locally.
  • the server 103 uses the XDP program to parse the message received or sent by the monitored object in the server 103 to obtain the IP-Port information. Based on the IP-Port information, the server 103 can determine the access service in the impact surface data.
  • FIG4 is a schematic diagram of an XDP monitoring provided by the present application.
  • the monitored object in the server 103 can access the service of a remote node through http access, grpc call, etc., and the above methods will use the TCP/IP technology stack for access.
  • the XDP program runs before the TCP/IP technology stack, performs Ethernet protocol parsing, IP protocol parsing, and TCP protocol parsing on the message to obtain the above IP-Port information.
  • the XDP program writes the IP-Port information into the eBPF Map, and the collector obtains the above IP-Port information from the eBPF Map. Finally, the server 103 obtains the above access service based on the correspondence between the IP-Port information and the process.
  • the server 103 may deploy the XDP program in a local kernel state.
  • the server 103 parses the message of the remote access process by using the XDP program, determines the type of remote service accessed in the impact surface data, and monitors the remote access process, thereby increasing the types of processes that can be monitored and improving the monitoring efficiency.
  • the impact surface data may include content as shown in Table 1 below.
  • the content of the impact surface data shown in Table 1 is only an example provided by this application and should not be understood as a limitation of this application.
  • the impact surface data may also include more or less content.
  • the identification of the first object includes the resource name of the above-mentioned calling object, the operation object, the accessed local cloud service or the accessed remote service, etc.
  • the identification of the monitored object includes the above-mentioned calling function name, process PID, calling parameters, etc.
  • FIG. 5 is a schematic diagram of a process tree provided by the present application.
  • Each node in the process tree represents a process
  • the lines between the nodes represent the association relationship between the processes
  • the serial number in the node represents the PID of the process.
  • the monitor uses the PID of the preset process (initial object) as the startup parameter, and uses the initial object and PID as the root node (root process) of the process tree. All processes in the process tree belong to the monitoring object.
  • the other processes When other processes are associated with processes on the process tree, the other processes also belong to the monitoring object and the other processes are maintained in the process tree.
  • the server only monitors the processes in the process tree, avoiding the impact on the processes that need to be monitored.
  • the problem of incomplete monitored data caused by omissions is solved, which improves the accuracy of monitoring.
  • this application provides the following four optional scenarios.
  • the initial object in the server 103 receives and executes the change command, and during the process of the initial object executing the change command, the subprocess (second object) created by the server 103 is the monitoring object.
  • the first object is used to indicate a file in the first application.
  • a second object needs to be created to perform the operation. For example, when the change command needs to search for the content in the first object, a process for searching will be started.
  • eBPF determines whether the initial object has an action to create a second object by monitoring the use of the copy_process system call. If the copy_process system call is monitored, the second object is used as the monitoring object. The aforementioned copy_process system call is used as the association relationship between the initial object and the monitoring object, and the PID of the second object is maintained in the process tree.
  • the server 103 associates multiple processes through the process tree, and the association relationship between the multiple processes is determined by the system call function. Since all processes in the process tree will be monitored by the server 103, when the server 103 adds the aforementioned second object to the process tree, the second object will also be monitored by the server 103, ensuring that all processes associated with the change command will not be omitted by the server 103, thereby improving the integrity and accuracy of the server 103 monitoring all processes associated with the change command.
  • the initial object interacts with other processes during the execution of the change command, and the other processes belong to the monitoring object.
  • the first object is used to indicate a process in the first application.
  • the initial object executes the change command to process the running result of the process in the first application
  • the process in the first application is the monitoring object.
  • the process in the first application is associated with the initial object through a system call function.
  • the object in the second application is a monitoring object.
  • the first application and the second application may be applications running in a server.
  • the change command executed by the initial object needs to call a running process in other applications, and the running process is the monitoring object.
  • the initial object calls, views or modifies a file of a process in the first application or the second application during the execution of the change command, and the process belongs to the monitoring object.
  • the determination of the monitoring object is not limited to the above four situations.
  • the monitoring object included in the process tree executes the change command, it will affect other processes.
  • the other process has an association relationship with the monitoring object, so the other process is also a monitoring object.
  • FIG. 6 is a flow chart of a data monitoring and analysis method provided by this application.
  • the steps performed by the monitor and analyzer in the figure can be executed by the processor in the server 103.
  • the server 103 executes the received change command, and uses the monitor deployed in the server to monitor the process that executes the change command, obtains the impact surface data, and then determines whether to terminate the change based on the impact surface data.
  • the data monitoring and analysis method of this embodiment can be executed by one or more servers.
  • the data monitoring and analysis method provided in this embodiment includes the following steps S610 to S630 .
  • S610 The server 103 receives a change command.
  • the change command is used to instruct to perform a change operation on the first object in the first application.
  • the server 103 receives the change command sent by the bastion host 102.
  • the steps for the bastion host 102 to obtain the change command can refer to the content of the bastion host shown in FIG. 1 and will not be described in detail here.
  • the server 103 performs a change operation on the first object to change the behavior of the first application, for example, adding a new function to the first application, modifying an existing function in the first application, etc.
  • a change operation on the first object to change the behavior of the first application, for example, adding a new function to the first application, modifying an existing function in the first application, etc.
  • three possible scenarios for performing the change operation are provided.
  • the change operation is adding, where adding refers to adding a new process to execute commands, such as the server 103 adding a new process to execute commands in the first application, thereby adding a new function to the first application.
  • the change operation is deletion
  • deletion refers to deleting the existing process that executes the command or deleting the data file.
  • the server 103 deletes the process that implements the search function in the first application
  • the search function in the first application is offline; or when the server 103 deletes the data file that supports the search function in the first application, the search function will also be offline.
  • the change operation is modification, which refers to modifying an existing process that executes a command or modifying a data file.
  • modification refers to modifying an existing process that executes a command or modifying a data file.
  • the server 103 modifies the process that implements the hot product push function in the first application, the hot product push function becomes an activity push function.
  • the change operation may also be a combination of the above-mentioned situations, such as the change operation includes addition and modification.
  • the data monitoring and analysis method provided in this embodiment further includes step S620 .
  • the server 103 determines a monitoring object associated with the first object.
  • the monitoring object is the process that needs to be scheduled when the server 103 executes the change command.
  • the change command executed in the initial object may call or view other processes, and the other processes are associated with the change command, so the other processes are the monitoring objects.
  • the change command has a corresponding relationship with the first object, so the first object is associated with the monitoring object.
  • the monitoring object For more ways to determine the monitoring object, please refer to the determination content of the monitoring object shown in Figure 5 above, which will not be repeated here.
  • the change command received by the server 103 which is an execution command after being screened in advance by the bastion host 102, please refer to the content of the bastion host 102 shown in Figure 1 above, which will not be repeated here.
  • the monitor deployed in advance by the server 103 obtains the operating system resource status of the monitored object during the execution of the change command, thereby obtaining the impact surface data.
  • the process of the monitor obtaining the impact surface data can refer to the process of deploying the monitor, which will not be repeated here.
  • this embodiment provides two possible examples.
  • the impact surface data is obtained locally on the server 103.
  • the acquisition method may refer to the above-mentioned content of the deployment monitor, which is not described in detail here.
  • the server 103 obtains the impact surface data of the remote server through a remote procedure call (RPC), and the remote server may also be deployed with a monitor.
  • the server 103 calls the remote server to execute the change command based on the RPC, and reads the impact surface data obtained by the monitor in the remote server.
  • RPC remote procedure call
  • the server 103 determines the risk level corresponding to the change command according to the impact surface data.
  • the risk level is used to indicate the impact of the change command on the first application.
  • the server 103 determines whether to issue an alarm for the change command according to the risk level of the change command and the preset alarm table.
  • the alarm table is used to indicate the corresponding relationship between the risk level and the analysis result, and the analysis result is an alarm or no alarm.
  • the analyzer in the server 103 can judge the risk level of the impact surface data based on a preset rule or a deep learning model, and the server 103 determines whether to issue an alarm according to the risk level.
  • the following describes the processing of the impact surface data by the analyzer based on the deep learning model.
  • the server 103 inputs the collected impact surface data into a preset risk assessment model through a collector in the monitor for processing, obtains the risk level corresponding to the change command, and determines the analysis result based on the risk level.
  • the risk assessment model can be obtained by training with training data based on algorithms such as SVM, density-based spatial clustering of applications with noise (DBSCAN), KNN, neural network, etc.
  • the analysis result is used to instruct the server 103 to issue an alarm, and the risk level and the analysis result may have a corresponding relationship.
  • the alarm table is shown in Table 2 below.
  • an alarm After determining that the analysis result is an alarm, an alarm will be sent to the terminal 101 or the bastion host 102, and the execution of the current change command will be terminated.
  • the server inputs the impact surface data into the risk assessment model for analysis and processing to determine the risk level of the change command, thereby determining whether to issue an alarm, thereby improving the accuracy of the server's alarms.
  • the server 103 only monitors the monitoring objects associated with the change command, which reduces the monitoring scope and improves the monitoring accuracy and efficiency. Moreover, since the scope of the processes that the server 103 needs to monitor is reduced, the amount of monitoring data generated by the server 103 is reduced, the redundant data generated during the data monitoring process is reduced, and the storage resources occupied by the monitoring data in the server are reduced. In addition, the server 103 only analyzes the data of the monitoring objects associated with the change command, and there is no need to analyze the aforementioned redundant data, which improves the data analysis efficiency of the server 103. The impact area obtained by the server 103 The data is the operation status of resources when running the monitored object.
  • the server inputs the impact surface data into the risk assessment model for analysis and processing to determine the risk level of the change command, thereby determining whether to issue an alarm, thereby improving the accuracy of the server's alarms.
  • one or more servers in the server cluster fail.
  • the failed servers will issue a fault alarm.
  • Server 103 obtains the alarm information in the fault alarm, retrieves the operation log of the affected surface data according to the alarm information, and determines one or more change commands corresponding to the alarm information.
  • the server 103 After the server 103 obtains the impact surface data, the impact surface data is associated with the corresponding operation log.
  • the operation log stores operation records of multiple change commands for the first application, and the multiple change commands include the change command.
  • the alarm information indicates the fault data generated by the server executing the change command.
  • the server 103 searches and matches the fault data with the impact surface data to obtain one or more impact surface data.
  • the server 103 then obtains the change command in the operation log based on the correspondence between the impact surface data and the operation log, and outputs the change command to the terminal 101.
  • the alarm information includes the content shown in Table 3 below.
  • warning information shown in Table 3 is only an example provided by the present application and should not be construed as a limitation of the present application.
  • the warning information may also include more or less content.
  • the server 103 uses a spatiotemporal retrieval algorithm based on graph computing to match the above-mentioned alarm information with the impact surface data, wherein the impact surface data obtained by each server will be stored in the local storage of the server respectively, and the data stored in each server will be associated based on the logical order in which each server executes the business.
  • the server 103 determines the server that has failed based on the alarm information.
  • the server 103 determines the server that needs to be retrieved through the above-mentioned association relationship and the server that has failed.
  • the range of servers that need to be retrieved will be expanded until the impact surface data matching the alarm information is obtained.
  • the spatiotemporal retrieval algorithm based on graph computing can be a DSSM, which is used to indicate the similarity between the alarm information and the multiple impact surface data stored in the server, and the server obtains the maximum value of the similarity between the alarm information and the multiple impact surface data. Based on the maximum value, the server obtains the impact surface data corresponding to the alarm information.
  • DSSM DSSM
  • Figure 7 is a schematic diagram of the association of servers provided by this application, showing the association relationship between the servers.
  • the overall process of the example shown in Figure 7 is as follows 1: The product display server displays the product, and after receiving the customer's click order operation, it jumps to the order server for processing. 2: After the order server receives the payment completion instruction, it will jump to the inventory server, and the inventory server will update the product inventory. 3: After the inventory server updates the product inventory, it jumps to the delivery server for product delivery.
  • the delivery server After the delivery server delivers the product out of the warehouse, it will send an instruction to the order server to instruct the order server that the product has been shipped out of the warehouse.
  • Servers connected by a short line are directly associated, such as servers corresponding to ordering and delivery; servers connected by two short lines are indirectly associated, such as servers corresponding to product display and delivery.
  • the server 103 determines the server directly connected to the server corresponding to the inventory, such as the order server and the delivery server, as the server required to be retrieved.
  • the server 103 matches the alarm information with the impact surface data in the order server, the inventory server, and the delivery server to determine the impact surface data corresponding to the alarm information. If the server 103 fails to match the corresponding impact surface data during the above matching process, it will determine the server indirectly connected to the inventory server, such as the product display server, as the server required to be retrieved.
  • the server 103 can match the impact surface data stored in the product display server with the alarm information to determine the impact surface data corresponding to the alarm information.
  • the server 103 determines the change command in the corresponding change order based on the impact surface data.
  • server 103 retrieves the impact surface data that matches the alarm information based on the spatiotemporal retrieval algorithm of graph computing, and the server establishes a spatiotemporal retrieval algorithm based on DSSM.
  • the DSSM uses the characters in the text as the finest segmentation granularity, which can reuse the semantics expressed by each character and reduce the dependence on word segmentation, thereby improving the generalization ability of the model; and DSSM is supervised training with high accuracy. Therefore, in this example, server 103 uses DSSM to perform a spatiotemporal retrieval algorithm to improve the matching accuracy of the alarm information and the impact surface data.
  • Server 103 determines the change command in the operation log corresponding to the aforementioned impact surface data, and outputs it to the front end, indicating to the user the change command that may cause a fault alarm, shortening the time consumption of abnormality troubleshooting and improving the efficiency of abnormality troubleshooting.
  • the server 103 uses the change command and risk level corresponding to the impact surface data as training data based on the risk level of the aforementioned impact surface data, and the server 103 updates the interceptor deployed in the bastion host 102 based on the training data.
  • the interceptor in the bastion host is an interception model obtained through deep learning model training.
  • the server retrains the interception model using the training data obtained in actual production to obtain an updated interception model, and redeploys the updated interception model to the bastion host 102.
  • the updated interception model can more accurately intercept execution commands whose risk levels meet the set conditions, thereby improving the accuracy of interception.
  • the server 103 does not need to judge the execution commands intercepted by the interception model, thereby reducing the number of execution commands that need to be monitored and analyzed during data monitoring and analysis, which is conducive to improving the efficiency of monitoring and analysis.
  • the server 103 sends the impact surface data or risk level to the terminal 101, and the terminal 101 displays the impact surface data or risk level on the front end.
  • the front end here may refer to a display connected to the terminal 101, or a display screen provided by the terminal 101, etc., which is not limited in this application.
  • server 103 sends the identifier of the first object indicated by the impact surface data to terminal 101 based on bastion host 102, so that terminal 101 displays the identifier of the first object.
  • the server 103 sends the identifier of the monitored object indicated by the impact surface data to the terminal 101 based on the bastion host 102, so that the terminal 101 displays the identifier of the monitored object.
  • the server 103 sends the change command and the corresponding risk level to the terminal 101 based on the bastion host 102, so that the terminal 101 displays the change command and the risk level.
  • the server 103 sends one type of data.
  • the server 103 can send multiple types of data to the terminal 101 at the same time, such as sending the identification of the first object and the identification of the monitored object.
  • the server 103 sends at least one of the identification of the first object, the identification of the monitored object and the risk level to the terminal 101 for display, thereby realizing data visualization.
  • the user can process the command input to the server in a timely manner according to the visualized data.
  • the processing device includes hardware structures and/or software modules corresponding to the execution of each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is executed in the form of hardware or computer software driving hardware depends on the specific application scenario and design constraints of the technical solution.
  • the data monitoring and analysis device can be used to implement the function of the processor in the above method embodiment, and thus can also achieve the beneficial effects possessed by the above method embodiment.
  • the data monitoring and analysis device can be a module (such as a chip) applied to the server 103.
  • the data monitoring and analysis device 800 includes a receiving module 810, an object determination module 820 and a level determination module 830.
  • the data monitoring and analysis device 800 is used to implement the functions of the method embodiments shown in Figs. 2 to 7 above.
  • the receiving module 810 is used to receive a change command.
  • the object determination module 820 is used to determine the monitoring object associated with the first object during the execution of the change command.
  • the monitoring object is the process that needs to be scheduled when the server 103 executes the change command.
  • the change command executed in the initial object may call or view other processes, and the other processes are associated with the change command, so the other processes belong to the monitoring object.
  • the object determination module 820 uses the pre-deployed monitor to obtain the system resources operated by the monitored object during the execution of the change command, the system resources including the identifier of the first object and the identifier of the monitored object, and the server thereby obtains the impact surface data.
  • the monitoring means of the monitor can refer to the aforementioned process of deploying the monitor, which will not be described in detail here.
  • the level determination module 830 is used to determine the risk level corresponding to the change command according to the identifier of the first object and the identifier of the monitored object.
  • the level determination module 830 may use preset rules or deep learning models to determine the risk level of the impact surface data.
  • the data monitoring and analysis device 800 further includes an information acquisition module 840 , a search module 850 , an update module 860 , a first monitoring module 870 , a second monitoring module 880 , and a display module 890 .
  • server 103 of the aforementioned embodiment may correspond to the data monitoring and analysis device 800, and may correspond to the corresponding subject corresponding to Figures 2 to 7 of the method according to the embodiments of the present application, and the operations and/or functions of each module in the data monitoring and analysis device 800 are respectively for implementing the corresponding processes of each method of the corresponding embodiments in Figures 2 to 7, which will not be repeated here for the sake of brevity.
  • the server 103 may include a variety of hardware, as shown in FIG9 , which is a schematic diagram of the structure of a server provided by the present application.
  • the server 900 may be applied to the operation and maintenance system shown in FIG1 , and the server may be any one of the bastion host 102 and the server 103 .
  • the server 900 may include a processor 910 , a memory 920 , a communication interface 930 , a bus 940 , etc.
  • the processor 910 , the memory 920 , and the communication interface 930 are connected via the bus 940 .
  • the processor 910 is the computing core and control core of the server 900.
  • the processor 910 may be a very large scale integrated circuit. An operating system and other software programs are installed in the processor 910, so that the processor 910 can access the memory 920 and various peripheral component interconnect express (PCIe) devices.
  • the processor 910 includes one or more processor cores.
  • the processor core in the processor 910 is, for example, a central processing unit (CPU) or other application specific integrated circuit (ASIC).
  • the processor 910 may also be other general-purpose processors, digital signal processors (DSP), field programmable gate arrays (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the server device 900 may also include multiple processors.
  • the memory 920 can be used to store computer executable program codes, which include instructions.
  • the processor 910 executes various functional applications and data processing of the server 900 by running the instructions stored in the internal memory 920.
  • the memory 920 may include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application required for at least one function (such as a running model function, a sending function, etc.), etc.
  • the data storage area may store data created during the use of the processing device 900 (such as impact surface data, etc.), etc.
  • the internal memory 920 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, a universal flash storage (UFS), etc.
  • UFS universal flash storage
  • the communication interface 930 is used to implement communication between the server 900 and an external device or component. In this embodiment, the communication interface 930 is used to perform data exchange with other processing devices.
  • the bus 940 may include a path for transmitting information between the above components (such as the processor 910, the memory 920, and the communication interface 930).
  • the bus 940 may also include a power bus, a control bus, and a status signal bus.
  • various buses are labeled as bus 940 in the figure.
  • the bus 940 may be a PCIe bus, or an extended industry standard architecture (EISA) bus, a unified bus (Ubus or UB), a compute express link (CXL), a cache coherent interconnect for accelerators (CCIX), etc.
  • EISA extended industry standard architecture
  • Ubus or UB unified bus
  • CXL compute express link
  • CCIX cache coherent interconnect for accelerators
  • the processor 910 can access these I/O devices through the PCIe bus.
  • the processor 910 is connected to the memory 920 through a double data rate (DDR) bus.
  • DDR double data rate
  • different memories 920 may use different data buses to communicate with the processor 910, so the DDR bus can also be replaced by other types of data buses, and the embodiment of the present application does not limit the bus type.
  • FIG9 only takes the example of a server 900 including one processor 910 and one memory 920.
  • the processor 910 and the memory 920 are respectively used to indicate a type of device or equipment.
  • the number of each type of device or equipment can be determined according to business requirements.
  • all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof.
  • all or part of the embodiments may be implemented in the form of a computer program product.
  • the computer program product includes one or more computer programs or instructions.
  • the computer program or instruction is loaded and executed on a computer, the process or function described in the embodiments of the present application is executed in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, a network device, a user device or other computer program. Programming device.
  • the computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program or instructions may be transmitted from one website, computer, server or data center to another website, computer, server or data center by wired or wireless means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center that integrates one or more available media.
  • the available medium may be a magnetic medium, such as a floppy disk, a hard disk, or a magnetic tape; it may also be an optical medium, such as a digital video disc (DVD); it may also be a semiconductor medium, such as a solid state drive (SSD).
  • SSD solid state drive

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

公开了一种数据监控分析方法、装置、服务器、运维系统及存储介质,涉及计算机技术领域。该数据监控分析方法包括:服务器接收并执行变更命令,该变更命令用于对应用中的对象执行变更操作;服务器监控与变更命令关联的监控对象,得到监控对象在运行过程中产生的第一对象的标识和监控对象的标识。由于仅监控与变更命令关联的监控对象,缩小了监控范围,提高了监控准确度和效率。而且,服务器监控范围缩小使得服务器产生的监控数据量降低,缩减了数据监控过程中产生的冗余数据,减少了监控数据对服务器中存储资源的占用;服务器对上述数据进行风险评估,得到变更命令对应的风险等级,无需对前述的冗余数据进行分析,提高了服务器的数据分析效率。

Description

数据监控分析方法、装置、服务器、运维系统及存储介质
本申请要求于2022年09月26日提交国家知识产权局、申请号为202211172716.0、申请名称为“数据监控分析方法、装置、服务器、运维系统及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,尤其涉及数据监控分析方法、装置、服务器、运维系统及存储介质。
背景技术
运维人员会根据用户的需求,执行命令对应用程序进行变更,以对应用程序的数据进行删除、修改或新增,修改应用程序的行为。在服务器运行应用程序的过程中,会对应用程序进行监控,以发现变更导致的运行故障。然而,服务器对应用程序中的所有进程进行监控会产生大量的监控数据,同时对所有监控数据进行分析的效率较低。因此,如何对变更过程提供一种更有效的监控分析方法成为目前亟需解决的问题。
发明内容
本申请提供了数据监控分析方法、装置、服务器、运维系统及存储介质,以解决通常技术中对监控数据进行分析的效率较低的问题。
本申请采用如下技术方案。
第一方面,提供了一种数据监控分析方法,该方法由服务器集群中的一个或多个服务器执行,该数据监控分析方法包括:首先,服务器接收并执行变更命令,该变更命令用于指示对第一应用中的第一对象执行变更操作。其次,服务器在执行变更命令的过程中,确定与第一对象关联的监控对象,并利用提前部署的监控器获取监控对象在运行过程中产生的第一对象的标识和监控对象的标识。最后,服务器根据上述第一对象的标识和监控对象的标识进行风险评估,得到变更命令对应的风险等级。
示例的,服务器可根据该风险等级确定是否进行告警。该第一对象可为第一应用中的数据文件或进程。监控对象可为与变更命令关联的进程。
相较于服务器对整个应用的所有进程都进行监控,在本实施例中,服务器仅监控变更命令所关联的监控对象,缩小了监控范围,提高了监控准确度和效率。而且,由于服务器所需监控的进程的范围缩小,因此,服务器产生的监控数据量降低,缩减了数据监控过程中产生的冗余数据,减少了监控数据对服务器中存储资源的占用。此外,服务器仅对变更命令所关联的监控对象的数据进行分析,无需对前述的冗余数据进行分析,提高了服务器的数据分析效率。
在一种可能的实现方式中,监控对象包括第一应用中的第二对象和第二应用中的对象,其中第二应用为与第一应用进行交互的应用。上述的对象用于指示进程。
在一种可能的实现方式中,监控对象通过系统调用函数与第一对象关联。
示例的,当第一对象为数据文件时,监控对象可通过read函数读取该数据文件,监控对象与第一对象间基于read函数关联。
当第一对象为进程时,监控对象可通过进程创建(copy_process)函数创建第一对象,监控对象与第一对象基于copy_process函数关联。
服务器通过系统调用函数,能清楚的展现监控对象与第一对象间的关系,进而能准确的将监控对象与第一对象关联,确保了对第一对象执行变更操作的进程,都属于监控对象,避免了遗漏监控对象导致监控到的数据不完整的问题,提高了监控的准确度。
在一种可能的实现方式中,变更操作包括增加、删除和修改中的一种或多种。
在一种可能的实现方式中,服务器利用预设的风险评估模型对第一对象的标识和监控对象的标识进行处理,得到变更命令对应的风险等级。
服务器利用风险评估模型对第一对象的标识和监控对象的标识进行处理,相较于利用指标型数据进行分析处理,服务器利用第一对象的标识和监控对象的标识来确定变更命令的风险情况,如将第一对象的标识和监控对象的标识输入到风险评估模型进行分析处理,确定该变更命令的风险等级,从而确定是 否进行告警。
在一种可能的实现方式中,在变更命令执行完毕后,服务器或服务器集群中其他服务器发生故障时,服务器获取到故障对应的告警信息,并根据告警信息检索第一对象的标识和监控对象的标识,服务器根据检索到的第一对象的标识和监控对象的标识,确定对应的操作日志,进而得到操作日志中的变更命令。
其中,告警信息用于指示服务器在运行过程中产生的故障数据;操作日志用于指示针对第一应用的多个变更命令的操作记录,该多个变更命令包括前述的变更命令。
示例的,服务器可采用图计算的时空检索算法,从服务器中存储的数据中,确定与告警信息匹配的第一对象的标识和监控对象的标识。
服务器利用了基于图计算的时空检索算法来进行数据检索,如基于图计算的时空检索算法是指一种深度语义匹配模型(Deep Structured Semantic Models,DSSM),若服务器基于该DSSM来查询告警信息对应的影响面数据,可以迅速得出与告警信息匹配度最高的影响面数据,提高了影响面数据的检索效率。服务器再确定前述第一对象的标识和监控对象的标识对应的操作日志中的变更命令,并输出至前端,为用户指示可能导致故障的变更命令,缩短了异常排查的耗时,提高了异常排查的效率。
在一种可能的实现方式中,服务器将变更命令以及对应的风险等级作为训练数据,对堡垒机中部署的拦截模型进行更新,该更新后的拦截用于对部分变更命令进行拦截。其中,服务器包括堡垒机。
更新后的拦截模型能对风险等级符合设定条件的执行命令进行更准确的拦截,提高了拦截的准确率,服务器无需对被拦截模型所拦截的执行命令进行判断,减少了数据监控分析过程中所需监控和分析的执行命令的数量,有利于提高监控分析效率。
在一种可能的实现方式中,在监控对象被运行时,服务器调用了预设的探测点,该探测点将触发在服务器中提前部署的监控器,该监控器获取监控对象被运行时处理的系统资源,得到第一对象的标识和监控对象的标识。其中,追踪点包括探测点和监控器。
示例的,上述监控器可为拓展伯克利包过滤器(Extended Berkeley Packet Filter,eBPF)。
服务器根据追踪点包括的探测点和监控器配合,该探测点实现仅监控执行到预设的命令或函数的监控对象,实现进一步缩小监控范围,避免产生冗余的监控数据。当监控对象执行到预设的命令或函数时,前述探测点触发监控器来监控前述的监控对象,得到第一对象的标识和监控对象的标识,相较于通常技术中监控数据包括性能指标数据等指示性数据,服务器将影响面数据输入风险评估模型进行分析处理,确定该变更命令的风险等级,从而确定是否进行告警,提高了服务器进行告警的准确率。
在另一种可能的实现方式中,在监控对象为远程访问进程时,服务器获取运行监控对象时产生的报文,服务器解析该报文,得到第一对象的标识和监控对象的标识。
示例的,服务器可通过高性能数据路径(Express Data Path,XDP)获取运行监控对象时产生的报文,并进行解析。
服务器通过对远程访问进程的报文进行解析,确定影响面数据中访问的远程服务类型,实现对远程访问进程进行监控,增加了可监控进程的类型,提高了监控效率。
在一种可能的实现方式中,服务器将第一对象的标识、所述监控对象的标识和所述风险等级中至少一种发送至终端的前端进行显示。这里的前端可以是指与终端连接的显示器,或者终端所具备的显示屏等,本申请对此不予限定。服务器将前述的数据在前端进行显示,实现了数据的可视化,用户可根据该可视化的数据及时对输入服务器的命令进行处理。
第二方面,提供了一种数据监控分析装置,该装置应用于服务器中,所述装置包括用于执行第一方面或第一方面任一种可能实现方式中的数据监控分析方法的各个模块。示例的,该数据监控分析装置包括:接收模块、对象确定模块和等级确定模块。接收模块用于接收变更命令;对象确定模块,用于在执行变更命令的过程中,确定与第一对象关联的监控对象;等级确定模块,用于根据第一对象的标识和监控对象的标识,确定变更命令对应的风险等级。其中,变更命令用于指示对第一应用中的第一对象执行变更操作。
有益效果可以参见第一方面中任一种可能实现方式中的描述,此处不再赘述。所述数据监控分析装置具有实现上述第一方面中任一种可能实现方式中的方法实例中行为的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。
第三方面,提供一种服务器,该服务器包括至少一个处理器和存储器,所述存储器存储有指令,所 述处理器调用所述指令实现第一方面和第一方面中任一种可能实现方式中的方法。
第四方面,提供一种运维系统,该运维系统包括:堡垒机和多个服务器;
所述堡垒机用于接收并筛选执行命令,得到变更命令;
所述服务器用于执行变更命令,并对执行变更命令过程进行监控分析,实现第一方面和第一方面中任一种可能实现方式中的方法。
第五方面,本申请提供一种计算机可读存储介质,存储介质中存储有计算机程序或指令,当计算机程序或指令被处理设备执行时,实现第一方面和第一方面中任一种可能实现方式中的方法。
第六方面,本申请提供一种计算机程序产品,该计算程序产品包括计算机程序或指令,当该计算机程序或指令在处理设备上运行时,使得处理设备执行该计算机程序或指令,以实现第一方面和第一方面中任一种可能实现方式中的方法。
以上第二方面至第六方面的有益效果可参照第一方面或第一方面中任一种实现方式的描述,在此不予赘述。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为本申请提供的一种运维系统的应用场景图;
图2为本申请提供的一种eBPF的监控示意图一;
图3为本申请提供的一种eBPF的监控示意图二;
图4为本申请提供的一种XDP的监控示意图;
图5为本申请提供的一种进程树的示意图;
图6为本申请提供的一种数据监控分析方法的流程示意图;
图7为本申请提供的一种服务器的关联示意图;
图8为本申请提供的一种数据监控分析装置的结构示意图;
图9为本申请提供的一种服务器的结构示意图。
具体实施方式
本申请提供一种数据监控分析方法,该方法包括:首先,服务器接收变更命令,并执行变更命令,并保存该变更命令的日志记录(或称为操作日志);该变更命令用于指示对第一应用中的第一对象执行变更操作。其次,服务器在执行变更命令的过程中,确定与第一对象关联的监控对象,并利用提前部署的监控器获取监控对象在运行过程中产生的第一对象的标识和监控对象的标识。最后,服务器根据上述第一对象的标识和监控对象的标识进行风险评估,得到变更命令对应的风险等级,服务器可根据该风险等级确定是否进行告警。
示例的,上述第一对象可为第一应用中的数据文件或进程。监控对象可为与变更命令关联的进程,如服务器可利用深度学习模型对第一对象的标识和监控对象的标识进行风险评估,得到分析结果(如前述的风险等级)。
相较于服务器对整个应用的所有进程都进行监控,在本实施例中,服务器仅监控变更命令所关联的监控对象,缩小了监控范围,提高了监控准确度和效率。而且,由于服务区所需监控的进程的范围缩小,因此,服务器产生的监控数据量降低,缩减了数据监控过程中产生的冗余数据,减少了监控数据对服务器中存储资源的占用。此外,服务器仅对变更命令所关联的监控对象的数据进行分析,无需对前述的冗余数据进行分析,提高了服务器的数据分析效率。
下面对本实施例提供的数据监控分析方法进行说明,首先给出相关技术的介绍。
在运维场景中,服务器对执行命令的管理通常分为三个环节:事前管理、事中管理、事后管理。
事前管理,是指在服务器执行变更命令之前的筛选处理环节。
事中管理,是指在服务器执行变更命令过程中的监控环节。
事后管理,是指在服务器执行变更命令之后的故障处理及模型更新环节。
拦截错误,是指在事前管理中,由于执行命令的复杂性,服务器在事前管理使用的拦截方法容易误判执行命令的风险,导致高危的执行命令,或拦截了正常的执行命令。示例的,由于修改普通文件不会引发风险,但是如果显示/etc/passwd文件(用户数据库,其中的域给出了用户名、真实姓名、家目录、加密的口令和用户的其他信息)会引发用户信息泄露等严重后果。
风险等级,是指服务器执行命令的风险状况,该风险状况可分为多个等级。示例的,在事前管理利用黑白名单的方法拦截命令,当执行命令利用vim命令查看/etc/passwd文件,该执行命令的风险等级为2级,所以,该执行命令为需被拦截的命令。
监控对象,是指在事中管理中被监控的进程。示例的,执行read命令的进程。
监控行为,是指在事中管理中监控对象的这一动作或行为。示例的,服务器监控某一进程在执行过程中的资源调用情况。
次生风险,是指一个运维脚本的命令之间是相互关联的,如果上一步命令被服务器拦截,下一步命令将在错误状态上执行,容易造成未知的风险。
堡垒机,是指在一个特定的网络环境下,为了保障网络和数据不受来自外部和内部用户的入侵和破坏,而运用各种技术手段监控和记录运维人员对网络内的服务器、网络设备、安全设备、数据库等设备的操作行为,以便集中报警、及时处理及审计定责的设备。
为避免出现上述的次生风险,本申请对被执行的变更命令的过程进行监控。如图1所示,图1为本申请提供的一种运维系统的应用场景图。该运维系统100可以包括堡垒机102和n个服务器103,n为正整数。堡垒机102和任一个服务器103间可通过有线的方式通信,也可以通过无线的方式通信。其中,利用影响面数据来指示第一对象的标识和监控对象的标识。在一种可能的示例中,该运维系统100还包括终端101。
示例性的,在如图1的运维系统100中,运维人员登录堡垒机102后,输入一个或多个执行命令。堡垒机102对执行命令进行筛选拦截,得到筛选后的变更命令。堡垒机102将变更命令转发至需要执行变更操作的第一应用所在服务器103。服务器103执行变更命令,服务器103上的监控器获取与变更命令关联的监控对象在执行过程中产生的影响面数据,服务器103再根据影响面数据确定前述变更命令的风险等级,基于该风险等级进行告警。
在一种可能的示例中,运维人员可在变更系统中指定变更的服务器,设定需要执行的操作,上传变更所用的脚本程序。然后由变更系统执行变更,并返回执行结果,该执行结果可为告警或不告警。
在一种可选的实现方式中,堡垒机102利用拦截器对执行命令进行筛选拦截。
在一种可能的示例中,该拦截器可为黑白名单拦截算法,堡垒机102利用黑白名单拦截算法对各执行命令的风险等级进行判断,得到执行命令的风险等级。堡垒机102根据得到的风险等级,对执行命令进行拦截或放行。其中,黑白名单拦截算法是指堡垒机102利用输入的执行命令查询预设的黑白名单中是否有相同的命令,若有,如黑名单中包含该命令,则进行确定对应的风险等级;若无,如白名单中包含该命令,则放行该执行命令。
在另一种可能的示例中,该拦截器可为深度学习模型,该深度学习模块可包括临近算法(K-Neares Neighbor,KNN)、支持向量机(Support Vector Machine,SVM)等模型,深度学习模型对执行命令的处理方法可参考上述黑白名单拦截算法对执行命令的处理步骤,在此不予赘述。
在堡垒机102将变更命令发送至需要执行变更操作的第一应用所在服务器103之前,将在需要执行变更操作的服务器103中部署监控器。
针对于在事前管理部署监控器的过程,本申请给出了下面给出了两种可能的实现方式。
在第一种可能的实现方式中,服务器103将eBPF部署到本地,利用eBPF获取服务器103在运行监控对象时操作的第一对象的标识和监控对象的标识;其中,该eBPF包括内核程序、收集器和中间介质,该中间介质用于内核程序和收集器间交互数据。
如图2所示,图2为本申请提供的一种eBPF的监控示意图一,服务器103通过在内核态中部署eBPF内核程序,以监控系统资源情况。服务器103通过在用户态中部署收集器,以收集系统资源情况。服务器103还部署了用于与eBPF内核程序和收集器交互数据的中间存储介质(eBPF Map),eBPF Map是一块用于eBPF内核程序与收集器交互数据的共享内存。在eBPF内核程序监控到服务器103执行变更命令时对系统资源的操作情况后,将该操作情况写入eBPF Map,收集器从eBPF Map中获取前述的操作情况,进而得到影响面数据。
在一种可能的情形中,服务器103运行监控对象通过共享内存,管道,信号,共享文件,socket等访问本地的动态对象。如图3所示,图3为本申请提供的一种eBPF的监控示意图二,以监控对象通过socket访问本地的动态对象为例,上述eBPF将监控sys_recv与sys_send系统调用,得到影响面数据。
在一种可能的示例中,服务器103还设有探测点,该探测点用于服务器103在运行监控对象过程中执行到了预设的命令或函数时,触发上述的eBPF。
示例的,在服务器103中接收到“vim”命令查看文件的内容时,服务器103通过Bash首先拉起一个子进程执行vim程序,vim程序调用了系统的open与read函数。由于提前在read函数中插入了探测点,所以子进程在执行上述vim程序并调用了read函数时,探测点触发eBPF进行监控,以获取到上述子进程的影响面数据。
服务器根据追踪点包括的探测点和eBPF配合,该探测点实现仅监控执行到预设的命令或函数的监控对象,实现进一步缩小监控范围,避免产生冗余的监控数据。
以及,相较于通常技术中监控数据包括性能指标数据等指示性数据,当监控对象执行到预设的命令或函数时,探测点触发eBPF来监控前述的监控对象,得到影响面数据。服务器利用该影响面数据来确定变更命令的风险情况,如将影响面数据输入到风险评估模型进行处理,确定该变更命令的风险等级,从而确定是否进行告警。
在第二种可能的实现方式中,服务器103将XDP程序部署到本地上。在监控对象属于远程访问进程时,服务器103利用XDP程序对服务器103中监控对象接收或发送的报文解析,得到IP-Port信息。服务器103基于IP-Port信息,可确定影响面数据中的访问服务。
针对于前述的服务器103利用XDP获取IP-Port信息,本实施例提供了一种服务器103利用XDP获取监控对象访问远程节点的服务的示例。如图4所示,图4为本申请提供的一种XDP监控示意图,服务器103中的监控对象可通过http访问、grpc调用等方式访问远程节点的服务,上述方式会使用TCP/IP技术栈进行访问。XDP程序在服务器103中的网卡收到报文之后先于TCP/IP技术栈运行,对报文进行Ethernet协议解析、IP协议解析、TCP协议解析以获取到上述IP-Port信息,XDP程序将该IP-Port信息写入eBPF Map,收集器从eBPF Map中获取前述的IP-Port信息。最后服务器103基于IP-Port信息与进程的对应关系,得到上述的访问服务。
在一种可能情形中,服务器103可将XDP程序部署在本地的内核态。
服务器103通过采用XDP程序对远程访问进程的报文进行解析,确定影响面数据中访问的远程服务类型,实现对远程访问进程进行监控,增加了可监控进程的类型,提高了监控效率。
如下表1示出的影响面数据可能包括的内容。
表1
值得注意的是,表1所示出的影响面数据包括的内容仅为本申请提供的示例,不应理解为对本申请的限定,影响面数据还可包括更多或更少的内容。其中,第一对象的标识包括上述的调用对象的资源名、操作对象、访问的本地云服务或访问的远程服务等。监控对象的标识包括上述的调用函数名、进程PID、调用参数等。
为了达到服务器仅监控变更命令所关联的监控对象的目的,本实施例提供了一种确定监控对象的实现方式。如图5所示,图5为本申请提供的一种进程树的示意图,进程树中各节点表示进程,节点间的连线表示进程间关联关系,节点中的序号表示进程的PID。在服务器103启动监控器时,监控器将预设的进程(初始对象)的PID作为启动参数,并将该初始对象及PID作为进程树的根节点(根进程),进程树中的进程都属于监控对象。当其他进程与进程树上的进程具有关联时,该其他进程也属于监控对象且将该其他进程维护到进程树中。服务器仅对进程树中的进程进行监控,避免了对需要监控的进程造成 遗漏导致监控到的数据不完整的问题,提高了监控的准确度。
对于监控对象的确定,本申请给出了以下四种可选的情形。
在第一种可选的情形中,服务器103中的初始对象接收并执行变更命令,在初始对象执行变更命令过程中,服务器103创建的子进程(第二对象)为监控对象。第一对象用于指示第一应用中的文件。
示例的,初始对象在执行变更命令过程中,因该变更命令的用于对第一应用中的文件执行变更操作,需创建第二对象来执行操作,如变更命令需要对第一对象中的内容进行搜索时,将启动一个用于搜索的进程。eBPF通过监控copy_process系统调用的使用情况,确定初始对象是否有创建第二对象的动作。若监控到copy_process系统调用,则将上述第二对象作为监控对象。以前述的copy_process系统调用作为初始对象与监控对象间的关联关系,将该第二对象的PID维护到进程树中。
上述示例仅示出了初始对象与子进程间的关系基于copy_process系统调用确定,进程树中的其他进程间的关系也可基于copy_process系统调用确定。
服务器103基于上述的系统调用函数,将多个进程通过进程树进行关联,该多个进程之间的关联关系是由系统调用函数来确定的。由于该进程树中所有的进程都会被服务器103所监控,因此,在服务器103将前述的第二对象添加到进程树中时,该第二对象也会被服务器103所监控,确保了与变更命令关联的所有进程不会被服务器103所遗漏,提高了服务器103监控变更命令所关联的所有进程的完整性和准确性。
在第二种可选的情形中,初始对象在执行变更命令过程中,与其他进程进行交互,该其他进程属于监控对象。第一对象用于指示第一应用中的进程。
示例的,初始对象对变更命令的执行,实现对上述第一应用中的进程的运行结果进行处理时该第一应用中的进程为监控对象。第一应用中的进程与初始对象通过系统调用函数关联。
在第三种可选的情形中,初始对象在执行变更命令过程中,通过网络访问了第二应用中的进程时,该第二应用中的对象属于监控对象。第一应用和第二应用可为运行在服务器中的应用。
示例的,该初始对象执行的变更命令需调用其他应用中正在运行的进程,该正在运行的进程为监控对象。
在第四种可选的情形中,初始对象在执行变更命令过程中,调用、查看或修改了第一应用或第二应用中进程的文件,该进程属于监控对象。
值得注意的是,对于监控对象的确定并不仅限于上述四种情形,进程树中包括的监控对象在执行变更命令时,会对其他进程产生影响时,该其他进程与监控对象具有关联关系,所以该其他进程也为监控对象。
为了解决服务器对整个应用的所有进程都进程监控,产生冗余数据,导致分析效率低的问题,本实施例提供了一种数据监控分析的方法。如图6所示,图6为本申请提供的一种数据监控分析方法的流程示意图,图中监控器和分析器所执行的步骤都可由服务器103中的处理器执行,在本实施例中,在事中管理,服务器103执行接收到的变更命令,并利用服务器中部署的监控器对执行变更命令的进程进行监控,得到影响面数据,再根据影响面数据判断是否终止变更。其中,可由一个或多个服务器执行本实施例的数据监控分析方法。
请参照图6,本实施例提供的数据监控分析方法包括以下步骤S610至S630。
S610、服务器103接收变更命令。
其中,变更命令用于指示对第一应用中的第一对象执行变更操作。
服务器103接收堡垒机102发送的变更命令,堡垒机102获取到变更命令的步骤可参考图1所示出的堡垒机的内容,在此不予赘述。
服务器103对第一对象执行变更操作,以变更第一应用的行为,示例的,为第一应用增加新功能,修改第一应用中的已有的功能等。在本实施例中,提供了执行变更操作的三种可能的情形。
在第一种可能的情形中,变更操作为增加,增加是指增加执行命令的新进程,如服务器103在第一应用中增加执行命令的新进程,使得第一应用增加了新功能。
在第二种可能的情形中,变更操作为删除,删除是指删除执行命令的已有进程或删除数据文件,如服务器103删除第一应用中实现搜索功能的进程时,使得第一应用中搜索功能下线;或者服务器103删除第一应用中支撑搜索功能的数据文件时,搜索功能也将下线。
在第三中可能的情形中,变更操作为修改,修改是指修改执行命令的已有进程或修改数据文件。如服务器103修改第一应用中实现热门商品推送功能的进程时,使得热门商品推送功能变为活动推送功能。
值得注意的是,变更操作也可为上述多种情形的组合,如变更操作包括增加和修改。
请继续参见图6,本实施例提供的数据监控分析方法还包括步骤S620。
S620、服务器103在执行所述变更命令的过程中,确定与所述第一对象关联的监控对象。
监控对象为服务器103执行变更命令过程中所需调度的进程。示例的,在初始对象中执行的变更命令,可能会调用或查看其他的进程,该其他的进程即与变更命令关联,所以该其他的进程为监控对象。并且变更命令与第一对象具有对应关系,所以第一对象与监控对象关联。
对于监控对象更多的确定方式,可参考上述图5所示出的对监控对象的确定内容,在此不予赘述。并且对于服务器103接收到的变更命令为经堡垒机102在事前筛选后的执行命令,可参考上述图1所示出的堡垒机102的内容,在此不予赘述。
服务器103在事前管理部署的监控器获取监控对象在执行变更命令过程中操作系统资源的情况,从而得到影响面数据。对于监控器获取影响面数据的过程可参考前述部署监控器的过程,在此不予赘述。
对于影响面数据的获取方式,本实施例提供两种可能的示例。
在第一种可能的示例中,在服务器103本地获取影响面数据,获取方法可参考上述部署监控器的内容,在此不予赘述。
在第二种可能的示例中,服务器103通过远程过程调用(Remote Procedure Call,RPC),获取远程服务器的影响面数据,该远程服务器也可部署有监控器。服务器103基于RPC调用远程服务器执行变更命令,并读取到远程服务器中监控器获取的影响面数据。
S630、服务器103根据影响面数据,确定变更命令对应的风险等级。
其中,风险等级用于指示变更命令对第一应用的影响。服务器103根据变更命令的风险等级与预设的告警表,确定是否对前述的变更命令进行告警。告警表用于指示风险等级与分析结果间的对应关系,分析结果为告警或不告警。
示例的,服务器103中的分析器可基于预设规则或深度学习模型对影响面数据的风险等级进行判断,服务器103根据风险等级,确定是否进行告警。下面将分析器基于深度学习模型对影响面数据处理进行说明。
在一种可能的实现方式中,服务器103通过监控器中的收集器将收集到的影响面数据输入预设的风险评估模型进行处理,得到变更命令对应的风险等级,基于风险等级确定分析结果。
其中,风险评估模型可基于SVM、基于密度的聚类算法(Density-Based Spatial Clustering of Applications with Noise,DBSCAN)、KNN、神经网络等算法利用训练数据进行训练得到。分析结果用于指示服务器103进行告警,风险等级与分析结果可具有对应关系。
示例的,如下表2所示告警表。
表2
值得注意的是,表2所示出的风险等级与分析结果的对应关系仅为本申请提供的示例,不应理解为对本申请的限定,风险等级与分析结果的对应关系还可包括更多或更少的内容。
在确定分析结果为告警后,将向终端101或堡垒机102发出告警,并终止当前变更命令的执行。
相较于利用指标型数据进行分析处理,在本实施例中,服务器将影响面数据输入风险评估模型进行分析处理,确定该变更命令的风险等级,从而确定是否进行告警,提高了服务器进行告警的准确率。
相较于服务器对整个应用的所有进程都进行监控,在本实施例中,服务器103仅监控变更命令所关联的监控对象,缩小了监控范围,提高了监控准确度和效率。而且,由于服务器103所需监控的进程的范围缩小,因此,服务器103产生的监控数据量降低,缩减了数据监控过程中产生的冗余数据,减少了监控数据对服务器中存储资源的占用。此外,服务器103仅对变更命令所关联的监控对象的数据进行分析,无需对前述的冗余数据进行分析,提高了服务器103的数据分析效率。服务器103获取到的影响面 数据为运行监控对象时对资源的操作情况,相较于通常技术中监控数据包括性能指标数据等指示性数据,服务器将影响面数据输入风险评估模型进行分析处理,确定该变更命令的风险等级,从而确定是否进行告警,提高了服务器进行告警的准确率。
在一种可选的实现方式中,在事后管理中,服务器集群中的一个或多个服务器出现故障,出现故障的服务器将发出故障告警,服务器103获取故障告警中的告警信息,根据告警信息检索影响面数据的操作日志,确定告警信息对应的一个或多个变更命令。
在服务器103得到影响面数据后,将影响面数据与对应的操作日志进行关联。操作日志中存储了针对第一应用的多个变更命令的操作记录,该多个变更命令中包括变更命令。
告警信息指示了服务器执行变更命令产生的故障数据,服务器103根据该故障数据与影响面数据进行检索匹配,匹配得到一个或多个影响面数据;服务器103再根据影响面数据与操作日志的对应关系,得到操作日志中的变更命令,将该变更命令输出至终端101。
示例的,如下表3所示的告警信息包括的内容。
表3
值得注意的是,表3所示出的告警信息仅为本申请提供的示例,不应理解为对本申请的限定,告警信息还可包括更多或更少的内容。
在一种可能的情形中,服务器103利用基于图计算的时空检索算法对上述的告警信息与影响面数据进行匹配,其中,各服务器得到的影响面数据将分别存储在服务器本地的存储器,基于各服务器执行业务的逻辑顺序,将各服务器中存储的数据建立关联关系。服务器103根据告警信息确定发生故障的服务器。服务器103通过上述关联关系和发生故障的服务器,确定需要检索的服务器。当在需要检索的服务器存储的影响面数据中未检索到与告警信息匹配的数据时,将扩大需要检索的服务器范围直至得到与告警信息匹配的影响面数据。
基于图计算的时空检索算法可为一种DSSM,该DSSM用于指示计算告警信息与服务器中存储的多个影响面数据之间的相似度,服务器获取告警信息与多个影响面数据间相似度的最大值。服务器基于该最大值时,获取告警信息对应的影响面数据。
示例的,如在商品下单流程中将依次经过商品展示、下单、库存、配送服务器处理,如图7所示,图7为本申请提供的一种服务器的关联示意图,示出了各服务器间的关联关系。图7示出的示例整体流程如下①:商品展示服务器进行商品展示,在接收到客户的点击下单操作后,跳转到下单服务器进行处理。②:在下单服务器接收到付款完成指令后,将跳转到库存服务器,库存服务器对商品库存进行更新。③:库存服务器更新商品库存后,跳转到配送服务器进行商品出库配送。④配送服务器对商品出库配送后,将发送指令至下单服务器,以指示下单服务器该商品出库完成。由一条短线连接的服务器为直接关联,如下单与配送对应的服务器;由两条短线连接的服务器为间接关联,如商品展示与配送对应的服务器。
上述确定需要检索的服务器为图6中库存对应的服务器时,服务器103确定与库存对应的服务器直接相连的服务器,如下单服务器和配送服务器作为需要检索的服务器。服务器103将告警信息和下单服务器、库存服务器、配送服务器中的影响面数据进行匹配,确定告警信息对应的影响面数据。服务器103若在上述匹配过程中,未匹配到对应的影响面数据时,将确定与库存服务器间接相连的服务器,如商品展示服务器作为需要进行检索的服务器。服务器103可将商品展示服务器中存储的影响面数据和告警信息进行匹配,确定告警信息对应的影响面数据。服务器103再根据该影响面数据确定对应变更单中的变更命令。
在事后管理中,服务器103基于图计算的时空检索算法检索与告警信息匹配的影响面数据,服务器基于DSSM建立时空检索算法。该DSSM使用文本中的字作为最细切分粒度,可以复用每个字表达的语义,减少分词的依赖,从而提高了模型的泛化能力;并且DSSM为有监督训练,精度较高。因此,在本示例中,服务器103基于DSSM来做时空检索算法,可以提高告警信息与影响面数据的匹配精度。服务器103再确定前述影响面数据对应的操作日志中的变更命令,并输出至前端,为用户指示可能导致故障告警的变更命令,缩短了异常排查的耗时,提高了异常排查的效率。
在一种可选的实现方式中,在事后管理中,服务器103根据前述的影响面数据的风险等级,将影响面数据对应的变更命令与风险等级作为训练数据,服务器103根据训练数据对部署在堡垒机102中的拦截器进行更新。
示例的,堡垒机中的拦截器为基于深度学习模型训练得到的拦截模型,服务器利用实际生产中得到的训练数据对拦截摸进行再训练,得到更新后的拦截模型,并将更新后的拦截模型重新部署到堡垒机102中,更新后的拦截模型能对风险等级符合设定条件的执行命令进行更准确的拦截,提高了拦截的准确率,服务器103无需对被拦截模型所拦截的执行命令进行判断,减少了数据监控分析过程中所需监控和分析的执行命令的数量,有利于提高监控分析效率。
在一种可选的实现方式中,服务器103发送影响面数据或风险等级至终端101,终端101在前端显示影响面数据或风险等级。这里的前端可以是指与终端101连接的显示器,或者终端101所具备的显示屏等,本申请对此不做限定。
例如,服务器103基于堡垒机102将影响面数据指示的第一对象的标识发送至终端101,以使终端101显示该第一对象的标识。
又如,服务器103基于堡垒机102将影响面数据指示的监控对象的标识发送至终端101,以使终端101显示该监控对象的标识。
再如,服务器103基于堡垒机102将变更命令及对应的风险等级发送至终端101,以使终端101显示该变更命令及风险等级。
值得注意的是,上述示例仅为服务器103发送一类数据的情形。在另一种情形中,服务器103可同时发送多类数据至终端101,如发送第一对象的标识和监控对象的标识。
服务器103将第一对象的标识、监控对象的标识和风险等级中至少一种发送至终端101进行显示,实现数据的可视化,用户可根据该可视化的数据及时对输入服务器的命令进行处理。
可以理解的是,为了实现上述实施例中的功能,处理设备包括了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本申请中所公开的实施例描述的各示例的单元及方法步骤,本申请能够以硬件或硬件和计算机软件相结合的形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用场景和设计约束条件。
上文中结合图1至图7,详细描述了根据本实施例所提供的数据监控分析方法,下面将结合图8,图8为本申请提供的一种数据监控分析装置的结构示意图,描述根据本实施例所提供的数据监控分析装置。
该数据监控分析装置可以用于实现上述方法实施例中处理器的功能,因此也能实现上述方法实施例所具备的有益效果。在本实施例中,该数据监控分析装置可以是应用于服务器103的模块(如芯片)。
如图8所示,该数据监控分析装置800包括接收模块810、对象确定模块820和等级确定模块830。数据监控分析装置800用于实现上述图2至图7中所示的方法实施例中的功能。
接收模块810,用于接收变更命令。
对象确定模块820,用于在执行所述变更命令的过程中,确定与所述第一对象关联的监控对象。
监控对象为服务器103执行变更命令过程中所需调度的进程。示例的,在初始对象中执行的变更命令,可能会调用或查看其他的进程,该其他的进程即与变更命令关联,所以该其他的进程属于监控对象。
在事前管理中,对象确定模块820利用提前部署的监控器获取监控对象在执行变更命令过程中操作的系统资源,该系统资源包括第一对象的标识和监控对象的标识,服务器从而得到影响面数据。对于监控器的监控手段可参考前述部署监控器的过程,在此不予赘述。
等级确定模块830,用于根据所述第一对象的标识和所述监控对象的标识,确定所述变更命令对应的风险等级。
等级确定模块830可利用预设规则或深度学习模型对影响面数据的风险等级进行判断。
为进一步实现上述图2至图7中所示的方法实施例中的功能。数据监控分析装置800还包括信息获取模块840、检索模块850、更新模块860、第一监控模块870、第二监控模块880、显示模块890。
其中,获取模块840用于获取告警信息;检索模块850用于根据告警信息检索操作日志;更新模块860用于将变更命令与风险等级作为输入,对服务器中部署的拦截模型进行更新;第一监控模块870用于利用追踪点监控系统资源,得到所述第一对象的标识和监控对象的标识;第二监控模块880用于对远程访问进程进行监控,接收运行监控对象时产生的报文;解析报文获取第一对象的标识和监控对象的标识;显示模块890用于显示影响面数据和风险等级中至少一种。
应理解,前述实施例的服务器103可对应于该数据监控分析装置800,并可以对应于执行根据本申请实施例的方法图2~图7对应的相应主体,并且数据监控分析装置800中的各个模块的操作和/或功能分别为了实现图2至图7中对应实施例的各个方法的相应流程,为了简洁,在此不再赘述。
示例性的,当数据监控分析装置800通过前述服务器103来实现时,该服务器103可包括多种硬件,如图9所示,图9为本申请提供的一种服务器的结构示意图。该服务器900可应用于图1所示的运维系统中,该服务器可以为堡垒机102和服务器103任一个。
如图9所示,服务器900可以包括处理器910、存储器920、通信接口930和总线940等,处理器910、存储器920、通信接口930通过总线940连接。
处理器910是服务器900的运算核心和控制核心。处理器910可以是一块超大规模的集成电路。处理器910中安装有操作系统和其他软件程序,使得处理器910实现对内存920及各种快捷外围部件互连标准(Peripheral Component Interconnect Express,PCIe)设备的访问。处理器910包括一个或多个处理器核(core)。处理器910中的处理器核例如是中央处理器(Central Processing unit,CPU)或其他特定集成电路(Application Specific Integrated Circuit,ASIC)。处理器910还可以是其他通用处理器、数字信号处理器(digital signal processing,DSP)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。实际应用中,服务器设备900也可以包括多个处理器。
存储器920可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器910通过运行存储在内部存储器920的指令,从而执行服务器900的各种功能应用以及数据处理。存储器920可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如运行模型功能,发送功能等)等。存储数据区可存储处理设备900使用过程中所创建的数据(比如影响面数据等)等。此外,内部存储器920可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。
通信接口930用于实现服务器900与外部设备或器件的通信。在本实施例中,通信接口930用于与其他处理设备进行数据交互。
总线940可以包括一通路,用于在上述组件(如处理器910、存储器920、通信接口930)之间传送信息。总线940除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线940。总线940可以是PCIe总线,或扩展工业标准结构(extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。例如,处理器910可以通过PCIe总线访问这些I/O设备。处理器910通过双倍速率(double data rate,DDR)总线和存储器920相连。这里,不同的存储器920可能采用不同的数据总线与处理器910通信,因此,DDR总线也可以替换为其他类型的数据总线,本申请实施例不对总线类型进行限定。
值得说明的是,图9中仅以服务器900包括1个处理器910和1个存储器920为例,此处,处理器910和存储器920分别用于指示一类器件或设备,具体实施例中,可以根据业务需求确定每种类型的器件或设备的数量。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可 编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘(digital video disc,DVD);还可以是半导体介质,例如,固态硬盘(solid state drive,SSD)。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (24)

  1. 一种数据监控分析方法,其特征在于,所述方法由服务器执行,所述方法包括:
    接收变更命令,所述变更命令指示对第一应用中的第一对象执行变更操作;
    在执行所述变更命令的过程中,确定与所述第一对象关联的监控对象;
    根据所述第一对象的标识和所述监控对象的标识,确定所述变更命令对应的风险等级。
  2. 根据权利要求1所述的方法,其特征在于,所述监控对象包括下述的一种或多种:
    所述第一应用中的第二对象、第二应用中的对象。
  3. 根据权利要求1或2所述的方法,其特征在于,所述监控对象通过系统调用函数与所述第一对象关联。
  4. 根据权利要求1至3中任一所述的方法,其特征在于,所述变更操作包括下述的一种或多种:
    增加、删除、修改。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述根据所述第一对象的标识和所述监控对象的标识,确定所述变更命令对应的风险等级,包括:
    将所述第一对象的标识和所述监控对象的标识输入风险评估模型,确定所述变更命令对应的风险等级。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述方法还包括:获取告警信息,所述告警信息用于指示所述服务器在运行过程中产生的故障数据;
    根据所述告警信息检索操作日志,确定与所述故障数据对应的变更命令,所述操作日志用于指示针对所述第一应用的多个变更命令的操作记录,所述多个变更命令包括所述变更命令。
  7. 根据权利要求5或6所述的方法,其特征在于,所述方法还包括:
    将所述变更命令与所述风险等级作为输入,对所述服务器中部署的拦截模型进行更新,更新后的拦截模型用于对部分变更命令进行拦截。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述方法,还包括:
    在所述监控对象被运行时,调用所述监控对象的追踪点;
    通过所述追踪点监控系统资源,得到所述第一对象的标识和所述监控对象的标识。
  9. 根据权利要求1至7中任一项所述的方法,其特征在于,所述监控对象为远程访问进程,所述方法,还包括:
    接收运行所述监控对象时产生的报文;
    解析所述报文获取所述第一对象的标识和所述监控对象的标识。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:
    显示所述第一对象的标识、所述监控对象的标识和所述风险等级中至少一种。
  11. 一种数据监控分析装置,其特征在于,所述装置包括:
    接收模块,用于接收变更命令,所述变更命令指示对第一应用中的第一对象执行变更操作;
    对象确定模块,用于在执行所述变更命令的过程中,确定与所述第一对象关联的监控对象;
    等级确定模块,用于根据所述第一对象的标识和所述监控对象的标识,确定所述变更命令对应的风险等级。
  12. 根据权利要求11所述的装置,其特征在于,所述监控对象包括下述的一种或多种:
    所述第一应用中的第二对象、第二应用中的对象。
  13. 根据权利要求11或12所述的装置,其特征在于,所述监控对象通过系统调用函数与所述第一对象关联。
  14. 根据权利要求11至13中任一所述的装置,其特征在于,所述变更操作包括下述的一种或多种:
    增加、删除、修改。
  15. 根据权利要求11至14中任一项所述的装置,其特征在于,所述等级确定模块,还用于:将所述第一对象的标识和所述监控对象的标识输入风险评估模型,确定所述变更命令对应的风险等级。
  16. 根据权利要求11至15中任一项所述的装置,其特征在于,所述装置还包括:
    获取模块,用于获取告警信息,所述告警信息用于指示所述服务器运行过程中产生的故障数据;
    检索模块,用于根据所述告警信息检索操作日志,确定与所述故障数据对应的变更命令,所述操作 日志用于指示针对所述第一应用的多个变更命令的操作记录,所述多个变更命令包括所述变更命令。
  17. 根据权利要求15或16所述的装置,其特征在于,所述装置还包括:
    更新模块,用于将所述变更命令与所述风险等级作为输入,对所述服务器中部署的拦截模型进行更新,更新后的拦截模型用于对部分变更命令进行拦截。
  18. 根据权利要求11至17中任一项所述的装置,其特征在于,所述装置还包括:
    第一监控模块,用于在所述监控对象被运行时,调用所述监控对象的追踪点;通过所述追踪点监控系统资源,得到所述第一对象的标识和所述监控对象的标识。
  19. 根据权利要求11至17中任一项所述的装置,其特征在于,所述监控对象为远程访问进程,所述装置还包括:
    第二监控模块,用于接收运行所述监控对象时产生的报文;解析所述报文获取所述第一对象的标识和所述监控对象的标识。
  20. 根据权利要求11至19中任一项所述的装置,其特征在于,所述装置还包括:
    显示模块,用于显示所述第一对象的标识、所述监控对象的标识和所述风险等级中至少一种。
  21. 一种服务器,其特征在于,包括:处理器和存储器;所述存储器存储有指令,所述处理器调用所述指令实现权利要求1至10中任一项所述的方法。
  22. 一种运维系统,其特征在于,包括堡垒机和多个服务器;
    所述堡垒机用于接收并筛选执行命令,得到变更命令;
    所述服务器用于执行变更命令,并对执行变更命令过程进行监控分析,实现权利要求1至10中任一项所述的方法。
  23. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有计算机程序或指令,当所述计算机程序或指令被处理设备执行时,实现权利要求1至10中任一项所述的方法。
  24. 一种计算机程序产品,包括计算机程序或指令,其特征在于,当所述计算机程序或指令在被处理设备执行时,实现权利要求1至10中任一项所述的方法。
PCT/CN2023/101436 2022-09-26 2023-06-20 数据监控分析方法、装置、服务器、运维系统及存储介质 WO2024066506A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211172716.0A CN117806899A (zh) 2022-09-26 2022-09-26 数据监控分析方法、装置、服务器、运维系统及存储介质
CN202211172716.0 2022-09-26

Publications (1)

Publication Number Publication Date
WO2024066506A1 true WO2024066506A1 (zh) 2024-04-04

Family

ID=90427410

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101436 WO2024066506A1 (zh) 2022-09-26 2023-06-20 数据监控分析方法、装置、服务器、运维系统及存储介质

Country Status (2)

Country Link
CN (1) CN117806899A (zh)
WO (1) WO2024066506A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8646084B1 (en) * 2012-09-28 2014-02-04 Kaspersky Lab Zao Securing file launch activity utilizing safety ratings
CN110390465A (zh) * 2019-06-18 2019-10-29 深圳壹账通智能科技有限公司 业务数据的风控分析处理方法、装置和计算机设备
CN112559023A (zh) * 2020-12-24 2021-03-26 中国农业银行股份有限公司 一种变更风险的预测方法、装置、设备及可读存储介质
CN112749879A (zh) * 2020-12-18 2021-05-04 成都飞机工业(集团)有限责任公司 一种基于异地厂所协同环境下的工程全局变更方法
CN113469584A (zh) * 2021-09-02 2021-10-01 云账户技术(天津)有限公司 一种业务服务运营的风险管理方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8646084B1 (en) * 2012-09-28 2014-02-04 Kaspersky Lab Zao Securing file launch activity utilizing safety ratings
CN110390465A (zh) * 2019-06-18 2019-10-29 深圳壹账通智能科技有限公司 业务数据的风控分析处理方法、装置和计算机设备
CN112749879A (zh) * 2020-12-18 2021-05-04 成都飞机工业(集团)有限责任公司 一种基于异地厂所协同环境下的工程全局变更方法
CN112559023A (zh) * 2020-12-24 2021-03-26 中国农业银行股份有限公司 一种变更风险的预测方法、装置、设备及可读存储介质
CN113469584A (zh) * 2021-09-02 2021-10-01 云账户技术(天津)有限公司 一种业务服务运营的风险管理方法及装置

Also Published As

Publication number Publication date
CN117806899A (zh) 2024-04-02

Similar Documents

Publication Publication Date Title
US11580067B1 (en) Storage volume regulation for multi-modal machine data
US11513935B2 (en) System and method for detecting anomalies by discovering sequences in log entries
US10810074B2 (en) Unified error monitoring, alerting, and debugging of distributed systems
US11263071B2 (en) Enabling symptom verification
CN110716910B (zh) 一种日志管理方法、装置、设备和存储介质
US8782472B2 (en) Troubleshooting system using device snapshots
US9354961B2 (en) Method and system for supporting event root cause analysis
US9210057B2 (en) Cross-cutting event correlation
US20200073781A1 (en) Systems and methods of injecting fault tree analysis data into distributed tracing visualizations
CN113726566B (zh) 一种服务网关装置
CN110062926B (zh) 设备驱动器遥测
CN112818307A (zh) 用户操作处理方法、系统、设备及计算机可读存储介质
CN113608964A (zh) 一种集群自动化监控方法、装置、电子设备及存储介质
US11675647B2 (en) Determining root-cause of failures based on machine-generated textual data
CN112988439B (zh) 服务器故障发现方法、装置、电子设备及存储介质
WO2024066506A1 (zh) 数据监控分析方法、装置、服务器、运维系统及存储介质
Meng et al. Driftinsight: detecting anomalous behaviors in large-scale cloud platform
US20190207801A1 (en) System and Method for Rule-Based Simple Network Management Protocol Agent
Roschke et al. An alert correlation platform for memory‐supported techniques
CN110493326B (zh) 基于zookeeper管理集群配置文件的系统和方法
TWI682655B (zh) 產生網路事件告警的方法及其網路管理裝置
CN113760856A (zh) 数据库管理方法及装置、计算机可读存储介质、电子设备
US11693851B2 (en) Permutation-based clustering of computer-generated data entries
CN117370063A (zh) 一种云服务器内存故障特征的提取方法、系统及相关装置
WO2022214200A1 (en) Method and network element for pre-upgrade use case validation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23869754

Country of ref document: EP

Kind code of ref document: A1