CN115391000A - Business resource monitoring method and device, electronic equipment and storage medium - Google Patents

Business resource monitoring method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115391000A
CN115391000A CN202110571059.6A CN202110571059A CN115391000A CN 115391000 A CN115391000 A CN 115391000A CN 202110571059 A CN202110571059 A CN 202110571059A CN 115391000 A CN115391000 A CN 115391000A
Authority
CN
China
Prior art keywords
container
parameter
data processing
configuration time
graphics processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110571059.6A
Other languages
Chinese (zh)
Inventor
查冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110571059.6A priority Critical patent/CN115391000A/en
Publication of CN115391000A publication Critical patent/CN115391000A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5021Priority
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/508Monitor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Power Sources (AREA)

Abstract

The invention provides a service resource monitoring method, a service resource monitoring device, electronic equipment and a storage medium, wherein the method comprises the following steps: calculating data processing tasks in the operating environment of the force container by monitoring the graphics processor; acquiring a first process termination signal sent by a container closing instruction; triggering a signal processing function; acquiring a processing time delay parameter and a configuration time parameter matched with the signal processing function through the signal processing function to obtain a comparison result of the processing time delay parameter and the configuration time parameter; and monitoring the recovery process of the service resources corresponding to the operating environment of the computational force container of the graphic processor based on the comparison result of the processing time delay parameter and the configuration time parameter. Therefore, the recovery process of the service resources can be monitored, the service resources can be timely and comprehensively recovered, the container can be assuredly and elegantly quitted, the service efficiency of the service resources can be improved, the service data processing speed can be guaranteed, and the use experience of a user can be improved.

Description

Business resource monitoring method and device, electronic equipment and storage medium
Technical Field
The present invention relates to a service resource monitoring processing technology of a cloud network, and in particular, to a service resource monitoring method and apparatus, an electronic device, and a storage medium.
Background
With the continuous development of computer technology, a Cloud server (CVM Cloud Virtual Machine) can provide a secure and reliable elastic computing service, and can also provide different instance types to meet the specific use scenario of a user. The example types comprise different combinations of a CPU, a memory, a storage and a network, when a neural network model is operated by a cloud server or a training task of the neural network model is executed, when a user needs to interrupt the training task or a system abnormal interrupt task occurs in training in the training process, no matter data IO reading operation or other training calculation is executed in a container, elegant exit of the container needs to be realized, the condition that occupied service resources are not released cannot occur, and otherwise the service efficiency of the service resources can be influenced.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for monitoring a service resource, an electronic device, and a storage medium, which can monitor a recovery process of a service resource corresponding to an operating environment of a computing power container of a graphics processor based on a comparison result of a processing delay parameter and a configuration time parameter, so as to ensure timely and comprehensive recovery of the service resource and elegant exit of the container, thereby improving use efficiency of the service resource, ensuring a processing speed of service data, and improving use experience of a user.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a service resource monitoring method, which comprises the following steps:
monitoring data processing tasks in the operating environment of the graphics processor computational power container;
when the interruption of the data processing task is monitored, acquiring a first process termination signal sent by a container closing instruction;
triggering a signal processing function in response to the first stop progress signal;
acquiring a processing time delay parameter and a configuration time parameter matched with the signal processing function through the signal processing function;
comparing the processing time delay parameter with the configuration time parameter to obtain a comparison result of the processing time delay parameter and the configuration time parameter;
and monitoring the recovery process of the service resources corresponding to the operating environment of the computational power container of the graphics processor based on the comparison result of the processing time delay parameter and the configuration time parameter.
The embodiment of the present invention further provides a service resource monitoring device, including:
the information transmission module is used for monitoring data processing tasks in the operating environment of the computational power container of the graphics processor;
the information processing module is used for acquiring a first process termination signal sent by a container closing instruction when the interruption of the data processing task is monitored;
the information processing module is used for responding to the first process termination signal and triggering a signal processing function;
the information processing module is used for acquiring a processing time delay parameter and a configuration time parameter matched with the signal processing function through the signal processing function;
the information processing module is used for comparing the processing time delay parameter with the configuration time parameter to obtain a comparison result of the processing time delay parameter and the configuration time parameter;
and the information processing module is used for monitoring the recovery process of the service resources corresponding to the operating environment of the computational power container of the graphic processor based on the comparison result of the processing time delay parameter and the configuration time parameter.
In the above-mentioned scheme, the first step of the method,
the information processing module is used for executing a service resource recovery process corresponding to the trigger signal processing function when the processing time delay parameter is smaller than the configuration time parameter;
the information processing module is used for recovering the memory corresponding to the data processing task through the service resource recovery process;
and the information processing module is used for closing the bidirectional communication connection corresponding to the data processing task through the service resource recycling process.
In the above-mentioned scheme, the first step of the method,
the information processing module is used for closing the user space file system client corresponding to the data processing task through the service resource recovery process and terminating the user space file system client when the operation environment of the computing power container of the graphic processor is a cloud server network;
and the information processing module is used for closing the file descriptor corresponding to the data processing task through the service resource recovery process.
In the above-mentioned scheme, the first step of the method,
the information processing module is used for triggering a second termination progress signal corresponding to the graphics processor calculation capacity container when the processing time delay parameter is greater than or equal to the configuration time parameter;
and the information processing module is used for closing the graphics processor computing power container through a container closing instruction corresponding to the second stop progress signal.
In the above-mentioned scheme, the first and second light sources,
the information processing module is used for determining the environment type of the operating environment of the computing power container of the graphics processor;
and the information processing module is used for adjusting the configuration time parameter based on the environment type of the operation environment of the graphics processor computing power container so as to realize the matching of the configuration time parameter and the environment type of the operation environment of the graphics processor computing power container.
In the above-mentioned scheme, the first and second light sources,
the information processing module is used for determining the complexity of the neural network model operated in the graphics processor computational power container;
and the information processing module is used for adjusting the configuration time parameter when the complexity of the neural network model is greater than a complexity threshold value so as to realize the matching of the configuration time parameter and the complexity of the neural network model operated in the computational power container of the graphic processor.
In the above-mentioned scheme, the first step of the method,
the information processing module is used for analyzing the data processing task and acquiring a priority identifier of the data processing task;
the information processing module is used for sequencing the priority of the received data processing task according to the priority identification of the data processing task;
the information processing module is used for creating a corresponding data processing task queue according to the priority of the data processing task;
and the information processing module is used for adjusting the processing sequence of the computational power container of the graphics processor according to the data processing task queue.
An embodiment of the present invention further provides an electronic device, where the electronic device includes:
a memory for storing executable instructions;
and the processor is used for realizing the preorder service resource monitoring method when the executable instruction stored in the memory is operated.
The embodiment of the invention also provides a computer-readable storage medium, which stores executable instructions, and the executable instructions are executed by the processor to realize the preorder service resource monitoring method.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device executes the service resource monitoring method provided by the embodiment of the application.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention calculates the data processing task in the running environment of the power container by monitoring the graphics processor; when the interruption of the data processing task is monitored, acquiring a first process termination signal sent by a container closing instruction; triggering a signal processing function in response to the first stop progress signal; acquiring a processing time delay parameter and a configuration time parameter matched with the signal processing function through the signal processing function; comparing the processing time delay parameter with the configuration time parameter to obtain a comparison result of the processing time delay parameter and the configuration time parameter; and monitoring the recovery process of the service resources corresponding to the operating environment of the computational power container of the graphics processor based on the comparison result of the processing time delay parameter and the configuration time parameter. Therefore, the recovery process of the service resources corresponding to the operation environment of the graphics processor computing power container can be monitored based on the comparison result of the processing time delay parameter and the configuration time parameter, timely and comprehensive recovery of the service resources is guaranteed, elegant exit of the container is guaranteed, the service efficiency of the service resources is improved, the service data processing speed is also guaranteed, and the use experience of a user is improved.
Drawings
Fig. 1 is a schematic view of a usage scenario of a service resource monitoring method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;
fig. 3 is an optional flowchart of a service resource monitoring method according to an embodiment of the present invention;
fig. 4 is an optional flowchart of a service resource monitoring method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating a process of graceful exit of a graphics processor computing power container according to an embodiment of the present invention;
fig. 6 is an optional flowchart of the service resource monitoring method according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.
1) In response to the condition or state on which the performed operation depends, one or more of the performed operations may be in real-time or may have a set delay when the dependent condition or state is satisfied; there is no restriction on the order of execution of the operations performed unless otherwise specified.
2) Terminals, including but not limited to: the system comprises a common terminal and a special terminal, wherein the common terminal is in long connection and/or short connection with a sending channel, and the special terminal is in long connection with the sending channel.
3) The client and the carrier for realizing the specific function in the terminal, for example, a mobile client (APP) is a carrier for realizing the specific function in the mobile terminal, for example, a function of executing report making or a function of displaying a report.
4) Server cluster (Server cluster) refers to a collection of servers that together perform the same service, and appears to a client as if there is only one Server. The server cluster can utilize a plurality of computers to perform parallel computation so as to obtain high computation speed, and also can use a plurality of computers to perform backup so as to ensure that any one machine damages the whole system or can normally run. The server cluster hard disk fault processing method provided by the application can be applied to a Cloud server use scene and a distributed server use scene, and can be used for detecting the state and repairing the fault of the server hard disk in different use scenes, and particularly, a Cloud server (CVM Cloud Virtual Machine) is a simple, efficient, safe and reliable computing service with elastically telescopic processing capacity. The management mode is simpler and more efficient than that of the traditional single physical server. A user can quickly create or release any plurality of cloud servers for the service process of the user to use without purchasing hardware in advance, and data of the cloud server user are stored. The data and programs of users in the use environment of the distributed server may not be located on one server, but are dispersed into a plurality of servers, and similarly, the use environment of the distributed server also needs to be configured with a large number of hard disks, and the state detection and fault repair of the hard disks of the servers need to be realized by the server cluster hard disk fault processing method provided by the application.
5) The container is withdrawn elegantly: when the graphics processor computing power container runs an AI training task, if data IO operation or other training computing operations are executed in the container, at this time, a user needs to interrupt the running of the task for some reasons or interrupt the task due to system exception, and a process environment, a network environment and a file system environment in the container need to be completely exited, dirty data is not remained, a dead or stuck process is not generated, so that the condition that the exited container occupies resources and is not released is avoided, and the container environment is completely destroyed and is called as container elegant exit.
Fig. 1 is a schematic view of a usage scenario of a service resource monitoring method provided by an embodiment of the present invention, and referring to fig. 1, with continuous development of computer technology, a Cloud Virtual Machine (CVM) may provide a secure and reliable elastic computing service, and may also provide different instance types to meet a specific usage scenario of a user. The terminal (including the terminal 10-1 and the terminal 10-2) is provided with corresponding clients capable of executing different functions, wherein the clients are the terminals (including the terminal 10-1 and the terminal 10-2) which acquire different information from the corresponding cloud server 200 through the network 300, and different services can be deployed in the cloud server. The terminal is connected to the cloud server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless link to implement data transmission. The example types provided by the cloud server are composed of different combinations of a CPU, a memory, a storage and a network, the business data of a user is stored in a hard disk of the cloud server, and a graphics processor computing container deployed in the cloud server can perform different data processing tasks, such as training a neural network model in a game environment to determine different parameters in the neural network model. The cloud server 200 may deploy single card instances JPEG HW decoder, each of which may also contain 5 video decoders (NVDEC) and 1 Optical Flow Accumulator (OFA). These resources allow the user to transfer the time-consuming computational load that would otherwise need to be preprocessed in the CPU to the GPU for completion, so as to accelerate the training speed of the neural network. In the embodiment provided by the present invention, the cloud server application running in the cloud server 200 may be written in software code environments of different programming languages, and the code objects may be different types of code entities. For example, in the software code of C language, one code object may be one function. In the JAVA language software code, a code object may be a class, and the IOS OC language may be a piece of object code. In the software code of C + + language, a code object may be a class or a function to execute processing instructions from different terminals. In the application, the source of the compiling environment of the name cloud server is not distinguished any more.
As will be described in detail below with respect to the structure of the service resource monitoring device according to the embodiment of the present invention, the service resource monitoring device may be implemented in various forms, such as a dedicated terminal with a processing function of the service resource monitoring device, or a server with a processing function of the service resource monitoring device, such as the server 200 in the foregoing fig. 1. Fig. 2 is a schematic structural diagram of a service resource monitoring device according to an embodiment of the present invention, and it can be understood that fig. 2 only shows an exemplary structure of the service resource monitoring device, and not a whole structure, and a part of the structure or a whole structure shown in fig. 2 may be implemented as needed.
The electronic device provided by the embodiment of the invention comprises: at least one processor 201, memory 202, user interface 203, and at least one network interface 204. The various components in the traffic resource monitoring apparatus are coupled together by a bus system 205. It will be appreciated that the bus system 205 is used to enable communications among the components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 205 in fig. 2.
The user interface 203 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.
It will be appreciated that the memory 202 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The memory 202 in the embodiments of the present invention is capable of storing data to support the operation of the terminal (e.g., 10-1). Examples of such data include: any computer program, such as an operating system and application programs, for operating on a terminal (e.g., 10-1). The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.
In some embodiments, the service resource monitoring apparatus provided in the embodiments of the present invention may be implemented by a combination of software and hardware, and as an example, the service resource monitoring apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the service resource monitoring method provided in the embodiments of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable Logic Devices (PLDs), complex Programmable Logic Devices (CPLDs), field Programmable Gate Arrays (FPGAs), or other electronic components.
As an example that the service resource monitoring apparatus provided by the embodiment of the present invention is implemented by combining software and hardware, the service resource monitoring apparatus provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, where the software modules may be located in a storage medium, the storage medium is located in the memory 202, and the processor 201 reads executable instructions included in the software modules in the memory 202, and completes the service resource monitoring method provided by the embodiment of the present invention in combination with necessary hardware (for example, including the processor 201 and other components connected to the bus 205).
By way of example, the Processor 201 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.
As an example of the service resource monitoring apparatus provided by the embodiment of the present invention implemented by hardware, the apparatus provided by the embodiment of the present invention may be implemented by directly using the processor 201 in the form of a hardware decoding processor, for example, the apparatus may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable Logic Devices (PLDs), complex Programmable Logic Devices (CPLDs), field Programmable Gate Arrays (FPGAs), or other electronic components to implement the service resource monitoring method provided by the embodiment of the present invention.
The memory 202 in the embodiment of the present invention is used to store various types of data to support the operation of the business resource monitoring apparatus. Examples of such data include: any executable instructions for operating on the service resource monitoring device, such as executable instructions, may be included in the executable instructions to implement the method for monitoring service resources according to the embodiment of the present invention.
In other embodiments, the service resource monitoring apparatus provided in the embodiments of the present invention may be implemented in a software manner, and fig. 2 illustrates the service resource monitoring apparatus stored in the memory 202, which may be software in the form of a program, a plug-in, and the like, and includes a series of modules, as an example of the program stored in the memory 202, the service resource monitoring apparatus may include the service resource monitoring apparatus, and the service resource monitoring apparatus includes the following software module information transmission module 2081 and information processing module 2082. When the software module in the service resource monitoring device is read into the RAM by the processor 201 and executed, the service resource monitoring method provided by the embodiment of the present invention is implemented, wherein the functions of each software module in the service resource monitoring device include:
and the information transmission module 2081 is used for monitoring data processing tasks in the operating environment of the computing power container of the graphics processor.
The information processing module 2082 is configured to, when it is monitored that the data processing task is interrupted, obtain a first process termination signal sent by a container closing instruction.
The information processing module 2082 is configured to trigger a signal processing function in response to the first termination process signal.
The information processing module 2082 is configured to obtain, through the signal processing function, a processing delay parameter and a configuration time parameter that are matched with the signal processing function.
The information processing module 2082 is configured to compare the processing delay parameter with the configuration time parameter to obtain a comparison result between the processing delay parameter and the configuration time parameter.
The information processing module 2082 is configured to monitor a recycling process of the service resources corresponding to the graphics processor computing power container operating environment based on the comparison result between the processing delay parameter and the configuration time parameter.
According to the electronic device shown in fig. 2, in one aspect of the present application, the present application also provides a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the method provided in the various optional implementation manners of the business resource monitoring method.
Before introducing the business resource monitoring method provided by the application, firstly, the defect of the graphical processor computing power container exit process in the traditional technology is introduced, for abnormal exit of the graphical processor computing power container, the traditional scheme depends on executing a docker stop command, but does not regulate from the angle of platform use, a standard stop signal is captured and a corresponding operation function is executed, the resource release operation is performed when the container exits too depending on a user program, the quality of user codes is uneven, SIGTERM signals sent by the stop command are easy to be ignored, the container is forcedly released by a SIGKILL signal, a dead process may occur, so that the resources in the container cannot be released in time, elegant exit of the container cannot be completed, and the distribution and use efficiency of a GPU new computing power container is influenced.
To solve this drawback, referring to fig. 3, fig. 3 is an optional flowchart of the service resource monitoring method provided in the embodiment of the present invention, and it can be understood that the steps shown in fig. 3 may be executed by various electronic devices operating the service resource monitoring apparatus, for example, the steps may be executed by a dedicated terminal with a service resource monitoring function, a server or a server cluster controller, or a control terminal of a cloud network server. The dedicated terminal with the service resource monitoring device can be packaged in the server 200 shown in fig. 1 to execute the corresponding software module in the service resource monitoring device shown in the foregoing fig. 2. The following is a description of the steps shown in fig. 3.
Step 301: the service resource monitoring device monitors data processing tasks in the operating environment of the computing power container of the graphics processor.
The embodiment of the present invention may be implemented by combining a Cloud technology, where the Cloud technology (Cloud technology) refers to a hosting technology for unifying series resources such as hardware, software, and a network in a wide area network or a local area network to implement calculation, storage, processing, and sharing of data, and may also be understood as a generic term of a network technology, an information technology, an integration technology, a management platform technology, an application technology, and the like applied based on a Cloud computing business model. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, photo-like websites and more portal websites, so cloud technology needs to be supported by cloud computing.
It should be noted that cloud computing is a computing mode, and distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can obtain computing power, storage space and information services as required. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand. As a basic capability provider of cloud computing, a cloud computing resource pool platform, which is called an Infrastructure as a Service (IaaS) for short, is established, and multiple types of virtual resources are deployed in a resource pool and are used by external clients selectively. The cloud computing resource pool mainly comprises: a computing device (which may be a virtualized machine, including an operating system), a storage device, and a network device.
Step 302: and when monitoring that the data processing task is interrupted, the service resource monitoring device acquires a first process termination signal sent by a container closing instruction.
The data processing task types in the graphics processor computing power container operating environment may be various, and may for example execute: the method comprises the following steps that different types of tasks such as deep learning, graphic image processing, video coding and decoding, a graphic database, a high-performance database, computational fluid dynamics, computational finance, seismic analysis, molecular modeling and genomics are adopted, and taking a neural network model training task in the deep learning as an example, when a training task of a user is terminated, a command of a docker stop (namely a container closing command) is sent by a graphic processor computing force container, the container closing command triggers a first termination process signal (namely a signal event of SIGTERM), the signal event needs to be captured by a starting command of the graphic processor computing force container, and the first termination process signal cannot be ignored or blocked.
Step 303: the service resource monitoring device responds to the first termination process signal and triggers a signal processing function.
In some embodiments of the present invention, after the graphics processor computing power container captures the first stop-progress signal, the SIG _ Function in the start command is called to complete processing of the signal, and the resources of the release container are recycled through the signal processing Function.
Step 304: and the service resource monitoring device acquires a processing time delay parameter and a configuration time parameter which are matched with the signal processing function through the signal processing function.
Step 305: and the service resource monitoring device compares the processing time delay parameter with the configuration time parameter to obtain a comparison result of the processing time delay parameter and the configuration time parameter.
In some embodiments of the present invention, the configuration time parameter can be flexibly adjusted according to different use environments, and specifically, the environment type of the graphics processor computing power container operation environment can be determined; and adjusting the configuration time parameter based on the environment type of the operation environment of the graphics processor computing power container to realize that the configuration time parameter is matched with the environment type of the operation environment of the graphics processor computing power container. For example: for training a neural network model in a cloud network, configuration time parameters are usually in the second level, and if task data storage is carried out through a cloud network mounted net disk, the configuration time parameters can be increased, complete exit of net disk data IO is guaranteed, and zombie processes are avoided.
In some embodiments of the present invention, since the complexity of the neural network model operated in the graphics processor computation force container is complex, and the training task is complex for the neural network model with a complex model structure, when determining the configuration time parameter, the complexity of the neural network model operated in the graphics processor computation force container may be determined first; and when the complexity of the neural network model is greater than a complexity threshold value, adjusting the configuration time parameter to realize that the configuration time parameter is matched with the complexity of the neural network model operated in the calculation capacity container of the graphic processor. For example: the complexity of a bidirectional attention neural network model for analyzing user behavior information in Game pictures of gunfight games, running cool games, racing games, multi-player Online tactical competition games (MOBA), racing games (RCG), sports games (SPG) and the like in a Game scene is far greater than that of a graph neural network model for analyzing the user behavior information in the chess and card games, and when the complexity threshold is 1, the configuration time parameter during training the bidirectional attention neural network model in the graphics processor computing power container can be greater than that corresponding to the graph neural network model so as to ensure the graceful exit of the graphics processor computing power container.
Step 306: and the service resource monitoring device monitors the recovery process of the service resources corresponding to the operating environment of the computing force container of the graphic processor based on the comparison result of the processing time delay parameter and the configuration time parameter.
Referring to fig. 4, fig. 4 is an optional schematic flow chart of the service resource monitoring method according to the embodiment of the present invention, and it can be understood that the steps shown in fig. 4 may be executed by various electronic devices operating the service resource monitoring apparatus, for example, the steps may be executed by a dedicated terminal with a service resource monitoring function, a server or a server cluster controller, or a control terminal of a cloud network server. The dedicated terminal with the service resource monitoring apparatus may be packaged in the server 200 shown in fig. 1 to execute the corresponding software modules in the service resource monitoring apparatus shown in the foregoing fig. 2. The following is a description of the steps shown in fig. 4.
Step 401: and obtaining a comparison result of the processing time delay parameter and the configuration time parameter.
Step 402: and judging whether the processing time delay parameter is smaller than the configuration time parameter, if so, executing step 403, otherwise, executing step 405.
Step 403: and when the processing time delay parameter is smaller than the configuration time parameter, executing a service resource recovery process corresponding to the trigger signal processing function.
Step 404: and recovering the memory corresponding to the data processing task through the service resource recovery process, and closing the bidirectional communication connection corresponding to the data processing task.
In some embodiments of the present invention, since the operating environments of the computing power containers of the graphics processors are different, in order to ensure an elegant exit of the computing power container of the graphics processor, a resource recovery process needs to be adjusted according to the operating environment of the computing power container of the graphics processor, specifically, when the operating environment of the computing power container of the graphics processor is a cloud server network, a user space file system client corresponding to the data processing task is closed through the service resource recovery process, and the user space file system client is terminated; and closing the file descriptor corresponding to the data processing task through the service resource recycling process.
Referring to fig. 5, fig. 5 is a schematic diagram illustrating a process of graceful exit of a graphics processor computing power container according to an embodiment of the present invention, in which a signal processing Function is a SIG _ Function, which is a signal processing Function embedded in a start command, and when a stop signal that exits abnormally is captured, the signal processing Function of SIG _ Function is executed, and a memory recovery operation is mainly executed in the Function to release an applied memory delete, thereby avoiding memory leakage; and actively initiating the closing operation of the socket link, and avoiding the condition that an opposite terminal waits for the data to be blocked.
For a cloud server network, a socket needs to be closed and a user space file system client needs to be terminated, wherein the user space file system client (fuse file in user space) is a module used for mounting some network spaces, such as SSH, in Linux to a local file system, and related contents can be found on SourceFrge, so that complete exit of network disk data IO can be guaranteed; meanwhile, the file descriptor is actively closed, so that the leakage of resources or the appearance of the Morse exception can be avoided, and even if the operating system closes the file descriptor after the container exits, the file descriptor is forcibly closed. Therefore, the resource recovery of the graphics processor computation force container is completed by being collected to the platform side, the threshold of user program writing is reduced, and the situation that the recovery and complete release of resources need to be considered all the time when the user program is written is avoided.
Step 405: and when the processing time delay parameter is greater than or equal to the configuration time parameter, triggering a second termination progress signal corresponding to the graphics processor calculation capacity container.
The second termination progress signal corresponding to the graphics processor computing power container may be a SIGKILL signal, a non-negligible or blocked signal, and it should be noted that a sigtherm termination progress (terminate) signal, which is different from SIGKILL, may be blocked and processed; this signal is typically generated by a default command that requires the program to normally exit the shell itself.
Step 406: and closing the graphics processor computing power container through a container closing instruction corresponding to the second termination progress signal.
After the graphics processor computing power container receives a kill command signal, the forced quitting operation can be immediately executed without waiting for the cleaning and resource release of the process environment in the graphics processor computing power container; therefore, when the processing of the second termination progress signal is executed, the time parameter can be configured, and a kill signal is not triggered in the configured time period, so that the cleaning of the container environment and the release of resources are guaranteed to be completed in the time period. Therefore, by processing the unified SIGTERM signal, the condition that the cleaning is incomplete depending on a user program is eliminated, the operability of the computing power system of the graphics processor is improved, and the management of the life cycle of the computing power container of the graphics processor is facilitated.
In some embodiments of the present invention, the data processing task may be parsed to obtain a priority identifier of the data processing task; sorting the priority of the received data processing task according to the priority identification of the data processing task; and creating a corresponding data processing task queue according to the priority of the data processing task. Therefore, different customers can be treated differently. Meanwhile, traversing a data processing task queue and determining a data processing task with the highest priority; determining link quality of each link in the network resources; and configuring the link with the highest link quality in the network resources for the data processing task with the highest priority so as to realize the transmission and processing of the data processing tasks in the data processing task queue through the configured link.
The following describes a service resource monitoring method according to the present invention, taking a neural network model in a training game processing environment as an example, wherein a schematic diagram of a use environment of the service resource monitoring method according to the embodiment of the present invention shown in fig. 1 is combined; the terminal (including the terminal 11-1 and the terminal 11-2) is provided with corresponding clients capable of executing different functions, wherein the clients are the terminals (including the terminal 11-1 and the terminal 11-2) and acquire the neural network model from the corresponding cloud server 200 through the network 300 for training so as to determine parameters of the neural network model, and for example, a game operator can actively acquire historical data for monitoring the behavior of using a game plug-in, or automatically acquire the historical data when a trigger condition is met. For example, the game log may be analyzed through a neural network model, and when information in the game log meets a trigger condition, an operation of acquiring the historical data is triggered. The terminal is connected to the server 200 through a network 300, the network 300 may be a wide area network or a local area network, or a combination of both, and data transmission is implemented using a wireless link, referring to fig. 6, fig. 6 is an optional flowchart of the service resource monitoring method provided in the embodiment of the present invention, and the following description is made with reference to the steps shown in fig. 6.
Step 601: a graphics processor computing power container is deployed by different service clusters of a cloud computing server.
Step 602: the close container command docker stop triggers the SIGTERM signal.
Step 603: the graphics processor acquires the sigtherm signal in response to a command to activate the force container.
Step 604: and triggering a signal processing Function and executing SIG Function processing.
Step 605: and judging whether the time delay parameter exceeds the configuration time parameter, if so, executing a step 606, otherwise, executing a step 604.
Step 606: the docker kill triggers the SIGKILL signal.
Step 607: and responding to the SIGKILL signal, executing the destruction of the computing power container by the graphics processor, and realizing the recovery of service resources.
Therefore, the difficulty of using a third-party component by the graphics processor computing capacity container can be effectively reduced, particularly, when a training task is executed to read and write data in the container through distributed file storage of a cloud server network, the situation that the container cannot be completely withdrawn and resources are not released due to abnormal interrupt training of the task is not needed to be worried about, and the elegant withdrawal of the graphics processor computing capacity container is guaranteed.
The beneficial technical effects are as follows:
the embodiment of the invention calculates the data processing task in the running environment of the force container by monitoring the graphic processor; when the interruption of the data processing task is monitored, acquiring a first process termination signal sent by a container closing instruction; triggering a signal processing function in response to the first stop progress signal; acquiring a processing time delay parameter and a configuration time parameter matched with the signal processing function through the signal processing function; comparing the processing time delay parameter with the configuration time parameter to obtain a comparison result of the processing time delay parameter and the configuration time parameter; and monitoring the recovery process of the service resources corresponding to the operating environment of the computing power container of the graphic processor based on the comparison result of the processing time delay parameter and the configuration time parameter. Therefore, the recovery process of the service resources corresponding to the operation environment of the graphics processor computing power container can be monitored based on the comparison result of the processing time delay parameter and the configuration time parameter, timely and comprehensive recovery of the service resources is guaranteed, elegant exit of the container is guaranteed, the service efficiency of the service resources is improved, the service data processing speed is also guaranteed, and the use experience of a user is improved.
The above description is intended to be illustrative only, and should not be taken as limiting the scope of the invention, which is intended to include all such modifications, equivalents, and improvements as fall within the true spirit and scope of the invention.

Claims (10)

1. A method for monitoring service resources is characterized in that the method comprises the following steps:
monitoring data processing tasks in the operating environment of the graphics processor computing power container;
when the interruption of the data processing task is monitored, acquiring a first process termination signal sent by a container closing instruction;
triggering a signal processing function in response to the first stop progress signal;
acquiring a processing time delay parameter and a configuration time parameter matched with the signal processing function through the signal processing function;
comparing the processing time delay parameter with the configuration time parameter to obtain a comparison result of the processing time delay parameter and the configuration time parameter;
and monitoring the recovery process of the service resources corresponding to the operating environment of the computing power container of the graphic processor based on the comparison result of the processing time delay parameter and the configuration time parameter.
2. The method according to claim 1, wherein the monitoring a recycling process of the service resource corresponding to the graphics processor computing power container operating environment based on the comparison result between the processing delay parameter and the configuration time parameter comprises:
when the processing time delay parameter is smaller than the configuration time parameter, executing a service resource recovery process corresponding to the trigger signal processing function;
through the business resource recovery process, recovering the memory corresponding to the data processing task;
and closing the bidirectional communication connection corresponding to the data processing task through the service resource recovery process.
3. The method of claim 2, further comprising:
when the operation environment of the computing power container of the graphic processor is a cloud server network, closing a user space file system client corresponding to the data processing task through the service resource recovery process, and terminating the user space file system client;
and closing the file descriptor corresponding to the data processing task through the service resource recycling process.
4. The method according to claim 1, wherein the monitoring, based on the comparison result between the processing delay parameter and the configuration time parameter, a recycling process of a service resource corresponding to the graphics processor computing power container operating environment includes:
when the processing time delay parameter is larger than or equal to the configuration time parameter, triggering a second termination progress signal corresponding to the graphics processor calculation capacity container;
and closing the graphics processor computing power container through a container closing instruction corresponding to the second termination progress signal.
5. The method of claim 1, further comprising:
determining an environment type of the operating environment of the graphics processor computational power container;
and adjusting the configuration time parameter based on the environment type of the operation environment of the graphics processor computing power container to realize that the configuration time parameter is matched with the environment type of the operation environment of the graphics processor computing power container.
6. The method of claim 1, further comprising:
determining a complexity of a neural network model run in the graphics processor computational power container;
and when the complexity of the neural network model is greater than a complexity threshold value, adjusting the configuration time parameter to realize that the configuration time parameter is matched with the complexity of the neural network model operated in the calculation capacity container of the graphic processor.
7. The method of claim 1, further comprising:
analyzing the data processing task to obtain a priority identifier of the data processing task;
sorting the priority of the received data processing tasks according to the priority identification of the data processing tasks;
creating a corresponding data processing task queue according to the priority of the data processing task;
and adjusting the processing sequence of the computational power container of the graphics processor according to the data processing task queue.
8. A traffic resource monitoring apparatus, the apparatus comprising:
the information transmission module is used for monitoring data processing tasks in the operating environment of the computational power container of the graphics processor;
the information processing module is used for acquiring a first process termination signal sent by a container closing instruction when the interruption of the data processing task is monitored;
the information processing module is used for responding to the first process termination signal and triggering a signal processing function;
the information processing module is used for acquiring a processing time delay parameter and a configuration time parameter matched with the signal processing function through the signal processing function;
the information processing module is used for comparing the processing time delay parameter with the configuration time parameter to obtain a comparison result of the processing time delay parameter and the configuration time parameter;
and the information processing module is used for monitoring the recovery process of the service resources corresponding to the operating environment of the computational power container of the graphic processor based on the comparison result of the processing time delay parameter and the configuration time parameter.
9. An electronic device, characterized in that the electronic device comprises:
a memory for storing executable instructions;
a processor configured to implement the method of monitoring a business resource of any one of claims 1 to 7 when executing the executable instructions stored in the memory.
10. A computer-readable storage medium storing executable instructions, wherein the executable instructions when executed by a processor implement the method for monitoring service resources of any one of claims 1 to 7.
CN202110571059.6A 2021-05-25 2021-05-25 Business resource monitoring method and device, electronic equipment and storage medium Pending CN115391000A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110571059.6A CN115391000A (en) 2021-05-25 2021-05-25 Business resource monitoring method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110571059.6A CN115391000A (en) 2021-05-25 2021-05-25 Business resource monitoring method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115391000A true CN115391000A (en) 2022-11-25

Family

ID=84113913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110571059.6A Pending CN115391000A (en) 2021-05-25 2021-05-25 Business resource monitoring method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115391000A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117593172A (en) * 2024-01-16 2024-02-23 北京趋动智能科技有限公司 Process management method, device, medium and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117593172A (en) * 2024-01-16 2024-02-23 北京趋动智能科技有限公司 Process management method, device, medium and equipment
CN117593172B (en) * 2024-01-16 2024-04-23 北京趋动智能科技有限公司 Process management method, device, medium and equipment

Similar Documents

Publication Publication Date Title
CN109492952B (en) Auditing data processing method and device, electronic equipment and storage medium
CN111273969B (en) State switching method, device, equipment and storage medium
CN111459754B (en) Abnormal task processing method, device, medium and electronic equipment
WO2017107514A1 (en) Offline transcoding method and system
US11061693B2 (en) Reprogramming a field programmable device on-demand
CN112949172B (en) Data processing method, device, machine-readable medium and equipment
CN102567047A (en) Fast computer startup
CN111897686A (en) Server cluster hard disk fault processing method and device, electronic equipment and storage medium
CN110727473B (en) Resource control method, device, system, storage medium and electronic equipment
CN113190427B (en) Method and device for monitoring blocking, electronic equipment and storage medium
CN110895503A (en) Application performance monitoring method and client
CN110502345A (en) A kind of overload protection method, device, computer equipment and storage medium
CN115391000A (en) Business resource monitoring method and device, electronic equipment and storage medium
CN111954072A (en) Multimedia playing method, device, multimedia player and medium
CN111298434B (en) Service processing method, device, equipment and storage medium
CN109840597B (en) Model prediction method and device, electronic equipment and storage medium
CN111538608A (en) Method for preventing terminal equipment from being down, terminal equipment and storage medium
Datta et al. Mind-proofing your phone: Navigating the digital minefield with greaseterminator
US8984336B1 (en) Systems and methods for performing first failure data captures
CN114339420B (en) Streaming method and device, electronic equipment and storage medium
CN114816803A (en) Distributed simulation communication message processing method and device
CN113656378A (en) Server management method, device and medium
CN107982915B (en) Multi-game same-screen implementation method and device
CN110968368B (en) Live broadcast data processing method and related equipment thereof
US10291685B2 (en) Method and apparatus for controlling running of service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination