CN113238888A - Data processing method, system and device - Google Patents

Data processing method, system and device Download PDF

Info

Publication number
CN113238888A
CN113238888A CN202110615637.1A CN202110615637A CN113238888A CN 113238888 A CN113238888 A CN 113238888A CN 202110615637 A CN202110615637 A CN 202110615637A CN 113238888 A CN113238888 A CN 113238888A
Authority
CN
China
Prior art keywords
calling
abnormal
information
platform
service platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110615637.1A
Other languages
Chinese (zh)
Inventor
杨仲谋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang eCommerce Bank Co Ltd
Original Assignee
Zhejiang eCommerce Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang eCommerce Bank Co Ltd filed Critical Zhejiang eCommerce Bank Co Ltd
Priority to CN202110615637.1A priority Critical patent/CN113238888A/en
Publication of CN113238888A publication Critical patent/CN113238888A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The specification provides a data processing method, a system and a device, wherein the data processing method is applied to a data processing system, the system comprises a calling service platform, a called service platform, a detection platform and a processing platform, wherein the calling service platform generates calling abnormal information under the condition of abnormal calling, writes the calling abnormal information into a calling abnormal log file, and sends target calling abnormal information to the detection platform; the detection platform determines the calling abnormal times of the calling service platform aiming at the called service platform based on the received calling abnormal information, generates a calling abnormal representation value based on the calling abnormal times, determines abnormal equipment in the called service platform based on the calling abnormal representation value, and sends the abnormal equipment information to the processing platform; and the processing platform performs exception processing based on the received exception equipment information.

Description

Data processing method, system and device
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method. The present specification also relates to a data processing system, a data processing apparatus, a computing device, and a computer-readable storage medium.
Background
With the rapid development of online service services and the increasing complexity of online service requests of users, service providers usually need to cooperatively process service requests sent by users through a plurality of online service platforms.
However, as the number of the on-line service platforms is continuously increased, interactive communication between machines belonging to the on-line service platforms is more and more frequent and complex; when one or more machines of the online service platform are abnormal, which causes failure of online business processing, detecting the abnormal machines usually consumes a certain time; this can have a serious impact on the stability of the on-line service platform. Therefore, it is desirable to provide a data processing method capable of accurately and efficiently identifying abnormal machines in an online service platform, thereby improving the stability of the online service platform.
Disclosure of Invention
In view of this, the embodiments of the present specification provide a data processing method. The present specification also relates to a data processing system, a data processing apparatus, a computing device, and a computer readable storage medium to solve the technical problems in the prior art.
According to a first aspect of embodiments herein, there is provided a data processing method applied to a data processing system, the system comprising a calling service platform, a called service platform, a detection platform and a processing platform, wherein,
the calling service platform generates calling abnormal information under the condition of calling abnormity, writes the calling abnormal information into a calling abnormal log file, and sends target calling abnormal information to the detection platform;
the detection platform determines the calling abnormal times of the calling service platform aiming at the called service platform based on the received calling abnormal information, generates a calling abnormal representation value based on the calling abnormal times, determines abnormal equipment in the called service platform based on the calling abnormal representation value, and sends the abnormal equipment information to the processing platform;
and the processing platform performs exception processing based on the received exception equipment information.
According to a second aspect of the embodiments of the present specification, there is provided a data processing method applied to an inspection platform, including,
determining the calling abnormal times of the calling service platform aiming at the called service platform based on the received calling abnormal information;
generating a calling abnormity representation value based on the calling abnormity times;
and determining abnormal equipment in the called service platform based on the calling abnormal characteristic value, and sending the abnormal equipment information to a processing platform.
According to a third aspect of embodiments herein, there is provided a data processing system comprising a calling service platform, a called service platform, a detection platform and a processing platform, wherein,
the calling service platform is configured to generate calling abnormal information under the condition of abnormal calling, write the calling abnormal information into a calling abnormal log file, and send target calling abnormal information to the detection platform;
the detection platform is configured to determine the calling abnormal times of the calling service platform aiming at the called service platform based on the received calling abnormal information, generate a calling abnormal representation value based on the calling abnormal times, determine abnormal equipment in the called service platform based on the calling abnormal representation value, and send the abnormal equipment information to the processing platform;
the processing platform is configured to perform exception processing based on the received exception device information.
According to a fourth aspect of the embodiments of the present specification, there is provided a data processing apparatus applied to an inspection platform, including:
the receiving module is configured to determine the calling abnormity times of the calling service platform aiming at the called service platform based on the received calling abnormity information;
a generation module configured to generate a calling exception characterizing value based on the number of calling exceptions;
the determining module is configured to determine abnormal equipment in the called service platform based on the calling abnormal characteristic value and send the abnormal equipment information to a processing platform.
According to a fifth aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, wherein the processor implements the steps of any of the data processing methods when executing the computer-executable instructions.
According to a sixth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of any of the data processing methods.
The data processing method provided by the specification is applied to a data processing system, wherein the system comprises a calling service platform, a called service platform, a detection platform and a processing platform, wherein the calling service platform generates calling abnormal information under the condition of abnormal calling, writes the calling abnormal information into a calling abnormal log file, and sends target calling abnormal information to the detection platform; the detection platform determines the calling abnormal times of the calling service platform aiming at the called service platform based on the received calling abnormal information, generates a calling abnormal representation value based on the calling abnormal times, determines abnormal equipment in the called service platform based on the calling abnormal representation value, and sends the abnormal equipment information to the processing platform; and the processing platform performs exception processing based on the received exception equipment information. Specifically, the method determines the calling abnormal times between the calling service platform and the called service platform through the calling abnormal information sent by the detection platform based on the service platform, generates an abnormal characteristic value according to the calling abnormal times, and can timely and accurately identify abnormal equipment appearing in the called service platform through the abnormal characteristic value, so that the stability of the service platform is improved, and the operation and maintenance pressure and emergency pressure caused by calling abnormal service platform in the data processing system are relieved.
Drawings
FIG. 1 is a block diagram of a data processing system according to an embodiment of the present disclosure;
FIG. 2 is a flow chart of a data processing method provided in an embodiment of the present specification;
fig. 3 is a schematic diagram of fixed-format call exception information in a data processing method provided in an embodiment of the present specification;
fig. 4 is a schematic diagram of data acquisition in a data processing method provided in an embodiment of the present specification;
fig. 5A is a schematic diagram of call exception information received by a detection platform in a data processing method provided in an embodiment of the present specification;
fig. 5B is a schematic diagram illustrating classification of call exception information in a data processing method provided in an embodiment of the present specification;
fig. 5C is a schematic diagram illustrating the number of call exceptions in the data processing method according to an embodiment of the present specification;
fig. 6 is a schematic diagram of the total number of call exceptions in the data processing method provided in an embodiment of the present specification;
fig. 7 is a schematic diagram illustrating that call exception information is sorted according to the number of call exceptions in a data processing method provided in an embodiment of the present specification;
FIG. 8 is a diagram illustrating an apparatus for determining an anomaly from sample variance in a data processing method according to an embodiment of the present disclosure;
FIG. 9 is a diagram illustrating a chain of call exceptions graph in a data processing method provided by an embodiment of the present specification;
FIG. 10 is a schematic diagram of an automated processing scheme of a processing platform in the data processing method provided in an embodiment of the present specification;
FIG. 11 is a flow chart of another data processing method provided by an embodiment of the present description;
FIG. 12 is a flowchart illustrating a data processing method applied to an abnormal machine detection scenario of an information system according to an embodiment of the present disclosure;
fig. 13 is a schematic flowchart illustrating an identification process of an abnormal machine in an information system in a data processing method according to an embodiment of the present disclosure;
fig. 14 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure;
fig. 15 is a block diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms to which one or more embodiments of the present specification relate are explained.
Microservice (Microservice): the method is an information system architecture and an organization method, and generally two or more information systems establish connection through RPC (remote procedure call) calls to cooperatively process service requests.
RPC calling: communication behavior between information systems through well-defined APIs.
Single machine exception: one or more machines owned by the information system have abnormal behaviors, which are usually represented by the fact that service requests cannot be responded, the response is very slow, the response is normal, but the returned result has great problems and the like.
SOFARPC: the RPC framework is an open-source RPC framework with high expandability, high performance and production level, and comprises components such as a micro-service development framework, an RPC call framework, a service registration center, a distributed timing task, a current limiting/fusing mechanism, a dynamically configured push function, a distributed link tracking, a Metrics monitoring, a distributed high-availability queue, a distributed transaction framework, a distributed database agent and the like.
Python: is a computer programming language.
The Go language: also known as gold, is a static strongly typed, compiled computer programming language.
OceanBase: is a distributed database.
MySQL: is a relational database management system.
PostgreSQL: is a client/server relational database management system with open source code.
epoll mechanism: is a poll (I/O multiplexing mechanism) modified to handle large volumes of handles.
kqueue mechanism: is an IO multiplexing mechanism.
And (3) SDK: a Software Development Kit (SDK) is a collection of related documents, paradigms and tools that assist in developing a certain class of Software.
In the present specification, a data processing method is provided, and the present specification relates to a data processing system, a data processing apparatus, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
With the development of internet technology, more and more internet technology enterprises start to use micro services for deployment of information systems, but with the increase of operation and maintenance and emergency costs.
In particular, in a microservice environment, a seemingly simple service request requires multiple information systems to complete in tandem. The information systems are only a logical concept, really bear flow and process service requests, are machines belonging to the information systems, and establish connection and cooperate with each other through RPC calls. If a machine (i.e., a "stand-alone") itself becomes abnormal, it may cause a failure of the service request.
In the prior art, operation and maintenance personnel of internet technology enterprises can allocate a group of 'balance loaders' to each information system; while allocating the flow, the heartbeat mechanism is also used to detect whether the machine can respond. If the machine cannot respond, the abnormal machine is considered to be found, and the service request flow is not forwarded to the machine. However, this solution cannot cope with some complex abnormal scenarios such as machine false death (capable of responding to heartbeat but not working normally), processing exception at the service level (request normal but returned data exception), and the like.
In order to solve the above technical problems, the existing internet technology enterprises generally adopt the open-source sofarp product. The SOFARPC has built-in single machine fault detection capability, and finds and isolates machines with abnormal performance by calculating call abnormal rate in each window period and matching with a certain measurement strategy. However, the method has the defects of processing the following scenes; firstly, the service level is abnormal, namely the condition that the request is normal but the returned data is abnormal is generally caused by the inconsistency of the machine environment or the deployment version; secondly, Internet technology enterprises have a plurality of information systems developed by using Python and Go languages, and the RPC calling mechanism is not fixed, so that the SOFARPC is difficult to embed the single-machine fault detection capability; then, some information systems are "stateful," i.e., the machines they own cannot be isolated and replaced at will, requiring specialized custom processing actions.
Based on this, in this specification, a data processing method is provided, which includes acquiring abnormal RPC call logs of machines in each information system, performing data aggregation and inference analysis, and finally finding out the machine with the abnormality, and providing a flexible processing method. The method has high universality, can cope with a plurality of complex single-machine abnormal scenes, and is not limited to a certain specific infrastructure and technical stack.
In addition, development and operation and maintenance personnel of the information system can know service data conditions of the information system more, and can print and call abnormal logs according to actual needs, so that the difficulty that the service level abnormity cannot be concerned in the prior art is overcome; the log printing is used for collecting the data of various RPC call failures, so that the operation and maintenance cost and the emergency cost are reduced, the universality is higher, the log format specification is only required to be followed, and no requirements are imposed on the development languages, frameworks and the like of various information systems. And the post-action capability is provided, and each information system can set the automatic processing mode of the abnormal machine according to actual needs, so that the problem that only unified isolation and replacement can be carried out in the prior art is solved.
Referring to fig. 1, fig. 1 shows a schematic structural diagram of a data processing system provided according to an embodiment of the present specification, where the system includes a calling service platform 102, a called service platform 104, a detection platform 106, and a processing platform 108, where the calling service platform 102 is configured to, in the event of a call exception, generate call exception information, write the call exception information into a call exception log file, and send target call exception information to the detection platform 106; the detection platform 106 is configured to determine, based on the received call exception information, a call exception number of the call service platform 102 for the called service platform 104, generate a call exception characterization value based on the call exception number, determine, based on the call exception characterization value, an exception device in the called service platform 104, and send the exception device information to the processing platform 108; the processing platform 108 is configured to perform exception handling based on the received exception device information.
In this embodiment, the information system may be only a logical concept, and actually performs a specific task or processes a specific service request, and is a machine belonging to the information system, and the machine includes but is not limited to a server.
The service platform may be understood as a platform for processing a specific service task, and may specifically be an information system for executing the specific service task.
Invoking a behavior that can be understood as communication among a plurality of service platforms through a well-defined rule; in embodiments of the present description, the call may be an RPC call.
The calling service platform 102 may be understood as a service platform that initiates a call to the called service platform 104, and correspondingly, the called service platform 104 may be understood as a service platform that is called by the calling service platform 102; in the case of mutual invocation among a plurality of service platforms, one service platform may be the calling service platform 102 or the called service platform 104. In this embodiment, the invoking service platform 102 may invoke the service data in the invoked service platform 104 by sending an invoking request, and after receiving the invoking request of the invoking service platform 102, the invoked server platform returns the corresponding service data to the invoking service platform 102 within a specific time.
The detection platform 106 may be understood as a platform for processing a specific detection task, and may specifically be an information system for executing the specific detection task, and in this embodiment of the present specification, the detection platform 106 may be an information system for detecting and identifying a device with an abnormality in the service platform.
The processing platform 108 may be understood as a platform that processes a specific processing task, and may specifically be an information system that performs the specific processing task.
The calling exception may be understood as an exception occurring in a process in which the calling service platform 102 initiates a call to the called service platform 104, and specifically may be that after the calling service platform 102 sends a call request to the called service platform 104, service data returned by the called service platform 104 in response to the call request is not received within a specific time; or the business data returned by the called service platform 104 and received by the calling service platform 102 has an exception; in the embodiments of the present specification, the present application is explained only by calling timeout and returning data exception, and the calling exception may be set according to the needs of an actual application scenario.
The calling exception information may be understood as information characterizing a calling exception of the calling service platform 102, and in this embodiment of the present specification, when the calling service platform 102 calls another service platform, each calling exception occurs, a piece of calling exception information characterizing the calling exception is generated. The call exception information may include name information of the calling service platform 102, an identification of the calling service platform 102, name information of the called service platform 104, identification information of the called service platform 104, a time at which the call exception was sent, and the like.
The call exception log file may be understood as a log file in the call services platform 102 dedicated to storing call exception information.
The target call exception information may be understood as specific call exception information in a call exception log file, and in this specification embodiment, the target call exception information may be call exception information in a specific time period.
The number of calling exceptions can be understood as the number of times of calling exceptions occurring between the calling service platform 102 and the called service platform 104, and when there are a plurality of called service platforms 104, the number of calling exceptions of the calling service platform 102 for each called service platform 104 can be determined; in the case where the service platform is an information system, the number of call exceptions may be understood as the number of call exceptions between a machine in the calling information system and a machine in the called information system.
The call exception characterization value may be understood as a difference magnitude characterizing the number of call exceptions of the calling service platform 102 for the called service platform 104. In embodiments of the present description, the call exception indicator may be one or more values.
The abnormal device may be understood as a device in which an abnormality occurs in the service platform, and in this embodiment of the present specification, the device may be a server; accordingly, an anomalous device may be understood as a server in which an anomaly has occurred.
The abnormal device information may be understood as an identifier of the abnormal device, and in the case that the abnormal device is a server in which an abnormality occurs, the abnormal device information may include information such as an internet protocol address (IP address) of the server, a media access control address (MAC address) of the server, and the like.
Exception handling may be understood as a specific operation performed based on the abnormal device information, and in this embodiment, exception handling may include the processing platform 108 sending a mail to a specific mailbox and/or sending a prompt to a specific person and/or sending an HTTP request and/or executing a custom script deployed on the processing platform 108, and the like, based on the abnormal device information.
Specifically, in the running process of the data processing system, the calling service platform 102 in the data processing system sends a calling request to the called service platform 104, but a calling exception may occur in the process that the calling service platform 102 calls the called service platform 104, and under the condition of the calling exception, the calling service platform 102 generates calling exception information and writes the calling exception information into a calling exception log file for storage; the calling service platform 102 may also collect specific calling exception information in the calling exception log file through a data collection program deployed on the calling service platform 102, and send the calling exception information to the detection platform 106 for processing.
After receiving the abnormal calling information sent by the calling service platform 102, the detection platform 106 analyzes and structures the abnormal calling information, and sends the analyzed and structured abnormal calling information to the database for storage.
The detection platform 106 may obtain the abnormal calling information of a specific time period from the database according to a preset time frequency, for example, the detection platform 106 may obtain the abnormal calling information stored in the database in "the previous minute" of the current time from the database according to the frequency of one minute, and process the abnormal calling information.
After the detection platform 106 obtains the calling abnormal information, for the calling abnormal information sent by each calling service platform 102, the detection platform 106 classifies the calling abnormal information according to the attribute information of the called service platform 104; determining the calling abnormal times of the calling service platform 102 aiming at the called service platform 104 based on the classified calling abnormal information, and generating a calling abnormal representation value based on the calling abnormal times; determining abnormal equipment causing abnormal calling between the calling service platform 102 and the called service platform 104 in the called service platform 104 according to the calling abnormal characteristic value, and sending abnormal equipment information corresponding to the abnormal equipment to the processing platform 108.
After receiving the abnormal device information sent by the detection platform 106, the processing platform 108 generates specific prompt information according to the abnormal device information and sends the prompt information; or the processing platform 108 performs isolation replacement processing on the abnormal device in the called service platform 104 according to the abnormal device information.
For example, taking the service platform as a shopping platform or a payment platform as an example, the calling service platform 102 may be understood as a shopping platform, and the called service platform may be understood as a payment platform.
When the shopping platform receives the commodity purchase transaction request, the payment function of the payment platform needs to be invoked to deduct the resource of the corresponding amount of the purchaser, and the resource can be, but not limited to, funds or points, so as to smoothly process the transaction request. However, when the transaction data is not sent to the shopping platform by the payment platform within a specific time or the data sent to the shopping platform by the payment platform is abnormal, the shopping platform generates calling abnormal information indicating that the shopping platform calls the payment platform to be abnormal, and stores the calling abnormal information into a calling abnormal log file of the shopping platform.
The data acquisition program deployed in the shopping platform can acquire the calling abnormal information newly written in the calling abnormal log file after detecting that the calling abnormal information is written in the calling abnormal log file, specifically, the shopping platform deploys a data acquisition program for acquiring the calling abnormal information in the calling abnormal log file, and the data acquisition program can read the contents in the calling abnormal log file with a path of "/home/admin/logs/. log" under the default condition, wherein the path can be configured according to the actual application requirement and can be specified when the data acquisition program is started. After acquiring the abnormal calling information, the data collection program sends the abnormal calling information to the detection platform 106 for detection.
After receiving the abnormal calling information sent by the shopping platform, the detection platform 106 analyzes and structures the abnormal calling information, obtains information such as a shopping platform name, a payment platform name, an IP address of a server initiating calling in the shopping platform, an IP address of the server called in the payment platform, time for sending the abnormal calling and the like contained in the abnormal calling information, and sends the information to the database for storage; in practical application, the detection platform uses OceanBase to save data such as calling abnormal information and set necessary indexes; however, the content of the data such as the abnormal calling information is relatively simple, and the detection platform also stores the data such as the abnormal calling information by adopting databases such as MySQL and PostgreSQL.
The detection platform 106 acquires calling abnormal information stored in a database in the previous minute at the current moment from the database according to the frequency of one minute, classifies the calling abnormal information according to information such as the name of the shopping platform and the name of the payment platform, determines the calling abnormal times of the shopping platform for the payment platform according to the classified calling abnormal information, and generates a calling abnormal representation value based on the calling abnormal times; and determining a server causing abnormal calling between the shopping platform and the payment platform in the payment platform according to the abnormal calling characteristic value, and sending the IP address of the abnormal server to the processing platform 108.
After receiving the IP address of the abnormal server sent by the detection platform 106, the processing platform 108 generates a mail or prompt message according to the IP address of the abnormal server, and sends the mail or prompt message to the operation and maintenance staff; or the processing platform 108 can also perform isolation replacement processing on the abnormal server in the payment platform according to the IP address of the abnormal server.
In the embodiment of the present specification, the data processing system writes the calling information into the calling exception log file in an exception manner, and sends the calling exception information in the calling exception log file to the detection platform 106 for processing, so that there is no any requirement on the development language, framework, etc. of the system, and the data processing system has high universality; and cost is saved. Meanwhile, the detection platform 106 determines the calling abnormal times between the calling service platform 102 and the called service platform 104 based on the calling abnormal information sent by the service platform, and then generates an abnormal characteristic value according to the calling abnormal times, and abnormal equipment appearing in the called service platform 104 can be timely and accurately identified through the abnormal characteristic value, so that the stability of the service platform is favorably improved, and the operation and maintenance pressure and the emergency pressure caused by calling abnormal of the service platform 102 in the data processing system are favorably relieved.
However, as the number of the on-line service platforms is continuously increased, interactive communication between machines belonging to the on-line service platforms is more and more frequent and complex; when one or more machines of the online service platform are abnormal, which causes failure of online business processing, detecting the abnormal machines usually consumes a certain time; this can have a serious impact on the stability of the on-line service platform. Based on this, the data processing method provided in the embodiments of the present specification can timely and accurately identify an abnormal machine appearing in the online service platform, thereby facilitating improvement of stability of the online service platform, and facilitating alleviation of operation and maintenance pressure and emergency pressure caused by calling the service platform to invoke an abnormality in the data processing system.
Referring to fig. 2, fig. 2 shows a flowchart of a data processing method provided in an embodiment of the present specification, which is applied to a data processing system, where the system includes a calling service platform, a called service platform, a detection platform, and a processing platform, and the data processing method specifically includes the following steps:
step 202: and the calling service platform generates calling abnormal information under the condition of abnormal calling, writes the calling abnormal information into a calling abnormal log file, and sends target calling abnormal information to the detection platform.
Specifically, a calling service platform in the data processing system sends a calling request to a called service platform, but calling abnormality may occur in the process of calling the called service platform by the calling service platform, and under the condition of calling abnormality, the calling service platform generates calling abnormality information and writes the calling abnormality information into a calling abnormality log file for storage; and the calling service platform collects specific calling abnormal information in the calling abnormal log file through a data collection program deployed on the calling service platform, and sends the calling abnormal information to the detection platform for processing.
Referring to fig. 3, fig. 3 is a schematic diagram of call exception information in a fixed format in a data processing method provided in an embodiment of the present specification, in the embodiment of the present specification, a call service platform may be understood as an initiator information system, and accordingly, a called service platform may be understood as a receiver information system.
When an RPC call exception occurs in an initiator information system, the initiator information system prints RPC call exception information on a specified call exception log file according to a fixed format, wherein the RPC call exception information can comprise an exception occurrence timestamp, an initiator information system name, an initiator machine address (the address of a machine initiating a call in the initiator information system), a receiver information system name and a receiver machine address (the address of a called machine in the receiver information system); for example, the exception send timestamp may be understood as exception timestamp "161547006000" in FIG. 3; the initiator information system name may be understood as "a" in fig. 3; the initiator machine address may be understood as ". times.. 12. times.2" in fig. 3; the receiver information system name may be understood as "B" in fig. 3; the recipient machine address may be understood as ". times.. 12. 3" in fig. 3.
Further, in this embodiment of the present specification, the exception condition includes:
the calling service platform calls the called service platform and timeout occurs, or
And under the condition that the calling service platform calls the service data of the called service platform, the calling service platform receives abnormal service data.
The service data may be data generated by the service platform in a request for executing a specific service, and in this embodiment of the present specification, the service data may be account balance of a user in the payment platform, transaction record data, and the like.
Specifically, in the process that a calling service platform in the data processing system calls a called service platform, after the calling platform sends a calling request to the called service platform, the calling platform does not receive corresponding service data returned by the called service platform within a preset time, and therefore the calling service platform calls the called service platform and overtime occurs; or after the calling platform sends a calling request to the called service platform, the calling platform receives corresponding service data returned by the called service platform within preset time, but the received service data is abnormal.
For example, taking the service platform as a shopping platform or a payment platform as an example, the service platform called may be understood as a shopping platform, and the service platform called may be understood as a payment platform.
Correspondingly, the calling is abnormal in the condition that the shopping platform sends a service data calling request to the payment platform, timing is started at the moment of sending the calling request, and if the payment platform returns corresponding service data to the shopping platform within 1 second, the shopping platform calls the payment platform and is not overtime; if the payment platform does not return corresponding service data to the shopping platform within 1 second, the condition that the shopping platform calls the payment platform and time out occurs is represented.
Or, the calling is abnormal, that is, the shopping platform sends a service data calling request to the payment platform, and the payment platform returns corresponding service data to the shopping platform within 1 second, although the calling is not overtime, if the shopping platform finds that the service data returned by the payment platform is abnormal, the calling is also abnormal.
In actual application, the call exception condition may be set according to the needs of an actual application scenario, and the present specification only describes in detail the call timeout condition, the return data exception condition, and the like, and is not limited specifically.
In the embodiment of the description, the service platform generates the calling abnormal information under the abnormal conditions of calling overtime, returned data abnormality and the like, so that the subsequent detection platform can accurately identify the problems of the called service platform based on the calling abnormal information, and the stability of the service platform is improved.
In an embodiment of the present application, the sending the target call exception information to the detection platform includes:
acquiring a file identifier of the calling abnormal log file, and searching the corresponding calling abnormal log file according to the file identifier;
acquiring the target calling abnormal information in the log file according to the calling abnormal information identifier;
and sending the target calling abnormal information to the detection platform.
In this embodiment of the present specification, before the abnormal call log file is read, an index number pointing to the abnormal call log file needs to be obtained, and then the content in the abnormal call log file is read according to the index number. However, if the index number cannot be obtained due to the reasons that the system authority is not enough, the calling abnormal log file is locked, and the like, reading and writing cannot be performed. The index number may also be a handle, file descriptor, file pointer, etc.
The calling exception information identifier may be understood as an identifier representing calling exception information, each calling exception information identifier may uniquely represent one calling exception information, and in this specification, the calling exception information identifier may be information such as a sequence number, a line number, and write time.
After the calling service platform writes the calling abnormal information into the calling abnormal log file, searching the calling abnormal log file by acquiring the file identifier of the calling abnormal log file; acquiring target calling abnormal information in the calling abnormal log file according to the calling abnormal information identifier; and then sending the target calling abnormal information to the detection platform.
Specifically, the calling service platform writes calling exception information into a calling exception log file for storage; a data acquisition program deployed on the calling service platform monitors file identification of the calling abnormal log file; reading the calling abnormal file based on the monitored file identification; and acquiring newly written calling abnormal information in the calling abnormal log file based on the calling abnormal information identifier, and sending the calling abnormal information serving as target calling abnormal information to the detection platform.
For example, taking the service platform as a shopping platform or a payment platform as an example, the service platform called may be understood as a shopping platform, and the service platform called may be understood as a payment platform.
After the shopping platform writes the calling abnormal information into the calling abnormal log file, a data acquisition program deployed on the shopping platform can continuously monitor the handle of the calling abnormal log file and acquire the handle, after the data acquisition program acquires the handle, the corresponding calling abnormal log file can be found according to the handle, the latest calling abnormal information is determined as the target calling abnormal information according to the number of rows of the calling abnormal information written into the log file, the target calling abnormal information is pushed into a sending queue, and the target calling abnormal information is sent to a detection platform according to the sequence of the target calling abnormal information stored in the queue.
Referring to fig. 4, fig. 4 is a schematic diagram of data acquisition in the data processing method provided in the embodiment of the present specification, a calling service platform may be understood as an initiator information system; the initiator information system prints RPC calling exception information to a calling exception log file (equivalent to log output in FIG. 4) through an information system process; and a data acquisition program deployed in the initiator information system monitors a handle of the call exception log file through an I/O multiplexing mechanism, when new RPC call exception information is detected to appear in the call exception log file, an acquisition thread in the data acquisition program reads the new RPC call exception information and pushes the new RPC call exception information to a queue to be sent, and a sending thread in the data acquisition program sends the new RPC call exception information to a detection platform in an asynchronous mode. Wherein the I/O multiplexing mechanism includes, but is not limited to, epoll mechanism, kqueue mechanism, etc.
It should be noted that, in order to achieve the light weight as much as possible, the data collection program may not check the content format of the RPC call exception information, but directly send the obtained RPC call exception information to the detection platform as it is, and the detection platform performs the check analysis.
In the embodiment of the specification, the call service platform collects various kinds of RPC call failure data in a mode of writing various kinds of call exception information into a call exception log file, so that the cost for detecting exception equipment is saved, the various kinds of RPC call failure data are collected by calling the exception log file, the data only need to follow the log format specification, no requirements are required for development languages, frames and the like of various information systems, and the universality is high; and the corresponding calling abnormal information can be found in time according to the file identification, and the target calling abnormal information can be quickly found in the calling abnormal log file by calling the abnormal information identification, so that the processing resource of the service platform is saved.
Step 204: the detection platform determines the calling abnormal times of the calling service platform aiming at the called service platform based on the received calling abnormal information, generates a calling abnormal characteristic value based on the calling abnormal times, determines abnormal equipment in the called service platform based on the calling abnormal characteristic value, and sends the abnormal equipment information to the processing platform.
Specifically, the detection platform classifies calling abnormal information sent by each calling service platform; determining the calling abnormal times of the calling service platform aiming at the called service platform based on the classified calling abnormal information, and generating a calling abnormal representation value based on the calling abnormal times; and determining abnormal equipment causing abnormal calling between the calling service platform and the called service platform in the called service platform according to the calling abnormal characteristic value, and sending abnormal equipment information corresponding to the abnormal equipment to the processing platform.
Further, in an embodiment of the present application, the determining, by the detection platform, the number of call exceptions of the calling service platform for the called service platform based on the received call exception information includes:
dividing the calling abnormal information into different categories according to the attribute information of the calling service platform and the attribute information of the called service platform;
and acquiring the device parameters of the calling abnormal information in each category, and determining the calling abnormal times of the devices in the calling service platform aiming at the devices in the called service platform according to the device parameters.
The attribute information may be, but is not limited to, information such as an identifier and a name of the platform. In the case where the platform is an information system, the attribute information may be a system name of the information system.
The device parameter may be understood as a server parameter, such as a server IP address, of the originating calling platform and the called platform carried in the calling exception information.
Specifically, the detection platform divides calling abnormal information into different categories according to the attribute information of the calling service platform and the attribute information of the called service platform; and acquiring the device parameters of the calling abnormal information in each category, and determining the calling abnormal times of the devices in the calling service platform aiming at the devices in the called service platform according to the device parameters.
For example, taking the service platform as a shopping platform, a payment platform, an after-sales platform, and a claim settlement platform as an example, the calling service platform may be understood as a shopping platform or an after-sales platform, and the called service platform may be understood as a payment platform and a claim settlement platform.
As shown in fig. 5A, fig. 5A is a schematic diagram of call exception information received by a detection platform in a data processing method provided in an embodiment of the present specification; the detection platform receives calling exception information sent by the shopping platform A and the after-sales platform C, wherein the calling exception information comprises Z1, Z2, Z3, Z4 and Z5.
As shown in fig. 5B, fig. 5B is a schematic diagram illustrating classification of call exception information in a data processing method provided in an embodiment of the present specification; after receiving the calling abnormal information sent by the shopping platform A and the after-sales platform C, the detection platform classifies the calling abnormal information sent by the shopping platform A and the after-sales platform C according to the called platform name (B, D), and classifies the calling abnormal information into different categories.
For example, the calling exception information of the shopping platform A calling the payment platform B comprises Z1, Z2 and Z3, so that Z1, Z2 and Z3 are divided into one class.
The calling exception information for the after-market platform C to call the claim platform D includes Z4 and Z5, thus Z4 and Z5 are categorized as one.
As shown in fig. 5C, fig. 5C is a schematic diagram illustrating the number of call exceptions in the data processing method provided in the embodiment of the present specification; after the detection platform is classified, counting the calling abnormal times of the server of the shopping platform A between the two servers in the process of calling the server of the payment platform B according to the calling abnormal information under each classification, and counting the calling abnormal times of the server of the after-sales platform C between the two servers in the process of calling the server of the claim settlement platform D.
Firstly, it is determined that an abnormality occurs in the process that one server of the shopping platform A calls one server of the payment platform.
Secondly, according to the IP addresses of the two servers, inquiring the calling abnormal information of how many calling abnormal information appear between the two servers in the calling abnormal information, and determining the calling abnormal times between the two servers according to the number of the calling abnormal information.
For example, the IP address is ". x. x.12. 2" server in the shopping platform a, and an abnormality occurs in the process of calling the IP address is ". x. x.11. 3" server in the payment platform B.
According to the IP addresses of the two servers, 2 pieces of calling abnormal information between the two servers are inquired in the calling abnormal information, so that the calling abnormal times between the two servers is determined to be 2.
In the embodiment of the description, the calling abnormal information is divided into different categories according to the attribute information of the calling service platform and the attribute information of the called service platform, the detection platform can detect the calling abnormal information in each category more finely, and the calling abnormal times of the calling service platform to the called service platform are clearly identified based on the device parameters of the calling abnormal information, so that the subsequent detection platform can more accurately detect the abnormal devices in the called service platform.
In an embodiment of the present application, the generating a call exception characterizing value based on the number of call exceptions includes:
obtaining a total calling abnormal frequency according to the calling abnormal frequency of the calling service platform aiming at the called service platform;
and generating a calling abnormity representation value representing the numerical difference between the calling abnormity numbers based on the total calling abnormity numbers and the calling abnormity numbers.
The total number of call exceptions may be understood as the sum of the number of call exceptions of the device in the calling service platform to the device in the called service platform.
Specifically, the detection platform accumulates the calling abnormal times of the calling of the plurality of devices in the calling service platform to the plurality of devices in the called service platform to obtain the total calling abnormal times; and generating a calling abnormity representation value representing the numerical difference between the calling abnormity numbers based on the total calling abnormity numbers and the calling abnormity numbers.
Along with the above example, as shown in fig. 6, fig. 6 is a schematic diagram of the total number of call exceptions in the data processing method provided in the embodiment of the present specification; if calling abnormality occurs when 4 servers in the payment platform are called by the shopping platform, accumulating the calling abnormality times corresponding to the 4 servers to obtain the total calling abnormality times, for example, referring to fig. 6, accumulating the calling abnormality times to obtain the total calling abnormality times: 802, a first step of; and generating a calling abnormity representation value representing the numerical difference between the calling abnormity numbers according to the total calling abnormity numbers and the calling abnormity numbers.
In the embodiment of the specification, a calling exception representation value for representing a numerical difference between a plurality of calling exception numbers is rapidly generated on the basis of the total calling exception number and the calling exception number; the processing resources of the data processing system are saved, and the efficiency of detecting the abnormal equipment in the called service platform by the detection platform is improved.
In an embodiment of the application, the generating a call exception characterization value characterizing a numerical difference between the number of call exceptions based on the total number of call exceptions and the number of call exceptions includes:
calculating and obtaining a calling abnormal frequency mean value based on the total calling abnormal frequency;
processing the calling abnormal times and the calling abnormal time mean value to generate a sample variance, and taking the sample variance as the calling abnormal characteristic value; or
And calculating to obtain a percentage value of each calling abnormity frequency based on the total calling abnormity frequency and the plurality of calling abnormity frequencies, and taking the percentage value as the calling abnormity representation value.
The average number of call exceptions may be understood to mean the number of call exceptions averaged for each device in the called service platform.
Specifically, the detection platform divides the number of devices with abnormal calling in the called service platform by the total abnormal calling times, and determines the average number of times of abnormal calling of each device of the called service platform, so as to obtain the average value of the abnormal calling times of the called service platform; processing the calling abnormal times and the calling abnormal time mean value to generate a sample variance, and taking the sample variance as the calling abnormal characteristic value; or the detection platform respectively calculates the calling abnormal times of each device in the called service platform according to the total calling abnormal times, wherein the calling abnormal times account for the percentage value of the total calling abnormal times; and taking the percentage value as the calling abnormity characterization value.
For example, the detection platform divides the number of the servers with abnormal calling in the payment platform by the sum of the abnormal calling times of all the servers in the payment platform, calculates an average value of the abnormal calling times of each device in the payment platform, and calculates a sample variance of the abnormal calling times of the payment platform according to a sample standard deviation formula, where the sample standard deviation formula is:
Figure BDA0003097334800000141
wherein s is a sample variance of the calling exception times of the payment platform, xiFor paying the abnormal times of calling of each server in the platform,
Figure BDA0003097334800000142
and the number of the servers with calling abnormality in the N payment platforms is the average value of the calling abnormality of each device in the payment platforms.
Or the detection platform calculates the calling abnormal times of each server in the payment platform according to the total calling abnormal times, and the calling abnormal times account for the percentage value of the sum of the calling abnormal times of all the servers in the payment platform, and the percentage value of each server is used as a calling abnormal representation value.
In the embodiment of the specification, by adopting a removing algorithm based on sample variance and an aggregation algorithm based on proportion statistics, a machine with abnormality can be found to the greatest extent, and the accuracy is improved while the false alarm is reduced; the diversity of the detection modes of the detection platform is increased, and the adaptability of the data processing system is improved.
In an embodiment of the present specification, the determining an exception device in the invoked service platform based on the invocation exception characterizing value includes:
sorting the calling exception information according to the calling exception times;
and under the condition that the calling abnormity representation value meets a preset detection threshold value, obtaining the calling abnormity information of a preset sequence position, and determining abnormal equipment in the called service platform according to the calling abnormity information of the preset sequence position.
The step of sorting the calling exception information can be understood as sorting the calling exception information according to the size of the calling exception times and the sequence from large to small or the sequence from small to large; correspondingly, the calling exception information at the preset sequence position can be understood as calling exception information which is arranged at the first bit or the last bit after the calling exception information is sequenced according to the calling exception times.
The detection threshold may be understood as a value or a range of values, and in the embodiment of the present specification, in the case of calling the abnormal feature value as the sample variance, the detection threshold may be 1.0; in the case where the call anomaly characterization value is a percentage value, the detection threshold may be 90%. In practical applications, the detection threshold can be set according to the requirements of practical applications, and the description merely explains the embodiments of the present specification by "1.0" and "90%" and does not specifically limit the detection threshold.
Specifically, after determining the calling abnormity representation value, the detection platform sorts the calling abnormity information according to the calling abnormity times and the sequence from large to small, the calling abnormity information with the largest calling abnormity times is ranked at the first position, and the calling abnormity information with the smallest calling abnormity times is ranked at the last position.
After calling abnormal information is sequenced according to the calling abnormal times, comparing the calling abnormal characteristic value with a preset detection threshold value, judging whether the calling abnormal characteristic value is larger than the detection threshold value, if so, acquiring calling abnormal information which is sequenced and arranged at the first position, determining equipment with calling abnormality in a called service platform corresponding to the calling abnormal information arranged at the first position as abnormal equipment, storing the abnormal equipment into an abnormal equipment list, and deleting the calling abnormal information arranged at the first position.
After deleting the calling abnormal information arranged at the first position, the detection platform determines the calling abnormal representation value of the called service platform again based on the remaining calling abnormal information, and sorts the calling abnormal information again according to the calling abnormal times. And then judging whether the calling abnormity representation value determined again is larger than a preset detection threshold value, if so, determining abnormal equipment of the called service platform according to calling abnormity information of the first position after re-sequencing, storing the abnormal equipment into an abnormity equipment list, and deleting the calling abnormity information which is ranked at the first position after re-sequencing.
And gradually determining abnormal equipment in the called service platform by continuously and iteratively executing the steps until the calling abnormal characteristic value is judged to be smaller than a preset detection threshold value, and indicating that the abnormal equipment does not exist in the called service platform.
For example, taking the service platform as a shopping platform or a payment platform as an example, the service platform called may be understood as a shopping platform, and the service platform called may be understood as a payment platform.
As shown in fig. 7, fig. 7 is a schematic diagram illustrating a data processing method provided in an embodiment of the present specification, in which call exception information is sorted according to the number of call exceptions; after determining the calling abnormity representation value, the detection platform can call abnormal calling times of the server of the payment platform according to the server of the shopping platform, and sort the abnormal calling information according to the sequence from large to small, wherein the abnormal calling information with the largest calling times is ranked on the first place, and the abnormal calling information with the smallest calling times is ranked on the last place.
Fig. 8 is a schematic diagram of an apparatus for determining an anomaly according to sample variance in a data processing method provided in an embodiment of the present specification, and is shown in fig. 8; after the detection platform sorts the calling exception information, comparing the calling exception characteristic value of the payment platform with a preset detection threshold value, referring to fig. 8, where the calling exception characteristic value may be 54639, 39690, and 0.5, and the preset detection threshold value may be 1.0, determining whether the calling exception characteristic value is greater than the detection threshold value, if so, obtaining the calling exception information sorted in the first position, determining the server with the calling exception corresponding to the calling exception information in the first position as an exception server, storing the exception device in an exception device list, and deleting the calling exception information in the first position.
And after deleting the calling abnormal information arranged at the first position, the detection platform determines the calling abnormal representation value again based on the remaining calling abnormal information, and sorts the calling abnormal information again according to the calling abnormal times. And then judging whether the calling abnormity representation value determined again is larger than a preset detection threshold value, if so, determining abnormal equipment in the payment platform according to calling abnormity information of the first position after re-sequencing, storing the abnormal equipment into an abnormity equipment list, and deleting the calling abnormity information which is ranked at the first position after re-sequencing.
And gradually determining abnormal equipment in the payment platform by continuously and iteratively executing the steps until the calling abnormal characteristic value is judged to be smaller than a preset detection threshold value, and indicating that the abnormal equipment does not exist in the payment platform.
In the embodiment of the specification, by comparing the calling abnormity representation value with a preset detection threshold value, equipment with abnormity in the called service platform is rapidly judged; the detection efficiency of the detection platform to the abnormal equipment is saved, the abnormal equipment existing in the called service platform can be determined very accurately and completely by repeatedly calling the abnormal characteristic value and comparing the abnormal characteristic value with the preset detection threshold value, and the loss of the abnormal equipment to the data processing system is reduced.
In an embodiment of this specification, the sending the abnormal device information to the processing platform includes:
judging whether the abnormal equipment belongs to the same called service platform or not;
if yes, sending the abnormal equipment information to the processing platform;
if not, generating a calling exception chain diagram according to the calling exception information and the exception equipment;
and determining target abnormal equipment based on the calling abnormal chain diagram, and sending the target abnormal equipment information to the processing platform.
The calling exception chain graph can represent calling exception relations among exception devices.
Specifically, the detection platform judges whether the abnormal equipment belongs to the same called service platform; under the condition that the abnormal equipment judged by the detection platform belongs to the same called service platform, the information of the abnormal equipment is directly sent to the processing platform; if the detected abnormal devices belong to different called service platforms respectively, there may be a case that the indirect call fails, that is, a part of the devices detected as abnormal devices may be misjudged as abnormal devices because there is a call with the real abnormal devices, and there is no abnormality in themselves.
And under the condition that indirect calling fails, the detection platform reads calling abnormal information sent by the calling platform again, detects whether a direct calling relationship exists among the plurality of abnormal devices according to the calling abnormal information, and constructs a calling abnormal chain diagram according to the calling relationship under the condition that the direct calling relationship exists among the plurality of abnormal devices, wherein the calling abnormal chain diagram is used for representing the calling relationship among the plurality of abnormal devices.
After the construction of the calling exception chain diagram is completed, the detection platform detects that the calling exception chain diagram does not accept any calling and does not send a calling request to any device of the called service platform, and determines the device as a target exception device. Meanwhile, it is also required to detect that, although a call request sent by a device calling a service platform is received in the call exception chain diagram, the device which does not send a request to any device itself is determined as a target exception device.
And after the detection platform finishes the detection of the target abnormal equipment, the target abnormal equipment information is sent to the processing platform.
According to the above example, the detection platform judges whether the abnormal equipment belongs to the same platform; under the condition that the abnormal servers judged by the detection platform belong to the payment platform, the IP addresses of the abnormal servers are directly sent to the processing platform; if the detected abnormal servers belong to different platforms respectively, there may be a case where the indirect call fails.
Referring to fig. 9, fig. 9 is a schematic diagram of invoking an exception chain diagram in a data processing method provided in an embodiment of the present specification; and under the condition that indirect calling fails, the detection platform reads calling abnormal information sent by the shopping platform again, detects whether a direct calling relationship exists among the plurality of abnormal servers according to the calling abnormal information, and constructs a calling abnormal chain diagram according to the calling relationship under the condition that the direct calling relationship exists among the plurality of abnormal servers, wherein the calling abnormal chain diagram is used for representing the calling relationship among the plurality of abnormal servers.
After the detection platform completes construction of the calling abnormal chain diagram, the detection platform detects that no call is received in the calling abnormal chain diagram and no server sending a call request to any server is detected, determines the server as a target abnormal server, and stores the IP address of the target abnormal server into a final abnormal server list. Meanwhile, it is also required to detect that, although a call request sent by another server is received in the call exception chain map, the device which does not send a request to any server itself determines the device as a target exception server, and stores the IP address of the target exception server in the final exception server list.
And after the detection platform finishes the detection of the target abnormal server, the IP address of the target abnormal server is sent to the processing platform.
In the embodiment of the specification, when the abnormal devices do not belong to the same called service platform, the detection platform generates the calling abnormal chain diagram to carry out an indirect calling failure detection mechanism, and can accurately identify the target abnormal devices with calling abnormality in the abnormal devices, so that the accuracy of the detection platform is improved, and the false alarm rate of the abnormal devices is reduced.
Step 206: and the processing platform performs exception processing based on the received exception equipment information.
Specifically, after receiving abnormal equipment information sent by the detection platform, the processing platform generates specific prompt information according to the abnormal equipment information and sends the prompt information; or the processing platform carries out isolation replacement processing on the abnormal equipment in the called service platform according to the abnormal equipment information.
For example, after receiving the IP address of the abnormal server sent by the detection platform, the processing platform generates a mail or a prompt message according to the IP address of the abnormal server, and sends the mail or the prompt message to the operation and maintenance staff;
referring to fig. 10, fig. 10 is a schematic diagram of an automated processing scheme of a processing platform in the data processing method provided in the embodiment of the present disclosure; the automatic processing scheme of the processing platform comprises two modes: an HTTP mode and a custom script mode, wherein the HTTP mode can be as follows:
the method comprises the steps that an HTTP template needing to be called is pre-configured by a processing platform, after an abnormal machine list containing the IP address of an abnormal server and sent by a detection platform is received, the pre-configured HTTP template is called, and placeholder parameters in the HTTP template are replaced by the IP address of the abnormal server to generate an HTTP request; and then calling an HTTP interface provided by the operation and maintenance system, sending the IP address of the abnormal server to the operation and maintenance system through an HTTP request, and performing machine offline, machine replacement (replacing the abnormal server with a good machine), machine traffic removal (the machine does not provide service, but the system program on the machine is still running) and other operations on the abnormal server through the operation and maintenance system.
In this embodiment, the processing platform can also specify that the operation and maintenance platform performs a specific operation on the exception server; specifically, the HTTP template includes an operation parameter, and the processing platform can replace the operation parameter with a specific operation name, where the operation name may be a name such as "machine offline", "machine traffic removal", or "machine replacement".
After receiving the abnormal machine list, the processing platform replaces the placeholder parameter in the HTTP template with the IP address of the abnormal server, and replaces the operation parameter with the appointed operation name; and then calling an HTTP interface provided by the operation and maintenance system, sending the IP address and the operation name of the abnormal server to the operation and maintenance system, and carrying out operation such as machine offline, machine replacement or machine flow removal on the abnormal server by the operation and maintenance system according to the operation name.
The self-defined script mode can be as follows:
specifically, when the user customizes the script content, the exception machine list is referred to by using $ 1; and after receiving the calling exception information, the processing platform calls a custom processing script which is pre-configured on the processing platform, takes the exception machine list as a first position parameter of the script, and performs operations such as machine offline, machine replacement (replacing the exception machine with a good machine), machine flow removal (the machine does not provide service, but the system program on the machine is still running) and the like through the custom processing script.
In practical application, the processing script function is mainly provided for some service information systems with special requirements. For example, a service information system may store user data files on various running machines, and the files on different machines are different. If the machine is taken off line or replaced directly, part of the file may be lost. Therefore, the embodiment of the present specification provides a function of self-defining a processing script, and a user writes processing logic in the self-defining processing script and calls the self-defining processing script when detecting that the service information system has an abnormal machine. The user data files are packaged and backed up before the machine is offline, and the machine is offline after the packaging and the backup of the files are completed, so that the loss of user data is prevented.
According to the data processing method provided by the specification, the abnormal calling times between the calling service platform and the called service platform are determined through the detection platform based on the abnormal calling information sent by the service platform, the abnormal characteristic value is generated according to the abnormal calling times, and abnormal equipment appearing in the called service platform can be timely and accurately identified through the abnormal characteristic value, so that the stability of the service platform is improved, and the operation and maintenance pressure and the emergency pressure caused by the abnormal calling of the calling service platform in the data processing system are relieved. Meanwhile, the processing platform provides flexible automatic processing capability, the automatic processing mode of abnormal equipment can be set according to the requirements of practical application, the adaptability and the flexibility of the processing platform are improved, and the singleness of the processing mode of the processing platform is avoided.
Referring to fig. 11, fig. 11 is a flowchart illustrating another data processing method provided in an embodiment of the present specification, applied to an inspection platform, where the another data processing method specifically includes the following steps:
step 1102: and determining the calling exception times of the calling service platform aiming at the called service platform based on the received calling exception information.
Specifically, the detection platform classifies calling abnormal information sent by each calling service platform; and determining the calling abnormal times of the calling service platform aiming at the called service platform based on the classified calling abnormal information.
Further, in an embodiment of the present application, the determining, by the detection platform, the number of call exceptions of the calling service platform for the called service platform based on the received call exception information includes:
dividing the calling abnormal information into different categories according to the attribute information of the calling service platform and the attribute information of the called service platform;
and acquiring the device parameters of the calling abnormal information in each category, and determining the calling abnormal times of the devices in the calling service platform aiming at the devices in the called service platform according to the device parameters.
The attribute information may be, but is not limited to, information such as an identifier and a name of the platform. In the case where the platform is an information system, the attribute information may be a system name of the information system.
The device parameter may be understood as a server parameter, such as a server IP address, of the originating calling platform and the called platform carried in the calling exception information.
Specifically, the detection platform divides calling abnormal information into different categories according to the attribute information of the calling service platform and the attribute information of the called service platform; and acquiring the device parameters of the calling abnormal information in each category, and determining the calling abnormal times of the devices in the calling service platform aiming at the devices in the called service platform according to the device parameters.
For example, taking the service platform as a shopping platform, a payment platform, an after-sales platform, and a claim settlement platform as an example, the calling service platform may be understood as a shopping platform or an after-sales platform, and the called service platform may be understood as a payment platform and a claim settlement platform.
As shown in fig. 5A, fig. 5A is a schematic diagram of call exception information received by a detection platform in a data processing method provided in an embodiment of the present specification; the detection platform receives calling exception information sent by the shopping platform A and the after-sales platform C, wherein the calling exception information comprises Z1, Z2, Z3, Z4 and Z5.
As shown in fig. 5B, fig. 5B is a schematic diagram illustrating classification of call exception information in a data processing method provided in an embodiment of the present specification; after receiving the calling abnormal information sent by the shopping platform A and the after-sales platform C, the detection platform classifies the calling abnormal information sent by the shopping platform A and the after-sales platform C according to the called platform name (B, D), and classifies the calling abnormal information into different categories.
For example, the calling exception information of the shopping platform A calling the payment platform B comprises Z1, Z2 and Z3, so that Z1, Z2 and Z3 are divided into one class.
The calling exception information for the after-market platform C to call the claim platform D includes Z4 and Z5, thus Z4 and Z5 are categorized as one.
As shown in fig. 5C, fig. 5C is a schematic diagram illustrating the number of call exceptions in the data processing method provided in the embodiment of the present specification; after the detection platform is classified, counting the calling abnormal times of the server of the shopping platform A between the two servers in the process of calling the server of the payment platform B according to the calling abnormal information under each classification, and counting the calling abnormal times of the server of the after-sales platform C between the two servers in the process of calling the server of the claim settlement platform D.
Firstly, it is determined that an abnormality occurs in the process that one server of the shopping platform A calls one server of the payment platform.
Secondly, according to the IP addresses of the two servers, inquiring the calling abnormal information of how many calling abnormal information appear between the two servers in the calling abnormal information, and determining the calling abnormal times between the two servers according to the number of the calling abnormal information.
For example, the IP address is ". x. x.12. 2" server in the shopping platform a, and an abnormality occurs in the process of calling the IP address is ". x. x.11. 3" server in the payment platform B.
According to the IP addresses of the two servers, 2 pieces of calling abnormal information between the two servers are inquired in the calling abnormal information, so that the calling abnormal times between the two servers is determined to be 2.
In the embodiment of the description, the calling abnormal information is divided into different categories according to the attribute information of the calling service platform and the attribute information of the called service platform, the detection platform can detect the calling abnormal information in each category more finely, and the calling abnormal times of the calling service platform to the called service platform are clearly identified based on the device parameters of the calling abnormal information, so that the subsequent detection platform can more accurately detect the abnormal devices in the called service platform.
Step 1104: and generating a calling abnormity representation value based on the calling abnormity times.
Specifically, the detection platform determines that the calling service platform generates a calling exception representation value for the calling exception times of the called service platform based on the received calling exception information.
In an embodiment of the present application, the generating a call exception characterizing value based on the number of call exceptions includes:
obtaining a total calling abnormal frequency according to the calling abnormal frequency of the calling service platform aiming at the called service platform;
and generating a calling abnormity representation value representing the numerical difference between the calling abnormity numbers based on the total calling abnormity numbers and the calling abnormity numbers.
The total number of call exceptions may be understood as the sum of the number of call exceptions of the device in the calling service platform to the device in the called service platform.
Specifically, the detection platform accumulates the calling abnormal times of the plurality of devices in the calling service platform in the calling process of the plurality of devices in the called service platform to obtain the total calling abnormal times; and generating a calling abnormity representation value representing the numerical difference between the calling abnormity numbers based on the total calling abnormity numbers and the calling abnormity numbers.
Along with the above example, as shown in fig. 6, fig. 6 is a schematic diagram of the total number of call exceptions in the data processing method provided in the embodiment of the present specification; if calling abnormality occurs when 4 servers in the payment platform are called by the shopping platform, accumulating the calling abnormality times corresponding to the 4 servers to obtain the total calling abnormality times, for example, referring to fig. 6, accumulating the calling abnormality times to obtain the total calling abnormality times: 802, a first step of; and generating a calling abnormity representation value representing the numerical difference between the calling abnormity numbers according to the total calling abnormity numbers and the calling abnormity numbers.
In the embodiment of the specification, a calling exception representation value for representing a numerical difference between a plurality of calling exception numbers is rapidly generated on the basis of the total calling exception number and the calling exception number; the processing resources of the data processing system are saved, and the efficiency of detecting the abnormal equipment in the called service platform by the detection platform is improved.
In an embodiment of the application, the generating a call exception characterization value characterizing a numerical difference between the number of call exceptions based on the total number of call exceptions and the number of call exceptions includes:
calculating and obtaining a calling abnormal frequency mean value based on the total calling abnormal frequency;
processing the calling abnormal times and the calling abnormal time mean value to generate a sample variance, and taking the sample variance as the calling abnormal characteristic value; or
And calculating to obtain a percentage value of each calling abnormity frequency based on the total calling abnormity frequency and the plurality of calling abnormity frequencies, and taking the percentage value as the calling abnormity representation value.
The average number of call exceptions may be understood to mean the number of call exceptions averaged for each device in the called service platform.
Specifically, the detection platform divides the number of devices with abnormal calling in the called service platform by the total abnormal calling times, and determines the average number of times of abnormal calling of each device of the called service platform, so as to obtain the average value of the abnormal calling times of the called service platform; processing the calling abnormal times and the calling abnormal time mean value to generate a sample variance, and taking the sample variance as the calling abnormal characteristic value; or the detection platform respectively calculates the calling abnormal times of each device in the called service platform according to the total calling abnormal times, and the percentage value of the calling abnormal times to the total calling abnormal times is used as the calling abnormal representation value.
For example, the detection platform divides the number of the servers with abnormal calling in the payment platform by the sum of the abnormal calling times of all the servers in the payment platform, calculates an average value of the abnormal calling times of each device in the payment platform, and calculates a sample variance of the abnormal calling times of the payment platform according to a sample standard deviation formula, where the sample standard deviation formula is:
Figure BDA0003097334800000201
wherein s is a sample variance of the calling exception times of the payment platform, xiFor paying the abnormal times of calling of each server in the platform,
Figure BDA0003097334800000202
and the number of the servers with calling abnormality in the N payment platforms is the average value of the calling abnormality of each device in the payment platforms.
Or the detection platform calculates the calling abnormal times of each server in the payment platform according to the total calling abnormal times, and the calling abnormal times account for the percentage value of the sum of the calling abnormal times of all the servers in the payment platform, and the percentage value of each server is used as a calling abnormal representation value.
In the embodiment of the specification, by adopting a removing algorithm based on sample variance and an aggregation algorithm based on proportion statistics, a machine with abnormality can be found to the greatest extent, and the accuracy is improved while the false alarm is reduced; the diversity of the detection modes of the detection platform is increased, and the adaptability of the data processing system is improved.
Step 1106: and determining abnormal equipment in the called service platform based on the calling abnormal characteristic value, and sending the abnormal equipment information to a processing platform.
Specifically, the abnormal device causing the abnormal calling between the calling service platform and the called service platform in the called service platform is determined through the calling abnormal characteristic value, and the abnormal device information corresponding to the abnormal device is sent to the processing platform.
In an embodiment of the present specification, the determining an exception device in the invoked service platform based on the invocation exception characterizing value includes:
sorting the calling exception information according to the calling exception times;
and under the condition that the calling abnormity representation value meets a preset detection threshold value, obtaining the calling abnormity information of a preset sequence position, and determining abnormal equipment in the called service platform according to the calling abnormity information of the preset sequence position.
The step of sorting the calling exception information can be understood as sorting the calling exception information according to the size of the calling exception times and the sequence from large to small or the sequence from small to large; correspondingly, the calling exception information at the preset sequence position can be understood as calling exception information which is arranged at the first bit or the last bit after the calling exception information is sequenced according to the calling exception times.
The detection threshold may be understood as a value or a range of values, and in the embodiment of the present specification, in the case of calling the abnormal feature value as the sample variance, the detection threshold may be 1.0; in the case where the call anomaly characterization value is a percentage value, the detection threshold may be 90%. In practical applications, the detection threshold can be set according to the requirements of practical applications, and the description merely explains the embodiments of the present specification by "1.0" and "90%" and does not specifically limit the detection threshold.
Specifically, after determining the calling abnormity representation value, the detection platform sorts the calling abnormity information according to the calling abnormity times and the sequence from large to small, the calling abnormity information with the largest calling abnormity times is ranked at the first position, and the calling abnormity information with the smallest calling abnormity times is ranked at the last position.
After calling abnormal information is sequenced according to the calling abnormal times, comparing the calling abnormal characteristic value with a preset detection threshold value, judging whether the calling abnormal characteristic value is larger than the detection threshold value, if so, acquiring calling abnormal information which is sequenced and arranged at the first position, determining equipment with calling abnormality in a called service platform corresponding to the calling abnormal information arranged at the first position as abnormal equipment, storing the abnormal equipment into an abnormal equipment list, and deleting the calling abnormal information arranged at the first position.
After deleting the calling abnormal information arranged at the first position, the detection platform determines the calling abnormal representation value of the called service platform again based on the remaining calling abnormal information, and sorts the calling abnormal information again according to the calling abnormal times. And then judging whether the call abnormity representation value determined again is larger than a preset detection threshold value, if so, determining abnormal equipment of the called service platform according to the call abnormity information at the first position after re-sequencing, storing the abnormal equipment into an abnormity equipment list, and deleting the call abnormity information which is ranked at the first position after re-sequencing.
And gradually determining abnormal equipment in the called service platform by continuously and iteratively executing the steps until the calling abnormal characteristic value is judged to be smaller than a preset detection threshold value, and indicating that the abnormal equipment does not exist in the called service platform.
For example, taking the service platform as a shopping platform or a payment platform as an example, the service platform called may be understood as a shopping platform, and the service platform called may be understood as a payment platform.
As shown in fig. 7, fig. 7 is a schematic diagram illustrating a data processing method provided in an embodiment of the present specification, in which call exception information is sorted according to the number of call exceptions; after determining the calling abnormity representation value, the detection platform can call abnormal calling times of the server of the payment platform according to the server of the shopping platform, and sort the abnormal calling information according to the sequence from large to small, wherein the abnormal calling information with the largest calling times is ranked on the first place, and the abnormal calling information with the smallest calling times is ranked on the last place.
Fig. 8 is a schematic diagram of an apparatus for determining an anomaly according to sample variance in a data processing method provided in an embodiment of the present specification, and is shown in fig. 8; after the detection platform sorts the calling exception information, comparing the calling exception characteristic value of the payment platform with a preset detection threshold value, which is shown in fig. 8, wherein the calling exception characteristic value may be 54639, 39690 and 0.5, and the preset detection threshold value may be 1.0), judging whether the calling exception characteristic value is greater than the detection threshold value, if so, obtaining the calling exception information ranked at the first position, determining the server with calling exception corresponding to the calling exception information ranked at the first position as an exception server, storing the exception device in an exception device list, and deleting the calling exception information at the first position.
And after deleting the calling abnormal information arranged at the first position, the detection platform determines the calling abnormal representation value again based on the remaining calling abnormal information, and sorts the calling abnormal information again according to the calling abnormal times. And then judging whether the calling abnormity representation value determined again is larger than a preset detection threshold value, if so, determining abnormal equipment in the payment platform according to calling abnormity information at the first position after re-sequencing, storing the abnormal equipment into an abnormity equipment list, and deleting the calling abnormity information which is ranked at the first position after re-sequencing.
And gradually determining abnormal equipment in the payment platform by continuously and iteratively executing the steps until the calling abnormal characteristic value is judged to be smaller than a preset detection threshold value, and indicating that the abnormal equipment does not exist in the payment platform.
In the embodiment of the specification, by comparing the calling abnormity representation value with a preset detection threshold value, equipment with abnormity in the called service platform is rapidly judged; the detection efficiency of the detection platform to the abnormal equipment is saved, the abnormal equipment existing in the called service platform can be determined very accurately and completely by repeatedly calling the abnormal characteristic value and comparing the abnormal characteristic value with the preset detection threshold value, and the loss of the abnormal equipment to the data processing system is reduced.
In an embodiment of this specification, the sending the abnormal device information to the processing platform includes:
judging whether the abnormal equipment belongs to the same called service platform or not;
if yes, sending the abnormal equipment information to the processing platform;
if not, generating a calling exception chain diagram according to the calling exception information and the exception equipment;
and determining target abnormal equipment based on the calling abnormal chain diagram, and sending the target abnormal equipment information to the processing platform.
The calling exception chain graph can represent calling exception relations among exception devices.
Specifically, the detection platform judges whether the abnormal equipment belongs to the same called service platform; under the condition that the abnormal equipment judged by the detection platform belongs to the same called service platform, the information of the abnormal equipment is directly sent to the processing platform; if the detected abnormal devices belong to different called service platforms respectively, there may be a case that the indirect call fails, that is, a part of the devices detected as abnormal devices may be misjudged as abnormal devices because there is a call with the real abnormal devices, and there is no abnormality in themselves.
And under the condition that indirect calling fails, the detection platform reads calling abnormal information sent by the calling platform again, detects whether a direct calling relationship exists among the plurality of abnormal devices according to the calling abnormal information, and constructs a calling abnormal chain diagram according to the calling relationship under the condition that the direct calling relationship exists among the plurality of abnormal devices, wherein the calling abnormal chain diagram is used for representing the calling relationship among the plurality of abnormal devices.
After the construction of the calling exception chain diagram is completed, the detection platform detects that no calling is received in the calling exception chain diagram and no device sending a calling request to any device of the called service platform is determined as a target exception device. Meanwhile, it is also required to detect that, although a call request sent by a device calling a service platform is received in the call exception chain diagram, the device which does not send a request to any device itself is determined as a target exception device.
And after the detection platform finishes the detection of the target abnormal equipment, the target abnormal equipment information is sent to the processing platform.
According to the above example, the detection platform judges whether the abnormal equipment belongs to the same platform; under the condition that the abnormal servers judged by the detection platform belong to the payment platform, the IP addresses of the abnormal servers are directly sent to the processing platform; if the detected abnormal servers belong to different platforms respectively, there may be a case where the indirect call fails.
Referring to fig. 9, fig. 9 is a schematic diagram of invoking an exception chain diagram in a data processing method provided in an embodiment of the present specification; and under the condition that indirect calling fails, the detection platform reads calling abnormal information sent by the shopping platform again, detects whether a direct calling relationship exists among the plurality of abnormal servers according to the calling abnormal information, and constructs a calling abnormal chain diagram according to the calling relationship under the condition that the direct calling relationship exists among the plurality of abnormal servers, wherein the calling abnormal chain diagram is used for representing the calling relationship among the plurality of abnormal servers.
After the detection platform completes construction of the calling abnormal chain diagram, the detection platform detects that no call is accepted in the calling abnormal chain diagram and no server sending a call request to any server is detected, determines the server as a target abnormal server, and stores the IP address of the target abnormal server into a final abnormal server list. Meanwhile, it is also required to detect that, although a call request sent by another server is received in the call exception chain map, the device which does not send a request to any server itself determines the device as a target exception server, and stores the IP address of the target exception server in the final exception server list.
And after the detection platform finishes the detection of the target abnormal server, the IP address of the target abnormal server is sent to the processing platform.
In the embodiment of the specification, when the abnormal devices do not belong to the same called service platform, the detection platform generates the calling abnormal chain diagram to carry out an indirect calling failure detection mechanism, and can accurately identify the target abnormal devices with calling abnormality in the abnormal devices, so that the accuracy of the detection platform is improved, and the false alarm rate of the abnormal devices is reduced.
The data processing method provided by the specification is applied to a detection platform and comprises the steps of determining the calling abnormal times of a calling service platform aiming at the called service platform based on received calling abnormal information; generating a calling abnormity representation value based on the calling abnormity times; and determining abnormal equipment in the called service platform based on the calling abnormal characteristic value, and sending the abnormal equipment information to a processing platform. Specifically, the method determines the calling abnormal times between the calling service platform and the called service platform through the calling abnormal information sent by the detection platform based on the service platform, generates an abnormal characteristic value according to the calling abnormal times, and can timely and accurately identify abnormal equipment appearing in the called service platform through the abnormal characteristic value, so that the stability of the service platform is improved, and the operation and maintenance pressure and emergency pressure caused by calling abnormal service platform in the data processing system are relieved.
The following description will further describe the data processing method provided in this specification with reference to fig. 12 by taking an application of the data processing method in an abnormal machine detection scenario of an information system as an example. Fig. 12 shows a processing flow chart of a data processing method applied to an abnormal machine detection scenario of an information system according to an embodiment of the present specification, which specifically includes the following steps:
step 1202: the information system a performs log printing.
The information system a may be understood as the calling service platform in the above embodiment; the log printing may be understood as writing the call exception information into the call exception log file in the above-described embodiment. The log file can be understood as a call exception log file in the above embodiment; the RPC exception information may be understood as the call exception information in the above embodiments.
Specifically, the information system a prints the RPC exception information into a log file in a specific format based on the auxiliary SDK.
In practical application, in order to facilitate the information system to print logs according to a specified specification, in the embodiment of the specification, an auxiliary SDK is provided for the information system to use; the information system can freely select to use the auxiliary SDK or to realize the journal printing through the information system.
Step 1204: and the information acquisition program acquires RPC abnormal information written in a log file of the information system A.
The log file handle may be understood as the file identifier in the above embodiment.
Specifically, when an RPC call exception occurs, the information system writes the RPC exception information into a designated log file according to a fixed format.
A data acquisition program deployed in an information system monitors a log file handle; when new RPC exception information is written into the log file, the RPC call exception information written into the log file can be read through the log file handle.
Step 1206: and the data acquisition program sends the RPC abnormal information to the analysis decision node.
The analysis decision node may be understood as the detection platform in the above embodiment.
Step 1208: and the analysis decision node stores the RPC abnormal information into a database.
Specifically, the analysis decision node receives RPC exception information sent by the data acquisition program, analyzes and structures the RPC exception information, and stores the RPC exception information into the database.
Step 1210: and the analysis task acquires RPC abnormal information in the database.
Specifically, the analysis task in the analysis decision node reads the abnormal RPC log record in the previous time window from the database according to the preset time frequency. Typically, the analysis task is performed once a minute, so that the "last time window" is within the range from "the current time minus 1 minute" to "the current time".
Referring to fig. 13, fig. 13 is a schematic flow chart of identifying an abnormal machine in an information system in a data processing method provided in an embodiment of the present specification, where the step of identifying an abnormal machine in an information system in the embodiment of the present specification is an execution step of actually identifying an abnormal machine in an information system for analyzing an analysis task in a decision node, and the specific steps are as follows:
step 1302: and the analysis decision node classifies and aggregates the RPC abnormal information according to the initiator information system.
Wherein, the initiator information system can be understood as the calling service platform in the above embodiment; the machine may be understood as the apparatus in the above described embodiments.
Specifically, the analysis decision node counts the RPC call failure times between two machines in the current time window according to the dimension of the machine in the receiver information system for each sender information system.
Step 1304: and analyzing and deciding the information systems by the node, and sequencing the information systems in a reverse order according to the RPC calling failure times.
The number of RPC call failures may be understood as the number of call exceptions in the above embodiment.
Specifically, the analysis decision node performs reverse order (i.e., from large to small) sorting on the machine IPs of the receiver information system according to the number of RPC call exceptions for each sender information system.
Step 1306: the sample variance is calculated.
Specifically, the analysis decision node calculates the sample variance of each sender information system according to the RPC calling abnormal times aiming at each sender information system.
Step 1308: and judging whether the value is larger than a fluctuation threshold value.
Here, the fluctuation threshold may be understood as the detection threshold in the above-described embodiment.
Specifically, the analysis decision node compares the sample variance with a fluctuation threshold value, and judges whether the sample variance is greater than the fluctuation threshold value;
if the sample variance is less than the fluctuation threshold, the detection of the abnormal machine in the receiver information system is stopped, and step 1314 is performed.
If the sample variance is greater than the fluctuation threshold, the analysis decision node proceeds to step 1310.
Step 1310: and adding the machine of the receiver information system with the first ranking into the abnormal machine list.
Step 1312: and removing the record corresponding to the machine of the receiver information system with the first ranking from the sorted list.
Specifically, if the sample variance is greater than the fluctuation threshold, the analysis decision node removes the record corresponding to the first receiver information system machine from the sorted list. And steps 1306 through 1312 are re-executed until the sample variance is no longer greater than the "fluctuation threshold".
Step 1314: a list of abnormal machines is determined.
Specifically, the analysis decision node determines the machines in the current abnormal machine list to be abnormal machines if the sample variance is no longer greater than the "fluctuation threshold".
Step 1316: whether a plurality of abnormal machines belong to different information systems is judged.
Specifically, the analysis decision node determines whether there are multiple abnormal machines and whether the abnormal machines belong to different information systems, and if there are multiple abnormal machines in the abnormal machine list and the abnormal machines belong to different information systems, the analysis decision node performs steps 1318 to 1322.
If multiple abnormal machines in the abnormal machine list belong to the same information system, the analysis may be ended, and the analysis decision node performs step 1324.
Step 1318: and constructing a calling exception chain diagram.
Specifically, the analysis decision node reads abnormal RPC abnormal information in a time window period again, and checks whether a direct abnormal RPC calling relationship exists between abnormal machines in an abnormal machine list or not; and if the abnormal RPC calling relationship exists, constructing a calling abnormal chain diagram through multiple iterations.
Step 1320: outliers are added to the final list of outlier machines.
The isolated point can be understood as an abnormal machine which does not receive the call of any machine in the call exception chain diagram and does not initiate the call to any machine.
Step 1322: and searching out 0 points in the maximum connected graph, and adding the points into the final abnormal machine list.
The point with out degree of 0 may be understood as an abnormal machine which calls the other machines in the abnormal chain diagram but does not initiate a call to any machine.
Specifically, if the maximum connected graph exists in the graph, the points with out degrees of 0 in the maximum connected graph are found, the machines represented by the points with out degrees of 0 are also abnormal machines, and the analysis decision node puts the points into a final abnormal machine list.
Step 1324: and outputting a final abnormal machine list.
Specifically, the analysis decision node generates a final abnormal machine list including abnormal machines in the receiver information system.
Step 1212: and the analysis task sends the final abnormal machine list to the post-action node.
Step 1214: the post action node generates a mail based on the list of abnormal machines.
Specifically, after receiving the abnormal machine list, the post action node generates a mail and sends the abnormal machine list to the mailbox of the user.
Step 1216: and the post action node generates a prompt message based on the abnormal machine list.
Specifically, after receiving the abnormal machine list, the post-action node generates a prompt message and sends the prompt message to the social account of the user to remind the user of the abnormal machine in the information system.
Step 1218: the post action node generates an HTTP request based on the list of anomalous machines.
Specifically, the post-action node also provides an automatic processing capability, and after receiving the abnormal machine list, the post-action node calls a pre-configured HTTP template, adds the abnormal machine list as a part of request parameters into an HTTP model, and initiates an HTTP request, and the HTTP service provider autonomously realizes processing actions.
Step 1220: the post action node executes the processing script based on the exception machine list.
Specifically, the user fills in a processing script on the page, and the post-action node directly executes the processing script after receiving the abnormal machine list.
Based on the above description embodiment, the data processing method applied to the abnormal machine detection scene of the information system is generally divided into four parts:
1. printing a log: the information system prints the RPC abnormal information to a specified log file according to a fixed format;
2. data acquisition: deploying a data acquisition program for acquiring data on each machine, reading the log file and sending the content of the log file to an analysis decision node;
3. and (3) analysis and decision making: on the analysis decision node, carrying out aggregation processing on the log data sent from each machine, and calculating a single machine with abnormal performance by using a certain algorithm;
4. post-operation: and recording, informing and displaying the detection result, and if necessary, executing subsequent automatic action to solve the problem of the abnormity of the single machine.
In the embodiment of the specification, the post-action node determines the number of times of call failure between information systems based on RPC exception information sent by the information systems, and then generates a sample variance of the initiator information system according to the number of times of call failure, and can timely and accurately identify an abnormal machine appearing in the receiver information system through the sample variance, thereby being beneficial to improving the stability of the information systems and relieving operation and maintenance pressure and emergency pressure caused by abnormal call of the sender information system in the data processing system. And the flexible configuration and execution capacity of the post-action node can meet the processing requirements of abnormal machines in different service scenes.
Corresponding to the above method embodiment, the present specification further provides an embodiment of a data processing apparatus, and fig. 14 shows a schematic structural diagram of another data processing apparatus provided in an embodiment of the present specification. As shown in fig. 14, the apparatus includes:
a receiving module 1402 configured to determine a number of call exceptions of the calling service platform for the called service platform based on the received call exception information;
a generating module 1404 configured to generate a call exception characterizing value based on the number of call exceptions;
a determining module 1406 configured to determine an abnormal device in the called service platform based on the calling abnormal characterization value, and send the abnormal device information to a processing platform.
Optionally, the receiving module 1402 is further configured to divide the calling exception information into different categories according to the attribute information of the calling service platform and the attribute information of the called service platform;
and acquiring the device parameters of the calling abnormal information in each category, and determining the calling abnormal times of the devices in the calling service platform aiming at the devices in the called service platform according to the device parameters.
Optionally, the generating module 1404 is further configured to obtain a total number of call exceptions for the called service platform according to the number of call exceptions of the called service platform;
and generating a calling abnormity representation value representing the numerical difference between the calling abnormity numbers based on the total calling abnormity numbers and the calling abnormity numbers.
Optionally, the generating module 1404 is further configured to calculate and obtain a mean value of the number of call exceptions based on the total number of call exceptions;
processing the calling abnormal times and the calling abnormal time mean value to generate a sample variance, and taking the sample variance as the calling abnormal characteristic value; or
And calculating to obtain a percentage value of each calling abnormity frequency based on the total calling abnormity frequency and the plurality of calling abnormity frequencies, and taking the percentage value as the calling abnormity representation value.
Optionally, the determining module 1406 is further configured to sort the call exception information according to the number of call exceptions;
and under the condition that the calling abnormity representation value meets a preset detection threshold value, obtaining the calling abnormity information of a preset sequence position, and determining abnormal equipment in the called service platform according to the calling abnormity information of the preset sequence position.
Optionally, the determining module 1406 is further configured to determine whether the abnormal device belongs to the same called service platform;
if yes, sending the abnormal equipment information to the processing platform;
if not, generating a calling exception chain diagram according to the calling exception information and the exception equipment;
and determining target abnormal equipment based on the calling abnormal chain diagram, and sending the target abnormal equipment information to the processing platform.
In the data processing device provided in the embodiment of the present specification, the detection platform determines the number of calling exception times between the calling service platform and the called service platform based on the calling exception information sent by the service platform, and then generates an exception characteristic value according to the number of calling exception times, and the exception device appearing in the called service platform can be timely and accurately identified through the exception characteristic value, so that the stability of the service platform is favorably improved, and the operation and maintenance pressure and the emergency pressure caused by calling exception of the service platform in the data processing system are favorably relieved.
The above is a schematic configuration of another data processing apparatus of the present embodiment. It should be noted that the technical solution of the data processing apparatus and the technical solution of the another data processing method belong to the same concept, and details that are not described in detail in the technical solution of the data processing apparatus can be referred to the description of the technical solution of the another data processing method.
FIG. 15 illustrates a block diagram of a computing device 1500 provided in accordance with one embodiment of the present description. The components of the computing device 1500 include, but are not limited to, a memory 1510 and a processor 1520. The processor 1520 is coupled to the memory 1510 via a bus 1530 and a database 1550 is used to store data.
The computing device 1500 also includes an access device 1540 that enables the computing device 1500 to communicate via one or more networks 1560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 1540 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 1500, as well as other components not shown in FIG. 15, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device structure shown in FIG. 15 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.
Computing device 1500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 1500 may also be a mobile or stationary server.
Wherein the processor 1520 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of any of the data processing methods.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the data processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the data processing method.
An embodiment of the present specification also provides a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of any of the data processing methods.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the data processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the data processing method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present disclosure is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present disclosure. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for this description.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the specification and its practical application, to thereby enable others skilled in the art to best understand the specification and its practical application. The specification is limited only by the claims and their full scope and equivalents.

Claims (18)

1. A data processing method is applied to a data processing system, the system comprises a calling service platform, a called service platform, a detection platform and a processing platform,
the calling service platform generates calling abnormal information under the condition of calling abnormity, writes the calling abnormal information into a calling abnormal log file, and sends target calling abnormal information to the detection platform;
the detection platform determines the calling abnormal times of the calling service platform aiming at the called service platform based on the received calling abnormal information, generates a calling abnormal representation value based on the calling abnormal times, determines abnormal equipment in the called service platform based on the calling abnormal representation value, and sends the abnormal equipment information to the processing platform;
and the processing platform performs exception processing based on the received exception equipment information.
2. The data processing method of claim 1, the invoking an exceptional condition comprising:
the calling service platform calls the called service platform and timeout occurs, or
And under the condition that the calling service platform calls the service data of the called service platform, the calling service platform receives abnormal service data.
3. The data processing method of claim 1, wherein sending target call exception information to the detection platform comprises:
acquiring a file identifier of the calling abnormal log file, and searching the corresponding calling abnormal log file according to the file identifier;
acquiring the target calling abnormal information in the calling abnormal log file according to the calling abnormal information identifier;
and sending the target calling abnormal information to the detection platform.
4. The data processing method of claim 1, the detecting platform determining a number of call exceptions of the calling service platform for the called service platform based on the received call exception information, comprising:
dividing the calling abnormal information into different categories according to the attribute information of the calling service platform and the attribute information of the called service platform;
and acquiring the device parameters of the calling abnormal information in each category, and determining the calling abnormal times of the devices in the calling service platform aiming at the devices in the called service platform according to the device parameters.
5. The data processing method of claim 1, the generating a call exception characterizing value based on the number of call exceptions, comprising:
obtaining a total calling abnormal frequency according to the calling abnormal frequency of the calling service platform aiming at the called service platform;
and generating a calling abnormity representation value representing the numerical difference between the calling abnormity numbers based on the total calling abnormity numbers and the calling abnormity numbers.
6. The data processing method of claim 5, the generating a call exception characterization value characterizing a numerical difference between a plurality of the call exception counts based on the total call exception count and the call exception count, comprising:
calculating and obtaining a calling abnormal frequency mean value based on the total calling abnormal frequency;
processing the calling abnormal times and the calling abnormal time mean value to generate a sample variance, and taking the sample variance as the calling abnormal characteristic value; or
And calculating to obtain a percentage value of each calling abnormity frequency based on the total calling abnormity frequency and the plurality of calling abnormity frequencies, and taking the percentage value as the calling abnormity representation value.
7. The data processing method of claim 1, the determining an exception device in the invoked service platform based on the invocation exception characterization value comprising:
sorting the calling exception information according to the calling exception times;
and under the condition that the calling abnormity representation value meets a preset detection threshold value, obtaining the calling abnormity information of a preset sequence position, and determining abnormal equipment in the called service platform according to the calling abnormity information of the preset sequence position.
8. The data processing method of claim 1, wherein sending the exception device information to the processing platform comprises:
judging whether the abnormal equipment belongs to the same called service platform or not;
if yes, sending the abnormal equipment information to the processing platform;
if not, generating a calling exception chain diagram according to the calling exception information and the exception equipment;
and determining target abnormal equipment based on the calling abnormal chain diagram, and sending the target abnormal equipment information to the processing platform.
9. A data processing method applied to a detection platform comprises the following steps,
determining the calling abnormal times of the calling service platform aiming at the called service platform based on the received calling abnormal information;
generating a calling abnormity representation value based on the calling abnormity times;
and determining abnormal equipment in the called service platform based on the calling abnormal characteristic value, and sending the abnormal equipment information to a processing platform.
10. The data processing method of claim 9, wherein determining the number of call exceptions of the calling service platform for the called service platform based on the received call exception information comprises:
dividing the calling abnormal information into different categories according to the attribute information of the calling service platform and the attribute information of the called service platform;
and acquiring the device parameters of the calling abnormal information in each category, and determining the calling abnormal times of the devices in the calling service platform aiming at the devices in the called service platform according to the device parameters.
11. The data processing method of claim 9, the generating a call exception characterizing value based on the number of call exceptions, comprising:
obtaining a total calling abnormal frequency according to the calling abnormal frequency of the calling service platform aiming at the called service platform;
and generating a calling abnormity representation value representing the numerical difference between the calling abnormity numbers based on the total calling abnormity numbers and the calling abnormity numbers.
12. The data processing method of claim 11, the generating a call exception characterization value characterizing a numerical difference between a plurality of the call exception counts based on the total call exception count and the call exception count, comprising:
calculating and obtaining a calling abnormal frequency mean value based on the total calling abnormal frequency;
processing the calling abnormal times and the calling abnormal time mean value to generate a sample variance, and taking the sample variance as the calling abnormal characteristic value; or
And calculating to obtain a percentage value of each calling abnormity frequency based on the total calling abnormity frequency and the plurality of calling abnormity frequencies, and taking the percentage value as the calling abnormity representation value.
13. The data processing method of claim 9, the determining an exception device in the invoked service platform based on the invocation exception characterization value comprising:
sorting the calling exception information according to the calling exception times;
and under the condition that the calling abnormity representation value meets a preset detection threshold value, obtaining the calling abnormity information of a preset sequence position, and determining abnormal equipment in the called service platform according to the calling abnormity information of the preset sequence position.
14. The data processing method of claim 9, wherein sending the exception device information to a processing platform comprises:
judging whether the abnormal equipment belongs to the same called service platform or not;
if yes, sending the abnormal equipment information to the processing platform;
if not, generating a calling exception chain diagram according to the calling exception information and the exception equipment;
and determining target abnormal equipment based on the calling abnormal chain diagram, and sending the target abnormal equipment information to the processing platform.
15. A data processing system comprises a calling service platform, a called service platform, a detection platform and a processing platform, wherein,
the calling service platform is configured to generate calling abnormal information under the condition of abnormal calling, write the calling abnormal information into a calling abnormal log file, and send target calling abnormal information to the detection platform;
the detection platform is configured to determine the calling abnormal times of the calling service platform aiming at the called service platform based on the received calling abnormal information, generate a calling abnormal representation value based on the calling abnormal times, determine abnormal equipment in the called service platform based on the calling abnormal representation value, and send the abnormal equipment information to the processing platform;
the processing platform is configured to perform exception processing based on the received exception device information.
16. A data processing device applied to a detection platform comprises:
the receiving module is configured to determine the calling abnormity times of the calling service platform aiming at the called service platform based on the received calling abnormity information;
a generation module configured to generate a calling exception characterizing value based on the number of calling exceptions;
the determining module is configured to determine abnormal equipment in the called service platform based on the calling abnormal characteristic value and send the abnormal equipment information to a processing platform.
17. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, wherein the processor implements the steps of the data processing method according to any one of claims 1 to 8 or 9 to 14 when executing the computer-executable instructions.
18. A computer readable storage medium storing computer executable instructions which, when executed by a processor, carry out the steps of the data processing method of any one of claims 1 to 8 or 9 to 14.
CN202110615637.1A 2021-06-02 2021-06-02 Data processing method, system and device Pending CN113238888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110615637.1A CN113238888A (en) 2021-06-02 2021-06-02 Data processing method, system and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110615637.1A CN113238888A (en) 2021-06-02 2021-06-02 Data processing method, system and device

Publications (1)

Publication Number Publication Date
CN113238888A true CN113238888A (en) 2021-08-10

Family

ID=77136465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110615637.1A Pending CN113238888A (en) 2021-06-02 2021-06-02 Data processing method, system and device

Country Status (1)

Country Link
CN (1) CN113238888A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115242613A (en) * 2022-08-03 2022-10-25 浙江网商银行股份有限公司 Target node determination method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107404456A (en) * 2016-05-18 2017-11-28 阿里巴巴集团控股有限公司 Location of mistake method and device
CN109684280A (en) * 2018-12-19 2019-04-26 泰康保险集团股份有限公司 Journal file processing method, apparatus and system
CN109739727A (en) * 2019-01-03 2019-05-10 优信拍(北京)信息科技有限公司 Service monitoring method and device in micro services framework
CN110297746A (en) * 2019-07-05 2019-10-01 北京慧眼智行科技有限公司 A kind of data processing method and system
CN110851298A (en) * 2019-11-08 2020-02-28 卫盈联信息技术(深圳)有限公司 Abnormality analysis and processing method, electronic device, and storage medium
CN112052109A (en) * 2020-08-28 2020-12-08 西安电子科技大学 Cloud service platform event anomaly detection method based on log analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107404456A (en) * 2016-05-18 2017-11-28 阿里巴巴集团控股有限公司 Location of mistake method and device
CN109684280A (en) * 2018-12-19 2019-04-26 泰康保险集团股份有限公司 Journal file processing method, apparatus and system
CN109739727A (en) * 2019-01-03 2019-05-10 优信拍(北京)信息科技有限公司 Service monitoring method and device in micro services framework
CN110297746A (en) * 2019-07-05 2019-10-01 北京慧眼智行科技有限公司 A kind of data processing method and system
CN110851298A (en) * 2019-11-08 2020-02-28 卫盈联信息技术(深圳)有限公司 Abnormality analysis and processing method, electronic device, and storage medium
CN112052109A (en) * 2020-08-28 2020-12-08 西安电子科技大学 Cloud service platform event anomaly detection method based on log analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115242613A (en) * 2022-08-03 2022-10-25 浙江网商银行股份有限公司 Target node determination method and device
CN115242613B (en) * 2022-08-03 2024-03-15 浙江网商银行股份有限公司 Target node determining method and device

Similar Documents

Publication Publication Date Title
AU2019275633B2 (en) System and method of automated fault correction in a network environment
CN113553210A (en) Alarm data processing method, device, equipment and storage medium
US10554701B1 (en) Real-time call tracing in a service-oriented system
CN111078513A (en) Log processing method, device, equipment, storage medium and log alarm system
CN116414717A (en) Automatic testing method, device, equipment, medium and product based on flow playback
CN114356499A (en) Kubernetes cluster alarm root cause analysis method and device
CN113515434A (en) Abnormity classification method, abnormity classification device, abnormity classification equipment and storage medium
Ali et al. [Retracted] Classification and Prediction of Software Incidents Using Machine Learning Techniques
CN113238888A (en) Data processing method, system and device
US11822578B2 (en) Matching machine generated data entries to pattern clusters
CN108763916B (en) Service interface security assessment method and device
CN112612679A (en) System running state monitoring method and device, computer equipment and storage medium
CN116149877A (en) Fault detection method and device
US20100153783A1 (en) Method and apparatus for system analysis
CN116136801B (en) Cloud platform data processing method and device, electronic equipment and storage medium
KR101288535B1 (en) Method for monitoring communication system and apparatus therefor
CN115545452A (en) Operation and maintenance method, operation and maintenance system, equipment and storage medium
CN113282751A (en) Log classification method and device
CN111211938B (en) Biological information software monitoring system and method
CN115048345A (en) Abnormal log detection method and device, electronic equipment and storage medium
CN112668744A (en) Data processing method and device
CN112100047A (en) Service performance monitoring and analyzing method and device
WO2021151494A1 (en) Device for monitoring a computer network system
CN111581062A (en) Service fault processing method and server
CN116739646B (en) Method and system for analyzing big data of network transaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210810

RJ01 Rejection of invention patent application after publication