CN111680009A - Log processing method and device, storage medium and processor - Google Patents

Log processing method and device, storage medium and processor Download PDF

Info

Publication number
CN111680009A
CN111680009A CN202010525810.4A CN202010525810A CN111680009A CN 111680009 A CN111680009 A CN 111680009A CN 202010525810 A CN202010525810 A CN 202010525810A CN 111680009 A CN111680009 A CN 111680009A
Authority
CN
China
Prior art keywords
data
log
processed
resource manager
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010525810.4A
Other languages
Chinese (zh)
Other versions
CN111680009B (en
Inventor
黄宇
王风雷
李东军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Yuemeng Information Technology Co ltd
Original Assignee
Suzhou Yuemeng Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Yuemeng Information Technology Co ltd filed Critical Suzhou Yuemeng Information Technology Co ltd
Priority to CN202010525810.4A priority Critical patent/CN111680009B/en
Publication of CN111680009A publication Critical patent/CN111680009A/en
Application granted granted Critical
Publication of CN111680009B publication Critical patent/CN111680009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a log processing method, a log processing device, a storage medium and a processor. Wherein, the method comprises the following steps: acquiring a log to be processed; copying the logs to be processed to obtain a plurality of logs to be processed; respectively carrying out parallel processing on a plurality of logs to be processed in a data stream and at least one service stream; the first processing result obtained in the data stream is sent to the first data processing system and the second processing result obtained in each traffic stream is sent to the second data processing system. The invention solves the technical problem that the log is difficult to be effectively processed.

Description

Log processing method and device, storage medium and processor
Technical Field
The present invention relates to the field of data processing, and in particular, to a log processing method, apparatus, storage medium, and processor.
Background
At present, the definition, collection and processing of logs are of great significance to continuously optimizing clients (apps).
However, processing of logs is very complicated, and standard log processing methods provided by streaming systems such as a memory distributed computing framework (Spark) and a streaming computing framework (Storm) in the prior art can only perform processing such as screening and conversion on logs, so that processing of logs is relatively limited, and cannot adapt to the situation that a plurality of different services coexist in different downstream, and it is difficult to effectively process logs.
In view of the above-mentioned problem that it is difficult to effectively process the log in the prior art, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a log processing method, a log processing device, a storage medium and a processor, which are used for at least solving the technical problem that logs are difficult to effectively process.
According to an aspect of an embodiment of the present invention, there is provided a log processing method. The method can comprise the following steps: acquiring a log to be processed; copying the logs to be processed to obtain a plurality of logs to be processed; respectively carrying out parallel processing on a plurality of logs to be processed in a data stream and at least one service stream; the first processing result obtained in the data stream is sent to the first data processing system and the second processing result obtained in each traffic stream is sent to the second data processing system.
Optionally, the parallel processing of the multiple logs to be processed in the data stream and the at least one service stream respectively includes: and respectively carrying out parallel processing on a plurality of logs to be processed in the data stream and at least one service stream through the resource manager.
Optionally, processing the log to be processed in the data stream by the resource manager includes: acquiring first identification information in a log to be processed in a data stream; reporting the first identification information to a resource manager; acquiring first data corresponding to the first identification information from the resource manager, and determining second identification information associated with the first identification information; reporting the second identification information to a resource manager; acquiring second data corresponding to the second identification information from the resource manager; and splicing the original log, the first data and the second data corresponding to the log to be processed into a first target log.
Optionally, the first data and the second data are requested and cached by the resource manager from an external database, or are retrieved and cached by the resource manager from the database simulator.
Optionally, sending the first processing result obtained in the data stream to the first data processing system, includes: and determining the first target log as a first processing result, and sending the first processing result to the distributed file system.
Optionally, processing, by the resource manager, each log to be processed in each traffic flow includes: in each service flow, determining the service corresponding to each log to be processed; determining third data corresponding to the service in each log to be processed; acquiring fourth data from the resource manager based on the third data; and converting the fourth data into the target message.
Optionally, the fourth data is requested and cached by the resource manager from an external database, or is retrieved and cached by the resource manager from the database simulator.
Optionally, obtaining fourth data from the resource manager based on the third data includes: sending a request carrying the identification information in the third data to a resource manager; and acquiring fourth data sent by the resource manager in response to the request.
Optionally, the converting the fourth data into the target message includes: and converting the fourth data into a target message based on the requirement information of the service.
Optionally, sending the second processing result obtained in each service flow to the second data processing system, including: and determining the target message as a second processing result, and sending the second processing result to the distributed publish-subscribe message system.
Optionally, the target message of the distributed publish-subscribe message system is processed by a corresponding downstream system.
Optionally, the logs to be processed include an original log having basic information and a second target log having database data, where the second target log is used for data recovery.
Optionally, after acquiring the log to be processed, the method further includes: and forwarding the logs to be processed to a database simulator through a converter, wherein the database data is extracted from the logs to be processed and stored by the database simulator.
Optionally, changes to the external database are masked by the resource manager when data recovery is performed.
According to another aspect of the embodiment of the invention, a log processing device is also provided. The apparatus may include: the acquisition unit is used for acquiring the log to be processed; the copying unit is used for copying the logs to be processed to obtain a plurality of logs to be processed; the processing unit is used for respectively carrying out parallel processing on a plurality of logs to be processed in the data stream and at least one service stream; and the sending unit is used for sending the first processing result obtained in the data flow to the first data processing system and sending the second processing result obtained in each service flow to the second data processing system.
According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium. The computer readable storage medium includes a stored program, wherein when the program runs, the apparatus where the computer readable storage medium is located is controlled to execute the log processing method of the embodiment of the present invention.
According to another aspect of the embodiments of the present invention, there is also provided a processor. The processor is used for running a program, wherein the program executes the log processing method of the embodiment of the invention when running.
In the embodiment of the invention, the log to be processed is obtained; copying the logs to be processed to obtain a plurality of logs to be processed; respectively carrying out parallel processing on a plurality of logs to be processed in a data stream and at least one service stream; the first processing result obtained in the data stream is sent to the first data processing system and the second processing result obtained in each traffic stream is sent to the second data processing system. That is to say, the method and the device decompose the processing of the log to be processed into a mode of parallel processing in the data stream and the plurality of task streams, so that the method and the device adapt to the condition that a plurality of different downstream coexists, avoid the operations of screening, converting and the like only on the log, solve the technical problem that the log is difficult to be effectively processed, and achieve the technical effect of effectively processing the log.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow diagram of a method of log processing according to an embodiment of the invention;
FIG. 2 is a flow chart of a log processing method in a normal mode according to an embodiment of the present invention;
FIG. 3 is a flow chart of a method of log processing in data recovery mode according to an embodiment of the present invention; and
fig. 4 is a schematic diagram of a log processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
In accordance with an embodiment of the present invention, there is provided an embodiment of a log processing method, it should be noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than that herein.
Fig. 1 is a flowchart of a log processing method according to an embodiment of the present invention. As shown in fig. 1, the method may include the steps of:
step S102, obtaining a log to be processed.
In the technical solution provided by step S102 of the present invention, the log to be processed is an input log input at a log entry. Optionally, the log to be processed in this embodiment is first input to the checker, and the log to be processed is checked by the checker, for example, whether the log to be processed is a correct log conforming to the definition is checked by the checker, and when the log to be processed is checked to be a correct log conforming to the definition, it may be determined that the log to be processed is successfully checked, and the correct log conforming to the definition is retained, and when the log to be processed is checked to be a crop log not conforming to the definition, it may be determined that the log to be processed is failed to be checked, and an error log not conforming to the definition is screened, for example, the error log not conforming to the definition is directly entered into a Distributed File System (HDFS), so that the purpose of screening the input log by the checker is achieved.
Optionally, the log to be processed in this embodiment may be a short video App log, in which various user behaviors are recorded, for example, behaviors of logging in and out of an App, browsing and watching a short video, searching for a short video, and the like are recorded.
And step S104, copying the logs to be processed to obtain a plurality of logs to be processed.
In the technical solution provided by step S104 of the present invention, after the to-be-processed log is obtained, the to-be-processed log that is successfully verified may be copied, so as to obtain multiple identical to-be-processed logs, and completely input the multiple to-be-processed logs into each stream, where each stream may include a data stream and a service stream, and the service stream may also be referred to as a task stream.
And step S106, respectively carrying out parallel processing on a plurality of logs to be processed in the data stream and at least one service stream.
In the technical solution provided in step S106 of the present invention, after the log to be processed is copied to obtain multiple logs to be processed, the multiple logs to be processed may be processed in parallel in the data stream and the at least one service stream, respectively.
In this embodiment, a piece of data to be processed that is successfully verified may be input into the data stream where it is processed to obtain a complete log. Optionally, in the data stream, various identification information in the log to be processed is extracted, where the identification information may be an Identity (Identity, abbreviated as ID), and is reported to the resource manager, and a complete log is generated, and then the log is sent.
The embodiment can simultaneously input other to-be-processed data successfully verified into other service flows respectively, and the service flows are used for processing logs related to the service, for example, collecting related logs in each service flow, processing object data, and further sending object messages. The embodiment can process a plurality of service flows and can naturally add a new service flow for processing.
Step S108, sending the first processing result obtained in the data flow to the first data processing system, and sending the second processing result obtained in each service flow to the second data processing system.
In the technical solution provided by step S108 of the present invention, after the multiple to-be-processed logs are respectively processed in the data stream and the at least one service stream in parallel, the first processing result obtained in the data stream is sent to the first data processing system, and the second processing result obtained in each service stream can be sent to the second data processing system in parallel.
In this embodiment, the log to be processed is processed in the data stream to obtain a first processing result, and this embodiment may send the first processing result to the first data processing system, for example, the first data processing system may be an HDFS; other logs to be processed in the embodiment may be processed in parallel in each service flow respectively to obtain a second processing result, and the second processing result may be sent to a second data processing system, and the second data processing system may be a distributed publish-subscribe message system (Kafka), so as to adapt to different downstream coexistence of multiple services.
Obtaining a log to be processed through the steps S102 to S108; copying the logs to be processed to obtain a plurality of logs to be processed; respectively carrying out parallel processing on a plurality of logs to be processed in a data stream and at least one service stream; the first processing result obtained in the data stream is sent to the first data processing system and the second processing result obtained in each traffic stream is sent to the second data processing system. That is to say, the embodiment decomposes the processing of the log to be processed into a way of parallel processing in the data stream and in the plurality of task streams, so as to adapt to the situation that a plurality of different services coexist in the downstream, avoid the operations of screening, converting and the like only on the log, solve the technical problem that the log is difficult to be effectively processed, and achieve the technical effect of effectively processing the log.
The above-described method of this embodiment is further described below.
As an optional implementation manner, in step S106, the parallel processing of multiple to-be-processed logs in the data stream and at least one service stream respectively includes: and respectively carrying out parallel processing on a plurality of logs to be processed in the data stream and at least one service stream through the resource manager.
In this embodiment, a plurality of to-be-processed logs may be processed in parallel in a data stream and at least one service stream respectively through the resource manager, where the resource manager may be configured to request data from an external database, cache the requested data, and respond to a data obtaining request, where the data obtaining request is used to request to obtain data from the external database.
The following describes a method for processing a log to be processed in a data stream by a resource manager.
As an optional implementation, processing the log to be processed in the data stream by the resource manager includes: acquiring first identification information in a log to be processed in a data stream; reporting the first identification information to a resource manager; acquiring first data corresponding to the first identification information from the resource manager, and determining second identification information associated with the first identification information; reporting the second identification information to a resource manager; acquiring second data corresponding to the second identification information from the resource manager; and splicing the original log, the first data and the second data corresponding to the log to be processed into a first target log.
In this embodiment, the obtaining of the first identification information in the log to be processed in the data stream may be extracting all user IDs and object IDs from all logs, that is, collecting all IDs, for example, collecting short video IDs. In this embodiment, the first identification information is reported to the resource manager, the resource manager may cache the received first identification information, and after all the first identification information is reported in the first round, the resource manager may request to obtain, in batch, first data corresponding to the first identification information from an external database or a database simulator, where the first data may also be referred to as original object data, and then obtain, in a data stream, the first data corresponding to the first identification information from the resource manager, that is, the resource manager returns the first data to the data stream for processing.
The embodiment may also determine second identification information associated with the second identification information, which may be an association ID, for example, an author ID in the short video data. In this embodiment, all the second identification information may be reported to the resource manager, and at this time, the resource manager may request to acquire, in batch, second data corresponding to the second identification information from an external database or a database simulator, where the second data may also be referred to as data of an associated object, that is, the resource manager returns the second data to a data stream for processing.
After the first data and the second data are obtained, the original log, the first data and the second data corresponding to the log to be processed are spliced into a first target log, and the first target log can be a complete log, namely a complete log, so that the aim of conveniently obtaining the data from an external database to supplement log information is fulfilled.
As an alternative embodiment, the first data and the second data are requested and cached by the resource manager from an external database, or are retrieved and cached by the resource manager from the database simulator.
The method of this embodiment may be implemented in a normal mode and a data recovery mode. In a normal mode, after reporting the first identification information to a resource manager in a data stream, the resource manager may request to acquire first data in batches from an external database, and cache the first data acquired by the request; optionally, in this embodiment, after reporting the second identification information to the resource manager, the resource manager may request to acquire the second data in batch from the external database, and cache the second data that is requested to be acquired, so as to achieve the purpose of accurately acquiring the data from the external database to form a complete log.
Optionally, in the data recovery mode, after reporting the first identification information to the resource manager in the data stream, the resource manager may request the database simulator to acquire the first data, and cache the first data that is requested to be acquired, where the database simulator stores the data of the previous complete log; optionally, in this embodiment, after reporting the second identification information to the resource manager, the resource manager may request the database simulator for obtaining the second data in batch, and cache the second data that is obtained by the request, so as to achieve the purpose of accurately obtaining the data from the external database to form a complete log.
As an alternative implementation, step S108, sending the first processing result obtained in the data stream to the first data processing system, includes: and determining the first target log as a first processing result, and sending the first processing result to the distributed file system.
In this embodiment, after the original log, the first data, and the second data corresponding to the log to be processed are spliced into the first target log, the first target log may be determined as a first processing result obtained in the data stream, and then sent to the distributed file system HDFS, and then the first processing result is stored in the HDFS.
The method of processing each log to be processed in each traffic flow by the resource manager according to this embodiment is described below.
As an optional implementation, processing each log to be processed in each traffic flow through the resource manager includes: in each service flow, determining the service corresponding to each log to be processed; determining third data corresponding to the service in each log to be processed; acquiring fourth data from the resource manager based on the third data; and converting the fourth data into the target message.
In this embodiment, each log to be processed has a corresponding service, and in this embodiment, third data corresponding to the service may be determined in each log to be processed, where the third data may be a part related to the service, which is screened from all logs according to respective needs of each service, and counted, so as to achieve the purpose of preprocessing each log to be processed, and then, the third data is post-processed, and fourth data may be obtained from the resource manager based on the third data, where the resource manager has cached complete database data, and can meet the needs of all service flows. After the fourth data is obtained, the fourth data may be processed, and the fourth data is converted into a target message to be sent, that is, the processed fourth data is sent in a form of a message.
As an alternative embodiment, the fourth data is requested and cached by the resource manager from an external database, or is retrieved and cached by the resource manager from the database simulator.
In this embodiment, in the normal mode, the resource manager may request to acquire fourth data from the external database, and cache the fourth data requested to be acquired; optionally, in the data recovery mode, the resource manager may request the database simulator to acquire fourth data, and cache the acquired fourth data.
As an optional implementation, obtaining the fourth data from the resource manager based on the third data includes: sending a request carrying the identification information in the third data to a resource manager; and acquiring fourth data sent by the resource manager in response to the request.
In this embodiment, when the fourth data is obtained from the resource manager based on the third data, the request carrying the identification information in the third data may be sent to the resource manager in a data stream, where the identification information in the third data may be an ID collected in the preprocessing step, and the request may be sent to the resource manager according to the ID, that is, the data obtaining request is sent to request the resource manager to obtain the fourth data. Wherein the resource manager, after receiving the request, transmits fourth data to the traffic flow in response to the request.
As an optional implementation, the converting the fourth data into the target message includes: and converting the fourth data into a target message based on the requirement information of the service.
In this embodiment, when the fourth data is converted into the target message, the requirement information of the service may be determined, then the fourth data is further processed based on the requirement information of the service to obtain a processing result, and the processing result is converted into the target message.
As an optional implementation manner, step S108, sending the second processing result obtained in each service flow to the second data processing system, includes: and determining the target message as a second processing result, and sending the second processing result to the distributed publish-subscribe message system.
In this embodiment, when the second processing result obtained in each service flow is sent to the second data processing system, the target message may be determined as the second processing result, and then the second processing result is sent to the distributed publish-subscribe message system Kafka.
In an alternative embodiment, the target message of the distributed publish-subscribe message system is processed by a corresponding downstream system.
In this embodiment, Kafka is connected to a corresponding downstream system that consumes the targeted message by subscribing to the Kafka topic. The downstream systems may include, but are not limited to, reporting systems, recommendation systems, and retrieval systems, among others. For the reporting system, almost all logs enter each sub-report of the reporting system in a classified manner; for the recommendation system, the browsing click log of the user is updated to a user information management system (Data-management platform, DMP for short) after being processed, and is finally used for iteration of the recommendation system; for a retrieval system, the search log of the user is updated to the history of the search engine, and the click rate of the short video is updated to influence the subsequent search result.
It should be noted that the above method of this embodiment may be applied to the log processing method in the normal mode, except that the first data and the second data are obtained and cached by the resource manager from the database simulator, and the fourth data are obtained and cached by the resource manager from the database simulator. However, when an unexpected event occurs, for example, a bug (bug) occurs in a certain traffic flow code or a certain downstream system crashes, a data recovery mode needs to be entered to recover data. The data recovery mode of this embodiment is a variation of the normal mode, and the log processing method in the data recovery mode may be the same as the above-mentioned related steps in the log processing method in the normal mode, and the difference between them includes the manner of acquiring the external data. The following further describes different parts of the log processing in the data recovery mode from those in the normal data mode.
As an optional implementation manner, the logs to be processed include an original log having basic information and a second target log having database data, where the second target log is used for data recovery.
In this embodiment, in the data recovery mode, the log to be processed is not an original log having only basic information, but includes a second target log having database data, where the second target log may be a complete log, and the database data may be data of the complete log, and the second target log may be a complete log obtained by splicing in a data stream for data recovery.
As an optional implementation manner, after obtaining the log to be processed, the method further includes: and forwarding the logs to be processed to a database simulator through a converter, wherein the database data is extracted from the logs to be processed and stored by the database simulator.
Compared with the log processing method in the normal mode, the log processing method in the data recovery mode of the embodiment is additionally provided with the converter and the database simulator, the function of the repeater is to forward the log to be processed to the database simulator at the entrance, and the database simulator can extract and store the database data from the log to be processed, namely, extract and store the data in the complete log.
In the data recovery mode, the resource manager does not request data to the external database in the normal mode any more, but instead requests data to the database simulator, where the data may be the first data, the second data, and the fourth data, because the data of the external database is dynamically changed, and in the data recovery mode, in order to ensure that the current data is restored by one hundred percent, the data of the external database can only be obtained from the current database snapshot stored in the complete log.
As an alternative embodiment, changes to the external database are masked by the resource manager when data recovery is performed.
In the embodiment, when data recovery is carried out, the change of the external database can be completely shielded by the resource manager, and no perception is given to data flow and service flow.
In this embodiment, the method may be performed by a stream-type log processing system, and specifically, may be a short video App log processing system for a multitask stream, and the method may be a parallel processing method for splitting a log stream into a data stream and a plurality of task streams, so as to adapt to a situation that a plurality of services coexist in different downstream, and may perform data recovery by using a complete log spliced by the data streams, thereby implementing accurate data acquisition from an external database to form a complete log, and logically and clearly processing a plurality of service streams, and naturally adding a new service stream, and also implementing recovery of data when an unexpected occurrence occurs, avoiding operations such as screening and converting only the log, thereby solving a technical problem that it is difficult to effectively process the log, and achieving a technical effect of effectively processing the log.
Example 2
The technical solutions of the embodiments of the present invention will be illustrated below with reference to preferred embodiments.
The short video App log records various user behaviors, such as logging in and out of apps, browsing and watching short videos, searching for short videos, and the like. The logs are defined, collected and processed properly, and the method has important significance for continuously optimizing App.
First, the definition of the log should be comprehensive and concise. The comprehensive means that user behaviors need to be recorded as much as possible, and the simple means that in order to reduce network overhead of transmission, the log only records core information and discards redundant information. The log should contain the following elements: app version number, timestamp, user action, user ID, object type, object ID, page where the action occurred. Redundant information, such as user and object details, should not be included in the original log.
Secondly, the collection of logs should be efficient in real time and have a disaster recovery system. Streaming systems such as Spark, Storm, etc. are commonly used to collect data into HDFS.
Finally, the processing of the log should serve multiple downstream systems, which may include: the report system is used for entering all the logs into all sub-reports of the report system in a classified manner; the recommendation system is used for updating the browsing click log of the user to the DMP system after processing and finally used for iteration of the recommendation system; in the retrieval system, the search log of the user is updated to the historical record of the search engine, and the subsequent search result is influenced by updating the click rate of the short video.
It can be known from the above that processing of logs is very complex, and standard log processing methods provided by streaming systems such as Spark and Storm can only perform operations such as screening and conversion on logs, and cannot adapt to the situation that a plurality of different services coexist in downstream, and cannot conveniently acquire data from an external database to supplement log information.
The embodiment is a streaming log processing system specially designed for the log processing requirements, and can accurately acquire data from an external database to form a complete log, logically and clearly process a plurality of service flows, and naturally add a new service flow. Data may be recovered when an accident occurs (e.g., a bug occurs in a traffic flow code, or a downstream system crash).
This embodiment abandons the standard method provided by Spark, organizing the system into three major modules: a resource manager, a data flow and several traffic flows. The resource manager is used for requesting data from an external database, caching the data and responding to a data acquisition request; the data stream is used for extracting various IDs in the log, reporting the IDs to a resource manager and generating a complete log; and the service flow is used for processing the logs related to the service. The embodiment splits the data stream and the service stream into parallel modules which do not interfere with each other.
This embodiment can operate in two modes: a normal mode and a data recovery mode. Which are described separately below.
Fig. 2 is a flowchart of a log processing method in a normal mode according to an embodiment of the present invention. As shown in fig. 2, a solid line represents a flow direction of the log, a dashed line represents a flow direction of the data, and characters on the line represent steps of processing, and the method may include the steps of:
in step S1, the log input from the log entry is sent to the verifier for verification.
In an embodiment, the error logs that do not conform to the definition are entered directly into the HDFS, and the correct logs that conform to the definition are copied into the same multiple copies and entered into each stream intact.
In the data flow, the log is processed by the following steps:
step S2.1, the log that was successfully verified is input to the data stream.
This embodiment collects the IDs of all objects in the data stream.
Alternatively, the embodiment extracts all user IDs and object IDs from all logs.
Step S3, reporting all user IDs and object IDs to the resource manager.
The resource manager caches the received IDs, and executes step S4 after receiving a signal that all IDs have been reported.
In step S4, the resource manager obtains the return data of the external database in batch.
And step S5, obtaining the return data corresponding to all the IDs reported in the first round from the resource manager.
The return data corresponding to all IDs of this embodiment may be referred to as raw object data.
This embodiment acquires all the original object data in the data stream.
This embodiment finds the associated object ID in the data stream, e.g. a first round finds the short video ID and a second round finds the author ID from the short video data.
Step S6, reporting all the associated object IDs to the resource manager.
In step S7, the resource manager obtains the return data of the external database.
The return data of the external database of this embodiment is also the data of the associated object.
At step S8, the resource manager returns the data to the data stream.
The data returned by the resource manager to the data stream in this embodiment may be associated object data, and the associated object data is acquired in the data stream.
The embodiment can splice the original log, the original object data and the data of the associated object into a complete log.
And step S10.3, storing the complete log to the HDFS.
In each service flow, the log is subjected to the following processing steps:
and S2.2 and S2.3, screening out the concerned parts from all logs according to the respective needs of each service, and counting.
Step S2.2, step S2.3 of this embodiment belong to the pre-processing step of processing in the traffic flow.
Step S9.1, step S9.2, obtain the return data from the resource manager.
Step S9.1 and step S9.2 in this embodiment are performed after the data flow completes step S8, and at this time, the resource manager has cached complete database data in the external data, so that the needs of all service flows can be met, and return data is obtained from the resource manager.
Step S10.1, step S10.2, sends a message to Kafka.
Each service flow in the embodiment requests data from the resource manager according to the ID collected in the preprocessing step, further processes the return data of the resource manager according to the service requirement to obtain a processing result, and further sends the processing result to Kafka in the form of a message. Downstream systems consume these messages by subscribing to Kafka topics.
Fig. 3 is a flowchart of a log processing method of a data recovery mode according to an embodiment of the present invention. As shown in fig. 3, the method may include the steps of:
in step S1', the log input from the log entry is sent to the verifier for verification.
In an embodiment, the input log is not an original log of only the underlying information, but a complete log containing the database data. This embodiment adds two modules, a repeater and a database simulator. The function of the repeater is to forward the log to the database simulator at the entrance of the system, and the database simulator extracts and stores the data in the complete log.
The embodiment also checks the logs in the checker, directly enters the error logs which do not accord with the definition into the HDFS, copies the correct logs which accord with the definition into the same multiple copies, and completely enters each stream.
In the data flow, the log is processed by the following steps:
step S2.1', the log that was successfully verified is input into the data stream.
This embodiment collects the IDs of all objects in the data stream.
Alternatively, the embodiment extracts all user IDs and object IDs from all logs.
Step S3', report all user IDs and object IDs to the resource manager.
The resource manager caches the received IDs, and executes step S4' after receiving the signal that all IDs have been reported.
In step S4', the resource manager obtains the return data of the database simulator in batch.
The resource manager of this embodiment no longer requests data from the database, but instead requests data from the database simulator. This is because the data of the database is dynamically changed, and in order to ensure that the current data is restored by one hundred percent in the data recovery phase, the data can only be obtained from the current database snapshot stored in the complete log.
And step S5', obtaining the return data corresponding to all the IDs reported in the first round from the database simulator.
The return data corresponding to all IDs of this embodiment may be referred to as raw object data.
This embodiment acquires all the original object data in the data stream.
This embodiment finds the associated object ID in the data stream, e.g. a first round finds the short video ID and a second round finds the author ID from the short video data.
Step S6', report all the associated object IDs to the resource manager.
In step S7', the resource manager obtains the return data of the database simulator.
The return data of the database simulator of this embodiment is also the data of the associated object.
Step S8', the resource manager returns the data to the data stream.
The data returned by the resource manager to the data stream in this embodiment may be associated object data, and the associated object data is acquired in the data stream.
The embodiment can splice the original log, the original object data and the data of the associated object into a complete log.
Step S10.3', save the complete log to HDFS.
In each service flow, the log is subjected to the following processing steps:
and S2.2 'and S2.3', each service screens out the concerned part from all logs according to the respective needs and counts.
Step S2.2 'and step S2.3' of this embodiment belong to the pre-processing step of processing in the traffic flow.
Step S9.1 ', step S9.2', the return data is obtained from the resource manager.
In step S9.1 ' of this embodiment, step S9.2 ' is performed after the data flow completes step S8 ', and at this time, the resource manager has cached the complete database data in the database simulator, so that the requirements of all the service flows can be met, and the return data is obtained from the resource manager.
Step S10.1 ', step S10.2', send message to Kafka.
Each service flow in the embodiment requests data from the resource manager according to the ID collected in the preprocessing step, further processes the return data of the resource manager according to the service requirement to obtain a processing result, and further sends the processing result to Kafka in the form of a message. Downstream systems consume these messages by subscribing to Kafka topics.
In the data recovery mode, the change of the external database is completely shielded by the resource manager, and no perception is given to data flow and service flow.
In this embodiment, the method may be performed by a streaming log processing system, and a log stream is divided into a data stream and a plurality of task streams in a parallel processing manner, so as to adapt to a situation that a plurality of services coexist in different downstream, and data recovery may be performed by using a complete log spliced by the data stream, thereby accurately obtaining data from an external database to form a complete log, logically and clearly processing the plurality of service streams, naturally adding a new service stream, and recovering data when an unexpected occurrence occurs, avoiding operations such as screening and converting only the log, thereby solving a technical problem that effective processing of the log is difficult, and achieving a technical effect of effectively processing the log.
Example 3
The embodiment of the invention also provides a log processing device. It should be noted that the log processing apparatus of this embodiment may be used to execute the log processing method of the embodiment of the present invention.
Fig. 4 is a schematic diagram of a log processing apparatus according to an embodiment of the present invention. As shown in fig. 4, the log processing apparatus 40 may include: an acquisition unit 41, a copying unit 42, a processing unit 43 and a sending unit 44.
An obtaining unit 41, configured to obtain a log to be processed.
And the copying unit 42 is used for copying the logs to be processed to obtain multiple logs to be processed.
The processing unit 43 is configured to perform parallel processing on the multiple to-be-processed logs in the data stream and the at least one service stream, respectively.
A sending unit 44, configured to send the first processing result obtained in the data flow to the first data processing system, and send the second processing result obtained in each service flow to the second data processing system.
The log processing device of the embodiment decomposes the processing of the log to be processed into a mode of parallel processing in the data stream and the plurality of task streams, so that the device is suitable for the condition that a plurality of different downstream coexists, and the operation of screening, converting and the like can be avoided, thereby solving the technical problem that the log is difficult to be effectively processed, and achieving the technical effect of effectively processing the log.
Example 4
According to an embodiment of the present invention, there is also provided a computer-readable storage medium including a stored program, wherein the program executes the log processing method described in embodiment 1.
Example 5
According to an embodiment of the present invention, there is also provided a processor, configured to execute a program, where the program executes the log processing method described in embodiment 1.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (17)

1. A log processing method, comprising:
acquiring a log to be processed;
copying the log to be processed to obtain a plurality of copies of the log to be processed;
respectively carrying out parallel processing on a plurality of logs to be processed in a data stream and at least one service stream;
and sending the first processing result obtained in the data flow to a first data processing system, and sending the second processing result obtained in each service flow to a second data processing system.
2. The method of claim 1, wherein the parallel processing of the multiple copies of the log to be processed in the data stream and the at least one service stream respectively comprises:
and respectively carrying out parallel processing on a plurality of logs to be processed in the data stream and the at least one service stream through a resource manager.
3. The method of claim 2, wherein processing the log to be processed in the data stream by the resource manager comprises:
acquiring first identification information in the log to be processed in the data stream;
reporting the first identification information to the resource manager;
acquiring first data corresponding to the first identification information from the resource manager, and determining second identification information associated with the first identification information;
reporting the second identification information to the resource manager;
acquiring second data corresponding to the second identification information from the resource manager;
and splicing the original log, the first data and the second data corresponding to the log to be processed into a first target log.
4. The method of claim 3, wherein the first data and the second data are requested and cached by the resource manager from an external database or are retrieved and cached by the resource manager from a database simulator.
5. The method of claim 3, wherein sending the first processing result obtained in the data stream to a first data processing system comprises:
and determining the first target log as the first processing result, and sending the first processing result to a distributed file system.
6. The method of claim 2, wherein processing each of the pending logs in each of the traffic flows by the resource manager comprises:
in each service flow, determining the service corresponding to each log to be processed;
determining third data corresponding to the service in each log to be processed;
acquiring fourth data from the resource manager based on the third data;
and converting the fourth data into a target message.
7. The method of claim 6, wherein the fourth data is requested and cached by the resource manager from an external database or is retrieved and cached by the resource manager from a database simulator.
8. The method of claim 6, wherein obtaining fourth data from the resource manager based on the third data comprises:
sending a request carrying the identification information in the third data to the resource manager;
and acquiring the fourth data sent by the resource manager in response to the request.
9. The method of claim 6, wherein translating the fourth data into a destination message comprises:
and converting the fourth data into the target message based on the requirement information of the service.
10. The method of claim 6, wherein sending the second processing result obtained in each of the traffic flows to a second data processing system comprises:
and determining the target message as the second processing result, and sending the second processing result to a distributed publish-subscribe message system.
11. The method of claim 10, wherein the target message of the distributed publish-subscribe message system is processed by a corresponding downstream system.
12. The method according to any one of claims 1 to 11, wherein the to-be-processed log comprises an original log having basic information and a second target log having database data, wherein the second target log is used for data recovery.
13. The method of claim 12, wherein after obtaining the pending log, the method further comprises:
and forwarding the logs to be processed to a database simulator through a converter, wherein the database data is extracted from the logs to be processed and stored by the database simulator.
14. The method of claim 12, wherein changes to the external database are masked by a resource manager when performing the data recovery.
15. A log processing apparatus, comprising:
the acquisition unit is used for acquiring the log to be processed;
the copying unit is used for copying the log to be processed to obtain a plurality of copies of the log to be processed;
the processing unit is used for respectively carrying out parallel processing on a plurality of logs to be processed in a data stream and at least one service stream;
and the sending unit is used for sending the first processing result obtained in the data flow to a first data processing system and sending the second processing result obtained in each service flow to a second data processing system.
16. A computer-readable storage medium, comprising a stored program, wherein the program, when executed by a processor, controls an apparatus in which the computer-readable storage medium is located to perform the method of any of claims 1-14.
17. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 14.
CN202010525810.4A 2020-06-10 2020-06-10 Log processing method, device, storage medium and processor Active CN111680009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010525810.4A CN111680009B (en) 2020-06-10 2020-06-10 Log processing method, device, storage medium and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010525810.4A CN111680009B (en) 2020-06-10 2020-06-10 Log processing method, device, storage medium and processor

Publications (2)

Publication Number Publication Date
CN111680009A true CN111680009A (en) 2020-09-18
CN111680009B CN111680009B (en) 2023-10-03

Family

ID=72435195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010525810.4A Active CN111680009B (en) 2020-06-10 2020-06-10 Log processing method, device, storage medium and processor

Country Status (1)

Country Link
CN (1) CN111680009B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104969213A (en) * 2013-01-31 2015-10-07 脸谱公司 Data stream splitting for low-latency data access
CN107918621A (en) * 2016-10-10 2018-04-17 阿里巴巴集团控股有限公司 Daily record data processing method, device and operation system
US20190080063A1 (en) * 2017-09-13 2019-03-14 Facebook, Inc. De-identification architecture

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104969213A (en) * 2013-01-31 2015-10-07 脸谱公司 Data stream splitting for low-latency data access
CN107918621A (en) * 2016-10-10 2018-04-17 阿里巴巴集团控股有限公司 Daily record data processing method, device and operation system
US20190080063A1 (en) * 2017-09-13 2019-03-14 Facebook, Inc. De-identification architecture

Also Published As

Publication number Publication date
CN111680009B (en) 2023-10-03

Similar Documents

Publication Publication Date Title
CN109034993B (en) Account checking method, account checking equipment, account checking system and computer readable storage medium
CN111752799B (en) Service link tracking method, device, equipment and storage medium
CN111046011B (en) Log collection method, system, device, electronic equipment and readable storage medium
CN107341258B (en) Log data acquisition method and system
US11151660B1 (en) Intelligent routing control
CN105488610A (en) Fault real-time analysis and diagnosis system and method for power application system
CN104067281A (en) Clustering event data by multiple time dimensions
CN111740868B (en) Alarm data processing method and device and storage medium
CN107085549B (en) Method and device for generating fault information
EP3272097B1 (en) Forensic analysis
CN105743730A (en) Method and system used for providing real-time monitoring for webpage service of mobile terminal
CN108322350B (en) Service monitoring method and device and electronic equipment
CN112350854B (en) Flow fault positioning method, device, equipment and storage medium
CN112181931A (en) Big data system link tracking method and electronic equipment
CN103701906A (en) Distributed real-time calculation system and data processing method thereof
CN112632129A (en) Code stream data management method, device and storage medium
CN112954031B (en) Equipment state notification method based on cloud mobile phone
CN114844771A (en) Monitoring method, device, storage medium and program product for micro-service system
CN114968959A (en) Log processing method, log processing device and storage medium
CN108509293A (en) A kind of user journal timestamp fault-tolerance approach and system
CN117097571A (en) Method, system, device and medium for detecting network transmission sensitive data
CN112969172A (en) Communication flow control method based on cloud mobile phone
CN111680009A (en) Log processing method and device, storage medium and processor
CN116264541A (en) Multi-dimension-based database disaster recovery method and device
CN114598622B (en) Data monitoring method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant