CN115860008A

CN115860008A - Data processing method, electronic device and medium for determining abnormal log information

Info

Publication number: CN115860008A
Application number: CN202310160654.XA
Authority: CN
Inventors: 李峰; 孙晓鹏; 时伟强; 夏国栋; 宋衍龙
Original assignee: Shandong Yuntian Safety Technology Co ltd
Current assignee: Shandong Yuntian Safety Technology Co ltd
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-03-28
Anticipated expiration: 2043-02-24
Also published as: CN115860008B

Abstract

The application relates to the field of data processing, in particular to a data processing method, electronic equipment and medium for determining abnormal log information, wherein the method comprises the following steps: acquiring an original log information set A to be processed corresponding to a target network; traversing A, if TAGi is the target log type identifier, deleting the fixed language segment corresponding to TAGi in ATXTi to obtain a target log information set B to be processed; performing word segmentation processing on Bi to obtain a word segmentation list set C; carrying out hot independent coding processing on the C to obtain a first coding list set D; vectorizing the word D to obtain a first word vector list set E; performing vector conversion processing on the E to obtain log semantic vectors F1, F2, …, fi, … and Fn corresponding to each original log information to be processed; and determining an abnormal log information set Y from A according to F1, F2, …, fi, … and Fn. According to the method and the device, the identification efficiency of the abnormal log information can be improved.

Description

Data processing method, electronic device and medium for determining abnormal log information

Technical Field

The present application relates to the field of data processing, and in particular, to a data processing method, an electronic device, and a medium for determining exception log information.

Background

In a network system, a large amount of log information is generated every day. And the abnormal log information can be used for carrying out abnormal positioning, thereby solving the potential safety hazard of the network. However, most of the existing abnormal log determining methods use a large number of regular expressions which are set to detect each log information in sequence, so as to determine the abnormal log. However, in this way, each log information needs to be matched with all regular expressions, which is inefficient.

Disclosure of Invention

In view of the above, the present application provides a data processing method, an electronic device, and a medium for determining abnormal log information, which at least partially solve the problems in the prior art.

In an aspect of the present application, a data processing method for determining exception log information is provided, including:

s100, acquiring An original log information set to be processed corresponding to a target network, wherein the log information set to be processed is A = (A1, A2, …, ai, …, an), and the Ai = (TAGi, ATXTi); the method comprises the steps that Ai is ith original log information to be processed generated in a time sequence in a target network, TAGi is a log type identifier corresponding to Ai, and ATXTi is original log content information corresponding to Ai;

s200, traversing A, and if TAGi is the target log type identifier, deleting the fixed speech segment corresponding to TAGi in ATXTi to obtain a target log information set B = (B1, B2, …, bi, …, bn) and Bi = (TAGi, BTXTi); wherein Bi is target to-be-processed log information corresponding to Ai, and BTXTi is target log content information corresponding to Bi; the target log type identifier is a log type identifier which contains a fixed speech segment in the corresponding original log content information;

s300, performing word segmentation processing on Bi to obtain a word segmentation listSet C = (C1, C2, …, ci, …, cn), ci = (TAGi, C) _i ¹ ，C _i ² ，…，C _i ^j ，…，C _i ^g（i） ) J =1,2, …, g (i); wherein Ci is a word segmentation list corresponding to Bi, C _i ^j Is the jth participle in BTXTi, and g (i) is the number of the participles in BTXTi;

s400, performing hot single encoding processing on the C to obtain a first encoding list set D = (D1, D2, …, di, …, dn), and Di = (D) _i ¹ ，D _i ² ，…，D _i ^p ，…，D _i ^g（i），D _i ^g（i）+1 ) P =1,2, …, g (i) +1; di is a first coding list corresponding to Ci, and D _i ¹ For the first code obtained after the hot independent coding of TAGi, D _i ^p Is a pair C _i ^p-1 Carrying out thermal independent coding processing to obtain a first code; each first code is a one-dimensional vector consisting of s characteristic elements; s is the number of words in a dictionary formed by using the historical log information set;

s500, vectorizing D to obtain a first word vector list set E = (E1, E2, …, ei, …, en), ei = (E1, E2, …, ei, …, en), and then _i ¹ ，E _i ² ，…，E _i ^p ，…，E _i ^g（i），E _i ^g（i）+1 ) (ii) a Wherein Ei is a first word vector list corresponding to Di, E _i ^p Is D _i ^p Corresponding first word vector, E _i ^p Containing contextual semantic information, E _i ^p =D _i ^p X VecT; vecT is a target word vector matrix obtained after a historical log information set is processed according to a preset word2vec model; size of VecT is s h, s is row number of VecT, h is column number of VecT, h < s;

s600, performing vector conversion processing on the E to obtain log semantic vectors F1, F2, …, fi, … and Fn corresponding to each original log information to be processed; fi is a log semantic vector corresponding to Ai obtained after vector conversion processing is carried out on Ei;

s700, determining an abnormal log information set Y = (Y1, Y2, …, yq, …, ym), q =1,2, …, m from a according to F1, F2, …, fi, …, fn; yq is the q-th abnormal log information determined from A, and m is the number of the abnormal log information determined from A; m is less than n.

In another aspect of the present application, there is provided an electronic device comprising a processor and a memory;

the processor is configured to perform the steps of any of the above methods by calling a program or instructions stored in the memory.

In another aspect of the application, a non-transitory computer readable storage medium is provided, storing a program or instructions that causes a computer to perform the steps of any of the methods described above.

The beneficial effects of this application are as follows.

According to the data processing method provided by the application, after the original log information set A to be processed is obtained, because the original log content corresponding to part of the log type identifiers is provided with the fixed language segments, when semantic extraction is carried out on the fixed language segments, the fixed language segments are easily regarded as important semantic information due to higher occurrence frequency, but actually the important degree of the fixed language segments is lower. Therefore, in the application, the fixed speech segment in the original content log information corresponding to the target log type identifier can be deleted, so that the denoising effect is achieved.

Meanwhile, the log information to be processed in the log information set B to be processed is arranged according to the generation sequence, so that the content in B can be regarded as a structure similar to an "article" (i.e., having a sequence rather than being out of order). For example, after the target word vector matrix is converted into first word vectors, context semantic information is included in each first word vector, so that the generation order of the original log information can be considered when determining abnormal log information according to the first word vectors and subsequently generated log semantic vectors. Therefore, the abnormal log information which can be determined only through the joint analysis of a plurality of logs can be effectively identified.

Further, in the present application, through the processing in steps S200 to S500, and through clustering or semantic recognition, the abnormal log information set Y can be quickly determined from a. Compared with the method for identifying by using the regular expression, the processing capacity is greatly reduced, and the context semantic information can be combined for identifying, so that the identification accuracy is higher.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a data processing method for determining abnormal log information according to the present application.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, all other embodiments that can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort fall within the scope of the present disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

Referring to fig. 1, in an exemplary embodiment of the present application, a data processing method for determining abnormal log information is provided, which includes the following steps:

s100, acquiring An original log information set to be processed corresponding to a target network, wherein the log information set to be processed is A = (A1, A2, …, ai, …, an), and the Ai = (TAGi, ATXTi); the Ai is the ith original log information to be processed generated in the target network in time sequence, TAGi is the log type identifier corresponding to the Ai, and ATXTi is the original log content information corresponding to the Ai. Specifically, the original log information to be processed may be linux system log information, such as "localhost sshd [1630]: accepted password for root from 192.168.0.104 port 4229 ssh2", wherein the log type identifier may be" localhost sshd [1630] "or" [1630] ", the original log content information is" Accept password for root from 192.168.0.104 port 4229 ssh2", and" Accepted password for root from "is" [1630] "corresponding fixed phrase segment.

S200, traversing A, and if TAGi is the target log type identifier, deleting the fixed speech segment corresponding to TAGi in ATXTi to obtain a target log information set B = (B1, B2, …, bi, …, bn) and Bi = (TAGi, BTXTi); wherein Bi is target to-be-processed log information corresponding to Ai, and BTXTi is target log content information corresponding to Bi; the target log type identifier is a log type identifier containing a fixed language segment in the corresponding original log content information. The target log type identifier can be determined through a preset identifier list, and all log type identifiers with corresponding fixed language segments are stored in the identifier list. For example, it may be identified whether the TAGi belongs to the identification list, and if so, it is determined as the target log type identification. Meanwhile, the fixed language segment corresponding to each log type identifier can be recorded in the identifier list, so that the fixed language segment in the ATXTi can be deleted conveniently. It is to be appreciated that if TAGi is not identified for the target log type, ATXTi is determined directly as BTXTi.

S300，Performing word segmentation processing on Bi to obtain a word segmentation list set C = (C1, C2, …, ci, …, cn), and Ci = (TAGi, C) _i ¹ ，C _i ² ，…，C _i ^j ，…，C _i ^g（i） ) J =1,2, …, g (i); wherein Ci is a word segmentation list corresponding to Bi, C _i ^j Is the jth participle in BTXTi, and g (i) is the number of participles in BTXTi. Specifically, each log type identifier is used as a separate segmentation word, and the target log content information can be determined by using a preset dictionary. The preset dictionary may be constructed by a historical log information set, where a plurality of pieces of historical log information are stored in the historical log information set, and the historical log information may be understood as all or at least part of log information generated before the current time.

S400, performing hot single encoding processing on the C to obtain a first encoding list set D = (D1, D2, …, di, …, dn), and Di = (D) _i ¹ ，D _i ² ，…，D _i ^p ，…，D _i ^g（i），D _i ^g（i）+1 ) P =1,2, …, g (i) +1; di is a first coding list corresponding to Ci, and D _i ¹ For the first code obtained after the hot independent coding of TAGi, D _i ^p Is a pair C _i ^p-1 Carrying out thermal independent coding processing to obtain a first code; each first code is a one-dimensional vector consisting of s characteristic elements; s is the number of words in the dictionary formed using the history log information set.

S500, vectorizing D to obtain a first word vector list set E = (E1, E2, …, ei, …, en), ei = (E1, E2, …, ei, …, en), and then _i ¹ ，E _i ² ，…，E _i ^p ，…，E _i ^g（i），E _i ^g（i）+1 ) (ii) a Wherein Ei is a first word vector list corresponding to Di, E _i ^p Is D _i ^p Corresponding first word vector, E _i ^p Containing contextual semantic information, E _i ^p =D _i ^p X VecT; vecT is according to a preset word2vec, processing the historical log information set by the model to obtain a target word vector matrix; vecT has a size of s h, s is the number of rows of VecT, h is the number of columns of VecT, and h < s. h may be 300.

Because the number of the feature elements in the first code obtained after the thermal coding is performed is very large, generally several tens of thousands or more, and the first code does not have context semantic information, and if the first code is directly used for semantic recognition or abnormal recognition, the processing amount is very large and the accuracy is low, in this embodiment, a target word vector matrix is used for vectorizing each first code, and the first code is converted into a first word vector, so that the number of the feature elements is reduced, and each first word vector has the context semantic information.

Specifically, when training is performed by using the historical log information set, the historical log information in the historical log information set needs to be arranged in the order from the front to the back of the generation time, and is cut in units of "days" or "weeks" to form a plurality of subsets. Each subset can be considered as a "paragraph" in an "article", and each historical log information within the same subset can be considered as several "sentences" in the "paragraph". Thus, a target word vector matrix obtained after the word2vec model is used for processing the historical log information set can learn the correlation between log generation sequences, so that a first word vector obtained through conversion of the target word vector matrix has context semantic information.

S600, performing vector conversion processing on the E to obtain log semantic vectors F1, F2, …, fi, … and Fn corresponding to each original log information to be processed; and Fi is a log semantic vector corresponding to Ai obtained by performing vector conversion processing on Ei.

Specifically, the step S600 includes:

s610, obtaining a vector weight corresponding to each first word vector, and obtaining a vector weight list set Q = (Q1, Q2, …, qi, …, qn), qi = (Q1, Q2, …, qi, …, qn) _i ¹ ，Q _i ² ，…，Q _i ^p ，…，Q _i ^g（i），Q _i ^g（i）+1 ) (ii) a Wherein Qi is Ei corresponding toVector weight list of, Q _i ^p Is E _i ^p Corresponding vector weight, Q _i ^p Is according to E _i ^p Word frequency of the corresponding word in the historical log information set, and E _i ^p Word frequency size and Q of corresponding participles in historical log information set _i ^p Is inversely proportional to the size of (a);

s620, acquiring Fi = (∑) _p=1 ^g(i)+1 Q _i ^p * E _i ^p ) V (g (i) + 1) to give F1, F2, …, fi, …, fn.

Research shows that the higher the word frequency of the appearing words in the system log information, the lower the importance degree of the appearing words, so that when all first word vectors corresponding to each target log information to be processed are converted into corresponding log semantic information, the vector weight corresponding to the first word vector with the high word frequency is reduced, the finally obtained log statement information has more effective information, and the accuracy of determining abnormal log information subsequently is improved.

Specifically, F1, F2, …, fi, …, fn determine the abnormal log information set Y from a, which may be semantic recognition on F1, F2, …, fi, …, fn, or clustering determination on F1, F2, …, fi, …, fn.

In the data processing method provided by this embodiment, after the original to-be-processed log information set a is obtained, because the original log content corresponding to the partial log type identifier has the fixed language segment, when performing semantic extraction on such fixed language segment, the fixed language segment is easily regarded as important semantic information because of its high occurrence frequency, but actually its importance degree is relatively low. Therefore, in this embodiment, the fixed speech segment in the original content log information corresponding to the target log type identifier is deleted, so as to achieve the denoising effect.

Meanwhile, in this embodiment, the target to-be-processed log information in the target to-be-processed log information set B is arranged according to the generation sequence thereof, so that the content in B can be regarded as a structure similar to an "article" (i.e., having a sequential order rather than a disorder order). For example, after the target word vector matrix is converted into first word vectors, context semantic information is included in each first word vector, so that the generation order of the original log information can be considered when determining abnormal log information according to the first word vectors and subsequently generated log semantic vectors. Therefore, the abnormal log information which can be determined only through the joint analysis of a plurality of logs can be effectively identified.

Further, in this embodiment, through the processing in steps S200 to S600, a log semantic vector with context semantic information can be obtained, and an abnormal log information set Y can be quickly determined from a through clustering or semantic recognition. Compared with the regular expression recognition, the processing amount is greatly reduced, and the recognition can be performed by combining the context semantic information, so that the recognition accuracy is higher.

In an exemplary embodiment of the present application, the step S700 includes:

s710, performing random initialization to obtain a first unit vector W1 and a second unit vector W2; w1 and W2 are both one-dimensional vectors consisting of h characteristic elements.

S720, a first parameter t =0 is obtained.

And S730, if t = h, acquiring t =1, otherwise, acquiring t = t +1.

S740, according to Ft, obtaining Dist1= sqrt (∑) _k=1 ^h (X1 k-X2 k)), and Dist2= sqrt (∑ s) _k=1 ^h (X1 k-X3 k)); wherein, X1k is the kth feature element in Ft, X2k is the kth feature element in W1, and X3k is the kth feature element in W2.

S750, if Dist1 < Dist2, obtaining W1= W1+ β (Ft-W1); otherwise, obtain W2= W2+ β (Ft-W2); where β is the learning rate, and the magnitude of β changes as the number of execution times of step S750 increases. Specifically, β may be determined by a mexican hat function or top hat function.

S760, clustering F1, F2, …, fi, … and Fn according to W1 and W2 to obtain a first clustering result and a second clustering result. The first clustering result is a clustering result obtained by taking W1 as a clustering core, and the second clustering result is a clustering result obtained by taking W2 as the clustering core. Each clustering result comprises at least one log semantic vector.

S770, determining whether the current condition is met according to the first clustering result and the second clustering result, if yes, entering step S780, and if not, entering step S730; the clustering termination condition includes: the number of the log semantic vectors in the first clustering result and the number of the log semantic vectors in the second clustering result are less than 0.0001 or more than 10000.

S780, determining that the number of the log semantic vectors in the first clustering result and the second clustering result is less as a target clustering result.

S790, determining original log information to be processed corresponding to each log semantic vector in the target clustering result as abnormal log information to obtain Y.

In this embodiment, the step S730-the step S760 are performed in a loop, and the first unit vector W1 and the second unit vector W2 are optimized continuously, so that the distance between the first unit vector W1 and the second unit vector W2 can be gradually increased and the first unit vector W1 and the second unit vector W2 are close to the clustering core. And in the optimization process, when t is increased to h, one round of optimization can be considered to be performed, in each round of optimization, log semantic vectors are sequentially input according to the generation time of the corresponding original logs to be processed, so that context related information can be learned in the optimization process, and the accuracy of subsequent clustering is further improved. Specifically, optimization can also be performed by using the SOM model to perform optimization and clustering of W1 and W2. In the conventional clustering process, the clustering can be finished only by the first unit vector W1 and the second unit vector W2 reaching the convergence condition (e.g., the changes of W1 and W2 tend to be stable). However, in the research, it is found that if a conventional convergence condition is used as a clustering termination condition, the determined abnormal log information may be too much (i.e., the accuracy is low). Therefore, in the embodiment, the ratio of the number of the log semantic vectors in the two clustering results is used as a termination condition, so that the clustering results and the ratio of the normal log information to the abnormal log information are converged when the clustering results and the abnormal log information are identified manually, and the accuracy of the abnormal log information can be improved. Experiments prove that the accuracy of the clustering result obtained by clustering by using the clustering termination condition provided by the embodiment is obviously higher than the accuracy of using a conventional convergence condition as the clustering termination condition.

In an exemplary embodiment of the present application, the cluster termination condition includes:

w1 and W2 meet a preset convergence condition; and the number of the log semantic vectors in the first clustering result and the number of the log semantic vectors in the second clustering result are less than 0.0001 or more than 10000.

In this embodiment, the clustering termination condition is set as two conditions that need to be satisfied simultaneously, that is, only after W1 and W2 satisfy the predetermined convergence condition (e.g., the variation of W1 and W2 tends to be stable), it is determined that the number of the log semantic vectors in the first clustering result and the number of the log semantic vectors in the second clustering result are less than 0.0001 or greater than 10000. Thus, the accuracy of clustering can be further improved.

In an exemplary embodiment of the present application, the size of β decreases as the number of execution times of step S750 increases.

And the clustering termination condition includes:

beta is smaller than a preset learning rate threshold value; and the number of the log semantic vectors in the first clustering result and the number of the log semantic vectors in the second clustering result are less than 0.0001 or more than 10000.

In the present embodiment, β decreases with the increase in the number of learning times and gradually tends toward 0, and when the learning rate tends toward 0, it can also be shown that W1 and W2 tend to be stable. At this time, the number of the log semantic vectors in the first clustering result and the number of the log semantic vectors in the second clustering result are judged to be less than 0.0001 or more than 10000. Thus, the accuracy of clustering can be further improved.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device according to this embodiment of the present application. The electronic device is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

The electronic device is in the form of a general purpose computing device. Components of the electronic device may include, but are not limited to: the at least one processor, the at least one memory, and a bus connecting the various system components (including the memory and the processor).

Wherein the storage stores program code executable by the processor to cause the processor to perform steps according to various exemplary embodiments of the present application described in the "exemplary methods" section above.

The memory may include readable media in the form of volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may further include Read Only Memory (ROM).

The storage may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The bus may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. Also, the electronic device may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via a network adapter. The network adapter communicates with other modules of the electronic device over the bus. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the present application may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present application described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the present application and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A data processing method for exception log information determination, comprising:

s100, acquiring An original log information set to be processed, wherein A = (A1, A2, …, ai, …, an), and Ai = (TAGi, ATXTi); ai is ith original log information to be processed generated in a time sequence in a target network, TAGi is a log type identifier corresponding to Ai, and ATXTi is original log content information corresponding to Ai;

s200, traversing A, and if TAGi is the target log type identifier, deleting the fixed speech segment corresponding to TAGi in ATXTi to obtain a target log information set B = (B1, B2, …, bi, …, bn) and Bi = (TAGi, BTXTi); bi is target log information to be processed corresponding to Ai, and BTXTi is target log content information corresponding to Bi; the target log type identifier is a log type identifier containing a fixed language segment in the corresponding original log content information;

s300, performing word segmentation processing on Bi to obtain a word segmentation list set C = (C1, C2, …, ci, …, cn), and Ci = (TAGi, C) _i ¹ ，C _i ² ，…，C _i ^j ，…，C _i ^g（i） ) J =1,2, …, g (i); ci is a word segmentation list corresponding to Bi, C _i ^j The j-th participle in BTXTi is obtained, and g (i) is the number of the participles in BTXTi;

s400, performing hot single encoding processing on the C to obtain a first encoding list set D = (D1, D2, …, di, …, dn), and Di = (D) _i ¹ ，D _i ² ，…，D _i ^p ，…，D _i ^g（i），D _i ^g（i）+1 ) P =1,2, …, g (i) +1; di is a first coding list corresponding to Ci, D _i ¹ For the first code obtained after the hot independent coding of TAGi, D _i ^p Is a pair C _i ^p-1 Carrying out thermal independent coding processing to obtain a first code; each first code is a one-dimensional vector consisting of s characteristic elements; s is the number of words in a dictionary formed by using the historical log information set;

s500, vectorizing D to obtain a first word vector list set E = (E1, E2, …, ei, …, en), ei = (E1, E2, …, ei, …, en), and then _i ¹ ，E _i ² ，…，E _i ^p ，…，E _i ^g（i），E _i ^g（i）+1 ) (ii) a Ei is the first word vector list corresponding to Di, E _i ^p Is D _i ^p Corresponding first word vector, E _i ^p Containing contextual semantic information, E _i ^p =D _i ^p X VecT; vecT is a target word vector matrix obtained after a historical log information set is processed according to a preset word2vec model; size of VecT is s x h, s is row number of VecT, h is column number of VecT, h ≪ s; e is used for determining the abnormal log information set Y from A.

2. The data processing method of claim 1, further comprising:

3. The data processing method according to claim 2, wherein the step S600 comprises:

s610, obtaining a vector weight corresponding to each first word vector, and obtaining a vector weight list set Q = (Q1, Q2, …, qi, …, qn), qi = (Q) _i ¹ ，Q _i ² ，…，Q _i ^p ，…，Q _i ^g（i），Q _i ^g（i）+1 ) (ii) a Wherein Qi is a vector weight list corresponding to Ei, Q _i ^p Is E _i ^p Corresponding vector weight, Q _i ^p Is according to E _i ^p The corresponding word is obtained from the word frequency of the history log information set, and E _i ^p Word frequency size and Q of corresponding participles in historical log information set _i ^p Is inversely proportional to the size of (a);

4. The data processing method according to claim 3, wherein the step S700 comprises:

s710, performing random initialization to obtain a first unit vector W1 and a second unit vector W2; w1 and W2 are both one-dimensional vectors consisting of h characteristic elements;

s720, obtaining a first parameter t =0;

s730, if t = h, obtaining t =1, otherwise, obtaining t = t +1;

s740, according to Ft, obtaining Dist1= sqrt (∑) _k=1 ^h (X1 k-X2 k)), and Dist2= sqrt (∑ s) _k=1 ^h (X1 k-X3 k)); wherein, X1k is the kth characteristic element in Ft, X2k is the kth characteristic element in W1, and X3k is the kth characteristic element in W2;

s750, if Dist1 < Dist2, obtaining W1= W1+ β (Ft-W1); otherwise, obtain W2= W2+ β (Ft-W2); wherein β is a learning rate, and the magnitude of β changes with the increase of the execution times of step S750;

s760, clustering F1, F2, …, fi, … and Fn according to W1 and W2 to obtain a first clustering result and a second clustering result;

s770, determining whether the current condition is met according to the first clustering result and the second clustering result, if yes, entering step S780, and if not, entering step S730; the clustering termination condition includes: the number of the log semantic vectors in the first clustering result and the number of the log semantic vectors in the second clustering result are less than 0.0001 or more than 10000;

s780, determining that the number of the log semantic vectors in the first clustering result and the second clustering result is less as a target clustering result;

s790, determining the original log information to be processed corresponding to each log semantic vector in the target clustering result as abnormal log information to obtain Y.

5. The data processing method of claim 4, wherein the clustering termination condition comprises:

6. The data processing method of claim 4, wherein the size of β decreases as the number of execution times of step S750 increases.

7. The data processing method of claim 6, wherein the clustering termination condition comprises:

8. The data processing method according to claim 6, wherein the original log information to be processed is linux system log information.

9. An electronic device comprising a processor and a memory;

the processor is adapted to perform the steps of the method of any one of claims 1 to 8 by calling a program or instructions stored in the memory.

10. A non-transitory computer readable storage medium storing a program or instructions for causing a computer to perform the steps of the method of any one of claims 1 to 8.