CN111488260B

CN111488260B - Data template acquisition method, device, computer equipment and readable storage medium

Info

Publication number: CN111488260B
Application number: CN201910087251.0A
Authority: CN
Inventors: 赵锋; 孟庆月; 田雨; 张朋朋; 马平丽
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2023-12-08
Anticipated expiration: 2039-01-29
Also published as: CN111488260A

Abstract

The application discloses a data template acquisition method, a data template acquisition device, computer equipment and a readable storage medium, and belongs to the technical field of data processing. According to the method, the target data is stored in the data group with a plurality of character strings on one column and only one character string on the other columns, so that the data group is provided with two columns with character string types larger than 1, further, the computer equipment can identify the variables of the log data in the data group based on the target data, so that the data template is obtained, when the variables in the data template correspond to the API service request before being identified, the data template can be used as the API template, and the problem that the API template cannot be obtained due to the fact that the second target column and the third target column of the data group cannot be determined at the same time can be solved.

Description

Data template acquisition method, device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for acquiring a data template, a computer device, and a readable storage medium.

Background

The application typically consists of at least one micro-service, each of which may be distributed over different servers and may communicate via an application programming interface (application programming interface, API) having a re-expressible state transition (representational state transfer, REST) style and following the hypertext transfer protocol (hyper text transfer protocol, HTTP) to implement the functionality of each micro-service. In order to know the service behavior of each micro service and analyze various performance indexes called when the API realizes communication, the computer equipment can acquire log data recorded with the micro service function realized through the API, and replace variables in the log data by first character strings by identifying constants and variables in the log data to acquire an API template of the log data, so that various performance indexes called by the API can be analyzed to know the service behavior of each micro service.

Currently, the data template may be obtained by the following procedure: the computer equipment acquires 5 pieces of log data, wherein the 5 pieces of log data are respectively per user/a1; user/b1/c1/d1; per user/b1/f1/g1; the ratio of/user/d 1/k1/m1; /api/v1/v2/v3; dividing each piece of log data into a plurality of character strings based on "/", putting log data including 2 character strings into data group 1 so that data group 1 is {/user/a1}, and putting log data including 4 character strings into data group 2 so that data group 2 is {/user/b1/c1/d1; per user/b1/f1/g1; the ratio of/user/d 1/k1/m1; /api/v1/v2/v3}; the character strings user, user, user and api in the data group 2 are formed into a first column of the data group 2, the character strings b1, d1 and v1 are formed into a second column of the data group 2, the character strings c1, f1, k1 and v2 are formed into a third column of the data group 2, and the character strings d1, g1, m1 and v3 are formed into a fourth column of the data group 2; splitting data set 2 based on the first column of data set 2 to obtain sub data set 2.1{/user/b1/c1/d1; per user/b1/f1/g1; per user/d1/k1/m1}, sub-data set 2.2{/api/v1/v2/v3}; establishing a one-to-many correspondence between character strings on two columns with the types larger than 1 in the data set, identifying one end of the one-to-many correspondence as a constant, and identifying multiple ends of the one-to-many correspondence as variables, namely, the character b1 in the second column of the subarray 2.1 corresponds to the character strings c1 and f1 in the third column; the computer equipment recognizes the character b1 in the second column as a constant, recognizes the character strings c1 and f1 in the third column as variables, and replaces the variables with the character strings { variable }, and the replaced sub-data group 2.1 is {/user/b1/{ variable }/d1; user/b1/{ variable }/g1; user/d1/k1/m1}; the computer device uses the/user/b 1/{ variable }/d1 and/user/b 1/{ variable }/g1 in the sub-data set 2.1 as an API template.

Based on the above process of obtaining the data template, a one-to-many correspondence is required to be established according to the strings on the two columns with the string types greater than 1 in the sub-data set, so that the string at one end of the one-to-many string is recognized as a constant, the strings at the ends are recognized as variables, and the API template is obtained.

Disclosure of Invention

The embodiment of the application provides a data template acquisition method, a data template acquisition device, computer equipment and a readable storage medium, which can solve the problem that an API template cannot be acquired because a second target column and a third target column of a data group cannot be determined at the same time. The technical scheme is as follows:

in a first aspect, a method for obtaining a data template is provided, the method comprising:

Grouping the plurality of pieces of log data to obtain a plurality of data groups, wherein the number of character strings of the log data included in each data group is the same, and the character strings at the same position of the plurality of pieces of log data in one data group form a column of the data group;

for any one data set, when the number of character string types on any column of the data set is equal to the number of log data in the data set and only one character string is arranged on each column of other columns, storing target data into the data set, wherein the number of character strings in the target data is the same as the column number of the data set;

based on the target data, replacing the character string identified as the variable in the data set with the first character string to obtain at least one data template, wherein the at least one data template consists of at least one piece of log data except the target data in the data set.

In one possible implementation, after grouping the plurality of data to obtain the plurality of data groups, the method further includes:

when the format of any character string in any data set meets the preset condition, the character string is replaced by the first character string.

Based on the possible implementation manner, the specific variable in the log data can be identified in advance, and then only the non-specific variable needs to be identified in the subsequent variable identification process, so that the variable identification efficiency is improved.

In one possible implementation manner, based on the target data, replacing the character string identified as the variable in the data set with the first character string to obtain at least one data template, including:

each time when a first target column of the data set has different kinds of character strings, replacing a second character string in the first target column with the first character string, wherein the occurrence probability of the second character string in the first target column is smaller than or equal to a preset numerical value;

splitting the data set based on the third character string in the first target column to obtain at least one sub-data set when the replaced first target column has the third character string, wherein the occurrence probability of the third character string in the first target column is larger than the preset value;

at least one data template is obtained from the sub-data set.

Based on the possible implementation manner, by continuously splitting the data set and replacing the second character string in the sub-data set with the first character string, the one-to-many correspondence relation between the character strings on the two columns with the character string types larger than 1 in the data set can be avoided, and thus the simultaneous determination of the second target column and the third target column of the data set can be avoided, and the problem that the API template cannot be acquired because the second target column and the third target column of the data set cannot be determined simultaneously can be avoided.

In one possible implementation, after replacing the character string identified as the variable in the data set with the first character string based on the target data, the method further includes:

the data set is de-duplicated, so that all log data in the de-duplicated data set are different from each other;

and acquiring at least one piece of log data in the de-duplicated data set as at least one data template.

Based on the possible implementation manner, at least one piece of log data which is different from each other in the data set can be obtained, so that the computer device can take the at least one piece of log data as a data template.

In one possible implementation, when the number of character string types on any column of the data set is equal to the number of log data in the data set and there is only one character string on each of the other columns, storing the target data into the data set includes:

when the character string type in only one row of the data set is larger than 1, the number of the log data in the data set is larger than a first preset value, and only one character string is arranged on each row of the other rows, the target data is stored in the data set.

Based on the above possible implementation, the target data may be made to be stored only in the data group having enough log data, and when the log data in the data group is small, the target data may not be necessarily stored.

In a second aspect, a data template acquisition method is provided, the method including:

for any one data set, when a first target column of the data set has different kinds of character strings, replacing a second character string in the first target column with the first character string, wherein the occurrence probability of the second character string in the first target column is smaller than or equal to a preset numerical value;

at least one data template is obtained from the sub-data set, the at least one data template being comprised of at least one piece of log data in the sub-data set.

In one possible implementation, when the format of any string in any data set satisfies a preset condition, the string is replaced with the first string.

In a third aspect, a data template acquiring apparatus is provided for performing the above data template acquiring method. Specifically, the data storage device includes a functional module for executing the data template acquiring method provided in the first aspect or any of the optional manners of the first aspect.

In a fourth aspect, a data template acquiring apparatus is provided for performing the above data template acquiring method. Specifically, the data storage device includes a functional module for executing the data template acquiring method provided in the second aspect or any of the optional manners of the second aspect.

In a fifth aspect, a computer device is provided, the computer device comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to perform operations as performed by the data template acquisition method described above.

In a sixth aspect, a computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement operations performed by a data template acquisition method as described above is provided.

The technical scheme provided by the embodiment of the application has the beneficial effects that:

By storing target data in a data group with multiple character strings on only one column and only one character string on other columns, the data group is provided with two columns with character string types larger than 1, and the computer equipment can identify variables of log data in the data group based on the target data, so that a data template is obtained, when the variables in the data template correspond to an API service request before being identified, the data template can be used as the API template, and the problem that the API template cannot be obtained due to the fact that a second target column and a third target column of the data group cannot be determined at the same time can be solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a computer device according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for obtaining a data template according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a method for obtaining a data template according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a method for splitting a data set according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a method for splitting a data set according to an embodiment of the present application;

FIG. 7 is a flowchart of a method for obtaining a data template according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a method for obtaining a data template according to an embodiment of the present application;

FIG. 9 is a structural diagram of a data template acquiring device according to an embodiment of the present application;

fig. 10 is a structural device diagram of a data template acquiring device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an implementation environment provided in an embodiment of the present application, referring to fig. 1, where the implementation environment includes a computer device, a database, and at least one user device, and the database is connected to the computer device and the at least one user device.

The user device is configured to provide log data, where the log data may be data for recording service requests, where the service requests may be service requests of micro services of applications on the user device through an API, where the log data may store a uniform resource locator (uniform resource locator, URL) corresponding to the service requests, and where a format of the URL may be a protocol: name of the user @ password @ subdomain @ domain name of the top-level domain name @ port number/directory/filename? Parameter = value # flag. Of course, the log data may also be other data, and the embodiment of the present application does not limit the specific content of the log data, and in a possible implementation manner, the user equipment may send an API service request to the server through a browser installed on the user equipment, so as to generate corresponding log data at the server.

And the computer equipment is used for identifying constants and variables in the log data, replacing the variables in the log data and further acquiring an API template of the log data.

The database is used for storing log data, the database can be a database connected with the computer equipment, the database can also be a database in the computer equipment, the log data sent by the user equipment can be directly stored in the database, the log data sent by the server can also be directly stored, the user equipment can also send the generated log data to the computer equipment, and the computer equipment stores the received log data in the database.

Fig. 2 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device 200 includes a larger difference between configuration and performance, and may include one or more processors (central processing units, CPU) 201 and one or more memories 202, where at least one instruction is stored in the memories 202, and the at least one instruction is loaded and executed by the processors 201 to implement the data template acquiring method provided in the method embodiments described below. Of course, the computer device 200 may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the data template acquisition method in the embodiments described below, is also provided. For example, the computer readable storage medium may be a read-only memory (ROM), a random access memory (random access memory, RAM), a compact disc-read only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

The foregoing describes an implementation environment and hardware of a computer device, and further, in order to embody a process of obtaining a log data template by a computer device, a description is given here of a specific embodiment, referring to fig. 3, and fig. 3 is a flowchart of a data template obtaining method provided by an embodiment of the present application, where a method flow provided by the embodiment of the present application includes:

301. the computer device obtains at least one log file, and obtains a plurality of pieces of log data from the at least one log file.

The computer device may directly obtain the log file from the user device, or may obtain the log file from the database, where a plurality of pieces of log data are stored in the log file. In addition, the method for acquiring the log file by the computer device is not particularly limited in the embodiment of the present application.

In one possible implementation, the computer device may send a log data acquisition request to the database, the log data acquisition request requesting at least one log file, the log collection system sending the requested at least one log file to the computer device based on the log data acquisition request, the computer device may acquire a plurality of pieces of log data from the at least one log file upon receiving the at least one log file,

302. The computer equipment groups a plurality of pieces of log data to obtain a plurality of data groups, the number of character strings of the log data included in each data group is the same, and the character strings at the same position of the plurality of pieces of log data in one data group form a column of the data group.

In one possible implementation, when a separator in the log data is detected, the computer device segments the log data based on the detected separator such that the segmented log data includes a plurality of character strings, for example, taking log data/user/a 1/add as an example, and segments the log data with "/" as a separator, the segmented log data including character strings user, a1, and add.

In a possible implementation manner, referring to fig. 4, fig. 4 is a schematic diagram of a data template obtaining method provided by an embodiment of the present application, input data in fig. 4 is log data segmented by the computer device, output data is a data group obtained by grouping the segmented log data by the computer device, as shown in fig. 4, the input data has 17 pieces, the first 13 pieces of log data all include 3 character strings, so that the first 13 pieces of log data are distributed to one data group to obtain a data group 1, and the last 4 pieces of log data all include 4 character strings, so that the last 4 pieces of log data are distributed to one data group to obtain a data group 2.

In a possible implementation manner, the position of each character string of each piece of log data in the data set may be regarded as a unit, taking 1 st piece of log data/api/user/zhang/status in the data set 2 as an example, in order of the character strings from left to right, the character string api is in the first unit of the log data, the character string user is in the second unit of the log data, the character string zhang is in the third unit of the log data, the character string status is in the fourth unit of the log data, the character string of each piece of data in the data set 2 is in the fourth unit of the log data, the character string of the 1 st piece of log data in the data set 2 is zhang, the character string of the 2 nd piece of log data in the third unit is wang, the character string of the 3 rd piece of log data in the third unit is li, the character string of the 4 th piece of log data in the third unit is zhang, and the character string of the 3 th piece of log data in the third unit is zhang.

303. When the format of any character string in any data set meets the preset condition, the computer equipment replaces the character string with the first character string.

The preset condition may be a format of a universal unique identification code (universally unique identifier, UUID), a number or a timestamp, and may also be a format of other variables, which is not specifically limited in the embodiment of the present application. The first string is used to represent variables, which can be represented by the string { variable }, UUID, number, and timestamp in URL are variables, and when the computer device recognizes the variables, the variables can be directly replaced by the first string "{ variable }".

In one possible implementation manner, the UUID, the number and the format of the timestamp are added in the regular expression rule, the computer device can match each character string in the data set through the regular expression in the regular expression rule, when the format of any character string meets the UUID, the number or the format of the timestamp, the character string is replaced by the first character string { variable }, still taking fig. 4 as an example, the character strings 1, 2 and 3 of log data of the 4 th to 6 th strips in the data set 1 are numbers, the computer device replaces the character strings 1, 2 and 3 with the first character string { variable }, the 4 th to 6 th strips of log data in the replaced array 1 are/user/update/{ variable }, the 4 th to 6 th strips of log data can be represented by the log data/user/update/{ variable } to obtain the data set 3, wherein, when the 4 th to 6 th strips of log data are the log data/user/update/{ variable } and the API data are used as the constant data and the API.

The computer device can identify the specific variable in the log data in advance through the step 303, and then only the non-specific variable needs to be identified in the subsequent variable identification process, so that the variable identification efficiency is improved.

It should be noted that, the computer device may identify a specific variable in the obtained log data before grouping, that is, the computer device may execute step 303 first and then execute step 302, and the execution sequence of steps 302-303 is not specifically limited in the embodiment of the present application.

304. For any one of the data sets, the computer device determines the column in the data set for which the character string category is least.

In one possible implementation, the computer device counts the number of character string types in each column of the data set to obtain the number of character string types in each column of the data set, compares the number of character string types in each column, and determines the column with the least character string type in the data set. Still taking fig. 4 as an example, the first column of the data set 3 includes 4 kinds of character strings, the second column of the data set 3 includes 9 kinds of character strings, and the third column of the data set 3 includes 4 kinds of character strings, and it can be seen that the first column and the third column have the smallest kinds of character strings, and the first column or the third column may be taken as the column with the smallest kinds of character strings in the data set 3.

In this step 304, in order to split the data set, the character string in the column with the least character string type is greater than 1, for example, the character string type of the first column of the data set x is 1, the character string type of the second column is 2, and the character string type of the third column is 3, and the computer device uses the second column of the data set x as the column with the least character string type.

305. According to the character string type in the column with the least character string type, the computer equipment splits the data set to obtain at least one sub-data set.

In one possible implementation manner, the computer device de-duplicates the character string in the column with the least character string category to obtain at least one de-duplicated character string, and splits the log data including the at least one character string into a sub-data group, that is, the character string category on the column corresponding to the column with the least character string category in any sub-data group obtained by splitting is the same. Referring to fig. 5, fig. 5 is a schematic diagram of a method for splitting a data set according to an embodiment of the present application, in which input data in fig. 5 is a split data set, output data is a split sub-data set, a first column of the data set 3 includes four character strings user, abc, def and ghi, and the computer device splits log data in a first unit of the data set, which is a character string user, to obtain a sub-data set 3.1 and a sub-data set 3.2, and the character strings in the first column of the sub-data set 3.1 are all user.

In one possible implementation, when the character string category of any column in any one of the sub-data sets is 1, the computer device does not split the sub-data sets; when the character string type of any column in any of the sub-data sets is greater than 1, the computer device groups the sub-data sets, for example, only the character string cluster is arranged on the second column and only the character string del is arranged on the third column of the sub-data set 3.2, the computer device does not split the sub-data set 3.2, and the character string type of the third column of the sub-data set 3.1 is multiple, the computer device can split the sub-data set 3.1 to obtain the sub-data sets 3.1.1, 3.1.2 and 3.1.3, and the splitting manner of the sub-data set 3.1 is the same as that of the splitting manner of the sub-data set 3, which is not repeated here.

In one possible implementation, when the character string type of any column in any data set is 1, the computer device does not split the data set, for example, only one character string is on the first column, the second column and the third column of the data set 2 in fig. 4, and then the computer device does not split the data set 2.

306. For any one of the data sets, when the number of character string types on any one column of the data set is equal to the number of pieces of log data in the data set and only one character string is on each of the other columns, the computer device stores target data into the data set, the number of character strings in the target data being the same as the number of columns of the data set.

Any data set may be a data set which is not split, or may be a sub-data set obtained by splitting, where the target data is not log data generated by the user equipment, but is a piece of dummy data, the format of the target data is the same as that of the log data in the data set, the character strings in the target data may be the same character string, for example, a/fake/fake/fake, and the character strings in the target data may also be multiple character strings, for example, a/one/two/three/four.

It should be noted that, in one possible implementation manner, any character string in the target data and any character string in the data set are different, and in one possible implementation manner, the computer device stores a plurality of candidate character strings, where the plurality of candidate character strings are used to form the target data, and before storing the target data in the data set, the computer device detects each character string in the data set, detects whether the data set has a candidate character string, and forms the candidate character string that is not detected in the data set into the target data. Taking still the data set 2 in fig. 4 as an example, the candidate strings stored in the computer device to form the target data include fake, one, two, three and four, the computer device detects the strings in the data set 2 to obtain that the 1 st column in the data set 2 includes only the string api, the 2 nd column includes only the string user, the 4 th column includes only the string status, and the 3 rd column includes zhang, wang, li and zhao four strings, it is seen that the candidate strings are not strings in the data set 2, and the computer device may select at least one string from the candidate strings fake, one, two, three and four to form the target data, for example, the computer device may store the target data/fake/fake/fake/fake in the data set 2 to obtain the data set 4.

In one possible implementation, the computer device stores a plurality of target data, each target data includes a preset number of strings, and before any one of the target data is stored in a data set, the computer device needs to acquire a number of columns of the data set, and store a first target data in the data set, where the number of columns of the data set is equal to the number of strings of the first target data. Taking still the data set 2 in fig. 4 as an example, the computer device stores target data/fake/fake including 2 character strings, target data/fake/fake including 3 character strings, target data/fake/fake/fake including 4 character strings, and by detecting that the number of columns of the computer device acquiring the data set 2 is 4, the computer device may store target data/fake/fake/fake including 4 character strings in the data set 2 to obtain the data set 4.

In one possible implementation, when the type of the character string in only one column of the data set is greater than 1, the number of the log data in the data set is greater than a first preset value, and only one character string is in each of the other columns, the target data is stored in the data set. If taking array 2 as an example, the data set 2 has 4 pieces of log data, when the first preset value is 3, the number 4 of pieces of log data in the data set 2 is greater than the first preset value 3, then the computer device may store the target data/fake/fake/fake/fake into the data set 2 to obtain the data set 4, if taking the sub-data set 3.1.1 as an example, the first column and the third column of the sub-data set 3.1.1 have only one character string, but the sub-data set 3.1.1 has only 3 pieces of log data, and it is visible that the number 3 of pieces of log data in the sub-data set 3.1.1 is not greater than the first preset value 3, then the computer device cannot store the target data fake/fake/fake into the sub-data set 3.1.1.1. Thus, the target data can be stored only in the data group with enough log data, and when the log data in the data group is small, the target data can be not stored, for example, only one piece of log data in the data group can be directly used as an API template, and the target data can not be stored in the data group.

It should be noted that, in the embodiment of the present application, the first preset value is not specifically limited.

307. The computer device replaces the character string identified as a variable in the data set with a first character string based on the target data, resulting in at least one data template comprised of at least one piece of log data in the data set other than the target data.

The data set is a data set or sub-data set in which target data is stored, and in one possible implementation, the computer device may implement this step 307 by a process shown in steps 307A to 307E described below.

Step 307A, the computer device determines at least one column in the data set having a string category greater than 1.

In one possible implementation, the computer device counts the number of character string types in each column of the data set to obtain the number of character string types in each column of the data set, so as to determine at least one column with a character string type greater than 1 in the data set. Still taking the data set 4 in fig. 4 as an example, the first column of the data set 4 includes two strings of api and fake, the second column includes two strings of user and fake, the third column includes five strings of zhang, wang, li, zhao and fake, and the fourth column includes two strings of status and fake, and then each column in the data set 4 is a column with a string type greater than 1.

Step 307B, the computer device uses the column with the least character string type in the at least one column as the second target column, and uses the column with the most character string type in the at least one column as the third target column.

In one possible implementation, the computer device determines the column from the at least one column for which the character string category is the least by comparing the character string categories for each of the at least one column. Still taking the example in step 307A as an example, the number of categories of strings on the first, second, and third columns of data set 4 is the same, and the number of categories is the smallest, the computer device may treat any one of the first, second, and third columns as the second target column, in some possible embodiments, in order from left to right in the data set, the leftmost one of the at least one columns with the smallest category of strings is the second target column, the computer device may treat the first column as the second target column of data set 4, and the character string on the third column is the largest, the computer device treats the third column as the third target column of data set 4.

In step 307C, the computer device identifies the first target string on the second target column as a constant, identifies the plurality of second target strings on the third target column as variables, and replaces the plurality of second target strings with the first string, the first string corresponding to the plurality of strings on the third target column.

In one possible implementation manner, the computer device establishes a one-to-many correspondence between the character string on the second target column and the character string on the third target column, recognizes the character string at one end of the one-to-many correspondence as a constant, recognizes the character string at multiple ends of the one-to-many correspondence as a variable, that is, recognizes the character string at one end of the one-to-many correspondence as a first target character string, and recognizes the character string at multiple ends as a second target character string. Still taking the example in step 307B as an example, the computer device recognizes the character string api on the second target column (first column) of the data group 4 as constant with the character string zhang, wang, li on the third target column (third column) and zhao, and as the first target character string, the character string fake on the second target column (first column) of the data group 4 corresponds only to the character string fake on the third target column (third column) and cannot correspond to the plurality of character strings on the third target column, and the computer device defaults to the character strings on both the second target column and the third target column as constant.

Step 307D, the computer device deletes the target data in the data set.

Since the target data is not log data generated by the user equipment but dummy data, the target data needs to be deleted after the computer equipment completes the variable identification, so as to prevent the target data from being acquired as a data target, thereby improving the accuracy of acquiring the data template.

Still taking the example in step 307C as an example, recognizing the strings zhang, wang, li and zhao corresponding to the first target string api on the third target column as variables, as the second target string, and replacing both the strings zhang, wang, li and zhao with the first string { variable }, deleting the target data/fake/fake/fake/fake in the data set 4, and obtaining the data set 5.

Step 307E, the computer device obtains at least one data template, where the at least one data template is at least one log data different from each other in the data group after deleting the target data.

At least one piece of log data different from each other, that is, at least one piece of log data. In one possible implementation manner, the computer device performs deduplication on the log data in the data set to obtain at least one piece of log data different from each other, and uses the at least one piece of log data different from each other as at least one data template. Still taking the example in step 307D as an example, where the data set 5 is a data set after deleting the target data, and four pieces of log data in the data set 5 are all log data, the computer device obtains the log data/API/user/{ variable }/status as a data template, and when the variable in the data template corresponds to a service request through the API before being identified, the data template may be used as an API template. If there is at least one type of log data in the data set 5, the computer device may use the at least one type of log data as a data template.

In a possible implementation manner, the computer device may further obtain a data template of log data by means of continuous grouping, and in particular, referring to fig. 6, fig. 6 is a schematic diagram of a data group splitting method provided by an embodiment of the present application, where the grouping method may be implemented by a procedure shown in the following steps 307F to 307H.

In step 307F, each time there is a different kind of character string in the first target column of the data set, the computer device replaces a second character string in the first target column with the first character string, and the probability of occurrence of the second character string in the first target column is less than or equal to a preset value.

The first target column may be any column having different kinds of character strings in the data group, and the first target column may be a column having a leftmost character string kind greater than 1 in the data group in the order of left to right in the data group. The embodiment of the application is described by taking the first target row as an example of any row with different kinds of character strings in the data set, and the preset numerical value is not particularly limited.

The probability of occurrence of a second string in the first target column is the ratio of the number of second strings on the first target column to the number of all strings on the first target column.

In one possible implementation manner, the computer device counts the character string types on each column in the data set, uses a column with any character string type greater than 1 as a first target column, calculates the probability of each character string on the first target column, recognizes the character string as a variable when the probability of any character string on the first target column is less than or equal to a preset numerical value, replaces the character string on the first target column with the first character string, recognizes the character string as a constant when the probability of any character string on the first target column is greater than preset data, and keeps the character string as a constant. For example, in fig. 6, the data set 4 is a data set to be split, the preset value is 0.3, the third column of the data set 4 is taken as the first target column, the probability of occurrence of the character string zhang in the third column is 0.2, the probability of occurrence of the character string wang in the third column is 0.2, the probability of occurrence of the character string li in the third column is 0.2, the probability of occurrence of the character string zhao in the third column is 0.2, the probability of occurrence of the character string fake in the third column is less than 0.3, it is seen that the computer device may replace the character strings zhang, wang, li, zhao and fake with the first character string { variable } to obtain the replaced data set 4.

In step 307G, each time a third string is included in the replaced first target column, the computer device splits the data set based on the third string in the first target column to obtain at least one sub-data set, where the probability of occurrence of the third string in the first target column is greater than the preset value.

In one possible implementation manner, after replacing the second string on the first target column with the first string, the computer device obtains the probability of each string on the first target column, uses the string with the probability greater than the preset value on the first target column as the third string, splits the log data of the third string into one sub-data group, and splits the log data of the strings in the first target column except the third string into another sub-data group. Taking the example in step 307F as an example, the first target column (the third column) in the replaced data set 4 has only the string { variable }, the computer device may not split based on the string on the replaced first target column, and the computer device may re-execute steps 307F-307G, that is, the computer device may use the second column of the replaced data set 4 as the first target column, where the second column includes the string user and the fake, the probability of occurrence of the string fake in the second column is 0.2, which is less than the preset value 0.3, the computer device uses the string fake as the second string of the second column, replaces the string fake with the first string { variable }, the probability of occurrence of the string user in the second column of the data set 4 'is 0.8, which is greater than the preset value 0.3, and the computer device may split the string user as the second string from the third string to the first string of data set 4' to the second string of the log in addition to the third string of the first string of data set 4.

Step 307H, the computer device obtains at least one data template from the sub-data set, the at least one data template being comprised of at least one piece of log data in the sub-data set.

In one possible implementation, the computer device checks the types of the log data in the sub-data group, and when the number of types of the log data in the sub-data group is equal to the number of pieces of the log data in the sub-data group, the computer device acquires all the log data in the sub-data group as a data template; when the number of types of log data in the sub-data group is smaller than the number of log data in the sub-data group, the computer equipment performs de-duplication on the log data in the sub-data group to obtain at least one log data which is different from each other, and the computer equipment acquires the at least one log data which is different from each other as a data template. Still taking the example in step 307G as an example, there is only one piece of log data/like/{ variable }/like in the sub-data set 4'.1, then there is only one character string on each column of the sub-data set 4'.1, but this piece of log data is the target data before replacement, the target data is added dummy data, and not the real log data that the computer device acquires from the log collection system, then the log data/like/{ variable }/{ variable }/like cannot be used as a data template, whereas there is only one piece of log data in the sub-data set 4'.2, then the computer device uses the log data/api/user/{ variable }/status as a data template for the log data in the data set 4.

By continuously splitting the data set and replacing the second character string in the sub-data set with the first character string, the one-to-many correspondence relation between the character strings on the two columns with the character string types larger than 1 in the data set can be avoided, and therefore the problem that the API template cannot be acquired because the second target column and the third target column of the data set cannot be determined at the same time can be avoided.

According to the method provided by the embodiment of the application, the target data is stored in the data group with a plurality of character strings on one column and only one character string on the other column, so that the data group is provided with two columns with the character string types larger than 1, and further the computer equipment can identify the variables of the log data in the data group based on the target data, so that the data template is obtained, and when the variables in the data template are corresponding to the API service request before being identified, the data template can be used as the API template, and the problem that the API template cannot be obtained due to the fact that the second target column and the third target column of the data group cannot be determined at the same time can be solved. In addition, by identifying specific variables in log data in advance, only non-specific variables are needed to be identified in the follow-up variable identification process, and therefore variable identification efficiency is improved. And by continuously splitting the data set and replacing the second character string in the sub data set with the first character string, the one-to-many correspondence relation between the character strings on the two columns with the character string types larger than 1 in the data set can be avoided, so that the simultaneous determination of the second target column and the third target column of the data set can be avoided, and the problem that the API template cannot be acquired because the second target column and the third target column of the data set cannot be determined at the same time can be avoided.

Fig. 3 is a process of acquiring a data template by storing target data in a data set, in a possible implementation manner, a computer device may not need to store target data in the data set, and may directly perform continuous splitting on the data set to acquire the data template, specifically, referring to fig. 7, fig. 7 is a flowchart of a data template acquiring method provided by an embodiment of the present application, and the method specifically includes the following steps.

701. The computer device obtains at least one log file, and obtains a plurality of pieces of log data from the at least one log file.

The same principle as step 701 is the same as step 301, and the specific process of step 701 is not described herein.

702. The computer equipment groups a plurality of pieces of log data to obtain a plurality of data groups, the number of character strings of the log data included in each data group is the same, and the character strings at the same position of the plurality of pieces of log data in one data group form a column of the data group.

The step 702 is the same as the step 302, and the specific process of the step 702 is not repeated here in the embodiment of the present application.

703. When the format of any character string in any data set meets the preset condition, the computer equipment replaces the character string with the first character string.

The same principle as step 703 is adopted in step 303, and the specific process of step 701 is not described herein.

704. For any one of the data sets, the computer device replaces a second string in a first target column of the data set with a first string whenever the first string has a different kind of string in the first target column, the probability of the second string occurring in the first target column being less than or equal to a preset value.

The first target column may be any column having different kinds of character strings in the data group, and the first target column may be a column having a leftmost character string kind greater than 1 in the data group in order from left to right in the data group. The embodiment of the present application is described by taking a case where the first target column is a column in which the leftmost character string type in the data group is greater than 1 as an example.

In one possible implementation manner, the computer device obtains a type of a character string on each column in the data set, uses a column with a different type of the character string and a minimum type of the character string as a first target column, obtains a probability that each character string on the first target column appears on the first target column, and replaces the character string with the first character string when the probability that any character string appears on the first target column is less than or equal to a preset value. For example, fig. 8 is a schematic diagram of a data template obtaining method provided in the embodiment of the present application, where 6 pieces of log data in the data set 6 take a first column of the data set 6 as a first target column of the data set 6, the preset value is 0.3, where the probabilities of occurrence of the character strings abc, def and ghi in the first target column are all 0.167 and less than the preset value of 0.3, the computer device takes the character strings abc, def and ghi as a first character string, and can replace the character strings abc, def and ghi with a first character string { variable }, and the probability of occurrence of the character string user in the first target column is 0.5 and greater than the preset value of 0.3, and the computer device does not replace the character string user.

705. And splitting the data set based on the third character string in the first target column to obtain at least one sub-data set when the replaced first target column has the third character string, wherein the occurrence probability of the third character string in the first target column is larger than the preset value.

In one possible implementation manner, after replacing the second string on the first target column with the first string, the computer device obtains the probability of each string on the first target column in the first target column, uses the string with the probability of being greater than the preset value in the first target column as the third string, splits the log data of the third string into one sub-data group, and splits the log data of the strings in the first target column except the third string into another sub-data group. Taking the example in step 704 as an example, the replaced first target column of the data set 6 includes the strings { variable } and the user, where the probabilities of occurrence of the strings { variable } and the user in the first target column are both 0.5 and greater than the preset value of 0.3, and the strings { variable } and the user are both third strings, and the computer device splits the log data with the third strings { variable } and the user to obtain the sub-data sets 6.1 and 6.2, so that the strings on the first column of the sub-data set 6.1 are both { variable }, and the strings on the first column of the sub-data set 6.2 are both strings user.

706. The computer device obtains at least one data template from the sub-data set, the at least one data template being comprised of at least one piece of log data in the sub-data set.

In one possible implementation, the computer device detects a type of log data in the sub-data group, and when the number of types of log data in the sub-data group is equal to the number of pieces of log data in the sub-data group, the computer device acquires all log data in the sub-data group as a data template; when the number of types of log data in the sub-data group is smaller than the number of log data in the sub-data group, the computer equipment performs de-duplication on the log data in the sub-data group to obtain at least one log data which is different from each other, and the computer equipment acquires the at least one log data which is different from each other as a data template. Still taking the example in step 705 as an example, where there is only one character string on each column in the sub-data set 6.1, i.e. there is only one kind of log data in the sub-data set 6.1, the computer device may directly obtain such log data as the data template 1.

The second column of the sub-data set 6.2 includes two character strings, the computer device may execute steps 704 to 706 to obtain a data template, specifically, the computer device uses the second column of the sub-data set 6.2 as a first target column, the probability of occurrence on the first target column is 0.33 and greater than a preset value of 0.3, the probability of occurrence of the character string { variable } on the first target column is 0.67 and greater than a preset value of 0.3, the computer device does not replace the character string update and { variable } and, because the probability of occurrence of the character string { variable } and the update on the first target column is greater than the preset value of 0.3, the character string update and { variable } are the third character string on the first target column, the computer device can split the sub-data set 6.2.1 and the sub-data set 6.2 based on the third character string update and { variable } and the first target column, and the computer device can obtain the sub-data set 6.1, and the sub-data set 2.2.1 can be divided into only the first sub-data set 6.2.1, and the sub-data set 2.2.1 includes only the data string 1, and the computer device can obtain the data template only of the first sub-data set 2.2.

And the third column of the sub data group 6.2.2 includes two character strings, the computer device executes steps 704-706 on the sub data group 6.2.2, the third column of the sub data group 6.2.2 is used as the first target column, the character strings add and end on the first target column of the sub data group 6.2.2, the probability of occurrence on the first target column is 0.5, which is greater than the preset value 0.3, the character strings add and end are the third character strings on the first target column, the computer device does not replace the character strings add and end, the computer device can split the sub data group 6.2.2 based on the character strings add and end, so as to obtain the sub data groups 6.2.2.2.1 and 6.2.2.2, the third column of the sub data group 6.2.2.1 only includes the character strings add, the third column of the sub data group 6.2.2.2 only includes the character strings end, the computer device can obtain the log data of the log data group 62.2.2.2.1 and the log data of the log data group 62.2.3, and the log data of the log data group 62.2.2.3 can be obtained.

It should be noted that, the computer device may be regarded as an iterative process by performing a replacement or splitting operation on one column of the data set. In one possible implementation manner, after the computer device performs iteration on the data set for a first preset number of times, based on the iteration result, a data template of log data in the data set is obtained, and specifically, still taking fig. 8 as an example, when the data set 6 is iterated for the first time, sub-data sets 6.1 and 6.2 are obtained, when the sub-data set 6.1 is iterated for the second time, that is, the sub-data set 6.1 is in iterated 2, the probability that a string cluster on the second column of the sub-data set 6.1 appears on the second column is 1 and is greater than a preset value of 0.3, the computer device does not replace the string cluster, and only the first string { variable } on the second column of the sub-data set 6.1, then the computer device does not split the data set 6.1 in iterated 2, namely, when the sub-data set 6.1 is iterated for the third time, that is in iterated 2, that is, the sub-data set 6.1 is in iterated 3, the first string { variable } is not split on the first column of the second column of the sub-data set 6.1, namely, the computer device can calculate the first string { variable } on the first column of the sub-data set 6.1, and the first string { variable } is not replaced on the first column of the second column of the sub-data set 6.1, and the first string { variable } is not shown in the iterated 1, and the first string { variable } is not shown in the iteration 1.

It should be noted that, by continuously splitting the data set, the convergence of the data templates may be improved, for example, by acquiring the data templates of the data set 6 in the prior art, 4 data templates including/abc/cluster/del,/def/cluster/del,/ghi/cluster/del,/user/update/{ variable },/user/{ variable }/add and/user/{ variable }/end and the like may be acquired, and by the method provided by the embodiment of the present application, the acquired data templates of the data set 6 include/{ variable }/cluster/del,/user/update/{ variable },/user/{ variable },/add and/user/{ variable }/end and the like.

According to the method provided by the embodiment of the application, the second character string in the data group is replaced with the first character string, and the data group is continuously split based on the first character string, so that log data in the sub-data group obtained through final splitting are different from each other, and the log data in the sub-data group which are different from each other are used as templates of the log data in the data group, so that the problem that an API template cannot be obtained because the second target column and the third target column of the data group cannot be determined at the same time can be avoided without determining the second target column and the third target column of the data group at the same time. In addition, by identifying specific variables in log data in advance, only non-specific variables are needed to be identified in the follow-up variable identification process, and therefore variable identification efficiency is improved. In addition, the convergence of the data template can be improved.

Fig. 9 is a structural device diagram of a data template acquiring device according to an embodiment of the present application, where the device includes: a grouping module 901, a storage module 902, and an acquisition module 903.

A grouping module 901, configured to execute the step 302;

a storage module 902, configured to perform step 306;

the obtaining module 903 is configured to perform step 307.

Optionally, the apparatus further comprises:

the template is replaced for performing step 303 described above.

Optionally, the obtaining module is configured to perform the steps 307F-307H described above:

optionally, the apparatus further comprises:

the de-duplication module is used for de-duplication the data set to ensure that log data in the de-duplicated data set are different from each other;

the obtaining module 903 is further configured to obtain at least one piece of log data in the deduplicated data set as at least one data template.

Optionally, the storage module 902 is configured to:

Fig. 10 is a structural device diagram of a data template acquiring device according to an embodiment of the present application, where the device includes: a grouping module 1001, a replacing module 1002, a splitting module 1003, and an acquiring module 1004.

A grouping module 1001, configured to perform the step 702;

a replacement module 1002, configured to perform step 704;

a splitting module 1003, configured to execute step 705;

the obtaining module 1004 is configured to execute step 706.

Optionally, the replacement module is further configured to perform step 703 described above.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

It should be noted that: the apparatus for obtaining a data template according to the above embodiment is only exemplified by the division of the above functional modules when obtaining the data template, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above. In addition, the device for acquiring the data template provided in the above embodiment and the method embodiment for acquiring the data template belong to the same concept, and the specific implementation process is detailed in the method embodiment, which is not repeated here.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims

1. A method for obtaining a data template, the method comprising:

for any one data group, when the number of character string types on any column of the data group is equal to the number of the log data in the data group and only one character string is arranged on each column in other columns, storing target data into the data group, wherein the number of character strings in the target data is the same as the column number of the data group;

each time when a first target column of the data set has different kinds of character strings, replacing a second character string in the first target column with a first character string, wherein the occurrence probability of the second character string in the first target column is smaller than or equal to a preset numerical value;

Splitting the data set based on a third character string in the first target column to obtain at least one sub-data set when the replaced first target column has the third character string, wherein the occurrence probability of the third character string in the first target column is larger than the preset numerical value;

at least one data template is obtained from the sub-data group, and the at least one data template consists of at least one piece of log data except the target data in the sub-data group.

2. The method of claim 1, wherein after grouping the plurality of pieces of log data to obtain the plurality of data sets, the method further comprises:

and when the format of any character string in any data set meets the preset condition, replacing the character string with the first character string.

3. The method of claim 1, wherein after replacing the character string identified as the variable in the data set with the first character string based on the target data, the method further comprises:

4. The method of claim 1, wherein storing the target data into the data set when the number of character string categories on any one column of the data set is equal to the number of pieces of log data in the data set and there is only one character string on each of the other columns, comprises:

and when the character string type in only one column of the data set is greater than 1, the number of the log data in the data set is greater than a first preset value, and only one character string is arranged on each column of other columns, storing target data into the data set.

5. A method for obtaining a data template, the method comprising:

for any one data set, when a first target column of the data set has different kinds of character strings, replacing a second character string in the first target column with a first character string, wherein the occurrence probability of the second character string in the first target column is smaller than or equal to a preset numerical value;

at least one data template is obtained from the sub-data group, and the at least one data template consists of at least one piece of log data in the sub-data group.

6. The method of claim 5, wherein the grouping the plurality of pieces of log data to obtain a plurality of data sets, the method further comprises:

7. A data template acquisition device, the device comprising:

the grouping module is used for grouping a plurality of pieces of log data to obtain a plurality of data groups, the number of character strings of the log data included in each data group is the same, and the character strings at the same position of the plurality of pieces of log data in one data group form a row of the data group;

a storage module, configured to store, for any one data set, target data into the data set when the number of character string types on any column of the data set is equal to the number of log data in the data set, and only one character string is on each of other columns, where the number of character strings in the target data is the same as the number of columns of the data set;

The acquisition module is used for replacing a second character string in a first target column of the data set with the first character string every time the first target column has different kinds of character strings, and the occurrence probability of the second character string in the first target column is smaller than or equal to a preset numerical value; splitting the data set based on a third character string in the first target column to obtain at least one sub-data set when the replaced first target column has the third character string, wherein the occurrence probability of the third character string in the first target column is larger than the preset numerical value; at least one data template is obtained from the sub-data group, and the at least one data template consists of at least one piece of log data except the target data in the sub-data group.

8. The apparatus of claim 7, wherein the apparatus further comprises:

and the replacement template is used for replacing the character string with the first character string when the format of any character string in any data set meets the preset condition.

9. The apparatus of claim 7, wherein the apparatus further comprises:

the de-duplication module is used for de-duplication the data sets to ensure that log data in the de-duplicated data sets are different from each other;

And the acquisition module is also used for acquiring at least one piece of log data in the de-duplicated data set as at least one data template.

10. The apparatus of claim 7, wherein the memory module is configured to:

11. A data template acquisition device, the device comprising:

a replacing module, configured to replace, for any one data set, when a first target column of the data set has different kinds of character strings, a second character string in the first target column with a first character string, where a probability of occurrence of the second character string in the first target column is less than or equal to a preset numerical value;

The splitting module is used for splitting the data set based on the third character string in the first target column to obtain at least one sub-data set when the replaced first target column has the third character string, and the occurrence probability of the third character string in the first target column is larger than the preset numerical value;

and the acquisition module is used for acquiring at least one data template from the sub-data group, wherein the at least one data template consists of at least one piece of log data in the sub-data group.

12. The apparatus of claim 11, wherein the replacement module is further configured to:

13. A computer device comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to implement the operations performed by the data template acquisition method of any one of claims 1 to 6.

14. A computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the operations performed by the data template retrieval method of any one of claims 1 to 6.