CN116956083A

CN116956083A - Data processing method and device

Info

Publication number: CN116956083A
Application number: CN202310855609.6A
Authority: CN
Inventors: 朱浩文; 陈明; 张园超; 余锋
Original assignee: Zhejiang eCommerce Bank Co Ltd
Current assignee: Zhejiang eCommerce Bank Co Ltd
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2023-10-27
Also published as: CN113111951B; CN113111951A

Abstract

The embodiment of the specification provides a data processing method and a device, wherein the data processing method comprises the following steps: the method comprises the steps of collecting historical access flow data of a server, wherein the historical access flow data comprises data identifications of access data and user identifications of access users, the data identifications are used for indexing the access data, clustering the user identifications of the access users according to association relations between the user identifications and the data identifications, generating clustering results respectively corresponding to the data identifications, and carrying out data category labeling on the access data according to the clustering results.

Description

Data processing method and device

The application relates to a data processing method and device, which are divisional applications with the application number 202110423768.X, the application date 2021, 04 and 20.

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a data processing method. One or more embodiments of the present specification also relate to a data processing apparatus, a computing device, and a computer-readable storage medium.

Background

Web applications are applications based on browser/server architecture, which are types of applications that have evolved with the development of Web technology. The Web application program comprises a plurality of static pages, not only has an information display function, but also can execute corresponding processing operation on data by calling different business logic interfaces in the pages. However, as with conventional computer applications, web applications necessarily suffer from a certain amount of vulnerabilities due to security policy flaws in the development process. The override vulnerability is one of common business logic vulnerabilities in the process of testing the Web application program. The reason for this is that the server side is over-trusted to the data operation request provided by the Web application program of the client side, and ignores the determination of the operation authority thereof.

Because of the defect of web programming, by utilizing the guessability of the URL input parameters, by changing the input parameter values, lateral unauthorized access can be caused, resulting in leakage of user privacy information. The detection mode of the unauthorized vulnerability is mainly realized by replacing user authentication information, and the detection mode is low in efficiency, high in false alarm rate and inaccurate in detection result.

Disclosure of Invention

In view of this, the present embodiments provide a data processing method. One or more embodiments of the present specification are also directed to a data processing apparatus, a computing device, and a computer-readable storage medium, which address the technical deficiencies of the prior art.

According to a first aspect of embodiments of the present specification, there is provided a data processing method, including:

collecting historical access flow data of a server, wherein the historical access flow data comprises a data identifier of access data and a user identifier of an access user, and the data identifier is used for indexing the access data;

clustering the user identifications of the access users according to the association relation between the user identifications and the data identifications, and generating clustering results respectively corresponding to the plurality of data identifications;

And marking the data category of the access data according to the clustering result.

Optionally, after the collecting the historical access flow data of the server, the method further includes:

splitting the historical access flow data;

determining hit results of access data contained in the split results on at least one preset data screening rule;

and screening the access data according to the hit result, and marking the access data contained in the screening result by utilizing the target character corresponding to the data identifier.

Optionally, the performing category labeling on the access data according to the clustering result includes:

performing de-duplication processing on the user identifiers in the clustering result corresponding to the target data identifiers, and comparing the number of the target user identifiers contained in the de-duplication processing result with a first preset number threshold;

if the number of the target user identifications contained in the duplication removal result is smaller than or equal to the first preset number threshold, determining that the access data associated with the target data identifications are private data; or alternatively, the process may be performed,

if the number of the target user identifications contained in the duplication removal result is determined to be larger than the first preset number threshold, determining that the access data associated with the target data identifications are public data; wherein the target data identifier is one of the plurality of data identifiers.

Optionally, the data processing method further includes:

and establishing a mapping relation table between the data identification of the access data and the data category according to the data category labeling result.

Optionally, the data processing method further includes:

acquiring access flow data to be detected;

and determining the data category of the data to be accessed according to the data identifier of the data to be accessed contained in the access flow data and the mapping relation table.

Optionally, the data processing method further includes:

and detecting whether the server has an unauthorized vulnerability according to the data type of the data to be accessed.

Optionally, the labeling the access data according to the clustering result includes:

performing de-duplication processing on user identifiers in a clustering result corresponding to the target data identifiers, and determining the number of the target user identifiers contained in the de-duplication processing result, wherein the target data identifiers are one of the plurality of data identifiers;

screening the clustering results according to the quantity to obtain target clustering results;

and marking the data category of the data identifier corresponding to the target clustering result.

Optionally, the data processing method further includes:

And inputting the access data contained in the marking result and the user identification with the association relation with the access data into a data category marking model to be trained for training, and obtaining the data category marking model.

Optionally, the data processing method further includes:

acquiring access flow data to be detected;

inputting the data to be accessed contained in the access flow data into the data category labeling model to label the data category, and generating a data category labeling result of the data to be accessed.

performing de-duplication processing on the user identifiers in the clustering result corresponding to the target data identifiers, and determining the number of the target user identifiers contained in the de-duplication processing result;

determining an access interface of the access data corresponding to the target data identifier;

taking the average value of the reciprocal of the number of target user identifications in the duplicate removal processing results corresponding to different data identifications under the access interface;

performing category labeling on the access data according to the average value; wherein the target data identifier is one of the plurality of data identifiers.

According to a second aspect of embodiments of the present specification, there is provided a data processing apparatus comprising:

The system comprises an acquisition module, a data processing module and a data processing module, wherein the acquisition module is configured to acquire historical access flow data of a server, the historical access flow data comprises a data identifier of access data and a user identifier of an access user, and the data identifier is used for indexing the access data;

the clustering module is configured to cluster the user identifications of the access users according to the association relation between the user identifications and the data identifications, and generate clustering results respectively corresponding to the plurality of data identifications;

and the labeling module is configured to label the data category of the access data according to the clustering result.

According to a third aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions:

According to a fourth aspect of embodiments of the present description, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the data processing method.

According to one embodiment of the specification, historical access flow data of a server are collected, the historical access flow data comprise data identification of access data and user identification of an access user, the data identification is used for indexing the access data, the user identification of the access user is clustered according to the association relation between the user identification and the data identification, clustering results corresponding to the data identifications respectively are generated, and the access data is subjected to data category labeling according to the clustering results.

According to the embodiment of the specification, the access data are classified according to the data categories, so that which access data are private data and which access data are public data are analyzed and identified from the historical access flow data, accurate override detection is conducted on the data to be accessed based on the identification result, the accuracy of the level override detection is improved, and the detection efficiency of override loopholes is improved.

Drawings

FIG. 1 is a process flow diagram of a data processing method provided in one embodiment of the present disclosure;

FIG. 2 (a) is a schematic diagram of a private data access scheme provided in one embodiment of the present disclosure;

FIG. 2 (b) is a schematic diagram of a public data access scheme according to one embodiment of the present disclosure;

FIG. 3 is a process flow diagram of a data processing method according to one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a data processing apparatus according to one embodiment of the present disclosure;

FIG. 5 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

First, terms related to one or more embodiments of the present specification will be explained.

Private data: on a business system this data is only assigned to a certain user and only data that is operable or accessible by a certain user is designated as private data in this specification.

Private data interface: the interface that the business system design provides to access private data is referred to herein as the private data interface.

Horizontal override vulnerability: security vulnerabilities where users may access or manipulate private data to other users through the private data interface are referred to as horizontal override vulnerabilities.

Unique identification of data: the value of a piece of data is marked, which is relatively broad, and is in most cases a pure value, e.g. id=1, but other situations, such as a random string, may exist.

Public data: as opposed to private data, i.e., data that is accessible to all users on the system design.

Public data interface: as opposed to private data interfaces, i.e., interfaces that are used in system design to access public data.

In the present specification, a data processing method is provided, and the present specification relates to a data processing apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

Fig. 1 shows a process flow diagram of a data processing method according to one embodiment of the present disclosure, including steps 102 to 106.

Step 102, collecting historical access flow data of a server.

The historical access flow data comprises a data identifier of access data and a user identifier of an access user, and the data identifier is used for indexing the access data.

Specifically, the historical access flow data may be access flow data generated by a user accessing a target application program through a server; the target application program can be a Web application program based on a browser/server architecture, and the Web application program not only comprises a plurality of static pages with a data display function, but also has different business logic interfaces capable of carrying out query, modification, addition, deletion and other processing operations on data.

In practical applications, a user may perform operation processing on data in a target application program through a server, for example, adding, deleting, modifying or querying, but since the data in the target application program may be divided into private data and public data, for private data, if a part of private data belongs to the user U1, the user U1 may call an interface to perform addition, deletion and modification on the part of private data, and other users except the user U1 may not have permission to perform any operation processing on the part of private data, and similarly, the user U1 may not have permission to perform operation processing on the private data of other users.

However, due to negligence of a background developer, the user identity is not judged in the process of adding, deleting and checking the data, or an error exists in the judgment result of judging the user identity, namely, the server side disturbs trust of a data operation request provided by the user U1 for accessing the target application program, the judgment of the operation authority is ignored, and the user U1 can perform operations such as adding, deleting and checking the private data belonging to other users, so that the target application program has an unauthorized vulnerability.

The data processing method provided by the embodiment of the specification can be applied to a gateway of a machine room outlet/inlet or realized in a WAF (Web Application Firewall) application protection system so as to ensure that the acquired historical access flow data is as complete as possible, thereby being capable of accurately detecting whether an unauthorized vulnerability exists on a page or an interface of a target application program accessed by a user.

In addition, since the common network monitoring modes can be divided into two types: one is the bypass monitoring mode and the other is the series monitoring mode. In the "bypass monitoring mode", monitoring is generally realized through a port mirror image function of network equipment such as a switch, and in this mode, the monitoring equipment only needs to be connected to a designated mirror image port of the switch, so the image is called "bypass monitoring". In the case of the bypass monitor mode, the historical access traffic data may be bypass mirror traffic. The tandem mode is generally a mode of monitoring through a gateway, a bridge or a proxy server, and is called as a "tandem monitoring mode" because the monitoring device is connected in series in the network as a gateway or a bridge. In the case of the tandem monitoring mode, the historical access traffic data may be tandem traffic.

The specific network monitoring mode may be determined according to actual requirements, and is not limited in this regard.

In this embodiment of the present disclosure, the historical access traffic data includes a data identifier of the access data and a user identifier of the access user, where the user identifier is used to indicate an identity of the user accessing the target application, and the data identifier is used to index the access data. If the target application program is to be detected whether an override vulnerability exists, it is first determined whether the user has permission to access the access data, and specifically, it is determined according to a user identifier of the user.

In practical application, the user identity accessing the access data can be identified by using the session ID in the cookie, wherein the cookie refers to data stored on the local terminal of the user by part of websites for distinguishing the user identity and carrying out session tracking, the data is a small text file stored on the browser of the user by the Web server, the small text file can contain information about the user, the small text file is one of main places for the user to acquire, communicate and transfer information, and the Web site can access the cookie information whenever the user is linked to the server.

However, since the session ID of the cookie is randomly generated when the user logs in, and the session ID is reset when the user logs in again or the session time expires, the session ID will be expressed as one data accessed by a plurality of users on the historical access traffic data, but in fact, different session IDs which may access the one data all point to the same user, and only the session ID is changed, which results in inaccurate results obtained by public and private data analysis on the historical traffic.

In the embodiment of the specification, in order to overcome the above-mentioned problem and improve the accuracy of the detection result of the unauthorized vulnerability, and simultaneously ensure that the private data of the user is not revealed, the encrypted user information is selected to be stored in the cookie, so that the problem of multi-user display caused by the session ID is avoided.

Specifically, when the user links to the server to request access to the access data, the Web site may access the cookie information to decrypt the encrypted user information stored in the cookie, and then use the decrypted result as the user identifier for accessing the access data. Because the encrypted user information stored in the cookie does not change, the decrypted user identification always points to the same user.

By encrypting the user information and putting the user information into the cookie, the encrypted user information is acquired and decrypted after the data access request of the user is received, so that the user information is prevented from being tampered, and inaccurate statistics of the user information caused by expiration of a session or re-login is also avoided.

In addition, the historical access flow data is acquired from the bypass mirror image flow in the gateway, so that the integrity of the historical access flow data acquisition is guaranteed, and the accuracy of the data category labeling result is guaranteed.

In specific implementation, after the historical access flow data of the server are collected, user identifiers in the historical access flow data are required to be clustered, and the access data are subjected to data category division according to a clustering result, but in order to ensure the accuracy of the data category division result, the embodiment of the specification also needs to filter and screen the historical access flow data, and the method can be realized specifically by the following steps:

Cleaning the historical access flow data to generate a corresponding data cleaning result;

splitting the historical access flow data contained in the data cleaning result to generate a plurality of historical access flow sub-data;

marking user identifiers and data identifiers in the historical access flow sub-data by using target characters;

determining hit results of the plurality of historical access flow sub-data on at least one preset data screening rule in the marking results;

and screening the historical access flow sub-data according to the hit result.

Specifically, because the real access flow data in the actual application scene is complex, the historical access flow data generated by the user through accessing the data in the target application program may include some other external attack flow data, crawler flow data, invalid access flow data and the like, so that in order to ensure the data processing efficiency, after the historical access flow data is acquired, the historical access flow data can be cleaned, namely the attack flow data, the crawler flow data, the invalid access flow data and the like are removed, and the operations of splitting, marking, screening, clustering and the like are performed on the remaining historical access flow.

Since the historical access traffic data includes the characteristic parameters of multiple types of access data, in order to avoid interference between different characteristic parameters, different characteristic parameters in one piece of historical access traffic data should be generated into one piece of data respectively, for example, the historical access traffic data includes/cccphone=13000000000 & info_id=333, and then the historical access traffic data needs to be split to generate two pieces of data (historical access traffic sub-data): and (2) cccphone= 13000000000 and/cccinfo_id=333, and marking the two pieces of data or performing subsequent screening, clustering and data category labeling processes respectively.

In practical application, the preset marking rule may be used to mark the user identifier and the data identifier in the plurality of historical access flow sub-data, where the marking rule may be defined according to practical requirements, for example, the matching parameter name may be "ID" character ending, and the parameter value may be an ID value (where the ID value is relatively wide and may be a value uniquely identified by various formats of identification data), and specifically, the target character (ID) is used to mark the user identifier and the data identifier in the plurality of historical access flow sub-data.

After marking, a white list mode can be adopted, namely, hit history access flow sub-data can be extracted by setting a hit rule. For example, the hit rule is set as: when the historical access flow sub-data ends with an ID and the historical access flow data contains data such as a data identifier of the access data, a user identifier of the access user, access time (generation time of the historical flow access data) and the like, if the user identifier and the data identifier are required to be obtained through screening, the target character (ID) can be utilized to mark the user identifier and the data identifier in the historical access flow data, the marked user identifier and the marked data identifier can hit the hit rule, and then the sub-data in the historical access flow data can be screened according to the hit result.

The historical access flow data are cleaned, so that interference flow is removed, and the influence of the interference data on subsequent statistics is avoided; in addition, marking is adopted, only the data hit by marking is taken, and one flow hit can generate a plurality of data for many times, namely, the historical access flow data is split, so that interference among characteristic parameters is avoided, and meanwhile, waste of calculation resources is avoided.

Step 104, clustering the user identifications of the access users according to the association relation between the user identifications and the data identifications, and generating clustering results respectively corresponding to the plurality of data identifications.

Specifically, after the historical access flow data is collected, the historical access flow data comprises the data identification of the access data and the user identification of the access user, so that the user identification of the access user can be clustered according to the association relationship between the user identification and the data identification, and a clustering result corresponding to each data identification is generated.

For example, if it is determined that the user U1, the user U2, and the user U3 have an association relationship with the data D1 according to the historical access flow data, the user U1, the user U2, and the user U3 are clustered, and a clustering result corresponding to the data D1 is generated.

And 106, marking the data category of the access data according to the clustering result.

In the implementation, the access data is labeled according to the clustering result, and the implementation can be realized in the following way:

Specifically, in the historical access flow data, the association relationship between the user identifier and the data identifier can be used for representing which users access which data, so that the clustering result corresponding to the data identifier generated by clustering according to the association relationship can be used for representing which users access the access data corresponding to the data identifier.

Under the horizontal override scene, if each data identifier (ID value) under a certain interface corresponds to only one user access, determining the interface as an interface for accessing private data of the user, wherein the access data corresponding to the data identifier is the private data of the user; if a plurality of users access each data identifier (ID value) under a certain interface, determining the interface as an interface for accessing the public data of the users, wherein the access data corresponding to the data identifier is the public data. Therefore, after the clustering result corresponding to each data identifier is generated, the user identifier included in the clustering result may be subjected to the de-duplication processing, for example, the clustering result includes the user U1, the user U2, the user U1, and the user U3, and the clustering result includes the user U1, the user U2, and the user U3 after the de-duplication.

After the de-duplication processing result is obtained, the number of target user identifiers contained in the clustering result obtained by de-duplication can be compared with a first preset number threshold, and if the number of target user identifiers contained in the clustering result obtained by de-duplication is determined to be smaller than or equal to the first preset number threshold, the access data associated with the target data identifiers is determined to be private data; or if the number of the target user identifiers contained in the clustering result obtained by the de-duplication is determined to be larger than a first preset number threshold, determining that the access data associated with the target data identifiers is public data.

In the horizontal override scenario provided in the embodiment of the present disclosure, a schematic diagram of a private data access manner is shown in fig. 2 (a), only one access user is provided for the user U1, only one access user is provided for the data D2, and similarly, only one access user is provided for the user U3, so that the data D1, the data D2, and the data D3 are private data.

In the horizontal override scenario provided in the embodiment of the present disclosure, a schematic diagram of a public data access manner is shown in fig. 2 (b), and in fig. 2 (b), data D1 has three access users, namely, user U1, user U2 and user U3, and data D2 has two access users, namely, user U1 and user U3, so that data D1 and data D2 are public data.

Further, after the data category labeling is performed on the access data according to the clustering result, a mapping relation table between the data identification and the data category of the access data can be established according to the data category labeling result.

And, whether the data to be accessed in the newly generated data access flow is private data can be determined based on the mapping relation table so as to detect the override, which can be realized by the following steps:

acquiring access flow data to be detected;

determining the data category of the data to be accessed according to the data identifier of the data to be accessed contained in the access flow data and the mapping relation table;

Specifically, after generating a clustering result corresponding to each data identifier, marking the data category of the access data corresponding to the data identifier according to the number of user types contained in the clustering result, namely marking the data accessed by only one user as private data, and marking an interface accessing the private data as a private data interface; the data accessed by a plurality of users is marked as public data, and the interface accessing the public data is marked as a public data interface.

The data marking results can be used as the judgment basis of whether the unauthorized detection is performed in the unauthorized detection process, and in practical application, the scanner can be used for performing unauthorized detection, namely, the scanner can be used for performing unauthorized detection according to the data type of the data to be accessed, or other methods capable of realizing unauthorized detection can be used besides the scanner, and the method can be specifically determined according to the practical requirements without limitation.

In addition, the data category labeling of the access data according to the clustering result can be realized by the following ways:

Specifically, because the actual service usage situation may be complex, a situation that some interface users rarely access may occur in a target application, and if these situations are ignored and only calculation is performed under ideal conditions according to the situation that each interface is frequently accessed, the result obtained by this calculation is not sufficiently accurate because the data generated by the interface that is rarely accessed by the user does not have statistical significance in fact.

Therefore, after the clustering result corresponding to each data identifier is generated, the embodiment of the specification can perform the deduplication processing on the user identifiers in the clustering result corresponding to the target data identifier, and determine the number of target user identifiers contained in the deduplication processing result, wherein the target data identifier is one of the plurality of data identifiers, and the number of target user identifiers can be used for representing the number of access users accessing the access data corresponding to the target data identifier.

After the number of the target data identifiers in the deduplication processing result is determined, an access interface of the access data corresponding to the target data identifiers can be determined, the deduplication processing results corresponding to the target data identifiers of a plurality of access data under the same access interface are integrated and deduplicated to obtain the number of the target user identifiers corresponding to the access interface, and the number of the target user identifiers corresponding to the access interface can be used for representing the number of the access users accessing the access interface, so that whether the historical access flow data of the access interface has statistical significance is determined according to the number of the access users accessing the access interface.

If the number of the access users accessing the access interface is lower than a second preset number threshold, the historical access flow data of the access interface is indicated to have no statistical significance, and if the number of the access users accessing the access interface is higher than the second preset number threshold, the historical access flow data of the access interface is indicated to have statistical significance; and screening the clustering results according to the number to obtain target clustering results, namely taking the clustering result corresponding to the target data identification of the access data under the access interface as the target clustering result under the condition that the number of access users accessing the access interface is higher than a second preset number threshold value, and finally marking the data category of the data identification corresponding to the target clustering result.

Because the number of the user identifications contained in the clustering result can be used for representing the times of accessing the access data by the user, the interface flow which is too few and has no statistical significance for the access user is removed by screening the clustering result, and the accuracy of the data category labeling result is improved.

In addition, the classification labeling of the access data according to the clustering result can be realized by the following ways:

Specifically, in the historical access flow data, the association relationship between the user identifier and the data identifier may be used to characterize which users access which data, so in the embodiment of the present disclosure, after the user identifiers are clustered according to the association relationship, the user identifiers included in the clustering result are subjected to deduplication processing, so that the number of access users accessing the access data corresponding to the target data identifier is characterized according to the number of target user identifiers included in the deduplication processing result.

Further, in the embodiment of the specification, the data type marking is performed on the access data under the access interface by using the average value of the reciprocal of the number of access users corresponding to different access data under the same access interface, so that after the user identifications in the clustering results corresponding to different target data identifications are subjected to the deduplication processing, the access interface corresponding to the access data corresponding to the different target data identifications can be determined, the number of access users contained in the deduplication processing results corresponding to the target data identifications of different access data under the same access interface is determined, then the reciprocal of the number of access users contained in the deduplication processing results corresponding to the target data identifications of different access data under the same access interface is obtained, the average value of the reciprocal of the number of access users corresponding to the different access data under the same access interface is obtained, and the type marking is performed on the access data according to the average value.

In an ideal state, if the access data under a certain access interface is private data, the calculation result obtained by the calculation process of averaging should be equal to 1, which represents that only one user has accessed each access data under the access interface; if the access data under a certain access interface is public data, the calculation result obtained by the mean value calculation process should be less than 1, which represents that a plurality of users have accessed.

However, in practical applications, because the collected historical access flow data may include some disturbance flow data including other external attack flow data, crawler flow data, invalid access flow data, and the like, although the historical access flow data may not be cleaned before clustering, it may not be guaranteed that the part of disturbance flow data can be completely cleaned, and therefore, an error may exist in a calculation result obtained by performing the mean calculation, and in order to ensure accuracy of a labeling result generated by labeling the data types of the access data according to the mean calculation result, in this embodiment of the present disclosure, a numerical range may be set, for example [0.95,1], if the mean calculation result falls within the numerical range, the access interface may be labeled as a private interface, and the access data under the access interface may be labeled as private data.

For example, the user may access the data D1, the data D2, and the data D3 through the access interface a, where the number of the target user identifiers included in the deduplication result corresponding to the data D1 is 1, the number of the target user identifiers included in the deduplication result corresponding to the data D2 is 2, and the number of the target user identifiers included in the deduplication result corresponding to the data D3 is 1, and the reciprocal of the number of the access users of the data D1, the data D2, and the data D3 under the access interface a is used to average the obtained average value calculation result is (1/1+1/2+1/1)/3=0.83, and the average value calculation result does not fall into the foregoing numerical range, so that the access interface a may be labeled as a public interface, and the data D1, the data D2, and the data D3 under the access interface a may be labeled as public data.

In addition, after the target characters corresponding to the data identifiers are utilized to mark the access data contained in the screening result, the access data contained in the marking result and the user identifiers related to the access data can be input into the data category labeling model to be trained to train, and the data category labeling model is obtained.

After the new data access flow to be detected is obtained, the data to be accessed contained in the access flow data can be input into the data category labeling model to carry out data category labeling, and a data category labeling result of the data to be accessed is generated.

Specifically, the data access flow to be detected is input into a data category labeling model which is trained in advance and used for carrying out data category labeling, and whether access data in the data access flow are private data can be determined according to a result output by the model; if the access data is determined to be private data, comparing the user identification of the user to which the access data belongs with the user identification of the access user according to the user identification of the access user so as to detect override.

According to the embodiment of the specification, the public and private data model of the application interface and the data identifier corresponding to the interface is obtained through aggregation analysis of the historical traffic of the application, so that whether the interface and the data are the private interface and the private data or not is judged, and the accuracy of override detection is improved.

The application of the data processing method provided in the present specification in the unauthorized detection scenario is taken as an example, and the data processing method is further described below with reference to fig. 3. Fig. 3 is a flowchart of a processing procedure of a data processing method according to an embodiment of the present disclosure, and specific steps include steps 302 to 328.

Step 302, historical access traffic data for a server is collected.

Wherein the historical access traffic data comprises a data identification of access data, the data identification being used to index the access data.

And step 304, splitting the historical access flow data.

Specifically, before the historical access flow data is split, the historical access flow data can be cleaned, data such as part of external malicious attack flow, crawler flow, invalid access flow and the like are removed, and the rest of the historical access flow data is split.

In addition, after the historical access flow data is cleaned to obtain a first screening result, the historical access flow data contained in the first screening result can be subjected to secondary screening, specifically, cookie information associated with each historical access flow data in the first screening result can be obtained, encrypted user information stored in the cookie information is decrypted, and access data contained in the first screening result is subjected to secondary screening according to the decryption result to generate a second screening result.

Specifically, when the user links to the server to request access to the access data, the Web site may access the cookie information to decrypt the encrypted user information stored in the cookie, and then use the decrypted result as the user identifier for accessing the access data. In the embodiment of the application, after the encrypted user information stored in the cookie is decrypted to generate the corresponding decryption result, the historical access flow data in the first screening result can be subjected to secondary screening according to the decryption result, specifically, whether the decryption result contains the user identifier or not can be determined, and if the decryption result does not contain the user identifier, the historical access flow data associated with the cookie information can be deleted from the first screening result so as to generate the second screening result.

Further, the historical access flow data is split, namely, the historical access flow data contained in the second screening result is split.

And 306, marking historical access flow sub-data in the split result.

Step 308, determining a hit result of the historical access flow sub-data contained in the marking result on at least one preset data screening rule.

And step 310, screening the historical access flow sub-data according to the hit result.

Step 312, clustering the user identifications of the access users according to the association relationship between the user identifications and the data identifications in the screening result, and generating clustering results respectively corresponding to the plurality of data identifications.

And 314, performing deduplication processing on the user identifications in the clustering result corresponding to the target data identifications, and determining the number of the target user identifications contained in the deduplication processing result, wherein the target data identifications are one of the plurality of data identifications.

And step 316, screening the clustering results according to the number to obtain target clustering results containing target user identifiers with the number greater than a second preset number threshold.

And step 318, performing deduplication on the user identifiers in the target clustering result, and comparing the user identifiers contained in the deduplication result with a first preset quantity threshold.

And 320, marking the data category of the access data according to the comparison result.

Specifically, performing deduplication processing on user identifiers in a clustering result corresponding to the target data identifiers, and comparing the number of target user identifiers contained in the deduplication processing result with a first preset number threshold;

And step 322, establishing a mapping relation table between the data identification of the access data and the data category according to the data category labeling result.

In step 324, access traffic data to be detected is obtained.

Step 326, determining the data category of the data to be accessed according to the data identifier of the data to be accessed contained in the access flow data and the mapping relation table.

And step 328, detecting whether the server has an unauthorized vulnerability according to the data type of the data to be accessed.

Specifically, whether the server has an unauthorized vulnerability can be detected through a scanner and the data type of the data to be accessed.

Corresponding to the above method embodiments, the present disclosure further provides an embodiment of a data processing apparatus, and fig. 4 shows a schematic diagram of a data processing apparatus provided in one embodiment of the present disclosure. As shown in fig. 4, the apparatus includes:

the collection module 402 is configured to collect historical access flow data of the server, wherein the historical access flow data comprises a data identifier of access data and a user identifier of an access user, and the data identifier is used for indexing the access data;

the clustering module 404 is configured to cluster the user identifications of the access users according to the association relationship between the user identifications and the data identifications, and generate clustering results respectively corresponding to the plurality of data identifications;

And the labeling module 406 is configured to label the data category of the access data according to the clustering result.

Optionally, the data processing apparatus further includes:

a splitting module configured to split the historical access traffic data;

the determining module is configured to determine a hit result of the access data contained in the split result on at least one preset data screening rule;

and the screening module is configured to screen the access data according to the hit result and mark the access data contained in the screening result by utilizing the target character corresponding to the data identifier.

Optionally, the labeling module 406 includes:

the comparison sub-module is configured to perform duplication elimination processing on the user identifications in the clustering result corresponding to the target data identifications, and compare the number of the target user identifications contained in the duplication elimination processing result with a first preset number threshold;

the first determining submodule is configured to determine that the access data associated with the target data identifier is private data if the number of the target user identifiers contained in the deduplication result is smaller than or equal to the first preset number threshold; or alternatively, the process may be performed,

A second determining submodule configured to determine that the access data associated with the target data identifier is public data if the number of the target user identifiers included in the deduplication result is determined to be greater than the first preset number threshold; wherein the target data identifier is one of the plurality of data identifiers.

Optionally, the data processing apparatus further includes:

the establishing module is configured to establish a mapping relation table between the data identification of the access data and the data category according to the data category labeling result.

Optionally, the data processing apparatus further includes:

the first data acquisition module is configured to acquire access flow data to be detected;

the first data type determining module is configured to determine the data type of the data to be accessed according to the data identifier of the data to be accessed contained in the access flow data and the mapping relation table.

Optionally, the data processing apparatus further includes:

and the detection module is configured to detect whether the server has an override vulnerability according to the data type of the data to be accessed.

Optionally, the labeling module 406 includes:

the quantity determination submodule is configured to perform de-duplication processing on the user identifications in the clustering result corresponding to the target data identifications, and determine the quantity of the target user identifications contained in the de-duplication processing result, wherein the target data identifications are one of the plurality of data identifications;

The clustering result screening sub-module is configured to screen the clustering results according to the quantity to obtain target clustering results;

and the labeling sub-module is configured to label the data category of the data identifier corresponding to the target clustering result.

Optionally, the data processing apparatus further includes:

the training module is configured to input the access data contained in the marking result and the user identification with the association relation with the access data into a data category labeling model to be trained for training, and the data category labeling model is obtained.

Optionally, the data processing apparatus further includes:

the second data acquisition module is configured to acquire access flow data to be detected;

the generation module is configured to input the data to be accessed contained in the access flow data into the data category labeling model to label the data category, and generate a data category labeling result of the data to be accessed.

Optionally, the labeling module 406 includes:

the de-duplication processing sub-module is configured to perform de-duplication processing on the user identifications in the clustering result corresponding to the target data identifications, and determine the number of the target user identifications contained in the de-duplication processing result;

An access interface determination submodule configured to determine an access interface of access data corresponding to the target data identifier;

the computing sub-module is configured to average the reciprocal of the number of the target user identifications in the deduplication processing results corresponding to different data identifications under the access interface;

the category labeling sub-module is configured to label the category of the access data according to the average value; wherein the target data identifier is one of the plurality of data identifiers.

The above is a schematic solution of a data processing apparatus of the present embodiment. It should be noted that, the technical solution of the data processing apparatus and the technical solution of the data processing method belong to the same conception, and details of the technical solution of the data processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the data processing method.

Fig. 5 illustrates a block diagram of a computing device 500 provided in accordance with one embodiment of the present description. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530 and database 550 is used to hold data.

Computing device 500 also includes access device 540, access device 540 enabling computing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 500, as well as other components not shown in FIG. 5, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 5 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 500 may also be a mobile or stationary server.

Wherein the memory 510 is configured to store computer executable instructions and the processor 520 is configured to execute the following computer executable instructions:

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the data processing method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the data processing method.

An embodiment of the present specification also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the steps of the data processing method.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the data processing method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the data processing method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. A data processing method, comprising:

collecting historical access flow data of a server, wherein the historical access flow data comprises a data identifier of access data;

and marking the data category of the access data, wherein the data category comprises: private data and public data;

establishing a mapping relation table between the data identification of the access data and the data category according to the data category labeling result; acquiring access flow data to be detected; determining the data category of the data to be accessed according to the data identifier of the data to be accessed contained in the access flow data and the mapping relation table; and detecting whether the server has an unauthorized vulnerability according to the data type of the data to be accessed.

2. The data processing method according to claim 1, further comprising, after the collecting the historical access traffic data of the server:

splitting the historical access flow data;

3. The data processing method according to claim 1 or 2, wherein the classifying the access data includes:

4. The data processing method according to claim 1, wherein the labeling of the access data according to the data category comprises:

5. The data processing method of claim 2, further comprising:

6. The data processing method of claim 5, further comprising:

acquiring access flow data to be detected;

7. The data processing method according to claim 1 or 2, wherein the classifying the access data includes:

8. A data processing apparatus comprising:

the system comprises an acquisition module, a data processing module and a data processing module, wherein the acquisition module is configured to acquire historical access flow data of a server, and the historical access flow data comprises a data identifier of access data;

the labeling module is configured to label the data category of the access data, and the data category comprises: private data and public data;

the data processing apparatus further includes: the establishing module is configured to establish a mapping relation table between the data identification of the access data and the data category according to the data category labeling result.

the first data type determining module is configured to determine the data type of the data to be accessed according to the data identifier of the data to be accessed contained in the access flow data and the mapping relation table;

9. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer executable instructions, the processor being configured to implement the steps of the data processing method of any one of claims 1 to 7 when the computer executable instructions are executed.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the data processing method of any one of claims 1 to 7.