CN112583827A

CN112583827A - Data leakage detection method and device

Info

Publication number: CN112583827A
Application number: CN202011463110.3A
Authority: CN
Inventors: 唐通
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-03-30
Anticipated expiration: 2040-12-11
Also published as: CN112583827B

Abstract

The application relates to the technical field of information security, and provides a data leakage detection method and device. The method comprises the following steps: acquiring a plurality of target DNS requests, and acquiring a source IP address, a secondary domain name and a sub-domain name of the secondary domain name in each target DNS request; dividing the obtained plurality of sub-domain names into a plurality of groups, wherein the sub-domain names in the same group have the same source IP address and second-level domain name, and each group corresponds to a host in a local area network; splicing the sub-domain names in the same group to obtain a character string to be detected; constructing a corresponding feature vector according to the character string to be detected of each group; and respectively carrying out anomaly detection on the feature vector of each group by using an anomaly detection model so as to determine a host with data leakage in the local area network according to the detection result of each group.

Description

Data leakage detection method and device

Technical Field

The invention relates to the technical field of information security, in particular to a data leakage detection method and device.

Background

The DNS (Domain Name System) protocol is one of the network communication protocols that are essential in most enterprise network environments. The DNS can provide domain name resolution services that translate domain names and IP addresses to access internet and intranet resources. In view of usage scenarios, the border guard device generally filters, analyzes or masks DNS packets very rarely, and therefore, the DNS protocol is often used by an attacker to steal data or achieve other malicious purposes, thereby bringing a huge security risk to the enterprise.

The DNS acquires the IP address of the destination domain name in a recursive or iterative query manner, and an attacker may steal data by using the query mechanism of the DNS. When DNS data is leaked, an infected host in the local area network embeds data in a sub-domain of a secondary domain name to perform DNS query, for example, initiates a query for the domain name data.

At present, the following two detection methods for DNS data leakage are mainly used:

(1) and establishing a domain name blacklist. The domain name blacklist is established through threat intelligence data of the domain name, but an attacker can bypass detection by registering a new domain name, and a detection blind spot exists on a new malicious domain name due to the fact that the domain name blacklist is updated to have hysteresis.

(2) Detection is performed by domain name features. The selected characteristics are generally statistical characteristics of the numbers of capital letters, lowercase letters and numeric characters in the sub-domain names, and an attacker can bypass detection by shortening the sub-domain names or transmitting the sub-domain names for multiple times.

Disclosure of Invention

An object of the embodiments of the present application is to provide a data leakage detection method and apparatus, so as to improve the above technical problems.

In order to achieve the above purpose, the present application provides the following technical solutions:

in a first aspect, an embodiment of the present application provides a data leakage detection method, including: acquiring a plurality of target DNS requests, and acquiring a source IP address, a secondary domain name and a sub-domain name of the secondary domain name in each target DNS request; dividing the obtained plurality of sub-domain names into a plurality of groups, wherein the sub-domain names in the same group have the same source IP address and second-level domain name, and each group corresponds to a host in a local area network; splicing the sub-domain names in the same group to obtain a character string to be detected; constructing a corresponding feature vector according to the character string to be detected of each group; and respectively carrying out anomaly detection on the feature vector of each group by using an anomaly detection model so as to determine a host with data leakage in the local area network according to the detection result of each group.

According to the technical scheme, the source IP address and the domain name information in the DNS request are analyzed, the sub domain names with the same source IP address and the same second-level domain name are divided into the same group, so that the sub domain names of the DNS request from a single host to a domain name server corresponding to a certain second-level domain name are divided into the same group, the sub domain names in the same group are spliced to obtain the character strings to be detected, and the spliced character strings to be detected are detected, so that the method is independent of the characteristics of a single sub domain name, and is very effective in data leakage scenes such as over-short sub domain names, multi-transmission or multi-query. Moreover, even if an attacker registers a new malicious domain name, the new malicious domain name can be accurately detected through the anomaly detection model.

In an optional implementation manner, the obtaining a plurality of target DNS requests and obtaining a source IP address, a secondary domain name, and a sub-domain name of the secondary domain name in each target DNS request includes: acquiring a plurality of DNS data packets sent by each host in a local area network, and analyzing to acquire a DNS request in each DNS data packet; and screening the obtained multiple DNS requests based on a domain name blacklist and/or a domain name whitelist to obtain multiple target DNS requests, and obtaining a source IP address, a secondary domain name and a sub-domain name of the secondary domain name in each target DNS request.

In an optional embodiment, the screening the obtained plurality of DNS requests based on the domain name blacklist and/or the domain name whitelist to obtain a plurality of target DNS requests includes: matching the secondary domain name in each DNS request with each domain name in the domain name blacklist respectively, and taking the DNS request corresponding to the secondary domain name which does not hit any domain name in the domain name blacklist as a target DNS request; or respectively matching the secondary domain name in each DNS request with each domain name in the domain name white list, and taking the DNS request corresponding to the secondary domain name which does not hit any domain name in the domain name white list as a target DNS request; or matching the secondary domain name in each DNS request with each domain name in the domain name blacklist and each domain name in the domain name whitelist respectively, and taking the DNS request corresponding to the secondary domain name which does not hit any domain name in the domain name blacklist and also does not hit any domain name in the domain name whitelist as a target DNS request.

The DNS data packets in the local area network are screened by combining the black and white list of the domain names, and the secondary domain names in the black and white list do not need to be detected, so that the detection efficiency is improved.

In an optional embodiment, after the anomaly detection is performed on the feature vectors of each group by using an anomaly detection model, the method further includes: and if the detection result of any one group is abnormal, adding the second-level domain name corresponding to the group into a domain name blacklist.

After the detection result of any group is obtained through the technical scheme, the second-level domain name with problems is added to the domain name blacklist in time, and the blacklist is updated, so that when a new round of detection is carried out later, malicious second-level domain names can be detected through the blacklist, and then processes such as sub-domain name grouping and sub-domain name splicing are not needed to be carried out on the DNS request of the second-level domain name, unnecessary feature extraction and detection steps can be avoided, and the detection efficiency is improved.

In an optional implementation manner, the constructing a corresponding feature vector according to the character string to be detected of each group includes: acquiring at least one characteristic value of the character string to be detected, wherein the height of each characteristic value reflects the probability that each sub domain name forming the character string to be detected contains leaked data; and constructing the feature vector according to the at least one feature value and a data input rule of the anomaly detection model.

In an alternative embodiment, the at least one characteristic value includes a value of at least one of the following characteristics: the length of the character string to be detected; the total number of the sub domain names in the group corresponding to the character string to be detected; the number of the duplicate removed sub domain names in the group corresponding to the character string to be detected; entropy of the character string to be detected; the ratio of capital letters in the character string to be detected; and the ratio of the number characters in the character string to be detected.

In a second aspect, an embodiment of the present application provides a data leakage detection apparatus, including: the DNS protocol analysis module is used for acquiring a plurality of target DNS requests and acquiring a source IP address, a secondary domain name and a sub-domain name of the secondary domain name in each target DNS request; the DNS data grouping module is used for dividing the obtained sub domain names into a plurality of groups, wherein the sub domain names in the same group have the same source IP address and secondary domain name, and each group corresponds to a host in a local area network; the sub-domain name splicing module is used for splicing sub-domain names in the same group to obtain a character string to be detected; the characteristic vector construction module is used for constructing corresponding characteristic vectors according to the character strings to be detected of each group; and the detection module is used for respectively carrying out anomaly detection on the characteristic vector of each group by utilizing an anomaly detection model so as to determine a host with data leakage in the local area network according to the detection result of each group.

In a third aspect, an embodiment of the present application provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the method according to any one of the first aspect and the optional implementation manner of the first aspect is performed.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the method according to any one of the first aspect, the optional implementation of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a flow chart of a data leakage detection method in a detection stage according to an embodiment of the present application;

FIG. 2 is a flow chart of a data leak detection method provided by an embodiment of the present application during a training phase;

FIG. 3 is a schematic diagram illustrating a data leak detection apparatus according to an embodiment of the present application;

fig. 4 shows a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element. The terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

Aiming at the defects in the prior art, the embodiment of the application provides a data leakage detection method. The detection method extracts sub-domain names of the secondary domain names from the DNS request, splices the sub-domain names with the same source IP addresses and the same secondary domain names to obtain character strings to be detected, and extracts features from the character strings to be detected for detection, so that the detection method does not depend on the features of a single sub-domain name, and is very effective in data leakage scenes of too short sub-domain names, multiple transmission or multiple query and the like. By the detection method, data leakage between the infected host in a local area network (such as an enterprise network and a college network) and a malicious server in an external network can be effectively detected.

The technical scheme includes a training stage and a detection stage, wherein the training stage is used for training an anomaly detection model, and the detection stage is used for detecting whether data leakage exists or not by using the trained anomaly detection model.

Fig. 1 is a schematic diagram illustrating a process of a data leakage detection method in a detection stage, please refer to fig. 1, where the detection method includes the following steps:

step 110: the method comprises the steps of obtaining a plurality of target DNS requests, and obtaining a source IP address, a secondary domain name and a sub-domain name of the secondary domain name in each target DNS request.

For example, a certain target DNS request is used to query the IP address of the domain name data.

And arranging detection equipment in the local area network, wherein the detection equipment can acquire the DNS data packet traffic sent by each host in the local area network, and after the DNS data packet traffic is acquired, analyzing each DNS data packet to acquire a plurality of target DNS requests and the content carried in each target DNS request.

In one embodiment, step 110 includes: acquiring a plurality of DNS data packets sent by each host in a local area network, analyzing the DNS data packets to acquire DNS requests therein, analyzing the DNS requests to acquire source IP addresses, secondary domain names and sub-domain names of the secondary domain names therein; and screening the obtained multiple DNS requests based on a domain name blacklist and/or a domain name whitelist to obtain multiple target DNS requests, and obtaining a source IP address, a secondary domain name and a sub-domain name of the secondary domain name in each target DNS request.

Specifically, after the detection device completes the resolution of the DNS packet, the detection device obtains the source IP address, the secondary domain name, and the sub-domain name of the secondary domain name in each DNS request, matches the secondary domain name in each DNS request with each domain name in the domain name blacklist and/or each domain name in the domain name whitelist, and screens out a suspicious target DNS request according to the matching result, and the screening method mainly includes the following three types:

(1) and respectively matching the secondary domain name in each DNS request with each malicious domain name in a domain name blacklist, and taking the DNS request corresponding to the secondary domain name which does not hit any malicious domain name in the domain name blacklist as a target DNS request.

The domain name blacklist is provided with a plurality of malicious domain names, the malicious domain names are secondary domain names, if the secondary domain name in the DNS request hits any malicious domain name in the domain name blacklist, the secondary domain name in the DNS request can be determined to be a malicious domain name, and then whether the sub-domain name of the secondary domain name has data leakage abnormity is not required to be detected.

(2) And respectively matching the secondary domain name in each DNS request with each trusted domain name in a domain name white list, and taking the DNS request corresponding to the secondary domain name which does not hit any trusted domain name in the domain name white list as a target DNS request.

The domain name white list is provided with a plurality of trusted domain names, the trusted domain names are secondary domain names, and if the secondary domain name in the DNS request does not hit any trusted domain name in the domain name white list, whether the sub-domain name of the secondary domain name has data leakage abnormality needs to be further checked.

(3) And respectively matching the secondary domain name in each DNS request with each malicious domain name in a domain name blacklist and each trusted domain name in a domain name white list, and taking the DNS request corresponding to the secondary domain name which does not hit any malicious domain name in the domain name blacklist and any trusted domain name in the domain name white list as a target DNS request.

In the foregoing embodiment, a detection period may be set, and in step 110, a plurality of DNS packets sent by each host in the local area network in the current detection period are acquired, and the DNS packets in the current detection period are detected; in addition, a threshold number of DNS packets may be set, and in step 110, detection is performed when the number of DNS packets sent by each host in the lan and not yet detected reaches the threshold number.

Step 120: and dividing the obtained plurality of sub-domain names into a plurality of groups, wherein the sub-domain names in the same group have the same source IP address and secondary domain name, and each group corresponds to a host in the local area network.

After the source IP addresses, the second-level domain names and the sub-domain names of the second-level domain names in the target DNS requests are obtained, the obtained sub-domain names are grouped, and the sub-domain names with the same source IP addresses and the same second-level domain names are grouped into the same group.

Because data leakage is usually the behavior of a single infected host in the local area network to the controlled domain name server, the infected host corresponds to the source IP address in the local area network, and the domain name server corresponds to the secondary domain name, after grouping in step 120, the sub-domain names of DNS requests from the single host to the domain name server corresponding to a certain secondary domain name are grouped into the same group, each group corresponds to a host in the local area network, and the same host may have multiple groups (each corresponding to a different secondary domain name). After a packet is completed, there may be only one or a plurality of sub-domain names in one packet.

Step 130: and splicing the sub-domain names in the same group to obtain the character string to be detected.

In one embodiment, all the sub-domain names in the same group are spliced to generate the character string to be detected.

Illustratively, in step 110, the target DNS request for the following three domain names is obtained: MWQy. magic. com, YTN0. magic. com, nge1. magic. com, the three have the same source IP address and the same second level domain name (i.e. magic. com), the sub-domain names of the second level domain name are MWQy, YTN0, NGE1, respectively, after grouping, the group has in common: and the sub domain names in the group are spliced to generate a character string 'MWQyYTN 0NGE 1' to be detected.

In step 130, if there is only one sub-domain name in the group, the spliced character string to be detected is itself.

Step 140: and constructing a corresponding feature vector according to the character string to be detected of each group.

After the character string to be detected corresponding to each group is obtained, a characteristic vector is constructed according to the character string to be detected of each group, and the characteristic vector is input into an anomaly detection model for detection.

Specifically, step 140 includes: acquiring at least one characteristic value of a character string to be detected, wherein the height of each characteristic value reflects the probability that each sub-domain name forming the character string to be detected contains leaked data; and constructing a corresponding feature vector according to the at least one feature value and the data input rule of the anomaly detection model. The anomaly detection model has been trained during a training phase.

The at least one characteristic value of the character string to be detected comprises at least one of the following characteristic values:

(1) the length of the character string to be detected. The longer the length of the character string to be detected, the greater the possibility that the leaked data is contained therein.

(2) And the total number of the sub domain names in the group corresponding to the character string to be detected. The more the number of sub-domain names in the packet, the more the number of times of DNS requests initiated by the corresponding host, the higher the possibility of the leaked data contained therein.

(3) The number of the deduplicated sub-domain names in the group corresponding to the character string to be detected. The same host may initiate multiple DNS requests for the same domain name, and therefore, duplicate sub-domain names may occur in multiple sub-domain names in the same group, and after deduplication is performed, the greater the number of the duplicated sub-domain names is, the greater the possibility that the leaked data is contained therein is.

(4) Entropy of the character string to be detected. The entropy represents the level of non-determinacy and non-readability of the character string to be detected, if data leakage occurs, an infected host generally encrypts or encodes the data to obtain a sub-domain name, and the determinacy and readability of the encrypted or encoded data are poor, so that the value of the entropy can reflect the possibility that the character string to be detected contains the leaked data.

(5) The ratio of capital letters in the character string to be detected. If data leakage occurs, the infected host generally encrypts or encodes the data to obtain the sub-domain name, and the encrypted or encoded data has a higher capital ratio.

(6) And (4) the ratio of the numeric characters in the character string to be detected. If data leakage occurs, the infected host generally encrypts or encodes the data to obtain a sub-domain name, and the encrypted or encoded data has a higher ratio of numeric characters.

Because the character string to be detected is spliced by a plurality of sub-domain names, the number of capital letters and digital characters cannot completely represent the probability of data leakage of the corresponding host, and is also related to the length of the character string to be detected, for example, the number of capital letters and digital characters is small, but if the length of the character string to be detected is short, the possibility of data leakage of the host cannot be considered to be low at this moment. Therefore, the embodiment does not adopt the number of the capital letters as the detection characteristic, but uses the occupation ratio, and the detection effect is better.

It is understood that the present embodiment may count values of one or more of the above six features, and construct a feature vector according to the counted one or more feature values. Of course, if only one feature value is adopted, the method may be effective only for a part of data leakage scenes, and the performance on the detection rate and the false alarm rate is not excellent enough, so that various feature values can be selected as much as possible, and the features of the character string to be detected are reflected from a plurality of different dimensions, so that the detection accuracy is higher, and the method can be applied to more data leakage scenes.

Step 150: and respectively carrying out anomaly detection on the feature vector of each group by using an anomaly detection model so as to determine a host with data leakage in the local area network according to the detection result of each group.

And sequentially outputting the characteristic vector of each group to a trained anomaly detection model, outputting the detection result of each group by the anomaly detection model, and determining the host with data leakage in the local area network according to the detection result of each group. And if the detection result of a certain packet is abnormal, determining that the host corresponding to the packet has data leakage, namely the infected host in the local area network.

Optionally, if the check result of any one of the plurality of packets is abnormal, the second-level domain name corresponding to the packet is added to the domain name blacklist.

For the packet determined to have abnormality by the detection method in this embodiment, the corresponding secondary domain name is added to the domain name blacklist, so that when step 110 and step 150 are executed again later, the DNS request for the secondary domain name does not need to perform processes such as sub-domain name grouping, sub-domain name splicing and the like, unnecessary feature extraction and detection steps can be avoided, and the detection efficiency is further improved.

Fig. 2 is a schematic diagram illustrating a process of a data leakage detection method in a training phase, referring to fig. 2, the training phase includes the following steps, and the following steps are executed by an electronic device:

step 210: and acquiring a plurality of DNS data packets sent by each host in the local area network.

The electronic device may be the above detection device, or may be any one of a PC, a notebook computer, a tablet computer, a server, an embedded device, and the like, and the electronic device is not limited to a single device, and may also be a combination of multiple devices or a cluster formed by a large number of devices.

In step 210, the original DNS packet sent by each host in the local area network may be collected by the detection device or an independent traffic collection device, and after the collection is completed, the original DNS packet is saved in the pcap file format, and the pcap file format is transmitted to the electronic device.

Step 220: and analyzing the plurality of DNS data packets to obtain the source IP address, the secondary domain name and the sub-domain name of the secondary domain name of the DNS request in each DNS data packet.

For example, a DNS request is used to query the IP address of the domain name news.

Step 230: and dividing the obtained plurality of sub-domain names into a plurality of groups, wherein the sub-domain names in the same group have the same source IP address and secondary domain name, and each group corresponds to a host in the local area network.

Step 240: and splicing the sub domain names in the same group to obtain the target character string.

Step 250: and constructing a corresponding feature vector according to the target character string of each group.

The detailed implementation of the step 230-250 can refer to the description of the detection stage, which is not repeated herein.

Step 260: and training an initial model by using the characteristic vector of each group, and obtaining an abnormal detection model after the training is finished.

The initial model may be an iForest (Isolation Forest) algorithm.

In step 260, the training of the initial model is unsupervised and does not require the construction of malicious DNS packet traffic.

In summary, the data leakage detection method provided by the invention divides the sub domain names having the same source IP address and the same secondary domain name into the same group by analyzing the source IP address and the domain name information in the DNS request, splices the sub domain names in the same group to obtain the character string to be detected, and detects the spliced character string to be detected, thereby effectively detecting data leakage in data leakage scenes, such as too short sub domain names, multiple transmission or multiple queries. Moreover, the result of the data leakage detection is combined with the domain name blacklist, so that the efficiency of feature extraction and detection can be further improved.

Based on the same inventive concept, an embodiment of the present application provides a data leakage detection apparatus, please refer to fig. 3, the apparatus includes:

a DNS protocol resolution module 310, configured to obtain a plurality of target DNS requests, and obtain a source IP address, a secondary domain name, and a sub-domain name of the secondary domain name in each target DNS request;

a DNS data grouping module 320, configured to divide the obtained multiple sub-domain names into multiple groups, where the sub-domain names in the same group have the same source IP address and secondary domain name, and each group corresponds to a host in the local area network;

the sub-domain name splicing module 330 is configured to splice sub-domain names in the same group to obtain a character string to be detected;

the feature vector construction module 340 is configured to construct a corresponding feature vector according to each group of the character strings to be detected;

and a detection module 350, configured to perform anomaly detection on the feature vectors of each packet by using an anomaly detection model, so as to determine a host with data leakage in the local area network according to a detection result of each packet.

Optionally, the DNS protocol resolution module 310 is specifically configured to: acquiring a plurality of DNS data packets sent by each host in a local area network, and analyzing to acquire a DNS request in each DNS data packet; and screening the obtained multiple DNS requests based on a domain name blacklist and/or a domain name whitelist to obtain multiple target DNS requests, and obtaining a source IP address, a secondary domain name and a sub-domain name of the secondary domain name in each target DNS request.

Optionally, the DNS protocol resolution module 310 is specifically configured to: matching the secondary domain name in each DNS request with each domain name in the domain name blacklist respectively, and taking the DNS request corresponding to the secondary domain name which does not hit any domain name in the domain name blacklist as a target DNS request; or respectively matching the secondary domain name in each DNS request with each domain name in the domain name white list, and taking the DNS request corresponding to the secondary domain name which does not hit any domain name in the domain name white list as a target DNS request; or matching the secondary domain name in each DNS request with each domain name in the domain name blacklist and each domain name in the domain name whitelist respectively, and taking the DNS request corresponding to the secondary domain name which does not hit any domain name in the domain name blacklist and also does not hit any domain name in the domain name whitelist as a target DNS request.

Optionally, the apparatus further comprises: and the blacklist updating module is used for adding the secondary domain name corresponding to each group into the domain name blacklist if the detection result of any group is abnormal after the detection module utilizes the abnormal detection model to respectively detect the abnormal characteristic vector of each group.

Optionally, the feature vector constructing module 340 includes: the characteristic value acquisition submodule is used for acquiring at least one characteristic value of the character string to be detected, and the height of each characteristic value reflects the probability that each sub-domain name forming the character string to be detected contains leaked data; and the construction submodule is used for constructing the feature vector according to the at least one feature value and the data input rule of the anomaly detection model.

Optionally, the at least one characteristic value includes a value of at least one of the following characteristics:

the length of the character string to be detected;

the total number of the sub domain names in the group corresponding to the character string to be detected;

the number of the duplicate removed sub domain names in the group corresponding to the character string to be detected;

entropy of the character string to be detected;

the ratio of capital letters in the character string to be detected;

and the ratio of the number characters in the character string to be detected.

The implementation principle and the generated technical effects of the data leakage detection device provided by the embodiment of the present application have been introduced in the foregoing method embodiments, and for brief description, no mention is made in part of the device embodiments, and reference may be made to the corresponding contents in the method embodiments.

Fig. 4 shows a possible structure of an electronic device 400 provided in an embodiment of the present application. Referring to fig. 4, the electronic device 400 includes: a processor 410, a memory 420, and a communication interface 430, which are interconnected and in communication with each other via a communication bus 440 and/or other form of connection mechanism (not shown).

The Memory 420 includes one or more (Only one is shown in the figure), which may be, but not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an electrically Erasable Programmable Read-Only Memory (EEPROM), and the like. The processor 410, as well as possibly other components, may access, read, and/or write data to the memory 420.

The processor 410 includes one or more (only one shown) which may be an integrated circuit chip having signal processing capabilities. The processor 410 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Micro Control Unit (MCU), a Network Processor (NP), or other conventional processors; or a special-purpose processor, including a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, and a discrete hardware component. Also, when there are a plurality of processors 410, some of them may be general-purpose processors, and the other may be special-purpose processors.

Communication interface 430 includes one or more (only one shown) devices that can be used to communicate directly or indirectly with other devices for data interaction. The communication interface 430 may include an interface that performs wired and/or wireless communication.

One or more computer program instructions may be stored in memory 420 and read and executed by processor 410 to implement the data leak detection methods provided by embodiments of the present application, as well as other desired functions, including performing the steps of the detection phase and/or the steps of the training phase of the data leak detection methods.

It will be appreciated that the configuration shown in fig. 4 is merely illustrative and that electronic device 400 may include more or fewer components than shown in fig. 4 or have a different configuration than shown in fig. 4. The components shown in fig. 4 may be implemented in hardware, software, or a combination thereof. The electronic device 400 may be a PC, a notebook computer, a tablet computer, a server, an embedded device, or the like, and the electronic device 400 is not limited to a single device, and may also be a combination of multiple devices or a cluster formed by a large number of devices.

The embodiment of the present application further provides a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and executed by a processor of a computer, the data leakage detection method provided in the embodiment of the present application is executed. The computer-readable storage medium may be implemented as, for example, memory 420 in electronic device 400 in fig. 4.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the unit is only a logical division, and other divisions may be realized in practice. Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for detecting data leakage, comprising:

acquiring a plurality of target DNS requests, and acquiring a source IP address, a secondary domain name and a sub-domain name of the secondary domain name in each target DNS request;

dividing the obtained plurality of sub-domain names into a plurality of groups, wherein the sub-domain names in the same group have the same source IP address and second-level domain name, and each group corresponds to a host in a local area network;

splicing the sub-domain names in the same group to obtain a character string to be detected;

constructing a corresponding feature vector according to the character string to be detected of each group;

and respectively carrying out anomaly detection on the feature vector of each group by using an anomaly detection model so as to determine a host with data leakage in the local area network according to the detection result of each group.

2. The method of claim 1, wherein obtaining a plurality of target DNS requests and obtaining a source IP address, a secondary domain name, and a sub-domain name of the secondary domain name in each target DNS request comprises:

acquiring a plurality of DNS data packets sent by each host in a local area network, and analyzing to acquire a DNS request in each DNS data packet;

and screening the obtained multiple DNS requests based on a domain name blacklist and/or a domain name whitelist to obtain multiple target DNS requests, and obtaining a source IP address, a secondary domain name and a sub-domain name of the secondary domain name in each target DNS request.

3. The method of claim 2, wherein the screening the obtained plurality of DNS requests based on the domain name blacklist and/or the domain name whitelist to obtain a plurality of target DNS requests comprises:

matching the secondary domain name in each DNS request with each domain name in the domain name blacklist respectively, and taking the DNS request corresponding to the secondary domain name which does not hit any domain name in the domain name blacklist as a target DNS request; alternatively, the first and second electrodes may be,

matching the secondary domain name in each DNS request with each domain name in the domain name white list, and taking the DNS request corresponding to the secondary domain name which does not hit any domain name in the domain name white list as a target DNS request; alternatively, the first and second electrodes may be,

and matching the secondary domain name in each DNS request with each domain name in the domain name blacklist and each domain name in the domain name whitelist respectively, and taking the DNS request corresponding to the secondary domain name which does not hit any domain name in the domain name blacklist and also does not hit any domain name in the domain name whitelist as a target DNS request.

4. The method of claim 2, wherein after the anomaly detection is performed on the feature vectors of each group using an anomaly detection model, the method further comprises:

and if the detection result of any one group is abnormal, adding the second-level domain name corresponding to the group into a domain name blacklist.

5. The method according to any one of claims 1 to 4, wherein constructing the corresponding feature vector according to the character string to be detected of each group comprises:

acquiring at least one characteristic value of the character string to be detected, wherein the height of each characteristic value reflects the probability that each sub domain name forming the character string to be detected contains leaked data;

and constructing the feature vector according to the at least one feature value and a data input rule of the anomaly detection model.

6. The method of claim 5, wherein the at least one characteristic value comprises a value of at least one of the following characteristics:

the length of the character string to be detected;

entropy of the character string to be detected;

the ratio of capital letters in the character string to be detected;

and the ratio of the number characters in the character string to be detected.

7. A data leak detection apparatus, characterized by comprising:

the DNS protocol analysis module is used for acquiring a plurality of target DNS requests and acquiring a source IP address, a secondary domain name and a sub-domain name of the secondary domain name in each target DNS request;

the DNS data grouping module is used for dividing the obtained sub domain names into a plurality of groups, wherein the sub domain names in the same group have the same source IP address and secondary domain name, and each group corresponds to a host in a local area network;

the sub-domain name splicing module is used for splicing sub-domain names in the same group to obtain a character string to be detected;

the characteristic vector construction module is used for constructing corresponding characteristic vectors according to the character strings to be detected of each group;

and the detection module is used for respectively carrying out anomaly detection on the characteristic vector of each group by utilizing an anomaly detection model so as to determine a host with data leakage in the local area network according to the detection result of each group.

8. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, performs the method according to any one of claims 1-6.

9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the method of any of claims 1-6.