CN115114329A - Method and device for detecting data stream abnormity, electronic equipment and storage medium - Google Patents

Method and device for detecting data stream abnormity, electronic equipment and storage medium Download PDF

Info

Publication number
CN115114329A
CN115114329A CN202110288769.8A CN202110288769A CN115114329A CN 115114329 A CN115114329 A CN 115114329A CN 202110288769 A CN202110288769 A CN 202110288769A CN 115114329 A CN115114329 A CN 115114329A
Authority
CN
China
Prior art keywords
feature vector
data stream
target
behavior
equipment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110288769.8A
Other languages
Chinese (zh)
Inventor
叶继明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110288769.8A priority Critical patent/CN115114329A/en
Publication of CN115114329A publication Critical patent/CN115114329A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a method and a device for data stream anomaly detection, electronic equipment and a storage medium, which can be applied to flow wind control anti-fraud of internet application in industries such as e-commerce, live broadcast, voyage, video and the like. The risk identification can be carried out on the data stream by combining the behaviors and the relationship chain, and then whether the data stream is the black product behavior or not is judged, so that the black product identification accuracy and the coverage rate are improved. A method of data flow anomaly detection, comprising: acquiring a data stream, wherein the data stream at least comprises IP information and equipment information; carrying out feature vector processing on the data stream to obtain a target relation chain feature vector and a target behavior feature vector; performing fusion processing on the target relation chain feature vector and the target behavior feature vector to obtain a fusion feature vector; and inputting the fusion feature vector into at least one risk recognition model trained in advance, and acquiring a risk recognition result of the data stream output by the at least one risk recognition model.

Description

Data flow abnormity detection method and device, electronic equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of network security, and in particular relates to a method and device for data flow anomaly detection, an electronic device and a storage medium.
Background
The black product refers to a black industry which takes the internet as a medium and takes a network technology as a means and obtains benefits by an illegal means. In order to implement batch cheating and volume refreshing, a network black product usually initiates a large number of requests through the same Internet Protocol (IP) address, device number, mobile phone number and user account number in a short time, which seriously affects normal network activities. How to identify risks of a data stream to determine whether the data stream is a black product behavior is an urgent problem to be solved.
Disclosure of Invention
The application provides a method and a device for detecting data stream abnormity, electronic equipment, a chip and a computer readable storage medium, wherein the IP and equipment relation chain characteristics are subjected to vector embedding, so that behavior characteristics can be fused with the relation chain characteristics, risk identification can be carried out on the data stream by combining the behavior and the relation chain, whether the data stream is a black product behavior or not is judged, and the black product identification accuracy and coverage rate are improved.
Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.
According to an aspect of the present application, there is provided a method for data flow anomaly detection, including:
acquiring a data stream, wherein the data stream at least comprises IP information and equipment information;
carrying out feature vector processing on the data stream to obtain a target relation chain feature vector and a target behavior feature vector; the target relation chain feature vector is an IP relation chain feature vector, and the target behavior feature vector is an IP behavior feature vector; or the target relation chain feature vector is an equipment relation chain feature vector, and the target behavior feature vector is an equipment behavior feature vector;
fusing the target relation chain feature vector and the target behavior feature vector to obtain a fused feature vector;
and inputting the fusion feature vector into at least one risk recognition model trained in advance, and acquiring a risk recognition result of the data stream output by the at least one risk recognition model.
According to an aspect of the present application, there is provided an apparatus for data flow anomaly detection, including:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring data stream, and the data stream at least comprises IP information and equipment information;
determining a model, which is used for carrying out feature vector processing on the data stream to obtain a target relation chain feature vector and a target behavior feature vector; the target relation chain feature vector is an IP relation chain feature vector, and the target behavior feature vector is an IP behavior feature vector; or the target relation chain feature vector is an equipment relation chain feature vector, and the target behavior feature vector is an equipment behavior feature vector;
the fusion module is used for carrying out fusion processing on the target relation chain feature vector and the target behavior feature vector to obtain a fusion feature vector;
the input module is used for inputting the fusion feature vector into at least one risk recognition model which is trained in advance;
and the obtaining module is also used for obtaining the risk identification result of the data stream output by at least one risk identification model.
According to an aspect of the present application, there is provided an electronic device including: the data flow anomaly detection system comprises a processor and a memory, wherein the memory is used for storing a computer program, and the processor is used for calling and running the computer program stored in the memory and executing the steps of the data flow anomaly detection method.
According to an aspect of the present application, there is provided a chip including: and the processor is used for calling and running the computer program from the memory so as to enable the processor to execute the steps of the data flow abnormity detection method.
According to an aspect of the present application, there is provided a computer-readable storage medium for storing a computer program for causing a computer to perform the steps of the above-described method for data flow anomaly detection.
Based on the technical scheme, the IP and the device relation chain characteristics are subjected to vector embedding, so that the behavior characteristics can be fused with the relation chain characteristics, the risk identification can be carried out on the data stream jointly by combining the behavior and the relation chain, whether the data stream is a black product behavior or not is further judged, and the black product identification accuracy and the coverage rate are improved.
Additional features and advantages of embodiments of the present application will be set forth in the detailed description which follows, or may be learned by practice of the application.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 schematically illustrates an application scenario of a method of data flow anomaly detection provided in accordance with an embodiment of the present application;
FIG. 2 schematically illustrates an architecture diagram for data flow anomaly detection provided in one embodiment of the present application;
FIG. 3 schematically illustrates a flow diagram of a method of data flow anomaly detection according to an embodiment of the present application;
FIG. 4 schematically illustrates a flow chart for determining a target relationship chain feature vector and a target behavior feature vector according to an embodiment of the present application;
FIG. 5 shows a schematic flow diagram of a method of training a risk identification model for an IP dimension according to one embodiment of the present application;
FIG. 6 shows a schematic block diagram of the training of a risk identification model for the IP dimension according to one embodiment of the present application;
FIG. 7 schematically illustrates a flow chart for determining a target relationship chain feature vector and a target behavior feature vector according to another embodiment of the present application;
FIG. 8 shows a schematic flow diagram of a method of training a risk identification model for a device dimension according to an embodiment of the application;
FIG. 9 shows a schematic block diagram of the training of a risk identification model of the device dimensions according to one embodiment of the present application;
FIG. 10 schematically illustrates a block diagram of an apparatus for data flow anomaly detection according to an embodiment of the present application;
FIG. 11 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The drawings are schematic illustrations of the present application and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more example embodiments. In the following description, numerous specific details are provided to give a thorough understanding of example embodiments of the present application. One skilled in the relevant art will recognize, however, that the embodiments of the present application can be practiced without one or more of the specific details, or with other methods, components, steps, etc. In other instances, well-known structures, methods, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or may be embodied in different networks, processor devices, or micro-control devices.
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like. The method and the device can be used for carrying out multi-speaker scene recognition and multi-speaker scene recognition network training based on an artificial intelligence technology.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machine has the functions of perception, reasoning and decision, namely the machine has the learning ability.
Machine Learning (ML) is a multi-field cross discipline, and relates to multiple sciences such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks (e.g., convolutional neural networks), belief networks, reinforcement learning, transfer learning, inductive learning, and formal learning.
Artificial intelligence in conjunction with cloud services may also implement artificial intelligence cloud services, also commonly referred to as AI as a Service (AIaaS). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform by means of Application Programming Interface (API) interfaces, and some of the sophisticated developers can also use the AI framework and the AI infrastructure provided by the platform to deploy and operate and maintain the self-dedicated cloud artificial intelligence services.
Fig. 1 is a diagram of an application scenario of a method for data stream anomaly detection provided in an embodiment, as shown in fig. 1, in the application scenario, a terminal 110 and a server 120 are included.
In some implementations, at least one risk recognition model may be trained by the server 120, where the at least one risk recognition model may include some or all of a supervised learning model, a clustering model, and an anomaly detection model. After the server 120 obtains the trained at least one risk identification model, the trained at least one risk identification model may be deployed in a risk identification application, the terminal 110 may install the risk identification application, after the terminal 110 obtains a data stream, a user may send a risk identification instruction through a corresponding operation, the terminal 110 may receive the risk identification instruction, and perform risk identification by using the obtained data stream as data to be processed, so as to obtain a risk identification result of the data stream.
The risk identification application may be a network security application, and the network security application may further have functions of data recording, audio/video playing, translation, data query, and the like.
In other implementations, at least one risk recognition model may be trained by the terminal 110, where the at least one risk recognition model may include some or all of a supervised learning model, a clustering model, and an anomaly detection model. After the terminal 110 obtains the data stream, the user may send a risk identification instruction through a corresponding operation, and the terminal 110 may receive the risk identification instruction, perform risk identification using the obtained data stream as data to be processed, and obtain a risk identification result of the data stream.
It is to be understood that the above application scenario is only an example, and does not constitute a limitation on the method for detecting data stream anomalies provided in the embodiment of the present application. For example, the trained at least one risk identification model may be stored in the server 120, and the server 120 may receive a data stream sent by the terminal 110, perform risk identification on the data stream to obtain a risk identification result of the data stream, and then return the risk identification result to the terminal 110.
The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform. The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a vehicle-mounted computer, a smart watch, and the like. The terminal 110 and the server 120 may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.
Fig. 2 schematically shows an architecture diagram of data stream anomaly detection provided in an embodiment of the present application, and as shown in fig. 2, a data stream generated in network activities such as registration, login, coupon getting, order gathering, help cutting, commodity ordering, video on demand, comment and the like is input, wherein the data stream generally contains account information, IP information and device information of a user. After account risk identification, IP risk identification, equipment risk identification and behavior risk identification, finally outputting a risk grade and a risk label through model fusion and strategy fusion, and judging whether the data stream comes from black production behaviors. The method and the device are particularly suitable for two scenes of IP risk identification and equipment risk identification.
In order to better understand the embodiment of the present application, the black product identification at the present stage is explained.
Currently, the wind control team generally identifies the IP and equipment used by the black products from three aspects:
1. maintaining a batch of IP, equipment and account black libraries: the black libraries may be from the IP and equipment with abnormal behaviors in own service, or from the portrait labels of the IP, such as machine room IP, open port information and the like; because the IP and the equipment for black product control are not fixed, the black library has time limit;
2. black birth was identified by using behaviors: in order to realize batch cheating and quantity refreshing, a large number of requests are usually initiated in a short time, so that a wind control team can identify the cheating of the black product from the number of the requests of the same IP, equipment number, mobile phone number and user number in a short time;
3. identifying black births by a chain of relationships: in order to bypass the wind control of a service party, the network black product develops corresponding tools for modifying, embezzling and forging IP, equipment numbers, mobile phone numbers and the like; when IP and equipment numbers are frequently switched in black products, a group of relation chains among the IP or the equipment numbers can appear, and the wind control team can dig out black product teams through the relation chains.
The current black product identification technology has the following defects:
1. the network black product can quickly replace the IP and the equipment number by technologies of dialing the IP per second, modifying the equipment number, maintaining the number in the cat pool and the like, so that the black library validity period of a service party is greatly shortened;
2. the situation that the network black IP shares the same IP section with a normal user in a large amount in a period of time possibly exists, and the situation that the IP is difficult to attack by directly using an IP black library is difficult;
3. since the black products are frequently switched to IP and equipment numbers, effective coverage is difficult to achieve by mining the black products from behavior chains or relationship chains independently.
Based on the technical problem, the application provides a data flow anomaly detection scheme, and by means of vector embedding of the IP and the device relation chain features, behavior features can be fused with relation chain features, so that risk identification can be performed on the data flow by combining the behavior and the relation chain, whether the data flow is a black product behavior or not is judged, black product identification accuracy is improved, and higher coverage is achieved.
The following describes in detail a specific implementation of the embodiments of the present application.
Fig. 3 shows a schematic flow diagram of a method 200 of data flow anomaly detection according to an embodiment of the present application, which method 200 of data flow anomaly detection may be performed by a device having computing processing capabilities, such as the terminal 110 or the server 120 described above. Referring to fig. 3, the method 200 for data flow anomaly detection may at least include steps S210 to S240, which are described in detail as follows:
in S210, a data stream is obtained, where the data stream includes at least IP information and device information.
Specifically, the data stream may be generated during registration, login, coupon getting, order matching, help cutting, commodity ordering, video on demand, comment, and other network activities.
The IP information may be, for example, an IP address. Specifically, the IP information may include, for example, a plurality of IP addresses or a plurality of associated IP addresses, or a plurality of IP addresses or a plurality of associated IP addresses over a period of time; the IP information may also include, for example, multiple IP addresses requesting the same service, or multiple IP addresses requesting the same service over a period of time.
The device information may be, for example, a device number or a device identification. Specifically, the device information may include, for example, a plurality of device numbers or device identifications, or a plurality of associated device numbers or device identifications; the device information may also include, for example, a plurality of device numbers or device identifications over a period of time, or a plurality of associated device numbers or device identifications over a period of time; the device information may also include, for example, a plurality of device numbers or device identifications requesting the same service, or a plurality of device numbers or device identifications requesting the same service for a period of time.
In some embodiments, the data stream may further include, but is not limited to, account information and content information. The account information may be, for example, a user account, or information such as a mobile phone number and an identification number associated with the user account. The content information may be, for example, requested service-related information such as registration, login, coupon pickup, scrip, and the like.
In some embodiments, the data stream may be request-pipelined data.
In some embodiments, the abnormal data in the original data stream may be filtered to obtain the data stream. The abnormal data may be, for example, data with an illegal IP address, data with an illegal device number, data with an illegal account number, data with an illegal content, and the like. Of course, other illegal data are also possible, and the present application is not limited thereto.
In some embodiments, the original data stream may be filtered to obtain the data stream. For example, the screening may be based on possible implementations of black product cheating and brushing volumes.
In S220, feature vector processing is performed on the data stream to obtain a target relationship chain feature vector and a target behavior feature vector; the target relation chain feature vector is an IP relation chain feature vector, and the target behavior feature vector is an IP behavior feature vector; or, the target relation chain feature vector is an equipment relation chain feature vector, and the target behavior feature vector is an equipment behavior feature vector.
In S230, the target relationship chain feature vector and the target behavior feature vector are fused to obtain a fused feature vector.
In some embodiments, the target relationship chain feature vector and the target behavior feature vector may be spliced or connected to obtain a fusion feature vector.
Assume that the target relationship chain feature vector is (a) 1 ,a 2 ,…,a n ) The target behavior chain feature vector is (b) 1 ,b 2 ,…,b m ) The feature vector fusion process may specifically include: splicing the target relation chain feature vector with the target behavior chain feature vector to obtain (a) 1 ,a 2 ,…,a n ,b 1 ,b 2 ,…,b m ) Unifying the spliced feature vectors to the same range by using a regularization algorithm to obtain (x) 1 ,x 2 ,…,x m+n )。
That is, the target relationship chain feature vector (a) 1 ,a 2 ,…,a n ) And target behavior feature vector (b) 1 ,b 2 ,…,b m ) Fusing to obtain a fused proteinThe eigenvector is (x) 1 ,x 2 ,…,x m+n )。
It should be noted that the target relationship feature vector and the target behavior feature vector may also be fused in other manners, which is not limited in this application.
Specifically, the fusion feature vector is used as the input of the risk identification model, the dimensionality of the features can be enriched to the maximum degree, the risk identification is carried out on the data stream by combining the relationship chain features and the behavior features, whether the data stream is the black product behavior or not is judged, and the accuracy and the coverage of the black product identification are improved.
In S240, the fusion feature vector is input into at least one risk recognition model trained in advance, and a risk recognition result of the data stream output by the at least one risk recognition model is obtained.
In some embodiments, the at least one risk identification model includes a supervised learning model, a clustering model, and an anomaly detection model.
Wherein the label data used by the supervised learning model is obtained from a risk database accumulated in the business. The risk database may be derived from, for example, an IP and a device having abnormal behavior in their own business, or may be derived from a portrait label of the IP itself, such as a machine room IP, open port information, and the like.
For risk identification of IP dimensions, the risk database may be, for example, an IP black library/profile library formed from IPs with abnormal behavior in their own business.
For risk identification of device dimensions, the risk database may be, for example, a device black bank/portrait bank derived from devices with abnormal behavior in their own business.
It should be noted that, for the supervised learning model, the training data in the supervised learning has both features and labels, and the machine finds the connection between the features and the labels through training, so that when encountering the data with only features but no labels, the labels can be judged for the data.
That is, for the supervised learning model, the purpose of predicting Y by X is achieved by training the relationship between the independent variable (fused feature vector, X) and the dependent variable (tag data, Y).
It should be noted that, for the clustering model, the process is to divide the sample into a plurality of classes composed of similar objects. In the classification process, a classification standard is not required to be given in advance, and the clustering analysis can automatically classify based on sample data. Objects in the same class have great similarity, while objects between different classes have great dissimilarity.
It should be noted that, in the anomaly detection model, an object different from most objects is found, that is, an outlier is found. Data is generally specified to have a "normal" model, and anomalies are considered deviations from this normal model.
In some embodiments, the supervised learning model may use, for example, a random Forest model, the clustering model may use, for example, a gaussian mixture model, and the anomaly detection model may use, for example, an Isolation Forest (Isolation Forest) model, which is not limited in this application.
In some embodiments, a user instruction is obtained, the user instruction indicating a first threshold; and determining the data flow as the data flow with the risk in the case that the risk identification result output by one or more risk identification models in the at least one risk identification model is larger than the first threshold value.
In some embodiments, the data stream is determined to be a risky data stream if the risk identification result output by one or more of the at least one risk identification model is greater than a first threshold. I.e. the data stream may be black producing.
Specifically, the first threshold may be set artificially based on the demand, for example, the first threshold is set to 0.5.
In some embodiments, the training process of the at least one risk identification model includes:
acquiring a training sample set, wherein each training sample in the training sample set comprises a fusion characteristic vector obtained by fusing a target relation chain characteristic vector and a target behavior characteristic vector and a risk identification label corresponding to the fusion characteristic vector;
the at least one risk recognition model is trained according to a training sample set.
Specifically, the target relation chain feature vector included in the training sample is an IP relation chain feature vector, and the target behavior feature vector is an IP behavior feature vector; or the target relation chain feature vector included in the training sample is the device relation chain feature vector, and the target behavior feature vector is the device behavior feature vector.
Specifically, the feature vector processing may be performed on the data stream to obtain a target relationship chain feature vector and a target behavior feature vector included in the training sample.
The number of training samples in the training sample set can be set according to requirements. For example, if the model needs to be trained 40 times, and each time 5000 training samples are used for training, the training sample set may include 5000 training samples, and each time the model is trained, the 5000 training samples are used for training. Fusing the feature vectors and their corresponding risk identification tags means: the fusion feature vector is a feature vector which needs risk identification, and the risk identification tag is a risk identification result expected after the fusion feature vector is identified by using a risk identification model.
The goal of model training is to obtain better model parameters to improve the risk recognition effect. During training, the fusion characteristic vector is input into a risk identification model, then model parameters are adjusted according to the difference between the risk identification result output by the risk identification model and the risk identification label corresponding to the fusion characteristic vector, so that the risk identification result obtained by performing risk identification according to the adjusted model parameters is closer to the risk identification label corresponding to the fusion characteristic vector until the model convergence condition is met, and the trained risk identification model is finally obtained.
In some embodiments, the model training effect of the at least one risk identification model may be verified against a risk database.
Specifically, at least one verification sample is selected from a risk database, wherein the verification sample comprises a fusion feature vector obtained by fusing an IP relation chain feature vector and an IP behavior feature vector and a risk identification label corresponding to the fusion feature vector; under the condition that a risk identification model (such as a supervised learning model, a clustering model and an anomaly detection model) is fitted with a training sample, a verification sample is input into the risk identification model, and optimal parameters are selected from multiple groups of convergence parameters obtained by the risk identification model in the training process according to a risk identification result output by the risk identification model and a risk identification label corresponding to the verification sample.
The application embodiment can be applied to flow wind control anti-fraud of internet application in industries such as e-commerce, live broadcast, navigation and video, and the like, and can improve the accuracy of identifying network black products in various flow links such as registration, login, coupon getting, commodity ordering, video on demand, comment and the like of application, improve the flow quality, optimize the marketing cost and improve the flow marketing effect. For example, in a typical e-commerce coupon fan-out activity, it may be used to identify black product teams using trumpets to fan out each other, hacking, and batch registration of new numbers to pick up novice coupons.
The method and the device for identifying the network blackjack can solve the problems of accuracy and coverage rate of identifying the blackjack under the condition that the network blackjack masters a large number of IP, equipment and account resources.
Therefore, in the embodiment of the application, the IP and the device relation chain feature are subjected to vector embedding, so that the behavior feature can be fused with the relation chain feature, the risk identification can be performed on the data stream by combining the behavior and the relation chain, whether the data stream is a black product behavior or not is further judged, and the black product identification accuracy and coverage rate are improved.
In addition, the IP/equipment relation chain is quantized, so that model training can be performed through the relation chain, the problem that a threshold value needs to be set manually when a community discovery algorithm is used for mining the relation chain is solved, and meanwhile, the accuracy of risk identification of the data stream is improved.
Fig. 4 is a schematic flow chart of a method for detecting data flow anomalies according to an embodiment of the present application, and the method shown in fig. 4 details a specific process of obtaining the target relationship chain feature vector and the target behavior feature vector in S220 described above. The target relationship chain feature vector is an IP relationship chain feature vector, and the target behavior feature vector is an IP behavior feature vector, as shown in fig. 4, the following steps S2201 to S2204 may be included. The detailed description is as follows:
in S2201, an IP-based mapping process is performed on the data stream to obtain an IP relationship map, nodes of the IP relationship map are IP addresses, an edge is created between two IP addresses having at least one of a device, an account, and a content that correspond to each other, and a weight of the edge is a number of at least one of the device, the account, and the content that correspond to each other between the two IP addresses.
That is, in the IP relationship map, an edge is created between two IP addresses having at least one of a device, an account, and a content that correspond to each other, and the weight of the edge is the number of at least one of the device, the account, and the content that correspond to each other between the two IP addresses.
In some embodiments, before constructing the IP relationship graph, IP information and first information (the first information may specifically include device information, account information, and content information) of each request flow in the data stream need to be extracted and then deduplicated to form an IP-first information relationship pair.
In S2201, it is assumed that an edge is created between two IP addresses having devices that correspond in common, and the weight of the edge is the number of devices that correspond in common between the two IP addresses. Specifically, in S2201, the IP and device information of each request flow in the data stream are extracted and then deduplicated to form an IP-device relationship pair. And if the two IP addresses have the devices which correspond to each other in the relation pair, establishing an edge between the two nodes, wherein the weight of the edge is the number of the devices which correspond to each other in the two IP addresses.
In S2202, a node is sampled multiple times on the IP relationship map in a random walk manner, and an IP address sequence sample is obtained.
In some embodiments, after nodes on the IP relationship graph are randomly arranged, a walking path is generated by using a randomly selected ith IP address as a root node, each walking path randomly moves from a current node to an adjacent node, and the walking is finished when a next adjacent node cannot be found or the path length reaches an upper limit value; and after multiple wandering sampling is executed, an IP address sequence sample is obtained.
The upper limit of the path length may be determined based on an IP relationship map, or the upper limit of the path length may be artificially set according to actual requirements.
Specifically, a Graph Neural Network (GNN) node representation algorithm may be used to generate an embedded vector of nodes, thereby obtaining IP address sequence samples.
The following takes a classic deep walk (deepwalk) algorithm as an example (of course, other walk algorithms are also possible, and the present application is not limited thereto). Sampling nodes on an IP relational graph through random walks, starting from a certain specific node, randomly moving each walk from a current node to an adjacent node, and continuously repeating the process to generate a random walk path. And finishing the node sampling task after generating a random walk path of a certain sample.
Specifically, the IP relationship graph may be represented as G ═ V, E, where V is a node set and E is an edge set; after nodes are arranged randomly, all nodes are traversed, and random nodes v are selected each time i In v with i Generating a walk path for a root node
Figure BDA0002981541390000121
n is a wandering path
Figure BDA0002981541390000122
The maximum number of the nodes is ended after the next adjacent node cannot be found or the path length reaches the upper limit value; the above steps can be repeated for gamma times to obtain an IP address sequence sample.
In S2203, an embedded vector of each IP address in the IP address sequence sample is calculated to obtain an IP relationship chain feature vector.
Specifically, a sequence algorithm may be used to calculate an embedding vector for each IP address in the IP address sequence sample, where the sequence algorithm may be, for example, word2vec algorithm, node2vec, BERT, or the like.
The following is withA typical word2vec algorithm is an example. For random walk path (v) 1 ,v 2 ,…,v n ) For intermediate arbitrary node v i Model prediction of node v by surrounding nodes i Thereby obtaining a node v i The embedded vector of (2). And by analogy, obtaining the embedded vectors of all the nodes.
In S2204, the data streams are aggregated through the IP dimension, and behavior statistical information under the IP sliding window is calculated to obtain an IP behavior feature vector.
In some embodiments, the size of the IP sliding window may be artificially set according to actual requirements for a period of time, for example, 1 minute, 10 minutes, 30 minutes, 1 hour, 2 hours, 1 day, and so on, which is not limited in this application.
Specifically, data streams are aggregated through IP dimensions, and behavior statistical information under an IP sliding window is calculated, so that behavior characteristics of the IP are obtained.
In some embodiments, the behavior statistics under the IP sliding window include at least one of:
the number of requests, the number of associated devices, the number of associated accounts and the number of associated mobile phones.
In some embodiments, the data streams are aggregated through an IP dimension, and at least one of the number of requests, the number of associated devices, the number of associated accounts, and the number of associated mobile phones in an IP sliding window is calculated; and performing feature vector processing on at least one of the request number, the associated equipment number, the associated account number and the associated mobile phone number under the IP sliding window to obtain the IP behavior feature vector.
FIG. 5 shows a schematic flow diagram of a method of training a risk identification model for an IP dimension according to one embodiment of the present application. FIG. 6 shows a schematic block diagram of the training of a risk identification model for the IP dimension according to one embodiment of the present application. Fig. 5 corresponds to fig. 6, and specifically, as shown in fig. 5, may include the following S1-1 to S1-11. The detailed description is as follows:
s1-1, acquiring the original data stream.
Specifically, the original data stream may be generated during registration, login, coupon getting, order matching, help cutting, commodity ordering, video on demand, comment, and other network activities.
S1-2, carrying out abnormal data filtering on the original data flow to obtain the data flow, wherein the data flow at least comprises IP information and equipment information.
Specifically, abnormal data in the original data stream may be filtered to obtain the data stream. The abnormal data may be, for example, data whose IP address is illegal, data whose device number is illegal, data whose account number is illegal, data whose content is illegal, or the like. Of course, other illegal data are also possible, and the present application is not limited thereto.
The IP information may be, for example, an IP address. Specifically, the IP information may include, for example, a plurality of IP addresses or a plurality of associated IP addresses, or a plurality of IP addresses or a plurality of associated IP addresses over a period of time; the IP information may also include, for example, multiple IP addresses requesting the same service, or multiple IP addresses requesting the same service over a period of time.
The device information may be, for example, a device number or a device identification. Specifically, the device information may include, for example, a plurality of device numbers or device identifications, or a plurality of associated device numbers or device identifications; the device information may also include, for example, a plurality of device numbers or device identifications over a period of time, or a plurality of associated device numbers or device identifications over a period of time; the device information may also include, for example, a plurality of device numbers or device identifications that request the same service, or a plurality of device numbers or device identifications that request the same service for a period of time.
In some embodiments, the data stream may further include, but is not limited to, account information and content information. The account information may be, for example, a user account, or information such as a mobile phone number and an identification number associated with the user account. The content information may be, for example, requested service-related information such as registration, login, coupon pickup, scrip, and the like.
S1-3, extracting IP and device duplicate removal relation pairs from the data stream.
Specifically, the IP and device information of each request stream in the data stream are extracted and then de-duplicated to form an IP-device relationship pair. Thereby ensuring that each IP is unique and that each device is unique.
And S1-4, performing IP-based mapping processing on the IP and the equipment duplicate removal relation pair to obtain an IP relation map, wherein the nodes of the IP relation map are IP addresses, an edge is created between two IP addresses of the equipment which are in common correspondence, and the weight of the edge is the number of the equipment which is in common correspondence between the two IP addresses.
And S1-5, performing graph node wandering sampling on the IP relation graph to obtain an IP address sequence sample.
Specifically, after nodes on the IP relation graph are randomly arranged, a walking path is generated by taking the ith randomly selected IP address as a root node, each walking step randomly moves from the current node to an adjacent node, and the walking is finished when the next adjacent node cannot be found or the path length reaches an upper limit value; and after multiple walk sampling is executed, obtaining an IP address sequence sample.
For example, the IP relationship graph may be represented as G ═ (V, E), where V is a set of nodes and E is a set of edges; after nodes are arranged randomly, all nodes are traversed, and random nodes v are selected each time i In v with i Generating a walk path for a root node
Figure BDA0002981541390000141
n is a wandering path
Figure BDA0002981541390000142
The maximum number of the nodes is ended after the next adjacent node cannot be found or the path length reaches the upper limit value; the above steps can be repeated for gamma times to obtain an IP address sequence sample.
And S1-6, calculating the embedded vector of each IP address in the IP address sequence sample to obtain the IP relation chain feature vector.
Specifically, a sequence algorithm may be adopted to calculate an embedded vector for each IP address in the IP address sequence sample, and the sequence algorithm may be, for example, word2vec algorithm, node2vec, BERT, or the like.
The classic word2vec algorithm is used as an example below. For random walk path (v) 1 ,v 2 ,…,v n ) For intermediate arbitrary node v i Model prediction of node v by surrounding nodes i Thereby obtaining a node v i The embedded vector of (2). And by analogy, obtaining the embedded vectors of all the nodes.
And S1-7, carrying out data aggregation processing of IP dimension on the data stream.
Specifically, the data included in the data stream is aggregated in the IP dimension. In order to calculate the behavior statistics under the IP sliding window.
And S1-8, calculating behavior statistical information under the IP sliding window to obtain an IP behavior characteristic vector.
In some embodiments, the size of the IP sliding window may be set manually according to actual requirements, for example, it may be 1 minute, 10 minutes, 30 minutes, 1 hour, 2 hours, 1 day, and so on, which is not limited in this application.
Specifically, data streams are aggregated through IP dimensions, and behavior statistical information under an IP sliding window is calculated, so that behavior characteristics of the IP are obtained.
In some embodiments, the behavioral statistics under the IP sliding window include at least one of:
the number of requests, the number of associated devices, the number of associated accounts and the number of associated mobile phones.
And S1-9, fusing the IP relation chain feature vector and the IP behavior feature vector by using a regularization algorithm to obtain a fused feature vector.
In some embodiments, the IP relationship chain feature vector and the IP behavior feature vector may be spliced or connected to obtain a fused feature vector.
Assume an IP relationship chain feature vector of (a) 1 ,a 2 ,…,a n ) The IP behavior chain feature vector is (b) 1 ,b 2 ,…,b m ) The feature vector fusion process may specifically include: splicing the IP relation chain feature vector and the IP behavior chain feature vector to obtain (a) 1 ,a 2 ,…,a n ,b 1 ,b 2 ,…,b m ),Unifying the spliced feature vectors to the same range by using a regularization algorithm to obtain (x) 1 ,x 2 ,…,x m+n )。
That is, the IP relationship chain feature vector (a) 1 ,a 2 ,…,a n ) And IP behavior feature vector (b) 1 ,b 2 ,…,b m ) Fusing to obtain a fused feature vector of (x) 1 ,x 2 ,…,x m+n )。
It should be noted that the IP relationship chain feature vector and the IP behavior feature vector may also be fused in other manners, which is not limited in the present application.
And S1-10, respectively inputting the fused feature vectors into a supervised learning model, a clustering model and an anomaly detection model for model training.
That is, the at least one risk identification model includes a supervised learning model, a clustering model, and an anomaly detection model. The supervised learning model uses a random forest model, the clustering model uses a Gaussian mixture model, and the anomaly detection model uses an isolated forest model.
Wherein the label data used by the supervised learning model is obtained from a risk database accumulated in the business. The risk database may be derived from, for example, an IP having an abnormal behavior in its own service, or may be derived from an image label of the IP itself, such as a machine room IP, open port information, or the like.
Specifically, the risk database may be, for example, an IP black library/profile library formed from an IP having an abnormal behavior in its own business.
And if any one of the supervised learning model, the clustering model and the anomaly detection model reaches a first threshold value, judging the data flow as the running water data with risks. I.e. the data stream may be black producing.
And S1-11, performing effect verification on model training of the supervised learning model, the clustering model and the anomaly detection model according to the risk database accumulated in the business.
In some embodiments, model effect verification is performed using a risk database accumulated in the business.
Specifically, at least one verification sample is selected from a risk database, wherein the verification sample comprises a fusion feature vector obtained by fusing an IP relation chain feature vector and an IP behavior feature vector and a risk identification label corresponding to the fusion feature vector; under the condition that a risk identification model (such as a supervised learning model, a clustering model and an anomaly detection model) is fitted with a training sample, a verification sample is input into the risk identification model, and optimal parameters are selected from multiple groups of convergence parameters obtained by the risk identification model in the training process according to a risk identification result output by the risk identification model and a risk identification label corresponding to the verification sample.
Fig. 7 is a schematic flow chart of a method for detecting data flow anomalies according to an embodiment of the present application, and the method shown in fig. 7 details a specific process of obtaining the target relationship chain feature vector and the target behavior feature vector in S220 described above. The target relationship chain feature vector is a device relationship chain feature vector, and the target behavior feature vector is a device behavior feature vector, as shown in fig. 7, the following steps S2205 to S2208 may be included. The detailed description is as follows:
in S2205, the data stream is subjected to feature vector processing to obtain an equipment relationship map, a node of the equipment relationship map is an equipment identifier, an edge is created between two pieces of equipment having at least one of an IP address, an account, and a content that correspond to each other, and a weight of the edge is a number of at least one of an IP address, an account, and a content that correspond to each other between the two pieces of equipment.
That is, in the device relationship map, an edge is created between two devices having at least one of the IP address, the account, and the content that correspond to each other, and the weight of the edge is the number of at least one of the IP address, the account, and the content that correspond to each other between the two devices.
In some embodiments, before constructing the device relationship graph, the device information and the second information (the second information may specifically include IP information, account information, and content information) of each request flow in the data stream need to be extracted and then deduplicated to form a device-second information relationship pair.
In S2205, it is assumed that an edge is created between two devices having IP addresses that correspond in common, and the weight of the edge is the number of IP addresses that correspond in common between the two devices. Specifically, in S2205, the IP and device information of each request stream in the data stream are extracted and then deduplicated to form a device-IP relationship pair. And constructing an equipment relation graph through the relation pairs, wherein the nodes of the graph are equipment numbers (identifications), if two equipment have IP addresses which correspond to each other in the relation pairs, an edge is created between the two equipment, and the weight of the edge is the number of the IP addresses which correspond to each other between the two equipment.
In S2206, multiple sampling is performed on the node on the device relationship map by using a random walk manner, so as to obtain a device sequence sample.
In some embodiments, after nodes on the device relationship graph are randomly arranged, a randomly selected jth device is used as a root node to generate a walking path, each walking path randomly moves from a current node to an adjacent node, and the walking is finished when a next adjacent node cannot be found or the path length reaches an upper limit value; and obtaining a device sequence sample after executing multi-time walk sampling.
Specifically, a Graph Neural Network (GNN) node representation algorithm may be used to generate embedded vectors of nodes, resulting in device sequence samples.
The following takes a classical deep walk (deep walk) algorithm as an example (of course, other walk algorithms are also possible, and the present application does not limit this algorithm). Sampling nodes on the device relation map through random walks, starting from a certain specific node, randomly moving each walk from the current node to an adjacent node, and continuously repeating the process to generate a random walk path. And finishing the node sampling task after generating a random walk path of a certain sample.
Specifically, the device relationship graph may be represented as G ═ V, E, where V is a node set and E is an edge set; after nodes are arranged randomly, all nodes are traversed, and random nodes v are selected each time i In v with i Generating a walk path for a root node
Figure BDA0002981541390000171
n is a wandering path
Figure BDA0002981541390000172
The maximum number of the nodes is ended after the next adjacent node cannot be found or the path length reaches the upper limit value; the above steps can be repeated for gamma times to obtain the device sequence sample.
In S2207, the embedded vector of each device in the device address sequence sample is calculated, resulting in a device relationship chain feature vector.
Specifically, a sequence algorithm may be used to calculate an embedding vector for each device in the device sequence sample, and the sequence algorithm may be, for example, word2vec algorithm, node2vec, BERT, or the like.
The classic word2vec algorithm is used as an example below. For random walk paths (v) 1 ,v 2 ,…,v n ) For intermediate arbitrary node v i Model prediction of node v by surrounding nodes i Thereby obtaining a node v i The embedded vector of (2). And by analogy, obtaining the embedded vectors of all the nodes.
In S2208, the data streams are aggregated through the device dimensions, and behavior statistical information under the device sliding window is calculated to obtain a device behavior feature vector.
In some embodiments, the size of the sliding window of the device may be artificially set according to actual requirements, for example, it may be 1 minute, 10 minutes, 30 minutes, 1 hour, 2 hours, 1 day, and so on, which is not limited in this application.
Specifically, data streams are aggregated through the device dimension, and behavior statistical information under a device sliding window (nearly 1 day/1 hour/10 minutes) is calculated, so that behavior characteristics of the device are obtained.
In some embodiments, the behavior statistics under the device sliding window include at least one of:
request number, correlation IP address number, correlation account number and correlation mobile phone number.
In some embodiments, the data streams are aggregated through device dimensions, and at least one of the number of requests, the number of associated IP addresses, the number of associated accounts, and the number of associated mobile phone numbers in a sliding window of the device is calculated; and performing feature vector processing on at least one of the request number, the associated IP address number, the associated account number and the associated mobile phone number under the equipment sliding window to obtain the equipment behavior feature vector.
FIG. 8 shows a schematic flow diagram of a method of training a risk recognition model of a device dimension according to one embodiment of the present application. FIG. 9 shows a schematic block diagram of the training of a risk identification model of the device dimensions according to one embodiment of the present application. Fig. 8 corresponds to fig. 9, and specifically, as shown in fig. 8, may include the following S2-1 to S2-11. The detailed description is as follows:
s2-1, acquiring the original data stream.
Specifically, the original data stream can be generated in network activities such as registration, login, coupon getting, order matching, help cutting, commodity ordering, video on demand, comment and the like.
S2-2, carrying out abnormal data filtering on the original data flow to obtain the data flow, wherein the data flow at least comprises IP information and equipment information.
Specifically, abnormal data in the original data stream may be filtered to obtain the data stream. The abnormal data may be, for example, data with an illegal IP address, data with an illegal device number, data with an illegal account number, data with an illegal content, and the like. Of course, other illegal data are also possible, and the present application is not limited thereto.
The IP information may be, for example, an IP address. Specifically, the IP information may include, for example, a plurality of IP addresses or a plurality of associated IP addresses, or a plurality of IP addresses or a plurality of associated IP addresses over a period of time; the IP information may also include, for example, multiple IP addresses requesting the same service, or multiple IP addresses requesting the same service over a period of time.
The device information may be, for example, a device number or a device identification. Specifically, the device information may include, for example, a plurality of device numbers or device identifications, or a plurality of associated device numbers or device identifications; the device information may also include, for example, a plurality of device numbers or device identifications over a period of time, or a plurality of associated device numbers or device identifications over a period of time; the device information may also include, for example, a plurality of device numbers or device identifications requesting the same service, or a plurality of device numbers or device identifications requesting the same service for a period of time.
In some embodiments, the data stream may further include, but is not limited to, account information and content information. The account information may be, for example, a user account, or information such as a mobile phone number and an identification number associated with the user account. The content information may be, for example, requested service-related information such as registration, login, coupon pickup, scrip, and the like.
S2-3, extracting IP and device duplicate removal relation pairs from the data stream.
Specifically, the IP and device information of each request stream in the data stream are extracted and then de-duplicated to form an IP-device relationship pair. Thereby ensuring that each IP is unique and that each device is unique.
And S2-4, performing device-based mapping processing on the IP and device deduplication relationship pairs to obtain a device relationship map, wherein the nodes of the device relationship map are device numbers, an edge is created between two devices with the IP addresses corresponding to the two devices in common, and the weight of the edge is the number of the IP addresses corresponding to the two devices in common.
And S2-5, performing graph node wandering sampling on the equipment relation graph to obtain an equipment sequence sample.
S2-6, calculating the embedded vector of each device in the device sequence sample to obtain the device relation chain feature vector.
And S2-7, carrying out data aggregation processing on the device dimensions on the data stream.
Specifically, the data included in the data stream is aggregated in the device dimension. In order to compute behavioral statistics under a sliding window of the device.
And S2-8, calculating the behavior statistical information under the equipment sliding window to obtain the equipment behavior characteristic vector.
In some embodiments, the size of the device sliding window may be set manually according to actual requirements, for example, it may be 1 minute, 10 minutes, 30 minutes, 1 hour, 2 hours, 1 day, and so on, which is not limited in this application.
Specifically, data streams are aggregated through device dimensions, and behavior statistical information under a device sliding window is calculated, so that behavior characteristics of the device are obtained.
In some embodiments, the behavior statistics under the device sliding window include at least one of:
request number, correlation IP address number, correlation account number and correlation mobile phone number.
And S2-9, fusing the device relation chain feature vector and the device behavior feature vector by using a regularization algorithm to obtain a fused feature vector.
In some embodiments, the device relationship chain feature vector and the device behavior feature vector may be spliced or connected to obtain a fusion feature vector.
Assume a device relationship chain feature vector is (a) 1 ,a 2 ,…,a n ) The device behavior chain feature vector is (b) 1 ,b 2 ,…,b m ) The feature vector fusion process may specifically include: splicing the device relation chain feature vector and the device behavior chain feature vector to obtain (a) 1 ,a 2 ,…,a n ,b 1 ,b 2 ,…,b m ) Unifying the spliced feature vectors to the same range by using a regularization algorithm to obtain (x) 1 ,x 2 ,…,x m+n )。
That is, the device relationship chain feature vector (a) 1 ,a 2 ,…,a n ) And device behavior feature vector (b) 1 ,b 2 ,…,b m ) Fusing to obtain a fused feature vector of (x) 1 ,x 2 ,…,x m+n )。
It should be noted that the device relationship chain feature vector and the device behavior feature vector may also be fused in other manners, which is not limited in the present application.
And S2-10, respectively inputting the fused feature vectors into a supervised learning model, a clustering model and an anomaly detection model for model training.
That is, the at least one risk identification model includes a supervised learning model, a clustering model, and an abnormality detection model. The supervised learning model uses a random forest model, the clustering model uses a Gaussian mixture model, and the anomaly detection model uses an isolated forest model.
Wherein the label data used by the supervised learning model is obtained from a risk database accumulated in the business. The risk database may be derived from, for example, a device having abnormal behavior in its own business, or may be derived from a portrait label of the device itself, such as a machine room device, open port information, and the like.
Specifically, the risk database may be, for example, a device black library/profile library formed from devices having abnormal behaviors in their own business.
And if any one of the supervised learning model, the clustering model and the anomaly detection model reaches a first threshold value, judging the data flow as the running water data with risks. I.e. the data stream may be black-yielding.
And S2-11, performing effect verification on model training of the supervised learning model, the clustering model and the anomaly detection model according to the equipment risk database accumulated in the service.
In some embodiments, the model effect verification is performed using a risk database accumulated in the business.
Specifically, at least one verification sample is selected from a risk database, wherein the verification sample comprises a fusion feature vector obtained by fusing an equipment relation chain feature vector and an equipment behavior feature vector and a risk identification label corresponding to the fusion feature vector; under the condition that a risk identification model (such as a supervised learning model, a clustering model and an anomaly detection model) is fitted with a training sample, inputting a verification sample into the risk identification model, and selecting optimal parameters from multiple groups of convergence parameters obtained by the risk identification model in the training process according to a risk identification result output by the risk identification model and a risk identification label corresponding to the verification sample.
While method embodiments of the present application are described in detail above with reference to fig. 3-9, apparatus embodiments of the present application are described in detail below with reference to fig. 10, it being understood that apparatus embodiments correspond to method embodiments and that similar descriptions may be had with reference to method embodiments.
FIG. 10 schematically illustrates a block diagram of an apparatus for data flow anomaly detection according to an embodiment of the present application. The data flow anomaly detection apparatus may be a part of a computer device using a software unit or a hardware unit, or a combination of both. As shown in fig. 10, the apparatus 300 for detecting data flow anomalies according to the embodiment of the present application may specifically include:
an obtaining module 310, configured to obtain a data stream, where the data stream at least includes IP information and device information;
determining a model 320, configured to perform feature vector processing on the data stream to obtain a target relationship chain feature vector and a target behavior feature vector; the target relation chain feature vector is an IP relation chain feature vector, and the target behavior feature vector is an IP behavior feature vector; or, the target relation chain feature vector is an equipment relation chain feature vector, and the target behavior feature vector is an equipment behavior feature vector;
the fusion module 330 is configured to perform fusion processing on the target relationship chain feature vector and the target behavior feature vector to obtain a fusion feature vector;
the input module 340 is configured to input the fusion feature vector into at least one risk recognition model trained in advance, and obtain a risk recognition result of the data stream output by the at least one risk recognition model.
In one embodiment, if the target relation chain feature vector is an IP relation chain feature vector, and the target behavior feature vector is an IP behavior feature vector; the determination model 320 is specifically configured to:
performing IP-based mapping processing on the data stream to obtain an IP relation map, wherein the nodes of the IP relation map are IP addresses, an edge is created between two IP addresses with at least one of the equipment, the account and the content which correspond to each other, and the weight of the edge is the number of at least one of the equipment, the account and the content which correspond to each other between the two IP addresses;
performing multiple sampling on the nodes on the IP relation map by adopting a random walk mode to obtain an IP address sequence sample; calculating an embedded vector of each IP address in the IP address sequence sample to obtain an IP relation chain feature vector; and
and aggregating the data streams through IP dimension, and calculating behavior statistical information under an IP sliding window to obtain an IP behavior characteristic vector.
In one embodiment, the deterministic model 320 is specifically configured to:
after nodes on the IP relation map are randomly arranged, a wandering path is generated by taking the randomly selected ith IP address as a root node, each wandering step randomly moves from the current node to an adjacent node, and the wandering is finished when the next adjacent node cannot be found or the path length reaches an upper limit value;
and after multiple wandering sampling is executed, an IP address sequence sample is obtained.
In one embodiment, the deterministic model 320 is specifically configured to:
aggregating the data streams through IP dimension, and calculating at least one of request number, associated equipment number, associated account number and associated mobile phone number under an IP sliding window;
and performing feature vector processing on at least one of the request number, the associated equipment number, the associated account number and the associated mobile phone number under the IP sliding window to obtain an IP behavior feature vector.
In one embodiment, if the target relationship chain feature vector is the device relationship chain feature vector, and the target behavior feature vector is the device behavior feature vector; the determination model 320 is specifically configured to:
performing device-based mapping processing on the data stream to obtain a device relationship map, wherein nodes of the device relationship map are device identifiers, an edge is created between two devices having at least one of a commonly corresponding IP address, account and content, and the weight of the edge is the number of at least one of the commonly corresponding IP address, account and content between the two devices;
performing multiple sampling on the nodes on the equipment relation map in a random walk mode to obtain an equipment sequence sample; calculating an embedded vector of each device in the device address sequence sample to obtain a device relation chain feature vector; and
and aggregating the data streams through the dimension of the equipment, and calculating the behavior statistical information under the sliding window of the equipment to obtain the behavior characteristic vector of the equipment.
In one embodiment, the deterministic model 320 is specifically configured to:
after nodes on the equipment relation map are randomly arranged, a randomly selected jth equipment is used as a root node to generate a walking path, each walking step randomly moves from a current node to an adjacent node, and the walking is finished when the next adjacent node cannot be found or the path length reaches an upper limit value;
and obtaining a device sequence sample after executing multi-time walk sampling.
In one embodiment, the deterministic model 320 is specifically configured to:
aggregating the data stream through the equipment dimension, and calculating at least one of the request number, the associated IP address number, the associated account number and the associated mobile phone number under the equipment sliding window;
and performing feature vector processing on at least one of the request number, the associated IP address number, the associated account number and the associated mobile phone number under the equipment sliding window to obtain an equipment behavior feature vector.
In one embodiment, the at least one risk identification model includes a supervised learning model, a clustering model, and an anomaly detection model, wherein tag data used by the supervised learning model is obtained from a risk database accumulated in the business.
In one embodiment, the training process of the at least one risk identification model comprises:
acquiring a training sample set, wherein each training sample in the training sample set comprises a fusion characteristic vector obtained by fusing a target relation chain characteristic vector and a target behavior characteristic vector and a risk identification label corresponding to the fusion characteristic vector;
at least one risk recognition model is trained from a training sample set.
In one embodiment, the obtaining module 310 is configured to obtain a user instruction, where the user instruction is used to indicate a first threshold;
the determining model 320 is further configured to determine that the data flow is running data with risks if the risk identification result output by one or more of the at least one risk identification model is greater than a first threshold.
In one embodiment, the acquisition model 310 is specifically configured to:
acquiring an original data stream;
and filtering abnormal data in the original data stream to obtain the data stream.
The specific implementation of each module in the apparatus for detecting data stream anomalies provided in the embodiment of the present application may refer to the content in the method for detecting data stream anomalies, and is not described herein again.
The modules in the device for detecting data flow abnormality can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute the operations of the modules.
Fig. 11 shows a schematic structural diagram of a computer system of an electronic device implementing the embodiment of the present application. It should be noted that the computer system 400 of the electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments.
As shown in fig. 11, the computer system 400 includes a Central Processing Unit (CPU)401 that can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory (ROM) 402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM403, various programs and data necessary for system operation are also stored. The CPU 401, ROM 402, and RAM403 are connected to each other via a bus 404. An Input/Output (I/O) interface 405 is also connected to the bus 404.
The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a Display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a Network interface card such as a Local Area Network (LAN) card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read therefrom is mounted in the storage section 408 as necessary.
In particular, the processes described in the above flowcharts may be implemented as computer software programs according to embodiments of the present application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the above-described flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. When the computer program is executed by a Central Processing Unit (CPU)401, various functions defined in the apparatus of the present application are executed.
In one embodiment, there is also provided an electronic device comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the steps in the above-described method embodiments via execution of executable instructions.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It should be noted that the computer readable storage medium described in this application can be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic disk storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present application, a computer-readable signal medium may comprise a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, radio frequency, etc., or any suitable combination of the foregoing.
The embodiment is only used for illustrating the present application, and the selection of the software and hardware platform architecture, the development environment, the development language, the message acquisition source, and the like of the embodiment may be changed, and on the basis of the technical solution of the present application, any improvement and equivalent transformation performed on a certain part according to the principle of the present application should not be excluded from the protection scope of the present application.
It is to be understood that the terminology used in the embodiments of the present application and the appended claims is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present application.
Those of skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.
If implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic disk or optical disk, etc. for storing program codes.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed electronic device, apparatus and method may be implemented in other ways.
For example, the division of a unit or a module or a component in the above-described device embodiments is only one logical function division, and there may be other divisions in actual implementation, for example, a plurality of units or modules or components may be combined or may be integrated into another system, or some units or modules or components may be omitted, or not executed.
Also for example, the units/modules/components described above as separate/display components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the units/modules/components can be selected according to actual needs to achieve the purposes of the embodiments of the present application.
Finally, it should be noted that the above shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The above description is only a specific implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims (14)

1. A method of data flow anomaly detection, comprising:
acquiring a data stream, wherein the data stream at least comprises IP information and equipment information;
performing feature vector processing on the data stream to obtain a target relation chain feature vector and a target behavior feature vector; the target relation chain feature vector is an IP relation chain feature vector, and the target behavior feature vector is an IP behavior feature vector; or the target relation chain feature vector is an equipment relation chain feature vector, and the target behavior feature vector is an equipment behavior feature vector;
performing fusion processing on the target relation chain feature vector and the target behavior feature vector to obtain a fusion feature vector;
and inputting the fusion feature vector into at least one risk recognition model trained in advance, and acquiring a risk recognition result of the data stream output by the at least one risk recognition model.
2. The method of claim 1, wherein if the target relationship chain feature vector is an IP relationship chain feature vector and the target behavior feature vector is an IP behavior feature vector;
the processing the feature vector of the data stream to obtain a target relationship chain feature vector and a target behavior feature vector includes:
performing IP-based mapping processing on the data stream to obtain an IP relationship map, wherein nodes of the IP relationship map are IP addresses, an edge is created between two IP addresses of at least one of the devices, the accounts and the contents which correspond to each other, and the weight of the edge is the number of at least one of the devices, the accounts and the contents which correspond to each other between the two IP addresses;
sampling the nodes on the IP relation map for multiple times in a random walk mode to obtain an IP address sequence sample; calculating an embedded vector of each IP address in the IP address sequence sample to obtain the IP relation chain feature vector; and
and aggregating the data streams through IP dimension, and calculating behavior statistical information under an IP sliding window to obtain the IP behavior characteristic vector.
3. The method of claim 2, wherein the performing multiple sampling on the nodes on the IP relationship map by using the random walk manner to obtain IP address sequence samples comprises:
after nodes on the IP relation graph are randomly arranged, a walking path is generated by taking the ith randomly selected IP address as a root node, each walking path randomly moves from the current node to an adjacent node, and the walking is finished when the next adjacent node cannot be found or the path length reaches an upper limit value;
and after multiple walk sampling is executed, obtaining the IP address sequence sample.
4. The method of claim 2, wherein the aggregating the data streams through an IP dimension, and calculating behavior statistics under an IP sliding window to obtain the IP behavior feature vector comprises:
aggregating the data streams through IP dimensions, and calculating at least one of the request number, the associated equipment number, the associated account number and the associated mobile phone number under an IP sliding window;
and performing feature vector processing on at least one of the request number, the associated equipment number, the associated account number and the associated mobile phone number under the IP sliding window to obtain the IP behavior feature vector.
5. The method of claim 1, wherein if the target relationship chain feature vector is a device relationship chain feature vector and the target behavior feature vector is a device behavior feature vector;
the processing the feature vector of the data stream to obtain a target relationship chain feature vector and a target behavior feature vector includes:
performing device-based mapping processing on the data stream to obtain a device relationship map, wherein nodes of the device relationship map are device identifiers, an edge is created between two devices having at least one of an IP address, an account and content which correspond to each other, and the weight of the edge is the number of at least one of the IP address, the account and the content which correspond to each other between the two devices;
performing multiple sampling on the nodes on the equipment relation map in a random walk mode to obtain an equipment sequence sample; calculating an embedded vector of each device in the device address sequence sample to obtain a device relation chain feature vector; and
and aggregating the data streams through the device dimension, and calculating behavior statistical information under a device sliding window to obtain the device behavior characteristic vector.
6. The method of claim 5, wherein the performing a plurality of sampling on the device relationship map in a random walk manner to obtain device sequence samples comprises:
after nodes on the equipment relation graph are randomly arranged, randomly selected jth equipment is used as a root node to generate a walking path, each walking path randomly moves from a current node to an adjacent node, and the walking path is finished when the next adjacent node cannot be found or the path length reaches an upper limit value;
and obtaining the device sequence sample after executing multi-time walk sampling.
7. The method of claim 5, wherein aggregating the data streams through device dimensions, and calculating behavior statistics under a device sliding window to obtain the device behavior feature vector comprises:
aggregating the data streams through the device dimension, and calculating at least one of the request number, the associated IP address number, the associated account number and the associated mobile phone number under a sliding window of the device;
and performing feature vector processing on at least one of the request number, the number of associated IP addresses, the number of associated account numbers and the number of associated mobile phone numbers under the equipment sliding window to obtain the equipment behavior feature vector.
8. The method of claim 1, wherein the at least one risk identification model comprises a supervised learning model, a clustering model, and an anomaly detection model, wherein the label data used by the supervised learning model is obtained from a risk database accumulated in a business.
9. The method of claim 1, wherein the training process of the at least one risk identification model comprises:
acquiring a training sample set, wherein each training sample in the training sample set comprises a fusion feature vector obtained by fusing the target relation chain feature vector and the target behavior feature vector and a risk identification label corresponding to the fusion feature vector;
training the at least one risk recognition model according to the training sample set.
10. The method of claim 1, further comprising:
acquiring a user instruction, wherein the user instruction is used for indicating a first threshold value;
and determining the data stream as a risky data stream when the risk identification result output by one or more risk identification models in the at least one risk identification model is larger than the first threshold value.
11. The method of claim 1, wherein the obtaining the data stream comprises:
acquiring an original data stream;
and filtering abnormal data in the original data stream to obtain the data stream.
12. An apparatus for data flow anomaly detection, comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring data stream, and the data stream at least comprises IP information and equipment information;
determining a model, which is used for carrying out feature vector processing on the data stream to obtain a target relation chain feature vector and a target behavior feature vector; the target relation chain feature vector is an IP relation chain feature vector, and the target behavior feature vector is an IP behavior feature vector; or the target relation chain feature vector is an equipment relation chain feature vector, and the target behavior feature vector is an equipment behavior feature vector;
the fusion module is used for carrying out fusion processing on the target relation chain feature vector and the target behavior feature vector to obtain a fusion feature vector;
the input module is used for inputting the fusion feature vector into at least one risk recognition model which is trained in advance;
and the obtaining module is further used for obtaining a risk identification result of the data stream output by the at least one risk identification model.
13. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any one of claims 1 to 11 via execution of the executable instructions.
14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.
CN202110288769.8A 2021-03-18 2021-03-18 Method and device for detecting data stream abnormity, electronic equipment and storage medium Pending CN115114329A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110288769.8A CN115114329A (en) 2021-03-18 2021-03-18 Method and device for detecting data stream abnormity, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110288769.8A CN115114329A (en) 2021-03-18 2021-03-18 Method and device for detecting data stream abnormity, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115114329A true CN115114329A (en) 2022-09-27

Family

ID=83324076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110288769.8A Pending CN115114329A (en) 2021-03-18 2021-03-18 Method and device for detecting data stream abnormity, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115114329A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116760643A (en) * 2023-08-21 2023-09-15 明阳时创(北京)科技有限公司 IPv6 risk quantification method, system, medium and device based on artificial intelligence
WO2024082859A1 (en) * 2022-10-20 2024-04-25 天翼数字生活科技有限公司 Underground industry user identification method and apparatus

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024082859A1 (en) * 2022-10-20 2024-04-25 天翼数字生活科技有限公司 Underground industry user identification method and apparatus
CN116760643A (en) * 2023-08-21 2023-09-15 明阳时创(北京)科技有限公司 IPv6 risk quantification method, system, medium and device based on artificial intelligence
CN116760643B (en) * 2023-08-21 2023-10-20 明阳时创(北京)科技有限公司 IPv6 risk quantification method, system, medium and device based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN109922032B (en) Method, device, equipment and storage medium for determining risk of logging in account
WO2022252363A1 (en) Data processing method, computer device and readable storage medium
CN110929806B (en) Picture processing method and device based on artificial intelligence and electronic equipment
CN113761261A (en) Image retrieval method, image retrieval device, computer-readable medium and electronic equipment
CN113765928B (en) Internet of things intrusion detection method, equipment and medium
CN115114329A (en) Method and device for detecting data stream abnormity, electronic equipment and storage medium
CN111523413A (en) Method and device for generating face image
CN111563267A (en) Method and device for processing federal characteristic engineering data
CN110046297A (en) Recognition methods, device and the storage medium of O&M violation operation
CN116932919B (en) Information pushing method, device, electronic equipment and computer readable medium
CN114692007B (en) Method, device, equipment and storage medium for determining representation information
CN114419363A (en) Target classification model training method and device based on label-free sample data
CN114492601A (en) Resource classification model training method and device, electronic equipment and storage medium
CN111159241B (en) Click conversion estimation method and device
CN112884235A (en) Travel recommendation method, and training method and device of travel recommendation model
US20220207861A1 (en) Methods, devices, and computer readable storage media for image processing
CN115130536A (en) Training method of feature extraction model, data processing method, device and equipment
CN110087230B (en) Data processing method, data processing device, storage medium and electronic equipment
CN114238968A (en) Application program detection method and device, storage medium and electronic equipment
CN112231571A (en) Information data processing method, device, equipment and storage medium
CN116501993B (en) House source data recommendation method and device
CN114820085B (en) User screening method, related device and storage medium
CN113946758B (en) Data identification method, device, equipment and readable storage medium
Bao et al. Neural network‐based image quality comparator without collecting the human score for training
CN114510638A (en) Information processing method, apparatus, device, storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination