CN114915566A - Application identification method, device, equipment and computer readable storage medium - Google Patents

Application identification method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN114915566A
CN114915566A CN202110119196.6A CN202110119196A CN114915566A CN 114915566 A CN114915566 A CN 114915566A CN 202110119196 A CN202110119196 A CN 202110119196A CN 114915566 A CN114915566 A CN 114915566A
Authority
CN
China
Prior art keywords
data
fingerprint
application
transmission type
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110119196.6A
Other languages
Chinese (zh)
Other versions
CN114915566B (en
Inventor
聂利权
郭晶
曾凡
容汉铿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110119196.6A priority Critical patent/CN114915566B/en
Priority claimed from CN202110119196.6A external-priority patent/CN114915566B/en
Publication of CN114915566A publication Critical patent/CN114915566A/en
Application granted granted Critical
Publication of CN114915566B publication Critical patent/CN114915566B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/028Capturing of monitoring data by filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/44Program or device authentication

Abstract

The application provides an application identification method, device, equipment and computer readable storage medium; the method comprises the following steps: extracting flow characteristics according to the transmission type of the network flow data; matching the flow characteristics in a preset flow fingerprint database to obtain a target sample application; the preset flow fingerprint database comprises a plaintext transmission type fingerprint and an encrypted transmission type fingerprint corresponding to each sample application in at least one sample application; the plaintext transmission type fingerprint comprises at least one of an application name fingerprint, a load key data fingerprint and a header field sequence fingerprint generated by performing feature extraction and preprocessing on plaintext transmission type data; the encrypted transmission type fingerprint comprises at least one of a target host fingerprint and a client handshake message fingerprint generated by performing feature extraction and preprocessing on encrypted transmission type data; and determining the target sample application as an application identification result of the network traffic data. Through the application, the efficiency and the accuracy of application identification can be improved.

Description

Application identification method, device, equipment and computer readable storage medium
Technical Field
The present application relates to cloud computing technologies, and in particular, to an application identification method, apparatus, device, and computer-readable storage medium.
Background
Currently, an Application program (App) identification mode mainly adopts Deep Packet Inspection (DPI). DPI is a flow detection and control technology based on an application layer, and a packet header of network data generated by an APP and application layer data transmitted by a network are mainly analyzed through a manually extracted feature library so as to realize matching and identification of the App through features presented by the network data. However, the efficiency of manually extracting the features is low, the updating speed is low, and the dimension of the extracted features is limited, so that the efficiency and the accuracy of application identification are greatly reduced.
Disclosure of Invention
The embodiment of the application identification method, device and equipment and a computer readable storage medium can improve the application identification efficiency.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides an application identification method, which comprises the following steps:
acquiring and analyzing network traffic data in real time to obtain the transmission type of the network traffic data;
extracting flow characteristics corresponding to the network flow data according to the transmission type;
matching the flow characteristics in a preset flow fingerprint database to obtain a target sample application matched with the flow characteristics; the preset flow fingerprint database comprises a plaintext transmission type fingerprint and an encrypted transmission type fingerprint corresponding to each sample application in at least one sample application; the plaintext transmission type fingerprint comprises at least one of an application name fingerprint, a load key data fingerprint and a header field sequence fingerprint generated by performing feature extraction and preprocessing on plaintext transmission type data; the encrypted transmission type fingerprint comprises at least one of a target host fingerprint and a client handshake message fingerprint generated by performing feature extraction and preprocessing on encrypted transmission type data;
and determining the target sample application as an application identification result of the network traffic data.
In the above method, before the matching of the traffic characteristics in a preset traffic fingerprint library to obtain a target sample with matched traffic characteristics is performed, the method further includes:
acquiring application information of each sample application; the application information comprises an application package name;
collecting a plurality of sample flow data in the running process of each sample application; each sample flow data of the multiple sample flow data sets comprises plaintext transmission type data and encrypted transmission type data;
in plaintext transmission type data of each sample flow data, when a character string containing the application packet name is detected, the application packet name is taken as the application name fingerprint;
detecting whether the plaintext transmission type data of each sample flow data contains at least one of the following fields: a target host domain name field, a universal gateway interface field, a resource request parameter field, and a user agent information field;
extracting the detected at least one field for normalization preprocessing to obtain at least one feature, counting the number of the obtained features of the at least one feature based on the multiple sample flow data, and taking the feature with the number of the features larger than a preset number threshold as the load key data fingerprint;
extracting at least one preset field name from a data packet header field of the plaintext transmission type data in the plaintext transmission type data of each sample flow data, and generating a header field sequence fingerprint according to the arrangement sequence of the at least one preset field name in the data packet header field so as to obtain the plaintext transmission type fingerprint;
when the encrypted transmission type data of each sample flow data is first handshake message data of a server name indication protocol, extracting a domain name field of a target host from the first handshake message data to be used as the fingerprint of the target host;
when the encrypted transmission type data of each sample flow data is second handshake message data of a secure transport layer protocol, extracting at least one preset handshake message field from the second handshake message data, and performing content splicing and information signature processing on the at least one preset handshake message field to obtain a client handshake message fingerprint;
counting the number of the obtained target host fingerprint and the client side handshake message fingerprint based on the multiple sample flow data, and taking the fingerprint with the number larger than a preset number threshold value as the encrypted transmission type fingerprint;
and constructing and obtaining the preset flow fingerprint database according to the plaintext transmission type fingerprint and the encrypted transmission type fingerprint corresponding to each sample application.
In the above method, the application information includes client information; the extracting the detected at least one field for normalization preprocessing to obtain at least one feature, and generating the load key data fingerprint according to the respective quantity information of the at least one feature, including:
when the at least one field contains the target host domain name field, carrying out normalization of a preset format on the target host domain name field to obtain a host domain name characteristic in the at least one characteristic;
when the at least one field comprises a universal gateway interface field, replacing a digital character string and a random character string in the universal gateway interface field with preset characters to obtain a gateway interface feature in the at least one feature;
when the at least one field comprises a resource request parameter field, extracting a key type character string from the resource request parameter field as a request parameter characteristic in the at least one characteristic;
when the at least one field contains the user agent information field, extracting the client information from the user agent information field as a proxy feature of the at least one feature.
In the above method, the generating the sequence fingerprint of the header field according to the arrangement sequence of the at least one preset field name in the header field of the data packet includes:
acquiring the arrangement sequence of the at least one preset field name in the data header field;
and splicing the at least one preset field name according to the arrangement sequence to obtain the sequence fingerprint of the header field.
In the above method, the second handshake message data includes: client handshake message data; the extracting at least one preset handshake message field from the second handshake message data of the security transport layer protocol data, and performing content splicing and information signature processing on the at least one preset handshake message field to obtain the client handshake message fingerprint includes:
extracting version information, an encryption suite candidate list, an expansion list length and an elliptic curve key exchange algorithm support list from the client handshake message data as the at least one preset handshake message field;
splicing the at least one preset handshake message field, separating each preset handshake message field by using a first preset symbol, and separating at least one message content in each preset handshake message field by using a second preset symbol to obtain spliced data;
and calculating the hash value of the spliced data through a message digest algorithm, finishing information signature of the spliced data, and obtaining the client handshake message fingerprint.
In the above method, the extracting, according to the transmission type of the network traffic data, a traffic feature corresponding to the network traffic data includes:
when the transmission type is a plaintext transmission type, performing character extraction and preprocessing on the network traffic data to obtain at least one of an application name characteristic, a load key data characteristic and a header field sequence characteristic as the traffic characteristic;
and when the transmission type is an encryption transmission type, performing character extraction and preprocessing on the network flow data to obtain at least one of target host characteristics and client handshake message characteristics as the flow characteristics.
In the above method, after the target sample application is determined as the application identification result of the network traffic data, the method further includes:
and realizing application monitoring and/or network management functions based on the application identification result.
An embodiment of the present application provides an application identification apparatus, including: .
The flow acquisition module is used for acquiring and analyzing network flow data in real time to obtain the transmission type of the network flow data;
the data preprocessing module is used for extracting the flow characteristics corresponding to the network flow data according to the transmission type;
the flow matching engine module is used for matching the flow characteristics in a preset flow fingerprint database to obtain target sample application matched with the flow characteristics; the preset flow fingerprint database comprises a plaintext transmission type fingerprint and an encrypted transmission type fingerprint corresponding to each sample application in at least one sample application; the plaintext transmission type fingerprint comprises at least one of an application name fingerprint, a load key data fingerprint and a header field sequence fingerprint generated by performing feature extraction and preprocessing on plaintext transmission type data; the encrypted transmission type fingerprint comprises at least one of a target host fingerprint and a client handshake message fingerprint generated by performing feature extraction and preprocessing on encrypted transmission type data;
and the determining module is used for determining the target sample application as an application identification result of the network flow data.
In the above device, the device further includes a fingerprint generation module, where the fingerprint generation module is configured to obtain application information of each sample application before matching the traffic characteristics in a preset traffic fingerprint library to obtain a target sample application with which the traffic characteristics are matched; the application information comprises an application package name;
collecting a plurality of sample flow data in the running process of each sample application; each sample flow data of the multiple sample flow data comprises plaintext transmission type data and encrypted transmission type data;
in the plaintext transmission type data of each sample flow data, when a character string containing the application packet name is detected, taking the application packet name as the application name fingerprint;
detecting whether the plaintext transmission type data of each sample flow data contains at least one of the following fields: a target host domain name field, a universal gateway interface field, a resource request parameter field and a user agent information field;
extracting the detected at least one field for normalization preprocessing to obtain at least one feature, counting the number of the obtained features of the at least one feature based on the multiple sample flow data, and taking the feature with the number of the features larger than a preset number threshold as the load key data fingerprint;
extracting at least one preset field name from a data packet header field of the plaintext transmission type data in the plaintext transmission type data of each sample flow data, and generating a header field sequence fingerprint according to the arrangement sequence of the at least one preset field name in the data packet header field so as to obtain the plaintext transmission type fingerprint;
when the encrypted transmission type data of each sample flow data is first handshake message data of a server name indication protocol, extracting a domain name field of a target host from the first handshake message data to be used as the fingerprint of the target host;
when the encrypted transmission type data of each sample flow data is second handshake message data of a secure transport layer protocol, extracting at least one preset handshake message field from the second handshake message data, and performing content splicing and information signature processing on the at least one preset handshake message field to obtain a client handshake message fingerprint;
counting the number of the obtained target host fingerprint and the client side handshake message fingerprint based on the multiple sample flow data, and taking the fingerprint with the number larger than a preset number threshold value as the encrypted transmission type fingerprint; and constructing and obtaining the preset flow fingerprint database according to the plaintext transmission type fingerprint and the encrypted transmission type fingerprint corresponding to each sample application.
In the above apparatus, the application information includes client information; the fingerprint generation module is further configured to, when the at least one field includes the target host domain name field, perform normalization of a preset format on the target host domain name field to obtain a host domain name feature in the at least one feature; when the at least one field comprises a universal gateway interface field, replacing a digital character string and a random character string in the universal gateway interface field with preset characters to obtain a gateway interface characteristic in the at least one characteristic; when the at least one field contains a resource request parameter field, extracting a key type character string from the resource request parameter field as a request parameter characteristic in the at least one characteristic; when the at least one field contains the user agent information field, extracting the client information from the user agent information field as a proxy feature in the at least one feature; obtaining respective quantities of the host domain name characteristic, the gateway interface characteristic, the request parameter characteristic and the proxy characteristic according to the multiple sample flow data, and taking the characteristic of which the quantity exceeds a preset quantity threshold value as the load key data fingerprint.
In the above apparatus, the fingerprint generating module is further configured to obtain an arrangement order of the at least one preset field name in the data header field; and splicing the at least one preset field name according to the arrangement sequence to obtain the sequence fingerprint of the header field.
In the above apparatus, the second handshake message data includes: client handshake message data; the fingerprint generation module is further configured to extract version information, an encryption suite candidate list, an extension list length, and an elliptic curve key exchange algorithm support list from the client handshake message data as the at least one preset handshake message field; splicing the at least one preset handshake message field, separating each preset handshake message field by using a first preset symbol, and separating at least one message content in each preset handshake message field by using a second preset symbol to obtain spliced data; and calculating the hash value of the spliced data through a message digest algorithm, finishing information signature of the spliced data, and obtaining the client handshake message fingerprint.
In the above apparatus, the data preprocessing module is further configured to, when the transmission type is a plaintext transmission type, perform character extraction and preprocessing on the network traffic data to obtain at least one of an application name feature, a load key data feature, and a header field order feature as the traffic feature; and when the transmission type is an encryption transmission type, performing character extraction and pretreatment on the network flow data to obtain at least one of target host characteristics and client handshake message characteristics as the flow characteristics.
In the above apparatus, the apparatus further includes an identification result application module, where the identification result application module is configured to, after determining the target sample application as the application identification result of the network traffic data, implement an application monitoring and/or a network management function based on the application identification result.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the method provided by the embodiment of the application when executing the executable instructions stored in the memory.
Embodiments of the present application provide a computer-readable storage medium, which stores executable instructions for causing a processor to implement the method provided by the embodiments of the present application when the processor executes the executable instructions.
The embodiment of the application has the following beneficial effects:
the plaintext transmission type data and the encrypted transmission type data applied to each sample are subjected to feature extraction and pretreatment, and a plaintext transmission type fingerprint and an encrypted transmission type fingerprint corresponding to each sample application are generated, so that a preset flow fingerprint library is automatically generated, the speed of feature extraction and feature fingerprint updating is greatly increased, the novel application can be more quickly identified by an application identification system, and the efficiency of application identification is improved. In addition, the plaintext transmission type fingerprint and the encrypted transmission type fingerprint respectively comprise at least one fingerprint with different feature dimensions, so that the matching degree of at least one flow feature is improved, and the accuracy of application identification is further improved.
Drawings
Fig. 1 is an alternative structural diagram of an application identification system architecture provided in an embodiment of the present application;
fig. 2 is an alternative structural diagram of an application identification device provided in the embodiment of the present application;
FIG. 3 is a schematic flow chart diagram illustrating an alternative application identification method according to an embodiment of the present application;
FIG. 4 is an alternative flow chart of an application identification method provided in the embodiments of the present application;
FIG. 5 is an alternative flow chart diagram of an application identification method provided in the embodiments of the present application;
FIG. 6 is an alternative flow chart of an application identification method provided in the embodiments of the present application;
fig. 7 is a data content diagram of client handshake message data provided by an embodiment of the present application;
fig. 8 is a schematic diagram of an optional functional module of the application identification apparatus according to the embodiment of the present application;
fig. 9 is an alternative flowchart of an application identification method according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) Cloud Security (Cloud Security) refers to a generic term for Security software, hardware, users, organizations, secure Cloud platforms for Cloud-based business model applications. The cloud security integrates emerging technologies and concepts such as parallel processing, grid computing and unknown virus behavior judgment, the latest information of Trojan horses and malicious programs in the internet is obtained through abnormal monitoring of a large number of netted clients on software behaviors in the network, the latest information is sent to a server for automatic analysis and processing, and then the solutions of viruses and Trojan horses are distributed to each client.
The main research directions of cloud security include: 1. the cloud computing security mainly researches how to guarantee the security of the cloud and various applications on the cloud, including the security of a cloud computer system, the security storage and isolation of user data, user access authentication, information transmission security, network attack protection, compliance audit and the like; 2. the cloud computing of the security infrastructure mainly researches how to newly build and integrate security infrastructure resources by adopting cloud computing and optimize a security protection mechanism, and comprises the steps of constructing a super-large-scale security event and an information acquisition and processing platform by using a cloud computing technology, realizing acquisition and correlation analysis of mass information and improving the handling control capability and risk control capability of the security event of the whole network; 3. the cloud security service mainly researches various security services such as anti-virus services and the like provided for users based on a cloud computing platform.
2) Big data (Big data) refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and process optimization capability only by a new processing mode. With the advent of the cloud era, big data has attracted more and more attention, and the big data needs special technology to effectively process a large amount of data within a tolerance elapsed time. The method is suitable for the technology of big data, and comprises a large-scale parallel processing database, data mining, a distributed file system, a distributed database, a cloud computing platform, the Internet and an extensible storage system.
3) The light splitting switch: also known as optical splitters, are fiber-optic splicing devices having multiple inputs and multiple outputs, and are commonly used for coupling, branching, and distributing optical signals. In a mobile communication network, an optical splitter is used as a special probe for signaling monitoring, and is mainly used for acquiring original signaling data. The system is matched with a signaling analysis system to carry out real-time monitoring and deep fault positioning on the network, provides powerful support for network dimension, market and customers, and realizes the evaluation of the network and the service quality and improves the service quality through index statistical analysis reports of various dimensions.
4) CGI: the Common Gateway Interface (Common Gateway Interface) is a standard Interface for providing information service for a Web server host. Through the CGI interface, the Web server can acquire information submitted by the client, transfer the information to a CGI program of the server for processing, and finally return a result to the client.
5) Secure transport layer protocol (TLS): for providing privacy and data integrity between two communicating applications.
6) Client _ Hello message: in the TLS protocol, a client initiates a request, and transmits request information in a plaintext, wherein the request information comprises version information, an encryption suite candidate list, a compression algorithm candidate list, a random number, an extension field and other information.
The following describes an exemplary application of the electronic device provided in the embodiments of the present application, and the electronic device provided in the embodiments of the present application may be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, and a portable game device), and may also be implemented as a server. In the following, an exemplary application will be explained when the device is implemented as a server.
Referring to fig. 1, fig. 1 is an alternative architecture diagram of an application identification system 100 provided in the embodiment of the present application, where terminals (terminal 400-1 and terminal 400-2 are exemplarily shown) are connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both.
The terminal 400 may be a terminal in an enterprise network or a local area network, the terminal 400 is installed with an application client 410 (an application client 410-1 and an application client 410-2 are exemplarily shown), and when receiving a user operation, the application client 410 performs data interaction with a corresponding application server 600 through the network 300 (an application server 600-1 and an application server 600-2 are exemplarily shown, where the application server 600-1 may be a background server corresponding to the application client 410-1, and the application server 600-2 may be a background server corresponding to the application client 410-2), so as to complete a function corresponding to the user operation.
The application recognition server 200 may be deployed at a network tap switch of the network 300 for global monitoring of an enterprise network or a local area network. The application identification server collects and analyzes the network traffic data between the application client 410 and the application server 600 in real time to obtain the transmission type of the network traffic data; extracting at least one flow characteristic corresponding to the network flow data according to the transmission type of the network flow data; matching at least one flow characteristic in a preset flow fingerprint database in the database 500 to obtain a target sample application matched with the flow characteristic; the preset flow fingerprint database comprises a plaintext transmission type fingerprint and an encrypted transmission type fingerprint corresponding to each sample application in at least one sample application; the plaintext transmission type fingerprint comprises at least one of an application name fingerprint, a load key data fingerprint and a header field sequence fingerprint generated by performing feature extraction and preprocessing on plaintext transmission type data; the encrypted transmission type fingerprint comprises at least one of a target host fingerprint and a client handshake message fingerprint generated by performing feature extraction and preprocessing on encrypted transmission type data; and determining the target sample application as an application identification result of the network traffic data. Furthermore, the application identification server can output the application identification result to a storage platform or a downstream service, so that the downstream service, such as an application monitoring management service and a network security service, uses, follows and analyzes the application identification result, and completes corresponding application monitoring management and network security protection functions.
In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present application.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an application identification server 200 according to an embodiment of the present application, where the application identification server 200 shown in fig. 2 includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in fig. 2.
The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.
The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 450 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.
An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;
a network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
a presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430;
an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.
In some embodiments, the apparatus provided in this embodiment may be implemented in software, and fig. 2 illustrates the application identification apparatus 455 stored in the memory 450, which may be software in the form of programs and plug-ins, and includes the following software modules: a flow collection module 4551, a data preprocessing module 4552, a flow matching engine module 4553 and a determination module 4554, which are logical and thus may be arbitrarily combined or further divided according to the functions implemented.
The functions of the respective modules will be explained below.
In other embodiments, the apparatus provided in the embodiments of the present Application may be implemented in hardware, and as an example, the apparatus provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the Application identification method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
The application identification method provided by the embodiment of the present application will be described in conjunction with an exemplary application and implementation of the server provided by the embodiment of the present application.
Referring to fig. 3, fig. 3 is an alternative flowchart of an application identification method provided in an embodiment of the present application, and will be described with reference to the steps shown in fig. 3.
S101, collecting and analyzing network traffic data in real time to obtain the transmission type of the network traffic data.
In the embodiment of the application, the application program performs data interaction with the corresponding application server in the running process, so as to generate network traffic, and the network traffic data usually carries some identification information corresponding to the application and characteristic content data of some common characteristics of the application. Therefore, the application identification device can analyze the network traffic data by collecting the network traffic data interacted between the application program on the terminal and the application server, including the network traffic data sent by the application program to the application server and the network traffic data responded to the application program by the application server, so as to further identify the application.
In some embodiments, the application identification device may collect, as the network traffic data, each HTTP packet of seven layers of HTTP traffic data or each TCP/UDP packet of four layers of TCP/UDP data between the application program and the application server in a data unit identified by the application as the data packet.
In the embodiment of the application, the application identification device may analyze the acquired network traffic data, obtain whether the network traffic data is plaintext transmission or encrypted transmission data, and correspondingly determine whether the transmission type of the network traffic data is plaintext transmission type, such as HTTP protocol transmission type data, or encrypted transmission type, such as HTTPs protocol transmission type data.
And S102, extracting the flow characteristics corresponding to the network flow data according to the transmission type.
In the embodiment of the present application, when the same application interacts with an application server, the network traffic data of different transmission types includes different data contents, and thus, data characteristics presented in the network traffic data of different transmission types are also different. And the application identification device extracts the flow characteristics corresponding to the network flow data according to the transmission type of the network flow data.
In the embodiment of the application, when the transmission type is the plaintext transmission type, the application identification device can acquire the specific data content of the network traffic data, so that the application identification device detects the network traffic data according to the preset key field, and performs character extraction and preprocessing on the character string corresponding to the detected key field to acquire the data characteristics of the plaintext transmission type. For example, the application identification device may detect whether the network traffic data includes an application name as the application name feature; the method comprises the steps of detecting whether network flow data contain a target resource host domain name field, a general gateway interface field and some key message fields which are distinguished from other applications, and extracting and normalizing the characteristics of the detected fields to obtain load key data characteristics; obtaining the sequence characteristics of the fields of the header field by detecting the field sequence of the header field of the data packet; and at least one of an application name feature, a load key data feature, and a header field order feature is used as the traffic feature.
In the embodiment of the application, when the transmission type is an encrypted transmission type, because the content of the message data which is encrypted for transmission cannot be obtained, the application identification device can obtain the network traffic data from the message in the handshake phase between the application program and the server, and perform character extraction and preprocessing on the network traffic data in the handshake phase to obtain at least one of the target host characteristic and the client handshake message characteristic as the traffic characteristic.
S103, matching the flow characteristics in a preset flow fingerprint database to obtain target sample application matched with the flow characteristics; the preset flow fingerprint database comprises a plaintext transmission type fingerprint and an encrypted transmission type fingerprint corresponding to each sample application in at least one sample application; the plaintext transmission type fingerprint comprises at least one of an application name fingerprint, a load key data fingerprint and a header field sequence fingerprint generated by performing feature extraction and preprocessing on plaintext transmission type data; the encrypted transmission type fingerprint comprises at least one of a target host fingerprint and a client handshake message fingerprint generated by performing feature extraction and preprocessing on the encrypted transmission type data.
In the embodiment of the application, the preset flow fingerprint database is constructed by collecting sample flow data interacted between at least one sample application and each application server, preprocessing and extracting characteristics of the sample flow data, and automatically generating a flow fingerprint corresponding to each sample application. The traffic fingerprint of each sample application is a stable traffic characteristic generated in the process that the sample application interacts with the application server in the traffic operation process, and the characteristic can be used for identifying which sample application the network traffic data belongs to, and meanwhile, the traffic fingerprint of each sample application does not conflict with the traffic fingerprints of other sample applications.
In the embodiment of the application, the application identification device can acquire encrypted transmission type data and plaintext transmission type data generated in the interaction process of each sample application and the background server thereof; generating at least one of an application name fingerprint, a load key data fingerprint and a header field sequence fingerprint as a plaintext transmission type fingerprint by performing feature extraction and pretreatment on plaintext transmission type data; and generating at least one of the target host fingerprint and the handshake message content mosaic fingerprint as the encrypted transmission type fingerprint by performing feature extraction and preprocessing on the encrypted transmission type data.
In the embodiment of the application, the application identification device can match the flow characteristics in a preset flow fingerprint database to obtain at least one target flow fingerprint matched with the flow characteristics. According to a preset corresponding relationship between at least one target traffic fingerprint and each sample application, for example, the preset corresponding relationship may be a regular expression that the traffic fingerprint of each sample application satisfies, and a sample application corresponding to at least one target traffic fingerprint is determined as a target sample application.
And S104, determining the target sample application as an application identification result of the network traffic data.
In the embodiment of the application, the application identification device determines the target sample application as the application identification result of the network traffic data, that is, the application to which the network traffic data belongs, and may perform further functions such as application monitoring, network management and control, network security management, and the like based on the application identification result.
It can be understood that, in the embodiment of the present application, the application identification device performs feature extraction and preprocessing on the plaintext transmission type data and the encrypted transmission type data applied to each sample to generate the plaintext transmission type fingerprint and the encrypted transmission type fingerprint corresponding to each sample application, so as to realize automatic generation of the preset flow fingerprint library, greatly improve the speed of feature extraction and feature fingerprint update, enable the application identification system to more quickly identify a novel application, and improve the efficiency of application identification. In addition, the plaintext transmission type fingerprint and the encrypted transmission type fingerprint respectively comprise at least one fingerprint with different characteristic dimensions, so that the matching degree of at least one flow characteristic is improved, and the accuracy of application identification is further improved.
In some embodiments, referring to fig. 4, fig. 4 is an optional flowchart of the application identification method provided in the embodiments of the present application, and based on fig. 3, before S103, S201 to S210 may also be executed, which will be described with reference to the steps.
S201, obtaining application information of each sample application; the application information includes an application package name.
In the embodiment of the application, the application identification device obtains each sample application according to a preset sample application list, downloads and installs the sample application to the terminal, and obtains application information of each application, such as an application package name.
In some embodiments, the application package name may be a bundle id. The bundle id is a unique id distributed by each mobile phone App in an application market and can be used as a strong characteristic for identifying the application.
S202, collecting a plurality of sample flow data in the running process of each sample application; each sample traffic data of the plurality of sample traffic data includes plaintext transmission type data and encrypted transmission type data.
In the embodiment of the application, in order to extract fingerprint features which stably appear, the application identification device can acquire a plurality of sample flow data in the running process of each sample application. Each sample flow data of the multiple sample flow data sets comprises plaintext transmission type data and encrypted transmission type data.
In some embodiments, each sample application, when performing a different operation, may generate plaintext transmission type data, such as HTTP data, upon interacting with the application server; encrypted transmission type data, such as HTTPS data, may also be generated.
S203, in the plaintext transmission type data of each sample traffic data, when a character string including an application packet name is detected, the application packet name is used as an application name fingerprint.
In this embodiment, the application identification device may detect whether a character string including an application packet name exists in the HTTP traffic data of the plaintext transmission type, and if the character string including the application packet name is stably detected in the multiple pieces of sample traffic data of the application, if it is detected that 4 or all 5 of the 5 pieces of sample traffic data carry the character string including the application packet name, the application identification device takes the application packet name as the application name fingerprint.
In some embodiments, when the application identification apparatus detects a character string containing the bundle id of a certain sample application in the HTTP traffic data of the certain sample application, it indicates that the sample application and the server interaction data usually carry an application package name, and the application identification apparatus may use the bundle id as the application name fingerprint of the certain sample application.
S204, in the plaintext transmission type data of each sample flow data, detecting whether at least one of the following fields is included: a target host domain name field, a universal gateway interface field, a resource request parameter field, and a user agent information field.
In the embodiment of the present application, when each sample application interacts with its corresponding application server, an individual feature that the application is different from other applications is usually carried in a domain name field of a target host, a CGI interface field, an HTTP request parameter field, and a UA (User _ Agent) information field in HTTP traffic data. Therefore, the application identification device can extract at least one character string corresponding to the network resource address field, the HTTP request parameter field and the UA information field in the plaintext transmission type data according to the actual project requirement or the specific data content of each sample traffic data in the plurality of sample traffic data, and use the at least one character string as the load key data capable of identifying the application to generate the load key data fingerprint.
S205, extracting at least one detected field, carrying out normalization preprocessing to obtain at least one feature, counting the feature quantity of each obtained at least one feature based on a plurality of sample flow data, and taking the feature with the feature quantity larger than a preset quantity threshold value as a load key data fingerprint.
In the embodiment of the present application, in order to reduce the influence of random data in sample traffic data on the accuracy of generating a fingerprint, data content that can be stably generated by a sample application needs to be extracted from the traffic data to generate a characteristic fingerprint of the sample application. Therefore, the application identification device can preprocess the extracted at least one character string, extract common parts of fields of the same type from a plurality of sample flow data generated by each sample application, normalize the common parts into a fixed characteristic format, obtain at least one characteristic, and determine the characteristic which can stably appear in the plurality of sample flow data from the at least one characteristic as the load key data fingerprint of each sample application.
In some embodiments, the application information includes client information; referring to fig. 5, fig. 5 is an optional flowchart of the application identification method provided in the embodiment of the present application, and the process of performing normalization preprocessing on at least one field detected by extraction in S205 based on fig. 4 to obtain at least one feature may be implemented by performing S2051 to S2054, which will be described with reference to the steps.
S2051, when at least one field contains the domain name field of the target host, normalizing the domain name field of the target host in a preset format to obtain the host domain name characteristic in at least one characteristic.
In some embodiments, the application recognition device may normalize the string corresponding to the target host domain name field to www.example.com as a preprocessed host address string.
And S2052, when at least one field contains the universal gateway interface field, replacing the numeric character string and the random character string in the universal gateway interface field with preset characters to obtain the gateway interface characteristic in at least one characteristic.
In the embodiment of the application, each sample application has a corresponding universal gateway interface, which may be used as a feature identifier of the application, but a field of the universal gateway interface may contain some numeric character strings or random character strings generated for gateway access, and this part of content data is also frequently changed for the same application, so that the numeric character strings and the random character strings in the field of the universal gateway interface need to be replaced with preset characters, thereby reducing the influence of random data on feature generation and obtaining gateway interface features in at least one feature.
In some embodiments, the application recognition device may replace the/index/abskhjs hdkjfkashdf/123 string with a preset string/index/STR/NUM in the CGI interface field, so as to replace a string that is easy to change with a uniform string to generate a more stable feature.
And S2053, when at least one field contains the resource request parameter field, extracting a key type character string from the resource request parameter field as a request parameter characteristic in at least one characteristic.
In the embodiment of the present application, the resource request parameter field may be a param field in HTTP, and the param field is in a key-value pair format, and may be in a format of key1 value1& key2 value as an example. For the same application, the key is usually fixed and constant, and the value will usually change with the change of the terminal operating system or firmware version. Therefore, the application recognition apparatus extracts a fixed key type character string as a request parameter character string from the character string corresponding to the HTTP request parameter field.
S2054, when the at least one field includes the user agent information field, extracting the client information from the user agent information field as a proxy feature of the at least one feature.
In this embodiment, the User Agent information field may be a User _ Agent field in HTTP message data. This field is used to inform the application server by which means the access mode initiated the access request. The User Agent field will typically carry identifying information for the client. The application recognition means may extract the client information from the user agent information field as an agent feature of the at least one feature.
In some embodiments, the application identification apparatus may remove information such as a browser version and a system from a string corresponding to the UA field, and extract the string corresponding to the client information as the proxy feature. For example, the client information of the WeChat application is known as MicroMessenger, and the UA fields of the WeChat application are as follows:
mozilla/5.0 (Linux; Android 4.4.2; NX505J built/KVT 49L) AppleWebKit/537.36(KHTML, like Gecko) Version/4.0 Chrome/37.0.0.0 Mobile MQQQQBass er/6.2 TBS/036548 Safari/537.36 MicroMessenger/6.3.18.800 NetType/WIFI Lan guide/zh _ CN; and processing the UA information by using an identification device, and reserving a MicroMessenger character string corresponding to the information of the processing client as a proxy characteristic.
In this embodiment of the present application, for multiple sample application corresponding sample traffic data, the application identification apparatus may perform the same extraction in S2051 to S2055 on each sample traffic data, so as to obtain respective feature quantities of a host domain name feature, a gateway interface feature, a request parameter feature, and a proxy feature, which are obtained according to the multiple eye sample traffic data. And the application identification device selects the features with the feature quantity larger than a preset quantity threshold value from the statistical features of each feature as the load key data fingerprint of each sample application.
It should be noted that, in this embodiment of the present application, the application identification apparatus may also extract other key fields capable of identifying the application characteristics from the plaintext transmission type data of each sample flow data, perform characteristic extraction and normalization according to the same principle, and obtain other types of flow characteristics, which are specifically selected according to the actual situation, and this embodiment of the present application is not limited.
S206, extracting at least one preset field name from the data packet header field of the plaintext transmission type data in the plaintext transmission type data of each sample flow data, and generating a header field sequence fingerprint according to the arrangement sequence of the at least one preset field name in the data packet header field so as to obtain the plaintext transmission type fingerprint.
In the embodiment of the application, for plaintext transmission type data generated by the same application, such as HTTP traffic data, the sequence of each field in the HTTP header is often fixed, so that the sequence of the preset fields can be used as a feature for identifying the application. The application identification device can acquire the arrangement sequence of at least one preset field name in a data header field from the plaintext transmission type data; and splicing at least one preset field name according to the arrangement sequence to obtain a head field sequence fingerprint.
In some embodiments, the HTTP traffic data content may be as follows:
POST/HTTP/1.1
Host:www.example.com
User-Agent:Mozilla/5.0(Windows NT 5.1;rv:10.0.2)Gecko/20100101 Fir efox/10.0.2
Content-Length:40
Content-Type:application/x-www-form-urlencoded
Connection:Keep-Alive
sex=man&&name=Professional
the at least one preset field is sequentially 'Host', 'User-Agent', 'Content-Length', 'Content-Type' and 'Connection'. The application recognition device may generate a string of "HOST _ User-Agent _ Content-Type _ Content-Length _ connection" as the header field order fingerprint according to the arranged order of the at least one preset field name.
In the embodiment of the application, for each sample application, the application identification device performs detection and generation of the three fingerprints according to the multiple corresponding sample flow data, and due to different specific contents of the sample flow data, the application identification device can obtain at least one of an application name fingerprint, a load key data fingerprint and a header field sequence fingerprint as a plaintext transmission type fingerprint corresponding to the application.
And S207, when the encrypted transmission type data of each sample flow data is first handshake message data of a server name indication protocol, extracting a domain name field of the target host from the first handshake message data to be used as a fingerprint of the target host.
In the embodiment of the present application, when the sample traffic data is encrypted transmission type data, the plaintext information of the domain name of the target host cannot be generally analyzed from the encrypted transmission type data. The Server Name Indication (SNI) extension protocol is an extension protocol under the encrypted TLS protocol, under which an application can submit a host Name to which the application is connected to its corresponding Server by the start of a handshake process. Therefore, the application identification apparatus may obtain the domain name field of the target host corresponding to the application through the handshake message data of the SNI protocol, and normalize the domain name field into a uniform preset format, such as www.example.com, as the fingerprint of the target host through the same method as in S2051.
And S208, when the encrypted transmission type data of each sample flow data is second handshake message data of a secure transport layer protocol, extracting at least one preset handshake message field from the second handshake message data, and performing content splicing and information signature processing on the at least one preset handshake message field to obtain a client handshake message fingerprint.
In the embodiment of the application, in the HTTPS security transport layer protocol data, the message data content in the handshake phase is not encrypted, so when the encrypted transmission type data is the second handshake message data of the security transport layer protocol data, the field content capable of representing the application individual characteristics can be extracted from the second handshake message data to be used as the client handshake message fingerprint.
In some embodiments, referring to fig. 6, fig. 6 is an optional flowchart of the method provided in the embodiments of the present application, and based on fig. 4 or fig. 5, S208 may be implemented by executing S2081 to S2083, which will be described with reference to the steps.
S2081, extracting version information, an encryption suite candidate list, an expansion list length and an elliptic curve key exchange algorithm support list from the client handshake message data as at least one preset handshake message field.
In this embodiment of the present application, the Client handshake message data may be a Client Hello message, and fig. 7 shows a display interface of the content of the Client Hello message data captured and analyzed by tcpdump or wiresharehick software. Wherein, the content of the 'Version' field 'TLS 1.2(0x 0303)' is Version information, and the content of the 'Ciper properties (19 properties)' field is an encryption suite candidate list; the content "141" of the "Extensions Length" field is the extension List Length; "Extension: encapsulating _ curves "with" Extension: the ec _ point _ formats' field corresponds to the elliptic curve key exchange algorithm support list. The application identification device can correspondingly extract the fields to obtain at least one preset handshake message field.
S2082, splicing the at least one preset handshake message field, separating each preset handshake message field by using a first preset symbol, and separating at least one message content in each preset handshake message field by using a second preset symbol to obtain spliced data.
In this embodiment of the application identification device, at least one preset handshake message field may be connected in series, each preset handshake message field is separated by using a first preset symbol, and at least one message content in each preset handshake message field is separated by using a second preset symbol in each preset handshake message field, so as to obtain concatenation data.
In some embodiments, the application identification apparatus may use "to separate each preset handshake message field, and at the same time, use" - "to separate at least one message content in each preset handshake message field, resulting in the concatenation data as follows: 769,47-53-5-10-49161-49162-49171-49172-50-56-19-4,0-10-11,23-24-25,0.
It should be noted that, if the content of a certain preset handshake message field is null, it may be directly set as a null character string.
S2083, calculating the hash value of the spliced data through a message digest algorithm, finishing information signature of the spliced data, and obtaining the client handshake message fingerprint.
In this embodiment, the application identification device may calculate a hash value of the obtained concatenation data through an MD5 algorithm, and use the hash value as a content concatenation fingerprint of a handshake message.
It should be noted that, in this embodiment of the present application, the application identification apparatus may also perform information signature processing on the splicing data through another signature algorithm to obtain a content splicing fingerprint of the handshake message, which is specifically selected according to an actual situation, and this embodiment of the present application is not limited.
S209, based on the multiple sample flow data, counting the number of the obtained target host fingerprints and the number of the client handshake message fingerprints, and taking the fingerprints with the number larger than a preset number threshold value as encryption transmission type fingerprints.
In the embodiment of the application, for each sample by applying the corresponding multiple sample flow data, the application identification device may count the number of the target host fingerprints and the number of the handshake message content splicing fingerprints respectively calculated according to the multiple sample flow data, and the fingerprints with the number larger than the preset number threshold are used as the stable characteristics of the application encryption transmission type data to obtain the encryption transmission type fingerprints.
S210, a preset flow fingerprint database is constructed and obtained according to the corresponding plaintext transmission type fingerprint and the corresponding encrypted transmission type fingerprint applied to each sample.
In this embodiment of the application, for at least one of the application name fingerprint, the load key data fingerprint, and the header field sequence fingerprint of each sample application obtained through S203-S206, the application identification apparatus may establish a corresponding relationship between each sample application and the fingerprint generated according to the plaintext transmission type data, and exemplarily, through a regular expression sample application a ═ application name fingerprint | (load key data fingerprint & header field sequence fingerprint), when an application name fingerprint, or a load key data fingerprint, or a header field sequence fingerprint appears in a traffic characteristic, identify network traffic data as network traffic data corresponding to the sample application a.
The application identification device can generate the corresponding relation between each sample application and the encrypted transmission type fingerprint through a similar method, and a preset flow fingerprint database is constructed and obtained according to the corresponding relation between each sample application and the plaintext transmission type fingerprint and the encrypted transmission type fingerprint.
It can be understood that, in the embodiment of the present application, the application identification device may automatically perform at least one dimension of feature extraction and fingerprint generation on sample traffic data of plaintext or encrypted two transmission types, respectively, thereby improving efficiency of constructing a fingerprint library and accuracy of generating an application traffic fingerprint. Moreover, the application identification device can screen the flow fingerprint which can be stably generated in the repeated execution process as the final App identification characteristic, so that the stability and the accuracy of application identification are further ensured.
In some embodiments, after S104, S105 may be further included, as follows:
and S105, realizing application monitoring and/or network management functions based on the application identification result.
In the embodiment of the application, the application identification device can monitor the application behavior in the network, manage the network security, manage and control the network traffic and other functions based on identification of each network traffic data in the network, exemplarily, prevent malicious application attack, or manage and control the application networking behavior according to the preset authority, and the like. The specific selection is performed according to actual conditions, and the embodiments of the present application are not limited.
Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.
In the embodiment of the present application, the application identification apparatus is applied to an application monitoring platform, and when the application traffic identification function is implemented, each functional module may be as shown in fig. 8. The application identification device downloads at least one sample application through an application downloading module 80_1 in the fingerprint extraction module 80, acquires data packets through an application automatic packet capturing module 80_2 in the running process of each sample application in the at least one sample application to obtain multiple sample flow data corresponding to each sample application, extracts different data flow characteristics from the multiple sample flow data according to an encryption transmission type and a plaintext transmission type through a characteristic extraction module 80_3, generates a plaintext transmission type fingerprint and an encryption transmission type fingerprint corresponding to each sample application according to different data flow characteristics through a fingerprint generation module 80_4, and further constructs and obtains a preset flow fingerprint library. Taking the generation of the plaintext transmission type fingerprint as an example, the implementation process shown in fig. 9 may be included:
s901, downloading a plurality of sample applications, and analyzing the application package names of the sample applications.
In S901, the application downloading module 80_1 may download and install at least one sample application in the preset sample application list on a terminal, such as a mobile phone, according to the preset sample application list that needs to be identified; and acquiring the bundleid of each sample application as the name of the application package through analysis in the downloading process.
And S902, automatically capturing packets in the running process of each sample application to obtain 5 flow packet files of each sample application.
In S902, for the downloaded sample applications, the application automatic package capture module 80_2 installs and starts each sample application through an automated script command on the mobile phone, and simultaneously starts a package capture program in the background, which may be tcpdump as an example. After each sample application is run, the application recognition device randomly clicks contents on a screen in an automatic or manual mode, and interactive traffic can be generated with a server, so that a pcap-format traffic packet file can be captured as sample data traffic through tcpdump. Because some random traffic may be mixed in the application execution process, in order to extract traffic characteristics that stably appear in the application operation process, the application automatic packet capturing module 80_2 repeatedly executes each sample application for 5 times, so that 5 pcap traffic packet files of each sample application can be obtained as multiple sample traffic data.
And S903, analyzing the HTTP traffic data in each traffic packet file to obtain analyzed content data.
In S903, the pcap packet captured by the packet capture tool such as tcpdump cannot be used directly, and the HTTP traffic data in the pcap packet needs to be parsed into a structured log by the parsing program. The feature extraction module 80_3 may parse the original pcap format file into field format content data through a pyshare module in python.
And S904, in the analyzed content data, obtaining the application name fingerprint by detecting the application package name.
In S904, for the formatted field analyzed in S903, the feature extraction module 80_3 may detect whether a field matching the application package name is included therein, and if so, the fingerprint generation module 80_4 may use the application package name as the application name fingerprint.
And S905, carrying out statistics extraction on the load key data from the analyzed content data.
In S905, for the formatted field analyzed in S903, the feature extraction module 80_3 may detect whether a target Host domain name field Host, a common gateway interface field CGI, a resource request parameter field Param, and a User Agent information field User-Agent are included therein, and when any one of the above fields is detected, perform normalization preprocessing on the above fields through the fingerprint generation module 80_4 by the same process as in S204 to obtain at least one feature, and use the feature that appears stably, for example, appears up to 5 times as a load key data fingerprint.
And S906, extracting the client fingerprint from the analyzed content data.
In S906, the fingerprint generating module 80_4 sequentially arranges the header names of the HTTP to obtain HOST _ User-Agent _ Content-Type _ Content-Length _ selection as a client fingerprint, where the client fingerprint is equivalent to a header field order fingerprint.
And S907, generating a plaintext flow fingerprint according to the application name fingerprint, the load key data fingerprint and the client fingerprint.
In this embodiment of the application, the fingerprint generation module 80_4 takes the application name fingerprint, the load key data fingerprint, and the client fingerprint as a plaintext traffic fingerprint corresponding to the sample application, that is, a plaintext transmission type fingerprint.
Further, based on fig. 8, in the embodiment of the present application, when the application identification device processes the network traffic in the real-time monitoring scene, the traffic collection module 81 may collect a data traffic packet in a network specified by the application monitoring platform, so as to obtain an HTTP data packet of a plaintext transmission type or an encrypted transmission type, which is used as the network traffic data. The data preprocessing module 82 may extract the corresponding real-time traffic features from the HTTP packet through a similar process in the feature extraction module 80_3, and output the extracted real-time traffic features to the traffic matching engine 83. Before the traffic matching engine module 83 matches the real-time traffic characteristics, a preset traffic fingerprint library constructed by the fingerprint extraction module 80 is loaded in a component calling mode, the real-time traffic characteristics and each traffic fingerprint in the preset traffic fingerprint library are combined into a logic expression through a regular expression, a logic expression with characteristic matching is determined from the logic expression, and a target sample application is obtained according to the traffic fingerprint in the logic expression with characteristic matching correspondingly and serves as an identification result of the HTTP data packet. The result output module 84 may output the identification result to a downstream application monitoring service module, so that the application monitoring service module determines whether to limit and manage the networking behavior of the application according to the identification result and a preset application network access white list or black list.
It can be understood that, in the embodiment of the present application, the application identification apparatus may extract the flow fingerprint features generated in the running process of the sample application in an automated manner, so that a process of manually extracting the features is omitted, and the encrypted flow is processed, thereby accurately and efficiently identifying the application corresponding to the flow. Furthermore, the application identification device repeatedly collects the fingerprints generated in the application running process in a statistical-based mode, so that the accuracy and the stability of the application flow fingerprints can be greatly improved.
Continuing with the exemplary structure of the application recognition device 455 provided by the embodiments of the present application implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the application recognition device 455 of the memory 450 may include:
the traffic collection module 4551 is used for collecting network traffic data in real time;
the data preprocessing module 4552 is configured to extract traffic characteristics corresponding to the network traffic data according to the transmission type of the network traffic data;
the flow matching engine module 4553 is configured to match the flow characteristics in a preset flow fingerprint database, so as to obtain a target sample application matched with the flow characteristics; the preset flow fingerprint database comprises a plaintext transmission type fingerprint and an encrypted transmission type fingerprint corresponding to each sample application in at least one sample application; the plaintext transmission type fingerprint comprises at least one of an application name fingerprint, a load key data fingerprint and a header field sequence fingerprint generated by performing feature extraction and pretreatment on plaintext transmission type data; the encrypted transmission type fingerprint comprises at least one of a target host fingerprint and a client handshake message fingerprint generated by performing feature extraction and preprocessing on encrypted transmission type data;
a determining module 4554, configured to determine the target sample application as an application identification result of the network traffic data.
In some embodiments, the application identification apparatus 455 further includes a fingerprint generation module, where the fingerprint generation module is configured to, before the traffic characteristics are matched in a preset traffic fingerprint library to obtain a target sample application with matched traffic characteristics, obtain application information of each sample application; the application information comprises an application package name; collecting a plurality of sample flow data in the running process of each sample application; the multiple sample flow data comprise plaintext transmission type data and encrypted transmission type data; in plaintext transmission type data of each sample flow data, obtaining the application name fingerprint by detecting a character string containing the application packet name; detecting whether a target host domain name field, a general gateway interface field, a resource request parameter field and a user agent information field are contained in plaintext transmission type data of each sample flow data, extracting at least one detected field to carry out normalization pretreatment to obtain at least one characteristic, and generating the load key data fingerprint according to respective quantity information of the at least one characteristic; extracting at least one preset field name from a data header field of the plaintext transmission type data in the plaintext transmission type data of each sample flow data, generating a header field sequence fingerprint according to the arrangement sequence of the at least one preset field name in the data header field, and further obtaining the plaintext transmission type fingerprint; in the encrypted transmission type data of each sample flow data, when the encrypted transmission type data is server name indication protocol data, extracting a target host domain name field from first handshake message data of the server name indication protocol data to be used as the target host fingerprint; in the encrypted transmission type data of each sample flow data, when the encrypted transmission type data is a secure transport layer protocol data, extracting at least one preset handshake message field from a second handshake message data of the secure transport layer protocol data, and performing content splicing and information signature processing on the at least one preset handshake message field to obtain a client handshake message fingerprint; obtaining respective quantities of the target host fingerprint and the client side handshake message fingerprint according to the multiple sample flow data, and further taking the fingerprints with the quantities larger than a preset quantity threshold value as the encrypted transmission type fingerprints; and constructing and obtaining the preset flow fingerprint database according to the plaintext transmission type fingerprint and the encrypted transmission type fingerprint corresponding to each sample application.
In some embodiments, the application information includes client information; the fingerprint generation module is further configured to, when the at least one field includes the target host domain name field, perform normalization of a preset format on the target host domain name field to obtain a host domain name feature in the at least one feature; when the at least one field comprises a universal gateway interface field, replacing a digital character string and a random character string in the universal gateway interface field with preset characters to obtain a gateway interface characteristic in the at least one characteristic; when the at least one field contains a resource request parameter field, extracting a key type character string from the resource request parameter field as a request parameter characteristic in the at least one characteristic; when the at least one field contains the user agent information field, extracting the client information from the user agent information field as a proxy feature in the at least one feature; obtaining respective quantities of the host domain name characteristic, the gateway interface characteristic, the request parameter characteristic and the proxy characteristic according to the multiple sample flow data, and taking the characteristic of which the quantity exceeds a preset quantity threshold value as the load key data fingerprint.
In some embodiments, the fingerprint generation module is further configured to concatenate the at least one preset field name according to an arrangement order to obtain the sequence fingerprint of the header field.
In the above apparatus, the second handshake message data includes: client handshake message data; the fingerprint generation module is further configured to extract version information, an encryption suite candidate list, an extension list length, and an elliptic curve key exchange algorithm support list from the client handshake message data as the at least one preset handshake message field; connecting the at least one preset handshake message field, separating each preset handshake message field by using a first preset symbol, and separating at least one message content in each preset handshake message field by using a second preset symbol to obtain splicing data; and calculating the hash value of the spliced data through a message digest algorithm, finishing the information signature of the spliced data, and obtaining the handshake message fingerprint of the client.
In some embodiments, the data preprocessing module 4552 is further configured to, when the transmission type is a plaintext transmission type, perform character extraction and preprocessing on the network traffic data to obtain at least one of an application name feature, a load-critical data feature, and a header field order feature as the traffic feature; and when the transmission type is an encryption transmission type, performing character extraction and pretreatment on the network flow data to obtain at least one of target host characteristics and client handshake message characteristics as the flow characteristics.
In some embodiments, the application identification device 455 further includes an identification result application module, and the identification result application module is configured to, after determining the target sample application as the application identification result of the network traffic data, implement an application monitoring and/or network management function based on the application identification result.
It should be noted that the above description of the embodiment of the apparatus, similar to the above description of the embodiment of the method, has similar beneficial effects as the embodiment of the method. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the application identification method described in the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform a method provided by embodiments of the present application, for example, a method as illustrated in fig. 3-6 or fig. 9.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
In summary, the embodiment of the application can extract and preprocess the characteristics of the plaintext transmission type data and the encrypted transmission type data applied to each sample to generate the plaintext transmission type fingerprint and the encrypted transmission type fingerprint corresponding to each sample application, so that the automatic generation of the preset flow fingerprint library is realized, the speed of characteristic extraction and characteristic fingerprint updating is greatly improved, the application identification system can more quickly identify novel application, and the efficiency of application identification is improved. In addition, the plaintext transmission type fingerprint and the encrypted transmission type fingerprint respectively comprise at least one fingerprint with different characteristic dimensions, so that the matching degree of at least one flow characteristic is improved, and the accuracy of application identification is further improved.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (10)

1. An application identification method, comprising:
acquiring and analyzing network traffic data in real time to obtain the transmission type of the network traffic data;
extracting flow characteristics corresponding to the network flow data according to the transmission type;
matching the flow characteristics in a preset flow fingerprint database to obtain target sample application matched with the flow characteristics; the preset flow fingerprint database comprises a plaintext transmission type fingerprint and an encrypted transmission type fingerprint corresponding to each sample application in at least one sample application; the plaintext transmission type fingerprint comprises at least one of an application name fingerprint, a load key data fingerprint and a header field sequence fingerprint generated by performing feature extraction and preprocessing on plaintext transmission type data; the encrypted transmission type fingerprint comprises at least one of a target host fingerprint and a client handshake message fingerprint generated by performing feature extraction and preprocessing on encrypted transmission type data;
and determining the target sample application as an application identification result of the network traffic data.
2. The method according to claim 1, wherein before the matching of the traffic characteristics in a preset traffic fingerprint database to obtain the target sample application of the traffic characteristic matching, the method further comprises:
acquiring application information of each sample application; the application information comprises an application package name;
collecting a plurality of sample flow data in the running process of each sample application; each sample flow data of the multiple sample flow data comprises plaintext transmission type data and encrypted transmission type data;
in the plaintext transmission type data of each sample flow data, when a character string containing the application packet name is detected, taking the application packet name as the application name fingerprint;
detecting whether the plaintext transmission type data of each sample flow data contains at least one of the following fields: a target host domain name field, a universal gateway interface field, a resource request parameter field, and a user agent information field;
extracting the detected at least one field for normalization pretreatment to obtain at least one feature, counting the feature quantity of the obtained at least one feature based on the multiple sample flow data, and taking the feature with the feature quantity larger than a preset quantity threshold value as the load key data fingerprint;
extracting at least one preset field name from a data packet header field of the plaintext transmission type data in the plaintext transmission type data of each sample flow data, and generating a header field sequence fingerprint according to the arrangement sequence of the at least one preset field name in the data packet header field so as to obtain the plaintext transmission type fingerprint;
when the encrypted transmission type data of each sample flow data is first handshake message data of a server name indication protocol, extracting a domain name field of a target host from the first handshake message data to be used as the fingerprint of the target host;
when the encrypted transmission type data of each sample flow data is second handshake message data of a secure transport layer protocol, extracting at least one preset handshake message field from the second handshake message data, and performing content splicing and information signature processing on the at least one preset handshake message field to obtain a client handshake message fingerprint;
counting the number of the obtained target host fingerprint and the client side handshake message fingerprint based on the multiple sample flow data, and taking the fingerprint with the number larger than a preset number threshold value as the encrypted transmission type fingerprint;
and constructing and obtaining the preset flow fingerprint database according to the plaintext transmission type fingerprint and the encrypted transmission type fingerprint corresponding to each sample application.
3. The method of claim 2, wherein the application information comprises client information; the extracting the detected at least one field for normalization preprocessing to obtain at least one feature includes:
when the at least one field comprises the target host domain name field, carrying out normalization of a preset format on the target host domain name field to obtain a host domain name characteristic in the at least one characteristic;
when the at least one field comprises a universal gateway interface field, replacing a digital character string and a random character string in the universal gateway interface field with preset characters to obtain a gateway interface characteristic in the at least one characteristic;
when the at least one field contains a resource request parameter field, extracting a key type character string from the resource request parameter field as a request parameter characteristic in the at least one characteristic;
when the at least one field contains the user agent information field, extracting the client information from the user agent information field as a proxy feature of the at least one feature.
4. The method according to claim 2 or 3, wherein the generating the header field order fingerprint according to the arrangement order of the at least one preset field name in the header field comprises:
acquiring the arrangement sequence of the at least one preset field name in the data header field;
and splicing the at least one preset field name according to the arrangement sequence to obtain the sequence fingerprint of the header field.
5. The method of claim 2, wherein the second handshake message data comprises: client handshake message data; the extracting at least one preset handshake message field from the second handshake message data, and performing content splicing and information signature processing on the at least one preset handshake message field to obtain the client handshake message fingerprint includes:
extracting version information, an encryption suite candidate list, an expansion list length and an elliptic curve key exchange algorithm support list from the client handshake message data as the at least one preset handshake message field;
splicing the at least one preset handshake message field, separating each preset handshake message field by using a first preset symbol, and separating at least one message content in each preset handshake message field by using a second preset symbol to obtain spliced data;
and calculating the hash value of the spliced data through a message digest algorithm, finishing information signature of the spliced data, and obtaining the client handshake message fingerprint.
6. The method according to claim 3 or 5, wherein the extracting, according to the transmission type, the traffic feature corresponding to the network traffic data includes:
when the transmission type of the network traffic data is a plaintext transmission type, performing character extraction and pretreatment on the network traffic data to obtain at least one of an application name characteristic, a load key data characteristic and a header field sequence characteristic as the traffic characteristic;
and when the transmission type of the network flow data is an encryption transmission type, performing character extraction and pretreatment on the network flow data to obtain at least one of target host characteristics and client handshake message characteristics as the flow characteristics.
7. The method of any of claims 1-6, wherein after determining the target sample application as the application identification result of the network traffic data, the method further comprises:
and realizing application monitoring and/or network management functions based on the application identification result.
8. An application recognition apparatus, comprising:
the flow acquisition module is used for acquiring and analyzing network flow data in real time to obtain the transmission type of the network flow data;
the data preprocessing module is used for extracting the flow characteristics corresponding to the network flow data according to the transmission type;
the flow matching engine module is used for matching the flow characteristics in a preset flow fingerprint database to obtain target sample application matched with the flow characteristics; the preset flow fingerprint database comprises a plaintext transmission type fingerprint and an encrypted transmission type fingerprint corresponding to each sample application in at least one sample application; the plaintext transmission type fingerprint comprises at least one of an application name fingerprint, a load key data fingerprint and a header field sequence fingerprint generated by performing feature extraction and pretreatment on plaintext transmission type data; the encrypted transmission type fingerprint comprises at least one of a target host fingerprint and a client handshake message fingerprint generated by performing feature extraction and preprocessing on encrypted transmission type data;
and the determining module is used for determining the target sample application as an application identification result of the network flow data.
9. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing the method of any one of claims 1 to 7 when executing executable instructions stored in the memory.
10. A computer-readable storage medium having stored thereon executable instructions for, when executed by a processor, implementing the method of any one of claims 1 to 7.
CN202110119196.6A 2021-01-28 Application identification method, device, equipment and computer readable storage medium Active CN114915566B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110119196.6A CN114915566B (en) 2021-01-28 Application identification method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110119196.6A CN114915566B (en) 2021-01-28 Application identification method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN114915566A true CN114915566A (en) 2022-08-16
CN114915566B CN114915566B (en) 2024-05-17

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116894011A (en) * 2023-07-17 2023-10-17 上海螣龙科技有限公司 Multi-dimensional intelligent fingerprint library and multi-dimensional intelligent fingerprint library design and query method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016201731A1 (en) * 2015-06-17 2016-12-22 宇龙计算机通信科技(深圳)有限公司 Fingerprint recognition method and apparatus, and electronic device
WO2018137225A1 (en) * 2017-01-25 2018-08-02 深圳市汇顶科技股份有限公司 Fingerprint data processing method and processing apparatus
CN109617762A (en) * 2018-12-14 2019-04-12 南京财经大学 A method of mobile application is identified using network flow
CN109802924A (en) * 2017-11-17 2019-05-24 华为技术有限公司 A kind of method and device identifying encrypting traffic
CN110198328A (en) * 2018-03-05 2019-09-03 腾讯科技(深圳)有限公司 Client recognition methods, device, computer equipment and storage medium
CN110602059A (en) * 2019-08-23 2019-12-20 东南大学 Method for accurately restoring clear text length fingerprint of TLS protocol encrypted transmission data
CN112261645A (en) * 2020-10-16 2021-01-22 北京锐驰信安技术有限公司 Mobile application fingerprint automatic extraction method and system based on grouping and domain division

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016201731A1 (en) * 2015-06-17 2016-12-22 宇龙计算机通信科技(深圳)有限公司 Fingerprint recognition method and apparatus, and electronic device
WO2018137225A1 (en) * 2017-01-25 2018-08-02 深圳市汇顶科技股份有限公司 Fingerprint data processing method and processing apparatus
CN109802924A (en) * 2017-11-17 2019-05-24 华为技术有限公司 A kind of method and device identifying encrypting traffic
CN110198328A (en) * 2018-03-05 2019-09-03 腾讯科技(深圳)有限公司 Client recognition methods, device, computer equipment and storage medium
CN109617762A (en) * 2018-12-14 2019-04-12 南京财经大学 A method of mobile application is identified using network flow
CN110602059A (en) * 2019-08-23 2019-12-20 东南大学 Method for accurately restoring clear text length fingerprint of TLS protocol encrypted transmission data
CN112261645A (en) * 2020-10-16 2021-01-22 北京锐驰信安技术有限公司 Mobile application fingerprint automatic extraction method and system based on grouping and domain division

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116894011A (en) * 2023-07-17 2023-10-17 上海螣龙科技有限公司 Multi-dimensional intelligent fingerprint library and multi-dimensional intelligent fingerprint library design and query method

Similar Documents

Publication Publication Date Title
CN110855676B (en) Network attack processing method and device and storage medium
US10769228B2 (en) Systems and methods for web analytics testing and web development
CN102945347A (en) Method, system and device for detecting Android malicious software
CN102984161B (en) The recognition methods of a kind of reliable website and device
US8875227B2 (en) Privacy aware authenticated map-reduce
CN112073437B (en) Multi-dimensional security threat event analysis method, device, equipment and storage medium
CN111200523B (en) Method, device, equipment and storage medium for configuring middle platform system
CN111222547B (en) Traffic feature extraction method and system for mobile application
CN107168844B (en) Performance monitoring method and device
CN114528457A (en) Web fingerprint detection method and related equipment
CN110932918A (en) Log data acquisition method and device and storage medium
EP3151124A1 (en) On-board information system and information processing method therefor
CN112822121A (en) Traffic identification method, traffic determination method and knowledge graph establishment method
CN102984162B (en) The recognition methods of credible website and gathering system
CN113821254A (en) Interface data processing method, device, storage medium and equipment
CN116069838A (en) Data processing method, device, computer equipment and storage medium
CN110830416A (en) Network intrusion detection method and device
CN111625837A (en) Method and device for identifying system vulnerability and server
CN114915566B (en) Application identification method, device, equipment and computer readable storage medium
WO2023082605A1 (en) Http message extraction method and apparatus, and medium and device
CN111367686A (en) Service interface calling method and device, computer equipment and storage medium
CN111277569A (en) Network message decoding method and device and electronic equipment
CN114915566A (en) Application identification method, device, equipment and computer readable storage medium
CN114374745A (en) Protocol format processing method and system
CN111818154B (en) Service pushing system and method based on network layer message analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant