CN111884813A

CN111884813A - Malicious certificate detection method

Info

Publication number: CN111884813A
Application number: CN202010775718.3A
Authority: CN
Inventors: 闫健恩; 李佳欣; 程亚楠; 张兆心; 黄俊凯; 姚雨辰
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2020-08-05
Filing date: 2020-08-05
Publication date: 2020-11-03
Anticipated expiration: 2040-08-05
Also published as: CN111884813B

Abstract

The invention relates to a method for detecting a malicious certificate, which solves the technical problems of low accuracy rate and low malicious certificate detection range of a method for detecting the malicious certificate, and comprises the following steps: carrying out basic content analysis and normalization test on the certificate, and judging whether the certificate conforms to RFC 5280; the trusted root certificate and the intermediate certificate acquired from the CCADB are combined with CERT _ ISSUER in AIA expansion information of the certificate to construct a complete certificate chain, certificate signatures are verified, and certificates on the whole certificate chain are verified; extracting the characteristics of the content of the certificate and the related verification information which are obtained before; benign certificate data and malicious certificate data are collected, and the characteristics of the certificates are extracted; and after the data features are extracted, a detection model is constructed and the verification of the malicious certificate is realized. The invention can be widely applied to detecting the malicious X.509 certificate.

Description

Malicious certificate detection method

Technical Field

The invention relates to the field of certificate encryption, in particular to a malicious certificate detection method.

Background

The x.509 certificate is the basis of the HTTPS protocol, which is authentication information issued by a certificate authority for encrypting a public key of transmission information. The x.509 certificate contains a signature algorithm used in signing by the certification authority, a signature processed by a private key of the certification authority, and some basic information about the certification authority, the certificate authority and the public key. When communication based on the HTTPS protocol is carried out, the server side sends an X.509 certificate of the server side to the client side in a handshake phase, and the client side determines whether the certificate is credible after certificate information analysis and certificate chain construction. The determined indicators generally include whether the certificate has expired, whether a root certificate of the certificate is trusted, whether the certificate is in a certificate revocation list and whether an online verification status of the certificate is valid. Since the x.509 certificate can guarantee the security of access and connection to some extent, more and more websites use the HTTPS protocol for information transfer. This also raises some problems, and control servers of phishing websites and some dead networks also start to use certificates to encrypt data information, thereby evading detection by some malicious traffic analysis tools. The Phishing activity trend report published by the Phishing Group APWG (Anti-Phishing Working Group) in the first quarter of 2020 shows that approximately 75% of Phishing websites use x.509 certificates to transfer data and disguise themselves. The SSLBlacklist project collects information about certificates used in communication between some dead web servers and dead nodes, which have been discovered from 2014 to date. These certificates used for malicious activities are called malicious certificates, and how to detect these malicious x.509 certificates becomes a troublesome problem.

At present, methods for detecting malicious x.509 certificates mainly include: the black chain is adopted to filter the certificate, and the method has low detection rate on the newly appeared malicious certificate; secondly, the certificate information is counted by adopting a statistical method, if some information appears, the certificate is judged to be a malicious certificate, and the method has low accuracy and is easy to cheat; some methods of classical machine learning are adopted to model benign and malicious certificates to detect malicious certificates, and the accuracy of the methods also has a space for improvement. The above methods are different in the adopted technology, and the malicious certificate range covered by different methods is also different, that is, one of the existing phishing website certificate and dead network certificate analysis is not comprehensive enough.

Disclosure of Invention

The invention provides a malicious certificate detection method with a wider detection range based on integrated learning, aiming at the technical problems that the existing malicious certificate detection method is low in accuracy rate and relates to a malicious certificate with a narrow range.

The invention provides a malicious certificate detection method, which comprises the following steps:

firstly, performing basic content analysis and normalization inspection on a certificate based on Cryptographic and RFC 5280, preliminarily acquiring basic information of the certificate, judging whether the basic information meets some specifications and constraints of RFC 5280 or not, and recording related information;

secondly, constructing a complete certificate chain based on the trusted root certificate and the intermediate certificate acquired from the CCADB and combined with CERT _ ISSUER in AIA extension information of the certificate, verifying a certificate signature and recording corresponding information;

if the certificate chain is successfully constructed, no error that the signature verification is inconsistent occurs, and the final root certificate, namely the credible self-signed certificate, is found, then the OpenSSL is used for verifying the certificate chain for each certificate of the certificate chain until the certificates on the whole certificate chain are verified; if the certificate verification on the certificate chain is not successful, directly performing the step four;

fourthly, extracting the characteristics of the content of the certificate obtained before and the related verification information;

fifth, collect benign certificate data and malicious certificate data, carry on the extraction of the characteristic to the certificate;

and sixthly, after the data features are extracted, constructing a detection model and realizing the verification of the malicious certificate.

Preferably, the step one of acquiring the basic information of the certificate comprises the following steps:

step (1): importing the possibly input pem and cer format certificates, and converting and storing the formats;

step (2): acquiring basic information and possible expansion information of the X.509 certificate according to Cryptographic;

and (3): according to RFC 5280, some of the specification constraints involved in a document are checked and relevant checking information is recorded.

Preferably, the certificate chain verification in step three is performed as follows:

step 1): importing a root certificate library which is credible under the windows into the context to serve as a basic credible certificate library;

step 2): loading CRL information collected in the certificate parsing process to a context for performing CRL verification on the certificate and verifying whether the certificate is in a certificate revocation list;

step 3): and adding each certificate to the context according to the sequence from the root certificate to the end certificate in the certificate chain and verifying the certificate chain currently added to the certificate, and adding newly found CRL information to the context to facilitate verification, which is equivalent to verifying the certificate chain of each certificate in the whole certificate chain and simultaneously recording corresponding information.

Preferably, the features extracted in step four include:

A. basic information of the certificate;

B. specification verification information of the certificate;

C. authentication information of certificate chains.

Preferably, the step five specific method comprises the following steps:

benign certificate data is a certificate corresponding to one million domain names before Alexa, a certificate corresponding to an Alexa website is directly acquired from scans.io, and the number of the certificates which can be acquired is about 70 million; the malicious domain name certificates are some X.509 certificates used by a phishing website and a dead network control server, the acquisition of the phishing website and the dead network control server certificates firstly needs to acquire corresponding domain names and certificate fingerprint information from corresponding phishtank and SSL Blacklist websites, if the acquired domain name information is acquired, the certificate information is acquired by using HTTPS connection, and multi-process acceleration is used; and if the acquired fingerprint information is acquired, directly downloading the information of the certificate from crt.sh by using the fingerprint information.

Preferably, the specific method for constructing the detection model in the sixth step includes:

converting the benign certificate data to make the benign data and the malicious certificate data consistent in quantity; constructing a model and selecting an optimal hyper-parameter by adopting XgBoost, LightGBM and Catboost models; arbitrary shuffling is required when merging data, division at a rate of 0.15 when dividing training data and validation data, and validation of model generalization in a cross-validation manner.

Preferably, the construction of the model comprises the steps of:

step 1: importing data, processing default values, and not regularizing the data;

step 2: dividing data, namely dividing a verification set in a proportion of 0.15;

and step 3: training the model and finding the optimal hyper-parameter.

Preferably, the specific implementation steps of the verification of the malicious certificate in the sixth step are as follows:

step a): performing basic verification and specification constraint check based on RFC 5280 on the certificate of the data;

step b): constructing a certificate chain and verifying the certificate chain;

step c): extracting the features based on the two steps to obtain the specific features of the certificate;

step d): and inputting the acquired features into the trained model, and acquiring whether the certificate judgment result of the model is a malicious certificate or not.

The invention has the beneficial effects that: aiming at the problems of lack of accuracy and narrow range of related malicious certificates in the conventional malicious certificate detection method, the invention extracts features based on RFC 5280 and some information in the certificate chain verification process, collects wider malicious certificate data, and adopts an ensemble learning method to construct a malicious certificate detection model with higher accuracy and wider related range. The problem faced by the current malicious certificate detection is effectively solved.

Drawings

FIG. 1 is a model workflow diagram of the present invention;

FIG. 2 is a flow chart of the construction of the certificate chain of the present invention;

fig. 3 is a certificate verification data flow diagram of the present invention.

Detailed Description

The present invention is further described below with reference to the drawings and examples so that those skilled in the art can easily practice the present invention.

Example 1: fig. 1-3 show a work flow diagram of the model, a construction flow diagram of the certificate chain, and a certificate verification data flow diagram, respectively.

The invention firstly analyzes the basic content and checks the normalization of the certificate based on Cryptographic and RFC 5280, preliminarily obtains the basic information of the certificate, judges whether the certificate conforms to some specifications and constraints of RFC 5280 or not, and records the related information. The process mainly comprises the following steps:

step (1): and importing the possibly input pep and cer format certificates, and converting and storing the formats.

Step (2): and acquiring basic information and possible extension information of the X.509 certificate according to Cryptographic.

Secondly, a complete certificate chain is constructed based on a trusted root certificate and an intermediate certificate acquired from a CCADB (common CA database) in combination with CERT _ ISSUER in AIA (authorization Information Access) extension Information of the certificate. CCADB is a root certificate and intermediate certificate repository maintained by Mozilla mastery, which is supported by microsoft and google. The method for searching the upper certificate of a certain X.509 certificate mainly comprises the following steps: the method is constructed through a certificate library and then acquired according to CERT _ ISSUER information existing in certificate expansion. The invention combines the two forms and can effectively construct the certificate. In the process of constructing the certificate chain, the problem of signature verification is also involved, and the certificate and the upper-level certificate thereof can be confirmed only if the signature verification is successful. The specific method is to use the found superior certificate, the signature value of the current certificate, the signature algorithm and tbs information, namely the information input during signature, to verify whether the signature of the current certificate is issued by the private key of the superior certificate. Some information generated in this process is also recorded accordingly for later feature extraction.

If the certificate chain is successfully constructed, namely no error that the signature verification is inconsistent occurs in the construction process, and the final root certificate, namely the trusted self-signed certificate, is found, the OpenSSL is used for verifying the certificate chain for each certificate of the certificate chain until the certificates on the whole certificate chain are verified. The specific process is as follows:

step 3): adding each certificate to the context in the order of root certificate to end certificate in the certificate chain and verifying the certificate chain to which the certificate is currently added, and adding newly found CRL information to the context facilitates verification, which is equivalent to verifying the certificate chain for each certificate throughout the certificate chain. The process also has corresponding information recorded.

Whether the verification is successful or not, feature extraction needs to be carried out on the content of the certificate obtained before and related verification information, and the total extracted features are 169, and relate to some information and marks generated in the process. The extracted features can be broadly classified into three categories:

A. basic information of certificate

B. Canonical verification information for certificates

C. Authentication information for certificate chains

So far, after information analysis and verification of a certificate and characteristic extraction of the certificate are finished, data are collected, benign certificate data adopted by the method are certificates corresponding to one million domain names before Alexa, and malicious domain name certificates are X.509 certificates used by phishing websites and dead network control servers. The Alexa website corresponding certificates are obtained directly from scales, io, which is approximately 70 ten thousand. The acquisition of the phishing website and the dead network control server certificate firstly needs to acquire corresponding domain name and certificate fingerprint information from corresponding phishtank and SSL Blacklist websites. If the domain name information is acquired, acquiring certificate information by using HTTPS connection, and accelerating by using multiple processes; if the acquired fingerprint information is acquired, the information of the certificate is directly downloaded from crt.sh by using the fingerprint information, and finally acquired malicious certificate data is about 2000. After the benign and malicious certificate data exist, the certificate is subjected to feature extraction according to the above process.

And after the data features are extracted, constructing and realizing a detection model. The data quantity of the extracted benign certificate and the extracted malicious certificate is greatly different, and the problem of data unbalance is obvious. The method for solving the problem is to transform the benign certificate data to ensure that the benign data is consistent with the malicious certificate data, namely the data used in the model training process is 4000, and the benign certificate and the malicious certificate are 2000 respectively. Arbitrary shuffling is required when merging data, division at a rate of 0.15 when dividing training data and validation data, and validation of model generalization in a cross-validation manner. Model selection may have a problem of low accuracy if the conventional machine learning method is used, and may cause a problem of model overfitting due to too small data amount if deep learning is used. The specific model is constructed by adopting XgBoost, LightGBM and Catboost models which are relatively popular in the industrial and academic circles. They are gradient lifting tree based algorithms, but there are some differences in training time and performance of final results, and these three types of models are better models in ensemble learning. XgBoost was proposed by Tianqi Chen et al in the competition, LightGBM by Guolin Ke et al, CatBost by Liudmila Prokhorenkova et al; the decision trees are all based on a model of the decision tree, a plurality of weak decision tree models are combined together to form a strong classifier, the decision trees are formed in a sequential manner, and the subsequent decision tree constructs a new decision tree aiming at improving the accuracy of the model according to the training effect of the previously formed decision tree until the model reaches the upper limit of the corresponding precision or the number of the decision trees. The three models are different in optimization and iteration strategies of some algorithms, so that the trained models have certain performance differences. The method adopts python realization of three types of models to construct a model to search for the model with better performance on the problem, gives the searching range of model parameters, and searches for the optimal model parameters in the given parameter range by means of GridSearchCV grid searching, wherein GridSearchCV is a model parameter searching tool in skearn. The three models are respectively used for constructing the model and selecting the optimal hyper-parameter, and the specific process comprises the following steps:

step 1: data import and some default value processing, considering the convenience of certificate detection later and the fact that the three types of models are tree-based models, data are not regularized, and missing values are discarded directly due to the fact that the data are few and the data are missing due to large data quantity;

step 2: dividing data, namely mixing and disordering abnormal certificate data and benign certificate data with consistent quantity, and dividing a training set and a verification set by a train _ test _ split tool in a skearn in a proportion of 0.15;

and step 3: training the model and finding the optimal hyper-parameter, constructing the model by using a Python realization library of the model, giving a section of parameter searching range, and finding the optimal model parameter in the given parameter range in a GridSearchCV grid searching mode.

With the trained model, the specific implementation of the verification of the malicious certificate follows. The method comprises the following specific steps:

step b): constructing a certificate chain and verifying the certificate chain;

The entire certificate detection process is now complete.

Example 2

In the basic analysis and verification process of the input X.509 certificate, basic information about a certificate main body, a certificate issuer, certificate expansion, public key use and the like stored in the certificate is extracted. Some restrictions in RFC 5280, such as whether decipher _ only and encipher _ only in the certificate public key usage are allowed to be set in the public key usage with key _ element set to true, whether serial _ number is the maximum positive integer represented by no more than 20 bytes, and the like, are innovatively set. The restrictions on the canonical certificates in the RFC 5280 document are carefully integrated and checked and relevant information is recorded. This completes the basic parsing and information checking of the certificate.

Example 3

The certificate chain is a complete certificate list from the end certificate to the root certificate, and the signatures of other certificates in the list except the root certificate can be verified by the public key of the superior certificate. Authentication of the certificate chain in addition to basic signature verification includes whether the certificate usage is in compliance, whether the certificate policy can be in compliance, whether the name constraints of the certificate are in compliance, whether the policy constraints of the certificate are in compliance, and the like. This information is detected during the authentication of the certificate chain, which also yields some relevant information. This process is performed for each certificate in the certificate chain, and it can be verified whether there are some problems with the intermediate certificate.

Example 4

Most of the information of the certificate can be obtained through basic parsing, specification verification and verification of the certificate chain. This information is then extracted and processed for later input into the model. After the data characteristics of benign certificates and malicious certificates are extracted, different integrated learning models are selected for training and parameter adjustment to find the model with the highest accuracy. Experimental results show that the accuracy of the integrated learning model can reach about 97%, and certificates used by phishing websites and dead network servers can be detected, so that some problems in the malicious certificate detection technology are effectively solved.

The above description is only for the purpose of illustrating preferred embodiments of the present invention and is not to be construed as limiting the present invention, and it is apparent to those skilled in the art that various modifications and variations can be made in the present invention. All changes, equivalents, modifications and the like which come within the scope of the invention as defined by the appended claims are intended to be embraced therein.

Claims

1. A malicious certificate detection method is characterized by comprising the following steps:

firstly, performing basic content analysis and normalization inspection on a certificate based on Cryptographic and RFC 5280, preliminarily acquiring basic information of the certificate, judging whether the basic information meets the specification and constraint of RFC 5280 or not, and recording related information;

2. The malicious certificate detection method according to claim 1, wherein the step one of obtaining the basic information of the certificate comprises the steps of:

step (1): importing input pem and cer format certificates, and converting and storing formats;

step (2): acquiring basic information and existing expansion information of the X.509 certificate according to Cryptographic;

and (3): according to RFC 5280, the specification constraints involved in the document are checked and relevant checking information is recorded.

3. The malicious certificate detection method according to claim 1, wherein the certificate chain verification in the third step is performed as follows:

4. The malicious certificate detection method according to claim 1, wherein the features extracted in the fourth step include:

A. basic information of the certificate;

B. specification verification information of the certificate;

C. authentication information of certificate chains.

5. The malicious certificate detection method according to claim 1, wherein the step five specific method comprises:

benign certificate data is a certificate corresponding to one million domain names before Alexa, and the certificate corresponding to the Alexa website is directly acquired from scans.io; the malicious domain name certificate is an X.509 certificate used by a phishing website and a dead network control server, the acquisition of the phishing website and the dead network control server certificate firstly needs to acquire corresponding domain name and certificate fingerprint information from corresponding phishtank and SSL Blacklist websites, if the acquired domain name information is acquired, the certificate information is acquired by using HTTPS connection, and multi-process acceleration is used; and if the acquired fingerprint information is acquired, directly downloading the information of the certificate from crt.sh by using the fingerprint information.

6. The malicious certificate detection method according to claim 1, wherein the specific method for constructing the detection model in the sixth step includes:

7. The malicious certificate detection method according to claim 6, wherein the model is constructed by the following steps:

and step 3: training the model and finding the optimal hyper-parameter.

8. The method for detecting a malicious certificate according to claim 1, wherein the step six for verifying the malicious certificate comprises the following steps:

step b): constructing a certificate chain and verifying the certificate chain;