CN116662576A - Association method and association system for security vulnerabilities and laws and regulations - Google Patents

Association method and association system for security vulnerabilities and laws and regulations Download PDF

Info

Publication number
CN116662576A
CN116662576A CN202310919095.6A CN202310919095A CN116662576A CN 116662576 A CN116662576 A CN 116662576A CN 202310919095 A CN202310919095 A CN 202310919095A CN 116662576 A CN116662576 A CN 116662576A
Authority
CN
China
Prior art keywords
legal
data
security
vulnerability
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310919095.6A
Other languages
Chinese (zh)
Inventor
王建国
王德民
田鑫程
王建龙
郭飞
李可
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tianyun Sea Number Technology Co ltd
Original Assignee
Beijing Tianyun Sea Number Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tianyun Sea Number Technology Co ltd filed Critical Beijing Tianyun Sea Number Technology Co ltd
Priority to CN202310919095.6A priority Critical patent/CN116662576A/en
Publication of CN116662576A publication Critical patent/CN116662576A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/027Frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method and a system for associating security vulnerabilities with laws and regulations, which relate to the technical field of network security, wherein the method for associating comprises the following steps: collecting legal and regulatory data and security hole data and structuring to obtain corresponding structured text data respectively; performing Chinese word segmentation on the structured text data, and filtering out stop words from word segmentation results to obtain an entry list of two types of data; constructing a dictionary based on the vocabulary entries, and constructing a document vector for each vocabulary entry based on the TF-IDF; and traversing and calculating the cosine similarity of each legal and legal vectors and each security vulnerability vector, and constructing a knowledge base by utilizing the data with the cosine similarity larger than a preset threshold value to obtain the association relation between the security vulnerabilities and the legal and legal laws. By the technical scheme, the legal and legal items to which each security hole can be related can be automatically carded, the automation and the intellectualization of the service are realized, the manpower workload is greatly saved, and the accuracy is improved.

Description

Association method and association system for security vulnerabilities and laws and regulations
Technical Field
The invention relates to the technical field of network security, in particular to a method for associating security vulnerabilities with laws and regulations and a system for associating security vulnerabilities with laws and regulations.
Background
With the rapid development and popularization of the internet, our lives have been kept away from the network. With the development of technologies such as 5G, IPv and cloud computing, the world is surrounded by more and more intelligent hardware and data, with data indicating that global networking devices would be expected to be more than one billion by 2030. With the continuous expansion of the whole virtual network space, some very important key infrastructures are also exposed in the network space, so that holes can exist in places with software, new holes are blown out every day, and the historical holes are over 50 ten thousand.
In the process of managing the network assets of the key infrastructure in the business, it is required to explain which laws and regulations and protection regulations are violated if each vulnerability is not repaired, and the vulnerability is manually or in a fuzzy matching mode at present. Various laws and regulations are also being introduced, and the manual work to deal with them is quite extensive. Keyword matching is performed in a fuzzy matching mode, each vulnerability can be associated with a large number of legal and legal items, accuracy is very low, manual auditing is finally needed, and efficiency is quite low.
Disclosure of Invention
Aiming at the problems, the invention provides a method and a system for associating the security vulnerabilities with the laws and regulations, which are used for automatically combing out which laws and regulations items each security vulnerabilities can be associated to by constructing a knowledge base associating the laws and regulations data with the security vulnerabilities data, thereby realizing the automation and the intellectualization of the service, greatly saving the manpower workload and improving the accuracy. Meanwhile, the knowledge base can be used for training a supervised learning model, laws and regulations related to new security vulnerability predictions can be applied to an intelligent report to measure which laws a hacker will violate if the hacker utilizes the vulnerability, how much loss is likely to be caused and which laws and regulations are violated if a network asset attribution unit with the vulnerability does not make vulnerability repair, and what consequences are borne.
In order to achieve the above object, the present invention provides a method for associating security vulnerabilities with laws and regulations, comprising:
collecting legal and legal data and security hole data, and structuring the legal and legal data and the security hole data to obtain corresponding structured text data respectively;
chinese word segmentation is carried out on the legal and legal structured text data and the security hole structured text data, and stop words are filtered out from word segmentation results by utilizing a preset stop word list, so that entry lists of two types of data are respectively obtained;
constructing a dictionary based on the vocabulary entries in the vocabulary entry list, and constructing a document vector for each vocabulary entry based on TF-IDF (Term Frequency-Inverse Document Frequency, inverse document Frequency);
and traversing and calculating the cosine similarity of each legal and legal vectors and each security vulnerability vector, and constructing a knowledge base by utilizing the data with the cosine similarity larger than a preset threshold value to obtain the association relation between the security vulnerabilities and the legal and legal laws.
In the above technical solution, preferably, the method for associating the security hole with the law and regulation further includes:
training an LSTM-based text classifier by taking the knowledge base as a training sample;
inputting the new security hole data into the text classifier after training is completed, and obtaining laws and regulations associated with the new security hole data.
In the above technical solution, preferably, the specific process of structuring the legal and regulatory data and the security hole data includes:
processing to obtain a list-form structured form composed of legal and legal names, legal and legal items and legal contents aiming at the downloaded text-form legal and legal data;
and correlating the data of different channels aiming at the security vulnerability data downloaded by different channels to obtain a Chinese description text corresponding to each structured security vulnerability.
In the above technical solution, preferably, the specific process of performing chinese word segmentation on the legal and legal structured text data and the security hole structured text data includes:
combining two columns of fields of legal and legal names and legal names in the legal and legal structure text data, and performing Chinese word segmentation on the combined fields by using a word segmentation algorithm;
combining and converting the vulnerability names, the vulnerability types and the vulnerability descriptions in the vulnerability data into a list form, and performing Chinese word segmentation on fields in the list by using the word segmentation algorithm.
In the above technical solution, preferably, the specific process of constructing a dictionary based on the terms in the term list and constructing a document vector for each term based on TF-IDF includes:
merging the vocabulary entry lists of the two types of data, and allocating a unique index or number to each vocabulary entry to construct a dictionary, wherein the dictionary establishes a mapping relation between the vocabulary entry and the index, and the dictionary is stored in the form of a dictionary or a hash table;
and constructing a document vector for the mapping relation between the vocabulary entry and the index in the dictionary based on the word frequency inverse document frequency TF-IDF.
In the above technical solution, preferably, the specific process of calculating the cosine similarity between each legal and regulatory vector and each security vulnerability vector includes:
performing length normalization on each legal and legal vector and each security vulnerability vector, and compressing and storing the legal and legal vectors and the security vulnerability vector by adopting a sparse matrix;
and traversing and calculating cosine similarity of each legal and regulatory vector and each security vulnerability vector.
In the above technical solution, preferably, the specific process of constructing the knowledge base by using the data with cosine similarity greater than the preset threshold includes:
storing law and regulation data and security hole data with cosine similarity larger than a preset threshold value into a database to form the knowledge base;
the fields of the knowledge base comprise legal and legal names, legal and legal items and legal contents of the legal and legal data, vulnerability names, vulnerability types and vulnerability descriptions of security vulnerability data of different channels, and similarity of the legal and legal data and the security vulnerability data;
wherein, the type of the database adopts MySQL, oracle, hive or hbase.
In the above technical solution, preferably, in the training process of the text classifier, a vulnerability name, a vulnerability type and a vulnerability description are used as input features, and legal and regulatory items are used as output classification features;
and inputting the vulnerability name, vulnerability type and/or vulnerability description of the newly-issued security vulnerability data into the text classifier to obtain associated legal and legal items.
The invention also provides a system for associating the security hole with the laws and regulations, and the method for associating the security hole with the laws and regulations disclosed by any one of the technical schemes comprises the following steps:
the data structuring module is used for collecting legal and legal data and security hole data, structuring the legal and legal data and the security hole data and respectively obtaining corresponding structured text data;
the data preprocessing module is used for carrying out Chinese word segmentation on the legal and legal structured text data and the security vulnerability structured text data, filtering off stop words from word segmentation results by utilizing a preset stop word list, and respectively obtaining entry lists of the two types of data;
the dictionary vector construction module is used for constructing a dictionary based on the vocabulary entries in the vocabulary entry list and constructing a document vector for each vocabulary entry based on TF-IDF;
the knowledge base construction module is used for traversing and calculating cosine similarity of each legal and legal vectors and each security vulnerability vector, and constructing a knowledge base by utilizing data with the cosine similarity larger than a preset threshold value to obtain the association relation between the security vulnerabilities and the legal and legal laws.
In the above technical solution, preferably, the system for associating security vulnerabilities with laws and regulations further includes a classifier prediction module, configured to train an LSTM-based text classifier with the knowledge base as a training sample;
inputting the new security hole data into the text classifier after training is completed, and obtaining laws and regulations associated with the new security hole data.
Compared with the prior art, the invention has the beneficial effects that: by constructing a knowledge base of association of legal and legal data and security vulnerability data, which legal and legal items each security vulnerability can be associated to is automatically combed, so that automation and intellectualization of business are realized, manpower workload is greatly saved, and meanwhile, the accuracy is improved. Meanwhile, the knowledge base can be used for training a supervised learning model, laws and regulations related to new security vulnerability predictions can be applied to an intelligent report to measure which laws a hacker will violate if the hacker utilizes the vulnerability, how much loss is likely to be caused and which laws and regulations are violated if a network asset attribution unit with the vulnerability does not make vulnerability repair, and what consequences are borne.
Drawings
FIG. 1 is a schematic flow diagram of a method for associating security vulnerabilities with laws and regulations according to an embodiment of the present invention;
fig. 2 is a schematic block diagram of a system for associating security vulnerabilities with laws and regulations according to an embodiment of the present invention.
In the figure, the correspondence between each component and the reference numeral is:
1. the system comprises a data structuring module, a data preprocessing module, a dictionary vector constructing module, a knowledge base constructing module and a classifier predicting module.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention is described in further detail below with reference to the attached drawing figures:
as shown in fig. 1, the method for associating security vulnerabilities with laws and regulations provided by the present invention includes:
collecting legal and legal data and security hole data, and structuring the legal and legal data and the security hole data to obtain corresponding structured text data respectively;
chinese word segmentation is carried out on the legal and legal structured text data and the security hole structured text data, and stop words are filtered out from word segmentation results by utilizing a preset stop word list, so that entry lists of two types of data are respectively obtained;
constructing a dictionary based on the vocabulary entries in the vocabulary entry list, and constructing a document vector for each vocabulary entry based on the TF-IDF;
and traversing and calculating the cosine similarity of each legal and legal vectors and each security vulnerability vector, and constructing a knowledge base by utilizing the data with the cosine similarity larger than a preset threshold value to obtain the association relation between the security vulnerabilities and the legal and legal laws.
In the embodiment, by constructing the knowledge base of the association of the legal and legal data and the security vulnerability data, the legal and legal items to which each security vulnerability can be associated are automatically carded, so that the automation and the intellectualization of the service are realized, the manpower workload is greatly saved, and the accuracy is improved. Meanwhile, the knowledge base can be used for training a supervised learning model, laws and regulations related to new security vulnerability predictions can be applied to an intelligent report to measure which laws a hacker will violate if the hacker utilizes the vulnerability, how much loss is likely to be caused and which laws and regulations are violated if a network asset attribution unit with the vulnerability does not make vulnerability repair, and what consequences are borne.
The legal and regulation data are all published, and the legal and regulation related to network security can be found and downloaded. Vulnerability data is downloaded from CVE, CNVD and CNNVD networks.
In the above embodiment, preferably, the specific process of structuring the legal regulation data and the security hole data includes:
processing to obtain a list-form structured form composed of legal and legal names, legal and legal items and legal contents aiming at the downloaded text-form legal and legal data;
and relating the data of different channels aiming at the security hole data downloaded by the different channels to obtain the Chinese description text corresponding to each structured security hole.
Specifically, the downloaded legal and regulatory data are text in various forms, but basically consist of the formats of chapter x, section x and strip x. For convenience of processing, the document is firstly processed into a structured form, which is composed of three columns: legal title, legal entry and legal content. The downloaded vulnerability data are structured, and Chinese description of each vulnerability is obtained by correlating CVE data, CNVD data and CNNVD data.
In the above embodiment, preferably, the specific process of performing chinese segmentation on the legal and legal structured text data and the security hole structured text data includes:
combining two columns of fields of legal and legal names and legal names in the legal and legal structure text data, and performing Chinese word segmentation on the combined fields by using a word segmentation algorithm;
combining and converting the vulnerability names, the vulnerability types and the vulnerability descriptions in the vulnerability data into a list form, and performing Chinese word segmentation on fields in the list by using a word segmentation algorithm.
Specifically, the structured Chinese text data obtained after the structured pretreatment is divided into words before the model treatment. Chinese segmentation is the process of segmenting continuous chinese text into individual words. Since chinese does not have an explicit word boundary like english, word segmentation is an important preprocessing step in chinese natural language processing. The goal of chinese segmentation is to segment a continuous segment of chinese text into meaningful words. The common Chinese word segmentation method comprises the following steps: dictionary-based segmentation, rule-based segmentation, and statistics-based segmentation. The two fields of 'legal regulation name' and 'legal regulation content' in the legal regulations are combined and converted into a list; and combining the 'vulnerability name + vulnerability type + text data of vulnerability description' in the processed vulnerability data and converting the combined vulnerability name + vulnerability type + text data into a list. On the basis of the conversion into lists, chinese word segmentation is respectively carried out on the two lists.
Stop words are a common class of words that are filtered or deleted in text processing, and these words are typically words that are very frequent but lack actual meaning or information value. The main purpose of eliminating stop words is to reduce the noise of texts and improve the effect of subsequent text analysis tasks. The stop words generally comprise some common functional words, prepositions, conjunctions, pronouns, some common high frequency words, and the like. And circularly filtering the result obtained by Chinese word segmentation to remove the stop words, so that the dimension of the vector can be reduced by removing the stop words, and the calculation efficiency is improved.
In the above embodiment, preferably, the specific process of constructing a dictionary based on the terms in the term list and constructing a document vector for each term based on TF-IDF includes:
combining the vocabulary entry lists of the two types of data, and distributing a unique index or number for each vocabulary entry to construct a dictionary, wherein the dictionary establishes a mapping relation for the vocabulary entry and the index, and the dictionary is stored in a dictionary or hash table form;
and constructing a document vector for the mapping relation between the vocabulary entry and the index in the dictionary based on the vocabulary frequency inverse document frequency TF-IDF.
The TF-IDF is used for weighting the entry, and the TF-IDF weight can highlight important words in the text, so that accuracy of similarity calculation is improved.
In the above embodiment, preferably, the specific process of traversing and calculating the cosine similarity between each legal and regulatory vector and each security hole vector includes:
performing length normalization on each legal and legal vector and each security vulnerability vector, and compressing and storing the legal and legal vectors and the security vulnerability vectors by adopting a sparse matrix;
and traversing and calculating cosine similarity of each legal and regulatory vector and each security vulnerability vector.
Specifically, for computing cosine similarity, the length of the vector has an effect on the result. To eliminate the effect of vector length on similarity calculation, the vector may be length normalized and converted into a unit vector. One common approach is to normalize with the L2 norm, i.e. divide the vector by its L2 norm. This ensures that all vectors have a length of 1, focusing only on the direction of the vector, and not on the length. In the case of high-dimensional sparse vectors, cosine similarity calculation may face efficiency problems, and the method is improved by using a compressed storage technology, so that the storage space of the vectors can be reduced by using compressed representation of a sparse matrix, and the calculation efficiency is improved.
In the foregoing embodiment, preferably, the specific process of constructing the knowledge base by using the data with cosine similarity greater than the preset threshold includes:
storing law and regulation data and security hole data with cosine similarity larger than a preset threshold value into a database to form a knowledge base;
the fields of the knowledge base comprise legal and legal names, legal and legal items and legal contents of legal and legal data, CVE_CODE, CNNVD_CODE, CNVD_CODE, vulnerability names, vulnerability types and vulnerability descriptions of different channel security vulnerability data, and similarity of the legal and legal data and the security vulnerability data;
wherein, the type of the database adopts MySQL, oracle, hive or hbase.
In the implementation process, after the similarity between each vulnerability and laws and regulations is calculated, the data set is huge, then similarity calculation is subjected to the repeated ranking, text association effects are judged sequentially from top to bottom, and a proper place is found out for interception. Preferably, the correlation effect with similarity greater than 80% is taken to be optimal, and then the part of data is stored in a database to form a knowledge base.
In the above embodiment, preferably, in the training process of the text classifier, a large amount of corpus in the constructed knowledge base is used as a sample, the vulnerability name, the vulnerability type and the vulnerability description are used as input features, and the law and regulation item is used as output classification features;
and inputting the vulnerability name, the vulnerability type and/or the vulnerability description of the newly-sent security vulnerability data into a text classifier to obtain associated legal and legal items.
In the above embodiment, preferably, the method for associating the security hole with the law and regulation further includes:
training an LSTM-based text classifier by taking a knowledge base as a training sample;
and inputting the new security hole data into a trained text classifier to obtain laws and regulations associated with the new security hole data. More legal and regulatory data can be expanded to be associated with security vulnerabilities by utilizing the classifier.
As shown in fig. 2, the present invention further provides a system for associating a security hole with a law and regulation, and the method for associating a security hole with a law and regulation disclosed in any one of the above embodiments is applied, and includes:
the data structuring module 1 is used for collecting legal and legal data and security hole data, structuring the legal and legal data and the security hole data and respectively obtaining corresponding structured text data;
the data preprocessing module 2 is used for performing Chinese word segmentation on the legal and legal structured text data and the security vulnerability structured text data, filtering off stop words from word segmentation results by utilizing a preset stop word list, and respectively obtaining entry lists of the two types of data;
a dictionary vector construction module 3 for constructing a dictionary based on the terms in the term list and constructing a document vector for each term based on TF-IDF;
the knowledge base construction module 4 is used for traversing and calculating cosine similarity of each legal and legal vectors and each security vulnerability vector, and constructing a knowledge base by utilizing data with the cosine similarity larger than a preset threshold value to obtain the association relation between the security vulnerabilities and the legal and legal laws.
In the above embodiment, preferably, the system for associating security vulnerabilities with laws and regulations further includes a classifier prediction module 5 for training an LSTM-based text classifier using the knowledge base as a training sample;
and inputting the new security hole data into a trained text classifier to obtain laws and regulations associated with the new security hole data.
According to the system for associating the security hole with the legal regulations disclosed in the above embodiment, in the implementation process, functions to be implemented by each module correspond to each step of the method for associating the security hole with the legal regulations in the above embodiment, and in the implementation process, implementation is performed by referring to the above method, which is not described herein again.
According to the method and system for associating the security hole with the legal regulations disclosed in the above embodiments, the implementation of the associating method and system is specifically described in the following examples.
(1) First, the legal regulations and vulnerability data need to be collected, and the legal regulations and legal regulations related to network security are downloaded. Vulnerability data is downloaded from CVE, CNVD and CNNVD networks.
(2) And (5) preprocessing data. The downloaded legal and legal data are text in various forms, but basically consists of the formats of chapter x, section x and strip x. For convenience of processing, the document is firstly processed into a structured form, which is composed of three columns: legal title, legal entry and legal content. The downloaded vulnerability data are structured, and can be imported into a database to complete data preprocessing by utilizing SQL sentences. The vulnerability data downloaded from the CVE is composed of English, CVE_CODE fields are utilized to respectively correlate the CVE data with CNVD and CNNVD data, and after correlation, the fields such as CVE_CODE, CNNVD_CODE, CNVD_CODE, chinese vulnerability names, vulnerability levels, vulnerability types, vulnerability descriptions and the like are taken.
(3) Chinese word segmentation. The structured Chinese text data obtained after the data preprocessing in the step (2) needs word segmentation before model processing. The Chinese word segmentation is a process of segmenting continuous Chinese text into single words, and the common Chinese word segmentation method comprises the following steps: dictionary-based segmentation, rule-based segmentation and statistics-based segmentation. There are many open-source word segmentation tools, such as: jieba, snowNLP, hanLP, stanford CoreNLP, etc., the embodiment selects a structure word, and performs chinese word segmentation on "law and regulation name" and "law and regulation content" in law and regulation, and "vulnerability name", "vulnerability type" and "vulnerability description" in vulnerability data, respectively.
(4) And removing stop words. The stop words generally comprise some common functional words, prepositions, conjunctions, pronouns, some common high-frequency words and the like, and mainly comprise: the use of the terms "a" and "an" are used herein to specify the presence of stated words such as "the" and "the" include, "but not limited to," but are also limited to, "the" and "the" include, but are not limited to, the presence of stated words such as "the" and "the" include, "as well as" the "and" the "include, the" and "the" include.
(5) A document vector is constructed. The document vector is then constructed using the word frequency inverse document frequency (TF-IDF).
(6) And calculating the similarity. Traversing the document vector constructed in the step (5), and calculating the similarity of the vulnerability document vector and the legal document vector by using an optimized and improved cosine similarity algorithm.
(7) And constructing a knowledge base. And (3) obtaining a data set with similarity of laws and regulations and vulnerabilities through the step (6), filtering out 108 ten thousand data sets with similarity greater than 0.8, and writing the data set into a MySQL database. The fields of the table include: cve_code, cnnvd_code, cnvd_code, vulnerability name, vulnerability type, vulnerability description, legal and legal names, legal and legal items, legal and legal content and similarity.
(8) And training a classification model. About 100 ten thousand knowledge bases can be obtained through the step (7), and an LSTM text classifier is trained by taking the knowledge bases as training samples, wherein legal and regulatory items are characterized by classification, vulnerability names, vulnerability types, vulnerability descriptions and the like.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for associating a security breach with a legal regulation, comprising:
collecting legal and legal data and security hole data, and structuring the legal and legal data and the security hole data to obtain corresponding structured text data respectively;
chinese word segmentation is carried out on the legal and legal structured text data and the security hole structured text data, and stop words are filtered out from word segmentation results by utilizing a preset stop word list, so that entry lists of two types of data are respectively obtained;
constructing a dictionary based on the vocabulary entries in the vocabulary entry list, and constructing a document vector for each vocabulary entry based on TF-IDF;
and traversing and calculating the cosine similarity of each legal and legal vectors and each security vulnerability vector, and constructing a knowledge base by utilizing the data with the cosine similarity larger than a preset threshold value to obtain the association relation between the security vulnerabilities and the legal and legal laws.
2. The method of associating a security breach with a legal regulation of claim 1, further comprising:
training an LSTM-based text classifier by taking the knowledge base as a training sample;
inputting the new security hole data into the text classifier after training is completed, and obtaining laws and regulations associated with the new security hole data.
3. The method of claim 1, wherein the specific process of structuring the legal and legal data and the security breach data comprises:
processing to obtain a list-form structured form composed of legal and legal names, legal and legal items and legal contents aiming at the downloaded text-form legal and legal data;
and correlating the data of different channels aiming at the security vulnerability data downloaded by different channels to obtain a Chinese description text corresponding to each structured security vulnerability.
4. The method for associating a security breach with a legal regulation as claimed in claim 3, wherein the specific process of chinese word segmentation of the legal regulation structured text data and the security breach structured text data includes:
combining two columns of fields of legal and legal names and legal names in the legal and legal structure text data, and performing Chinese word segmentation on the combined fields by using a word segmentation algorithm;
combining and converting the vulnerability names, the vulnerability types and the vulnerability descriptions in the vulnerability data into a list form, and performing Chinese word segmentation on fields in the list by using the word segmentation algorithm.
5. The method of claim 4, wherein the specific process of constructing a dictionary based on terms in the term list and constructing a document vector for each term based on TF-IDF comprises:
merging the vocabulary entry lists of the two types of data, and allocating a unique index or number to each vocabulary entry to construct a dictionary, wherein the dictionary establishes a mapping relation between the vocabulary entry and the index, and the dictionary is stored in the form of a dictionary or a hash table;
and constructing a document vector for the mapping relation between the vocabulary entry and the index in the dictionary based on the word frequency inverse document frequency TF-IDF.
6. The method for associating a security hole with a law and regulation of claim 5, wherein the specific process of traversing and calculating cosine similarity between each law and regulation vector and each security hole vector includes:
performing length normalization on each legal and legal vector and each security vulnerability vector, and compressing and storing the legal and legal vectors and the security vulnerability vector by adopting a sparse matrix;
and traversing and calculating cosine similarity of each legal and regulatory vector and each security vulnerability vector.
7. The method for associating security vulnerabilities with laws and regulations of claim 6, wherein the specific process of constructing a knowledge base using data with cosine similarity greater than a preset threshold comprises:
storing law and regulation data and security hole data with cosine similarity larger than a preset threshold value into a database to form the knowledge base;
the fields of the knowledge base comprise legal and legal names, legal and legal items and legal contents of the legal and legal data, vulnerability names, vulnerability types and vulnerability descriptions of security vulnerability data of different channels, and similarity of the legal and legal data and the security vulnerability data;
wherein, the type of the database adopts MySQL, oracle, hive or hbase.
8. The method for associating security vulnerabilities with laws and regulations of claim 2, wherein in the training process of the text classifier, a vulnerability name, a vulnerability type and a vulnerability description are taken as input features, and a law and regulation item is taken as output classification features;
and inputting the vulnerability name, vulnerability type and/or vulnerability description of the newly-issued security vulnerability data into the text classifier to obtain associated legal and legal items.
9. A system for associating a security breach with a law regulation, wherein a method for associating a security breach with a law regulation as claimed in any one of claims 1 to 8 is applied, comprising:
the data structuring module is used for collecting legal and legal data and security hole data, structuring the legal and legal data and the security hole data and respectively obtaining corresponding structured text data;
the data preprocessing module is used for carrying out Chinese word segmentation on the legal and legal structured text data and the security vulnerability structured text data, filtering off stop words from word segmentation results by utilizing a preset stop word list, and respectively obtaining entry lists of the two types of data;
the dictionary vector construction module is used for constructing a dictionary based on the vocabulary entries in the vocabulary entry list and constructing a document vector for each vocabulary entry based on TF-IDF;
the knowledge base construction module is used for traversing and calculating cosine similarity of each legal and legal vectors and each security vulnerability vector, and constructing a knowledge base by utilizing data with the cosine similarity larger than a preset threshold value to obtain the association relation between the security vulnerabilities and the legal and legal laws.
10. The system for associating a security breach with a legal regulation of claim 9, further comprising a classifier prediction module for training an LSTM based text classifier using the knowledge base as a training sample;
inputting the new security hole data into the text classifier after training is completed, and obtaining laws and regulations associated with the new security hole data.
CN202310919095.6A 2023-07-26 2023-07-26 Association method and association system for security vulnerabilities and laws and regulations Pending CN116662576A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310919095.6A CN116662576A (en) 2023-07-26 2023-07-26 Association method and association system for security vulnerabilities and laws and regulations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310919095.6A CN116662576A (en) 2023-07-26 2023-07-26 Association method and association system for security vulnerabilities and laws and regulations

Publications (1)

Publication Number Publication Date
CN116662576A true CN116662576A (en) 2023-08-29

Family

ID=87720850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310919095.6A Pending CN116662576A (en) 2023-07-26 2023-07-26 Association method and association system for security vulnerabilities and laws and regulations

Country Status (1)

Country Link
CN (1) CN116662576A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090327243A1 (en) * 2008-06-27 2009-12-31 Cbs Interactive, Inc. Personalization engine for classifying unstructured documents
CN109902159A (en) * 2019-01-29 2019-06-18 华融融通(北京)科技有限公司 A kind of intelligent O&M statement similarity matching process based on natural language processing
CN111444353A (en) * 2020-04-03 2020-07-24 杭州叙简科技股份有限公司 Construction and use method of warning situation knowledge graph
CN113420127A (en) * 2021-07-06 2021-09-21 北京信安天途科技有限公司 Threat information processing method, device, computing equipment and storage medium
CN113656807A (en) * 2021-08-23 2021-11-16 杭州安恒信息技术股份有限公司 Vulnerability management method, device, equipment and storage medium
CN115563619A (en) * 2022-09-27 2023-01-03 北京墨云科技有限公司 Vulnerability similarity comparison method and system based on text pre-training model
CN116244448A (en) * 2023-02-24 2023-06-09 中国电子科技集团公司第十研究所 Knowledge graph construction method, device and system based on multi-source data information

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090327243A1 (en) * 2008-06-27 2009-12-31 Cbs Interactive, Inc. Personalization engine for classifying unstructured documents
CN109902159A (en) * 2019-01-29 2019-06-18 华融融通(北京)科技有限公司 A kind of intelligent O&M statement similarity matching process based on natural language processing
CN111444353A (en) * 2020-04-03 2020-07-24 杭州叙简科技股份有限公司 Construction and use method of warning situation knowledge graph
CN113420127A (en) * 2021-07-06 2021-09-21 北京信安天途科技有限公司 Threat information processing method, device, computing equipment and storage medium
CN113656807A (en) * 2021-08-23 2021-11-16 杭州安恒信息技术股份有限公司 Vulnerability management method, device, equipment and storage medium
CN115563619A (en) * 2022-09-27 2023-01-03 北京墨云科技有限公司 Vulnerability similarity comparison method and system based on text pre-training model
CN116244448A (en) * 2023-02-24 2023-06-09 中国电子科技集团公司第十研究所 Knowledge graph construction method, device and system based on multi-source data information

Similar Documents

Publication Publication Date Title
CN106874292B (en) Topic processing method and device
CN104199965B (en) Semantic information retrieval method
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN102662952A (en) Chinese text parallel data mining method based on hierarchy
CN108573045A (en) A kind of alignment matrix similarity retrieval method based on multistage fingerprint
CN110334343B (en) Method and system for extracting personal privacy information in contract
CN110598066B (en) Bank full-name rapid matching method based on word vector expression and cosine similarity
CN110674635B (en) Method and device for dividing text paragraphs
Vorobeva Influence of features discretization on accuracy of random forest classifier for web user identification
Amato et al. An application of semantic techniques for forensic analysis
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
Friedrich Complexity and entropy in legal language
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN113011156A (en) Quality inspection method, device and medium for audit text and electronic equipment
CN108694176B (en) Document emotion analysis method and device, electronic equipment and readable storage medium
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
CN114925757B (en) Multisource threat information fusion method, device, equipment and storage medium
CN116662576A (en) Association method and association system for security vulnerabilities and laws and regulations
CN115329173A (en) Method and device for determining enterprise credit based on public opinion monitoring
CN111753540B (en) Method and system for collecting text data to perform Natural Language Processing (NLP)
CN116431776B (en) Keyword retrieval method for Chinese data
CN111274399A (en) Common data grading method based on natural language preprocessing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination