SE2050302A1

SE2050302A1 - A method for linking a cve with at least one synthetic cpe

Info

Publication number: SE2050302A1
Application number: SE2050302A
Authority: SE
Inventors: Emil Wåreus
Original assignee: Debricked Ab
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2021-09-20

Abstract

A method (300) for linking a common vulnerability and exposure, CVE, (106) with at least one synthetic common platform enumeration, CPE, (112) wherein the CVE (106) comprises a summary of a vulnerability, is disclosed. The method (300) comprising: receiving (S302) the summary of the CVE (106) from a vulnerability database, VD, (104); extracting (S304) information from the summary of the CVE (106) using a Natural Language Processing, NLP, model; building (S306) at least one synthetic CPE (112) based on the extracted information; and linking (S308) the CVE (106) with the at least one synthetic CPE (112).

Description

A METHOD FOR LINKING A CVE WITH AT LEAST ONE SYNTHETICCPE Technical field The disclosure relates to software development and IT security ingeneral. More particular, it is related to a method for linking a commonvulnerability and exposure, CVE, with at least one synthetic common platformenumeration, CPE, by using a trained NLP model.

Backgroundln almost all software development today, using open source and third- party components is crucial for its success. lt is beneficial to the quality,security, functionality, and development efficiency. However, at the sametime, it increases the exposure to vulnerabilities in code developed by thirdparties. To maintain control over the security of the developed software, themaintainers need to continuously monitor if vulnerabilities have beenintroduced or found in these third-party dependencies. This is commonly donewith Dependency Vulnerability Management (DVM) tools that automate theprocess of Software Composition Analysis (SCA), and matches used softwarecomponents with known vulnerabilities.

A major source of vulnerabilities is the National Vulnerability Database(NVD) [15]. These vulnerabilities have a unique Common Vulnerabilities andExposures (CVE) identifier. The list of such identifiers is maintained by Mitreand includes a short summary of the vulnerability. ln the last few years,around 30-50 new vulnerabilities have been given a CVE identifier and beenrecorded in NVD each day. Unfortunately, farfrom all CVEs maintained in theNVD database are correctly linked to CPEs. Moreover, as reported in [4],there is a notable time-lag from the first CVE disclosure to the linking of CPEsto the vulnerability. ln 2018, the median time to correctly assign the CPEmetadata was 35 days.

SummaryNational Institute of Standards and Technology (NIST) security professionals take the CVEs as they are published by Mitre and link one ormore Common Platform Enumerations (CPE) [14] to each CVE. These CPEsare used to specify which software and versions are vulnerable. NIST alsoadds other pieces of information, such as a Common Vulnerability ScoringSystem (CVSS) score. While the summary, as recorded in the originalidentifier provided by Mitre, often includes information regarding whichproduct and versions are affected, the list of CPEs formalizes this informationand provides it in a standardized, and machine-readable, format. Thus, theCPE can be a crucial addition to the CVE information when vulnerabilityidentification and assessment are being automated. lt is an object of the invention to at least partly overcome one or moreof the above-identified limitations of the prior art. ln particular, it is an object toprovide methods and system for linking a common vulnerability and exposure,CVE, with at least one synthetic common platform enumeration, CPE, byusing a trained NLP model.

According to a first aspect it is provided a method for linking acommon vulnerability and exposure, CVE, with at least one synthetic commonplatform enumeration, CPE, wherein the CVE comprises a summary of avulnerability, the method comprising: receiving the summary of the CVE from a vulnerability database, VD; extracting information from the summary of the CVE using a NaturalLanguage Processing, NLP, model, building at least one synthetic CPE based on the extractedinformation, and linking the CVE with the at least one synthetic CPE.

The CVE should be interpreted as a uniquely identifiable vulnerabilityand information about this vulnerability.

The CPE should be interpreted as an identifier and metadata of asoftware. The synthetic CPE should be interpreted as an identifier on the same format as the CPE, but built by means of this invention instead of NISTsecurity professionals.

The vulnerability database should be interpreted as a databasecomprising vulnerabilities, wherein each vulnerability has been assigned aCVE. As a non-limiting example, the vulnerability database can be a NationalVulnerability Database, NVD. The NVD is the database used in thediscussion disclosed in the detailed description. However, the database maybe any database comprising vulnerabilities assigned with CVEs.

An advantage of using the disclosed method is that it is possible tobuild a more complete information of products, vendors and versions affectedby a vulnerability in an efficient way. ln addition, it is possible to link CVEs,received from the vulnerability database, to the at least one synthetic CPEautomatically, instead of having to wait for the NIST security professionals tolink them. This is an advantage since during the time between when a CVE ismade public and it is linked with one or more CPEs software may bevulnerable to attacks. The extracted information may comprise a vendorand/or product name and/or a product version affected by the vulnerability.

The step of extracting information from the summary of the CVE maycomprise: adding a label for each word in the summary, wherein the label isselected from a CPE relevant group comprising vendor, product, version, firstexcluded version, first included version, last excluded version, last includedversion or a non-CPE relevant group comprising none-labels, and extracting words with labels from the CPE relevant group.

According to one non-Iimiting example, the first excluded version maybe a versionStartExcluding as will be further discussed in section “Data andLabels” in the detailed description. Thus, the first excluded version mayindicate all vulnerable version after (excluding) a specific version.

According to one non-Iimiting example, the first included version maybe a versionStartlncluding as will be further discussed in section “Data andLabels” in the detailed description. Thus, the first included version may indicate all vulnerable versions after (including) the specific version.

According to one non-limiting example, the last excluded version maybe a VersionEndExcluding as will be further discussed in section “Data andLabels” in the detailed description. Thus, the last exluded version mayindicate all vulnerable version before (excluding) this version.

According to one non-limiting example, the last included version maybe a versionEndlncluding as will be further discussed in section “Data andLabels” in the detailed description. Thus, the last included version may bevulnerable version before (including) this version.

The label selected from the CPE relevant group indicates words thatmay be relevant in view of extracting information. Thus, these words maycomprise relevant information relating to the vendor and/or product and/or theversion that may be affected by the vulnerability.

The none-label in the non-CPE relevant group indicates words thatmay not be of relevance in view of extracting information. Thus, these wordsmay comprise information that does not indicate anything about the vendorand/or product and/or version that may be affected by the vulnerability.

Thus, by adding labels to each word in the summary indicating whetherthe word is relevant in view of extracting information or not, facilitatesprovision of an efficient way of dividing the words of the summary in differentgroups and hence, extract the relevant words.

The labels from the CPE relevant group may further be labeled as B-label or l-label. The B-label may denote a labeled word to be a beginning of aword combination. The l-label may denote the labeled word to be placed afterthe beginning in the word combination.

Word combination should be understood as a combination of words,wherein the combination comprises two or more words. Alternatively, or incombination, a word combination may comprise only one word. As a non-limiting example, the word combination may be a product name, wherein theproduct name comprises two or more words.

The step of extracting information from the summary of the CVE mayfurther comprises feeding, each word in the summary of the CVE into a featureengineering, wherein the feature engineering comprising Word LevelEmbeddings, wherein the Word Level Embeddings is configured to transformeach word in the summary into a numerical vector and Character LevelEmbeddings, wherein the Character Level Embeddings is configured toextract character level features for each word in the summary; forming an input by combining outputs of the Word Level Embeddingsand the Character Level Embeddings; feeding the input into a neural network comprising a recurrentBidirectional Long Short-term Memory (BLSTM) network and a ConditionalRandom Field (CRF) layer; and determining a set of labeled words from output of the neural network.

Each word in the summary may simultaneously be fed into the featureengineering. Alternatively, or in combination, each word in the summary maybe fed one by one into the feature engineering. Alternatively, or incombination, two or more words may be fed into the feature engineering atthe same time.

Each word in the summary may be fed simultaneously into the WordLevel Embeddings and the Character Level Embeddings.

Outputs from the BLSTM may be the probability for the different labelsto belong to a word. CRF may be used for considering adjacent words andtheir labels in the determination of labels.

The set of labeled words may comprise one word. Alternatively, or incombination, the set of labeled words may comprise two or more words.

The feature engineering may further comprise a Word Level CaseFeatures and/or a Word Level Lexicon. The Word Level Case Features maybe configured to find word-properties in the summary. The Word LevelLexicon may be configured to find features based on domain knowledge.

The step of forming the input may further comprise combining outputsof the Word Level Case Features and the Word Level Lexicon. The outputs of the Word Level Case Features and the Word Level Lexicon may be combined with the outputs of the Word Level Embeddings and the Character LevelEmbeddings in order to form the input.

The Word Level Lexicon may be constructed from a set of CVEs fromthe VD, comprising known products, vendors and product and vendors.

The step of building the at least one synthetic CPE based on theextracted information may further comprise combining the extractedinformation into a predetermined CPE format.

The predetermined CPE format should be understood as being aformat where the extracted information may be added, maintaining amachine-readable format.

According to a second aspect, a method for building a database of aplurality of CVEs linked with at least one synthetic CPE is provided. Themethod comprising the steps of: linking each CVE of the plurality of CVEs to at least onesynthetic CPE according to the method of the first aspect, and storing each CVE of the plurality of CVEs linked with at leastone synthetic CPE in the database.

According to one non-limiting example, the database may be asynthetic CPE database as will be discussed in connection with Fig. 16.

The synthetic CPEs of the database may be compared to a file CPE in orderto find vulnerabilities in software. The file CPE may comprise vendor, productand version in imported software.

According to a third aspect it is provided a method for training of anNLP model, wherein the NLP model is configured to be used for linking acommon vulnerability and exposure, CVE, with at least one synthetic commonplatform enumeration, CPE, the method comprising forming a data set, wherein the data set comprises CVEs withlinked CPEs received from a vulnerability database, VD; dividing the data set into a training set and a validation set; fitting parameters of the model by applying the model to CVEswith already linked CPEs in the training set, and optimizing, which may also be referred to as validating, the NLPmodel using the CVEs in the Validation set.According to a fourth aspect it is provided a server configured to link a common vulnerability and exposure, CVE, with at least one synthetic common platform enumeration, CPE, wherein the CVE comprises a summary of a vulnerability, the server comprising: a transceiver configured to: receive the summary of the CVE from a vulnerability database,VD;a control circuit configured to execute: an extracting function configured to extract information from thesummary of the CVE using an NLP model; a building function configured to build at least one synthetic CPEbased on the extracted information; and a linking function configured to link the CVE with the at least onesynthetic CPE.

The extracted information may comprise a product and/or a version and/or vendor affected by the vulnerability.

Still other objectives, features, aspects and advantages of the invention will appear from the following detailed description as well as from the drawings. The same features and advantages described with respect to one aspect are applicable to the other aspects unless explicitly stated otherwise.

Brief descirption of the drawinqs Embodiments of the invention will now be described, by way of example, with reference to the accompanying schematic drawings, in which CPEs.

Fig. 1 is an overview of the relationship between a CVE and multiple Fig. 2 is an example of a labeled sentence.

Fig. 3 is an overview of the model architecture and data pipeline.Fig. 4 is a table of the different case-features with their properties.Fig. 5 is a table of the number of entries in the security lexicon.

Fig. 6 illustrates the accumulated mentions of product over the numberof mentions of a product. The X-axis denotes the number of mentions ofindividual CPE-product and the Y-axis denotes the number of accumulatedmentions of products with X-mentions. The mean of the distribution is 4.69mentions per product and the median is one mention per product.

Fig. 7 illustrates a Long Short Term Memory cell. The input gate,output gate, forget gate, and cell state are marked in dotted lines.

Fig. 8 is a table of the hyperparameters search space and parametersused for best result.

Fig. 9 is a table of the results of the four training cases.

Fig. 10 is a table of the granular test results from model with casefeatures and without lexicon. Scores are over each possible label for themodel. Label Count describes how many instances of that particular label ispresent in the test-set, and Prediction Count describes how many predictionsthe model produces for a particular label.

Fig. 11 is a scatter plot over Label Count and F1-score for each class(excluding 'O'). This plot indicates that there seems to be a minimum amountof examples in each class to achieve an F1-score above .8 at approximately300.

Fig. 12 illustrates Precision, F-measure, and Recall over each possibleclass for the model with case-features and without lexicon-features.

Fig. 13 illustrates Label and Prediction count for each class in the testdataset. Note that the 'O'-label is removed for this visualization.

Fig. 14 shows the distribution of number of miss-classifications in asequence over all miss-classifications.

Fig. 15 shows common miss-classifications made by the system. Thisexplains about 90% of the error.

Fig. 16 illustrates a system for identifying vulnerabilities in software.

Fig. 17 illustrates a server configured to link a common vulnerabilityand exposure, CVE, with at least one synthetic common platformenumeration, CPE.

Fig. 18 is a ﬂowchart illustrating a method for linking a commonvulnerability and exposure, CVE, with at least one synthetic common platformenumeration, CPE.

Fig. 19 is a ﬂowchart illustrating a method for building a database pf aplurality of CVEs linked with at least one synthetic CPE.

Fig. 20 is a flowchart illustrating a method for training a NLP model.

Detailed description Vulnerability DataA new vulnerability is often reported as a common vulnerability and exposure, CVE. A list of CVEs is maintained by Mitre and each entry maycomprise a unique CVE number, a short summary, and at least one externalreference [20]. The CVE summary typically includes the affected product andversions. An example of the ShellShock CVE-2014-6271 is given below.

GNU Bash through 4.3 processes trailing strings after functiondefinitions in the values of environmental variables, which allows remoteattackers to execute arbitrary code via a crafted environment, asdemonstrated by vectors involving the ForceCommand features in OpenSSHsshd, the mod_cgi and mod_cgid modules in the Apache HTTP Server,scripts executed by unspecified DHCP clients, and other situations in whichsetting the environment occurs across a privilege boundary from Bashexecution, aka “ShellShock”.

This information is then used by NVD, adding, among other things, aCommon Vulnerability Scoring System, CVSS, score, and a list of commonplatform enumerations, CPEs. The CVSS score provided by National Instituteof Standards and Technology, NlST, is environment independent, but usefulwhen assessing the severity of the vulnerability. The CPE may provide astandardized string for defining which product and versions are affected bythe vulnerability.

The current version of CPE is 2.3. The format is specified in [14], andis given by the string cpe:2.3:part:vendor:product:version:update:edition:language:sw_edition:target_sw:target_hw:other The first part defines that it is a CPE and its version. Then, part can beone of h for hardware, a for application and o for operating system. Thefollowing fields are used to uniquely specify the component by as non-limitingexamples defining vendor, the name of the product, the product version. lt iscommon to use the fields up to and including version, even though, as can beseen, further details about the component can be defined. A non-limitingexample, as can be found in CVE-2014-6271, is given by NVD may also provide a JSON feed with CVE data for eachvulnerability. This feed supports additional fields for defining ranges ofversions that are vulnerable. This feed provides a more efficientrepresentation if there are many versions affected. This feed is furtherdetailed in the section Data and Labels.

NVD comprises of around 130 000 vulnerabilities (early 2020). Thesummary is given immediately when the CVE is published since it is requiredby Mitre, while the CPE is later added by NVD. The discrepancy differsbetween different CVEs, but an analysis in [4] reported that, in 2018, themedian to correctly assign CPE data was 35 days.

Natural Lanquaqe Processinq and Named Entitv Recognition Natural Language Processing (NLP) is the task to make computersunderstand linguistics, usually with the support of machine learning. WithinNLP, tasks such as machine translation, document classification, questionanswering systems, automatic summary generation, and speech recognitionare common [10]. One advantage of using machine learning for NLP is that the algorithms may gain a contextual semantic understanding of text where classifications are not dependent on a single word, but rather a complexsequence of words that can completely alter the meaning of the document.This may be beneficial to our system, as synthetic CPEs that have not beenseen before in the NVD database may be correctly classified from the CVE-summary through a contextual understanding of the document.

Named Entity Recognition (NER), or sequence labeling, is the NLPtask of classifying each word in a sequence. One of the most commonbenchmarks in NER is the CoNLL-2003 dataset [21], where the task is tolabel words with either person-, organization-, or location-names. NER is animportant task within NLP, as a system needs to understand what category aword or sub-sequence belongs to truly understand the contextual meaning ofthe document.

Data and Labels To successfully create machine learning models, it is necessary tocollect data to train it. The goal for the model is to learn the generalunderlying structure of the problem through training on that data, which actsas a representation of that problem. This data is referred to as the dataset.Our dataset consists of historical vulnerabilities with already determinedCPEs. These can be retrieved using the NVD data feed. Each entry in thedataset have the following features: - cveld: The unique identifier and name for each CVE. - updatedAt: The date and time of the last update from NVD for thisparticular CVE. - summary: A text description of the CVE, often naming the vulnerablesoftware, including product, vendor, and version. - cweName: The Common Weakness Enumerator. - cpes: A list of CPEs linked to this particular CVE. Each CPE contains: o vendor: The vendor of the product or software.o product: Name of the software.o version: An exact statement of a single vulnerable version. o versionStartExcluding: All versions are vulnerable after(excluding) this version. o versionStartlncluding: All versions are vulnerable after(including) this version. o versionEndExcluding: All versions are vulnerable before(excluding) this version. o versionEndlncluding: All versions are vulnerable before (including) this version.

The analysis concludes that 81 .9% of all CPEs from CVEs in NVD onlyspecifies one of the following fields: version, versionStartExcluding,versionStartlncluding, versionEndExcluding, and versionEndlncluding. About14.5% have no version range specified, and 3.6% have exactly two versionranges specified. Fig. 1 illustrates how a CVE-CPE link can be structured.

As seen in Fig. 1, some of the product and vendor strings can be foundin the summary. The version can also be found in the summary but isdependent on the context of the summary to determine if other versions arevulnerable (in this case all versions before version 1.16.3). ln this disclosure,only the summary is regarded as input features, the CPE-list as the labels,and all other data is disregarded in the model. Naturally, all CPEs may not bepossible to link to the summary through text models as there is no occurrenceof the product or vendor in the paragraph. ln the analysis, about 59% of CPEscan be mapped with regex methods to its CVE summary, and for 27% of theCVEs, all corresponding CPEs can be mapped to its summary. This is shownin Fig. 1, as Oracle Solaris is not mentioned in the paragraph, but isconsidered vulnerable from the context that X.Org xorg-server is vulnerable.

A sequence word labeling model requires a label for each word in thesentence. There are eight labels to consider in the CPEs provided by NVD:vendor, product, version, versionStartExcluding, versionStartlncluding,versionEndExcluding, verisonEndlncluding, and O (which denotes the none-label). Some vendors or products consist of multiple words, which need to be accurately predicted by the model. To denote this, labels are split into B- and I-labels where B denotes a start of a label, and I denote the word following theprevious B or I labeled words. A part of an example sentence, taken from theCVE summary in Fig. 1, can be seen in Fig. 2.

Problem Statement and Evaluation The high-level problem that may be solved is one of determining whatsoftware and what versions are described in a document. This could belimited to mapping each document to already existing CPEs in the availableCPE-list [16]. However, this is not disclosed in this disclosure because theavailable CPE-list is deficient as it is lacking entries for many products.Analyzing all available CPEs mentioned in CVEs, about 60% of those areonly mentioned once. Thus, the probability of a syntehic CVE describing asynthetic, none existent, CPE is high. Therefore, the system of this disclosureis allowed to create synthetic CPEs, in terms of finding software that has notbeen mentioned in any existing CPE list yet. A completely successful NER-predicted CVE-summary from our test-data will let us reconstruct allcorresponding CPEs correctly, while the model may create synthetic CPEs onnew CVE-summaries.

To determine success, the system may be measured as a conventionalNER-model as following. Over each predicted sequence, the precision was calculated, 2 true_positive2 true_positive + 2 fa1se_positive ' precision = (1) as well as the recall, 2 true_positiverecall = _ _ _ , (2)2 true_pos1t1ve + 2 fa1se_negat1ve and the harmonic mean F1, precision - recall P1 = z (s) precision + recall ' Every correctly predicted O-label from the measurements wasremoved, as it greatly inflates the result. The overall accuracy of the model asthe number of completely correctly NER-predicted CVE-summaries divided bythe total number of summaries in that particular dataset may also bemeasured. A hold-out strategy was implemented to measure these metrics,with a training set to train the model on, a validation set to optimize the model during development, and a testing set to test the final result.

Modelingln this section, the feature engineering and machine learning model is described. The model is inspired by the work of [2] and [12] in the context ofgeneric NER, where the contribution was to feed the text-data into multipleinput sequences to capture different levels of semantics in the text. ln brief,words are converted to vector representations [13] to embed contextualsemantic meaning in its vector values. Additional word level and characterlevel features are engineered to capture variations, such as word levelnumerical saturation, casing, and commonly used CPE-products and -vendors. These features are fed into a recurrent Bidirectional Long Short-termMemory (BLSTM) network to perform sequence labeling. Dropout [19] is usedto regularize the model after concatenated embeddings, after the recurrentlayer, and within the case-feature layer. This model was chosen as itpresented a superior performance on the specific task of CPE-labelingcompared to other common architectures, such as BERT [3]. The model isalso suitable, as domain knowledge can easily be embedded through featureengineering. An overview of the architecture is presented in Fig. 3.

Feature EngineeringThis subsection will discuss the four parallel input layers used in the feature engineering part of our model as seen in Fig. 3. These are word level embeddings, character level embeddings, word level case-features, and aword level lexicon of known statements. The word and character level embeddings are regarded as part of the base model, and case and lexiconfeatures are regarded as optional/experimental features to the model. Theoutput features are concatenated into an information rich sequential matrix that is fed into a neural network.

Word Level EmbeddinqsEach word is transformed into a 50, 100, 200, or 300 dimensional numerical vector to represent the semantics of that word with Gloveembeddings [18]. These embeddings are pre-trained on a large set ofWikipedia articles and consists of a vocabulary of 400 000 words. Thislanguage model serves as a good starting point for our experiments, as theyare well documented and tested, which enables us to look into other variablesin the modeling. These embeddings are not tuned during training and missingwords from the vocabulary are assigned a default randomly generated vector.

Character Level FeaturesTo extract character level features for each word, a three-stage process of embedding on character level was employed, applying a one-dimensional convolution (CNN-layer), and extracting the final word-featureswith a max-pooling layer. The embeddings are randomly initialized and tunedduring training. Dropout is applied to prevent the model from overfitting. Theemployed CNN-layer has a filter-size of 30 and a kernel-size of 3. A max-operation is done over each filter, so each word outputs a character-featurevector of shape (1 , 30), and for the whole word-sequence a shape of (word-sequence-length, 30). Character level features enable the model to learn newwords through decoding of character-sequences, and can thereby give similaroutput-values to insignificant variations of very similar character sequences.As our text-domain (security) is quite different from the pre-trained word levelembeddings (Wikipedia), the character level embeddings enable our model to learn security-related semantics.

Word Level Case Features ln the task to find versions, products, and vendors, casing and otherword-properties may be important to determine the label of that particularword. For instance, it is common that products' and vendors' names arecapitalized. The version label contains a high concentration of character leveldigits, but may also contain mid-word punctuation and special characters. Fig.4 shows the different case-features, which are fed into random-uniformlyinitialized trainable embeddings with the same dimension as the number of CGSGS.

Security LexiconTo embed domain knowledge into the system, a security-lexicon is built. The labels product and vendor are included in the lexicon features. Thelexicon is constructed from the complete set of CVEs from the NVD databaseconsisting of about 130 000 vulnerabilities describing about 50 000 differentproducts, excluding all CVEs in the validation and test dataset. Each entryinto the lexicon can describe one of three entities, product, vendor, andproduct and vendor. Some product/vendor names exist both as products andvendors, which explains this separate feature. The total number of entries inthe lexicon can be seen in Fig. 5. When constructing the security lexicon, onlycommon CPE-products and -vendors are added to the lexicon. The cutoff wasset to the top 80% of the most common products and vendors to avoid CPEswith very few mentions. As seen in Fig. 6, the accumulated product mentionsare heavily skewed towards products with very few mentions. This distributionmay discourage the use of a lexicon-feature and increase the importance ofcase-features and contextual understanding of the model, as the probability ofnew CVE-summaries containing already existing CPEs has historically been low.

Neural Network The input layer of the model consists of some or all features describedin the section Feature Engineering. The outputs of these features are allconsidered as embeddings that can be concatenated into a high-dimensionalfeature-map considering multiple characteristics of the input sequence. Theseconcatenated embeddings are then fed into a neural network for sequenceclassification. The network architecture is inspired by [12], where theembeddings are fed into a Bidirectional Long Short-term Memory (BLSTM) layer and labels are decoded in a Conditional Random Field (CRF).

Bi-directional Lonq Short-Term Memory Unit (BLSTM) The LSTM [7] neural network unit is a type of recurrent layer that hastheoretically strong capabilities to capture long-distance dependencies insequential data. ln text-data, recurrent models are capable of capturingcontextual semantics in the data, and correctly model the sequentialvariations and dependencies of that text data. Conventional recurrent unitssuffer from problems such as the vanishing and exploding gradient [1,17]which disables these networks to be effective on some tasks. The LSTM unithandles these complications by an internal architecture consisting of an inputgate, output gate, forget gate, and a cell state. An overview of the LSTM cellcan be seen in Fig. 7. ln Fig. 7, Xi denotes the t:th embedded input word to the LSTM cell andh represents the hidden state. The variable hi-i is the output from the previousLSTM cell and hi serves as the output prediction from this LSTM cell for thet:th word in the sequence. C denotes the cell state, which passes thememories of the already processed sequence to the LSTM cell. The forgetgate is a nested neural network with a sigmoid activation function that scalesthe previous hidden state sequence between 0 and 1, where a low outputvalue for a particular part of the sequence denotes that that word should be forgotten. The output from the forget gate fi is derived through, ft = o(Wf >< concat(ht_1,Xt) + bf), (4) where Wf and bf are the trainable weights. The activation function o isderived through, 1= _ 5øcx) 1 + e, < >5 The input gate values are derived similar to Equation (4),it = o(Wi >< concat(ht_1,Xt) + bi), (6) where Wi and bi are trainable weights as well. Similarly to Eq. (4), the sigmoid in Eq. (6) normalizes the input values and previous hidden state between 0 and 1, which corresponds to their relative importance in thisparticular time step t. This layer is responsible to decide what new datashould be added to the cell state. To calculate the cell state, the input andprevious hidden state is passed through the following equation, (Ft = tanh(Wc >< concat(ht_1,Xt) + bc), (7)to calculate the actual information that the input at step t brings. WCand bC are trainable weights. The tanh function normalizes the input between-1.0 and 1.0 through the following equation,X _ -Xtanm) = + :_X . (s)The relative importance is calculated for X and h, and applied to theoutput from Eq. (7), which together with the forget gate forms the cell statethroughCt = ft >< Ct_1 + it >< ät, (9)25 where CM is the previous cell state. To calculate the output from a particular part of the sequence, which corresponds to the hidden state hi, the input Xi and hi-1 are passed through an output gate. This gate decides whatinformation should be passed to the next hidden state and output as asequence prediction. The output gate is derived through(ft = o(W0 >< concat(ht_1,Xt) + bo), (10)where WO and bo are trainable weights and the current hidden state iscalculated through,ht = (ft >< tanh(Ct). (11) The output is passed to the next layer of the model, and is a matrix ofshape [batch_size, sequence_length, weight_shape], where the batch_size isthe number of parallel input examples fed to the model, sequence_length islength of the sentence, and weight_shape is a user set parameter thatdecides the number of weights used in the four nested neural networks.

To make this LSTM layer bidirectional [6], one simply use twoseparate, but identical, LSTM layers that pass over the input sequence in onedirection each. The output is then concatenated. The output is regularizedwith dropout [19].

The reason for using a BLSTM is that an LSTM cell does not knowanything about the future sequence t+1, t+2, which may be contextuallyvaluable. For instance, when classifying a version, a part of the sequencemay be "[..] vulnerable version 1.3.4 and earlier". A BLSTM can capture thesemantic meaning of "and earlier", and correctly classifies this asversionEndlncluding RFAs shown in the architectural overview in Fig. 3, the output from theBLSTM is fed to a Conditional Random Field (CRF) [8] layer. The benefits ofa CRF layer is statistically correlated label determination when assigning aclass to a word in a sequence. For instance, the probability of a word being labeled with l-product increases if the previous word has been labeled with B- product. With CRF, labels are assigned jointly to reflect a final prediction forall entities in the sequence that make sense together. This is done throughconditional probabilities and global normalization of a random field model.Consider the output sequence of the BLSTM-layer h = {hi, h2, hi, hN}, where hi denotes the numerical vector output from the BLSTM-layercorresponding to the i:th word from the CVE-summary word sequence oflength N. The label sequence y = {y1, y2, yi, yN} denotes eachcorresponding labels to the CVE-summary word sequence, where yi denotesthe predicted label for the i:th word. Y(h) denotes the universe of all possiblelabels for h. The conditional random field describes the conditional probabilityof label yi in respect to input hi and surrounding labels yviii = yv~i, where ~denotes v as close to i, as p(yi|hi,yv, v ~ i) over all possible label sequences.To determine the probability, a layer of weights W and biases b are used as in=1Yi(Yi-1» Yi» hi) ( |h;W,b) = ,. ,. ,p y Zywvni) Hin=1Yi(Yi-1»Yi»hi) (12) where, yi(y',y, hi) = exp (Wgilyhi + byily) . (13)The weights are trained through gradient descent and the Adam optimizer [11], as the rest of the model. The output of the CRF-layer is decoded from the highest conditional probability over the full sequence and serves as the output of the model.

Results and DiscussionTraining To train the model a dataset of 15190 CVEs from NVD may be used,with an evaluation set of 3798 entries and a test set of 4748 entries. The testand evaluation split was done randomly. Experiments were conducted onwhether to do a time-split instead of the dataset to prevent look-ahead bias but resulted in an insignificant performance change. The model wasoptimized with Bayesian hyperparameter optimization [9] over the followinghyperparameters: - The learning rate is a parameter that scales how much each weightshould be updated in each gradient descent step [11].

- The number of cells in the LSTM-layers determines the size of theweight matrices Wf, Wi, Wo, and WC, and their corresponding biases.

- Whether the casing features should be used.

- Whether the lexicon features should be used.

- The dimension of word level embeddings of pre-trained vectors.

- The dimension of character level embeddings of randomly initializedvectors.

- The Dropout-rate before and after the BLSTM-layer, and inside thechar-features.

- The Recurrent dropout-rate in the LSTM-cells which determines thedropout rate of previous hidden state hm.

The training was performed on NVIDIA TESLA K80 GPU and it tookabout 4-6 h to train the model once. ln total, it took about 30h to do the fulltraining sweep on 16 K80s for 80 training iterations with differenthyperparameter settings. This amounts to about 20 GPU-days. Theparameter search space can be seen in Fig. 8. The Adam optimizer [11] wasused to update the trainable parameters in the model and early stopping toreduce the risk of overfitting.

Main Resultsln Fig. 9 the results are presented for the different model configurations. lt is clear that the security lexicon did not provide anysignificant signal to improve the model. The word level casing feature provedbeneficial to the performance with a significant improvement over the basemodel. The best performance on the test-set was attained without the lexiconfeatures and with casing features with an F-measure of 0.8604, a precision of0.8571, and a recall of 0.8637. lt is also clear that the same model had the best performance on the Validation set, but as could be seen in Fig. 9, someindications of overfitting to the training-set as the F-measure, recall, andprecision are much higher. This may indicate that additional performancecould be gained with more aggressive regularization techniques. The fullycombined model had much worse performance on the training set and similarperformance on the test and validation set. This may indicate that furthertraining and hyperparameter optimization could increase the performance ofthis model and enable it to surpass the other options.

Performance over CPE-product, -vendor, and -versionAt a more granular level shown in Fig. 10, the performance of each label on the test-set is illustrated, as well as the number of instances of eachlabel in the test-set Label Count and the number of predicted instancesPrediction Count. As seen in Fig. 10, some classes perform better thanothers. The F-measure is high for B-vendor, B-product, and B-version, as wellas l-product. lt is clear that there is a correlation between Label Count and allperformance scores, which makes sense for this type of model as neuralnetworks tend to be very data-hungry. ln Fig. 11 labels with more examples inthe dataset clearly have higher performance than less common labels. Thereseems to be a cutoff at approximately 300 examples to have an F-measureabove 0.8. lt is further illustrated that the performance for word combinations.are worse, as the scores for l-labels are lower. To further increase theperformance on l-labeled entries, it may be beneficial to create n-gramsfeatures in the lexicon or collect additional data for those particular cases. Fig.12 visualizes the results from Fig 10. The model achieves a similar distribution over each label, which is visualized in Fig 13.

Feature AnalysisThe lexicon features did not provide any significant performance gains together with or compared to the case-features. lt is possible that the casefeatures better captured characteristics of the vendor and product labels since those are commonly capitalized in some manner, rather than over-relying on a fairly static memory of common labels. This result is in line with thedistribution of products shown in Fig. 6, as 60% of all products NVD arementioned only once. Other papers, such as [4] and [5] use keyword-basedsystems or features targeting narrow properties of the vendor- and productlabel. These systems are not taking the context of the sequence intoconsideration when performing classification. Thus, by using the presentdisclosure, the achieved results may be significantly better compared tosystem and features disclosed in [4] and [5]. With a contextually awareclassification, the disclosed system is able to find CPEs that have never beenseen before by NVD in any CVEs. This is highly desirable in a system toautomatically extract CPEs from CVEs due to the distribution in Fig. 6.

Increase Performance on Rare Labels The disclosed dataset consists of about roughly 20% of all availableCVEs in NVD, which may limit the results. This particular subset was chosenas over 90% of all CPE-version, -product, and -vendor strings for all CPEspaired with a CVE could be found in the summary through regular expression.Stronger regular expressions could increase the number of training examples,and further increase the performance of the system. To increase theperformance in the more challenging task of classifying multi-word labels, anoverweight to these cases could be provided to the training data, or the modelcould be pre-trained on a larger high-quality data set such as CoNLL 2003[21].

Error AnalysisThe overall accuracy of the system of correctly linking every CPE of each CVE is 0.6744, measured as the full CVE-summary being correctlyNER-annotated by the system. lf the system only regards pursuing vendorand product classification, the accuracy would increase to 0.7772, which ismore comparable to earlier research as they do not always search for versionranges in the summary. The distribution of the number of errors in all sequences that were incorrect is visualized in Fig. 14, where the accumulated error for sequences with up to 3 errors stands for about 80% of the miss-classified summaries. Looking further into what types of errors the modelmakes, Fig. 15 visualizes in total about 90% of all miss-classifications. ln thetop four spots, regarding about 40% of the error are bad predictions on theproduct label, with the l-label scoring higher than the B-label. Thisstrengthens the hypothesis that the system needs improvement to better findword combinations. The top two mistakes contribute to a lower precision, as itincorrectly finds a CPE-product where there is none, and error three and four contributes to a lower recall as products are miss-classified as a O-words.

Related research Other research has tried more extensive engineering of text-features toextract CPEs from the CVE-summary published by NVD. ln [5] the authorsmine the target product, vendor, and version from the summary bytokenization of each word much like our case-feature and lexicon-feature todiscover punctuation, capitalization, and commonly used vendors/products.They also generate snippets (sequence of tokens) to cover multi-word labelsthrough engineered rules based on the feature vector. Multiple token-sequences can then be grouped into a CPE (vendor-product-version link)based on rules, such as that all version tokens that are within 6 tokens of aproduct token are assigned to that product token. The context of each versionis analyzed to determine the version type (before/after, including/excluding).The authors achieve an F-measure of 0.70 (precision: 0.71, recall: 0.69),which the disclosure significantly outperforms as the disclosure attain an F-measure of 0.86 (precision: 0.857, recall: 0.864).

A similar system of finding CPEs to "one day"-vulnerabilities wasproposed in [4], where the authors use a key-word based technique with TF-IDF to find the probability of each word being assigned to a certain sub-classwithin a CPE. The output of the model is an ordered list of words with a highprobability of being a relevant word in a CPE. The authors results may not becomparable to this research, as their system is not intended for automateduse and needs explainability. Although, to make a fair comparison, their results of the precision of the top predicted word in each ordered list, which isjust below 0.6, can be compared to this disclosure, which is 0.857. Still, theirresearch indicates that a TF-IDF implementation of a lexicon feature couldprovide additional performance to our system in terms of finding alreadymentioned products and vendors.

The model is largely based upon [12], which combined engineeredfeatures, a BLSTM-netvvork and a CRF-layer to perform NER on the CoNLL2003 [21] dataset. They achieve an F-measure score as high as 0.9121,which to our knowledge held the state-of-the-art during some time in 2016.Results from different datasets are not comparable, as the quality and thegeneral challenge of each dataset may be different. Other, more recent,implementations of state-of-the-art NLP-models were implemented in ourresearch such as BERT [3], but with a significant decrease in performance compared to our model.

Conclusion The present disclosure concludes that it is possible to automate theprocess of linking CVEs with CPEs by machine learning with high precisionand recall, in regards to the CPEs that are actually mentioned in the CVE-summary. This model is able to find CPE-products, -vendors, and -versionswith an F-measure 0.8604 (precision: 0.8571, recall: 08637) through NER-tagging, and completely reconstruct all corresponding CPEs in 67.44% ofCVE-summaries. This system enables DVM-tools to automatically andwithout time-lag get an estimate of some CPEs a particular CVE describesand thereby reduce the risk of becoming a victim of a "one day"-vulnerability.Additionally, CPEs may also be found in incorrectly labeled CVEs or fromvulnerabilities from other sources, such as forums, email-threads, or RSS-feeds. These results may establish a synthetic state-of-the-art in extractingCPEs from CVE-summaries.

The system could be further developed by embedding knowledge ofthe available universe of CPEs into the results of the prediction so that eachestimated CPE could be pared to one or multiple existing CPEs. A TF-IDF or n-grams implementation of the security Iexicon feature, as in [4], could alsoimprove the performance of the system, possibly also taking advantage of asecurity-Iexicon, which in our case brings no noteworthy additional performance.

Svstem for identifvinq vulnerabilities in softwareFig. 16 illustrates a system 100 for identifying vulnerabilities insoftware 102. The system 100 comprises a vulnerability database 104. Thevulnerability database 104 provides a plurality of common vulnerability andexposures, CVEs 106. Each CVE 106 of the plurality of CVEs may comprisea summary describing a vulnerability. The system 100 further comprises asynthetic common platform enumeration, CPE, database 108. The syntheticCPE database 108 can be constructed 110 by using a method 400 forbuilding the database of the plurality of CVEs as will be further discussed inconnection with Fig. 19. 19. The plurality of CVEs 106 can be retrieved fromthe vulnerability database 104. The synthetic CPE database 108 compriseslinked CVEs 106, wherein the linked CVEs 106 are linked with syntheticCPEs 112. The linked CVEs 106 can be linked to a plurality of synthetic CPEs112 and vice versa.A software 102 to be investigated for vulnerabilities can be described by at least one File CPE 114. The File CPE(s) 114 can then be compared 116to the synthetic CPEs 112 of the synthetic CPE database 108, in order to findif there are any matching synthetic CPEs 112 that link to known CVEs 106 representing vulnerabilities 118.

Server confiqured to link a CVE with at least one svnthetic CPEFig. 17 illustrates a schematic view of a server 200. The server 200 is configured to link a common vulnerability and exposure, CVE, with at leastone synthetic common platform enumeration, CPE. The CVE comprises asummary of a vulnerability. The server 200 comprises a transceiver 202, acontrol circuit 204 and a memory 208.

The transceiver 202 is configured to enable the server 200 tocommunicate with other devices, such as a vulnerability database, VB. Thetransceiver 202 may be configured to receive the summary of the CVE fromthe vulnerability database, VD.

The control circuit 204 may be configured to perform the control offunction and operations of the server 200. The control circuit 204 may includea processor 206, such as a central processing unit (CPU). The processor 206can be configured to execute program code stored in the memory 208, inorder to perform functions and operations of the server 200.

The control circuit 204 may execute an extracting function 210. Theextracting function 210 can be configured to extract information from thesummary of the CVE. The information may be extracted by using a NaturalLanguage Processing, NLP, model. The extracted information may comprisea product and/or version and/or vendor affected by the vulnerability.Extracting the information is discussed in further detail in connection with Fig.18.

The control circuit 204 may execute a building function 212. Thebuilding function 212 can be configured to build at least one synthetic CPE.The synthetic CPE may be built based on the extracted information.

The control circuit 204 may execute a linking function 214. The linkingfunction 214 can be configured to link the CVE with the at least one syntheticCPE.

Method for linkinq a CVE with at least one synthetic CPEFig. 18 is a flowchart illustrating steps of a method 300 for linking a common vulnerability and exposure, CVE, with at least one synthetic commonplatform enumeration, CPE, by way of example. The CVE may comprise asummary of a vulnerability. ln a first step S302, the summary of the CVE is received from avulnerability database, VD. ln a second step S304, information from the summary of the CVE isextracted by using a Natural Language Processing, NLP, model. The information may comprise a vendor and/or a product name and/or a productversion that may be affected by the vulnerability.Optionally, the step of extracting information from the summary of theCVE may comprise adding a label for each word in the summary, wherein thelabel may be selected from a CPE relevant group or a non-CPE relevantgroup. The CPE relevant group may comprise vendor, product, version, firstexcluded version, first included version, last excluded version, last includedversion. The labels from the CPE relevant group may further be labeled as B-label or l-label. The B-label may denote a labeled word to be a beginning of aword combination. The l-label may denote the labeled word to be placed afterthe beginning in the word combination. The non-CPE relevant group maycomprise none-labels. Thereafter, the words with labels from the CPErelevant group may be extracted from the summary of the CVE.Optionally, the step of extracting information from the summary of the CVE may comprise feeding each word in the summary of the CVE into afeature engineering. The feature engineering may comprise Word LevelEmbeddings and Character Level Embeddings. The Word Level Embeddingsmay be configured to transform each word in the summary into a numericalvector. The Character Level Embeddings may be configured to extractcharacter level features for each word in the summary. Alternatively, or incombination, the feature engineering may further comprise Word Level CaseFeatures. The Word Level Case Features may be configured to find word-properties in the summary. Alternatively, or in combination, the featureengineering may further comprise a Word Level Lexicon. The Word LevelLexicon may be configured to find features based on domain knowledge. TheWord Level Lexicon may be constructed from a set of CVEs from the VD,comprising known products, vendors and product and vendors. Thereafter, aninput may be formed by combining outputs of the Word Level Embeddingsand the Character Level Embeddings. Alternatively, or in combination, thestep of forming the input may further comprise combining outputs of the WordLevel Case Features and the Word Level Lexicon. Then, the input may be fed into a neural network. The neural network may comprise a recurrent Bidirectional Long Short-term Memory (BLSTM) network and a ConditionalRandom Field (CRF) layer. Thereafter, a set of Iabeled words from output ofthe neural network may be determined. ln a third step S306, at least one synthetic CPE is built based on theextracted information. ln a fourth step S308, the CVE is linked with the at least one syntheticCPE.

Method for buildinq a database of a pluralitv of CVEs Fig. 19 is a flowchart illustrating steps of a method 400 for building adatabase of a plurality of CVEs with at least one linked synthetic CPE. ln afirst step S402, each CVE of the plurality of CVEs may be linked to at leastone synthetic CPE according to the method 300 discussed in connection withFig. 18. ln a second step S404, each CVE of the plurality of CVEs with atleast one linked synthetic CPE may be stored in the database.

Optionally, the synthetic CPE of the database may be compared to afile CPE in order to find vulnerabilities in software. The file CPE may comprise vendor, product and version of imported software.

Method for traininq of an NLP modelFig. 20 is a flowchart illustrating steps of a method 500 for training anNLP model. The NLP model is configured to be used for linking a commonvulnerability and exposure, CVE, with at least one common platformenumeration, CPE. ln a first step S502, a dataset is formed. The dataset maycomprise CVEs with linked CPEs. ln a second step S504, the dataset may bedivided into a training set and a validation set. ln a third step S506,parameters of the model may be fitted by applying the model to CVEs withlinked CPEs in the training set. Thereafter, in a fourth step S508, the NLPmodel may be optimized by using the CVEs in the validation set.From the description above follows that, although various embodiments of the disclosure have been described and shown, the disclosure is not restricted thereto, but may also be embodied in other wayswithin the scope of the subject-matter defined in the following claims.

Claims

1. A method (300) for linking a common vulnerability and exposure,CVE, (106) with at least one synthetic common platform enumeration, CPE,(112) wherein the CVE (106) comprises a summary of a vulnerability, themethod comprising: receiving (S302) the summary of the CVE (106) from a vulnerabilitydatabase, VD, (104); extracting (S304) information from the summary of the CVE (106)using a Natural Language Processing, NLP, model, building (S306) at least one synthetic CPE (112) based on theextracted information, and linking (S308) the CVE (106) with the at least one synthetic CPE (112).

2. The method (300) according to claim 1, wherein the extractedinformation comprises a vendor and/or product name and/or a product version affected by the vulnerability.

3. The method (300) according to any one of the preceding claims,wherein the step of extracting information (S304) from the summary of theCVE (106) comprises: adding a label for each word in the summary, wherein the label isselected from a CPE relevant group comprising vendor, product, version, firstexcluded version, first included version, last excluded version, last includedversion or a non-CPE relevant group comprising none-labels, and extracting words with labels from the CPE relevant group.

4. The method according to claim 3, wherein the labels from the CPErelevant group is further labeled as B-label or l-label, wherein the B-labeldenotes a labeled word to be a beginning of a word combination and the I-label denotes the labeled word to be placed after the beginning in the word combination.

5. The method according to any one of the preceding claims, whereinthe step of extracting information (S304) from the summary of the CVE (106)further comprises: feeding, each word in the summary of the CVE (106) into a featureengineering, wherein the feature engineering comprising Word LevelEmbeddings, wherein the Word Level Embeddings is configured to transformeach word in the summary into a numerical vector and Character LevelEmbeddings, wherein the Character Level Embeddings is configured toextract character level features for each word in the summary; forming an input by combining outputs of the Word Level Embeddingsand the Character Level Embeddings; feeding the input into a neural network comprising a recurrentBidirectional Long Short-term Memory (BLSTM) network and a ConditionalRandom Field (CRF) layer; and determining a set of labeled words from output of the neural network.

6. The method according to claim 5, wherein the feature engineeringfurther comprising a Word Level Case Features, wherein the Word LevelCase Features is configured to find word-properties in the summary, and/or aWord Level Lexicon, wherein the Word Level Lexicon is configured to findfeatures based on domain knowledge.

7. The method according to claim 6, wherein the step of forming theinput further comprises combining outputs of the Word Level Case Features and the Word Level Lexicon.

8. The method according to claim 6 or 7, wherein the Word LevelLexicon is constructed from a set of CVEs from the VD, comprising known products, vendors and product and vendors.

9. The method according to any one of the preceding claims, whereinthe step of building the at least one synthetic CPE (112) based on theextracted information further comprises combining the extracted informationinto a predetermined CPE format.

10. A method (400) for building a database (108) of a plurality of CVEs(106) linked with at least one synthetic CPE (112), comprising the steps of: linking (S402) each CVE (106) of the plurality of CVEs to at least onesynthetic CPE (112) according to the method (300) of claim 1, and storing (S404) each CVE (106) of the plurality of CVEs linked with atleast one synthetic CPE (112) in the database (108).

11. The method (400) according to claim 10, for comparing (116) a fileCPE (114), wherein the file CPE (114) comprises vendor, product and versionof imported software, with synthetic CPEs (112) of the database to findvulnerabilities in software (102).

12. A method (500) for training of an NLP model, wherein the NLPmodel is configured to be used for linking a common vulnerability andexposure, CVE, (106) with at least one synthetic common platformenumeration, CPE, (112) the method (500) comprising: forming (S502) a data set, wherein the data set comprises CVEs (106)with linked CPEs (112) received from a vulnerability database, VD, (104); dividing (S504) the data set into a training set and a validation set; fitting (S506) parameters of the model by applying the model to CVEswith linked CPEs in the training set, and optimizing (S508) the NLP model using the CVEs in the validation set.

13. A server (200) configured to link a common vulnerability andexposure, CVE, (106) with at least one synthetic common platformenumeration, CPE, (112) wherein the CVE (106) comprises a summary of avulnerability, the server (200) comprising: a transceiver (202) configured to:receive the summary of the CVE (106) from a vulnerabilitydatabase, VD, (104);a control circuit (204) configured to execute:5 an extracting function (210) configured to extract informationfrom the summary of the CVE (106) using a NLP model;a building function (212) configured to build at least onesynthetic CPE (112) based on the extracted information; anda linking function (214) configured to link the CVE (106) with the10 at least one synthetic CPE (112).

14. The server (200) according to claim 13, wherein the extractedinformation comprises a product and/or a version and/or vendor affected bythe vulnerability.