WO2023228188A1 - A sensitive variable identifying system and method - Google Patents

A sensitive variable identifying system and method Download PDF

Info

Publication number
WO2023228188A1
WO2023228188A1 PCT/IL2023/050535 IL2023050535W WO2023228188A1 WO 2023228188 A1 WO2023228188 A1 WO 2023228188A1 IL 2023050535 W IL2023050535 W IL 2023050535W WO 2023228188 A1 WO2023228188 A1 WO 2023228188A1
Authority
WO
WIPO (PCT)
Prior art keywords
variable
expressive
code
identifier
given
Prior art date
Application number
PCT/IL2023/050535
Other languages
French (fr)
Inventor
Uzy HADAD
Arthur GARMIDER
Hanan IVRY
Original Assignee
Privya Ops Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Privya Ops Ltd filed Critical Privya Ops Ltd
Publication of WO2023228188A1 publication Critical patent/WO2023228188A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/12Protecting executable software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to the field of systems and methods for identifying sensitive and personal variables.
  • Computer programming also known as coding, is the process of writing code involving a set of instructions intended to facilitate specific actions to be executed by a computer.
  • the set of instructions may involve the use of various variables, at least some of which may be associated with personal and sensitive data (e.g., as defined by privacy regulations such as the General Data Protection Regulation (GDPR)).
  • GDPR General Data Protection Regulation
  • a sensitive variable identifying system comprising a processing circuitry configured to: obtain: (a) a plurality of expressive variable identifiers, each given expressive variable identifier being: (i) associated with a respective machine learning model of one or more machine learning models, capable of receiving a vector associated with a variable and labeling the variable as either associated with the given expressive variable identifier or not, and (ii) indicative of sensitive content of a respective variable associated with the given expressive variable identifier, and (b) at least one code segment including a plurality of code lines containing a definition of at least one defined variable having a non-expressive variable identifier, wherein the non- expressive variable identifier is not included in the plurality of expressive variable identifiers; identify, utilizing the at least one code segment, a collection of terms associated with the at least one defined variable; generate a defined variable vector from the collection of terms; and determine the at least one defined variable as sensitive by determining whether the defined variable vector
  • each machine learning model, associated with the given expressive variable identifier is generated by: obtaining one or more code segments including a plurality of code lines, wherein the plurality of code lines includes a definition of a variable having the given expressive variable identifier; identifying, utilizing the one or more code segments, collections of terms associated with the given expressive variable identifier; generating one or more labeled vectors, each of which is based on an identified collection of terms of the collections of terms and associated with a label of the given expressive variable identifier; and training the machine learning model associated with the given expressive variable identifier, based on the one or more labeled vectors.
  • the one or more labeled vectors are generated using a word embedding method.
  • the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof.
  • each given expressive variable identifier is associated with a distinct machine learning model.
  • the defined variable vector is generated of a collection of partially defined vectors.
  • each given expressive variable identifier of the plurality of expressive variable identifiers is associated with additional expressive variable identifiers that are also indicative of the content of the respective variable associated with the given expressive variable identifier.
  • the collection of terms includes: one or more function identifiers, one or more variable names, or any combination thereof.
  • the vector is generated using a word embedding method.
  • the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof.
  • the processing circuitry is configured to identify at least one code flow of the plurality of code flows in which the defined variable having the non-expressive variable identifier is present.
  • each code flow of the plurality of code flows includes a plurality of code levels, each including a plurality of code lines, and (ii) the identification of the defined variable within the plurality of code levels of each code flow is carried out up to a code level threshold.
  • a sensitive variable identifying method comprising: obtaining: (a) a plurality of expressive variable identifiers, each given expressive variable identifier being: (i) associated with a respective machine learning model of one or more machine learning models, capable of receiving a vector associated with a variable and labeling the variable as either associated with the given expressive variable identifier or not, and (ii) indicative of sensitive content of a respective variable associated with the given expressive variable identifier, and (b) at least one code segment including a plurality of code lines containing a definition of at least one defined variable having a non- expressive variable identifier, wherein the non-expressive variable identifier is not included in the plurality of expressive variable identifiers; identifying, utilizing the at least one code segment, a collection of terms associated with the at least one defined variable; generating a defined variable vector from the collection of terms; and determining the at least one defined variable as sensitive by determining whether the defined variable vector is associated with
  • each machine learning model, associated with the given expressive variable identifier is generated by: obtaining one or more code segments including a plurality of code lines, wherein the plurality of code lines includes a definition of a variable having the given expressive variable identifier; identifying, utilizing the one or more code segments, collections of terms associated with the given expressive variable identifier; generating one or more labeled vectors, each of which is based on an identified collection of terms of the collections of terms and associated with a label of the given expressive variable identifier; and training the machine learning model associated with the given expressive variable identifier, based on the one or more labeled vectors.
  • the one or more labeled vectors are generated using a word embedding method.
  • the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof.
  • each given expressive variable identifier is associated with a distinct machine learning model.
  • the defined variable vector is generated of a collection of partially defined vectors.
  • each given expressive variable identifier of the plurality of expressive variable identifiers is associated with additional expressive variable identifiers that are also indicative of the content of the respective variable associated with the given expressive variable identifier.
  • the collection of terms includes: one or more function identifiers, one or more variable names, or any combination thereof.
  • the vector is generated using a word embedding method.
  • the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof.
  • the processing circuitry is configured to identify at least one code flow of the plurality of code flows in which the defined variable having the non-expressive variable identifier is present.
  • each code flow of the plurality of code flows includes a plurality of code levels, each including a plurality of code lines, and (ii) the identification of the defined variable within the plurality of code levels of each code flow is carried out up to a code level threshold.
  • a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform sensitive variable identifying method, the method comprising: obtaining: (a) a plurality of expressive variable identifiers, each given expressive variable identifier being: (i) associated with a respective machine learning model of one or more machine learning models, capable of receiving a vector associated with a variable and labeling the variable as either associated with the given expressive variable identifier or not, and (ii) indicative of sensitive content of a respective variable associated with the given expressive variable identifier, and (b) at least one code segment including a plurality of code lines containing a definition of at least one defined variable having a non-expressive variable identifier, wherein the non-expressive variable identifier is not included in the plurality of expressive variable identifiers; identifying, utilizing the at least one code segment, a collection of terms associated with
  • Fig. 1 is a schematic illustration of a code segment on which the sensitive variable identifying system operates, in accordance with the presently disclosed subject matter;
  • Fig. 2 is a block diagram schematically illustrating one example of a sensitive variable identifying system, in accordance with the presently disclosed subject matter; and, Fig. 3 is a flowchart illustrating an example of a sequence of operations carried out by a sensitive variable identifying system, in accordance with the presently disclosed subject matter; and,
  • Figs. 4A-4B are exemplary code segments on which the sensitive variable identifying system operates, in accordance with the presently disclosed subject matter.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • non-transitory is used herein to exclude transitory, propagating signals, but to otherwise include any volatile or nonvolatile computer memory technology suitable to the application.
  • the phrase “for example,” “such as”, “for instance” and variants thereof describe non-limiting embodiments of the presently disclosed subject matter.
  • Reference in the specification to “one case”, “some cases”, “other cases” or variants thereof means that a particular feature, structure or characteristic described in connection with the embodiment(s) is included in at least one embodiment of the presently disclosed subject matter.
  • the appearance of the phrase “one case”, “some cases”, “other cases” or variants thereof does not necessarily refer to the same embodiment(s).
  • Fig. 1 illustrate a general schematic of the system architecture in accordance with an embodiment of the presently disclosed subject matter.
  • Each module in Fig. 2 can be made up of any combination of software, hardware and/or firmware that performs the functions as defined and explained herein.
  • the modules in Fig. 2 may be centralized in one location or dispersed over more than one location.
  • the system may comprise fewer, more, and/or different modules than those shown in Fig. 2.
  • Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
  • Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.
  • Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
  • FIG. 1 showing a schematic illustration of a code segment on which the sensitive variable identifying system operates, in accordance with the presently disclosed subject matter.
  • code segment 100 includes a plurality of code lines, denoted LI to Ln, forming a set of instructions intended to facilitate specific actions to be executed by a computer (not shown).
  • the plurality of code lines which can be written in any programing language, for example, Java, C#, TypeScript, Go, Kotlin, Scala, JavaScript, C++, C Language, Python, PHP Hypertext Preprocessor (PHP), Structured Query Language (SQL), and the like, may be arranged in one or more code flows and include the use of one or more variables directed to store data.
  • the data stored can be sensitive, i.e., data that requires protection because its loss, misuse, modification, or unauthorized access will negatively impact the welfare, privacy, assets, or security of an organization or individual (e.g., gender, ID number, Social Security Number (SSN), password, and the like), or non-sensitive, i.e., data that does not require protection (e.g., metadata of services, products, etc.).
  • sensitive i.e., data that requires protection because its loss, misuse, modification, or unauthorized access will negatively impact the welfare, privacy, assets, or security of an organization or individual (e.g., gender, ID number, Social Security Number (SSN), password, and the like
  • non-sensitive i.e., data that does not require protection (e.g., metadata of services, products, etc.).
  • Each variable of the one or more variables within the one or more code flows of code segment 100 can be defined by type (e.g., char, integer, string, void, double, etc.) and be associated with a variable identifier.
  • the variable identifier can be an expressive variable identifier, i.e., a variable identifier that at least implies the content of the variable to which it is associated, or a non-expressive variable identifier, i.e., a variable identifier that does not even imply the content of the variable to which it is associated.
  • code segment 100 includes two defined variables, a “Long” variable and a “String” variable, associated with respective variable identifiers 102a and 102b.
  • variable identifier 102a is considered an expressive variable identifier, as it indicates that the content of the defined “Long” variable to which it is associated includes an ID number
  • variable identifier 102b is considered a non-expressive variable identifier, as it does not even imply the content of the defined “String” variable to which it is associated. From the denotation of variable identifier 102a it is evident that the content of the defined “Long” variable is associated with sensitive data, while the sensitivity of the content of the defined “String” variable associated with variable identifier 102b remains unclear.
  • variable identifying system operates on code segment 100, as will be described hereafter in reference to Fig. 3.
  • Fig. 2 is a block diagram schematically illustrating one example of the sensitive variable identifying system 200, in accordance with the presently disclosed subject matter.
  • the sensitive variable identifying system 200 can comprise a network interface 206.
  • the network interface 206 e.g., a network card, a WiFi client, a Li-Fi client, 3G/4G client, or any other component
  • system 200 can receive, through network interface 206, a plurality of expressive variable identifiers associated with a respective machine learning model of one or more machine learning models.
  • System 200 can further comprise or be otherwise associated with a data repository 204 (e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.) configured to store data.
  • a data repository 204 e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.
  • data repository 204 e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.
  • Data repository 204 can be further configured to enable retrieval and/or update and/or deletion of the stored data. It is to be noted that in some cases, data repository 204 can be distributed, while the system 200 has access to the information stored thereon, e.g., via a wired or wireless network to which system 200 is able to connect (utilizing its network interface 206).
  • System 200 further comprises processing circuitry 202.
  • Processing circuitry 202 can be one or more processing units (e.g., central processing units), microprocessors, microcontrollers (e.g., microcontroller units (MCUs)) or any other computing devices or modules, including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling relevant system 200 resources and for enabling operations related to system’s 200 resources.
  • processing units e.g., central processing units
  • microprocessors e.g., microcontroller units (MCUs)
  • MCUs microcontroller units
  • the processing circuitry 202 comprises a sensitive variable identifying module 208, configured to perform a sensitive variable identifying process, as further detailed herein, inter alia with reference to Fig. 3.
  • system 200 can operate as a standalone system without the need for network interface 206 and/or data repository 204. Adding one or both of these elements to system 200 is optional and not mandatory, as system 200 can operate according to its intended use either way.
  • FIG. 3 there is shown a flowchart illustrating one example of operations carried out by the sensitive variable identifying system 200, in accordance with the presently disclosed subject matter.
  • the sensitive variable identifying system 200 (also interchangeably referred to hereafter as “system 200”) can be configured to perform a sensitive variable identifying process 300, e.g., using sensitive variable identifying module 208.
  • system 200 obtains a plurality of expressive variable identifiers each of which is (a) indicative of sensitive content of a respective defined variable affiliated with it, and (b) associated with a respective machine learning model of one or more machine learning models capable of receiving a vector associated with a given variable and labeling the given variable as either associated with the expressive variable identifier or not.
  • system 200 obtains at least one code segment including a plurality of code lines containing a definition of at least one defined variable having a non-expressive variable identifier.
  • the non-expressive variable identifier can be, for example, a variable identifier that is not included in the plurality of expressive variable identifiers obtained (block 302).
  • system 200 obtains a plurality of expressive variable identifiers including: ID number, Social Security Number (SSN), and password, each of which is indicative of sensitive content of a respective defined variable affiliated with it and associated with a respective machine learning model, and the code segment 100 of Fig. 1 which includes the defined “String” variable associated with the non-expressive variable identifier 102b, denoted “xl”, which is not included in the plurality of expressive variable identifiers obtained.
  • ID number ID number
  • SSN Social Security Number
  • password password
  • Each machine learning model associated with a given expressive variable identifier of the plurality of expressive variable identifiers can be generated and trained, for example, by: obtaining one or more code segments including a plurality of code lines containing a definition of a variable having the given expressive variable identifier; identifying, utilizing the one or more code segments, collections of terms associated with the given expressive variable identifier; generating one or more labeled vectors, each of which is based on an identified collection of terms of the collections of terms and associated with a label of the given expressive variable identifier; and training the machine learning model associated with the given expressive variable identifier, based on the one or more labeled vectors.
  • the machine learning model associated with variable identifier 102a is generated and trained by: obtaining at least one code segment containing the definition of a variable using variable identifier 102a; identifying, utilizing the one or more code segments, collections of terms associated with variable identifier 102a; generating one or more labeled vectors, each of which is based on an identified collection of terms of the collections of terms and associated with a label of the variable identifier 102a; and training the machine learning model associated with variable identifier 102a, based on the one or more labeled vectors.
  • the one or more labeled vectors are generated using a word embedding method.
  • the word embedding method can be, for example, one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, and the like.
  • system 200 identifies, utilizing the at least one code segment, a collection of terms associated with the at least one defined variable (block 304).
  • the collection of terms can include, for example, one or more function identifiers, one or more variable names, and the like.
  • system 200 identifies the function identifiers “getEmail” and “checkEmail” as terms associated with the defined “String” variable associated with variable identifier 102b, denoted “xl”.
  • system 200 From the identified collection of terms, system 200 generates a defined variable vector (block 306).
  • the defined variable vector can be generated, for example, using a word embedding method, such as One hot encoding, Word2Vec, Fasttext, Global Vectors for Word Representation (GloVe), Continuous Bag of Words Model (CBOW), and the like, and/or be assembled of a collection of partially defined vectors, each of which can be assembled, for example, by using a subset of the terms, for example terms that are associated with a given code flow within code segment 100 as further explained herein, or by using a different word embedding method of the word embedding methods mentioned hereinbefore.
  • at least one of the terms has multiple vector representation. In such cases, a corresponding machine learning model is generated in a way that it can take the multiple vector representations as input.
  • system 200 Based on the identified function identifiers “getEmail” and “checkEmail, system 200 generates a defined variable vector associated with the defined “String” variable, using the word embedding method Word2Vec.
  • system 200 determines the at least one defined variable as sensitive by determining whether the generated defined variable vector is associated with a given expressive variable identifier of the plurality of expressive variable identifiers, utilizing at least one of the machine learning models (block 308).
  • system 200 determines whether the defined variable vector associated with the defined “String” variable is associated with any expressive variable identifier of the plurality of expressive variable identifiers: ID number, Social Security Number (SSN), and password, utilizing their respective machine learning models.
  • Each machine learning model receives the defined variable vector of the defined “String” variable and labels the defined “String” variable as either associated with the expressive variable identifier it is associated with or not.
  • SSN Social Security Number
  • password password
  • each given expressive variable identifier can be associated with a distinct machine learning model that is not associated with any other expressive variable identifier of the plurality of expressive variable identifiers.
  • the plurality of expressive variable identifiers can all be associated with the same machine learning model.
  • each given expressive variable identifier of the plurality of expressive variable identifiers is associated with additional expressive variable identifiers that also indicate the content of the respective variable associated with the given expressive variable identifier.
  • additional expressive variable identifiers that also indicate the content of the respective variable associated with the given expressive variable identifier.
  • the expressive variable identifier “Password” and the additional expressive variable identifiers “Password 1”, “Password 1”, “Pass”, “Passl”, “Pass 1”, which are permutations of the expressive variable identifier “Password” are all indicative of the same content of the respective variable associated with the expressive variable identifier “Password”.
  • the at least one code segment includes a plurality of code flows, each including a plurality of code lines.
  • system 200 identifies at least one code flow of the plurality of code flows in which the defined variable having the non-expressive variable identifier is present so as to operate only on said identified at least one code flow.
  • each code flow of the plurality of code flows assembling the at least one code segment includes a plurality of code levels, and the identification of the defined variable within the plurality of code levels of each code flow is carried out up to a code level threshold.
  • a code flow can include a plurality of code lines in which a main class calls for a first function that calls for a second function that calls to a third function.
  • the code flow consists of four code levels, main class - level one, first function - level two, second function - level three, and third function - level four.
  • the identification of the defined variable within the four code levels of the code flow would be carried out up to the code level threshold, which in this case, can be between 1 and 4. In some cases, as illustrated in Fig.
  • system 200 can determine whether the content of a defined variable associated with a non-expressive variable identifier is sensitive or not, based on one or more defined variables, each associated with an expressive identifier, and/or one or more function/class identifiers, which are also present within the code flow in which the defined variable is present. For example, as shown in Fig 4A, four defined “String” variables are within the class "Payment”. Of the four defined “String” variables, three variables are associated with expressive identifiers indicating their content ("CVV", "username”, and "SSN"), while the fourth "String" variable is associated with a non-expressive identifier ("CC").
  • system 200 can determine that the content of the defined variable associated with the non-expressive identifier "CC" is sensitive. In some case, system 200 can further determine that the term "CC" represents "credit card”.
  • system 200 can determine whether the content of a defined variable associated with a non-expressive variable identifier is sensitive or not by converting its content into a regular expression and determining whether a specific letter or sign is present within said regular expression. For example, as shown in Fig 4B, system 200 converts the content of a defined "String" variable associated with a non-expressive identifier "str" into a regular expression. System 200 then determines whether the regular expression includes an ampersand and, upon identifying an ampersand within the regular expression, determines that the content of the defined "String" variable refers to an email address.
  • system can be implemented, at least partly, as a suitably programmed computer.
  • the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the disclosed method.
  • the presently disclosed subject matter further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the disclosed method.

Abstract

The presently disclosed subject matter aims to a sensitive variable identifying system and method configured to determine at least one defined variable as sensitive by determining whether the variable vector of the at least one defined variable is associated with an expressive variable identifier of a plurality of expressive variable identifiers, each of which is associated with defined variables holding sensitive content, utilizing at least one machine learning model.

Description

A SENSITIVE VARIABLE IDENTIFYING SYSTEM AND METHOD
TECHNICAL FIELD
The present invention relates to the field of systems and methods for identifying sensitive and personal variables.
BACKGROUND
Computer programming, also known as coding, is the process of writing code involving a set of instructions intended to facilitate specific actions to be executed by a computer. The set of instructions may involve the use of various variables, at least some of which may be associated with personal and sensitive data (e.g., as defined by privacy regulations such as the General Data Protection Regulation (GDPR)). To protect this sensitive data from unauthorized disclosure, data protection, privacy, and security regulations must be complied with, and the variables associated with the sensitive data must be identified.
Nowadays, in order to identify variables associated with sensitive data, existing privacy and security management solutions focus on identifying meaningful or expressive identifiers (e.g., variable names) that imply the content of the variables to which they are associated. In doing so, these solutions are entirely dependent on the level of expressiveness of each variable identifier as given by the coder of the code and are unable to identify variables associated with non-expressive, meaningless or trivial identifiers as holding sensitive data.
Thus, there is a need in the art for a new sensitive variable identifying system and method.
GENERAL DESCRIPTION
In accordance with a first aspect of the presently disclosed subject matter, there is provided a sensitive variable identifying system comprising a processing circuitry configured to: obtain: (a) a plurality of expressive variable identifiers, each given expressive variable identifier being: (i) associated with a respective machine learning model of one or more machine learning models, capable of receiving a vector associated with a variable and labeling the variable as either associated with the given expressive variable identifier or not, and (ii) indicative of sensitive content of a respective variable associated with the given expressive variable identifier, and (b) at least one code segment including a plurality of code lines containing a definition of at least one defined variable having a non-expressive variable identifier, wherein the non- expressive variable identifier is not included in the plurality of expressive variable identifiers; identify, utilizing the at least one code segment, a collection of terms associated with the at least one defined variable; generate a defined variable vector from the collection of terms; and determine the at least one defined variable as sensitive by determining whether the defined variable vector is associated with a given expressive variable identifier of the plurality of expressive variable identifiers, utilizing at least one of the machine learning models.
In some cases, each machine learning model, associated with the given expressive variable identifier, is generated by: obtaining one or more code segments including a plurality of code lines, wherein the plurality of code lines includes a definition of a variable having the given expressive variable identifier; identifying, utilizing the one or more code segments, collections of terms associated with the given expressive variable identifier; generating one or more labeled vectors, each of which is based on an identified collection of terms of the collections of terms and associated with a label of the given expressive variable identifier; and training the machine learning model associated with the given expressive variable identifier, based on the one or more labeled vectors.
In some cases, the one or more labeled vectors are generated using a word embedding method.
In some cases, the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof.
In some cases, each given expressive variable identifier is associated with a distinct machine learning model.
In some cases, the defined variable vector is generated of a collection of partially defined vectors.
In some cases, each given expressive variable identifier of the plurality of expressive variable identifiers is associated with additional expressive variable identifiers that are also indicative of the content of the respective variable associated with the given expressive variable identifier. In some cases, the collection of terms includes: one or more function identifiers, one or more variable names, or any combination thereof.
In some cases, the vector is generated using a word embedding method.
In some cases, the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof.
In some cases, (i) the at least one code segment includes a plurality of code flows, each including a plurality of code lines, and (ii) prior to the identify step, the processing circuitry is configured to identify at least one code flow of the plurality of code flows in which the defined variable having the non-expressive variable identifier is present.
In some cases, (i) each code flow of the plurality of code flows includes a plurality of code levels, each including a plurality of code lines, and (ii) the identification of the defined variable within the plurality of code levels of each code flow is carried out up to a code level threshold.
In accordance with a second aspect of the presently disclosed subject matter, there is provided a sensitive variable identifying method comprising: obtaining: (a) a plurality of expressive variable identifiers, each given expressive variable identifier being: (i) associated with a respective machine learning model of one or more machine learning models, capable of receiving a vector associated with a variable and labeling the variable as either associated with the given expressive variable identifier or not, and (ii) indicative of sensitive content of a respective variable associated with the given expressive variable identifier, and (b) at least one code segment including a plurality of code lines containing a definition of at least one defined variable having a non- expressive variable identifier, wherein the non-expressive variable identifier is not included in the plurality of expressive variable identifiers; identifying, utilizing the at least one code segment, a collection of terms associated with the at least one defined variable; generating a defined variable vector from the collection of terms; and determining the at least one defined variable as sensitive by determining whether the defined variable vector is associated with a given expressive variable identifier of the plurality of expressive variable identifiers, utilizing at least one of the machine learning models.
In some cases, each machine learning model, associated with the given expressive variable identifier, is generated by: obtaining one or more code segments including a plurality of code lines, wherein the plurality of code lines includes a definition of a variable having the given expressive variable identifier; identifying, utilizing the one or more code segments, collections of terms associated with the given expressive variable identifier; generating one or more labeled vectors, each of which is based on an identified collection of terms of the collections of terms and associated with a label of the given expressive variable identifier; and training the machine learning model associated with the given expressive variable identifier, based on the one or more labeled vectors.
In some cases, the one or more labeled vectors are generated using a word embedding method.
In some cases, the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof.
In some cases, each given expressive variable identifier is associated with a distinct machine learning model.
In some cases, the defined variable vector is generated of a collection of partially defined vectors.
In some cases, each given expressive variable identifier of the plurality of expressive variable identifiers is associated with additional expressive variable identifiers that are also indicative of the content of the respective variable associated with the given expressive variable identifier.
In some cases, the collection of terms includes: one or more function identifiers, one or more variable names, or any combination thereof.
In some cases, the vector is generated using a word embedding method.
In some cases, the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof.
In some cases, (i) the at least one code segment includes a plurality of code flows, each including a plurality of code lines, and (ii) prior to the identify step, the processing circuitry is configured to identify at least one code flow of the plurality of code flows in which the defined variable having the non-expressive variable identifier is present.
In some cases, (i) each code flow of the plurality of code flows includes a plurality of code levels, each including a plurality of code lines, and (ii) the identification of the defined variable within the plurality of code levels of each code flow is carried out up to a code level threshold.
In accordance with a third aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform sensitive variable identifying method, the method comprising: obtaining: (a) a plurality of expressive variable identifiers, each given expressive variable identifier being: (i) associated with a respective machine learning model of one or more machine learning models, capable of receiving a vector associated with a variable and labeling the variable as either associated with the given expressive variable identifier or not, and (ii) indicative of sensitive content of a respective variable associated with the given expressive variable identifier, and (b) at least one code segment including a plurality of code lines containing a definition of at least one defined variable having a non-expressive variable identifier, wherein the non-expressive variable identifier is not included in the plurality of expressive variable identifiers; identifying, utilizing the at least one code segment, a collection of terms associated with the at least one defined variable; generating a defined variable vector from the collection of terms; and determining the at least one defined variable as sensitive by determining whether the defined variable vector is associated with a given expressive variable identifier of the plurality of expressive variable identifiers, utilizing at least one of the machine learning models.
BRIEF DESCRIPTION OF THE DRAWINGS
In order to understand the presently disclosed subject matter and to see how it may be carried out in practice, the subject matter will now be described, by way of non-limiting examples only, with reference to the accompanying drawings, in which:
Fig. 1 is a schematic illustration of a code segment on which the sensitive variable identifying system operates, in accordance with the presently disclosed subject matter;
Fig. 2 is a block diagram schematically illustrating one example of a sensitive variable identifying system, in accordance with the presently disclosed subject matter; and, Fig. 3 is a flowchart illustrating an example of a sequence of operations carried out by a sensitive variable identifying system, in accordance with the presently disclosed subject matter; and,
Figs. 4A-4B are exemplary code segments on which the sensitive variable identifying system operates, in accordance with the presently disclosed subject matter.
DETAILED DESCRIPTION
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the presently disclosed subject matter. However, it will be understood by those skilled in the art that the presently disclosed subject matter may be practiced without these specific details. In other instances, well- known methods, procedures, and components have not been described in detail so as not to obscure the presently disclosed subject matter.
In the drawings and descriptions set forth, identical reference numerals indicate those components that are common to different embodiments or configurations.
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as “obtaining“, “identifying”, “generating“, “determining” or the like, include action and/or processes of a computer that manipulate and/or transform data into other data, said data represented as physical quantities, e.g., such as electronic quantities, and/or said data representing the physical objects. The terms “computer”, “processor”, “processing resource”, “processing circuitry”, and “controller” should be expansively construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, a personal desktop/laptop computer, a server, a computing system, a communication device, a smartphone, a tablet computer, a smart television, a processor (e.g. digital signal processor (DSP), a microcontroller, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), etc.), a group of multiple physical machines sharing performance of various tasks, virtual servers co- residing on a single physical machine, any other electronic computing device, and/or any combination thereof.
The operations in accordance with the teachings herein may be performed by a computer specially constructed for the desired purposes or by a general-purpose computer specially configured for the desired purpose by a computer program stored in a non- transitory computer readable storage medium. The term "non-transitory" is used herein to exclude transitory, propagating signals, but to otherwise include any volatile or nonvolatile computer memory technology suitable to the application.
As used herein, the phrase "for example," "such as", "for instance" and variants thereof describe non-limiting embodiments of the presently disclosed subject matter. Reference in the specification to "one case", "some cases", "other cases" or variants thereof means that a particular feature, structure or characteristic described in connection with the embodiment(s) is included in at least one embodiment of the presently disclosed subject matter. Thus, the appearance of the phrase "one case", "some cases", "other cases" or variants thereof does not necessarily refer to the same embodiment(s).
It is appreciated that, unless specifically stated otherwise, certain features of the presently disclosed subject matter, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the presently disclosed subject matter, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination.
In embodiments of the presently disclosed subject matter, fewer, more and/or different stages than those shown in Fig. 3 may be executed. In embodiments of the presently disclosed subject matter one or more stages illustrated in Fig. 3 may be executed in a different order and/or one or more groups of stages may be executed simultaneously. Fig. 1 illustrate a general schematic of the system architecture in accordance with an embodiment of the presently disclosed subject matter. Each module in Fig. 2 can be made up of any combination of software, hardware and/or firmware that performs the functions as defined and explained herein. The modules in Fig. 2 may be centralized in one location or dispersed over more than one location. In other embodiments of the presently disclosed subject matter, the system may comprise fewer, more, and/or different modules than those shown in Fig. 2.
Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.
Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
Bearing this in mind, attention is drawn to Fig. 1, showing a schematic illustration of a code segment on which the sensitive variable identifying system operates, in accordance with the presently disclosed subject matter.
As shown in the schematic illustration, code segment 100 includes a plurality of code lines, denoted LI to Ln, forming a set of instructions intended to facilitate specific actions to be executed by a computer (not shown). The plurality of code lines, which can be written in any programing language, for example, Java, C#, TypeScript, Go, Kotlin, Scala, JavaScript, C++, C Language, Python, PHP Hypertext Preprocessor (PHP), Structured Query Language (SQL), and the like, may be arranged in one or more code flows and include the use of one or more variables directed to store data. The data stored can be sensitive, i.e., data that requires protection because its loss, misuse, modification, or unauthorized access will negatively impact the welfare, privacy, assets, or security of an organization or individual (e.g., gender, ID number, Social Security Number (SSN), password, and the like), or non-sensitive, i.e., data that does not require protection (e.g., metadata of services, products, etc.).
Each variable of the one or more variables within the one or more code flows of code segment 100 can be defined by type (e.g., char, integer, string, void, double, etc.) and be associated with a variable identifier. The variable identifier can be an expressive variable identifier, i.e., a variable identifier that at least implies the content of the variable to which it is associated, or a non-expressive variable identifier, i.e., a variable identifier that does not even imply the content of the variable to which it is associated. For example, as illustrated in Fig. 1, code segment 100 includes two defined variables, a “Long” variable and a “String” variable, associated with respective variable identifiers 102a and 102b. Of the two variable identifiers, variable identifier 102a, denoted “id”, is considered an expressive variable identifier, as it indicates that the content of the defined “Long” variable to which it is associated includes an ID number, whereas variable identifier 102b, denoted “xl”, is considered a non-expressive variable identifier, as it does not even imply the content of the defined “String” variable to which it is associated. From the denotation of variable identifier 102a it is evident that the content of the defined “Long” variable is associated with sensitive data, while the sensitivity of the content of the defined “String” variable associated with variable identifier 102b remains unclear.
To determine whether the content of the defined “String” variable associated with variable identifier 102b contains sensitive data, the sensitive variable identifying system of the presently disclosed subject matter operates on code segment 100, as will be described hereafter in reference to Fig. 3.
Attention is now drawn to a description of the components of the sensitive variable identifying system 200.
Fig. 2 is a block diagram schematically illustrating one example of the sensitive variable identifying system 200, in accordance with the presently disclosed subject matter.
In accordance with the presently disclosed subject matter, the sensitive variable identifying system 200 (also interchangeably referred to herein as “system 200”) can comprise a network interface 206. The network interface 206 (e.g., a network card, a WiFi client, a Li-Fi client, 3G/4G client, or any other component), enables system 200 to communicate over a network with external systems and handles inbound and outbound communications from such systems. For example, system 200 can receive, through network interface 206, a plurality of expressive variable identifiers associated with a respective machine learning model of one or more machine learning models.
System 200 can further comprise or be otherwise associated with a data repository 204 (e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.) configured to store data. Some examples of data that can be stored in the data repository 204 include:
• Expressive variable identifiers associated with respective defined variables;
• One or more machine learning models associated with the expressive variable identifiers;
• One or more code segments;
• Collections of terms associated with at least one defined variable;
• One or more vectors associated with at least one defined variable; One or more labeled vectors; etc.
Data repository 204 can be further configured to enable retrieval and/or update and/or deletion of the stored data. It is to be noted that in some cases, data repository 204 can be distributed, while the system 200 has access to the information stored thereon, e.g., via a wired or wireless network to which system 200 is able to connect (utilizing its network interface 206).
System 200 further comprises processing circuitry 202. Processing circuitry 202 can be one or more processing units (e.g., central processing units), microprocessors, microcontrollers (e.g., microcontroller units (MCUs)) or any other computing devices or modules, including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling relevant system 200 resources and for enabling operations related to system’s 200 resources.
The processing circuitry 202 comprises a sensitive variable identifying module 208, configured to perform a sensitive variable identifying process, as further detailed herein, inter alia with reference to Fig. 3.
It should be noted that system 200 can operate as a standalone system without the need for network interface 206 and/or data repository 204. Adding one or both of these elements to system 200 is optional and not mandatory, as system 200 can operate according to its intended use either way.
Turning to Fig. 3 there is shown a flowchart illustrating one example of operations carried out by the sensitive variable identifying system 200, in accordance with the presently disclosed subject matter.
Accordingly, the sensitive variable identifying system 200 (also interchangeably referred to hereafter as “system 200”) can be configured to perform a sensitive variable identifying process 300, e.g., using sensitive variable identifying module 208.
For this purpose, system 200 obtains a plurality of expressive variable identifiers each of which is (a) indicative of sensitive content of a respective defined variable affiliated with it, and (b) associated with a respective machine learning model of one or more machine learning models capable of receiving a vector associated with a given variable and labeling the given variable as either associated with the expressive variable identifier or not. In addition, system 200 obtains at least one code segment including a plurality of code lines containing a definition of at least one defined variable having a non-expressive variable identifier. The non-expressive variable identifier can be, for example, a variable identifier that is not included in the plurality of expressive variable identifiers obtained (block 302).
By way of example, system 200 obtains a plurality of expressive variable identifiers including: ID number, Social Security Number (SSN), and password, each of which is indicative of sensitive content of a respective defined variable affiliated with it and associated with a respective machine learning model, and the code segment 100 of Fig. 1 which includes the defined “String” variable associated with the non-expressive variable identifier 102b, denoted “xl”, which is not included in the plurality of expressive variable identifiers obtained.
Each machine learning model associated with a given expressive variable identifier of the plurality of expressive variable identifiers can be generated and trained, for example, by: obtaining one or more code segments including a plurality of code lines containing a definition of a variable having the given expressive variable identifier; identifying, utilizing the one or more code segments, collections of terms associated with the given expressive variable identifier; generating one or more labeled vectors, each of which is based on an identified collection of terms of the collections of terms and associated with a label of the given expressive variable identifier; and training the machine learning model associated with the given expressive variable identifier, based on the one or more labeled vectors. For example, the machine learning model associated with variable identifier 102a, denoted “id”, is generated and trained by: obtaining at least one code segment containing the definition of a variable using variable identifier 102a; identifying, utilizing the one or more code segments, collections of terms associated with variable identifier 102a; generating one or more labeled vectors, each of which is based on an identified collection of terms of the collections of terms and associated with a label of the variable identifier 102a; and training the machine learning model associated with variable identifier 102a, based on the one or more labeled vectors.
In some cases, the one or more labeled vectors are generated using a word embedding method. The word embedding method can be, for example, one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, and the like.
Once the plurality of expressive variable identifiers and the at least one code segment are obtained, system 200 identifies, utilizing the at least one code segment, a collection of terms associated with the at least one defined variable (block 304). The collection of terms can include, for example, one or more function identifiers, one or more variable names, and the like.
By way of example and in accordance with the example above, system 200 identifies the function identifiers “getEmail” and “checkEmail” as terms associated with the defined “String” variable associated with variable identifier 102b, denoted “xl”.
From the identified collection of terms, system 200 generates a defined variable vector (block 306). The defined variable vector can be generated, for example, using a word embedding method, such as One hot encoding, Word2Vec, Fasttext, Global Vectors for Word Representation (GloVe), Continuous Bag of Words Model (CBOW), and the like, and/or be assembled of a collection of partially defined vectors, each of which can be assembled, for example, by using a subset of the terms, for example terms that are associated with a given code flow within code segment 100 as further explained herein, or by using a different word embedding method of the word embedding methods mentioned hereinbefore. In some cases, at least one of the terms has multiple vector representation. In such cases, a corresponding machine learning model is generated in a way that it can take the multiple vector representations as input.
By way of example and in accordance with the example above, based on the identified function identifiers “getEmail” and “checkEmail, system 200 generates a defined variable vector associated with the defined “String” variable, using the word embedding method Word2Vec.
Following the generation of the defined variable vector, system 200 determines the at least one defined variable as sensitive by determining whether the generated defined variable vector is associated with a given expressive variable identifier of the plurality of expressive variable identifiers, utilizing at least one of the machine learning models (block 308).
By way of example and in accordance with the example above, system 200 determines whether the defined variable vector associated with the defined “String” variable is associated with any expressive variable identifier of the plurality of expressive variable identifiers: ID number, Social Security Number (SSN), and password, utilizing their respective machine learning models. Each machine learning model receives the defined variable vector of the defined “String” variable and labels the defined “String” variable as either associated with the expressive variable identifier it is associated with or not. Following the evaluation of the defined variable vector of the defined “String” variable by each machine learning model, none of the machine learning models has labeled it as associated with its respective expressive variable identifier. As such, the content of the defined “String” variable associated with variable identifier 102b, denoted “xl”, is defined to be non- sensitive.
In some cases, each given expressive variable identifier can be associated with a distinct machine learning model that is not associated with any other expressive variable identifier of the plurality of expressive variable identifiers. In other cases, the plurality of expressive variable identifiers can all be associated with the same machine learning model.
In some cases, each given expressive variable identifier of the plurality of expressive variable identifiers is associated with additional expressive variable identifiers that also indicate the content of the respective variable associated with the given expressive variable identifier. For example, the expressive variable identifier “Password” and the additional expressive variable identifiers “Password 1”, “Password 1”, “Pass”, “Passl”, “Pass 1”, which are permutations of the expressive variable identifier “Password”, are all indicative of the same content of the respective variable associated with the expressive variable identifier “Password”.
In some cases, the at least one code segment includes a plurality of code flows, each including a plurality of code lines. In such cases, prior to the identifying of the collection of terms associated with the at least one defined variable, described hereinbefore, system 200 identifies at least one code flow of the plurality of code flows in which the defined variable having the non-expressive variable identifier is present so as to operate only on said identified at least one code flow.
In other cases, each code flow of the plurality of code flows assembling the at least one code segment includes a plurality of code levels, and the identification of the defined variable within the plurality of code levels of each code flow is carried out up to a code level threshold. For example, a code flow can include a plurality of code lines in which a main class calls for a first function that calls for a second function that calls to a third function. As such, the code flow consists of four code levels, main class - level one, first function - level two, second function - level three, and third function - level four. The identification of the defined variable within the four code levels of the code flow would be carried out up to the code level threshold, which in this case, can be between 1 and 4. In some cases, as illustrated in Fig. 4A, system 200 can determine whether the content of a defined variable associated with a non-expressive variable identifier is sensitive or not, based on one or more defined variables, each associated with an expressive identifier, and/or one or more function/class identifiers, which are also present within the code flow in which the defined variable is present. For example, as shown in Fig 4A, four defined "String" variables are within the class "Payment". Of the four defined "String" variables, three variables are associated with expressive identifiers indicating their content ("CVV", "username", and "SSN"), while the fourth "String" variable is associated with a non-expressive identifier ("CC"). From the expressive identifiers of the three variables and the class identifier, system 200 can determine that the content of the defined variable associated with the non-expressive identifier "CC" is sensitive. In some case, system 200 can further determine that the term "CC" represents "credit card".
In some cases, as illustrated in Fig. 4B, system 200 can determine whether the content of a defined variable associated with a non-expressive variable identifier is sensitive or not by converting its content into a regular expression and determining whether a specific letter or sign is present within said regular expression. For example, as shown in Fig 4B, system 200 converts the content of a defined "String" variable associated with a non-expressive identifier "str" into a regular expression. System 200 then determines whether the regular expression includes an ampersand and, upon identifying an ampersand within the regular expression, determines that the content of the defined "String" variable refers to an email address.
It is to be noted, with reference to Fig. 3, that some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. It is to be further noted that some of the blocks are optional. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
It is to be understood that the presently disclosed subject matter is not limited in its application to the details set forth in the description contained herein or illustrated in the drawings. The presently disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. Hence, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception upon which this disclosure is based may readily be utilized as a basis for designing other structures, methods, and systems for carrying out the several purposes of the present presently disclosed subject matter.
It will also be understood that the system according to the presently disclosed subject matter can be implemented, at least partly, as a suitably programmed computer. Likewise, the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the disclosed method. The presently disclosed subject matter further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the disclosed method.

Claims

CLAIMS:
1. A sensitive variable identifying system comprising a processing circuitry configured to: obtain: (a) a plurality of expressive variable identifiers, each given expressive variable identifier being: (i) associated with a respective machine learning model of one or more machine learning models, capable of receiving a vector associated with a variable and labeling the variable as either associated with the given expressive variable identifier or not, and (ii) indicative of sensitive content of a respective variable associated with the given expressive variable identifier, and (b) at least one code segment including a plurality of code lines containing a definition of at least one defined variable having a non-expressive variable identifier, wherein the non-expressive variable identifier is not included in the plurality of expressive variable identifiers; identify, utilizing the at least one code segment, a collection of terms associated with the at least one defined variable; generate a defined variable vector from the collection of terms; and determine the at least one defined variable as sensitive by determining whether the defined variable vector is associated with a given expressive variable identifier of the plurality of expressive variable identifiers, utilizing at least one of the machine learning models.
2. The sensitive variable identifying system of claim 1, wherein each machine learning model, associated with the given expressive variable identifier, is generated by: obtaining one or more code segments including a plurality of code lines, wherein the plurality of code lines includes a definition of a variable having the given expressive variable identifier; identifying, utilizing the one or more code segments, collections of terms associated with the given expressive variable identifier; generating one or more labeled vectors, each of which is based on an identified collection of terms of the collections of terms and associated with a label of the given expressive variable identifier; and training the machine learning model associated with the given expressive variable identifier, based on the one or more labeled vectors. The system of claim 2, wherein the one or more labeled vectors are generated using a word embedding method. The system of claim 3, wherein the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof. The system of claim 1, wherein each given expressive variable identifier is associated with a distinct machine learning model. The system of claim 1, wherein the defined variable vector is generated of a collection of partially defined vectors. The system of claim 1, wherein each given expressive variable identifier of the plurality of expressive variable identifiers is associated with additional expressive variable identifiers that are also indicative of the content of the respective variable associated with the given expressive variable identifier. The system of claim 1, wherein the collection of terms includes: one or more function identifiers, one or more variable names, or any combination thereof. The system of claim 1, wherein the vector is generated using a word embedding method. The system of claim 9, wherein the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof. The system of claim 1, wherein: (i) the at least one code segment includes a plurality of code flows, each including a plurality of code lines, and (ii) prior to the identify step, the processing circuitry is configured to identify at least one code flow of the plurality of code flows in which the defined variable having the non- expressive variable identifier is present.
12. The system of claim 11, wherein: (i) each code flow of the plurality of code flows includes a plurality of code levels, each including a plurality of code lines, and (ii) the identification of the defined variable within the plurality of code levels of each code flow is carried out up to a code level threshold.
13. A sensitive variable identifying method comprising: obtaining: (a) a plurality of expressive variable identifiers, each given expressive variable identifier being: (i) associated with a respective machine learning model of one or more machine learning models, capable of receiving a vector associated with a variable and labeling the variable as either associated with the given expressive variable identifier or not, and (ii) indicative of sensitive content of a respective variable associated with the given expressive variable identifier, and (b) at least one code segment including a plurality of code lines containing a definition of at least one defined variable having a non-expressive variable identifier, wherein the non-expressive variable identifier is not included in the plurality of expressive variable identifiers; identifying, utilizing the at least one code segment, a collection of terms associated with the at least one defined variable; generating a defined variable vector from the collection of terms; and determining the at least one defined variable as sensitive by determining whether the defined variable vector is associated with a given expressive variable identifier of the plurality of expressive variable identifiers, utilizing at least one of the machine learning models.
14. The sensitive variable identifying method of claim 13, wherein each machine learning model, associated with the given expressive variable identifier, is generated by: obtaining one or more code segments including a plurality of code lines, wherein the plurality of code lines includes a definition of a variable having the given expressive variable identifier; identifying, utilizing the one or more code segments, collections of terms associated with the given expressive variable identifier; generating one or more labeled vectors, each of which is based on an identified collection of terms of the collections of terms and associated with a label of the given expressive variable identifier; and training the machine learning model associated with the given expressive variable identifier, based on the one or more labeled vectors.
15. The method of claim 14, wherein the one or more labeled vectors are generated using a word embedding method.
16. The method of claim 15, wherein the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof.
17. The method of claim 13, wherein each given expressive variable identifier is associated with a distinct machine learning model.
18. The method of claim 13, wherein the defined variable vector is generated of a collection of partially defined vectors.
19. The method of claim 13, wherein each given expressive variable identifier of the plurality of expressive variable identifiers is associated with additional expressive variable identifiers that are also indicative of the content of the respective variable associated with the given expressive variable identifier.
20. The method of claim 13, wherein the collection of terms includes: one or more function identifiers, one or more variable names, or any combination thereof. The method of claim 13, wherein the vector is generated using a word embedding method. The method of claim 21, wherein the word embedding method is one or more of: one hot encoding, Word2Vec, Fasttext, GloVe, CBOW, or any combination thereof. The method of claim 13, wherein: (i) the at least one code segment includes a plurality of code flows, each including a plurality of code lines, and (ii) prior to the identify step, the processing circuitry is configured to identify at least one code flow of the plurality of code flows in which the defined variable having the non- expressive variable identifier is present. The method of claim 23, wherein: (i) each code flow of the plurality of code flows includes a plurality of code levels, each including a plurality of code lines, and (ii) the identification of the defined variable within the plurality of code levels of each code flow is carried out up to a code level threshold. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform sensitive variable identifying method, the method comprising: obtaining: (a) a plurality of expressive variable identifiers, each given expressive variable identifier being: (i) associated with a respective machine learning model of one or more machine learning models, capable of receiving a vector associated with a variable and labeling the variable as either associated with the given expressive variable identifier or not, and (ii) indicative of sensitive content of a respective variable associated with the given expressive variable identifier, and (b) at least one code segment including a plurality of code lines containing a definition of at least one defined variable having a non-expressive variable identifier, wherein the non-expressive variable identifier is not included in the plurality of expressive variable identifiers; identifying, utilizing the at least one code segment, a collection of terms associated with the at least one defined variable; generating a defined variable vector from the collection of terms; and determining the at least one defined variable as sensitive by determining whether the defined variable vector is associated with a given expressive variable identifier of the plurality of expressive variable identifiers, utilizing at least one of the machine learning models.
PCT/IL2023/050535 2022-05-26 2023-05-24 A sensitive variable identifying system and method WO2023228188A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263345952P 2022-05-26 2022-05-26
US63/345,952 2022-05-26

Publications (1)

Publication Number Publication Date
WO2023228188A1 true WO2023228188A1 (en) 2023-11-30

Family

ID=88918690

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2023/050535 WO2023228188A1 (en) 2022-05-26 2023-05-24 A sensitive variable identifying system and method

Country Status (1)

Country Link
WO (1) WO2023228188A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190171846A1 (en) * 2017-12-04 2019-06-06 ShiftLeft Inc System and method for code-based protection of sensitive data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190171846A1 (en) * 2017-12-04 2019-06-06 ShiftLeft Inc System and method for code-based protection of sensitive data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIU YIN; DHAR SIDDHARTH; TILEVICH ELI: "Only pay for what you need: Detecting and removing unnecessary TEE-based code", JOURNAL OF SYSTEMS & SOFTWARE, ELSEVIER NORTH HOLLAND, NEW YORK, NY, US, vol. 188, 10 February 2022 (2022-02-10), US , XP087012254, ISSN: 0164-1212, DOI: 10.1016/j.jss.2022.111253 *

Similar Documents

Publication Publication Date Title
US11675915B2 (en) Protecting data based on a sensitivity level for the data
US11301578B2 (en) Protecting data based on a sensitivity level for the data
US20190258648A1 (en) Generating asset level classifications using machine learning
US11587150B1 (en) Systems and methods for eligibility verification
US10972567B2 (en) Multi-dimensional tagging namespace for cloud resource management
US11347891B2 (en) Detecting and obfuscating sensitive data in unstructured text
US11093774B2 (en) Optical character recognition error correction model
US11681817B2 (en) System and method for implementing attribute classification for PII data
CN106295333A (en) For detecting the method and system of malicious code
US20200076806A1 (en) Methods and systems for managing access to computing system resources
US11171929B2 (en) Applying differential security to API message payload data elements
CN113177700B (en) Risk assessment method, system, electronic equipment and storage medium
CN103365812A (en) Method and system for data privacy engine
US11270226B2 (en) Hybrid learning-based ticket classification and response
CN110414989A (en) Method for detecting abnormality and device, electronic equipment and computer readable storage medium
US11537668B2 (en) Using a machine learning system to process a corpus of documents associated with a user to determine a user-specific and/or process-specific consequence index
US11861003B1 (en) Fraudulent user identifier detection using machine learning models
US11449677B2 (en) Cognitive hierarchical content distribution
CN114244611B (en) Abnormal attack detection method, device, equipment and storage medium
WO2023228188A1 (en) A sensitive variable identifying system and method
US11809602B2 (en) Privacy verification for electronic communications
US20220309084A1 (en) Record matching in a database system
US20220300837A1 (en) Data mark classification to verify data removal
US11699082B2 (en) Multi-dimensional record correlations
CN114493850A (en) Artificial intelligence-based online notarization method, system and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23811316

Country of ref document: EP

Kind code of ref document: A1