WO2022238851A1

WO2022238851A1 - Neural network configuration method and binary file processing method

Info

Publication number: WO2022238851A1
Application number: PCT/IB2022/054224
Authority: WO
Inventors: Daniele CANAVESE; Leonardo REGANO; Cataldo BASILE
Original assignee: Politecnico Di Torino
Priority date: 2021-05-14
Filing date: 2022-05-06
Publication date: 2022-11-17
Also published as: IT202100012488A1

Abstract

A method (100) of configuring neural networks is described, comprising the steps of: defining (110; 120) a plurality of functions (asm) and applying a plurality of software protections (P1,...,Pn) to said functions; constructing (130) a data set comprising a plurality of samples each including a function (asmj) of the plurality and at least one of said software protections (Pi,.,Pn) applied to the respective function; encode (140) each function (asmj) of the data set to obtain a plurality of encoded samples (CHMP_COD) each expressed as a sequence of numerical values; train (150) a neural network (NN (Pi)) by means of the plurality of encoded samples (CHMP_COD) so that it is capable of processing a file to be analyzed and providing information regarding software protections applied to said file to be analyzed.

Description

"NEURAL NETWORK CONFIGURATION METHOD AND BINARY FILE

PROCESSING METHOD

DESCRIPTION

FIELD OF THE TECHNIQUE

The present invention relates to the field of software security and protections applied to the sofatware.

STATE OF THE ART Software, because of its inherent characteristics, is highly sensitive from the aspect of security.

Consider, for example, that software often incorporates or handles confidential, private data or intellectual property of third parties. It also makes explicit the know-how that made its creation possible. Such information is available in the form of the instructions and data structures that make up or are processed by a given piece of software. They are therefore potentially available to anyone who has a copy of the software. Therefore, software can also be seen as a container of assets critical to the business of the company that developed itSoftware is exposed to numerous threats and risks. In a MATE (Man-At-The-End) scenario, an attacker, having a given piece of software at his disposal and controlling the environment in which it will run, can analyze the behaviour of said software, e.g., using a debugger, or can disassemble or decompile it to extract its logical structure. These reverse engineering operations can then allow some of the assets in the software to be obtained. MATE attacks are frequent and can consist of the identification and possible reuse of functionality deemed strategic, the compromise of licensing controls, the identification of flaws or

1 contexts of use that can be exploited to compromise the functionality of the software itself or the environments in which it runs, etc.

Given the severity of the risks to which software is exposed, appropriate mitigations cannot be ignored. In this context, mitigations are the software protection methods and technologies adopted both during the software development phase and immediately before its deployment. The software protection process combines the use of cryptographic functions, transformations of the software itself, and software engineering techniques in order to mitigate risks. Software protections can rely on security features available in the environment in which the software runs, but they can also be built into the software itself, using specially designed protection technologies.

While there are no definitive solutions that can prevent attackers from obtaining the assets in the software, there are a number of protection solutions known in the art that have been adopted, aimed at delaying the attacker as much as possible and thus preserving the business model adopted by the owner of that software.

Consider, for example, the scenario of a software house: software houses represent an important part of the software protection market-at the time when it intends to market a new video game. A large portion of the sales of this type of product is concentrated in the first few days after its release. Therefore, it becomes critical to delay as much as possible the moment an attacker successfully breaches the software protections contained in the work, such as licensing controls, releasing cracks or circulating illegal copies of the video game.

The effectiveness of protection techniques applied to software are often evaluated, before a given software is released, by specialized personnel inside or outside the company developing the software. However, this evaluation process is often performed manually or by means of semi -automated tools, thus requiring the expenditure of a considerable amount of time and resources. In addition, time is very often limited due to the business models

2 adopted that impose a stringent time-to-market in order to succeed in the market before competitors.

A first step that attackers must perform to extract assets from software is to identify the protection techniques that have been applied to specific portions of the software in order to disable or eliminate them. An assessment of the effectiveness of protections cannot be separated from estimating how easy it is to recognize which protections have been applied.

SUMMARY OF THE INVENTION

Currently, the known tools in the state of the art aimed at detecting protection techniques applied to a software are not yet satisfactory because they offer, when present, a limited level of automation and consequently depend heavily on manual activity, in a context where time plays a key role in the success or failure of a software product in the market.

Furthermore, said tools are not optimized for performing protected area detection tasks within the software itself.

The purpose of the present invention is to improve the degree of automation in the identification phase of protection techniques applied to software.

It is an object of the present invention to provide a method for automating the process of identifying software protections applied to a binary file and identifying protected areas within said file as described by claim 1 and preferred embodiments thereof described by claims 2-12.

It is also an object of the present invention to provide a method of processing files as described by claim 13 and preferred embodiments thereof described by claims 14-16.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred forms of embodiment of the present invention will be described below, by way of example only, with reference to the accompanying drawings, wherein:

3 - Fig.l shows an example of a computer system usable for the purposes of the present invention;

- Fig.2 shows a schematic example of a binary file in which functions and the possible presence of protections are highlighted; - Fig.3 shows, by means of functional blocks, an example of a method of configuring a neural network capable of obtaining information about any protection techniques that may be present in a file to be analyzed;

- Fig.4 shows an example of a coded function;

- Fig.5 shows, by means of function blocks, a simplified architecture of a neural network based on LSTM (Long Short Term Memory) cells;

- Fig.6 shows, by means of functional blocks, a simplified architecture of a BERT (Bidirectional Encoder Representations from Transformers) transformer-type neural network.

DETAILED DESCRIPTION OF THE INVENTION The following detailed description of preferred forms of embodiment refers to the accompanying drawings which constitute a part thereof and show, by way of example, specific forms of embodiment of the present invention. The following description is therefore not intended to be limiting, and the scope of the inventions is defined only by the appended claims. Fig. 1 shows an example of a computer system 10 configured to provide information about software protections contained in a file to be analyzed.

The system 10 includes, for example, a general-purpose computing device 20, in the form of a conventional personal computer, which includes a processing unit 21, a system memory 22, and a system bus 23 that couples the system memory 22 and other system components to the processing unit 21. System bus 23 can be any of a number of different

4 types of buses capable of enabling a communication channel between different hardware devices to exchange information. System memory 22 includes, for example, a read-only memory (ROM) 24 and a random-access memory (RAM) 25. A basic input/ output system (BIOS) 26, stored in ROM 24, contains the basic routines that transfer information between the components of the personal computer 20. BIOS 24 also contains the system boot routines. Personal computer 20 also includes a hard disk drive 27 to read from and write to at least one hard disk drive 29. Hard disk drive 27 is connected to system bus 23 via a hard disk drive interface 32. For example, system 10 includes hard disk drive 29 but may include other types of media, such as memory cards, external hard disks, RAM, ROM, and the like. Program modules may be stored on hard disk 29 on ROM 24 and RAM 25.

A user can enter commands and information into the personal computer 20 through one or more input devices such as, for example, a keyboard 40 and an optical pointing device 42. These and other input devices are often connected to the processing unit 21 through a specific input interface 46 that depends on the type of port used such as a serial port, parallel port, USB port, etc. coupled to the system bus 23. A monitor 47 or other display device also connects to system bus 23 through an interface such as a video adapter 48. In addition to the monitor, personal computers may also include other output peripherals (not shown) such as a printer. Personal computer 20 can operate, in a data exchange network, using logical connections to one or more remote computers such as remote computer 49. Remote computer 49 can be another personal computer, a server, a router, a network PC, or another node on the network. It typically includes many or all of the components described above in relation to personal computer 20. However, in the example in Fig. 1 only one storage device 50 is shown for simplicity. The logical connections shown in Fig. 1 may include a LAN and/ or WAN 51 type network common in offices, corporate computer networks, intranets, and the Internet.

5 When in a LAN/WAN network environment, the PC 20 connects to a network 51 through a network interface or adapter 53 that may be a wired or wireless network card. In a network environment, program modules represented as residing within the personal computer 20 or portions thereof may be stored in a remote storage device 50. The program modules may include: the operating system 35, one or more application programs 36, at least one neural network NN(Pi) (processing module 33), and a training module MOD_TRAIN 34. In particular, there is a plurality of NN(Pi) neural networks, each associated with a particular software protection. Each of the neural networks NN(Pi) can be implemented in hardware, in software, or in a combination thereof. The training module MOD_TRAIN 34 is tasked with training each neural network NN(Pi) by means of data sets, i.e., a collection of data used as samples for the purpose of "teaching" the NN(Pi) neural network how to react in the face of specific input data.

As will be described in more detail later, each neural network NN(P,) is trained to obtain information about a specific protection technique possibly present in a file to be analyzed.

Figure 2 shows a schematized example of a binary file that may belong, by way of example, to an application or a software library. The binary file is formed by a plurality of functions FNZ 1 - FNZ n, each of which consists of a sequence of assembly instructions, also called lines of code or more generally code. One or more of said plurality of functions contained in the binary file may need software protection if their contents represent an asset, i.e., constitute value in economic and/ or know-how terms.

Assets that may constitute a critical area within the binary file may be, for example, but not limited to, proprietary algorithms (or other intellectual property), cryptographic secrets, or security controls such as commercial software licensing controls.

6 In Figure 2, function FNZ 1 and function FNZ 6 are shown with the symbol of a shield to specify that these two functions are the result of enforcing software protections since they contain at least one asset. Note that particular lines of code or functions may have been protected in order to confuse attackers despite not being assets. In contrast, function FNZ 4 is instead shown with an X symbol to indicate that it is not a function protected with any kind of software protection.

Software protections, after being applied to code leave their own fingerprint, or fingerprint, anomalous to unprotected code.

Examples of a fingerprint present in the code as a result of the application of a software protection could be particularly complex control flows or logical conditions.

Each software protection has a characteristic fingerprint that might allow certain information to be inferred, such as what peculiarities the protected assets possess and what security properties have been decided to apply. Examples of software protections, may be: control flow flattening, opaque predicates, branch functions, encode arithmetic, converting data into functions (e.g., with Mealy machines), merging or splitting variables, recoding variables (e.g., xor masking, residue number encoding, . ..), white-box cryptography, virtualization using virtual machines or JIT compilation, call stack checks, code guards, control flow tagging, anti-debugging, code mobility, client/ server code splitting, anti-cloning, and software attestation. Figure 3 shows, by means of a flowchart, a preferred form of realization of a neural network configuration method 100 that can be implemented, for example, using System 10.

Method 100 allows one or more neural networks NN(Pi) to be configured, each to be used to obtain information about a specific protection technique possibly present in a file to be analyzed.

7 After a beginning phase, method 100 provides a first stepllO in which one or more source files (i.e., a file expressed in a high-level language) employed for the purpose of training a neural network are provided. Consider, for the sake of brevity, the use of a single source file. According to the example described, such a source file is, initially, free of software protections.

Then, according to an example, one or more software protections are applied to that source file. In the considered example, we refer to the case where multiple software protections Pi,...,P_n are applied. Specifically, such protections Pi,...,P_n are applied to one or more features of the source file.

As is also reiterated below, not all functions of the source file have the protections applied to them, but some of them remain unprotected. This will allow neural networks NN(Pi) to be trained to recognize unprotected features as well.

According to the example described here, the source file provided with the software protections Pi,...,P_n is then compiled, resulting in, a binary file.

Note that a binary file to which the protections Pi,...,P_n. have already been applied can be directly provided, i.e., avoiding the first step 110. Note that some protections are applied to the source file while others are applied directly to the binary file.

In a second step 120 disassembly of the binary file obtainable, for example, from the previous compilation is carried out in order to extract the plurality of functions contained therein (i.e., of its code portions).

The disassembly operation makes it possible to obtain the previously compiled file in the form of assembly code by going on to replace each machine language operating code with a sequence of characters representing it in mnemonic form, i.e., in a way easily interpreted by an operator. Data and memory addresses can also be rewritten in assembly according to a numeric base, such as hexadecimal, or in symbolic form using text strings

8 (identifiers). The program in assembly format will thus be relatively more readable than the corresponding binary.

Examples of operating codes in mnemonic format are ADD for the sum operation or MOV to indicate a copy operation. In addition, a third step 130 is carried out that aims to collect in a data set, or data collection, the plurality of protected functions extracted from the assembly file contextually with the indication of the protections present for each function. This indication is an identifier of the type of protection applied Pi,...,P_n. The data set may have a matrix structure.

Note that the data set can also be obtained by a library of protected functions associated with a library of protections, without performing the application of protections to the functions in the binary file and the extraction of functions from the binary file indicated in the second step 120.

The set of information regarding the protected function and the indication of the protections applied to the same function defines a sample CHMP. Such a sample CHMP is, for example, a row of the data set in the following format:

(asm Ri,. .,R ) where asny is the j-th protected function extracted from the assembly file and the identifiers Pi,...,P_n represent the specific software protections applied to the function asm_j.

In particular, the identifiers Pi,...,P_n can be Boolean variables indicating whether the protection P_t has been applied to the function asm_j. For functions to which no protection has been applied, the identifiers Pi,...,P_n take the value "false."

Samples CHMP will be used in the subsequent steps of method 100. Note that in order to carry out a good training phase of a neural network, it is necessary for the data set to contain a sufficiently large number of samples CHMP.

9 According to a preferred form of method 100, the first phase 110 (compilation phase) and the second phase 120 (disassembly phase) can be performed several times using different combinations of protections involving the application of one or more sequences of protections to each function. As mentioned above, advantageously, the data set is also constructed using compiled functions to which no software protection has been applied. The compiled functions without any software protection are called vanilla functions and are intended to balance the data set. The purpose of such balancing is to prevent an unbalanced data set from adversely affecting the learning process of a neural network, described later in this paper, by leading it to focus on prevalent events while neglecting rare ones. Specifically, vanilla features are used to make the neural network learn what unprotected features look like, so that it can distinguish these from features protected with the specific protection technique that the neural network is trained to identify. In a fourth step 140 we perform the encoding of each function (asm_j) belonging to the plurality of functions by converting them into encoded functions CHMP_COD. The purpose of this operation is to transform into a sequence of numerical values the instructions contained in the functions of each sample CHMP expressed in assembly language, in particular, instructions containing operating codes, data and addresses when expressed in mnemonic format.

At the end of said encoding step 140 each encoded function CHMP_COD belonging to the data set will be expressed as a sequence of numeric values thus being suitable for use n a training step of a neural network. The encoding step 140 may optionally include two additional sub-steps that allow a neural network to reach convergence faster: the masking sub-step and the scaling sub-step.

10 In a fifth phase 150, training of one of the neural networks NN(Pi) is performed. For example, a first neural network NN(Pi) associated with a first software protection Pi is trained.

Particularly, the first neural network NN(Pi) is trained to provide a first probability index Pli (in this case i = 1) indicating the probability that a given j-th function (asm_j) is protected by the first protection Pi.

In addition, in the case where a function has been identified to which the first protection PI is applied with a first probability index PIi,j above a certain threshold, the first neural network NN(Pi) can also provide a second index FAi,j,k. Said second index FAi,j,k represents the possibility that the instructions in a specific area (denoted generically by index k) of that function (asm_j) have been applied to the first protection Pi (Pi with i= 1). For example, where the higher the value of the second index the more "suspicious" the area is.

For example, for each instruction of a function, the first neural network NN(Pi) indicates the probability that it has the first protection PI. This makes it possible to identify the instructions of the function to which a given protection has been applied or which have alternatively been introduced by the application of the protection, or in other words, to identify the location of a protection within a function.

The training of the first neural network NN(Pi) is carried out using the data set that includes the encoded functions CHMP_COD related to the first protection Pi and the encoded functions CHMP_COD protected with each possible pair of protections Pi (e.g., P1+P2, P1+P3, ... , Pl+Pn). Note that it is also possible to use longer combinations than the above, for example, triples (e.g. P1+P2+P3), quadruples (e.g. P1+P2+P3+P4), etc. This could improve accuracy in cases where particular combinations of longer protections (triples, quadruples, etc.) are to be identified, for example, because they are known to be used in the state of the art.

11 According to the example described, training is carried out using the training module MOD_TRAIN 34.

The training can be repeated for each neural network NN (Pi) related to the other P_2- P_n protections of interest as well. Note that the NN(Pi) neural network can be chosen from neural networks capable of handling sequences and/or neural networks having an attention mechanism.

One type of neural network capable of handling sequences is, for example, a recurrent type neural network, that is, a network in which feedback connections are present. Such feedback creates a kind of "memory" of what happened in the recent past by making available at time T information processed at time T-l or T-2 thereby making the value of the current output depend not only on the current input values, but also on the previous inputs. An example of a recurrent neural network is the Long Short-Term Memory (LSTM) network.

The idea behind the attention mechanism is to be able to define which parts of the input vector the neural network should focus on to generate the appropriate output. In other words, an attention mechanism allows it to process input data while also attending to relevant information contained in other input data. The attention mechanism also allows the masking of those data that do not contain relevant information. Examples of neural networks that use the attention mechanism could be, for example, recurrent neural networks, such as the aforementioned LSTM, or neural networks such as BERT (Bidirectional Encoder Representations from Transform).

Neural networks NN(Pi), trained as described above, can be employed in a classification method applied to a binary file to be analyzed (i.e., a file distinct from the one used for training in the configuration method 100).

12 In this case, the binary file to be analyzed is disassembled and the relevant functions (asm) that are to be analyzed are extracted from the resulting assembly file. This can be achieved using a conventional disassembler.

Each function (asm) is then processed by each neural network NN(Pi). Each such neural network NN(Pi) will return a relative first probability index PIi,j associated with a specific protection Pi and also, preferably, the second FAi,j,k index for each function.

The set of values of the first probability index PIi,j allows a classification of the protections Pi,..., P_n that may be present in the analyzed binary file.

The values of the second indices FAi,j,k are associated with additional indications that identify the location of instructions within each function having those values of the second index.

Thus, the classification method will allow the evaluation of the security quality of the protections applied to the analyzed binary file because the detection of such protections by neural networks NN(Pi) indicates that the protection is quickly identifiable, so an attacker is 'delayed' less.

With reference to examples of practical applications, note that companies specializing in software protection typically operate using two separate teams: the first (protection team) is responsible for actually protecting the software, while the second team (reverse engineering team), emulates the behaviour of possible attackers, attempting to identify the assets within the application and the protections used, and then removing/adding these protections compromising the security of the assets. The protection team proposes an initial solution, the goodness of which is evaluated by the reverse engineering team. These operations are then performed iteratively until a sufficient level of protection has been achieved (or time has run out).

13 The described classification method, based on the configuration method 100, can thus be used by companies specializing in software protection in two different ways. The protection team can obtain a quick assessment of the identifiability of the chosen protections (without waiting for the results of reverse engineering activities). Contextually, the described classification method can also be used by the reverse engineering team to automate and speed up the identification of assets, an essential first step in their activities.

The configuration and classification methods described above are applicable to each type of protection listed above and, with particular effectiveness, to the following types of protection, as revealed by tests conducted by the Applicant: control flow flattening, opaque predicates, branch functions, the encode arithmetic.

Particular forms of implementation of some of the steps of method 100 described above are described below.

Encoding

Encoding step 140, as mentioned earlier, aims to transform the instructions contained in the functions of each sample CHMP into sequences of numerical values. This transformation aims to encode each function (asm_j) of the samples CHMP belonging to the data set as a matrix of numerical values whose rows are the encodings of the component instructions of the function itself.

The encoded function, as shown in Figure 4a, has two dummy pseudo-instructions (also encoded along with the rest of the function) of <begin> and <end> added at the beginning and end of the function (asm_j), respectively, to make its boundaries explicit. Such dummy pseudo-instructions are inserted at a preliminary phase of the encoding phase step 140.

Optionally, again at said preliminary phase, the number of instructions constituting the function can be truncated to the first n instructions. For illustrative purposes only, in

14 figure 4c is shown the coded function truncated to a predetermined maximum size of 4 with respect to the function in figure 4a. The truncation operation becomes necessary when the type of neural network chosen to be trained to operate as a classifier requires that the input sequence have a maximum size that should not be exceeded.

In contrast, Figure 4b shows the detail of an i-th instruction belonging to the function in Figure 4a after being encoded. According to the example, the encoded instruction consists of:

- an opcode representing the numeric encoding of the instruction opcode;

- an address representing the address of the instruction;

- a set of parameters opl-op6 representing the encoding of any operands of the instruction.

The number of parameters may vary based on the hardware architecture chosen.

Figure 4b shows the generalization of an instruction based on ARM-type architecture where more complex instructions can have up to six operands. Flowever, the number of operands that can be handled by the encoding step 140 is not limited to the maximum number of operands so that it can be easily adapted to instructions of different hardware architectures.

An example of an instruction encoding performed by the present method is described below, keeping in mind that the numerical values given related to encoding are intended for descriptive purposes only, as these values may change depending on the hardware architecture and neural network chosen.

An example of an assembly instruction is "0x1234 add rO, r2, 5" which expresses the assignment rO = r2 + 5 and can be divided into five different parts:

- 0x1234: is the numeric value indicating the address of the instruction, which in the example considered is an integer expressed in hexadecimal, which is equivalent to the number 4660 in decimal base, and indicates the location of the instruction in memory;

15 - add: is the type of operation (opcode) of the instruction, which in this case is a sum;

- rO: is the first operand of the instruction and refers to memory register rO;

- r2: is the second operand of the instruction and refers to memory register r2;

- 5: is the third operand of the instruction and refers to the integer 5.

In this example, the instruction uses only three operands; the fourth, fifth and sixth operands will be absent.

As an example, encoding step 140 transforms each instruction line of the functions of the sample CHMP into a sequence of 1250 values as follows:

- 236 dedicated to encoding the opcode;

- 4 dedicated to encoding the instruction address;

- 273 dedicated to encoding the first operand;

- 239 dedicated to encoding the second operand;

- 239 dedicated to encoding the third operand;

- 239 dedicated to encoding the fourth operand;

- 16 dedicated to encoding the fifth operand;

- 4 dedicated to the coding of the sixth operand.

Consider that the instruction encoding must always be a sequence 1250 values long even in the case of a smaller number of operands, in the example equal to 6. Since the instruction ADD considered has only three operands, the absence of the missing operands is also encoded so as not to alter the length of the value encoding sequence.

The encoding of the opcode occurs on a sequence of 236 numeric values and represents the embedding of the opcode itself. Embedding refers to a standard modelling technique where words or numbers are mapped into numerical sequences. Said values were pre-computed by means of a neural network using standard algorithms, such as the CBOW (Continuous Bag Of Words) and Skip-gram algorithms, described in the paper Tomas

16 Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean, "Distributed Representations of Words and Phrases and their Compositionality ," in Proceedings of the 26th International Conference on Neural Information Processing Systems, Volume 2 (NIPS'13), Curran Associates Inc, Red Hook, NY, USA, pp. 3111-3119, available at: https://proceedings.neurips.cc/paper/2013/ file/9aa42b31882ec039965f3c4923ce901b-

Paper.pdf

Address encoding is done on a sequence of 4 numeric values where each of the 4 values is determined according to a specific criterion.

The first value in the sequence is valued at 1 when the instruction contains an address, or at 0 when missing, as in the case of some special instructions. The <begin> and <end> instructions are examples of special instructions.

The second value of the sequence is valued at 0 when the instruction has no address or if the result of the mathematical operation was performed on invalid operands (NaN) or is worth infinity. In all other cases, the second value of the sequence is valued at 1. The third value in the sequence contains the address value expressed in decimal base when present, otherwise it is valued at 0.

The fourth value in the sequence is valued at 1 when the instruction address contains the exclamation mark (!) as is the case for some special addresses, otherwise it is valued at 0. The exclamation mark is used in certain (very rare) cases in ARM to indicate the write -back operation, that is, that the result of an operation must be written inside a certain address. For example, if the address "1234!" is written in an instruction, it means that the result of the instruction must be written back to memory cell 1234. If the exclamation mark is not present, then the result of the operation is not written to memory cell 1234 (which typically will only be read).

17 According to said criteria, the address 0x1234 contained in the example instruction is encoded as 1, 1, 4660, 0.

The encoding of the first operand, as mentioned earlier, occurs on a sequence of 273 values that are divided into 9 sections in order to represent the different types of operands.

A first section called "string" is used to encode a string-type operand on a value in the sequence. A second section called "number" is used to encode a number-type operand on a sequence of 4 values. A third, fourth, and fifth sequence named "endian," "cond," and "CPU state" are used to encode internal processor states on sequences of 2, 15, and 4 values, respectively. The sixth section named "registers" is used to encode memory registers on a sequence of 154 values. A seventh section called "barrier" is used to encode memory barriers on a sequence of 12 values. A memory barrier is a type of operation that allows the CPU to impose a constraint on the ordering of operations by preventing out-of-order execution due to performance optimizations of modern CPU.

An eighth section called "address" is used to encode memory addresses over a sequence of 65 values. Finally, the ninth section called "coproc" is used to encode a math coprocessor on a sequence of 16 values.

In the example instruction, the first operand is a register, rO, and therefore it will be encoded in the sixth "registers" section while the value sequences in the other sections will all be value to 0.

Of the 154 values in that section, the first 123 values are associated with registers where the first value is associated with register rO, the second value is associated with register rl, the third value is associated with register r2, etc. During encoding, the value associated with the register to be encoded will be valued at 1 while the other values associated with the other registers will be valued at 0. The remaining 31 values are used to encode special mathematical operations on registers such as scaling. Since the first operand, rO, of our

18 example does not require special mathematical operations, the sequence of 31 values will all be valued at 0.

The encoding of the second operand, similarly to what has been described for the first operand, is done on a sequence of 239 values that are divided into 4 sections in order to represent the different types of operands.

A first section called "number" is used to encode a numeric type operand on a sequence of 4 values. A second section called "registers" is used to encode memory registers on a sequence of 154 values. A third section called "address" is used to encode memory addresses on a sequence of 65 values. A fourth section called "reg coproc" is used to encode a math coprocessor register on a sequence of 16 values.

In the example instruction, the second operand is again a register, r2, and therefore will be encoded in the second "registers" section while the value sequences in the other sections will all be value to 0.

Of the 154 values in that section, the first 123 values are associated with registers where the first value is associated with register rO, the second value is associated with register rl, the third value is associated with register r2, etc. During encoding, the value associated with the register to be encoded will be valued at 1 while the other values associated with the other registers will be valued at 0. The remaining 31 values are used to encode special mathematical operations on registers such as scaling. Since the second operand, r2, of our example does not require special mathematical operations, the sequence of 31 values will all be valued at 0.

The encoding of the third operand is done on a sequence of 239 values that are divided into 4 sections, as for the second operand, in order to represent the different types of operands. A first section called "number" is used to encode a numeric type operand on a sequence of 4 values. A second section called "registers" is used to encode memory registers

19 on a sequence of 154 values. A third section called "address" is used to encode memory addresses on a sequence of 65 values. A fourth section called "reg coproc" is used to encode a math coprocessor register on a sequence of 16 values.

In the example instruction, the third operand is a number, 5, so it will be encoded in the first "number" section while the sequences of values in the other sections will all be value to 0.

The first value of the "number" section is valued at 1 when the operand contains a numeric value, otherwise it is valued at 0. The second value of the "number" section is valued to 1 when the numeric value is not NaN or infinite, otherwise it is valued to 0. The third value of the "number" section is valued with the numeric value of the operand, which in the example is equal to 5. The fourth value of the "number" section is valued at 1 when notation with an exclamation mark (!) is used, otherwise it is valued at 0.

According to the criteria defined for the number section, the third operand in the example, the number 5, is coded as 1150. The encoding of the fourth operand is done on a sequence of 239 values, which are divided into 4 sections, as for the second and third operands, in order to represent different types of operands. A first section called "number" is used to encode a numeric type operand on a sequence of 4 values. A second section called "registers" is used to encode memory registers on a sequence of 154 values. A third section called "address" is used to encode memory addresses on a sequence of 65 values. A fourth section called "reg coproc" is used to encode register of a math coprocessor on a sequence of 16 values.

The example instruction does not have a fourth operand but it will still need to be encoded to keep the length of the sequence of values consistent. In this case the encoding the values of all sections will be valued at 0.

20 The encoding of the fifth operand occurs on a sequence of 16 values included in a single section called "reg coproc" and is used to encode a register of a math coprocessor on a sequence of 16 values.

The example instruction again does not have a fifth operand but it will still need to be encoded to keep the length of the sequence of values consistent. As in the previous case, in the absence of the operand, the section values will all be valued at 0.

The encoding of the sixth operand occurs on a sequence of 4 values included in a single section called "number" is used to encode a numeric type operand on a sequence of 4 values.

The example instruction even in the latter case does not have the sixth operand but it will still have to be encoded to keep the length of the sequence of values consistent. As in the previous cases, in the absence of the operand, the section values will all be valued at 0. In conclusion, the example instruction "0x1234 add rO, r2, 5" at the end of the encoding phase will be represented on a numeric sequence of 1250 values as follows:

- The first 236 values represent the add opcode and are encoded with the sequence (0.5741776823997498, 0.5895169377326965, 0.44707465171813965, 0.5283305644989014, );

- the next 4 values represent address 0x1234 and are encoded with the sequence (1, 1, 4660, 0);

- the next 273 values represent the first operand rO and are encoded with the sequence

(0, .., 0, 1, 0, ...);

- the next 239 values represent the second operand r2 and are encoded with the sequence (0, ..., 0, 0, 1, 0, ...);

- the next 239 values represent the third operand 5 and are encoded with the sequence

(1, 1, 5, 0, ...);

21 - the next 239 values represent the fourth operand absent and are encoded with the sequence (0, 0, 0, 0, 0, 0, 0, ...);

- the next 16 values represent the fifth absent operand and are encoded with the sequence (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);

- the next 4 values represent the sixth absent operand and are encoded with the sequence (0, 0, 0, 0).

As mentioned earlier, the encoding step 140 can optionally include two additional sub-steps that enable a neural network to achieve convergence faster: the masking sub-step and the scaling sub-step.

The masking sub-step is responsible for eliminating every column from the encoded samples belonging to the dataset that has the same value in all samples. The elimination of said columns is possible because the encoding step 140 is capable of representing more instructions than a processor would be able to do and also some of the values are never used in reality. The masking sub-step allows the length of the encoded sequences to be reduced and therefore the neural network in the training step does not use time to dwell on data that is not really important. In the example case, the masking sub-step is able to reduce the length of the value sequence from 1250 to 768.

The rescaling sub-step performs a rescaling of values within a range of 0 to 1. Although the encoded samples belonging to the data set contain the values 0 and 1 most of the time, in some cases they contain much larger values. For example, in the case of encoding the address expressed with a sequence 4 long, the third value in the sequence contains the address expressed in decimal base, which in the example described above was worth 4660. Another example of values present in the decoding other than 0 and 1 is, for example, present in the "numbers" section of the encoding of the first, second, third, fourth and sixth operands. In the above example, the third operand was the number 5, encoded as the third

22 value in the number section of the third operand. The rescaling operation prevents larger numbers from being given greater weight in training to which greater importance does not always correspond. The greater weight given to a given value in the initial training step of a neural network slows down the learning process since the network would take longer to realize that the greater weight given to such values was unfounded.

Training

When the encoding stepl40 is completed, the encoded samples CHMP_COD can be used in the training step 150 by sending them to the neural network NN(Pi).

As mentioned above, any neural network capable of handling sequences and with attention mechanism can be used in this phase. In particular, two possible alternative training phases will be described, using two different architectures of said neural networks.

Figure 5 shows an initial simplified architecture based on LSTM cells, a type of recurrent neural network with a long-term memory mechanism that enables the processing of data sequences. The information in said sequences is stored so that, due to the presence of loops, proceeding through the sequence the information stored in the cells assists in the processing of new data. In this way the neural network is able to interpret in order the assembly instructions contained in the encoded samples CHMP_COD. A semantic encoding of the function contained in the encoded sample CHMP_COD is used as input to the neural network where a first 1S_LSTM layer of LSTM cells parses the assembly instructions contained in the encoded sample CHMP_COD. Said LSTM cells are bidirectional type cells that can parse assembly instructions from first to last and in the opposite direction from last to first.

An attention mechanism ATT receives output data from the first layer of cells 1S_LSTM from which it is able to extract the attention levels LIV_ATT of individual instructions giving an indication of which assembly instructions contain protection. The

23 higher the attention level, the more likely it is that the assembly instruction is part of a protection. This attention level LIV_ATT corresponds to the second index FAi,j,k, described above.

The output of the attention mechanism ATT is summed with the output of the first cell layer 1S_LSTM so that information from the attention levels LIV_ATT is added and then enters toward a second layer 2S_LSTM of cells LSTM, which analyzes the assembly instructions contained in the encoded sample.

The output of the second layer of cells 2S_LSTM enters as input to a final layer TRAS_LIN where a linear transformation and a sigmoid function are used to calculate the values of the first index PIi,j for each assembly instruction in a range between 0 and 1. The final value (or score) CLASS is given by the score obtained from the last instruction contained in the encoded sample CHMP_COD, i.e., the pseudo-instruction <end>. Said score the closer it is to the value 1, the more likely the presence of protection. Said score CLASS corresponds to the first probability index PIi,j. A second architecture, as shown in Figure 6, is based on the system of the type BERT transformer (bidirectional encoder representations from transformer), which is a nonrecurrent architecture with an attention mechanism. Since this type of neural network needs the input function to have a maximum length known a priori, the previous encoding step 140 performs the function truncation operation as shown in Figure 4c. Therefore, at this case the last instruction of the function may not be the pseudo-instruction <end>.

The technical expert is familiar with the particularities of this neural network architecture and the overcome limitations present in recurrent type neural network architectures.

Figure 6 shows a simplified BERT-based architecture where a semantic encoding of the function contained in the encoded sample CHMP_COD- is added to a positional

24 encoding of the function to add positional information from the assembly instructions contained in the encoded sample CHMP_COD-. This first operation is necessary to make the position information of the instruction explicit since the transformer-type neural network, having no recurrence, does not have the notion of the position of an element within a sequence. Positional encoding is a prerequisite to the training step and are coefficients calculated according to standard formulas stored within a matrix.

The output result of the previous step enters as input to a series of encoding layers S_ENCOD that analyze its contents, where each encoding layer possesses an attention mechanism that is used to evaluate the attention levels LIV_ATT of the assembly instructions in a manner analogous to the LSTM-type neural network architecture. These attention levels LIV_ATT correspond to values of the second index FAi,j,k, described above.

The output from the encoding layers S_ENCOD enters as input to a final layer TRAS_LIN where a linear transformation and a sigmoid function are used to compute classification scores for each assembly instruction in a range between 0 and 1. Unlike in the layer TRAS_LIN of a network LSTM, the final score value CLASS is relative to the first instruction contained in the encoded sample CHMP_COD, i.e., the pseudo-instruction <begin>. Again, the final score CLASS corresponds to the first probability index PIi,j, introduced above. Note that in addition to the two examples described, it is possible to proceed with training any architecture, among those capable of handling sequences and/ or having an attention mechanism using any standard training algorithm.

The described solution produces a qualitative metric of the identification of protection techniques that can be used as an estimate of the quality of the chosen protection solution, introducing a high degree of automation in identifying the protection techniques applied to a file and the protected areas within the file.

25 Note that the results obtained through the application of the method of the present invention can be used as a tool for validating the risk exposure of assets, validating the invisibility of developed protection techniques, and identifying the methods by which viruses and malware are obfuscated to help update antivirus tools. The described solution thus allows achieving a higher level of software protection for the same amount of time spent or a level of protection equivalent to that achievable by known techniques in a significantly shorter time interval.

26

Claims

CLAIMS . A neural network configuration method (100), comprising the steps of: a) defining (110; 120) a plurality of functions (asm) and applying a plurality of software protections (Pi,...,P_n) to said functions; b) constructing (130) a data set comprising a plurality of samples each including a function (asm_j) of the plurality and at least one of said software protections (Pi,...,P_n) applied to said function; c) encoding (140) each function (asm_j) of the data set to obtain a plurality of encoded samples (CHMP_COD) each expressed as a sequence of numerical values; d) training (150) a neural network (NN(Pi)) by the plurality of encoded samples

(CHMP_COD) such that it is capable of processing a file to be analyzed and providing information regarding software protections applied to said file to be analyzed. . The method (100) of claim 1, wherein: said step of defining the plurality of functions (110, 120) further comprising defining a plurality of vanilla functions to which no software protections are applied; said data set further comprising a plurality of samples each including a function (asm_j) of the plurality of vanilla functions. . The method (100) of claim 1, wherein the step of defining (110, 120) the plurality of functions comprises the steps of: providing a source file comprising said plurality of functions (asm); applying (110) the plurality of software protections (Pi,..,P_n) to the plurality of functions (asm) of the source file and compiling the source file provided with the software protections resulting in a compiled binary file;

27 disassembling (120) the compiled binary file into an assembly file and extract the plurality of functions (asm) from the assembly file.

4. The method (100) of claim 1, wherein the step of defining (110, 120) the plurality of functions comprises the steps of: providing a binary file comprising said plurality of functions (asm) to which a plurality of software protections (Pi,...,P_n) are applied; disassembling (120) the binary file resulting in an assembly format file and extracting the plurality of functions (asm) from the assembly format file.

5. The method (100) of claim 1, wherein said neural network (NN(P1)) is associated with a single type of software protection (Pi).

6. The method (100) of claim 2, wherein said neural network (NN(P1)) is configured such that said information comprises a first probability index (Ply) indicative of a probability that a first function (asm_j) of the plurality of functions has been protected by the first protection (Pi).

7. The method (100) of claim 3, wherein said neural network (NN(Pi)) is configured such that said information comprises a second index (FAi,j,k) indicative of a possibility that instructions of a specific area of the first function have been applied the first protection (Pi).

8. The method (100) of claim 3, wherein the plurality of software protections (Pi,. ,P_n) comprises at least one of the following protections: control flow flattening, opaque predicates, branch functions, encode arithmetic, converting data into functions, merging or splitting variables, recoding variables, white-box cryptography, virtualization using virtual machines or JIT compilation, call stack controls, code guards, control flow tagging, anti-debugging, code mobility, client/server code splitting, anti-cloning, and software attestation.

28 e method of claim 1, wherein said method is carried out to configure a plurality of neural networks (NN(Pi) each associated with a relative software protection. e method (100) of claim 1, wherein said neural network (NN(Pi)) is implemented according to at least one of: a network capable of handling sequences, a network having an attention mechanism. e method (100) of claim 9, wherein said neural network (NN(Pi)) is implemented according to at least one of the following neural network types: LSTM network, BERT network, GRU network, transformer-XT network. e method (100) of claim 1, wherein the plurality of encoded samples includes a plurality of sequences of numerical values and said encoding step further comprises: a masking step wherein repeated values in each sequence of the plurality are removed from the plurality of sequences of numerical values; a rescaling step of said numerical values within a predetermined range. method of processing files, comprising the steps of:

- providing a binary file to be analyzed including a plurality of features to be analyzed;

- disassembling the binary file to be analyzed to obtain an assembly file;

- extracting from the assembly file the plurality of functions to be analyzed (asm);

- encode each function to be analyzed by expressing it as a relative sequence of numerical values;

- providing a plurality of neural networks (NN(Pi)), each associated with a relative software protection (RI,. .,Rh), configured according to the configuration method (100) of at least one of the preceding claims;

- processing the plurality of features to be analyzed (asm) using the plurality of neural networks (NN(Pi)) to search for information related to software protections within the plurality of features to be analyzed.

29 he method of claim 13, wherein each of said neural networks (NN(Pi)) is associated with a respective type of software protection (Pi). he method of claim 14, wherein processing the plurality of functions to be analyzed

(asm) by the plurality of neural networks (NN(Pi)) returns classification information including a plurality of probability indices (IPi) each indicative of a probability that a relative function (asmj) of the plurality of functions is protected by one of said protections (Pi). The method of claim 14, wherein processing the plurality of functions to be analyzed (asm) by the plurality of neural networks (NN(Pi)) returns positional information including a plurality of second indices (FAi,j,k) each indicative of a possibility that a corresponding protection (Pi) has been applied to instructions of a specific area of a function.

30