CN117574334B

CN117574334B - Code confusion method and system combining MD5 and sequence-to-sequence model

Info

Publication number: CN117574334B
Application number: CN202311040048.0A
Authority: CN
Inventors: 苏庆; 袁梓迪; 谢国波; 林志毅; 黄剑锋
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2023-08-17
Filing date: 2023-08-17
Publication date: 2024-05-28
Anticipated expiration: 2043-08-17
Also published as: CN117574334A

Abstract

The invention discloses a code confusion method and a system combining MD5 and a sequence-to-sequence model, which specifically comprise the steps of constructing a constant data set, constructing an encoder dictionary and a decoder dictionary, preprocessing data, constructing and training a sequence-to-sequence model, constructing a decryption function, constructing an opaque predicate, inserting the opaque predicate and the decryption function, and compiling to generate an executable application program. The invention combines the MD5 hash algorithm and the sequence-to-sequence model as encryption and decryption algorithms, and constructs opaque predicates by encrypting and decrypting constants in the expression so as to realize code confusion. The constants in the expression are used as plaintext, the plaintext is encrypted by using an MD5 hash algorithm, the generated ciphertext is difficult to be solved reversely by utilizing the unidirectionality of the MD5 hash algorithm, and the anti-static analysis capability of the opaque predicate is enhanced; and decrypting the ciphertext by using the sequence-to-sequence model, and storing the mapping from the ciphertext to the plaintext in a model weight mode, so that the security of the opaque predicate is improved. The code confusion method disclosed by the invention can effectively protect program execution logic and enhance the capability of the program for resisting reverse analysis.

Description

Code confusion method and system combining MD5 and sequence-to-sequence model

Technical Field

The invention is applied to the field of software protection code confusion, and particularly relates to a code confusion method and system combining MD5 with a sequence-to-sequence model.

Background

With the rapid development of information technology, the security problem of software is more and more paid attention to. An attacker breaks the software by using various reverse analysis technologies, steals important data and core algorithms of the software, seriously damages intellectual property rights of software developers, and strikes creativity and enthusiasm of software industry. In order to improve software security, various software protection technologies such as serial number verification, software watermarking, software encryption and decryption, code confusion and the like are presented. The code confusion technology has the advantages of flexibility and convenience in implementation, low cost and the like, so that the core code content and the operation logic of the software can be well protected, the code confusion technology has better resistance when being subjected to attack of reverse engineering of the software, is widely applied to the field of software protection at present, and is widely used in the fields of cloud computing, internet of things, artificial intelligence and the like.

Code confusion technology makes codes formally difficult to read and analyze by performing semantic-preserving transformations on the codes of computer programs; thereby achieving the purpose of protecting software. By converting the program code into a form with poorer readability and harder modification, the converted program is more difficult to be attacked by static analysis and reverse engineering; an attacker needs to pay a greater cost when cracking the software.

Code obfuscation techniques fall into four categories: layout confusion, data confusion, control flow confusion, and preventative confusion, wherein control flow confusion is the most straightforward method of protecting program algorithm logic, as the control flow graph of a program clearly reflects the program's algorithm logic and execution flow. The most main method of control flow confusion is to flatten control flow and opaque predicates, and the opaque predicate confusion technology has the advantages of simple form, concealment, small introduced cost and the like, so that the opaque predicate confusion technology is one of hot spot confusion technologies.

Opaque predicate refers to a predicate Q whose output is known to the code protector, but whose output is difficult for an attacker to infer, then the predicate Q is said to be opaque. The opaque predicates comprise three forms of perpetual, perpetual false and real-time false, and the perpetual opaque predicates represent that the output of the predicate Q is always true; the permanently false opaque predicate indicates that the output of predicate Q is always false; the true-time-false opaque predicate indicates that the output of predicate Q is sometimes true and sometimes false. By inserting opaque predicates into the program, the control flow graph of the program is complicated, the functions of the program are kept unchanged, and the difficulty of reverse analysis of the program by an attacker can be increased.

The neural network is used as a machine learning model of the current mainstream, and is widely applied in various fields due to extremely high fault tolerance and extremely strong nonlinear function approximation capability; in the field of machine translation, the neural network can convert one text into another text through data learning and updating weights, which is similar to the traditional encryption and decryption algorithm, so that the neural network can simulate the encryption and decryption algorithm to realize the mutual conversion between plaintext and ciphertext. The disadvantage of neural networks has been underscored by the unexplained weight, but this disadvantage has become an important advantage in the field of code confusion: the opaque predicate is constructed by using a neural network simulation encryption and decryption algorithm, so that the constructed opaque predicate has higher safety and reliability.

The Chinese patent document with publication number CN107437005A discloses a code confusion method and device based on chaos opaque predicates. According to the method, the chaos opaque predicate is constructed through the chaos opaque expression and the number theory expression, different combinations of chaos mapping and secondary mapping can be utilized to generate different opaque expression construction modes, so that the code confusion process has diversity and uncertainty, and the universality and the safety are higher.

The chinese patent document with publication number CN112256275A discloses a code obfuscation method, apparatus, electronic device and medium. The method comprises the steps of obtaining a source code file of a target application program, wherein the source code file comprises a preset built-in function, and the built-in function is used for selecting constant character strings to be confused in the source code file; in the compiling process of the source code file, determining a constant character string to be confused selected in the built-in function as a target constant character string; and performing confusion processing on the target constant character string.

Chinese patent document publication No. CN114357390a discloses a code obfuscation method, apparatus, electronic device, and storage medium. Aiming at constants to be replaced in the original code, according to the content and sequence of encryption cells in a preset encryption replacement table, different numbers of alternative replacement methods are obtained from the encryption replacement table; then, selecting a target replacement method from the replacement methods to be selected according to a preset selection method for the replacement methods to be selected; then, according to the target replacement method, replacing the constant to be replaced in the original code to obtain a replaced confusion code; and then, running the replaced confusion code to hide the original code.

Disclosure of Invention

The invention provides a code confusion method combining MD5 and a sequence-to-sequence model, which aims to solve the problems that the current opaque predicate has a too simple expression form, has insufficient black box characteristics, is easy to be reversely broken and can only confuse a certain type of constant in the predicate, and has low universality.

The invention also provides a code confusion system combining MD5 and a sequence-to-sequence model, after the source program to be confused is input into the system, the system can confuse the source program by means of a confusion method, and the confused source program is output.

Term interpretation:

To confuse source program: a source program for code obfuscation is required.

Gating cycle unit (Gate Recurrent Unit, GRU): GRU is a kind of cyclic neural network (Recurrent Neural Network, RNN) and can solve the problems of long-term memory, gradient disappearance or gradient explosion in counter-propagation and the like in RNN.

One-hot encoding: n states are encoded using an N-bit state register, each state having a corresponding unique register bit, and only one bit in the register is valid at any time.

Intermediate state: the intermediate state refers to a vector generated by the encoder from the characteristic information in the input of the learning model, which contains the characteristic information of all the data learned from the input by the encoder. The decoder predicts the model output mainly by taking the characteristic information in the input through intermediate states.

The technical scheme of the invention is as follows:

A code confusion method combining MD5 and sequence-to-sequence models specifically comprises the following steps as shown in FIG. 1:

S100: constructing a constant data set: for an expression in a branch statement of a source program to be confused, acquiring any character string constant and numerical constant in the expression as a first plaintext, wherein the numerical constant comprises: integer constants, single-precision floating-point constants, and double-precision floating-point constants; then, salt adding treatment is carried out on the first plaintext to obtain a second plaintext; then, encrypting the second plaintext by using an MD5 hash algorithm, and then adding a corresponding type identifier to generate a ciphertext; finally, a first prologue is constructed for each pair of ciphertext and first plaintext, all of the first prologues being constructed as a constant dataset.

S200: constructing an encoder dictionary and a decoder dictionary: constructing an encoder dictionary according to ciphertext of all first ordinal pairs in the constant data set; a decoder dictionary is constructed from the first plaintext of all the first prologues in the constant data set, and then a start character and an end character are custom-defined and added to the decoder dictionary.

S300: data preprocessing: for each first prologue in the constant data set, firstly adding the ending symbol in S200 at the rearmost of the first plaintext of the first prologue to obtain a third plaintext; sequentially searching to obtain numbers of all characters in the secret text in the encoder dictionary to form a first number sequence, and sequentially searching to obtain numbers of all characters in the third secret text in the decoder dictionary to form a second number sequence; then, carrying out One-hot coding on numbers in the first number sequence to form a first vector sequence, and carrying out One-hot coding on numbers in the second number sequence to form a second vector sequence; finally, a second sequence of pairs is constructed from the first vector sequence and the second vector sequence.

S400: building and training a sequence-to-sequence model: using GRU as encoder, GRU and full connection layer as decoder to construct a sequence-to-sequence model; and taking all the second puppets as training sets, and obtaining a sequence-to-sequence model after training.

S500: constructing a decryption function: according to the sequence-to-sequence model in S400, a function is constructed that decrypts the ciphertext into the first plaintext, the function being referred to as a decryption function.

S600: constructing opaque predicates: and replacing the character string constant and the numerical value constant in the expression of the branch statement in the source program to be confused with the call of the decryption function, and completing the expression after the constant replacement to be the opaque predicate.

S700: inserting opaque predicates and decryption functions: and replacing the original expression of the branch statement in the source program to be confused with the opaque predicate, and inserting a decryption function into any position which does not influence compiling in the source program to be confused to obtain the confused source program.

S800: compiling and generating an executable application program: compiling the confused source program into an executable application program.

Further, the specific steps of constructing the constant data set in step S100 are as follows:

s110: for an expression in a branch statement of a source program to be confused, any string constant and numerical constant in the expression are obtained as a first plaintext, wherein the numerical constant comprises integer constant, single-precision floating-point constant and double-precision floating-point constant.

S120: salt adding treatment is carried out on the first plaintext to obtain a second plaintext: when the first plaintext is a numerical constant, a fixed constant is selected as salt and added with the first plaintext to obtain a second plaintext, and when the first plaintext is a character string constant, a fixed punctuation mark is selected as salt and added to the rearmost of the first plaintext to obtain the second plaintext; then, encrypting the second plaintext by using an MD5 hash algorithm to obtain an MD5 character string; then, defining four mutually different characters, and respectively using the characters as type identifiers of a character string constant, an integer constant, a single-precision floating-point constant and a double-precision floating-point constant; finally, inserting the corresponding type identifier into the MD5 character string according to the constant type of the first plaintext, wherein the inserted position can be any position in the MD5 character string; the MD5 string after the insertion of the type identifier is referred to as ciphertext.

S130: firstly, all ciphertext and first plaintext are converted into character strings, and then each pair of ciphertext and first plaintext is utilized to construct a first prologueWherein the first elementRepresenting ciphertext, second elementRepresenting a first plaintext corresponding to the ciphertext; all first prologs form a constant data set o= { O _1,o_2,…,o_n }, and n is the number of first prologs in O.

Further, the specific steps of constructing the encoder dictionary and the decoder dictionary in step S200 are as follows:

S210: traversing all of O Will allThe character appearing in the code sequence is constructed as a character sequence, if the character is duplicated, only one character is reserved, and the character sequence is the encoder dictionary。

S220: traversing all of OWill allThe character appearing in the sequence is constructed as a character sequence, and if the character is repeated, only one character is reserved; then, customizing two characters respectively serving as a start character and an end character, wherein the start character and the end character are different from each other and are different from all characters in the character sequence; finally, inserting a start character and an end character at the forefront of the character sequence, and the character sequence after the insertion operation is the dictionary of the decoder。

Further, the specific steps of data preprocessing in step S300 are as follows:

for each first prolog in constant data set O ，

S310: adding an end symbol to a first plaintextAnd finally, obtain the third plaintext。

S320: for ciphertextCharacter-by-character segmentation from left to right according toSequentially obtainEach character in (a)All numbers forming a first number sequence; Then toCharacter-by-character segmentation from left to right according toSequentially obtainEach character in (a)All numbers forming a second number sequence。

S330: calculation ofLength a and of (2)Length b of (2); then sequentially addEach number One-hot is encoded as an a-dimensional column vector, all vectors constituting a first vector sequence; Will be sequentiallyEach number One-hot is encoded as a b-dimensional column vector, all vectors constituting a second vector sequence。

S340: using a first vector sequenceAnd a second vector sequenceConstructing a second puppet。

Further, the specific steps of constructing and training a sequence-to-sequence model described in step S400 are as follows:

S410: taking GRU as an encoder from sequence to sequence model; the calculation formula of the GRU is as follows:

Wherein, 、AndRespectively a reset gate, an update gate and a candidate state at the t moment in sequence,Is the hidden state at the t-th moment,Is the hidden state at the t-1 th moment,For input to the encoder at time t, i.e. the first vector sequenceIs selected from the group consisting of the t-th column vector,、AndRespectively resetting the gate weight matrix, updating the gate weight matrix and the candidate state weight matrix in turn,、AndRespectively resetting a gate bias item, updating a gate bias item and a candidate state bias item in sequence;

When (when) All of (3)After all are input into the encoder, the encoding is finished, and the last moment is reachedAs intermediate state c.

S420: taking GRU and full connection layer as decoder of sequence-to-sequence model; the calculation formula of the GRU is as follows:

the calculation formula of the full connection layer is as follows:

Wherein, 、AndRespectively a reset gate, an update gate and a candidate state at the kth moment in sequence,Is the hidden state at the kth time,For the output of the decoder at the kth instant,Is the hidden state at the k-1 th moment,The One-hot encoding vector corresponding to the input of the kth time decoder, i.e., the output of the kth-1 time decoder,、、AndRespectively and sequentially resetting a gate weight matrix, updating the gate weight matrix, a candidate state weight matrix and a full connection layer weight matrix,、、AndRespectively resetting a gate bias item, updating a gate bias item, a candidate state bias item and a full connection layer bias item in sequence;

The decoder uses the intermediate state c as the hidden state of the initial moment The One-hot coding vector of the initiator is used as the input of the initial time decoderWhen the decoder outputs the end symbol, decoding is ended, as shown in fig. 2.

S430: first initializing the model weights of the whole sequence to the sequence model, including weight matrix、、、、、、Bias term、、、、、、Setting model-related super parameters, including the number of hidden layers of the GRU, the number of hidden layer units of the GRU, the size of batch processing, the size of learning rate and the iteration number of training; then use all second puppetsAs a training set training model, whereinAs an input sequence toAs a target sequence; finally stopping training until reaching the preset training iteration times Epoch; if all the input sequences in the training set are correctly predicted to be corresponding target sequences by the model at the moment, finishing training and saving model weights; otherwise, increasing the training iteration times Epoch and retraining the model.

Further, the specific steps of constructing the decryption function in step S500 are as follows:

s510: constructing an input as ciphertext First according to step S300Conversion to vector sequences; Then handleInputting the sequence into a sequence model, outputting a numbered sequence; Then according toEach number g in the decoder dictionaryThe characters with the numbers equal to g are searched for, and the characters are sequentially formed into a character sequence; Then splice in turnAll characters in the sequence, obtain a third plaintext; Deletion ofThe end symbol in (1) to obtain a first plaintext; Finally according toWill be identified by the type identifier in (b)Converting into corresponding constant types, and converting the constant typesAs an output of the decryption function.

The invention also provides a code obfuscation system combining MD5 with a sequence-to-sequence model, comprising:

Constructing a constant data set module: for an expression in a branch statement of a source program to be confused, acquiring any character string constant and numerical constant in the expression as a first plaintext, wherein the numerical constant comprises: integer constants, single-precision floating-point constants, and double-precision floating-point constants; then, salt adding treatment is carried out on the first plaintext to obtain a second plaintext; then, encrypting the second plaintext by using an MD5 hash algorithm, and then adding a corresponding type identifier to generate a ciphertext; finally, a first prologue is constructed for each pair of ciphertext and first plaintext, all of the first prologues being constructed as a constant dataset.

Constructing an encoder dictionary and a decoder dictionary module: constructing an encoder dictionary according to the ciphertext of all the first ordinal pairs in the constant data set, constructing a decoder dictionary according to the first plaintext of all the first ordinal pairs in the constant data set, and then customizing a start character and an end character and adding the start character and the end character into the decoder dictionary.

And a data preprocessing module: for each first prologue in the constant data set, firstly adding the custom ending symbol at the rearmost of the first plaintext of the first prologue to obtain a third plaintext; sequentially searching to obtain numbers of all characters in the secret text in the encoder dictionary to form a first number sequence, and sequentially searching to obtain numbers of all characters in the third secret text in the decoder dictionary to form a second number sequence; then, carrying out One-hot coding on numbers in the first number sequence to form a first vector sequence, and carrying out One-hot coding on numbers in the second number sequence to form a second vector sequence; finally, a second sequence of pairs is constructed from the first vector sequence and the second vector sequence.

Building and training a sequence-to-sequence model module: using GRU as encoder, GRU and full connection layer as decoder to construct a sequence-to-sequence model; and taking all the second puppets as training sets, and obtaining a sequence-to-sequence model after training.

Constructing a decryption function module: and constructing a function for decrypting the ciphertext into the first plaintext according to the trained sequence-to-sequence model, wherein the function is called a decryption function.

Constructing an opaque predicate module: and replacing the character string constant and the numerical value constant in the expression of the branch statement in the source program to be confused with the call of the decryption function, and completing the expression after the constant replacement to be the opaque predicate.

Inserting opaque predicates and a decryption function module: and replacing the original expression of the branch statement in the source program to be confused with the opaque predicate, and inserting a decryption function into any position which does not influence compiling in the source program to be confused to obtain the confused source program.

Compiling and generating an executable application program module: compiling the confused source program into an executable application program.

Compared with the prior art, the invention has the following advantages:

(1) Compared with the prior opaque predicate construction method, the method utilizes the unidirectionality of the MD5 hash algorithm and the non-interpretability of the sequence-to-sequence model weight, so that the opaque predicate constructed by the method has stronger static analysis resistance and higher security.

(2) Compared with the prior opaque predicate construction method, the method establishes the combination of the MD5 hash algorithm and the sequence-to-sequence model to construct the opaque predicate, and the combination of the MD5 hash algorithm and the sequence-to-sequence model has high execution efficiency and reliable black box characteristics, so that the construction efficiency of the opaque predicate is improved, and the constructed opaque predicate has higher safety.

(3) Compared with the prior opaque predicate construction method, the method can simultaneously confuse the numerical constant and the character string constant to construct the opaque predicate, and has higher universality.

Drawings

FIG. 1 is a flowchart showing the steps of a code obfuscation method combining MD5 with a sequence-to-sequence model;

FIG. 2 is a diagram of the operation of the sequence-to-sequence model;

FIG. 3 is program pseudocode prior to source program obfuscation in embodiment 1;

FIG. 4 is program pseudocode after source program confusion in example 1.

Detailed Description

The invention is further defined by, but is not limited to, the following drawings and examples in conjunction with the specification.

Example 1

It will be appreciated by those skilled in the art that the following examples are illustrative of the present invention and should not be construed as limiting the scope of the invention. The specific techniques or conditions are not identified in the examples and are performed according to the techniques or conditions described in the literature in this field or the product specifications. The materials or equipment used are conventional products available from commercial sources, not identified to the manufacturer.

A code obfuscation method implementation flowchart combining MD5 and sequence-to-sequence model is shown in fig. 1, and includes the following steps:

Step S100: constructing a constant data set: for an expression in a branch statement of a source program to be confused, acquiring any character string constant and numerical constant in the expression as a first plaintext, wherein the numerical constant comprises: integer constants, single-precision floating-point constants, and double-precision floating-point constants; then, salt adding treatment is carried out on the first plaintext to obtain a second plaintext; then, encrypting the second plaintext by using an MD5 hash algorithm, and then adding a corresponding type identifier to generate a ciphertext; finally, a first prologue is constructed for each pair of ciphertext and first plaintext, all of the first prologues being constructed as a constant dataset.

Step S200: constructing an encoder dictionary and a decoder dictionary: constructing an encoder dictionary according to the ciphertext of all the first ordinal pairs in the constant data set, constructing a decoder dictionary according to the first plaintext of all the first ordinal pairs in the constant data set, and then customizing a start character and an end character and adding the start character and the end character into the decoder dictionary.

Step S300: data preprocessing: for each first prologue in the constant data set, firstly adding the ending symbol in the step S200 at the rearmost of the first plaintext of the first prologue to obtain a third plaintext; sequentially searching to obtain numbers of all characters in the secret text in the encoder dictionary to form a first number sequence, and sequentially searching to obtain numbers of all characters in the third secret text in the decoder dictionary to form a second number sequence; then, carrying out One-hot coding on numbers in the first number sequence to form a first vector sequence, and carrying out One-hot coding on numbers in the second number sequence to form a second vector sequence; finally, a second sequence of pairs is constructed from the first vector sequence and the second vector sequence.

Step S400: building and training a sequence-to-sequence model: using GRU as encoder, GRU and full connection layer as decoder to construct a sequence-to-sequence model; and taking all the second puppets as training sets, and obtaining a sequence-to-sequence model after training.

Step S500: constructing a decryption function: according to the sequence-to-sequence model in step S400, a function is constructed that decrypts the ciphertext into the first plaintext, the function being referred to as a decryption function.

Step S600: constructing opaque predicates: and replacing the character string constant and the numerical value constant in the expression of the branch statement in the source program to be confused with the call of the decryption function, and completing the expression after the constant replacement to be the opaque predicate.

Step S700: inserting opaque predicates and decryption functions: and replacing the original expression of the branch statement in the source program to be confused with the opaque predicate, and inserting a decryption function into any position which does not influence compiling in the source program to be confused to obtain the confused source program.

Step S800: compiling and generating an executable application program: compiling the confused source program into an executable application program.

An example is given below, where the source program in fig. 3 is taken as a source program to be confused, and the source program to be confused is code confused according to the steps described above, specifically as follows:

preferably, the specific steps of constructing the constant data set in step S100 are as follows:

step S110: for an expression in a branch statement of a source program to be confused, any string constant and numerical constant in the expression are obtained as a first plaintext, wherein the numerical constant comprises integer constant, single-precision floating-point constant and double-precision floating-point constant.

Preferably, in this example, step S110 is specifically:

since expressions of branch sentences in the current source program to be confused are respectively And; Thus two integer constants "2" and "7" and a string constant "are obtained"", All three constants are the first plaintext.

Step S120: salt adding treatment is carried out on the first plaintext to obtain a second plaintext: when the first plaintext is a numerical constant, a fixed constant is selected as salt and added with the first plaintext to obtain a second plaintext, and when the first plaintext is a character string constant, a fixed punctuation mark is selected as salt and added to the rearmost of the first plaintext to obtain the second plaintext; then, encrypting the second plaintext by using an MD5 hash algorithm to obtain an MD5 character string; then, defining four mutually different characters, and respectively using the characters as type identifiers of a character string constant, an integer constant, a single-precision floating-point constant and a double-precision floating-point constant; finally, inserting the corresponding type identifier into the MD5 character string according to the constant type of the first plaintext, wherein the inserted position can be any position in the MD5 character string; the MD5 string after the insertion of the type identifier is referred to as ciphertext.

Preferably, in this example, step S120 is specifically:

Firstly, salifying the first plaintext to obtain a second plaintext: when the first plaintext is a numerical constant, selecting an integer of 4 as salt and adding the salt with the first plaintext to obtain a second plaintext; when the first plaintext is a character string constant, selecting a symbol "|" as salt and adding the salt to the rearmost of the first plaintext to obtain a second plaintext; and then encrypting the second plaintext by using an MD5 hash algorithm to obtain an MD5 character string:

(1) The first plaintext 2 is salted to obtain a second plaintext 6, and the second plaintext is encrypted by using an MD5 hash algorithm to obtain an MD5 character string: 。

(2) The first plaintext 7 is salted to obtain a second plaintext 11, and the second plaintext is encrypted by using an MD5 hash algorithm to obtain an MD5 character string: 。

(3) For the first plaintext " After salt treatment, a second plaintext is obtained "", After encrypting the second plaintext by using the MD5 hash algorithm, obtaining an MD5 string:。

The choice of the symbol "|" as salt is an example in this embodiment, and in implementations of the invention, the code confusion implementer may choose any other symbol as salt.

Next, defining a character 'A' as a type identifier of the character string constant, a character 'B' as a type identifier of the integer constant, a character 'C' as a type identifier of the single-precision floating-point constant and a character 'D' as a type identifier of the double-precision floating-point constant; finally, the forefront of the MD5 character string is designated as the insertion position of the type identifier, and the corresponding type identifier is inserted into the forefront of the MD5 character string according to the constant type of the first plaintext, so that ciphertext is obtained:

(4) The first plaintext "2" is an integer constant, and the character "B" is added as a type identifier to the forefront of its corresponding MD5 string, resulting in ciphertext: 。

(5) The first plaintext "7" is an integer constant, and the character "B" is added as a type identifier to the forefront of its corresponding MD5 string, resulting in ciphertext: 。

(6) First plain text' "Is a string constant, and adds the character" a "as a type identifier to the forefront of its corresponding MD5 string, resulting in ciphertext:。

step S130: firstly, all ciphertext and first plaintext are converted into character strings, and then each pair of ciphertext and first plaintext is utilized to construct a first prologue Wherein the first elementRepresenting ciphertext, second elementRepresenting a first plaintext corresponding to the ciphertext; all first prologs form a constant data set o= { O _1,o_2,…,o_n }, and n is the number of first prologs in O.

Preferably, in this example, step S130 is specifically:

the first plaintext "2", the first plaintext "7" and the first plaintext " "And ciphertextCiphertext (ciphertext)Ciphertext and method for producing sameAre converted into character strings; then construct a first prologue using each pair of ciphertext and first plaintextAll O constitute a constant dataset o= { O _1,o_2,o₃ }, where:

preferably, the specific steps of constructing the encoder dictionary and the decoder dictionary in step S200 are as follows:

step S210: traversing all of O Will allThe character appearing in the code sequence is constructed as a character sequence, if the character is duplicated, only one character is reserved, and the character sequence is the encoder dictionary。

Preferably, in this example, step S210 is specifically:

The constant data set O has three first ordinals respectively 、And。

First go through、AndA kind of electronic deviceWill allThe character appearing in (a) is constructed as a character sequence:

And then removing repeated characters in the character sequence, wherein the repeated characters only remain one:

The character sequence is the encoder dictionary ：

Step S220: traversing all of OWill allThe character appearing in the sequence is constructed as a character sequence, and if the character is repeated, only one character is reserved; then, customizing two characters respectively serving as a start character and an end character, wherein the start character and the end character are different from each other and are different from all characters in the character sequence; finally, inserting a start character and an end character at the forefront of the character sequence, and the character sequence after the insertion operation is the dictionary of the decoder。

Preferably, in this example, step S220 is specifically:

The constant data set O has three first ordinals respectively 、And。

Then define the character As a initiator, a characterIs an ending symbol; finally, inserting a start character at the forefront of the character sequenceEnd symbolThe character sequence after the insertion operation is the dictionary of the decoder：

Preferably, the specific steps of data preprocessing described in step S300 are as follows:

for each first prolog in constant data set O ，

Step S310: adding an end symbol to a first plaintextAnd finally, obtain the third plaintext。

Preferably, in this example, step S310 is specifically:

The constant data set O has three first ordinals respectively 、And。

(1) Feeding ofIn (a) and (b)Adding an end symbolObtaining a third plaintext：“”。

(2) Feeding ofIn (a) and (b)Adding an end symbolObtaining a third plaintext：“”。

(3) Feeding ofIn (a) and (b)Adding an end symbolObtaining a third plaintext：“”。

Step S320: for ciphertextCharacter-by-character segmentation from left to right according toSequentially obtainEach character in (a)All numbers forming a first number sequence; Then toCharacter-by-character segmentation from left to right according toSequentially obtainEach character in (a)All numbers forming a second number sequence。

Preferably, in this example, step S320 is specifically:

To be used for For example, it is obtained in step S310Is'”。

For the following：

First toCharacter-by-character segmentation is performed from left to right: is cut into

(、、、、、、、、、、、、、、、、)

Then according toSequentially obtainEach character in (a)The number of (3):

(、、、、、、6、、、、、、、、、、)

All numbers constitute a first number sequence ：

For the following：

(、、、、、、、、、、、)

Then according toSequentially obtainEach character in (a)The number of (3):

(、、、、、、、、、、、)

All numbers constitute a second number sequence ：

(1) For a pair ofAfter the operations in the above examples are performed, the corresponding first number sequence is obtainedAnd a second numbering sequence：

(2) For a pair ofAfter the operations in the above examples are performed, the corresponding first number sequence is obtainedAnd a second numbering sequence：

(3) For a pair ofAfter the operations in the above examples are performed, the corresponding first number sequence is obtainedAnd a second numbering sequence：

Step S330: calculation ofLength a and of (2)Length b of (2); then sequentially addEach number One-hot is encoded as an a-dimensional column vector, all vectors constituting a first vector sequence; Will be sequentiallyEach number One-hot is encoded as a b-dimensional column vector, all vectors constituting a second vector sequence。

Preferably, in this example, step S330 is specifically:

Calculation of Length a and of (2)Is defined as a is 17 and b is 12;

To be used for For example, it is the first numbered sequence obtained in step S320And a second numbering sequence：

First sequentiallyEach number of the sequence is One-hot coded into a 17-dimensional column vector to obtain a first vector sequence：

Then sequentially addEach number of the code is One-hot coded into a 12-dimensional column vector to obtain a second vector sequence：

(1) For a pair ofAfter the operations in the above examples are performed, the corresponding first vector sequence is obtainedAnd a second vector sequence：

(2) For a pair ofAfter the operations in the above examples are performed, the corresponding first vector sequence is obtainedAnd a second vector sequence：

(3) For a pair ofAfter the operations in the above examples are performed, the corresponding first vector sequence is obtainedAnd a second vector sequence：

Step S340: using a first vector sequenceAnd a second vector sequenceConstructing a second puppet。

Preferably, in this example, step S340 is specifically:

together construct three pairs of second puppets : Wherein,

Preferably, the specific steps of constructing and training a sequence-to-sequence model described in step S400 are as follows:

step S410: taking GRU as an encoder from sequence to sequence model; the calculation formula of the GRU is as follows:

Step S420: taking GRU and full connection layer as decoder of sequence-to-sequence model; the calculation formula of the GRU is as follows:

the calculation formula of the full connection layer is as follows:

The decoder uses the intermediate state c as the hidden state of the initial moment The One-hot coding vector of the initiator is used as the input of the initial time decoderWhen the decoder outputs the end symbol, decoding is ended.

Step S430: first initializing the model weights of the whole sequence to the sequence model, including weight matrix、、、、、、Bias term、、、、、、Setting model-related super parameters, including the number of hidden layers of the GRU, the number of hidden layer units of the GRU, the size of batch processing, the size of learning rate and the iteration number of training; then use all second puppetsAs a training set training model, whereinAs an input sequence toAs a target sequence; finally stopping training until reaching the preset training iteration times Epoch; if all the input sequences in the training set are correctly predicted to be corresponding target sequences by the model at the moment, finishing training and saving model weights; otherwise, increasing the training iteration times Epoch and retraining the model.

Preferably, the specific steps of constructing the decryption function in step S500 are as follows:

step S510: constructing an input as ciphertext First according to step S300Conversion to vector sequences; Then handleInputting the sequence into a sequence model, outputting a numbered sequence; Then according toEach number g in the decoder dictionaryThe characters with the numbers equal to g are searched for, and the characters are sequentially formed into a character sequence; Then splice in turnAll characters in the sequence, obtain a third plaintext; Deletion ofThe end symbol in (1) to obtain a first plaintext; Finally according toWill be identified by the type identifier in (b)Converting into corresponding constant types, and converting the constant typesAs an output of the decryption function.

Preferably, in this example, step S510 is specifically:

Constructing an input as ciphertext Is named as the decryption function of (2); To be used forFor example, the current decryption functionThe input of (a) is ciphertext：，

Will firstConversion to vector sequences：

Then willPutting the sequence constructed and trained in the step S400 into a sequence model as input to obtain a numbered sequence：

Then according toEach element g in the decoder dictionaryThe characters with the numbers equal to g are searched for, and the characters are sequentially formed into a character sequence：

Then splice in turnAll characters in the sequence, obtain a third plaintext：“"; Deletion ofEnd symbol in (a)Obtaining a first plaintext：“”；

Finally according toWill be identified by the type identifier in (b)Conversion to the corresponding constant type: due toThe type identifier in (a) is the character "a", and will therefore beConverting into character string and converting constant typeAs a function of decryptionIs provided.

The decryption function in this example is given belowIs a pseudo code of (1):

Preferably, the specific steps of constructing the opaque predicates in the step S600 are: and replacing the character string constant and the numerical value constant in the expression of the branch statement in the source program to be confused with the call of the decryption function, and completing the expression after the constant replacement to be the opaque predicate.

Preferably, in this example, step S600 is specifically:

since expressions of branch sentences in the current source program to be confused are respectively And; The string constants and numerical constants in them are replaced with calls to the decryption function:

(1) Will express the expression The integer constant "2" in (2) is replaced by

(2) Will express the expressionThe integer constant "7" in (2) is replaced by

(3) Will express the expressionCharacter string constant in ""Replace with

Two opaque predicates are obtained:

And 。

Preferably, the specific steps of inserting opaque predicates and decrypting functions in step S700 are as follows: and replacing the original expression of the branch statement in the source program to be confused with the opaque predicate, and inserting a decryption function into any position which does not influence compiling in the source program to be confused to obtain the confused source program.

Preferably, in this example, step S700 is specifically:

original expression of branch statement in source program to be confused Substitution to opaque predicates；

Original expression of branch statement in source program to be confusedSubstitution to opaque predicates。

The obfuscated source procedure is shown in fig. 4.

Preferably, the specific steps of compiling the executable application program in step S800 are as follows: compiling the confused source program into an executable application program.

Embodiment 2 the present invention also provides an embodiment of a code obfuscation system that combines MD5 with a sequence-to-sequence model, including:

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the system embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced equivalently; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A code obfuscation method that combines MD5 with a sequence-to-sequence model, comprising:

First, constructing a constant data set: for an expression in a branch statement of a source program to be confused, acquiring any character string constant and numerical constant in the expression as a first plaintext, wherein the numerical constant comprises: integer constants, single-precision floating-point constants, and double-precision floating-point constants; then, salt adding treatment is carried out on the first plaintext to obtain a second plaintext; then, encrypting the second plaintext by using an MD5 hash algorithm, and then adding a corresponding type identifier to generate a ciphertext; finally, constructing a first prologue for each pair of ciphertext and first plaintext, wherein all the first prologues are constructed into a constant data set;

Second, constructing an encoder dictionary and a decoder dictionary: constructing an encoder dictionary according to ciphertext of all first ordinal pairs in the constant data set, constructing a decoder dictionary according to the first plaintext of all first ordinal pairs in the constant data set, and then customizing a start character and an end character and adding the start character and the end character into the decoder dictionary;

Thirdly, data preprocessing: for each first prologue in the constant data set, firstly adding an ending symbol in the second step at the rearmost of the first plaintext of the first prologue to obtain a third plaintext; sequentially searching to obtain numbers of all characters in the secret text in the encoder dictionary to form a first number sequence, and sequentially searching to obtain numbers of all characters in the third secret text in the decoder dictionary to form a second number sequence; then, carrying out One-hot coding on numbers in the first number sequence to form a first vector sequence, and carrying out One-hot coding on numbers in the second number sequence to form a second vector sequence; finally, constructing a second prologue according to the first vector sequence and the second vector sequence;

Fourth, a sequence-to-sequence model is constructed and trained: using GRU as encoder, GRU and full connection layer as decoder to construct a sequence-to-sequence model; taking all the second puppets as training sets, and training to obtain a sequence-to-sequence model; the method comprises the following specific steps:

(1) Taking GRU as an encoder from sequence to sequence model; the calculation formula of the GRU is as follows:

Wherein, 、/>And/>Respectively a reset gate, an update gate and a candidate state at the t-th moment in sequence,/>Is the hidden state at the t-th moment,/>Is the hidden state at the t-1 th moment,/>For input to the encoder at time t, i.e. the first vector sequenceT-th column vector in/>、/>And/>Respectively resetting the gate weight matrix, updating the gate weight matrix and the candidate state weight matrix in turn,/>、/>And/>Respectively resetting a gate bias item, updating a gate bias item and a candidate state bias item in sequence;

When (when) All/>After all have been input to the encoder, the encoding is ended and the last moment/>As intermediate state c;

(2) Taking GRU and full connection layer as decoder of sequence-to-sequence model; the calculation formula of the GRU is as follows:

the calculation formula of the full connection layer is as follows:

Wherein, 、/>And/>Respectively a reset gate, an update gate and a candidate state at the kth moment in sequence,/>For the hidden state at the kth instant,/>For the output of the decoder at the kth instant,/>For the hidden state at time k-1,/>For One-hot encoded vector corresponding to the input of the kth time decoder, i.e. the output of the kth-1 time decoder,/>、/>、/>AndRespectively and sequentially resetting a gate weight matrix, updating the gate weight matrix, a candidate state weight matrix and a full connection layer weight matrix,、/>、/>And/>Respectively resetting a gate bias item, updating a gate bias item, a candidate state bias item and a full connection layer bias item in sequence;

The decoder uses the intermediate state c as the hidden state of the initial moment The One-hot coding vector of the initiator is used as the input/>, of the initial moment decoderEnding decoding when the decoder outputs the ending symbol;

(3) First initializing the model weights of the whole sequence to the sequence model, including weight matrix 、/>、/>、/>、/>、、/>And bias term/>、/>、/>、/>、/>、/>、/>Setting model-related super parameters, including the number of hidden layers of the GRU, the number of hidden layer units of the GRU, the size of batch processing, the size of learning rate and the iteration number of training; then use all second prolog/>As a training set training model, wherein a first vector sequence is usedAs input sequence, with second vector sequence/>As a target sequence; finally stopping training until reaching the preset training iteration times Epoch; if all the input sequences in the training set are correctly predicted to be corresponding target sequences by the model at the moment, finishing training and saving model weights; otherwise, increasing the training iteration times Epoch and retraining the model;

Fifth, construct the decryption function: constructing a function for decrypting the ciphertext into the first plaintext according to the sequence-to-sequence model in the fourth step, wherein the function is called a decryption function;

Sixth, construct opaque predicates: replacing character string constants and numerical constants in expressions of branch sentences in a source program to be confused with calls for decryption functions, wherein the expressions after the constant replacement are opaque predicates;

seventh, opaque predicates and decryption functions are inserted: replacing the original expression of the branch statement in the source program to be confused with the opaque predicate, and inserting a decryption function into any position which does not influence compiling in the source program to be confused to obtain the confused source program;

Eighth step, compiling and generating an executable application program: compiling the confused source program into an executable application program.

2. The code obfuscation method combining MD5 and sequence-to-sequence model of claim 1, wherein the specific steps of the first step include:

(1) For an expression in a branch statement of a source program to be confused, acquiring any character string constant and numerical constant in the expression as a first plaintext, wherein the numerical constant comprises integer constant, single-precision floating-point constant and double-precision floating-point constant;

(2) Salt adding treatment is carried out on the first plaintext to obtain a second plaintext: when the first plaintext is a numerical constant, a fixed constant is selected as salt and added with the first plaintext to obtain a second plaintext, and when the first plaintext is a character string constant, a fixed punctuation mark is selected as salt and added to the rearmost of the first plaintext to obtain the second plaintext; then, encrypting the second plaintext by using an MD5 hash algorithm to obtain an MD5 character string; then, defining four mutually different characters, and respectively using the characters as type identifiers of a character string constant, an integer constant, a single-precision floating-point constant and a double-precision floating-point constant; finally, inserting a corresponding type identifier into the MD5 character string according to the constant type of the first plaintext, wherein the inserted position is any position in the MD5 character string; the MD5 character string inserted with the type identifier is called ciphertext;

(3) Firstly, all ciphertext and first plaintext are converted into character strings, and then each pair of ciphertext and first plaintext is utilized to construct a first prologue Wherein the first element/>Representing ciphertext, second element/>Representing a first plaintext corresponding to the ciphertext; all first ordinal/>Constitute a constant dataset/>N is/>First ordinal/>Is a number of (3).

3. The code obfuscation method combining MD5 and sequence-to-sequence model of claim 1, wherein the specific steps of the second step include:

(1) Traversing constant data sets All first prologue/>Ciphertext/>Will all/>The character appearing in the code sequence is constructed as a character sequence, if the character is repeated, only one character is reserved, and the character sequence is the encoder dictionary D _e;

(2) Traversing constant data sets All first prologue/>First plaintext/>Will all/>The character appearing in the sequence is constructed as a character sequence, and if the character is repeated, only one character is reserved; then, customizing two characters respectively serving as a start character and an end character, wherein the start character and the end character are different from each other and are different from all characters in the character sequence; finally, the initial symbol and the end symbol are inserted at the forefront of the character sequence, and the character sequence after the insertion operation is the decoder dictionary D _d.

4. The code obfuscation method combining MD5 and sequence-to-sequence model of claim 1, wherein the specific steps of the third step include:

For constant data sets First ordinal/>，

(1) Adding an end symbol to a first plaintextAnd finally, get the third plaintext/>；

(2) For ciphertextCharacter-by-character segmentation is carried out from left to right, and/>' is sequentially obtained according to the encoder dictionary D _e The number of each character in D _e, all numbers constituting the first number sequence/>; Then pair/>Character-by-character segmentation is performed from left to right, and/>' is sequentially obtained according to the decoder dictionary D _d The number of each character in D _d, all numbers constituting the second number sequence/>；

(3) Calculating the length a of D _e and the length b of D _d; then sequentially addEach number One-hot is encoded as an a-dimensional column vector, all vectors constituting a first vector sequence/>; Will/>, in turnEach number of (a) is One-hot encoded as a b-dimensional column vector, all vectors constituting a second vector sequence/>；

(4) Using a first vector sequenceAnd second vector sequence/>Constructing a second puppet。

5. The code obfuscation method combining MD5 and sequence-to-sequence model of claim 1, wherein the specific steps of the fifth step include:

Constructing an input as ciphertext First according to said third step/>Transition to first vector sequence/>; Then/>Inputting the sequence into a sequence model, and outputting a numbered sequence/>; Then according to/>Each number g in the decoder dictionary D _d, find the characters with the number equal to g, and sequentially construct the characters into a character sequence/>; Then splice/>All characters in (3) to obtain a third plaintext/>; Deletion/>The ending symbol in (2) to obtain a first plaintext/>; Finally according to/>Type identifier in will/>Convert to the corresponding constant type and will complete the constant type converted/>As an output of the decryption function.

6. The method of code obfuscation combining MD5 and sequence-to-sequence models of claim 5, wherein the code obfuscation is based onType identifier in will/>The specific steps of converting to the corresponding constant type include:

First according to the type identifier Position in (1), acquisition/>The character corresponding to the position; and then according to the character pair/>Constant type conversion is performed: when the character is a type identifier of a string constant, the/>Converting into character strings; when the character is a type identifier of integer constant, will/>Converting into integer values; when the character is a type identifier of single precision floating point constant, the/>Converting into single-precision floating point type numerical values; when the character is a type identifier of a double-precision floating-point constant, the/>And converting into a double-precision floating point type numerical value.

7. A code obfuscation system combining MD5 and a sequence-to-sequence model for executing the code obfuscation method combining MD5 and the sequence-to-sequence model according to any of claims 1-6, comprising constructing a constant dataset module, constructing an encoder dictionary and a decoder dictionary module, a data preprocessing module, constructing and training a sequence-to-sequence model module, constructing a decryption function module, constructing an opaque predicate module, inserting an opaque predicate and a decryption function module, and compiling to generate an executable application module:

The build constant dataset module: for an expression in a branch statement of a source program to be confused, acquiring any character string constant and numerical constant in the expression as a first plaintext, wherein the numerical constant comprises: integer constants, single-precision floating-point constants, and double-precision floating-point constants; then, salt adding treatment is carried out on the first plaintext to obtain a second plaintext; then, encrypting the second plaintext by using an MD5 hash algorithm, and then adding a corresponding type identifier to generate a ciphertext; finally, constructing a first prologue for each pair of ciphertext and first plaintext, wherein all the first prologues are constructed into a constant data set;

the construct encoder dictionary and decoder dictionary modules: constructing an encoder dictionary according to ciphertext of all first ordinal pairs in the constant data set, constructing a decoder dictionary according to the first plaintext of all first ordinal pairs in the constant data set, and then customizing a start character and an end character and adding the start character and the end character into the decoder dictionary;

The data preprocessing module is used for: for each first prologue in the constant data set, firstly adding a custom ending symbol at the rearmost of the first plaintext of the first prologue to obtain a third plaintext; sequentially searching to obtain numbers of all characters in the secret text in the encoder dictionary to form a first number sequence, and sequentially searching to obtain numbers of all characters in the third secret text in the decoder dictionary to form a second number sequence; then, carrying out One-hot coding on numbers in the first number sequence to form a first vector sequence, and carrying out One-hot coding on numbers in the second number sequence to form a second vector sequence; finally, constructing a second prologue according to the first vector sequence and the second vector sequence;

The building and training of a sequence-to-sequence model module: using GRU as encoder, GRU and full connection layer as decoder to construct a sequence-to-sequence model; taking all the second puppets as training sets, and training to obtain a sequence-to-sequence model; the method comprises the following specific steps:

the calculation formula of the full connection layer is as follows:

The construction decryption function module: constructing a function for decrypting the ciphertext into a first plaintext according to the trained sequence-to-sequence model, wherein the function is called a decryption function;

The construct opaque predicate module: replacing character string constants and numerical constants in expressions of branch sentences in a source program to be confused with calls for decryption functions, wherein the expressions after the constant replacement are opaque predicates;

The insert opaque predicate and decrypt function module: replacing the original expression of the branch statement in the source program to be confused with the opaque predicate, and inserting a decryption function into any position which does not influence compiling in the source program to be confused to obtain the confused source program;

the compiling generates an executable application module: compiling the confused source program into an executable application program.