WO2024051196A1

WO2024051196A1 - Malicious code detection method and apparatus, electronic device, and storage medium

Info

Publication number: WO2024051196A1
Application number: PCT/CN2023/093383
Authority: WO
Inventors: 徐莉莎
Original assignee: 上海派拉软件股份有限公司
Priority date: 2022-09-09
Filing date: 2023-05-11
Publication date: 2024-03-14
Also published as: CN115455416A

Abstract

Disclosed in the embodiments of the present application are a malicious code detection method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring text data to be tested; according to the semantic logic, dividing said text data into at least one piece of sub-text data; determining a sub-text vector corresponding to the sub-text data; and inputting the sub-text vector into an optimal neural network model, so as to determine a malicious code detection result.

Description

Malicious code detection method, device, electronic equipment and storage medium

This disclosure claims priority from Chinese patent application No. 202211104769.9, filed with the China Patent Office on September 9, 2022, the entire content of which is incorporated into this application by reference.

Technical field

This application relates to the field of computer security technology, for example, to a malicious code detection method, device, electronic equipment and storage medium.

Background technique

In daily life, hackers attack in various forms. Some attack methods have obvious characteristics, while some attack methods have more subtle characteristics. Hackers often attack by injecting malicious code into part of the file. When the file is viewed or a command is executed, the malicious code will run, thereby inserting virus Trojans, leaving backdoors and other dangerous behaviors. Once the device is connected to the Internet, the malicious code will Devices are easily hacked and hijacked, causing significant damage.

In response to the above problems, in related technologies, one method for detecting malicious code in files is to find whether there is malicious code in the file through keyword matching. This method often results in the inability to correctly perform the corresponding operation when unknown malicious code enters. Detection; the other is to determine whether the data is malicious code through machine learning or deep learning classification methods. However, this method cannot confirm the location and type of malicious code at the same time.

Contents of the invention

This application provides a method, device, equipment and medium for detecting malicious codes, which can simultaneously confirm the probability, location and type corresponding to the malicious codes, and improve the accuracy of detecting malicious codes.

According to one aspect of the present application, an embodiment of the present application provides a method for detecting malicious code, which method includes:

Get the text data to be detected;

Divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;

The sub-text vector is input into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.

According to another aspect of the present application, an embodiment of the present application also provides a malicious code detection method and device, which includes:

The data acquisition module is set to obtain the text data to be detected;

A sub-text vector determination module, configured to divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;

The result determination module is configured to input the sub-text vector into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.

According to another aspect of the present application, an embodiment of the present application further provides an electronic device, where the electronic device includes:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application. Malicious code detection methods.

According to another aspect of the present application, embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable the processor to implement any of the present application when executed. A method for detecting malicious code according to an embodiment.

The technical solution of the embodiment of the present application obtains the text data to be detected, divides the text data to be detected into at least one sub-text data according to semantic logic, determines the sub-text vector corresponding to the sub-text data, and inputs the sub-text vector into the optimal neural network. In the network model to determine the detection results of malicious code, the optimal neural network model is generated based on the text data training set containing malicious code. In the embodiment of this application, the acquired text data to be detected is divided into at least one sub-text data and the corresponding sub-text vector is determined through semantic logic, which can form sub-text features and facilitate subsequent data processing of the optimal neural network model; through the optimal The neural network model determines the detection results of malicious code in the sub-text vector, can confirm the relevant detection information corresponding to the malicious code, and improves the accuracy of the detection of malicious code.

Description of the drawings

In order to illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be introduced below. The drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, , without any creative effort, other drawings can also be obtained based on these drawings.

Figure 1 is a flow chart of a malicious code detection method provided by an embodiment of the present application;

Figure 2 is a flow chart of another malicious code detection method provided by an embodiment of the present application;

Figure 3 is a flow chart of a malicious code training process provided by an embodiment of the present application;

Figure 4 is a flow chart of a malicious code detection process provided by an embodiment of the present application;

Figure 5 is a structural block diagram of a malicious code detection device provided by an embodiment of the present application;

FIG. 6 shows a schematic structural diagram of an electronic device that can be used to implement embodiments of the present application.

Detailed ways

In order to enable those in the technical field to understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application. The described embodiments are only part of the embodiments of the present application. Not all examples. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts should fall within the scope of protection of this application.

It should be noted that the terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "include" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., processes, methods, systems, products or devices that comprise a series of steps or units and are not necessarily limited to those steps listed. or units, but may include other steps or units not listed or inherent to such processes, methods, products or devices.

In one embodiment, Figure 1 is a flow chart of a method for detecting malicious code provided by an embodiment of the present application. This embodiment can be applied to situations when malicious code is detected during the transmission of various files. The method may be executed by a malicious code detection device, which may be implemented in the form of hardware and/or software, and may be configured in an electronic device. As shown in Figure 1, the method includes:

S110. Obtain the text data to be detected.

Among them, the text data to be detected refers to relevant text data that may contain malicious code waiting to be detected. The text data to be detected can also be called character data, which is text data in various types of files. For example, it can be related text data in doc files, pdf files, and txt files. It can include English characters, Chinese characters, numbers, and Other input characters, etc.

In this embodiment, in the process of transferring, dumping or opening various types of files, the file type and the content information of the text data in various types of text data to be detected can be obtained, and the content information of each content information can be obtained. The row of the page where it is located, etc., so that the obtained text data can be processed accordingly.

S120. Divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data.

Among them, semantic logic can be understood as the logical relationship between sentences or the character symbols between sentences on the page. Logical relationships can be, for example, juxtaposition, succession, transition, result and cause, purpose, concession, etc.

In this embodiment, sub-text data can be understood as modular division of text data to be detected through semantic logic, and sub-text data corresponding to each module. It should be noted that the text data to be detected can be divided into at least one module through modularization. Each module can correspond to one sub-text data. After modular division, the corresponding sub-text data can be a code class file or a Non-script files. The sub-text vector can be understood as the sub-text vector corresponding to each module obtained by vectorizing the sub-text data of each module using the doc2vec model or the word2vec model.

In this embodiment, the content data in the text data to be detected can be extracted, and the content data can be filtered accordingly to obtain English data. Depending on the format and type of the obtained English data, it can be, for example, a non-script document or programming code. etc., the English data can be divided accordingly according to its corresponding logical content, and the doc2vec model or word2vec model can be used to vectorize the sub-text data corresponding to each module after the division; it can also be determined by the text data to be detected. The corresponding function call graph is used to obtain the feature vector corresponding to the corresponding call sequence for corresponding vectorization.

S130. Input the sub-text vector into the optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on the text data training set containing the malicious code.

Among them, the optimal neural network model is trained and generated through relevant text data training sets of some public data sources containing malicious code. Malicious code can also be called malware, which refers to all software or code that may conflict with an organization's security policy. These codes have no effect but bring certain dangers. They can be created without explicitly prompting the user or without the user's permission. With permission, software that infringes upon the user's legitimate rights and interests is installed and run on the user's computer or other terminal; it can also be computer code that is deliberately prepared or set up and poses a threat or potential threat to the network or system. In this embodiment, the type of malicious code can be SQL injection or XSS attack. XSS attack is called cross-site scripting attack.

In this embodiment, the detection results of the malicious code include the location of the malicious code, the type corresponding to the malicious code, the probability corresponding to the type of malicious code, and so on.

In this embodiment, the sub-text vector corresponding to the sub-text data in each module after segmentation can be input into the optimal neural network model to determine the detection results of malicious code. For example, the sub-text vectors corresponding to the text data to be detected can be formed into a sub-text vector set, the sub-text vector set can be divided into at least one subset according to a preset number, and at least one subset can be input to the optimal neural network model. To determine the detection results of malicious code, distinction processing can be carried out based on the number of subsets. If it is a single subset, the detection result of a single subset is directly used as the output result of the optimal neural network model. If it is at least two subsets, then the corresponding detection results are determined and output based on at least two subsets.

In one embodiment, training of the optimal neural network model includes:

Extract text data from the text data training set containing malicious code, where the text data in the text data training set contains the location and type of the malicious code;

Extract content data from text data and filter out non-English data in the content data to obtain English data;

Divide the English data into at least one sub-text data, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data;

According to the location and type of the malicious code in the text data training set, the location, type and corresponding probability label of the malicious code in the generated sub-text data are correspondingly generated to form a text data training set;

Based on the text data training set, iterative training is performed on the updated parameters and weight values of the neural network model until the loss function reaches the minimum and the optimal neural network model is output. Otherwise, the parameters and weight values of the neural network model are adjusted and the above iterative training process is repeated.

Among them, the text data training set consists of text data containing malicious code. The text data in the text data training set contains the location and type of the malicious code. For example, File 1 contains malicious code, and the type of the malicious code is The xss attack is located on line 6 of the page.

The content data may include at least one of the following: text type, text information, the row and/or column where the text information is located, and the number of words corresponding to the text information.

In this embodiment, the training process of the optimal neural network model is to extract text data containing malicious code from the text data training set, where the text data in the text data training set contains the location and type of the malicious code; extract the text data Content data, such as text information in doc files, pdf files, txt files, lines corresponding to the text, number of words, etc., generate an original text set, maintain the number of lines on the page corresponding to the original text data, and filter out non-conforming content in the content data. To obtain English data, divide the English data into at least one sub-text data, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data. Based on the location and type of the malicious code in the text data training set, the corresponding sub-text data is generated The location, type and corresponding probability label of the malicious code in the text data training set are used to iteratively train the neural network model update parameters and weight values through the text data training set. When the loss function reaches the minimum, the optimal neural network model is output, otherwise the parameters and weight values of the neural network model are adjusted and the above iterative training process is repeated.

It should be noted that in the neural network model, the overall loss function in the training stage is divided into the sum of position loss and classification loss. It can be understood that the training of the neural network model is calculated through multiple iterations. After multiple iterations, the accuracy The rate is maximized to reduce the error rate of the entire neural network, and the loss function can be used to correct the deviation between the real position and the predicted position. The overall loss function in the training stage can be expressed as: LossTotal=loss(location)+loss(classify), where loss(location) can be expressed by the Intersection over Union (IOU) related series of losses, such as using When using the GIOU loss function, A is the predicted position, B is the real position, and C is the minimum convex closed box between A and B, which is the real target bounding box, then the loss function IOU=|A∩B|/|A∪B|, GIOU=IOU-|C/(A∪B)|/|C|, where loss (classify) is the cross-entropy loss function.

It should be noted that if there is new text data, the original model can be fine-tuned and updated or retrained with old data.

In one embodiment, Figure 2 is a flow chart of another malicious code detection method provided by an embodiment of the present application. Based on the above embodiments, this embodiment divides the text data to be detected according to semantic logic. For at least one sub-text data, the sub-text vector corresponding to the sub-text data is determined, and the sub-text vector is input into the optimal neural network model to determine the detection result of the malicious code. As shown in Figure 2, this paper The malicious code detection method in the embodiment may include the following steps:

S210. Obtain the text data to be detected.

S220. Extract content data in the text data to be detected.

The content data includes at least one of the following: text type, text information, the row and/or column where the text information is located, and the number of words corresponding to the text information. The text type refers to the file type corresponding to the text data to be detected, for example, it can be a doc file, pdf file, txt file, etc. Text information can be understood as text information of the text data to be detected and related attribute information of the text information.

In this embodiment, through fixed text information extraction tools and/or text data parsing toolkits, libraries, etc., the corresponding content data can be extracted from the text data to be detected to generate the original text set. It can be known that, The original text collection holds the original content data of the text information to be detected, including the original text type of the text data to be detected, the original text information, the row and/or column of the original text information, and the number of words corresponding to the original text information.

S230. Filter non-English data in the content data to obtain English data in the content data.

Among them, English data can be understood as English characters.

In this embodiment, based on the encoding method corresponding to the relevant characters of the content data, it can be determined whether the content data in the text data to be detected is Chinese character data or English character data. The language of the malicious code is usually English characters. The Chinese character data is Filter out and obtain the English data corresponding to the content data.

S240. When the English data is a code, divide the English data into at least one sub-text data according to the logical structure of the code, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data.

Among them, the logical structure includes at least one of the following: sequential logic, conditional logic, loop logic, function block, and class. The doc2vec model is a related model used to generate word vectors and paragraph vectors. The doc2vec model is used for embedding vectorization for the sub-text data obtained from each modularization.

In this embodiment, the fixed format of the script document related to the programming language can be used to determine whether the filtered English data is a code or a non-script document. When the English data is a code, it can be determined according to the logical structure of the programming code, for example, it can be a sequence Structures such as logic, conditional logic, loop logic, function blocks and classes divide English data into modules accordingly. Each module corresponds to the corresponding sub-text data. The number of divided modules is at least one. After the corresponding division, you can The doc2vec model is used to determine the sub-text vector corresponding to the sub-text data in each module. For example, the number of modules after modular segmentation of the file to be detected whose English data is code is recorded as n, and the sub-text data of n modules is embedding vectorized using the doc2vec model to generate the corresponding m*k as Vector, where m represents the number of characters contained in the sub-text data in each module, and k represents the number of vocabularies. At this time, a vector of text data to be detected can represent an n*m*k dimensional vector.

S250. When the English data is a non-script document, divide the English data into at least one sub-text data according to the punctuation marks of the characters in the English data, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data.

Among them, the non-script document can be a doc type, pdf type, txt type and other non-script documents. The punctuation marks of characters in English data can be understood as the punctuation marks of characters in each line of the page. For example, each punctuation mark in each line of characters is divided into a sentence.

In this embodiment, if the determined English data is a non-script document, it needs to be For the punctuation marks of Chinese characters, the English data is divided into modules accordingly. Each module corresponds to the corresponding sub-text data. The number of divided modules is at least one. After the corresponding division, the doc2vec model can be used for embedding vectorization. To determine the sub-text vector corresponding to the sub-text data in each module.

S260. Construct sub-text vectors corresponding to the text data to be detected into a sub-text vector set.

In this embodiment, according to the sub-text vector corresponding to the text data to be detected, a corresponding sub-text vector set can be formed. It should be noted that the sub-text vector set may contain one or more sub-text vectors.

S270. Divide the sub-text vector set into at least one subset according to a preset number.

Among them, the preset number can be understood as the fixed number of sub-text vectors corresponding to the pre-set modules. The corresponding settings can be made through experience, or they can be set manually. The subset contains subtext vectors corresponding to one or more subtext data.

In this embodiment, since the number of modules divided into each text data to be detected is different, that is, the number of sub-text data is different, it is necessary to identify the sub-text corresponding to the sub-text data in each text data to be detected. The vector set is divided accordingly according to a fixed number to obtain the vectors corresponding to the subsets. The sub-text data of n modules are embedding vectorized using the doc2vec model to generate corresponding m*k vectors. At this time, according to The fixed number r is divided accordingly, and the vector corresponding to each subset is r*m*k, where m represents the number of characters contained in the sub-text data in each module, and k represents the number of vocabulary lists. It should be noted that for the end segmentation that does not meet the fixed number r during segmentation, it is necessary to copy its own sub-text data until the fixed number r is reached.

It should be noted that if the text data to be detected contains malicious code, after dividing the sub-text vector according to the preset number, the location of the malicious code may be the position in the original text data to be detected, or it may not be the original text data to be detected. Position in text data.

In one embodiment, the sub-text vector set is divided into at least one subset according to a preset number, including:

If the number of sub-text vectors in the sub-text vector set exceeds the preset number, a corresponding number of sub-text vectors will be selected from each sub-text vector according to the pre-set number and divided into a subset until the number of remaining sub-text vectors does not reach Default quantity;

If the number of sub-text vectors in the sub-text vector set does not reach the preset number, copy the sub-text vectors so that the sum of the copied sub-text vectors and each sub-text vector is equal to the pre-set number, and combine the copied sub-text vectors with each sub-text vector. Divide the text vector into a subset.

In this embodiment, when the number of sub-text vectors exceeds the preset number, a corresponding number of sub-text vectors are selected from each sub-text vector according to the preset number and divided into a subset until the remaining The number of sub-text vectors does not reach the preset number. When the number of sub-text vectors does not reach the preset number, the sub-text vectors are copied so that the sum of the copied sub-text vectors and each sub-text vector is equal to the pre-set number. , divide the copied sub-text vector and each sub-text vector into a subset. It should be noted that if the number of each sub-text vector does not reach the preset number and the remainder is 1, then the sub-text vector corresponding to the sub-text data is copied to expand to the preset number. If the number of each sub-text vector When the number that does not reach the preset number has a remainder of at least 2, you can copy the subtext vector corresponding to one of the subtext data to expand to the preset number, or you can copy the subtext vector corresponding to multiple subtext data. Text vectors to expand to a preset amount.

For example, the preset number, that is, the standardized number, is 5. If the sub-text vector corresponding to the sub-text data in the text data to be detected is 15, then the corresponding sub-text data in the text data to be detected is 5 according to the standardized number. The sub-text vector of is divided into 3 parts; if the sub-text vector corresponding to the sub-text data in the text data to be detected is 17, then the sub-text vector corresponding to the sub-text data in the text data to be detected is cut according to the standardized number 5 as 4 copies. There are 2 sub-text vectors in the 4 copies. At this time, the number of sub-text vectors in the 4th copy needs to be expanded to the standardized number 5. At this time, the 2 sub-text vectors in the 4th copy can be copied. to extend to the standardized number 5.

S280. Input at least one subset into the optimal neural network model to determine the detection result of the malicious code, where the detection result includes the location, type and corresponding probability of the malicious code.

In this embodiment, at least a subset of the input is detected based on the optimal neural network model to determine the relevant detection results of the malicious code in the text data to be detected. If the malicious code is contained, the optimal neural network model is used to determine the relevant detection results of the malicious code in the text data to be detected. The corresponding output contains the location, type and corresponding probability of the malicious code. It should be noted that differentiation processing can be performed based on the number of subsets to obtain the detection results of the corresponding malicious code in the case of at least two subsets of a single subset.

S290. If the number of sub-text vectors in the sub-text vector set constitutes a single subset, then the detection result of the single subset is used as the output result of the optimal neural network model.

In this embodiment, when the number of sub-text vectors constitutes a single subset, the detection result of the single subset is directly used as the output result of the optimal neural network model. For example, if the probability corresponding to the type of a single subset is lower than the preset probability, the detection result of the malicious code is determined to be that there is no malicious code in the text data to be detected; if the probability corresponding to the type of a single subset exceeds the preset probability, Then it is determined that the detection result of the malicious code is that there is malicious code in the text data to be detected, and the corresponding location, category and corresponding probability of the malicious code in a single subset are directly output.

In one embodiment, the detection results of a single subset are used as the output results of the optimal neural network model, including:

When the probability corresponding to the type of a single subset is lower than the preset probability, determine the detection rate of malicious code. The test result is that there is no malicious code in the text data to be detected;

When the probability corresponding to the type of a single subset exceeds the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the corresponding location, category and corresponding probability of the malicious code in the single subset are output.

In this embodiment, when the probability corresponding to the type of a single subset is lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; when the probability corresponding to the type of a single subset exceeds Under the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the corresponding location, category and corresponding probability of the malicious code in a single subset are output. For example, when the preset probability is 5% and the probability corresponding to a single subset type exceeds 5%, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the malicious code is output through the optimal neural network model. The current location, category and corresponding probability, for example, it can be the row of the page, the category of the malicious code and the probability corresponding to the current malicious code.

S2100. If the number of sub-text vectors in the sub-text vector set constitutes at least two subsets, determine the output result of the optimal neural network model based on the comprehensive detection results of at least two subsets.

In this embodiment, if the number of sub-text vectors in the sub-text vector set constitutes at least two subsets, the output result of the optimal neural network model is determined based on the comprehensive detection results of the at least two subsets. For example, if the probabilities corresponding to the types of at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; if the probabilities corresponding to the types of at least two subsets exceed the preset probability probability, it is determined that the detection result of malicious code is that there is malicious code in the text data to be detected. It is necessary to merge the content data corresponding to each subset according to the original order of each subset to determine the corresponding positions of the malicious code in at least two subsets. Categories and corresponding probabilities.

In one embodiment, the output result of the optimal neural network model is determined based on the comprehensive detection results of at least two subsets, including:

When the probabilities corresponding to the types of at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected;

When the probabilities corresponding to the types of at least two subsets exceed the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the content data corresponding to each subset is merged according to the original order of each subset. , to determine the corresponding locations, categories and corresponding probabilities of malicious codes in at least two subsets.

In this embodiment, when the probabilities corresponding to the types of at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; When the probability exceeds the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the content corresponding to each subset is merged according to the original order of each subset. Data to determine the corresponding location, category, and corresponding probability of malicious code in at least two subsets.

For example, if it is determined that the detection result of malicious code is that there is malicious code in the text data to be detected, then when segmenting the subsets, the location of the malicious code will be segmented accordingly. For example, in the original data, the malicious code is in Lines 7-10, after splitting according to a fixed number of 8, the malicious code in the first subset is in lines 7 and 8, and the malicious code in the second subset is in lines 1 and 2. You need to follow the The original row numbers of each subset are sequentially merged with the content data corresponding to each subset to determine the location, type and corresponding probability information of the malicious code contained in the original data.

The above technical solution of the embodiment of the present application extracts the content data in the text data to be detected and filters the non-English data in the content data to obtain the English data in the content data. When the English data is a code, according to the code The logical structure divides the English data into at least one sub-text data, and uses the doc2vec model to determine the sub-text vector corresponding to the sub-text data. When the English data is a non-script document, the English data is divided into At least one sub-text data, and uses the sub-text vector corresponding to the doc2vec model sub-text data to form corresponding sub-text features to facilitate subsequent data processing of the optimal neural network model; by dividing the sub-text vector into At least one subset is input into the optimal neural network model to determine the detection result of the malicious code, where the detection result includes the location, type and corresponding probability of the malicious code. If the sub-text data is a single sub- set, the detection result of a single subset is used as the output result of the optimal neural network model. If the sub-text data is composed of at least two subsets, the output results of the optimal neural network model are determined based on the comprehensive detection results of at least two subsets, which can simultaneously confirm the probability, location and type of the malicious code, and improve the detection efficiency of the malicious code. Accuracy.

In one embodiment, since the detection method of malicious code can be divided into a training stage of malicious code and a detection stage of malicious code, in order to better understand the detection method of malicious code, Figure 3 is provided by an embodiment of the present application. A flow chart of a malicious code training process. Figure 4 is a flow chart of a malicious code detection process provided by an embodiment of the present application. Among them, the text data represents the text data to be detected in the above embodiment, the non-English characters represent the non-English data in the above embodiment, the subsample represents the subset in the above embodiment, and the preset module number represents the preset number in the above embodiment. Set quantity.

First, the steps in the training phase are as follows:

S310. Collect text data containing malicious code.

In this embodiment, data collection is performed to determine training data.

S320: Extract the text content data in the text data, generate an original text set, and maintain the original number of lines.

S330. Filter non-English characters from the text content data to obtain English characters.

S340. Carry out modular division of English characters.

In this embodiment, the text data is divided into grids based on semantic logic, and the number of modules is recorded as n. If the text data containing English characters is a code file, it will be divided according to sequential logic, conditional logic, loop logic, and function blocks. , class and other structures are divided, and the number of divided modules is recorded as n; if the text data containing English characters is a non-script document such as doc, pdf, txt, etc., it is divided according to inline statements to generate n modules.

The modular division of English characters in this embodiment can be understood as the division of text data to be detected into at least one sub-text data according to semantic logic in the above embodiments. A module is represented as a subtext data.

S350. Use the doc2vec model to conduct embedding vectorization on the text data of each module after division, and generate an m*k dimensional vector.

In this embodiment, in this case, the vector of a file is represented as an n*m*k dimensional vector. The module vectorization in this embodiment can be understood as determining the sub-text vector corresponding to the sub-text data in the above-mentioned embodiments.

S360: Split according to the preset number of modules to generate subsamples.

In this embodiment, cutting is performed according to a fixed length. Since the number of modules contained in each file is different, each file vector is divided according to the fixed number of modules r to obtain a sub-sample vector. Then each sub-sample The vector is an r*m*k dimensional vector, in which the end segmentation modules that do not satisfy r will copy their own data until r modules.

S370. According to the location and type of the malicious code in the text data training set, correspond to the location, type and corresponding probability label of the malicious code in the generated sample to form a text data training set.

S380. Input the sub-sample training data into the constructed neural network model for training iterations to obtain the final optimal neural network model.

In this embodiment, an end-to-end neural network model is constructed, the number of network layers is defined, and the loss function and optimizer are determined.

In this embodiment, the overall loss function in the training phase is divided into the sum of position loss and classification loss. The loss function can be expressed as: LossTotal=loss(location)+loss(classify), where loss(location) can be expressed by IOU related series loss, such as GIOU, for example: A is the predicted location, B is the real location, C is the minimum convex closed box of A and B, which is the real target bounding box, then
IOU＝|A∩B|/|A∪B|
GIOU＝IOU-|C/(A∪B)|/|C|

Among them, loss(classify) is the cross-entropy loss function.

In this embodiment, some parameters in the model are constantly adjusted during the training process, and the optimal god Some weights and parameters corresponding to the number of network layers in the network model are determined. For example, the network has 3 layers, the second layer has 3 neurons, and the weight and learning rate corresponding to each neuron.

As shown in Figure 4, the steps in the detection phase of malicious code are as follows:

S410. Obtain the text data to be detected, and extract the content data in the text data to be detected.

S420. Filter non-English data in the content data to obtain English data in the content data.

S430. Modularly divide the English data in the text data based on semantic logic.

In this embodiment, the text data is divided into grids based on semantic logic, and the number of modules is recorded as n: If the text data containing English characters is a code file, it is based on sequential logic, conditional logic, loop logic, and function blocks. , class and other structures are divided, and the number of divided modules is recorded as n; if the text data containing English characters is a non-script document such as doc, pdf, txt, etc., it is divided according to inline statements to generate n modules.

S440. Use the doc2vec model to perform embedding vectorization on the text data of each module after division, and generate an m*k dimensional vector.

S450: Split according to the preset number of modules to generate subsamples.

S460. Determine whether the number of subsamples is a single subsample. If so, execute S470; if not, execute S480.

S470. Determine whether the probability corresponding to the type of a single subsample is lower than the preset probability. If so, execute S471; otherwise, execute S472.

S471. There is no malicious code.

S472. If there is malicious code, output the corresponding location, category and corresponding probability of the malicious code in a single subsample.

S480. Determine whether the probabilities corresponding to the types of at least two subsamples are lower than the preset probability. If so, execute S481; otherwise, execute S482.

S481. There is no malicious code.

S482. If there is malicious code, merge the content data corresponding to each subsample according to the original order of each subsample to determine the corresponding location, category and corresponding probability of the malicious code after the merging of each subsample.

In one embodiment, FIG. 5 is a structural block diagram of a malicious code detection device provided by an embodiment of the present application. The device is suitable for detecting malicious codes during the transmission of various files. The device Can be implemented by hardware/software. It can be configured in an electronic device to implement a malicious code detection method in the embodiment of the present application. As shown in Figure 5, the device includes: a data acquisition module 510, a sub-text vector determination module 520 and a result determination module 530.

Among them, the data acquisition module 510 is configured to acquire the text data to be detected;

The sub-text vector determination module 520 is configured to divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;

The result determination module 530 is configured to input the sub-text vector into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.

In the embodiment of the present application, the sub-text vector determination module divides the acquired text data to be detected into at least one sub-text data through semantic logic and determines the corresponding sub-text vector, which can form sample features and facilitate the subsequent optimal neural network model data Processing; the result determination module uses the optimal neural network model to determine the detection results of malicious code in the sub-text vector, which can confirm the relevant detection information corresponding to the malicious code and improve the accuracy of the detection of malicious code.

In one embodiment, the sub-text vector determination module 520 includes:

A data extraction unit configured to extract content data in the text data to be detected, wherein the content data includes at least one of the following: text type, text information, the row and/or column where the text information is located, the The number of words corresponding to the text message;

A data filtering unit configured to filter non-English data in the content data to obtain English data in the content data;

The first sub-text vector determination unit is configured to, when the English data is a code, divide the English data into at least one sub-text data according to the logical structure of the code, and use the doc2vec model to determine the sub-text data Corresponding sub-text vector; wherein, the logical structure includes at least one of the following: sequential logic, conditional logic, loop logic, function block, class;

The second sub-text vector determination unit is configured to divide the English data into at least one sub-text data according to the punctuation marks of characters in the English data when the English data is a non-script document, and use the The sub-text vector corresponding to the sub-text data described in the doc2vec model.

In one embodiment, the result determination module 530 includes:

A set forming unit configured to form the sub-text vector corresponding to the text data to be detected. Collection of subtext vectors;

A subset dividing unit configured to divide the sub-text vector set into at least one subset according to a preset number;

The result determination unit is configured to input the at least one subset into the optimal neural network model to determine the detection result of the malicious code, wherein the detection result includes the location, type and correspondence of the malicious code. The probability;

The first result output unit is configured to use the detection result of the single subset as the output result of the optimal neural network model if the number of the sub-text vectors in the sub-text vector set constitutes a single subset. ;

The second result output unit is configured to determine the output result of the optimal neural network model based on the comprehensive detection results of the at least two subsets if the number of the sub-text vectors in the sub-text vector set constitutes at least two subsets. .

In one embodiment, the subset dividing unit includes:

The first subset is divided into sub-units, and is configured to select a corresponding sub-text vector from each of the sub-text vectors according to the preset number if the number of the sub-text vectors in the sub-text vector set exceeds the preset number. Divide the number of sub-text vectors into one of the subsets until the number of remaining sub-text vectors does not reach the preset number;

The second subset is divided into sub-units, and is configured to copy the sub-text vectors such that if the number of the sub-text vectors in the sub-text vector set does not reach the preset number, the copied sub-text vectors The sum of each of the sub-text vectors is equal to the preset number, and the copied sub-text vector and each of the sub-text vectors are divided into one of the subsets.

In one embodiment, the first result output unit includes:

The first result determination subunit is configured to determine that the detection result of the malicious code is that there is no malicious code in the text data to be detected when the probability corresponding to the type of the single subset is lower than the preset probability;

The second result determination subunit is configured to determine that the detection result of the malicious code is that the text data to be detected contains malicious code, and output the single subunit when the probability corresponding to the type of the single subset exceeds the preset probability. The corresponding location, category and corresponding probability of concentrated malicious code.

In one embodiment, the second result output unit includes:

The third result output subunit is configured to determine that the detection result of the malicious code is that there is no malicious code in the text data to be detected when the probabilities corresponding to the types of the at least two subsets are lower than the preset probability;

The fourth result output subunit is configured to determine that the detection result of the malicious code is that the text data to be detected contains malicious code when the probabilities corresponding to the types of the at least two subsets exceed the preset probability. According to each subset The content data corresponding to each subset is merged in the original order to determine the corresponding location, category and corresponding probability of the malicious code in the at least two subsets.

In one embodiment, the training of the optimal neural network model includes:

Extract text data from a text data training set containing malicious code, wherein the text data in the text data training set contains the location and type of the malicious code;

Extract content data from the text data, and filter out non-English data in the content data to obtain English data;

According to the location and type of the malicious code in the text data training set, the location, type and corresponding probability label of the malicious code in the sub-text data are correspondingly generated to form the text data training set;

Iterative training is performed on the updated parameters and weight values of the neural network model based on the text data training set until the loss function reaches the minimum and the optimal neural network model is output. Otherwise, the parameters and weight values of the neural network model are adjusted and the above iterative training is repeated. process.

The malicious code detection device provided by the embodiments of this application can execute the malicious code detection method provided by any embodiment of this application, and has functional modules and beneficial effects corresponding to the execution method.

In an embodiment, FIG. 6 shows a schematic structural diagram of an electronic device that can be used to implement embodiments of the present application. Electronic device 10 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only.

As shown in Figure 6, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (Read Only Memory, ROM) 12, a random access memory (Random Access Memory, RAM). ) 13, etc., wherein the memory stores a computer program that can be executed by at least one processor, and the processor 11 can load it into a random access memory (RAM) according to the computer program stored in the read-only memory (ROM) 12 or from the storage unit 18 )13 to perform various appropriate actions and processes. In RAM 13, electronic devices can also be stored Prepare various programs and data required for 10 operations. The processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14. An input/output (I/O) interface 15 is also connected to the bus 14 .

Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.

Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 may include a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphic Processing Unit, GPU), various dedicated artificial intelligence (Artificial Intelligence, AI) computing chips, various types of machine learning Model algorithm processor, digital signal processor (Digital Signal Processing, DSP), and any appropriate processor, controller, microcontroller, etc. The processor 11 performs various methods and processes described above, such as the detection method of malicious code.

In some embodiments, the malicious code detection method may be implemented as a computer program, which is tangibly included in a computer-readable storage medium, such as the storage unit 18 . In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the above-described malicious code detection method may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the malicious code detection method through other suitable means (eg, by means of firmware).

Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, Field-Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs) , Application Specific Standard Parts (ASSP), System on Chip (SOC), Complex Programmable Logic Device (CPLD), computer hardware, firmware, software, and/or implemented in their combination. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

Computer programs for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided for general-purpose computers, special-purpose computers, or other programmable The computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented. A computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. Computer-readable storage media may include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. Alternatively, the computer-readable storage medium may be a machine-readable signal medium. Examples of machine-readable storage media may include an electrical connection based on one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (Electronic Programable Read Only Memory (EPROM) or flash memory, optical fiber, portable compact disk read-only memory (Compact Disc-Read Only Memory, CD-ROM), optical storage device, magnetic storage device, or a suitable combination of the above.

To provide interaction with a user, the systems and techniques described herein may be implemented on an electronic device having a display device (e.g., a cathode ray tube (CRT) or liquid crystal) for displaying information to the user. A display (Liquid Crystal Display, LCD or monitor); and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.

The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), blockchain network, and the Internet.

Computing systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, Also known as cloud computing server or cloud host, it is a host product in the cloud computing service system, which solves the shortcomings of difficult management and weak business scalability in traditional physical hosts and VPS services.

It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in this application can be executed in parallel, sequentially, or in a different order, as long as the desired results of the technical solution of this application can be achieved.

Claims

A method for detecting malicious code, including:

Get the text data to be detected;

Divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;

The sub-text vector is input into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
The method according to claim 1, wherein dividing the text data to be detected into at least one sub-text data according to semantic logic and determining the sub-text vector corresponding to the sub-text data includes:

Extract content data in the text data to be detected, wherein the content data includes at least one of the following: text type, text information, the row and/or column where the text information is located, and the number of words corresponding to the text information;

Filter non-English data in the content data to obtain English data in the content data;

When the English data is a code, the English data is divided into at least one sub-text data according to the logical structure of the code, and the doc2vec model is used to determine the sub-text vector corresponding to the sub-text data; wherein, The logical structure includes at least one of the following: sequential logic, conditional logic, loop logic, function block, and class;

When the English data is a non-script document, the English data is divided into at least one sub-text data according to the punctuation marks of the characters in the English data, and the doc2vec model is used to determine the corresponding sub-text data. subtext vector of .
The method according to claim 1, wherein said inputting the sub-text vector into an optimal neural network model to determine the detection result of malicious code includes:

Construct the sub-text vectors corresponding to the text data to be detected into a sub-text vector set;

Divide the sub-text vector set into at least one subset according to a preset number;

Input the at least one subset into the optimal neural network model to determine the detection results of the malicious code, wherein the detection results include the location, type and corresponding probability of the malicious code;

If the number of the sub-text vectors in the sub-text vector set constitutes a single subset, then the detection result of the single subset is used as the output result of the optimal neural network model;

If the number of the sub-text vectors in the sub-text vector set constitutes at least two subsets, the output result of the optimal neural network model is determined based on the comprehensive detection results of the at least two subsets. fruit.
The method according to claim 3, wherein dividing the sub-text vector set into at least one subset according to a preset number includes:

If the number of the sub-text vectors in the sub-text vector set exceeds the preset number, a corresponding number of the sub-text vectors are selected from the sub-text vectors according to the preset number and divided into a set of sub-text vectors. until the number of remaining sub-text vectors does not reach the preset number;

If the number of sub-text vectors in the sub-text vector set does not reach the preset number, then copy the sub-text vector so that the sum of the copied sub-text vector and each of the sub-text vectors is equal to The preset number is used to divide the copied sub-text vectors and each of the sub-text vectors into one of the subsets.
The method according to claim 3, wherein using the detection result of the single subset as the output result of the optimal neural network model includes:

When the probability corresponding to the type of the single subset is lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected;

When the probability corresponding to the type of the single subset exceeds the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the corresponding location and category of the malicious code in the single subset are output. and corresponding probabilities.
The method according to claim 3, wherein determining the output result of the optimal neural network model based on the comprehensive detection results of the at least two subsets includes:

When the probabilities corresponding to the types of the at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected;

When the probabilities corresponding to the types of the at least two subsets exceed the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the results of each subset are merged according to the original order of each subset. Corresponding content data is used to determine the corresponding location, category and corresponding probability of the malicious code in the at least two subsets.
The method according to claim 1, wherein the training of the optimal neural network model includes:

Extract text data from a text data training set containing malicious code, wherein the text data in the text data training set contains the location and type of the malicious code;

Extract content data from the text data, and filter out non-English data in the content data to obtain English data;

Divide the English data into at least one sub-text data, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data;

According to the location and type of the malicious code in the text data training set, the location, type and corresponding probability label of the malicious code in the sub-text data are correspondingly generated to form the text data training set;

Iterative training is performed on the updated parameters and weight values of the neural network model based on the text data training set until the loss function reaches the minimum and the optimal neural network model is output. Otherwise, the parameters and weight values of the neural network model are adjusted and the above iterative training is repeated. process.
A malicious code detection device, including:

The data acquisition module is set to obtain the text data to be detected;

A sub-text vector determination module, configured to divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;

The result determination module is configured to input the sub-text vector into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
An electronic device, the electronic device includes:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor, so that the at least one processor can execute any one of claims 1-7 The detection method of malicious code.
A computer-readable storage medium stores computer instructions, and the computer instructions are used to implement the malicious code detection method described in any one of claims 1-7 when executed by a processor.