WO2024051196A1 - Procédé et appareil de détection de code malveillant, dispositif électronique et support de stockage - Google Patents

Procédé et appareil de détection de code malveillant, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2024051196A1
WO2024051196A1 PCT/CN2023/093383 CN2023093383W WO2024051196A1 WO 2024051196 A1 WO2024051196 A1 WO 2024051196A1 CN 2023093383 W CN2023093383 W CN 2023093383W WO 2024051196 A1 WO2024051196 A1 WO 2024051196A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
text
data
malicious code
text data
Prior art date
Application number
PCT/CN2023/093383
Other languages
English (en)
Chinese (zh)
Inventor
徐莉莎
Original Assignee
上海派拉软件股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海派拉软件股份有限公司 filed Critical 上海派拉软件股份有限公司
Publication of WO2024051196A1 publication Critical patent/WO2024051196A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of computer security technology, for example, to a malicious code detection method, device, electronic equipment and storage medium.
  • hackers In daily life, hackers attack in various forms. Some attack methods have obvious characteristics, while some attack methods have more subtle characteristics.
  • Hackers often attack by injecting malicious code into part of the file. When the file is viewed or a command is executed, the malicious code will run, thereby inserting virus Trojans, leaving backdoors and other dangerous behaviors. Once the device is connected to the Internet, the malicious code will Devices are easily hacked and hijacked, causing significant damage.
  • one method for detecting malicious code in files is to find whether there is malicious code in the file through keyword matching. This method often results in the inability to correctly perform the corresponding operation when unknown malicious code enters. Detection; the other is to determine whether the data is malicious code through machine learning or deep learning classification methods. However, this method cannot confirm the location and type of malicious code at the same time.
  • This application provides a method, device, equipment and medium for detecting malicious codes, which can simultaneously confirm the probability, location and type corresponding to the malicious codes, and improve the accuracy of detecting malicious codes.
  • an embodiment of the present application provides a method for detecting malicious code, which method includes:
  • the sub-text vector is input into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
  • an embodiment of the present application also provides a malicious code detection method and device, which includes:
  • the data acquisition module is set to obtain the text data to be detected
  • a sub-text vector determination module configured to divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;
  • the result determination module is configured to input the sub-text vector into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
  • an embodiment of the present application further provides an electronic device, where the electronic device includes:
  • the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application. Malicious code detection methods.
  • embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable the processor to implement any of the present application when executed.
  • a method for detecting malicious code according to an embodiment.
  • the technical solution of the embodiment of the present application obtains the text data to be detected, divides the text data to be detected into at least one sub-text data according to semantic logic, determines the sub-text vector corresponding to the sub-text data, and inputs the sub-text vector into the optimal neural network.
  • the optimal neural network model is generated based on the text data training set containing malicious code.
  • the acquired text data to be detected is divided into at least one sub-text data and the corresponding sub-text vector is determined through semantic logic, which can form sub-text features and facilitate subsequent data processing of the optimal neural network model; through the optimal The neural network model determines the detection results of malicious code in the sub-text vector, can confirm the relevant detection information corresponding to the malicious code, and improves the accuracy of the detection of malicious code.
  • Figure 1 is a flow chart of a malicious code detection method provided by an embodiment of the present application.
  • Figure 2 is a flow chart of another malicious code detection method provided by an embodiment of the present application.
  • Figure 3 is a flow chart of a malicious code training process provided by an embodiment of the present application.
  • Figure 4 is a flow chart of a malicious code detection process provided by an embodiment of the present application.
  • Figure 5 is a structural block diagram of a malicious code detection device provided by an embodiment of the present application.
  • FIG. 6 shows a schematic structural diagram of an electronic device that can be used to implement embodiments of the present application.
  • Figure 1 is a flow chart of a method for detecting malicious code provided by an embodiment of the present application. This embodiment can be applied to situations when malicious code is detected during the transmission of various files.
  • the method may be executed by a malicious code detection device, which may be implemented in the form of hardware and/or software, and may be configured in an electronic device. As shown in Figure 1, the method includes:
  • the text data to be detected refers to relevant text data that may contain malicious code waiting to be detected.
  • the text data to be detected can also be called character data, which is text data in various types of files. For example, it can be related text data in doc files, pdf files, and txt files. It can include English characters, Chinese characters, numbers, and Other input characters, etc.
  • the file type and the content information of the text data in various types of text data to be detected can be obtained, and the content information of each content information can be obtained.
  • the row of the page where it is located, etc., so that the obtained text data can be processed accordingly.
  • semantic logic can be understood as the logical relationship between sentences or the character symbols between sentences on the page.
  • Logical relationships can be, for example, juxtaposition, succession, transition, result and cause, purpose, concession, etc.
  • sub-text data can be understood as modular division of text data to be detected through semantic logic, and sub-text data corresponding to each module. It should be noted that the text data to be detected can be divided into at least one module through modularization. Each module can correspond to one sub-text data. After modular division, the corresponding sub-text data can be a code class file or a Non-script files.
  • the sub-text vector can be understood as the sub-text vector corresponding to each module obtained by vectorizing the sub-text data of each module using the doc2vec model or the word2vec model.
  • the content data in the text data to be detected can be extracted, and the content data can be filtered accordingly to obtain English data.
  • the English data can be divided accordingly according to its corresponding logical content, and the doc2vec model or word2vec model can be used to vectorize the sub-text data corresponding to each module after the division; it can also be determined by the text data to be detected.
  • the corresponding function call graph is used to obtain the feature vector corresponding to the corresponding call sequence for corresponding vectorization.
  • malware refers to all software or code that may conflict with an organization's security policy. These codes have no effect but bring certain dangers. They can be created without explicitly prompting the user or without the user's permission. With permission, software that infringes upon the user's legitimate rights and interests is installed and run on the user's computer or other terminal; it can also be computer code that is deliberately prepared or set up and poses a threat or potential threat to the network or system.
  • the type of malicious code can be SQL injection or XSS attack. XSS attack is called cross-site scripting attack.
  • the detection results of the malicious code include the location of the malicious code, the type corresponding to the malicious code, the probability corresponding to the type of malicious code, and so on.
  • the sub-text vector corresponding to the sub-text data in each module after segmentation can be input into the optimal neural network model to determine the detection results of malicious code.
  • the sub-text vectors corresponding to the text data to be detected can be formed into a sub-text vector set, the sub-text vector set can be divided into at least one subset according to a preset number, and at least one subset can be input to the optimal neural network model.
  • distinction processing can be carried out based on the number of subsets. If it is a single subset, the detection result of a single subset is directly used as the output result of the optimal neural network model. If it is at least two subsets, then the corresponding detection results are determined and output based on at least two subsets.
  • training of the optimal neural network model includes:
  • the location, type and corresponding probability label of the malicious code in the generated sub-text data are correspondingly generated to form a text data training set
  • the text data training set consists of text data containing malicious code.
  • the text data in the text data training set contains the location and type of the malicious code.
  • File 1 contains malicious code
  • the type of the malicious code is The xss attack is located on line 6 of the page.
  • the content data may include at least one of the following: text type, text information, the row and/or column where the text information is located, and the number of words corresponding to the text information.
  • the training process of the optimal neural network model is to extract text data containing malicious code from the text data training set, where the text data in the text data training set contains the location and type of the malicious code; extract the text data Content data, such as text information in doc files, pdf files, txt files, lines corresponding to the text, number of words, etc., generate an original text set, maintain the number of lines on the page corresponding to the original text data, and filter out non-conforming content in the content data.
  • To obtain English data divide the English data into at least one sub-text data, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data.
  • the corresponding sub-text data is generated
  • the location, type and corresponding probability label of the malicious code in the text data training set are used to iteratively train the neural network model update parameters and weight values through the text data training set.
  • the optimal neural network model is output, otherwise the parameters and weight values of the neural network model are adjusted and the above iterative training process is repeated.
  • the overall loss function in the training stage is divided into the sum of position loss and classification loss. It can be understood that the training of the neural network model is calculated through multiple iterations. After multiple iterations, the accuracy The rate is maximized to reduce the error rate of the entire neural network, and the loss function can be used to correct the deviation between the real position and the predicted position.
  • IOU Intersection over Union
  • the original model can be fine-tuned and updated or retrained with old data.
  • the technical solution of the embodiment of the present application obtains the text data to be detected, divides the text data to be detected into at least one sub-text data according to semantic logic, determines the sub-text vector corresponding to the sub-text data, and inputs the sub-text vector into the optimal neural network.
  • the optimal neural network model is generated based on the text data training set containing malicious code.
  • the acquired text data to be detected is divided into at least one sub-text data and the corresponding sub-text vector is determined through semantic logic, which can form sub-text features and facilitate subsequent data processing of the optimal neural network model; through the optimal The neural network model determines the detection results of malicious code in the sub-text vector, can confirm the relevant detection information corresponding to the malicious code, and improves the accuracy of the detection of malicious code.
  • Figure 2 is a flow chart of another malicious code detection method provided by an embodiment of the present application. Based on the above embodiments, this embodiment divides the text data to be detected according to semantic logic. For at least one sub-text data, the sub-text vector corresponding to the sub-text data is determined, and the sub-text vector is input into the optimal neural network model to determine the detection result of the malicious code. As shown in Figure 2, this paper The malicious code detection method in the embodiment may include the following steps:
  • the content data includes at least one of the following: text type, text information, the row and/or column where the text information is located, and the number of words corresponding to the text information.
  • the text type refers to the file type corresponding to the text data to be detected, for example, it can be a doc file, pdf file, txt file, etc.
  • Text information can be understood as text information of the text data to be detected and related attribute information of the text information.
  • the corresponding content data can be extracted from the text data to be detected to generate the original text set.
  • the original text collection holds the original content data of the text information to be detected, including the original text type of the text data to be detected, the original text information, the row and/or column of the original text information, and the number of words corresponding to the original text information.
  • English data can be understood as English characters.
  • the content data in the text data to be detected is Chinese character data or English character data.
  • the language of the malicious code is usually English characters.
  • the Chinese character data is Filter out and obtain the English data corresponding to the content data.
  • the English data is a code
  • the logical structure includes at least one of the following: sequential logic, conditional logic, loop logic, function block, and class.
  • the doc2vec model is a related model used to generate word vectors and paragraph vectors.
  • the doc2vec model is used for embedding vectorization for the sub-text data obtained from each modularization.
  • the fixed format of the script document related to the programming language can be used to determine whether the filtered English data is a code or a non-script document.
  • the English data is a code
  • it can be determined according to the logical structure of the programming code, for example, it can be a sequence Structures such as logic, conditional logic, loop logic, function blocks and classes divide English data into modules accordingly.
  • Each module corresponds to the corresponding sub-text data.
  • the number of divided modules is at least one. After the corresponding division, you can The doc2vec model is used to determine the sub-text vector corresponding to the sub-text data in each module.
  • n the number of modules after modular segmentation of the file to be detected whose English data is code
  • the sub-text data of n modules is embedding vectorized using the doc2vec model to generate the corresponding m*k as Vector, where m represents the number of characters contained in the sub-text data in each module, and k represents the number of vocabularies.
  • a vector of text data to be detected can represent an n*m*k dimensional vector.
  • the non-script document can be a doc type, pdf type, txt type and other non-script documents.
  • the punctuation marks of characters in English data can be understood as the punctuation marks of characters in each line of the page. For example, each punctuation mark in each line of characters is divided into a sentence.
  • the English data is divided into modules accordingly. Each module corresponds to the corresponding sub-text data. The number of divided modules is at least one. After the corresponding division, the doc2vec model can be used for embedding vectorization. To determine the sub-text vector corresponding to the sub-text data in each module.
  • a corresponding sub-text vector set can be formed. It should be noted that the sub-text vector set may contain one or more sub-text vectors.
  • the preset number can be understood as the fixed number of sub-text vectors corresponding to the pre-set modules.
  • the corresponding settings can be made through experience, or they can be set manually.
  • the subset contains subtext vectors corresponding to one or more subtext data.
  • the vector set is divided accordingly according to a fixed number to obtain the vectors corresponding to the subsets.
  • the sub-text data of n modules are embedding vectorized using the doc2vec model to generate corresponding m*k vectors.
  • the fixed number r is divided accordingly, and the vector corresponding to each subset is r*m*k, where m represents the number of characters contained in the sub-text data in each module, and k represents the number of vocabulary lists. It should be noted that for the end segmentation that does not meet the fixed number r during segmentation, it is necessary to copy its own sub-text data until the fixed number r is reached.
  • the location of the malicious code may be the position in the original text data to be detected, or it may not be the original text data to be detected. Position in text data.
  • the sub-text vector set is divided into at least one subset according to a preset number, including:
  • a corresponding number of sub-text vectors will be selected from each sub-text vector according to the pre-set number and divided into a subset until the number of remaining sub-text vectors does not reach Default quantity;
  • a corresponding number of sub-text vectors are selected from each sub-text vector according to the preset number and divided into a subset until the remaining The number of sub-text vectors does not reach the preset number.
  • the sub-text vectors are copied so that the sum of the copied sub-text vectors and each sub-text vector is equal to the pre-set number. , divide the copied sub-text vector and each sub-text vector into a subset.
  • each sub-text vector does not reach the preset number and the remainder is 1, then the sub-text vector corresponding to the sub-text data is copied to expand to the preset number. If the number of each sub-text vector When the number that does not reach the preset number has a remainder of at least 2, you can copy the subtext vector corresponding to one of the subtext data to expand to the preset number, or you can copy the subtext vector corresponding to multiple subtext data. Text vectors to expand to a preset amount.
  • the preset number that is, the standardized number
  • the preset number is 5. If the sub-text vector corresponding to the sub-text data in the text data to be detected is 15, then the corresponding sub-text data in the text data to be detected is 5 according to the standardized number.
  • the sub-text vector of is divided into 3 parts; if the sub-text vector corresponding to the sub-text data in the text data to be detected is 17, then the sub-text vector corresponding to the sub-text data in the text data to be detected is cut according to the standardized number 5 as 4 copies. There are 2 sub-text vectors in the 4 copies. At this time, the number of sub-text vectors in the 4th copy needs to be expanded to the standardized number 5. At this time, the 2 sub-text vectors in the 4th copy can be copied. to extend to the standardized number 5.
  • At least a subset of the input is detected based on the optimal neural network model to determine the relevant detection results of the malicious code in the text data to be detected. If the malicious code is contained, the optimal neural network model is used to determine the relevant detection results of the malicious code in the text data to be detected.
  • the corresponding output contains the location, type and corresponding probability of the malicious code. It should be noted that differentiation processing can be performed based on the number of subsets to obtain the detection results of the corresponding malicious code in the case of at least two subsets of a single subset.
  • the detection result of the single subset is used as the output result of the optimal neural network model.
  • the detection result of the single subset is directly used as the output result of the optimal neural network model. For example, if the probability corresponding to the type of a single subset is lower than the preset probability, the detection result of the malicious code is determined to be that there is no malicious code in the text data to be detected; if the probability corresponding to the type of a single subset exceeds the preset probability, Then it is determined that the detection result of the malicious code is that there is malicious code in the text data to be detected, and the corresponding location, category and corresponding probability of the malicious code in a single subset are directly output.
  • the detection results of a single subset are used as the output results of the optimal neural network model, including:
  • the test result is that there is no malicious code in the text data to be detected
  • the probability corresponding to the type of a single subset exceeds the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the corresponding location, category and corresponding probability of the malicious code in the single subset are output.
  • the probability corresponding to the type of a single subset is lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; when the probability corresponding to the type of a single subset exceeds Under the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the corresponding location, category and corresponding probability of the malicious code in a single subset are output.
  • the preset probability is 5% and the probability corresponding to a single subset type exceeds 5%, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the malicious code is output through the optimal neural network model.
  • the current location, category and corresponding probability for example, it can be the row of the page, the category of the malicious code and the probability corresponding to the current malicious code.
  • the output result of the optimal neural network model is determined based on the comprehensive detection results of the at least two subsets. For example, if the probabilities corresponding to the types of at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; if the probabilities corresponding to the types of at least two subsets exceed the preset probability probability, it is determined that the detection result of malicious code is that there is malicious code in the text data to be detected. It is necessary to merge the content data corresponding to each subset according to the original order of each subset to determine the corresponding positions of the malicious code in at least two subsets. Categories and corresponding probabilities.
  • the output result of the optimal neural network model is determined based on the comprehensive detection results of at least two subsets, including:
  • the detection result of the malicious code is that there is no malicious code in the text data to be detected
  • the probabilities corresponding to the types of at least two subsets exceed the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the content data corresponding to each subset is merged according to the original order of each subset. , to determine the corresponding locations, categories and corresponding probabilities of malicious codes in at least two subsets.
  • the probabilities corresponding to the types of at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; When the probability exceeds the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the content corresponding to each subset is merged according to the original order of each subset. Data to determine the corresponding location, category, and corresponding probability of malicious code in at least two subsets.
  • the location of the malicious code will be segmented accordingly.
  • the malicious code is in Lines 7-10, after splitting according to a fixed number of 8, the malicious code in the first subset is in lines 7 and 8, and the malicious code in the second subset is in lines 1 and 2.
  • You need to follow the The original row numbers of each subset are sequentially merged with the content data corresponding to each subset to determine the location, type and corresponding probability information of the malicious code contained in the original data.
  • the above technical solution of the embodiment of the present application extracts the content data in the text data to be detected and filters the non-English data in the content data to obtain the English data in the content data.
  • the English data is a code
  • the logical structure divides the English data into at least one sub-text data, and uses the doc2vec model to determine the sub-text vector corresponding to the sub-text data.
  • the English data is divided into At least one sub-text data, and uses the sub-text vector corresponding to the doc2vec model sub-text data to form corresponding sub-text features to facilitate subsequent data processing of the optimal neural network model; by dividing the sub-text vector into At least one subset is input into the optimal neural network model to determine the detection result of the malicious code, where the detection result includes the location, type and corresponding probability of the malicious code. If the sub-text data is a single sub- set, the detection result of a single subset is used as the output result of the optimal neural network model.
  • the output results of the optimal neural network model are determined based on the comprehensive detection results of at least two subsets, which can simultaneously confirm the probability, location and type of the malicious code, and improve the detection efficiency of the malicious code. Accuracy.
  • Figure 3 is provided by an embodiment of the present application.
  • Figure 4 is a flow chart of a malicious code detection process provided by an embodiment of the present application.
  • the text data represents the text data to be detected in the above embodiment
  • the non-English characters represent the non-English data in the above embodiment
  • the subsample represents the subset in the above embodiment
  • the preset module number represents the preset number in the above embodiment. Set quantity.
  • data collection is performed to determine training data.
  • S320 Extract the text content data in the text data, generate an original text set, and maintain the original number of lines.
  • the text data is divided into grids based on semantic logic, and the number of modules is recorded as n. If the text data containing English characters is a code file, it will be divided according to sequential logic, conditional logic, loop logic, and function blocks. , class and other structures are divided, and the number of divided modules is recorded as n; if the text data containing English characters is a non-script document such as doc, pdf, txt, etc., it is divided according to inline statements to generate n modules.
  • the modular division of English characters in this embodiment can be understood as the division of text data to be detected into at least one sub-text data according to semantic logic in the above embodiments.
  • a module is represented as a subtext data.
  • the vector of a file is represented as an n*m*k dimensional vector.
  • the module vectorization in this embodiment can be understood as determining the sub-text vector corresponding to the sub-text data in the above-mentioned embodiments.
  • each file vector is divided according to the fixed number of modules r to obtain a sub-sample vector. Then each sub-sample The vector is an r*m*k dimensional vector, in which the end segmentation modules that do not satisfy r will copy their own data until r modules.
  • an end-to-end neural network model is constructed, the number of network layers is defined, and the loss function and optimizer are determined.
  • the overall loss function in the training phase is divided into the sum of position loss and classification loss.
  • GIOU IOU-
  • loss(classify) is the cross-entropy loss function.
  • some parameters in the model are constantly adjusted during the training process, and the optimal god Some weights and parameters corresponding to the number of network layers in the network model are determined.
  • the network has 3 layers
  • the second layer has 3 neurons
  • the text data is divided into grids based on semantic logic, and the number of modules is recorded as n: If the text data containing English characters is a code file, it is based on sequential logic, conditional logic, loop logic, and function blocks. , class and other structures are divided, and the number of divided modules is recorded as n; if the text data containing English characters is a non-script document such as doc, pdf, txt, etc., it is divided according to inline statements to generate n modules.
  • S450 Split according to the preset number of modules to generate subsamples.
  • each file vector is divided according to the fixed number of modules r to obtain a sub-sample vector. Then each sub-sample The vector is an r*m*k dimensional vector, in which the end segmentation modules that do not satisfy r will copy their own data until r modules.
  • FIG. 5 is a structural block diagram of a malicious code detection device provided by an embodiment of the present application.
  • the device is suitable for detecting malicious codes during the transmission of various files.
  • the device Can be implemented by hardware/software. It can be configured in an electronic device to implement a malicious code detection method in the embodiment of the present application.
  • the device includes: a data acquisition module 510, a sub-text vector determination module 520 and a result determination module 530.
  • the data acquisition module 510 is configured to acquire the text data to be detected
  • the sub-text vector determination module 520 is configured to divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;
  • the result determination module 530 is configured to input the sub-text vector into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
  • the sub-text vector determination module divides the acquired text data to be detected into at least one sub-text data through semantic logic and determines the corresponding sub-text vector, which can form sample features and facilitate the subsequent optimal neural network model data Processing; the result determination module uses the optimal neural network model to determine the detection results of malicious code in the sub-text vector, which can confirm the relevant detection information corresponding to the malicious code and improve the accuracy of the detection of malicious code.
  • the sub-text vector determination module 520 includes:
  • a data extraction unit configured to extract content data in the text data to be detected, wherein the content data includes at least one of the following: text type, text information, the row and/or column where the text information is located, the The number of words corresponding to the text message;
  • a data filtering unit configured to filter non-English data in the content data to obtain English data in the content data
  • the first sub-text vector determination unit is configured to, when the English data is a code, divide the English data into at least one sub-text data according to the logical structure of the code, and use the doc2vec model to determine the sub-text data Corresponding sub-text vector; wherein, the logical structure includes at least one of the following: sequential logic, conditional logic, loop logic, function block, class;
  • the second sub-text vector determination unit is configured to divide the English data into at least one sub-text data according to the punctuation marks of characters in the English data when the English data is a non-script document, and use the The sub-text vector corresponding to the sub-text data described in the doc2vec model.
  • the result determination module 530 includes:
  • a set forming unit configured to form the sub-text vector corresponding to the text data to be detected. Collection of subtext vectors;
  • a subset dividing unit configured to divide the sub-text vector set into at least one subset according to a preset number
  • the result determination unit is configured to input the at least one subset into the optimal neural network model to determine the detection result of the malicious code, wherein the detection result includes the location, type and correspondence of the malicious code.
  • the probability is configured to input the at least one subset into the optimal neural network model to determine the detection result of the malicious code, wherein the detection result includes the location, type and correspondence of the malicious code.
  • the first result output unit is configured to use the detection result of the single subset as the output result of the optimal neural network model if the number of the sub-text vectors in the sub-text vector set constitutes a single subset.
  • the second result output unit is configured to determine the output result of the optimal neural network model based on the comprehensive detection results of the at least two subsets if the number of the sub-text vectors in the sub-text vector set constitutes at least two subsets. .
  • the subset dividing unit includes:
  • the first subset is divided into sub-units, and is configured to select a corresponding sub-text vector from each of the sub-text vectors according to the preset number if the number of the sub-text vectors in the sub-text vector set exceeds the preset number. Divide the number of sub-text vectors into one of the subsets until the number of remaining sub-text vectors does not reach the preset number;
  • the second subset is divided into sub-units, and is configured to copy the sub-text vectors such that if the number of the sub-text vectors in the sub-text vector set does not reach the preset number, the copied sub-text vectors The sum of each of the sub-text vectors is equal to the preset number, and the copied sub-text vector and each of the sub-text vectors are divided into one of the subsets.
  • the first result output unit includes:
  • the first result determination subunit is configured to determine that the detection result of the malicious code is that there is no malicious code in the text data to be detected when the probability corresponding to the type of the single subset is lower than the preset probability;
  • the second result determination subunit is configured to determine that the detection result of the malicious code is that the text data to be detected contains malicious code, and output the single subunit when the probability corresponding to the type of the single subset exceeds the preset probability.
  • the corresponding location, category and corresponding probability of concentrated malicious code is configured to determine that the detection result of the malicious code is that the text data to be detected contains malicious code, and output the single subunit when the probability corresponding to the type of the single subset exceeds the preset probability.
  • the second result output unit includes:
  • the third result output subunit is configured to determine that the detection result of the malicious code is that there is no malicious code in the text data to be detected when the probabilities corresponding to the types of the at least two subsets are lower than the preset probability;
  • the fourth result output subunit is configured to determine that the detection result of the malicious code is that the text data to be detected contains malicious code when the probabilities corresponding to the types of the at least two subsets exceed the preset probability. According to each subset The content data corresponding to each subset is merged in the original order to determine the corresponding location, category and corresponding probability of the malicious code in the at least two subsets.
  • the training of the optimal neural network model includes:
  • the location, type and corresponding probability label of the malicious code in the sub-text data are correspondingly generated to form the text data training set;
  • Iterative training is performed on the updated parameters and weight values of the neural network model based on the text data training set until the loss function reaches the minimum and the optimal neural network model is output. Otherwise, the parameters and weight values of the neural network model are adjusted and the above iterative training is repeated. process.
  • the malicious code detection device provided by the embodiments of this application can execute the malicious code detection method provided by any embodiment of this application, and has functional modules and beneficial effects corresponding to the execution method.
  • FIG. 6 shows a schematic structural diagram of an electronic device that can be used to implement embodiments of the present application.
  • Electronic device 10 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only.
  • the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (Read Only Memory, ROM) 12, a random access memory (Random Access Memory, RAM). ) 13, etc., wherein the memory stores a computer program that can be executed by at least one processor, and the processor 11 can load it into a random access memory (RAM) according to the computer program stored in the read-only memory (ROM) 12 or from the storage unit 18 )13 to perform various appropriate actions and processes.
  • RAM 13 electronic devices can also be stored Prepare various programs and data required for 10 operations.
  • the processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14.
  • An input/output (I/O) interface 15 is also connected to the bus 14 .
  • the I/O interface 15 Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc.
  • the communication unit 19 allows the electronic device 10 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 may include a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphic Processing Unit, GPU), various dedicated artificial intelligence (Artificial Intelligence, AI) computing chips, various types of machine learning Model algorithm processor, digital signal processor (Digital Signal Processing, DSP), and any appropriate processor, controller, microcontroller, etc.
  • the processor 11 performs various methods and processes described above, such as the detection method of malicious code.
  • the malicious code detection method may be implemented as a computer program, which is tangibly included in a computer-readable storage medium, such as the storage unit 18 .
  • part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19.
  • the processor 11 may be configured to perform the malicious code detection method through other suitable means (eg, by means of firmware).
  • FPGAs Field-Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Computer programs for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided for general-purpose computers, special-purpose computers, or other programmable The computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented. A computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device.
  • Computer-readable storage media may include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • the computer-readable storage medium may be a machine-readable signal medium.
  • machine-readable storage media may include an electrical connection based on one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (Electronic Programable Read Only Memory (EPROM) or flash memory, optical fiber, portable compact disk read-only memory (Compact Disc-Read Only Memory, CD-ROM), optical storage device, magnetic storage device, or a suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM Electrical Programable Read Only Memory
  • flash memory electrical connection based on one or more wires
  • CD-ROM Compact Disc-Read Only Memory
  • CD-ROM Compact Disc-Read Only Memory
  • the systems and techniques described herein may be implemented on an electronic device having a display device (e.g., a cathode ray tube (CRT) or liquid crystal) for displaying information to the user.
  • a display device e.g., a cathode ray tube (CRT) or liquid crystal
  • a display Liquid Crystal Display, LCD or monitor
  • a keyboard and pointing device e.g., a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), blockchain network, and the Internet.
  • Computing systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, Also known as cloud computing server or cloud host, it is a host product in the cloud computing service system, which solves the shortcomings of difficult management and weak business scalability in traditional physical hosts and VPS services.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Virology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Les modes de réalisation de la présente demande divulguent un procédé et un appareil de détection de code malveillant, un dispositif électronique et un support de stockage. Le procédé comprend : l'acquisition de données de texte à tester ; selon la logique sémantique, la division desdites données de texte en au moins un élément de données de sous-texte ; la détermination d'un vecteur de sous-texte correspondant aux données de sous-texte ; et l'introduction du vecteur de sous-texte dans un modèle de réseau neuronal optimal de façon à déterminer un résultat de détection de code malveillant.
PCT/CN2023/093383 2022-09-09 2023-05-11 Procédé et appareil de détection de code malveillant, dispositif électronique et support de stockage WO2024051196A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211104769.9 2022-09-09
CN202211104769.9A CN115455416A (zh) 2022-09-09 2022-09-09 一种恶意代码检测方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2024051196A1 true WO2024051196A1 (fr) 2024-03-14

Family

ID=84302126

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/093383 WO2024051196A1 (fr) 2022-09-09 2023-05-11 Procédé et appareil de détection de code malveillant, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN115455416A (fr)
WO (1) WO2024051196A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115455416A (zh) * 2022-09-09 2022-12-09 上海派拉软件股份有限公司 一种恶意代码检测方法、装置、电子设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685739A (zh) * 2020-12-31 2021-04-20 卓尔智联(武汉)研究院有限公司 恶意代码检测方法、数据交互方法及相关设备
CN113239354A (zh) * 2021-04-30 2021-08-10 武汉科技大学 一种基于循环神经网络的恶意代码检测方法及系统
CN114253866A (zh) * 2022-03-01 2022-03-29 紫光恒越技术有限公司 恶意代码检测的方法、装置、计算机设备及可读存储介质
CN114357443A (zh) * 2021-12-13 2022-04-15 北京六方云信息技术有限公司 基于深度学习的恶意代码检测方法、设备与存储介质
CN114579965A (zh) * 2021-12-31 2022-06-03 厦门服云信息科技有限公司 一种恶意代码的检测方法、装置及计算机可读存储介质
CN114692156A (zh) * 2022-05-31 2022-07-01 山东省计算中心(国家超级计算济南中心) 内存片段恶意代码入侵检测方法、系统、存储介质及设备
CN114817913A (zh) * 2021-01-19 2022-07-29 腾讯科技(深圳)有限公司 代码检测方法、装置、计算机设备和存储介质
CN115455416A (zh) * 2022-09-09 2022-12-09 上海派拉软件股份有限公司 一种恶意代码检测方法、装置、电子设备及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685739A (zh) * 2020-12-31 2021-04-20 卓尔智联(武汉)研究院有限公司 恶意代码检测方法、数据交互方法及相关设备
CN114817913A (zh) * 2021-01-19 2022-07-29 腾讯科技(深圳)有限公司 代码检测方法、装置、计算机设备和存储介质
CN113239354A (zh) * 2021-04-30 2021-08-10 武汉科技大学 一种基于循环神经网络的恶意代码检测方法及系统
CN114357443A (zh) * 2021-12-13 2022-04-15 北京六方云信息技术有限公司 基于深度学习的恶意代码检测方法、设备与存储介质
CN114579965A (zh) * 2021-12-31 2022-06-03 厦门服云信息科技有限公司 一种恶意代码的检测方法、装置及计算机可读存储介质
CN114253866A (zh) * 2022-03-01 2022-03-29 紫光恒越技术有限公司 恶意代码检测的方法、装置、计算机设备及可读存储介质
CN114692156A (zh) * 2022-05-31 2022-07-01 山东省计算中心(国家超级计算济南中心) 内存片段恶意代码入侵检测方法、系统、存储介质及设备
CN115455416A (zh) * 2022-09-09 2022-12-09 上海派拉软件股份有限公司 一种恶意代码检测方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN115455416A (zh) 2022-12-09

Similar Documents

Publication Publication Date Title
WO2021212675A1 (fr) Procédé et appareil permettant de générer un échantillon antagoniste, dispositif électronique et support de stockage
CN107004159B (zh) 主动机器学习
WO2020108063A1 (fr) Procédé, appareil et serveur de détermination de mots caractéristiques
CN113807098A (zh) 模型训练方法和装置、电子设备以及存储介质
WO2023116561A1 (fr) Procédé et appareil d'extraction d'entité, dispositif électronique et support de stockage
CN113434858B (zh) 基于反汇编代码结构和语义特征的恶意软件家族分类方法
CN112347760A (zh) 意图识别模型的训练方法及装置、意图识别方法及装置
WO2024051196A1 (fr) Procédé et appareil de détection de code malveillant, dispositif électronique et support de stockage
US20230073994A1 (en) Method for extracting text information, electronic device and storage medium
CN113141360A (zh) 网络恶意攻击的检测方法和装置
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN106569989A (zh) 一种用于短文本的去重方法及装置
Gupta et al. SMPOST: parts of speech tagger for code-mixed indic social media text
CN114724156B (zh) 表单识别方法、装置及电子设备
CN113688240A (zh) 威胁要素提取方法、装置、设备及存储介质
CN115858776B (zh) 一种变体文本分类识别方法、系统、存储介质和电子设备
CN113366511A (zh) 利用遗传编程的命名实体识别和提取
CN114662469B (zh) 情感分析方法、装置、电子设备及存储介质
CN114880520B (zh) 视频标题生成方法、装置、电子设备和介质
WO2023245869A1 (fr) Procédé et appareil d'entraînement de modèle de reconnaissance vocale, dispositif électronique et support de stockage
US20200099718A1 (en) Fuzzy inclusion based impersonation detection
CN115909376A (zh) 文本识别方法、文本识别模型训练方法、装置及存储介质
CN114117007A (zh) 检索实体的方法、装置、设备以及存储介质
US11349856B2 (en) Exploit kit detection
CN109214005A (zh) 一种基于中文分词的线索提取方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23861900

Country of ref document: EP

Kind code of ref document: A1