WO2024051196A1 - Malicious code detection method and apparatus, electronic device, and storage medium - Google Patents

Malicious code detection method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2024051196A1
WO2024051196A1 PCT/CN2023/093383 CN2023093383W WO2024051196A1 WO 2024051196 A1 WO2024051196 A1 WO 2024051196A1 CN 2023093383 W CN2023093383 W CN 2023093383W WO 2024051196 A1 WO2024051196 A1 WO 2024051196A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
text
data
malicious code
text data
Prior art date
Application number
PCT/CN2023/093383
Other languages
French (fr)
Chinese (zh)
Inventor
徐莉莎
Original Assignee
上海派拉软件股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海派拉软件股份有限公司 filed Critical 上海派拉软件股份有限公司
Publication of WO2024051196A1 publication Critical patent/WO2024051196A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of computer security technology, for example, to a malicious code detection method, device, electronic equipment and storage medium.
  • hackers In daily life, hackers attack in various forms. Some attack methods have obvious characteristics, while some attack methods have more subtle characteristics.
  • Hackers often attack by injecting malicious code into part of the file. When the file is viewed or a command is executed, the malicious code will run, thereby inserting virus Trojans, leaving backdoors and other dangerous behaviors. Once the device is connected to the Internet, the malicious code will Devices are easily hacked and hijacked, causing significant damage.
  • one method for detecting malicious code in files is to find whether there is malicious code in the file through keyword matching. This method often results in the inability to correctly perform the corresponding operation when unknown malicious code enters. Detection; the other is to determine whether the data is malicious code through machine learning or deep learning classification methods. However, this method cannot confirm the location and type of malicious code at the same time.
  • This application provides a method, device, equipment and medium for detecting malicious codes, which can simultaneously confirm the probability, location and type corresponding to the malicious codes, and improve the accuracy of detecting malicious codes.
  • an embodiment of the present application provides a method for detecting malicious code, which method includes:
  • the sub-text vector is input into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
  • an embodiment of the present application also provides a malicious code detection method and device, which includes:
  • the data acquisition module is set to obtain the text data to be detected
  • a sub-text vector determination module configured to divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;
  • the result determination module is configured to input the sub-text vector into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
  • an embodiment of the present application further provides an electronic device, where the electronic device includes:
  • the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application. Malicious code detection methods.
  • embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable the processor to implement any of the present application when executed.
  • a method for detecting malicious code according to an embodiment.
  • the technical solution of the embodiment of the present application obtains the text data to be detected, divides the text data to be detected into at least one sub-text data according to semantic logic, determines the sub-text vector corresponding to the sub-text data, and inputs the sub-text vector into the optimal neural network.
  • the optimal neural network model is generated based on the text data training set containing malicious code.
  • the acquired text data to be detected is divided into at least one sub-text data and the corresponding sub-text vector is determined through semantic logic, which can form sub-text features and facilitate subsequent data processing of the optimal neural network model; through the optimal The neural network model determines the detection results of malicious code in the sub-text vector, can confirm the relevant detection information corresponding to the malicious code, and improves the accuracy of the detection of malicious code.
  • Figure 1 is a flow chart of a malicious code detection method provided by an embodiment of the present application.
  • Figure 2 is a flow chart of another malicious code detection method provided by an embodiment of the present application.
  • Figure 3 is a flow chart of a malicious code training process provided by an embodiment of the present application.
  • Figure 4 is a flow chart of a malicious code detection process provided by an embodiment of the present application.
  • Figure 5 is a structural block diagram of a malicious code detection device provided by an embodiment of the present application.
  • FIG. 6 shows a schematic structural diagram of an electronic device that can be used to implement embodiments of the present application.
  • Figure 1 is a flow chart of a method for detecting malicious code provided by an embodiment of the present application. This embodiment can be applied to situations when malicious code is detected during the transmission of various files.
  • the method may be executed by a malicious code detection device, which may be implemented in the form of hardware and/or software, and may be configured in an electronic device. As shown in Figure 1, the method includes:
  • the text data to be detected refers to relevant text data that may contain malicious code waiting to be detected.
  • the text data to be detected can also be called character data, which is text data in various types of files. For example, it can be related text data in doc files, pdf files, and txt files. It can include English characters, Chinese characters, numbers, and Other input characters, etc.
  • the file type and the content information of the text data in various types of text data to be detected can be obtained, and the content information of each content information can be obtained.
  • the row of the page where it is located, etc., so that the obtained text data can be processed accordingly.
  • semantic logic can be understood as the logical relationship between sentences or the character symbols between sentences on the page.
  • Logical relationships can be, for example, juxtaposition, succession, transition, result and cause, purpose, concession, etc.
  • sub-text data can be understood as modular division of text data to be detected through semantic logic, and sub-text data corresponding to each module. It should be noted that the text data to be detected can be divided into at least one module through modularization. Each module can correspond to one sub-text data. After modular division, the corresponding sub-text data can be a code class file or a Non-script files.
  • the sub-text vector can be understood as the sub-text vector corresponding to each module obtained by vectorizing the sub-text data of each module using the doc2vec model or the word2vec model.
  • the content data in the text data to be detected can be extracted, and the content data can be filtered accordingly to obtain English data.
  • the English data can be divided accordingly according to its corresponding logical content, and the doc2vec model or word2vec model can be used to vectorize the sub-text data corresponding to each module after the division; it can also be determined by the text data to be detected.
  • the corresponding function call graph is used to obtain the feature vector corresponding to the corresponding call sequence for corresponding vectorization.
  • malware refers to all software or code that may conflict with an organization's security policy. These codes have no effect but bring certain dangers. They can be created without explicitly prompting the user or without the user's permission. With permission, software that infringes upon the user's legitimate rights and interests is installed and run on the user's computer or other terminal; it can also be computer code that is deliberately prepared or set up and poses a threat or potential threat to the network or system.
  • the type of malicious code can be SQL injection or XSS attack. XSS attack is called cross-site scripting attack.
  • the detection results of the malicious code include the location of the malicious code, the type corresponding to the malicious code, the probability corresponding to the type of malicious code, and so on.
  • the sub-text vector corresponding to the sub-text data in each module after segmentation can be input into the optimal neural network model to determine the detection results of malicious code.
  • the sub-text vectors corresponding to the text data to be detected can be formed into a sub-text vector set, the sub-text vector set can be divided into at least one subset according to a preset number, and at least one subset can be input to the optimal neural network model.
  • distinction processing can be carried out based on the number of subsets. If it is a single subset, the detection result of a single subset is directly used as the output result of the optimal neural network model. If it is at least two subsets, then the corresponding detection results are determined and output based on at least two subsets.
  • training of the optimal neural network model includes:
  • the location, type and corresponding probability label of the malicious code in the generated sub-text data are correspondingly generated to form a text data training set
  • the text data training set consists of text data containing malicious code.
  • the text data in the text data training set contains the location and type of the malicious code.
  • File 1 contains malicious code
  • the type of the malicious code is The xss attack is located on line 6 of the page.
  • the content data may include at least one of the following: text type, text information, the row and/or column where the text information is located, and the number of words corresponding to the text information.
  • the training process of the optimal neural network model is to extract text data containing malicious code from the text data training set, where the text data in the text data training set contains the location and type of the malicious code; extract the text data Content data, such as text information in doc files, pdf files, txt files, lines corresponding to the text, number of words, etc., generate an original text set, maintain the number of lines on the page corresponding to the original text data, and filter out non-conforming content in the content data.
  • To obtain English data divide the English data into at least one sub-text data, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data.
  • the corresponding sub-text data is generated
  • the location, type and corresponding probability label of the malicious code in the text data training set are used to iteratively train the neural network model update parameters and weight values through the text data training set.
  • the optimal neural network model is output, otherwise the parameters and weight values of the neural network model are adjusted and the above iterative training process is repeated.
  • the overall loss function in the training stage is divided into the sum of position loss and classification loss. It can be understood that the training of the neural network model is calculated through multiple iterations. After multiple iterations, the accuracy The rate is maximized to reduce the error rate of the entire neural network, and the loss function can be used to correct the deviation between the real position and the predicted position.
  • IOU Intersection over Union
  • the original model can be fine-tuned and updated or retrained with old data.
  • the technical solution of the embodiment of the present application obtains the text data to be detected, divides the text data to be detected into at least one sub-text data according to semantic logic, determines the sub-text vector corresponding to the sub-text data, and inputs the sub-text vector into the optimal neural network.
  • the optimal neural network model is generated based on the text data training set containing malicious code.
  • the acquired text data to be detected is divided into at least one sub-text data and the corresponding sub-text vector is determined through semantic logic, which can form sub-text features and facilitate subsequent data processing of the optimal neural network model; through the optimal The neural network model determines the detection results of malicious code in the sub-text vector, can confirm the relevant detection information corresponding to the malicious code, and improves the accuracy of the detection of malicious code.
  • Figure 2 is a flow chart of another malicious code detection method provided by an embodiment of the present application. Based on the above embodiments, this embodiment divides the text data to be detected according to semantic logic. For at least one sub-text data, the sub-text vector corresponding to the sub-text data is determined, and the sub-text vector is input into the optimal neural network model to determine the detection result of the malicious code. As shown in Figure 2, this paper The malicious code detection method in the embodiment may include the following steps:
  • the content data includes at least one of the following: text type, text information, the row and/or column where the text information is located, and the number of words corresponding to the text information.
  • the text type refers to the file type corresponding to the text data to be detected, for example, it can be a doc file, pdf file, txt file, etc.
  • Text information can be understood as text information of the text data to be detected and related attribute information of the text information.
  • the corresponding content data can be extracted from the text data to be detected to generate the original text set.
  • the original text collection holds the original content data of the text information to be detected, including the original text type of the text data to be detected, the original text information, the row and/or column of the original text information, and the number of words corresponding to the original text information.
  • English data can be understood as English characters.
  • the content data in the text data to be detected is Chinese character data or English character data.
  • the language of the malicious code is usually English characters.
  • the Chinese character data is Filter out and obtain the English data corresponding to the content data.
  • the English data is a code
  • the logical structure includes at least one of the following: sequential logic, conditional logic, loop logic, function block, and class.
  • the doc2vec model is a related model used to generate word vectors and paragraph vectors.
  • the doc2vec model is used for embedding vectorization for the sub-text data obtained from each modularization.
  • the fixed format of the script document related to the programming language can be used to determine whether the filtered English data is a code or a non-script document.
  • the English data is a code
  • it can be determined according to the logical structure of the programming code, for example, it can be a sequence Structures such as logic, conditional logic, loop logic, function blocks and classes divide English data into modules accordingly.
  • Each module corresponds to the corresponding sub-text data.
  • the number of divided modules is at least one. After the corresponding division, you can The doc2vec model is used to determine the sub-text vector corresponding to the sub-text data in each module.
  • n the number of modules after modular segmentation of the file to be detected whose English data is code
  • the sub-text data of n modules is embedding vectorized using the doc2vec model to generate the corresponding m*k as Vector, where m represents the number of characters contained in the sub-text data in each module, and k represents the number of vocabularies.
  • a vector of text data to be detected can represent an n*m*k dimensional vector.
  • the non-script document can be a doc type, pdf type, txt type and other non-script documents.
  • the punctuation marks of characters in English data can be understood as the punctuation marks of characters in each line of the page. For example, each punctuation mark in each line of characters is divided into a sentence.
  • the English data is divided into modules accordingly. Each module corresponds to the corresponding sub-text data. The number of divided modules is at least one. After the corresponding division, the doc2vec model can be used for embedding vectorization. To determine the sub-text vector corresponding to the sub-text data in each module.
  • a corresponding sub-text vector set can be formed. It should be noted that the sub-text vector set may contain one or more sub-text vectors.
  • the preset number can be understood as the fixed number of sub-text vectors corresponding to the pre-set modules.
  • the corresponding settings can be made through experience, or they can be set manually.
  • the subset contains subtext vectors corresponding to one or more subtext data.
  • the vector set is divided accordingly according to a fixed number to obtain the vectors corresponding to the subsets.
  • the sub-text data of n modules are embedding vectorized using the doc2vec model to generate corresponding m*k vectors.
  • the fixed number r is divided accordingly, and the vector corresponding to each subset is r*m*k, where m represents the number of characters contained in the sub-text data in each module, and k represents the number of vocabulary lists. It should be noted that for the end segmentation that does not meet the fixed number r during segmentation, it is necessary to copy its own sub-text data until the fixed number r is reached.
  • the location of the malicious code may be the position in the original text data to be detected, or it may not be the original text data to be detected. Position in text data.
  • the sub-text vector set is divided into at least one subset according to a preset number, including:
  • a corresponding number of sub-text vectors will be selected from each sub-text vector according to the pre-set number and divided into a subset until the number of remaining sub-text vectors does not reach Default quantity;
  • a corresponding number of sub-text vectors are selected from each sub-text vector according to the preset number and divided into a subset until the remaining The number of sub-text vectors does not reach the preset number.
  • the sub-text vectors are copied so that the sum of the copied sub-text vectors and each sub-text vector is equal to the pre-set number. , divide the copied sub-text vector and each sub-text vector into a subset.
  • each sub-text vector does not reach the preset number and the remainder is 1, then the sub-text vector corresponding to the sub-text data is copied to expand to the preset number. If the number of each sub-text vector When the number that does not reach the preset number has a remainder of at least 2, you can copy the subtext vector corresponding to one of the subtext data to expand to the preset number, or you can copy the subtext vector corresponding to multiple subtext data. Text vectors to expand to a preset amount.
  • the preset number that is, the standardized number
  • the preset number is 5. If the sub-text vector corresponding to the sub-text data in the text data to be detected is 15, then the corresponding sub-text data in the text data to be detected is 5 according to the standardized number.
  • the sub-text vector of is divided into 3 parts; if the sub-text vector corresponding to the sub-text data in the text data to be detected is 17, then the sub-text vector corresponding to the sub-text data in the text data to be detected is cut according to the standardized number 5 as 4 copies. There are 2 sub-text vectors in the 4 copies. At this time, the number of sub-text vectors in the 4th copy needs to be expanded to the standardized number 5. At this time, the 2 sub-text vectors in the 4th copy can be copied. to extend to the standardized number 5.
  • At least a subset of the input is detected based on the optimal neural network model to determine the relevant detection results of the malicious code in the text data to be detected. If the malicious code is contained, the optimal neural network model is used to determine the relevant detection results of the malicious code in the text data to be detected.
  • the corresponding output contains the location, type and corresponding probability of the malicious code. It should be noted that differentiation processing can be performed based on the number of subsets to obtain the detection results of the corresponding malicious code in the case of at least two subsets of a single subset.
  • the detection result of the single subset is used as the output result of the optimal neural network model.
  • the detection result of the single subset is directly used as the output result of the optimal neural network model. For example, if the probability corresponding to the type of a single subset is lower than the preset probability, the detection result of the malicious code is determined to be that there is no malicious code in the text data to be detected; if the probability corresponding to the type of a single subset exceeds the preset probability, Then it is determined that the detection result of the malicious code is that there is malicious code in the text data to be detected, and the corresponding location, category and corresponding probability of the malicious code in a single subset are directly output.
  • the detection results of a single subset are used as the output results of the optimal neural network model, including:
  • the test result is that there is no malicious code in the text data to be detected
  • the probability corresponding to the type of a single subset exceeds the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the corresponding location, category and corresponding probability of the malicious code in the single subset are output.
  • the probability corresponding to the type of a single subset is lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; when the probability corresponding to the type of a single subset exceeds Under the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the corresponding location, category and corresponding probability of the malicious code in a single subset are output.
  • the preset probability is 5% and the probability corresponding to a single subset type exceeds 5%, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the malicious code is output through the optimal neural network model.
  • the current location, category and corresponding probability for example, it can be the row of the page, the category of the malicious code and the probability corresponding to the current malicious code.
  • the output result of the optimal neural network model is determined based on the comprehensive detection results of the at least two subsets. For example, if the probabilities corresponding to the types of at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; if the probabilities corresponding to the types of at least two subsets exceed the preset probability probability, it is determined that the detection result of malicious code is that there is malicious code in the text data to be detected. It is necessary to merge the content data corresponding to each subset according to the original order of each subset to determine the corresponding positions of the malicious code in at least two subsets. Categories and corresponding probabilities.
  • the output result of the optimal neural network model is determined based on the comprehensive detection results of at least two subsets, including:
  • the detection result of the malicious code is that there is no malicious code in the text data to be detected
  • the probabilities corresponding to the types of at least two subsets exceed the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the content data corresponding to each subset is merged according to the original order of each subset. , to determine the corresponding locations, categories and corresponding probabilities of malicious codes in at least two subsets.
  • the probabilities corresponding to the types of at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; When the probability exceeds the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the content corresponding to each subset is merged according to the original order of each subset. Data to determine the corresponding location, category, and corresponding probability of malicious code in at least two subsets.
  • the location of the malicious code will be segmented accordingly.
  • the malicious code is in Lines 7-10, after splitting according to a fixed number of 8, the malicious code in the first subset is in lines 7 and 8, and the malicious code in the second subset is in lines 1 and 2.
  • You need to follow the The original row numbers of each subset are sequentially merged with the content data corresponding to each subset to determine the location, type and corresponding probability information of the malicious code contained in the original data.
  • the above technical solution of the embodiment of the present application extracts the content data in the text data to be detected and filters the non-English data in the content data to obtain the English data in the content data.
  • the English data is a code
  • the logical structure divides the English data into at least one sub-text data, and uses the doc2vec model to determine the sub-text vector corresponding to the sub-text data.
  • the English data is divided into At least one sub-text data, and uses the sub-text vector corresponding to the doc2vec model sub-text data to form corresponding sub-text features to facilitate subsequent data processing of the optimal neural network model; by dividing the sub-text vector into At least one subset is input into the optimal neural network model to determine the detection result of the malicious code, where the detection result includes the location, type and corresponding probability of the malicious code. If the sub-text data is a single sub- set, the detection result of a single subset is used as the output result of the optimal neural network model.
  • the output results of the optimal neural network model are determined based on the comprehensive detection results of at least two subsets, which can simultaneously confirm the probability, location and type of the malicious code, and improve the detection efficiency of the malicious code. Accuracy.
  • Figure 3 is provided by an embodiment of the present application.
  • Figure 4 is a flow chart of a malicious code detection process provided by an embodiment of the present application.
  • the text data represents the text data to be detected in the above embodiment
  • the non-English characters represent the non-English data in the above embodiment
  • the subsample represents the subset in the above embodiment
  • the preset module number represents the preset number in the above embodiment. Set quantity.
  • data collection is performed to determine training data.
  • S320 Extract the text content data in the text data, generate an original text set, and maintain the original number of lines.
  • the text data is divided into grids based on semantic logic, and the number of modules is recorded as n. If the text data containing English characters is a code file, it will be divided according to sequential logic, conditional logic, loop logic, and function blocks. , class and other structures are divided, and the number of divided modules is recorded as n; if the text data containing English characters is a non-script document such as doc, pdf, txt, etc., it is divided according to inline statements to generate n modules.
  • the modular division of English characters in this embodiment can be understood as the division of text data to be detected into at least one sub-text data according to semantic logic in the above embodiments.
  • a module is represented as a subtext data.
  • the vector of a file is represented as an n*m*k dimensional vector.
  • the module vectorization in this embodiment can be understood as determining the sub-text vector corresponding to the sub-text data in the above-mentioned embodiments.
  • each file vector is divided according to the fixed number of modules r to obtain a sub-sample vector. Then each sub-sample The vector is an r*m*k dimensional vector, in which the end segmentation modules that do not satisfy r will copy their own data until r modules.
  • an end-to-end neural network model is constructed, the number of network layers is defined, and the loss function and optimizer are determined.
  • the overall loss function in the training phase is divided into the sum of position loss and classification loss.
  • GIOU IOU-
  • loss(classify) is the cross-entropy loss function.
  • some parameters in the model are constantly adjusted during the training process, and the optimal god Some weights and parameters corresponding to the number of network layers in the network model are determined.
  • the network has 3 layers
  • the second layer has 3 neurons
  • the text data is divided into grids based on semantic logic, and the number of modules is recorded as n: If the text data containing English characters is a code file, it is based on sequential logic, conditional logic, loop logic, and function blocks. , class and other structures are divided, and the number of divided modules is recorded as n; if the text data containing English characters is a non-script document such as doc, pdf, txt, etc., it is divided according to inline statements to generate n modules.
  • S450 Split according to the preset number of modules to generate subsamples.
  • each file vector is divided according to the fixed number of modules r to obtain a sub-sample vector. Then each sub-sample The vector is an r*m*k dimensional vector, in which the end segmentation modules that do not satisfy r will copy their own data until r modules.
  • FIG. 5 is a structural block diagram of a malicious code detection device provided by an embodiment of the present application.
  • the device is suitable for detecting malicious codes during the transmission of various files.
  • the device Can be implemented by hardware/software. It can be configured in an electronic device to implement a malicious code detection method in the embodiment of the present application.
  • the device includes: a data acquisition module 510, a sub-text vector determination module 520 and a result determination module 530.
  • the data acquisition module 510 is configured to acquire the text data to be detected
  • the sub-text vector determination module 520 is configured to divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;
  • the result determination module 530 is configured to input the sub-text vector into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
  • the sub-text vector determination module divides the acquired text data to be detected into at least one sub-text data through semantic logic and determines the corresponding sub-text vector, which can form sample features and facilitate the subsequent optimal neural network model data Processing; the result determination module uses the optimal neural network model to determine the detection results of malicious code in the sub-text vector, which can confirm the relevant detection information corresponding to the malicious code and improve the accuracy of the detection of malicious code.
  • the sub-text vector determination module 520 includes:
  • a data extraction unit configured to extract content data in the text data to be detected, wherein the content data includes at least one of the following: text type, text information, the row and/or column where the text information is located, the The number of words corresponding to the text message;
  • a data filtering unit configured to filter non-English data in the content data to obtain English data in the content data
  • the first sub-text vector determination unit is configured to, when the English data is a code, divide the English data into at least one sub-text data according to the logical structure of the code, and use the doc2vec model to determine the sub-text data Corresponding sub-text vector; wherein, the logical structure includes at least one of the following: sequential logic, conditional logic, loop logic, function block, class;
  • the second sub-text vector determination unit is configured to divide the English data into at least one sub-text data according to the punctuation marks of characters in the English data when the English data is a non-script document, and use the The sub-text vector corresponding to the sub-text data described in the doc2vec model.
  • the result determination module 530 includes:
  • a set forming unit configured to form the sub-text vector corresponding to the text data to be detected. Collection of subtext vectors;
  • a subset dividing unit configured to divide the sub-text vector set into at least one subset according to a preset number
  • the result determination unit is configured to input the at least one subset into the optimal neural network model to determine the detection result of the malicious code, wherein the detection result includes the location, type and correspondence of the malicious code.
  • the probability is configured to input the at least one subset into the optimal neural network model to determine the detection result of the malicious code, wherein the detection result includes the location, type and correspondence of the malicious code.
  • the first result output unit is configured to use the detection result of the single subset as the output result of the optimal neural network model if the number of the sub-text vectors in the sub-text vector set constitutes a single subset.
  • the second result output unit is configured to determine the output result of the optimal neural network model based on the comprehensive detection results of the at least two subsets if the number of the sub-text vectors in the sub-text vector set constitutes at least two subsets. .
  • the subset dividing unit includes:
  • the first subset is divided into sub-units, and is configured to select a corresponding sub-text vector from each of the sub-text vectors according to the preset number if the number of the sub-text vectors in the sub-text vector set exceeds the preset number. Divide the number of sub-text vectors into one of the subsets until the number of remaining sub-text vectors does not reach the preset number;
  • the second subset is divided into sub-units, and is configured to copy the sub-text vectors such that if the number of the sub-text vectors in the sub-text vector set does not reach the preset number, the copied sub-text vectors The sum of each of the sub-text vectors is equal to the preset number, and the copied sub-text vector and each of the sub-text vectors are divided into one of the subsets.
  • the first result output unit includes:
  • the first result determination subunit is configured to determine that the detection result of the malicious code is that there is no malicious code in the text data to be detected when the probability corresponding to the type of the single subset is lower than the preset probability;
  • the second result determination subunit is configured to determine that the detection result of the malicious code is that the text data to be detected contains malicious code, and output the single subunit when the probability corresponding to the type of the single subset exceeds the preset probability.
  • the corresponding location, category and corresponding probability of concentrated malicious code is configured to determine that the detection result of the malicious code is that the text data to be detected contains malicious code, and output the single subunit when the probability corresponding to the type of the single subset exceeds the preset probability.
  • the second result output unit includes:
  • the third result output subunit is configured to determine that the detection result of the malicious code is that there is no malicious code in the text data to be detected when the probabilities corresponding to the types of the at least two subsets are lower than the preset probability;
  • the fourth result output subunit is configured to determine that the detection result of the malicious code is that the text data to be detected contains malicious code when the probabilities corresponding to the types of the at least two subsets exceed the preset probability. According to each subset The content data corresponding to each subset is merged in the original order to determine the corresponding location, category and corresponding probability of the malicious code in the at least two subsets.
  • the training of the optimal neural network model includes:
  • the location, type and corresponding probability label of the malicious code in the sub-text data are correspondingly generated to form the text data training set;
  • Iterative training is performed on the updated parameters and weight values of the neural network model based on the text data training set until the loss function reaches the minimum and the optimal neural network model is output. Otherwise, the parameters and weight values of the neural network model are adjusted and the above iterative training is repeated. process.
  • the malicious code detection device provided by the embodiments of this application can execute the malicious code detection method provided by any embodiment of this application, and has functional modules and beneficial effects corresponding to the execution method.
  • FIG. 6 shows a schematic structural diagram of an electronic device that can be used to implement embodiments of the present application.
  • Electronic device 10 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only.
  • the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (Read Only Memory, ROM) 12, a random access memory (Random Access Memory, RAM). ) 13, etc., wherein the memory stores a computer program that can be executed by at least one processor, and the processor 11 can load it into a random access memory (RAM) according to the computer program stored in the read-only memory (ROM) 12 or from the storage unit 18 )13 to perform various appropriate actions and processes.
  • RAM 13 electronic devices can also be stored Prepare various programs and data required for 10 operations.
  • the processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14.
  • An input/output (I/O) interface 15 is also connected to the bus 14 .
  • the I/O interface 15 Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc.
  • the communication unit 19 allows the electronic device 10 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 may include a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphic Processing Unit, GPU), various dedicated artificial intelligence (Artificial Intelligence, AI) computing chips, various types of machine learning Model algorithm processor, digital signal processor (Digital Signal Processing, DSP), and any appropriate processor, controller, microcontroller, etc.
  • the processor 11 performs various methods and processes described above, such as the detection method of malicious code.
  • the malicious code detection method may be implemented as a computer program, which is tangibly included in a computer-readable storage medium, such as the storage unit 18 .
  • part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19.
  • the processor 11 may be configured to perform the malicious code detection method through other suitable means (eg, by means of firmware).
  • FPGAs Field-Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Computer programs for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided for general-purpose computers, special-purpose computers, or other programmable The computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented. A computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device.
  • Computer-readable storage media may include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • the computer-readable storage medium may be a machine-readable signal medium.
  • machine-readable storage media may include an electrical connection based on one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (Electronic Programable Read Only Memory (EPROM) or flash memory, optical fiber, portable compact disk read-only memory (Compact Disc-Read Only Memory, CD-ROM), optical storage device, magnetic storage device, or a suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM Electrical Programable Read Only Memory
  • flash memory electrical connection based on one or more wires
  • CD-ROM Compact Disc-Read Only Memory
  • CD-ROM Compact Disc-Read Only Memory
  • the systems and techniques described herein may be implemented on an electronic device having a display device (e.g., a cathode ray tube (CRT) or liquid crystal) for displaying information to the user.
  • a display device e.g., a cathode ray tube (CRT) or liquid crystal
  • a display Liquid Crystal Display, LCD or monitor
  • a keyboard and pointing device e.g., a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), blockchain network, and the Internet.
  • Computing systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, Also known as cloud computing server or cloud host, it is a host product in the cloud computing service system, which solves the shortcomings of difficult management and weak business scalability in traditional physical hosts and VPS services.

Abstract

Disclosed in the embodiments of the present application are a malicious code detection method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring text data to be tested; according to the semantic logic, dividing said text data into at least one piece of sub-text data; determining a sub-text vector corresponding to the sub-text data; and inputting the sub-text vector into an optimal neural network model, so as to determine a malicious code detection result.

Description

恶意代码检测方法、装置、电子设备及存储介质Malicious code detection method, device, electronic equipment and storage medium
本公开要求在2022年9月9日提交中国专利局、申请号为202211104769.9的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This disclosure claims priority from Chinese patent application No. 202211104769.9, filed with the China Patent Office on September 9, 2022, the entire content of which is incorporated into this application by reference.
技术领域Technical field
本申请涉及计算机安全技术领域,例如涉及一种恶意代码检测方法、装置、电子设备及存储介质。This application relates to the field of computer security technology, for example, to a malicious code detection method, device, electronic equipment and storage medium.
背景技术Background technique
日常生活中,黑客的攻击形式多种多样,有些攻击方式特征比较明显,有些攻击方式特征比较隐蔽。黑客的攻击方式往往通过将恶意代码注入到文件中的部分位置,当该文件被查看或者被执行命令,该恶意代码就会运行,从而插入病毒木马,留后门等危险行为,一旦设备联网,该设备非常容易被入侵和劫持,从而造成重大损失。In daily life, hackers attack in various forms. Some attack methods have obvious characteristics, while some attack methods have more subtle characteristics. Hackers often attack by injecting malicious code into part of the file. When the file is viewed or a command is executed, the malicious code will run, thereby inserting virus Trojans, leaving backdoors and other dangerous behaviors. Once the device is connected to the Internet, the malicious code will Devices are easily hacked and hijacked, causing significant damage.
针对上述问题,相关技术中,对文件中恶意代码的检测方法,一种是通过关键词匹配查找文件中是否存在恶意代码,这种方式往往造成当未知的恶意代码进入时,无法正确进行相应的检测;另一种是通过机器学习或深度学习的分类方法判断数据是否是恶意代码,然而,这种方式不能同时确认恶意代码的位置和类型。In response to the above problems, in related technologies, one method for detecting malicious code in files is to find whether there is malicious code in the file through keyword matching. This method often results in the inability to correctly perform the corresponding operation when unknown malicious code enters. Detection; the other is to determine whether the data is malicious code through machine learning or deep learning classification methods. However, this method cannot confirm the location and type of malicious code at the same time.
发明内容Contents of the invention
本申请提供一种恶意代码的检测方法、装置、设备及介质,能够同时确认恶意代码对应的概率、位置和类型,提升对恶意代码的检测的准确率。This application provides a method, device, equipment and medium for detecting malicious codes, which can simultaneously confirm the probability, location and type corresponding to the malicious codes, and improve the accuracy of detecting malicious codes.
根据本申请的一方面,本申请实施例提供了一种恶意代码的检测方法,该方法包括:According to one aspect of the present application, an embodiment of the present application provides a method for detecting malicious code, which method includes:
获取待检测文本数据;Get the text data to be detected;
根据语义逻辑将所述待检测文本数据划分为至少一个子文本数据,确定所述子文本数据对应的子文本向量;Divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;
将所述子文本向量输入至最优神经网络模型中以确定恶意代码的检测结果;其中,所述最优神经网络模型基于含有恶意代码的文本数据训练集生成。 The sub-text vector is input into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
根据本申请的另一方面,本申请实施例还提供了一种恶意代码的检测方法装置,该装置包括:According to another aspect of the present application, an embodiment of the present application also provides a malicious code detection method and device, which includes:
数据获取模块,设置为获取待检测文本数据;The data acquisition module is set to obtain the text data to be detected;
子文本向量确定模块,设置为根据语义逻辑将所述待检测文本数据划分为至少一个子文本数据,确定所述子文本数据对应的子文本向量;A sub-text vector determination module, configured to divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;
结果确定模块,设置为将所述子文本向量输入至最优神经网络模型中以确定恶意代码的检测结果;其中,所述最优神经网络模型基于含有恶意代码的文本数据训练集生成。The result determination module is configured to input the sub-text vector into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
根据本申请的另一方面,本申请实施例还提供了一种电子设备,所述电子设备包括:According to another aspect of the present application, an embodiment of the present application further provides an electronic device, where the electronic device includes:
至少一个处理器;以及at least one processor; and
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行本申请任一实施例所述的恶意代码的检测方法。The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application. Malicious code detection methods.
根据本申请的另一方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现本申请任一实施例所述的恶意代码的检测方法。According to another aspect of the present application, embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable the processor to implement any of the present application when executed. A method for detecting malicious code according to an embodiment.
本申请实施例的技术方案,通过获取待检测文本数据,根据语义逻辑将待检测文本数据划分为至少一个子文本数据,确定子文本数据对应的子文本向量,将子文本向量输入至最优神经网络模型中以确定恶意代码的检测结果其中,最优神经网络模型基于含有恶意代码的文本数据训练集生成。本申请实施例,通过语义逻辑将获取的待检测文本数据划分为至少一个子文本数据并确定相应的子文本向量,能够形成子文本特征,便于后续最优神经网络模型的数据处理;通过最优神经网络模型以确定子文本向量中恶意代码的检测结果,能够确认恶意代码对应的相关检测信息,提升对恶意代码的检测的准确率。The technical solution of the embodiment of the present application obtains the text data to be detected, divides the text data to be detected into at least one sub-text data according to semantic logic, determines the sub-text vector corresponding to the sub-text data, and inputs the sub-text vector into the optimal neural network. In the network model to determine the detection results of malicious code, the optimal neural network model is generated based on the text data training set containing malicious code. In the embodiment of this application, the acquired text data to be detected is divided into at least one sub-text data and the corresponding sub-text vector is determined through semantic logic, which can form sub-text features and facilitate subsequent data processing of the optimal neural network model; through the optimal The neural network model determines the detection results of malicious code in the sub-text vector, can confirm the relevant detection information corresponding to the malicious code, and improves the accuracy of the detection of malicious code.
附图说明Description of the drawings
为了说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作介绍,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。 In order to illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be introduced below. The drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, , without any creative effort, other drawings can also be obtained based on these drawings.
图1为本申请一实施例提供的一种恶意代码的检测方法的流程图;Figure 1 is a flow chart of a malicious code detection method provided by an embodiment of the present application;
图2为本申请一实施例提供的另一种恶意代码的检测方法的流程图;Figure 2 is a flow chart of another malicious code detection method provided by an embodiment of the present application;
图3为本申请一实施例提供的一种恶意代码训练过程的流程图;Figure 3 is a flow chart of a malicious code training process provided by an embodiment of the present application;
图4为本申请一实施例提供的一种恶意代码的检测过程的流程图;Figure 4 is a flow chart of a malicious code detection process provided by an embodiment of the present application;
图5是本申请一实施例提供的一种恶意代码的检测装置的结构框图;Figure 5 is a structural block diagram of a malicious code detection device provided by an embodiment of the present application;
图6示出了可以用来实施本申请的实施例的电子设备的结构示意图。FIG. 6 shows a schematic structural diagram of an electronic device that can be used to implement embodiments of the present application.
具体实施方式Detailed ways
为了使本技术领域的人员理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those in the technical field to understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application. The described embodiments are only part of the embodiments of the present application. Not all examples. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts should fall within the scope of protection of this application.
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”,“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于列出的那些步骤或单元,而是可包括没有列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "include" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., processes, methods, systems, products or devices that comprise a series of steps or units and are not necessarily limited to those steps listed. or units, but may include other steps or units not listed or inherent to such processes, methods, products or devices.
在一实施例中,图1为本申请一实施例提供的一种恶意代码的检测方法的流程图,本实施例可适用于在各种文件进行传输过程中对恶意代码进行检测时的情况,该方法可以由恶意代码的检测装置来执行,该恶意代码的检测装置可以采用硬件和/或软件的形式实现,该恶意代码的检测装置可配置于电子设备中。如图1所示,该方法包括:In one embodiment, Figure 1 is a flow chart of a method for detecting malicious code provided by an embodiment of the present application. This embodiment can be applied to situations when malicious code is detected during the transmission of various files. The method may be executed by a malicious code detection device, which may be implemented in the form of hardware and/or software, and may be configured in an electronic device. As shown in Figure 1, the method includes:
S110、获取待检测文本数据。S110. Obtain the text data to be detected.
其中,待检测文本数据指的是等待被检测的可能含有恶意代码的相关文本数据。待检测文本数据也可以称为字符型数据,为各种不同类型文件中的文本数据,例如可以是doc文件,pdf文件,txt文件中的相关文本数据,可以为包含英文字符、汉字、数字以及其他可输入的字符等。 Among them, the text data to be detected refers to relevant text data that may contain malicious code waiting to be detected. The text data to be detected can also be called character data, which is text data in various types of files. For example, it can be related text data in doc files, pdf files, and txt files. It can include English characters, Chinese characters, numbers, and Other input characters, etc.
在本实施例中,可以在各种不同类型的文件进行传输、转储或者打开的过程中,获取各种类型的待检测文本数据中的文件所属类型、文本数据的内容信息,各内容信息的所在页面的行等,以便对获取的文本数据进行相应的处理。In this embodiment, in the process of transferring, dumping or opening various types of files, the file type and the content information of the text data in various types of text data to be detected can be obtained, and the content information of each content information can be obtained. The row of the page where it is located, etc., so that the obtained text data can be processed accordingly.
S120、根据语义逻辑将待检测文本数据划分为至少一个子文本数据,确定子文本数据对应的子文本向量。S120. Divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data.
其中,语义逻辑可以理解为句与句之间的逻辑关系或者页面内语句之间的字符符号。逻辑关系例如可以为并列、承接、转折、结果和原因、目的、让步等。Among them, semantic logic can be understood as the logical relationship between sentences or the character symbols between sentences on the page. Logical relationships can be, for example, juxtaposition, succession, transition, result and cause, purpose, concession, etc.
在本实施例中,子文本数据可以理解为通过语义逻辑将待检测文本数据进行模块化划分,各模块所对应的子文本数据。需要说明的是,待检测文本数据可以通过模块化划分为至少一个模块,每个模块可以对应有一个子文本数据,进行模块化划分之后所对应的子文本数据可以为代码类文件,也可以为非脚本文件。子文本向量可以理解为对各个模块的子文本数据采用doc2vec的模型或word2vec的模型进行向量化所得到的对应各模块的子文本向量。In this embodiment, sub-text data can be understood as modular division of text data to be detected through semantic logic, and sub-text data corresponding to each module. It should be noted that the text data to be detected can be divided into at least one module through modularization. Each module can correspond to one sub-text data. After modular division, the corresponding sub-text data can be a code class file or a Non-script files. The sub-text vector can be understood as the sub-text vector corresponding to each module obtained by vectorizing the sub-text data of each module using the doc2vec model or the word2vec model.
在本实施例中,可以通过提取待检测文本数据中的内容数据,对内容数据进行相应的过滤得到英文数据,依据得到的英文数据的格式、类型等的不同,例如可以是非脚本文档或编程代码等,可以按照与其相对应的逻辑内容将英文数据进行相应的划分,并采用doc2vec的模型或word2vec的模型将划分之后各模块所对应的子文本数据进行向量化;也可以通过待检测文本数据所对应的函数调用图以获取相应的调用序列所对应的特征向量进行相应的向量化。In this embodiment, the content data in the text data to be detected can be extracted, and the content data can be filtered accordingly to obtain English data. Depending on the format and type of the obtained English data, it can be, for example, a non-script document or programming code. etc., the English data can be divided accordingly according to its corresponding logical content, and the doc2vec model or word2vec model can be used to vectorize the sub-text data corresponding to each module after the division; it can also be determined by the text data to be detected. The corresponding function call graph is used to obtain the feature vector corresponding to the corresponding call sequence for corresponding vectorization.
S130、将子文本向量输入至最优神经网络模型中以确定恶意代码的检测结果;其中,最优神经网络模型基于含有恶意代码的文本数据训练集生成。S130. Input the sub-text vector into the optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on the text data training set containing the malicious code.
其中,最优神经网络模型是通过一些公开的包含有恶意代码的数据源的相关文本数据训练集进行训练生成的。恶意代码也可以称为恶意软件,指的是所有可能与某个组织安全策略相冲突的软件或代码,这些代码没有作用却会带来一定的危险,可以是在未明确提示用户或未经用户许可的情况下,在用户计算机或其他终端上安装运行,侵犯用户合法权益的软件;也可以是故意编制或设置的、对网络或系统会产生威胁或潜在威胁的计算机代码。在本实施例中,恶意代码的类型可以为sql注入,也可以为xss攻击,xss攻击称为跨站脚本攻击等。Among them, the optimal neural network model is trained and generated through relevant text data training sets of some public data sources containing malicious code. Malicious code can also be called malware, which refers to all software or code that may conflict with an organization's security policy. These codes have no effect but bring certain dangers. They can be created without explicitly prompting the user or without the user's permission. With permission, software that infringes upon the user's legitimate rights and interests is installed and run on the user's computer or other terminal; it can also be computer code that is deliberately prepared or set up and poses a threat or potential threat to the network or system. In this embodiment, the type of malicious code can be SQL injection or XSS attack. XSS attack is called cross-site scripting attack.
在本实施例中,对恶意代码的检测结果中包括恶意代码的所在位置、恶意代码对应的类型以及恶意代码的类型对应的概率等。In this embodiment, the detection results of the malicious code include the location of the malicious code, the type corresponding to the malicious code, the probability corresponding to the type of malicious code, and so on.
在本实施例中,可以将切分后各模块中子文本数据所对应的子文本向量输 入至最优神经网络模型中,以确定恶意代码的检测结果。示例性的,可以将待检测文本数据对应的子文本向量构成子文本向量集合,按照预设数量将子文本向量集合切分为至少一个子集,将至少一个子集输入至最优神经网络模型中以确定恶意代码的检测结果,可以依据子集的构成个数进行区分处理,若为单个子集,则直接将单个子集的检测结果作为最优神经网络模型的输出结果,若为至少两个子集,则依据至少两个子集进行综合以确定相应的检测结果并进行输出。In this embodiment, the sub-text vector corresponding to the sub-text data in each module after segmentation can be input into the optimal neural network model to determine the detection results of malicious code. For example, the sub-text vectors corresponding to the text data to be detected can be formed into a sub-text vector set, the sub-text vector set can be divided into at least one subset according to a preset number, and at least one subset can be input to the optimal neural network model. To determine the detection results of malicious code, distinction processing can be carried out based on the number of subsets. If it is a single subset, the detection result of a single subset is directly used as the output result of the optimal neural network model. If it is at least two subsets, then the corresponding detection results are determined and output based on at least two subsets.
在一实施例中,最优神经网络模型的训练包括:In one embodiment, training of the optimal neural network model includes:
提取含有恶意代码的文本数据训练集中的文本数据,其中,文本数据训练集中的文本数据包含恶意代码的位置和类型;Extract text data from the text data training set containing malicious code, where the text data in the text data training set contains the location and type of the malicious code;
对文本数据提取内容数据,并过滤出内容数据中的非英文数据,以得到英文数据;Extract content data from text data and filter out non-English data in the content data to obtain English data;
将英文数据划分为至少一个子文本数据,并采用doc2vec模型确定子文本数据对应的子文本向量;Divide the English data into at least one sub-text data, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data;
依据文本数据训练集中恶意代码的位置和类型,对应生成子文本数据中恶意代码的位置、类型以及对应的概率标签,以组成文本数据训练集;According to the location and type of the malicious code in the text data training set, the location, type and corresponding probability label of the malicious code in the generated sub-text data are correspondingly generated to form a text data training set;
基于文本数据训练集对神经网络模型更新参数和权重值进行迭代训练,直到损失函数达到最小,输出最优神经网络模型,否则调整神经网络模型的参数和权重值并重复上述迭代训练过程。Based on the text data training set, iterative training is performed on the updated parameters and weight values of the neural network model until the loss function reaches the minimum and the optimal neural network model is output. Otherwise, the parameters and weight values of the neural network model are adjusted and the above iterative training process is repeated.
其中,文本数据训练集由包含有恶意代码的文本数据所构成,文本数据训练集中的文本数据包含恶意代码的位置和类型,示例性的,文件1中包含有恶意代码,该恶意代码的类型为xss攻击,所在位置为页面的第6行。Among them, the text data training set consists of text data containing malicious code. The text data in the text data training set contains the location and type of the malicious code. For example, File 1 contains malicious code, and the type of the malicious code is The xss attack is located on line 6 of the page.
其中,内容数据可以至少包括下述之一:文本类型、文本信息、所述文本信息所在行和/或列、所述文本信息对应的字数。The content data may include at least one of the following: text type, text information, the row and/or column where the text information is located, and the number of words corresponding to the text information.
在本实施例中,最优神经网络模型的训练过程为,从文本数据训练集中提取含有恶意代码的文本数据,其中,文本数据训练集中的文本数据包含恶意代码的位置和类型;提取文本数据中的内容数据,如doc文件,pdf文件,txt文件中文本信息、文本所对应的行、字数等,生成原始文本集,保持原始文本数据对应的页面所在行数,并过滤出内容数据中的非英文数据,以得到英文数据,将英文数据划分为至少一个子文本数据,并采用doc2vec模型确定子文本数据对应的子文本向量,依据文本数据训练集中恶意代码的位置和类型,对应生成子文本数据中恶意代码的位置、类型以及对应的概率标签,以组成文本数据训练集,通过文本数据训练集对神经网络模型更新参数和权重值进行迭代训练,直 到损失函数达到最小,输出最优神经网络模型,否则调整神经网络模型的参数和权重值并重复上述迭代训练过程。In this embodiment, the training process of the optimal neural network model is to extract text data containing malicious code from the text data training set, where the text data in the text data training set contains the location and type of the malicious code; extract the text data Content data, such as text information in doc files, pdf files, txt files, lines corresponding to the text, number of words, etc., generate an original text set, maintain the number of lines on the page corresponding to the original text data, and filter out non-conforming content in the content data. To obtain English data, divide the English data into at least one sub-text data, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data. Based on the location and type of the malicious code in the text data training set, the corresponding sub-text data is generated The location, type and corresponding probability label of the malicious code in the text data training set are used to iteratively train the neural network model update parameters and weight values through the text data training set. When the loss function reaches the minimum, the optimal neural network model is output, otherwise the parameters and weight values of the neural network model are adjusted and the above iterative training process is repeated.
需要说明的是,神经网络模型中,训练阶段的整体损失函数分为位置损失与分类损失之和,可以理解为,神经网络模型的训练是经过多次迭代计算出来的,经过多次迭代使得准确率达到最高,以减小整个神经网络的错误率,使用损失函数可以纠正真实位置与预测位置之间的偏差。训练阶段的整体损失函数,用公式可以表示为:LossTotal=loss(location)+loss(classify),其中,loss(location)可采用交并比(Intersection over Union,IOU)相关系列损失表达,如使用GIOU损失函数时,A是预测位置,B是真实位置,C为A与B的最小凸闭合框,即为真实目标边界框,则损失函数IOU=|A∩B|/|A∪B|,GIOU=IOU-|C/(A∪B)|/|C|,其中loss(classify)是交叉熵损失函数。It should be noted that in the neural network model, the overall loss function in the training stage is divided into the sum of position loss and classification loss. It can be understood that the training of the neural network model is calculated through multiple iterations. After multiple iterations, the accuracy The rate is maximized to reduce the error rate of the entire neural network, and the loss function can be used to correct the deviation between the real position and the predicted position. The overall loss function in the training stage can be expressed as: LossTotal=loss(location)+loss(classify), where loss(location) can be expressed by the Intersection over Union (IOU) related series of losses, such as using When using the GIOU loss function, A is the predicted position, B is the real position, and C is the minimum convex closed box between A and B, which is the real target bounding box, then the loss function IOU=|A∩B|/|A∪B|, GIOU=IOU-|C/(A∪B)|/|C|, where loss (classify) is the cross-entropy loss function.
需要说明的是,若有新的文本数据,可在原来模型上进行微调更新或者结合旧数据重新训练。It should be noted that if there is new text data, the original model can be fine-tuned and updated or retrained with old data.
本申请实施例的技术方案,通过获取待检测文本数据,根据语义逻辑将待检测文本数据划分为至少一个子文本数据,确定子文本数据对应的子文本向量,将子文本向量输入至最优神经网络模型中以确定恶意代码的检测结果其中,最优神经网络模型基于含有恶意代码的文本数据训练集生成。本申请实施例,通过语义逻辑将获取的待检测文本数据划分为至少一个子文本数据并确定相应的子文本向量,能够形成子文本特征,便于后续最优神经网络模型的数据处理;通过最优神经网络模型以确定子文本向量中恶意代码的检测结果,能够确认恶意代码对应的相关检测信息,提升对恶意代码的检测的准确率。The technical solution of the embodiment of the present application obtains the text data to be detected, divides the text data to be detected into at least one sub-text data according to semantic logic, determines the sub-text vector corresponding to the sub-text data, and inputs the sub-text vector into the optimal neural network. In the network model to determine the detection results of malicious code, the optimal neural network model is generated based on the text data training set containing malicious code. In the embodiment of this application, the acquired text data to be detected is divided into at least one sub-text data and the corresponding sub-text vector is determined through semantic logic, which can form sub-text features and facilitate subsequent data processing of the optimal neural network model; through the optimal The neural network model determines the detection results of malicious code in the sub-text vector, can confirm the relevant detection information corresponding to the malicious code, and improves the accuracy of the detection of malicious code.
在一实施例中,图2为本申请一实施例提供的另一种恶意代码的检测方法的流程图,本实施例在上述各实施例的基础上,对根据语义逻辑将待检测文本数据划分为至少一个子文本数据,以确定子文本数据对应的子文本向量,以及将子文本向量输入至最优神经网络模型中以确定恶意代码的检测结果进行了细化,如图2所示,本实施例中的恶意代码的检测方法可以包含如下步骤:In one embodiment, Figure 2 is a flow chart of another malicious code detection method provided by an embodiment of the present application. Based on the above embodiments, this embodiment divides the text data to be detected according to semantic logic. For at least one sub-text data, the sub-text vector corresponding to the sub-text data is determined, and the sub-text vector is input into the optimal neural network model to determine the detection result of the malicious code. As shown in Figure 2, this paper The malicious code detection method in the embodiment may include the following steps:
S210、获取待检测文本数据。S210. Obtain the text data to be detected.
S220、提取待检测文本数据中的内容数据。S220. Extract content data in the text data to be detected.
其中,内容数据至少包括下述之一:文本类型、文本信息、文本信息所在行和/或列、文本信息对应的字数。文本类型指的是待检测文本数据所对应的文件类型,例如可以是doc文件,pdf文件,txt文件等。文本信息可以理解为待检测文本数据的文字信息、文字信息的相关属性信息。 The content data includes at least one of the following: text type, text information, the row and/or column where the text information is located, and the number of words corresponding to the text information. The text type refers to the file type corresponding to the text data to be detected, for example, it can be a doc file, pdf file, txt file, etc. Text information can be understood as text information of the text data to be detected and related attribute information of the text information.
在本实施例中,通过固定的文本信息提取工具和/或解析文本数据的工具包,库等可以从待检测文本数据中提取相对应的内容数据,以生成原始文本集,可以知道的是,原始文本集中保持待检测文本信息的原始内容数据,包含待检测文本数据的原始文本类型、原始文本信息、原始文本信息所在行和/或列、原始文本信息对应的字数。In this embodiment, through fixed text information extraction tools and/or text data parsing toolkits, libraries, etc., the corresponding content data can be extracted from the text data to be detected to generate the original text set. It can be known that, The original text collection holds the original content data of the text information to be detected, including the original text type of the text data to be detected, the original text information, the row and/or column of the original text information, and the number of words corresponding to the original text information.
S230、过滤内容数据中的非英文数据,以得到内容数据中的英文数据。S230. Filter non-English data in the content data to obtain English data in the content data.
其中,英文数据可以理解为英文字符。Among them, English data can be understood as English characters.
在本实施例中,依据内容数据的相关字符所对应的编码方式可以判断出待检测文本数据中内容数据是中文字符数据还是英文字符数据,恶意代码的语言通常是英文字符,将中文字符数据进行过滤出,得到内容数据相对应的英文数据。In this embodiment, based on the encoding method corresponding to the relevant characters of the content data, it can be determined whether the content data in the text data to be detected is Chinese character data or English character data. The language of the malicious code is usually English characters. The Chinese character data is Filter out and obtain the English data corresponding to the content data.
S240、在英文数据为代码的情况下,按照代码的逻辑结构将英文数据划分为至少一个子文本数据,采用doc2vec模型确定子文本数据对应的子文本向量。S240. When the English data is a code, divide the English data into at least one sub-text data according to the logical structure of the code, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data.
其中,逻辑结构至少包括下述之一:顺序逻辑、条件逻辑、循环逻辑、函数块、类。doc2vec模型是一种用来产生词向量和段落向量的相关模型,针对每个模块化所得到的子文本数据采用doc2vec模型进行embedding向量化。Among them, the logical structure includes at least one of the following: sequential logic, conditional logic, loop logic, function block, and class. The doc2vec model is a related model used to generate word vectors and paragraph vectors. The doc2vec model is used for embedding vectorization for the sub-text data obtained from each modularization.
在本实施例中,可以通过编程语言相关脚本文档的固定格式,进行判断过滤的英文数据是代码还是非脚本文档,在英文数据为代码的情况下,按照编程代码的逻辑结构,例如可以是顺序逻辑、条件逻辑、循环逻辑、函数块和类等结构对英文数据进行相应的划分模块,每个模块对应相应的子文本数据,所划分的模块数量至少为一个,在进行相应的划分之后,可以采用doc2vec模型确定各模块中子文本数据所对应的子文本向量。示例性的,将英文数据为代码的待检测文件进行模块化切分之后的模块数量记为n,对n个模块的子文本数据采用doc2vec模型进行embedding向量化,以生成相应的m*k为向量,其中,m表示每个模块中子文本数据含有的字符的数量,k表示词汇表数目,此时一个待检测文本数据的向量可以表示n*m*k维向量。In this embodiment, the fixed format of the script document related to the programming language can be used to determine whether the filtered English data is a code or a non-script document. When the English data is a code, it can be determined according to the logical structure of the programming code, for example, it can be a sequence Structures such as logic, conditional logic, loop logic, function blocks and classes divide English data into modules accordingly. Each module corresponds to the corresponding sub-text data. The number of divided modules is at least one. After the corresponding division, you can The doc2vec model is used to determine the sub-text vector corresponding to the sub-text data in each module. For example, the number of modules after modular segmentation of the file to be detected whose English data is code is recorded as n, and the sub-text data of n modules is embedding vectorized using the doc2vec model to generate the corresponding m*k as Vector, where m represents the number of characters contained in the sub-text data in each module, and k represents the number of vocabularies. At this time, a vector of text data to be detected can represent an n*m*k dimensional vector.
S250、在英文数据为非脚本文档的情况下,按照英文数据中字符的标点符号将英文数据划分为至少一个子文本数据,并采用doc2vec的模型确定子文本数据对应的子文本向量。S250. When the English data is a non-script document, divide the English data into at least one sub-text data according to the punctuation marks of the characters in the English data, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data.
其中,非脚本文档可以为doc类型,pdf类型、txt类型等非脚本文档。英文数据中字符的标点符号可以理解为页面中各行字符中的标点符号。示例的,将各行字符中每一个标点符号划分为一句。Among them, the non-script document can be a doc type, pdf type, txt type and other non-script documents. The punctuation marks of characters in English data can be understood as the punctuation marks of characters in each line of the page. For example, each punctuation mark in each line of characters is divided into a sentence.
在本实施例中,若判断出的英文数据为非脚本文档,则需要按照英文数据 中字符的标点符号,将英文数据进行相应的划分模块,每个模块对应相应的子文本数据,所划分的模块数量至少为一个,在进行相应的划分之后,可以采用doc2vec模型进行embedding向量化,以确定各模块中子文本数据所对应的子文本向量。In this embodiment, if the determined English data is a non-script document, it needs to be For the punctuation marks of Chinese characters, the English data is divided into modules accordingly. Each module corresponds to the corresponding sub-text data. The number of divided modules is at least one. After the corresponding division, the doc2vec model can be used for embedding vectorization. To determine the sub-text vector corresponding to the sub-text data in each module.
S260、将待检测文本数据对应的子文本向量构成子文本向量集合。S260. Construct sub-text vectors corresponding to the text data to be detected into a sub-text vector set.
在本实施例中,依据待检测文本数据所对应的子文本向量,可以构成对应的子文本向量集合。需要说明的是,子文本向量集合中可能包含一个或多个子文本向量。In this embodiment, according to the sub-text vector corresponding to the text data to be detected, a corresponding sub-text vector set can be formed. It should be noted that the sub-text vector set may contain one or more sub-text vectors.
S270、按照预设数量将子文本向量集合切分为至少一个子集。S270. Divide the sub-text vector set into at least one subset according to a preset number.
其中,预设数量可以理解为预先进行设置的模块对应的子文本向量的固定数量,可以通过经验进行相应的设置,也可以人为进行自行设置。子集中包含一个或多个子文本数据所对应的子文本向量。Among them, the preset number can be understood as the fixed number of sub-text vectors corresponding to the pre-set modules. The corresponding settings can be made through experience, or they can be set manually. The subset contains subtext vectors corresponding to one or more subtext data.
在本实施例中,由于每个待检测文本数据中所划分的模块个数不同,也即子文本数据的数量不同,需要对每个待检测文本数据中的子文本数据的所对应的子文本向量集合按照固定个数进行相应的切分,得到子集所对应的向量,对n个模块的子文本数据采用doc2vec模型进行embedding向量化,以生成相应的m*k为向量,则此时按照固定个数r进行相应的切分,每个子集所对应的向量为r*m*k,其中,m表示每个模块中子文本数据含有的字符的数量,k表示词汇表数目。需要说明的是,对于切分时不满足固定个数r的结尾切分,需要复制本身的子文本数据直至达到固定个数r。In this embodiment, since the number of modules divided into each text data to be detected is different, that is, the number of sub-text data is different, it is necessary to identify the sub-text corresponding to the sub-text data in each text data to be detected. The vector set is divided accordingly according to a fixed number to obtain the vectors corresponding to the subsets. The sub-text data of n modules are embedding vectorized using the doc2vec model to generate corresponding m*k vectors. At this time, according to The fixed number r is divided accordingly, and the vector corresponding to each subset is r*m*k, where m represents the number of characters contained in the sub-text data in each module, and k represents the number of vocabulary lists. It should be noted that for the end segmentation that does not meet the fixed number r during segmentation, it is necessary to copy its own sub-text data until the fixed number r is reached.
需要说明的是,若待检测文本数据中含有恶意代码,则按照预设数量将子文本向量进行切分之后,恶意代码的位置可能为原始待检测文本数据中的位置,也可能不是原始待检测文本数据中的位置。It should be noted that if the text data to be detected contains malicious code, after dividing the sub-text vector according to the preset number, the location of the malicious code may be the position in the original text data to be detected, or it may not be the original text data to be detected. Position in text data.
在一实施例中,按照预设数量将子文本向量集合切分为至少一个子集,包括:In one embodiment, the sub-text vector set is divided into at least one subset according to a preset number, including:
若子文本向量集合中的子文本向量的数量超过预设数量,则按照预设数量在各子文本向量中选择对应数量的子文本向量划分到一个子集,直到剩余的子文本向量的数量未达到预设数量;If the number of sub-text vectors in the sub-text vector set exceeds the preset number, a corresponding number of sub-text vectors will be selected from each sub-text vector according to the pre-set number and divided into a subset until the number of remaining sub-text vectors does not reach Default quantity;
若子文本向量集合中的子文本向量的数量未达到预设数量,则复制子文本向量以使得复制的子文本向量与各子文本向量之和等于预设数量,将复制的子文本向量与各子文本向量划分到一个子集。If the number of sub-text vectors in the sub-text vector set does not reach the preset number, copy the sub-text vectors so that the sum of the copied sub-text vectors and each sub-text vector is equal to the pre-set number, and combine the copied sub-text vectors with each sub-text vector. Divide the text vector into a subset.
在本实施例中,在子文本向量的数量超过预设数量的情况下,则按照预设数量在各子文本向量中选择对应数量的子文本向量划分到一个子集,直到剩余 的子文本向量的数量未达到预设数量,在子文本向量的数量未达到预设数量的情况下,则复制子文本向量以使得复制的子文本向量与各子文本向量之和等于预设数量,将复制的子文本向量与各子文本向量划分到一个子集。需要说明的是,若各子文本向量的数量未达到预设数量的个数余数为1时,则复制该子文本数据所对应的子文本向量以扩展到预设数量,若各子文本向量的数量未达到预设数量的个数余数至少为2时,则可以复制其中的一个子文本数据所对应的子文本向量以扩展到预设数量,也可以复制其中的多个子文本数据所对应的子文本向量以扩展到预设数量。In this embodiment, when the number of sub-text vectors exceeds the preset number, a corresponding number of sub-text vectors are selected from each sub-text vector according to the preset number and divided into a subset until the remaining The number of sub-text vectors does not reach the preset number. When the number of sub-text vectors does not reach the preset number, the sub-text vectors are copied so that the sum of the copied sub-text vectors and each sub-text vector is equal to the pre-set number. , divide the copied sub-text vector and each sub-text vector into a subset. It should be noted that if the number of each sub-text vector does not reach the preset number and the remainder is 1, then the sub-text vector corresponding to the sub-text data is copied to expand to the preset number. If the number of each sub-text vector When the number that does not reach the preset number has a remainder of at least 2, you can copy the subtext vector corresponding to one of the subtext data to expand to the preset number, or you can copy the subtext vector corresponding to multiple subtext data. Text vectors to expand to a preset amount.
示例性的,预设数量也即标准化个数如为5,若待检测文本数据中子文本数据对应的子文本向量为15,则按照标准化个数5将该待检测文本数据中子文本数据对应的子文本向量切分为3份;若待检测文本数据中子文本数据对应的子文本向量为17,则按照标准化个数5将该待检测文本数据中子文本数据对应的子文本向量切为4份,在4份中有2个子文本向量,此时需要将第4份中的子文本向量的个数扩展到标准化个数5,此时可以将第4份中的2个子文本向量进行复制以扩展到标准化个数5。For example, the preset number, that is, the standardized number, is 5. If the sub-text vector corresponding to the sub-text data in the text data to be detected is 15, then the corresponding sub-text data in the text data to be detected is 5 according to the standardized number. The sub-text vector of is divided into 3 parts; if the sub-text vector corresponding to the sub-text data in the text data to be detected is 17, then the sub-text vector corresponding to the sub-text data in the text data to be detected is cut according to the standardized number 5 as 4 copies. There are 2 sub-text vectors in the 4 copies. At this time, the number of sub-text vectors in the 4th copy needs to be expanded to the standardized number 5. At this time, the 2 sub-text vectors in the 4th copy can be copied. to extend to the standardized number 5.
S280、将至少一个子集输入至最优神经网络模型中,以确定恶意代码的检测结果,其中,检测结果中包含恶意代码的位置、类型和对应的概率。S280. Input at least one subset into the optimal neural network model to determine the detection result of the malicious code, where the detection result includes the location, type and corresponding probability of the malicious code.
在本实施例中,基于最优神经网络模型对输入的至少一个子集进行相应的检测,以确定对待检测文本数据中恶意代码的相关检测结果,若含有恶意代码,则从最优神经网络模型中对应输出包含恶意代码的位置、类型和对应的概率。需要说明的是,可以依据子集的个数进行区分处理,以获取单个子集合至少两个子集情况下对应恶意代码的检测结果。In this embodiment, at least a subset of the input is detected based on the optimal neural network model to determine the relevant detection results of the malicious code in the text data to be detected. If the malicious code is contained, the optimal neural network model is used to determine the relevant detection results of the malicious code in the text data to be detected. The corresponding output contains the location, type and corresponding probability of the malicious code. It should be noted that differentiation processing can be performed based on the number of subsets to obtain the detection results of the corresponding malicious code in the case of at least two subsets of a single subset.
S290、若子文本向量集合中的子文本向量的数量构成单个子集,则将单个子集的检测结果作为最优神经网络模型的输出结果。S290. If the number of sub-text vectors in the sub-text vector set constitutes a single subset, then the detection result of the single subset is used as the output result of the optimal neural network model.
在本实施例中,在子文本向量的数量构成单个子集的情况下,则直接将单个子集的检测结果作为最优神经网络模型的输出结果。示例性的,若单个子集的类型对应的概率低于预设概率,则确定恶意代码的检测结果为待检测文本数据不存在恶意代码;若单个子集的类型对应的概率超过预设概率,则确定恶意代码的检测结果为待检测文本数据存在恶意代码,直接输出单个子集中恶意代码的对应的位置、类别和对应概率。In this embodiment, when the number of sub-text vectors constitutes a single subset, the detection result of the single subset is directly used as the output result of the optimal neural network model. For example, if the probability corresponding to the type of a single subset is lower than the preset probability, the detection result of the malicious code is determined to be that there is no malicious code in the text data to be detected; if the probability corresponding to the type of a single subset exceeds the preset probability, Then it is determined that the detection result of the malicious code is that there is malicious code in the text data to be detected, and the corresponding location, category and corresponding probability of the malicious code in a single subset are directly output.
在一实施例中,将单个子集的检测结果作为最优神经网络模型的输出结果,包括:In one embodiment, the detection results of a single subset are used as the output results of the optimal neural network model, including:
在单个子集的类型对应的概率低于预设概率的情况下,确定恶意代码的检 测结果为待检测文本数据不存在恶意代码;When the probability corresponding to the type of a single subset is lower than the preset probability, determine the detection rate of malicious code. The test result is that there is no malicious code in the text data to be detected;
在单个子集的类型对应的概率超过预设概率的情况下,确定恶意代码的检测结果为待检测文本数据存在恶意代码,输出单个子集中恶意代码的对应的位置、类别和对应概率。When the probability corresponding to the type of a single subset exceeds the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the corresponding location, category and corresponding probability of the malicious code in the single subset are output.
在本实施例中,在单个子集的类型对应的概率低于预设概率的情况下,确定恶意代码的检测结果为待检测文本数据不存在恶意代码;在单个子集的类型对应的概率超过预设概率的情况下,确定恶意代码的检测结果为待检测文本数据存在恶意代码,输出单个子集中恶意代码的对应的位置、类别和对应概率。示例性的,预设概率为5%,单个子集的类型对应的概率超过5%的情况下,确定恶意代码的检测结果为待检测文本数据存在恶意代码,通过最优神经网络模型输出恶意代码当前所在的位置、类别以及对应的概率,例如可以为页面的第几行,恶意代码的所属类别以及当前恶意代码对应的概率。In this embodiment, when the probability corresponding to the type of a single subset is lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; when the probability corresponding to the type of a single subset exceeds Under the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the corresponding location, category and corresponding probability of the malicious code in a single subset are output. For example, when the preset probability is 5% and the probability corresponding to a single subset type exceeds 5%, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the malicious code is output through the optimal neural network model. The current location, category and corresponding probability, for example, it can be the row of the page, the category of the malicious code and the probability corresponding to the current malicious code.
S2100、若子文本向量集合中的子文本向量的数量构成至少两个子集,依据至少两个子集进行综合的检测结果确定最优神经网络模型的输出结果。S2100. If the number of sub-text vectors in the sub-text vector set constitutes at least two subsets, determine the output result of the optimal neural network model based on the comprehensive detection results of at least two subsets.
在本实施例中,若子文本向量集合中的子文本向量的数量构成至少两个子集,依据至少两个子集进行综合的检测结果确定最优神经网络模型的输出结果。示例性的,若至少两个子集的类型对应的概率低于预设概率,则确定恶意代码的检测结果为待检测文本数据不存在恶意代码;若至少两个子集的类型对应的概率超过预设概率,则确定恶意代码的检测结果为待检测文本数据存在恶意代码,需要按照各子集的原始顺序进行合并各子集所对应的内容数据,以确定至少两个子集中恶意代码的对应的位置、类别和对应概率。In this embodiment, if the number of sub-text vectors in the sub-text vector set constitutes at least two subsets, the output result of the optimal neural network model is determined based on the comprehensive detection results of the at least two subsets. For example, if the probabilities corresponding to the types of at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; if the probabilities corresponding to the types of at least two subsets exceed the preset probability probability, it is determined that the detection result of malicious code is that there is malicious code in the text data to be detected. It is necessary to merge the content data corresponding to each subset according to the original order of each subset to determine the corresponding positions of the malicious code in at least two subsets. Categories and corresponding probabilities.
在一实施例中,依据至少两个子集进行综合的检测结果确定最优神经网络模型的输出结果,包括:In one embodiment, the output result of the optimal neural network model is determined based on the comprehensive detection results of at least two subsets, including:
在至少两个子集的类型对应的概率低于预设概率的情况下,确定恶意代码的检测结果为待检测文本数据不存在恶意代码;When the probabilities corresponding to the types of at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected;
在至少两个子集的类型对应的概率超过预设概率的情况下,确定恶意代码的检测结果为待检测文本数据存在恶意代码,按照各子集的原始顺序进行合并各子集所对应的内容数据,以确定至少两个子集中恶意代码的对应的位置、类别和对应概率。When the probabilities corresponding to the types of at least two subsets exceed the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the content data corresponding to each subset is merged according to the original order of each subset. , to determine the corresponding locations, categories and corresponding probabilities of malicious codes in at least two subsets.
在本实施例中,在至少两个子集的类型对应的概率低于预设概率的情况下,确定恶意代码的检测结果为待检测文本数据不存在恶意代码;在至少两个子集的类型对应的概率超过预设概率的情况下,确定恶意代码的检测结果为待检测文本数据存在恶意代码,按照各子集的原始顺序进行合并各子集所对应的内容 数据,以确定至少两个子集中恶意代码的对应的位置、类别和对应概率。In this embodiment, when the probabilities corresponding to the types of at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected; When the probability exceeds the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the content corresponding to each subset is merged according to the original order of each subset. Data to determine the corresponding location, category, and corresponding probability of malicious code in at least two subsets.
示例性的,若确定恶意代码的检测结果为待检测文本数据存在恶意代码,则前面进行子集切分的时候,对恶意代码的位置也相应的进行了切分,比如原始数据中恶意代码在第7行-10行,按固定个数8进行切分之后,第一个子集中恶意代码在第7行和第8行,第二个子集中恶意代码在第1行和第2行,需要按照各子集的原始行数顺序进行合并各子集所对应的内容数据,以确定原始数据中包含恶意代码的位置、类型和相应的概率信息。For example, if it is determined that the detection result of malicious code is that there is malicious code in the text data to be detected, then when segmenting the subsets, the location of the malicious code will be segmented accordingly. For example, in the original data, the malicious code is in Lines 7-10, after splitting according to a fixed number of 8, the malicious code in the first subset is in lines 7 and 8, and the malicious code in the second subset is in lines 1 and 2. You need to follow the The original row numbers of each subset are sequentially merged with the content data corresponding to each subset to determine the location, type and corresponding probability information of the malicious code contained in the original data.
本申请实施例的上述技术方案,通过提取待检测文本数据中的内容数据,过滤内容数据中的非英文数据,以得到内容数据中的英文数据,在英文数据为代码的情况下,按照代码的逻辑结构将英文数据划分为至少一个子文本数据,采用doc2vec模型确定子文本数据对应的子文本向量,在英文数据为非脚本文档的情况下,按照英文数据中字符的标点符号将英文数据划分为至少一个子文本数据,并采用doc2vec的模型子文本数据对应的子文本向量,形成相应的子文本特征,便于后续最优神经网络模型的数据处理;通过按照预设数量将子文本向量切分为至少一个子集,将至少一个子集输入至最优神经网络模型中,以确定恶意代码的检测结果,其中,检测结果中包含恶意代码的位置、类型和对应的概率,若子文本数据为单个子集,则将单个子集的检测结果作为最优神经网络模型的输出结果。若子文本数据为至少两个子集组成,依据至少两个子集进行综合的检测结果确定最优神经网络模型的输出结果,能够同时确认恶意代码对应的概率、位置和类型,提升对恶意代码的检测的准确率。The above technical solution of the embodiment of the present application extracts the content data in the text data to be detected and filters the non-English data in the content data to obtain the English data in the content data. When the English data is a code, according to the code The logical structure divides the English data into at least one sub-text data, and uses the doc2vec model to determine the sub-text vector corresponding to the sub-text data. When the English data is a non-script document, the English data is divided into At least one sub-text data, and uses the sub-text vector corresponding to the doc2vec model sub-text data to form corresponding sub-text features to facilitate subsequent data processing of the optimal neural network model; by dividing the sub-text vector into At least one subset is input into the optimal neural network model to determine the detection result of the malicious code, where the detection result includes the location, type and corresponding probability of the malicious code. If the sub-text data is a single sub- set, the detection result of a single subset is used as the output result of the optimal neural network model. If the sub-text data is composed of at least two subsets, the output results of the optimal neural network model are determined based on the comprehensive detection results of at least two subsets, which can simultaneously confirm the probability, location and type of the malicious code, and improve the detection efficiency of the malicious code. Accuracy.
在一实施例中,由于恶意代码的检测方法可分为恶意代码的训练阶段和对恶意代码的检测阶段,为便于更好的理解恶意代码的检测方法,图3为本申请一实施例提供的一种恶意代码训练过程的流程图,图4为本申请一实施例提供的一种恶意代码的检测过程的流程图。其中,文本数据表示上述实施例中的待检测文本数据,非英文字符表示上述实施例中的非英文数据,子样本表示上述实施例中的子集,预设模块数量表示上述实施例中的预设数量。In one embodiment, since the detection method of malicious code can be divided into a training stage of malicious code and a detection stage of malicious code, in order to better understand the detection method of malicious code, Figure 3 is provided by an embodiment of the present application. A flow chart of a malicious code training process. Figure 4 is a flow chart of a malicious code detection process provided by an embodiment of the present application. Among them, the text data represents the text data to be detected in the above embodiment, the non-English characters represent the non-English data in the above embodiment, the subsample represents the subset in the above embodiment, and the preset module number represents the preset number in the above embodiment. Set quantity.
首先,在训练阶段的步骤如下:First, the steps in the training phase are as follows:
S310、采集包含恶意代码的文本数据。S310. Collect text data containing malicious code.
在本实施例中,进行数据采集,确定训练数据。In this embodiment, data collection is performed to determine training data.
S320、提取文本数据中的文本内容数据,生成原始文本集,保持原始行数。S320: Extract the text content data in the text data, generate an original text set, and maintain the original number of lines.
S330、从文本内容数据中过滤非英文字符,以得到英文字符。S330. Filter non-English characters from the text content data to obtain English characters.
S340、对英文字符进行模块化划分。 S340. Carry out modular division of English characters.
在本实施例中,基于语义逻辑对文本数据进行网格化切分,模块个数记为n,如果包含英文字符的文本数据是代码文件,则按照顺序逻辑,条件逻辑,循环逻辑,函数块,类等结构进行划分,所划分的模块数量记为n;如果包含英文字符的文本数据是doc、pdf、txt等非脚本文档,则按行内语句进行切分,生成n个模块。In this embodiment, the text data is divided into grids based on semantic logic, and the number of modules is recorded as n. If the text data containing English characters is a code file, it will be divided according to sequential logic, conditional logic, loop logic, and function blocks. , class and other structures are divided, and the number of divided modules is recorded as n; if the text data containing English characters is a non-script document such as doc, pdf, txt, etc., it is divided according to inline statements to generate n modules.
本实施例中的对英文字符进行模块化划分,可以理解为上述各实施例中的根据语义逻辑将待检测文本数据划分为至少一个子文本数据。一个模块表示为一个子文本数据。The modular division of English characters in this embodiment can be understood as the division of text data to be detected into at least one sub-text data according to semantic logic in the above embodiments. A module is represented as a subtext data.
S350、对划分之后每个模块的文本数据采用doc2vec的模型进行embedding向量化,生成m*k维向量。S350. Use the doc2vec model to conduct embedding vectorization on the text data of each module after division, and generate an m*k dimensional vector.
在本实施例中,此种情况下,一个文件的向量表示为n*m*k维向量。本实施例中的模块向量化,可以理解为上述各实施例中确定子文本数据对应的子文本向量。In this embodiment, in this case, the vector of a file is represented as an n*m*k dimensional vector. The module vectorization in this embodiment can be understood as determining the sub-text vector corresponding to the sub-text data in the above-mentioned embodiments.
S360、按照预设模块数量进行切分以生成子样本。S360: Split according to the preset number of modules to generate subsamples.
在本实施例中,根据固定长度进行切割,由于每个文件中所含的模块个数不同,则对每个文件向量按照固定模块个数r进行切分,得到子样本向量,则每个子样本向量为r*m*k维向量,其中不满足r个的结尾切分模块则复制本身数据直至r个模块。In this embodiment, cutting is performed according to a fixed length. Since the number of modules contained in each file is different, each file vector is divided according to the fixed number of modules r to obtain a sub-sample vector. Then each sub-sample The vector is an r*m*k dimensional vector, in which the end segmentation modules that do not satisfy r will copy their own data until r modules.
S370、依据文本数据训练集中恶意代码的位置和类型,对应生成样本中恶意代码的位置、类型以及对应的概率标签,以组成文本数据训练集。S370. According to the location and type of the malicious code in the text data training set, correspond to the location, type and corresponding probability label of the malicious code in the generated sample to form a text data training set.
S380、将子样本训练数据输入构建的神经网络模型中进行训练迭代,得到最终的最优神经网络模型。S380. Input the sub-sample training data into the constructed neural network model for training iterations to obtain the final optimal neural network model.
在本实施例中,构建端到端的神经网络模型,定义网络层数,确定损失函数和优化器。In this embodiment, an end-to-end neural network model is constructed, the number of network layers is defined, and the loss function and optimizer are determined.
在本实施例中,训练阶段的整体损失函数分为位置损失与分类损失之和。损失函数用公式可以表示为:LossTotal=loss(location)+loss(classify),其中,loss(location)可采用IOU相关系列损失表达,如GIOU,如:A是预测位置,B是真实位置,C为A与B的最小凸闭合框,即为真实目标边界框,则
IOU=|A∩B|/|A∪B|
GIOU=IOU-|C/(A∪B)|/|C|
In this embodiment, the overall loss function in the training phase is divided into the sum of position loss and classification loss. The loss function can be expressed as: LossTotal=loss(location)+loss(classify), where loss(location) can be expressed by IOU related series loss, such as GIOU, for example: A is the predicted location, B is the real location, C is the minimum convex closed box of A and B, which is the real target bounding box, then
IOU=|A∩B|/|A∪B|
GIOU=IOU-|C/(A∪B)|/|C|
其中,loss(classify)是交叉熵损失函数。Among them, loss(classify) is the cross-entropy loss function.
在本实施例中,在训练的过程中是不断在调整模型中的一些参数,最优神 经网络模型中网络层数所对应的一些权重、参数是确定的,比如网络有3层,第二层有3个神经元,每个神经元对应的权重、学习率。In this embodiment, some parameters in the model are constantly adjusted during the training process, and the optimal god Some weights and parameters corresponding to the number of network layers in the network model are determined. For example, the network has 3 layers, the second layer has 3 neurons, and the weight and learning rate corresponding to each neuron.
如图4所示,在恶意代码的检测阶段的步骤如下:As shown in Figure 4, the steps in the detection phase of malicious code are as follows:
S410、获取待检测文本数据,并提取待检测文本数据中内容数据。S410. Obtain the text data to be detected, and extract the content data in the text data to be detected.
S420、过滤内容数据中的非英文数据,以得到内容数据中的英文数据。S420. Filter non-English data in the content data to obtain English data in the content data.
S430、基于语义逻辑对文本数据中的英文数据进行模块化划分。S430. Modularly divide the English data in the text data based on semantic logic.
在本实施例中,基于语义逻辑对文本数据进行网格化切分,模块个数记为n:如果包含英文字符的文本数据是代码文件,则按照顺序逻辑,条件逻辑,循环逻辑,函数块,类等结构进行划分,所划分的模块数量记为n;如果包含英文字符的文本数据是doc、pdf、txt等非脚本文档,则按行内语句进行切分,生成n个模块。In this embodiment, the text data is divided into grids based on semantic logic, and the number of modules is recorded as n: If the text data containing English characters is a code file, it is based on sequential logic, conditional logic, loop logic, and function blocks. , class and other structures are divided, and the number of divided modules is recorded as n; if the text data containing English characters is a non-script document such as doc, pdf, txt, etc., it is divided according to inline statements to generate n modules.
S440、对划分之后每个模块的文本数据采用doc2vec的模型进行embedding向量化,生成m*k维向量。S440. Use the doc2vec model to perform embedding vectorization on the text data of each module after division, and generate an m*k dimensional vector.
S450、按照预设模块数量进行切分以生成子样本。S450: Split according to the preset number of modules to generate subsamples.
在本实施例中,根据固定长度进行切割,由于每个文件中所含的模块个数不同,则对每个文件向量按照固定模块个数r进行切分,得到子样本向量,则每个子样本向量为r*m*k维向量,其中不满足r个的结尾切分模块则复制本身数据直至r个模块。In this embodiment, cutting is performed according to a fixed length. Since the number of modules contained in each file is different, each file vector is divided according to the fixed number of modules r to obtain a sub-sample vector. Then each sub-sample The vector is an r*m*k dimensional vector, in which the end segmentation modules that do not satisfy r will copy their own data until r modules.
S460、判断子样本的个数是否为单个子样本,若是则执行S470;若否,则执行S480。S460. Determine whether the number of subsamples is a single subsample. If so, execute S470; if not, execute S480.
S470、判断单个子样本的类型对应的概率是否低于预设概率,若是,则执行S471;否则,执行S472。S470. Determine whether the probability corresponding to the type of a single subsample is lower than the preset probability. If so, execute S471; otherwise, execute S472.
S471、不存在恶意代码。S471. There is no malicious code.
S472、存在恶意代码,输出单个子样本中恶意代码的对应的位置、类别和对应概率。S472. If there is malicious code, output the corresponding location, category and corresponding probability of the malicious code in a single subsample.
S480、判断至少两个子样本的类型对应的概率是否低于预设概率,若是,则执行S481;否则,执行S482。S480. Determine whether the probabilities corresponding to the types of at least two subsamples are lower than the preset probability. If so, execute S481; otherwise, execute S482.
S481、不存在恶意代码。S481. There is no malicious code.
S482、存在恶意代码,按照各子样本的原始顺序进行合并各子样本所对应的内容数据,以确定各子样本合并之后恶意代码的对应的位置、类别和对应概率。 S482. If there is malicious code, merge the content data corresponding to each subsample according to the original order of each subsample to determine the corresponding location, category and corresponding probability of the malicious code after the merging of each subsample.
在一实施例中,图5是本申请一实施例提供的一种恶意代码的检测装置的结构框图,该装置适用于在各种文件进行传输过程中对恶意代码进行检测时的情况,该装置可以由硬件/软件实现。可配置于电子设备中来实现本申请实施例中的一种恶意代码的检测方法。如图5所示,该装置包括:数据获取模块510、子文本向量确定模块520以及结果确定模块530。In one embodiment, FIG. 5 is a structural block diagram of a malicious code detection device provided by an embodiment of the present application. The device is suitable for detecting malicious codes during the transmission of various files. The device Can be implemented by hardware/software. It can be configured in an electronic device to implement a malicious code detection method in the embodiment of the present application. As shown in Figure 5, the device includes: a data acquisition module 510, a sub-text vector determination module 520 and a result determination module 530.
其中,数据获取模块510,设置为获取待检测文本数据;Among them, the data acquisition module 510 is configured to acquire the text data to be detected;
子文本向量确定模块520,设置为根据语义逻辑将所述待检测文本数据划分为至少一个子文本数据,确定所述子文本数据对应的子文本向量;The sub-text vector determination module 520 is configured to divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;
结果确定模块530,设置为将所述子文本向量输入至最优神经网络模型中以确定恶意代码的检测结果;其中,所述最优神经网络模型基于含有恶意代码的文本数据训练集生成。The result determination module 530 is configured to input the sub-text vector into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
本申请实施例,子文本向量确定模块,通过语义逻辑将获取的待检测文本数据划分为至少一个子文本数据并确定相应的子文本向量,能够形成样本特征,便于后续最优神经网络模型的数据处理;结果确定模块,通过最优神经网络模型以确定子文本向量中恶意代码的检测结果,能够确认恶意代码对应的相关检测信息,提升对恶意代码的检测的准确率。In the embodiment of the present application, the sub-text vector determination module divides the acquired text data to be detected into at least one sub-text data through semantic logic and determines the corresponding sub-text vector, which can form sample features and facilitate the subsequent optimal neural network model data Processing; the result determination module uses the optimal neural network model to determine the detection results of malicious code in the sub-text vector, which can confirm the relevant detection information corresponding to the malicious code and improve the accuracy of the detection of malicious code.
在一实施例中,子文本向量确定模块520,包括:In one embodiment, the sub-text vector determination module 520 includes:
数据提取单元,设置为提取所述待检测文本数据中的内容数据,其中,所述内容数据至少包括下述之一:文本类型、文本信息、所述文本信息所在行和/或列、所述文本信息对应的字数;A data extraction unit configured to extract content data in the text data to be detected, wherein the content data includes at least one of the following: text type, text information, the row and/or column where the text information is located, the The number of words corresponding to the text message;
数据过滤单元,设置为过滤所述内容数据中的非英文数据,以得到所述内容数据中的英文数据;A data filtering unit configured to filter non-English data in the content data to obtain English data in the content data;
第一子文本向量确定单元,设置为在所述英文数据为代码的情况下,按照所述代码的逻辑结构将所述英文数据划分为至少一个子文本数据,采用doc2vec模型确定所述子文本数据对应的子文本向量;其中,所述逻辑结构至少包括下述之一:顺序逻辑、条件逻辑、循环逻辑、函数块、类;The first sub-text vector determination unit is configured to, when the English data is a code, divide the English data into at least one sub-text data according to the logical structure of the code, and use the doc2vec model to determine the sub-text data Corresponding sub-text vector; wherein, the logical structure includes at least one of the following: sequential logic, conditional logic, loop logic, function block, class;
第二子文本向量确定单元,设置为在所述英文数据为非脚本文档的情况下,按照所述英文数据中字符的标点符号将所述英文数据划分为至少一个子文本数据,并采用所述doc2vec的模型所述子文本数据对应的子文本向量。The second sub-text vector determination unit is configured to divide the English data into at least one sub-text data according to the punctuation marks of characters in the English data when the English data is a non-script document, and use the The sub-text vector corresponding to the sub-text data described in the doc2vec model.
在一实施例中,结果确定模块530,包括:In one embodiment, the result determination module 530 includes:
集合构成单元,设置为将所述待检测文本数据对应的所述子文本向量构成 子文本向量集合;A set forming unit configured to form the sub-text vector corresponding to the text data to be detected. Collection of subtext vectors;
子集划分单元,设置为按照预设数量将所述子文本向量集合切分为至少一个子集;A subset dividing unit configured to divide the sub-text vector set into at least one subset according to a preset number;
结果确定单元,设置为将所述至少一个子集输入至所述最优神经网络模型中,以确定恶意代码的检测结果,其中,所述检测结果中包含所述恶意代码的位置、类型和对应的概率;The result determination unit is configured to input the at least one subset into the optimal neural network model to determine the detection result of the malicious code, wherein the detection result includes the location, type and correspondence of the malicious code. The probability;
第一结果输出单元,设置为若所述子文本向量集合中的所述子文本向量的数量构成单个子集,则将所述单个子集的检测结果作为所述最优神经网络模型的输出结果;The first result output unit is configured to use the detection result of the single subset as the output result of the optimal neural network model if the number of the sub-text vectors in the sub-text vector set constitutes a single subset. ;
第二结果输出单元,设置为若子文本向量集合中的所述子文本向量的数量构成至少两个子集,依据所述至少两个子集进行综合的检测结果确定所述最优神经网络模型的输出结果。The second result output unit is configured to determine the output result of the optimal neural network model based on the comprehensive detection results of the at least two subsets if the number of the sub-text vectors in the sub-text vector set constitutes at least two subsets. .
在一实施例中,子集划分单元,包括:In one embodiment, the subset dividing unit includes:
第一子集划分子单元,设置为若所述子文本向量集合中的所述子文本向量的数量超过所述预设数量,则按照所述预设数量在各所述子文本向量中选择对应数量的所述子文本向量划分到一个所述子集,直到剩余的所述子文本向量的数量未达到所述预设数量;The first subset is divided into sub-units, and is configured to select a corresponding sub-text vector from each of the sub-text vectors according to the preset number if the number of the sub-text vectors in the sub-text vector set exceeds the preset number. Divide the number of sub-text vectors into one of the subsets until the number of remaining sub-text vectors does not reach the preset number;
第二子集划分子单元,设置为若所述子文本向量集合中的所述子文本向量的数量未达到所述预设数量,则复制所述子文本向量以使得复制的所述子文本向量与各所述子文本向量之和等于所述预设数量,将所述复制的所述子文本向量与各所述子文本向量划分到一个所述子集。The second subset is divided into sub-units, and is configured to copy the sub-text vectors such that if the number of the sub-text vectors in the sub-text vector set does not reach the preset number, the copied sub-text vectors The sum of each of the sub-text vectors is equal to the preset number, and the copied sub-text vector and each of the sub-text vectors are divided into one of the subsets.
在一实施例中,第一结果输出单元,包括:In one embodiment, the first result output unit includes:
第一结果确定子单元,设置为在所述单个子集的类型对应的概率低于预设概率的情况下,确定恶意代码的检测结果为所述待检测文本数据不存在恶意代码;The first result determination subunit is configured to determine that the detection result of the malicious code is that there is no malicious code in the text data to be detected when the probability corresponding to the type of the single subset is lower than the preset probability;
第二结果确定子单元,设置为在所述单个子集的类型对应的概率超过预设概率的情况下,确定恶意代码的检测结果为所述待检测文本数据存在恶意代码,输出所述单个子集中恶意代码的对应的位置、类别和对应概率。The second result determination subunit is configured to determine that the detection result of the malicious code is that the text data to be detected contains malicious code, and output the single subunit when the probability corresponding to the type of the single subset exceeds the preset probability. The corresponding location, category and corresponding probability of concentrated malicious code.
在一实施例中,第二结果输出单元,包括:In one embodiment, the second result output unit includes:
第三结果输出子单元,设置为在所述至少两个子集的类型对应的概率低于预设概率的情况下,确定恶意代码的检测结果为所述待检测文本数据不存在恶意代码; The third result output subunit is configured to determine that the detection result of the malicious code is that there is no malicious code in the text data to be detected when the probabilities corresponding to the types of the at least two subsets are lower than the preset probability;
第四结果输出子单元,设置为在所述至少两个子集的类型对应的概率超过预设概率的情况下,确定恶意代码的检测结果为所述待检测文本数据存在恶意代码,按照各子集的原始顺序进行合并各子集所对应的内容数据,以确定所述至少两个子集中恶意代码的对应的位置、类别和对应概率。The fourth result output subunit is configured to determine that the detection result of the malicious code is that the text data to be detected contains malicious code when the probabilities corresponding to the types of the at least two subsets exceed the preset probability. According to each subset The content data corresponding to each subset is merged in the original order to determine the corresponding location, category and corresponding probability of the malicious code in the at least two subsets.
在一实施例中,所述最优神经网络模型的训练,包括:In one embodiment, the training of the optimal neural network model includes:
提取含有恶意代码的文本数据训练集中的文本数据,其中,所述文本数据训练集中的文本数据包含恶意代码的位置和类型;Extract text data from a text data training set containing malicious code, wherein the text data in the text data training set contains the location and type of the malicious code;
对所述文本数据提取内容数据,并过滤出所述内容数据中的非英文数据,以得到英文数据;Extract content data from the text data, and filter out non-English data in the content data to obtain English data;
将所述英文数据划分为至少一个子文本数据,并采用doc2vec模型确定所述子文本数据对应的子文本向量;Divide the English data into at least one sub-text data, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data;
依据所述文本数据训练集中恶意代码的位置和类型,对应生成所述子文本数据中所述恶意代码的位置、类型以及对应的概率标签,以组成所述文本数据训练集;According to the location and type of the malicious code in the text data training set, the location, type and corresponding probability label of the malicious code in the sub-text data are correspondingly generated to form the text data training set;
基于所述文本数据训练集对神经网络模型更新参数和权重值进行迭代训练,直到损失函数达到最小,输出最优神经网络模型,否则调整所述神经网络模型的参数和权重值并重复上述迭代训练过程。Iterative training is performed on the updated parameters and weight values of the neural network model based on the text data training set until the loss function reaches the minimum and the optimal neural network model is output. Otherwise, the parameters and weight values of the neural network model are adjusted and the above iterative training is repeated. process.
本申请实施例所提供的恶意代码的检测装置可执行本申请任意实施例所提供的恶意代码的检测方法,具备执行方法相应的功能模块和有益效果。The malicious code detection device provided by the embodiments of this application can execute the malicious code detection method provided by any embodiment of this application, and has functional modules and beneficial effects corresponding to the execution method.
在一实施例中,图6示出了可以用来实施本申请的实施例的电子设备的结构示意图。电子设备10旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备(如头盔、眼镜、手表等)和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例。In an embodiment, FIG. 6 shows a schematic structural diagram of an electronic device that can be used to implement embodiments of the present application. Electronic device 10 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only.
如图6所示,电子设备10包括至少一个处理器11,以及与至少一个处理器11通信连接的存储器,如只读存储器(Read Only Memory,ROM)12、随机访问存储器(Random Access Memory,RAM)13等,其中,存储器存储有可被至少一个处理器执行的计算机程序,处理器11可以根据存储在只读存储器(ROM)12中的计算机程序或者从存储单元18加载到随机访问存储器(RAM)13中的计算机程序,来执行各种适当的动作和处理。在RAM 13中,还可存储电子设 备10操作所需的各种程序和数据。处理器11、ROM 12以及RAM 13通过总线14彼此相连。输入/输出(Input/Output,I/O)接口15也连接至总线14。As shown in Figure 6, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (Read Only Memory, ROM) 12, a random access memory (Random Access Memory, RAM). ) 13, etc., wherein the memory stores a computer program that can be executed by at least one processor, and the processor 11 can load it into a random access memory (RAM) according to the computer program stored in the read-only memory (ROM) 12 or from the storage unit 18 )13 to perform various appropriate actions and processes. In RAM 13, electronic devices can also be stored Prepare various programs and data required for 10 operations. The processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14. An input/output (I/O) interface 15 is also connected to the bus 14 .
电子设备10中的多个部件连接至I/O接口15,包括:输入单元16,例如键盘、鼠标等;输出单元17,例如各种类型的显示器、扬声器等;存储单元18,例如磁盘、光盘等;以及通信单元19,例如网卡、调制解调器、无线通信收发机等。通信单元19允许电子设备10通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
处理器11可以是各种具有处理和计算能力的通用和/或专用处理组件。处理器11的一些示例可以包括中央处理单元(Central Processing Unit,CPU)、图形处理单元(Graphic Processing Unit,GPU)、各种专用的人工智能(Artificial Intelligence,AI)计算芯片、各种运行机器学习模型算法的处理器、数字信号处理器(Digital Signal Processing,DSP)、以及任何适当的处理器、控制器、微控制器等。处理器11执行上文所描述的各个方法和处理,例如恶意代码的检测方法。Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 may include a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphic Processing Unit, GPU), various dedicated artificial intelligence (Artificial Intelligence, AI) computing chips, various types of machine learning Model algorithm processor, digital signal processor (Digital Signal Processing, DSP), and any appropriate processor, controller, microcontroller, etc. The processor 11 performs various methods and processes described above, such as the detection method of malicious code.
在一些实施例中,恶意代码的检测方法可被实现为计算机程序,其被有形地包含于计算机可读存储介质,例如存储单元18。在一些实施例中,计算机程序的部分或者全部可以经由ROM 12和/或通信单元19而被载入和/或安装到电子设备10上。当计算机程序加载到RAM 13并由处理器11执行时,可以执行上文描述的恶意代码的检测方法的一个或多个步骤。备选地,在其他实施例中,处理器11可以通过其他适当的方式(例如,借助于固件)而被配置为执行恶意代码的检测方法。In some embodiments, the malicious code detection method may be implemented as a computer program, which is tangibly included in a computer-readable storage medium, such as the storage unit 18 . In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the above-described malicious code detection method may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the malicious code detection method through other suitable means (eg, by means of firmware).
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(Field-Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、芯片上系统的系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, Field-Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs) , Application Specific Standard Parts (ASSP), System on Chip (SOC), Complex Programmable Logic Device (CPLD), computer hardware, firmware, software, and/or implemented in their combination. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.
用于实施本申请的方法的计算机程序可以采用一个或多个编程语言的任何组合来编写。这些计算机程序可以提供给通用计算机、专用计算机或其他可编 程数据处理装置的处理器,使得计算机程序当由处理器执行时使流程图和/或框图中所规定的功能/操作被实施。计算机程序可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Computer programs for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided for general-purpose computers, special-purpose computers, or other programmable The computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented. A computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本申请的上下文中,计算机可读存储介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的计算机程序。计算机可读存储介质可以包括电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。备选地,计算机可读存储介质可以是机器可读信号介质。机器可读存储介质的示例可包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(Electronic Programable Read Only Memory,EPROM)或快闪存储器、光纤、便捷式紧凑盘只读存储器(Compact Disc-Read Only Memory,CD-ROM)、光学储存设备、磁储存设备、或上述内容的合适组合。In the context of this application, a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. Computer-readable storage media may include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. Alternatively, the computer-readable storage medium may be a machine-readable signal medium. Examples of machine-readable storage media may include an electrical connection based on one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (Electronic Programable Read Only Memory (EPROM) or flash memory, optical fiber, portable compact disk read-only memory (Compact Disc-Read Only Memory, CD-ROM), optical storage device, magnetic storage device, or a suitable combination of the above.
为了提供与用户的交互,可以在电子设备上实施此处描述的系统和技术,该电子设备具有:用于向用户显示信息的显示装置(例如,阴极射线管(Cathode Ray Tube,CRT)或者液晶显示器(Liquid Crystal Display,LCD)或者监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给电子设备。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on an electronic device having a display device (e.g., a cathode ray tube (CRT) or liquid crystal) for displaying information to the user. A display (Liquid Crystal Display, LCD or monitor); and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(Local Area Network,LAN)、广域网(Wide Area Network,WAN)、区块链网络和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), blockchain network, and the Internet.
计算系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器, 又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务中,存在的管理难度大,业务扩展性弱的缺陷。Computing systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, Also known as cloud computing server or cloud host, it is a host product in the cloud computing service system, which solves the shortcomings of difficult management and weak business scalability in traditional physical hosts and VPS services.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本申请的技术方案所期望的结果。 It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in this application can be executed in parallel, sequentially, or in a different order, as long as the desired results of the technical solution of this application can be achieved.

Claims (10)

  1. 一种恶意代码的检测方法,包括:A method for detecting malicious code, including:
    获取待检测文本数据;Get the text data to be detected;
    根据语义逻辑将所述待检测文本数据划分为至少一个子文本数据,确定所述子文本数据对应的子文本向量;Divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;
    将所述子文本向量输入至最优神经网络模型中以确定恶意代码的检测结果;其中,所述最优神经网络模型基于含有恶意代码的文本数据训练集生成。The sub-text vector is input into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
  2. 根据权利要求1所述的方法,其中,所述根据语义逻辑将所述待检测文本数据划分为至少一个子文本数据,确定所述子文本数据对应的子文本向量,包括:The method according to claim 1, wherein dividing the text data to be detected into at least one sub-text data according to semantic logic and determining the sub-text vector corresponding to the sub-text data includes:
    提取所述待检测文本数据中的内容数据,其中,所述内容数据至少包括下述之一:文本类型、文本信息、所述文本信息所在行和/或列、所述文本信息对应的字数;Extract content data in the text data to be detected, wherein the content data includes at least one of the following: text type, text information, the row and/or column where the text information is located, and the number of words corresponding to the text information;
    过滤所述内容数据中的非英文数据,以得到所述内容数据中的英文数据;Filter non-English data in the content data to obtain English data in the content data;
    在所述英文数据为代码的情况下,按照所述代码的逻辑结构将所述英文数据划分为至少一个子文本数据,采用doc2vec模型确定所述子文本数据对应的子文本向量;其中,所述逻辑结构至少包括下述之一:顺序逻辑、条件逻辑、循环逻辑、函数块、类;When the English data is a code, the English data is divided into at least one sub-text data according to the logical structure of the code, and the doc2vec model is used to determine the sub-text vector corresponding to the sub-text data; wherein, The logical structure includes at least one of the following: sequential logic, conditional logic, loop logic, function block, and class;
    在所述英文数据为非脚本文档的情况下,按照所述英文数据中字符的标点符号将所述英文数据划分为至少一个子文本数据,并采用所述doc2vec的模型确定所述子文本数据对应的子文本向量。When the English data is a non-script document, the English data is divided into at least one sub-text data according to the punctuation marks of the characters in the English data, and the doc2vec model is used to determine the corresponding sub-text data. subtext vector of .
  3. 根据权利要求1所述的方法,其中,所述将所述子文本向量输入至最优神经网络模型中以确定恶意代码的检测结果,包括:The method according to claim 1, wherein said inputting the sub-text vector into an optimal neural network model to determine the detection result of malicious code includes:
    将所述待检测文本数据对应的所述子文本向量构成子文本向量集合;Construct the sub-text vectors corresponding to the text data to be detected into a sub-text vector set;
    按照预设数量将所述子文本向量集合切分为至少一个子集;Divide the sub-text vector set into at least one subset according to a preset number;
    将所述至少一个子集输入至所述最优神经网络模型中,以确定恶意代码的检测结果,其中,所述检测结果中包含所述恶意代码的位置、类型和对应的概率;Input the at least one subset into the optimal neural network model to determine the detection results of the malicious code, wherein the detection results include the location, type and corresponding probability of the malicious code;
    若所述子文本向量集合中的所述子文本向量的数量构成单个子集,则将所述单个子集的检测结果作为所述最优神经网络模型的输出结果;If the number of the sub-text vectors in the sub-text vector set constitutes a single subset, then the detection result of the single subset is used as the output result of the optimal neural network model;
    若所述子文本向量集合中的所述子文本向量的数量构成至少两个子集,依据所述至少两个子集进行综合的检测结果确定所述最优神经网络模型的输出结 果。If the number of the sub-text vectors in the sub-text vector set constitutes at least two subsets, the output result of the optimal neural network model is determined based on the comprehensive detection results of the at least two subsets. fruit.
  4. 根据权利要求3所述的方法,其中,所述按照预设数量将所述子文本向量集合切分为至少一个子集,包括:The method according to claim 3, wherein dividing the sub-text vector set into at least one subset according to a preset number includes:
    若所述子文本向量集合中的所述子文本向量的数量超过所述预设数量,则按照所述预设数量在所述子文本向量中选择对应数量的所述子文本向量划分到一个所述子集,直到剩余的所述子文本向量的数量未达到所述预设数量;If the number of the sub-text vectors in the sub-text vector set exceeds the preset number, a corresponding number of the sub-text vectors are selected from the sub-text vectors according to the preset number and divided into a set of sub-text vectors. until the number of remaining sub-text vectors does not reach the preset number;
    若所述子文本向量集合中的所述子文本向量的数量未达到所述预设数量,则复制所述子文本向量以使得复制的所述子文本向量与各所述子文本向量之和等于所述预设数量,将所述复制的所述子文本向量与各所述子文本向量划分到一个所述子集。If the number of sub-text vectors in the sub-text vector set does not reach the preset number, then copy the sub-text vector so that the sum of the copied sub-text vector and each of the sub-text vectors is equal to The preset number is used to divide the copied sub-text vectors and each of the sub-text vectors into one of the subsets.
  5. 根据权利要求3所述的方法,其中,所述将所述单个子集的检测结果作为所述最优神经网络模型的输出结果,包括:The method according to claim 3, wherein using the detection result of the single subset as the output result of the optimal neural network model includes:
    在所述单个子集的类型对应的概率低于预设概率的情况下,确定恶意代码的检测结果为所述待检测文本数据不存在恶意代码;When the probability corresponding to the type of the single subset is lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected;
    在所述单个子集的类型对应的概率超过预设概率的情况下,确定恶意代码的检测结果为所述待检测文本数据存在恶意代码,输出所述单个子集中恶意代码的对应的位置、类别和对应概率。When the probability corresponding to the type of the single subset exceeds the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the corresponding location and category of the malicious code in the single subset are output. and corresponding probabilities.
  6. 根据权利要求3所述的方法,其中,所述依据所述至少两个子集进行综合的检测结果确定所述最优神经网络模型的输出结果,包括:The method according to claim 3, wherein determining the output result of the optimal neural network model based on the comprehensive detection results of the at least two subsets includes:
    在所述至少两个子集的类型对应的概率低于预设概率的情况下,确定恶意代码的检测结果为所述待检测文本数据不存在恶意代码;When the probabilities corresponding to the types of the at least two subsets are lower than the preset probability, it is determined that the detection result of the malicious code is that there is no malicious code in the text data to be detected;
    在所述至少两个子集的类型对应的概率超过预设概率的情况下,确定恶意代码的检测结果为所述待检测文本数据存在恶意代码,按照各子集的原始顺序进行合并各子集所对应的内容数据,以确定所述至少两个子集中恶意代码的对应的位置、类别和对应概率。When the probabilities corresponding to the types of the at least two subsets exceed the preset probability, it is determined that the detection result of the malicious code is that the text data to be detected contains malicious code, and the results of each subset are merged according to the original order of each subset. Corresponding content data is used to determine the corresponding location, category and corresponding probability of the malicious code in the at least two subsets.
  7. 根据权利要求1所述的方法,其中,所述最优神经网络模型的训练包括:The method according to claim 1, wherein the training of the optimal neural network model includes:
    提取含有恶意代码的文本数据训练集中的文本数据,其中,所述文本数据训练集中的文本数据包含恶意代码的位置和类型;Extract text data from a text data training set containing malicious code, wherein the text data in the text data training set contains the location and type of the malicious code;
    对所述文本数据提取内容数据,并过滤出所述内容数据中的非英文数据,以得到英文数据;Extract content data from the text data, and filter out non-English data in the content data to obtain English data;
    将所述英文数据划分为至少一个子文本数据,并采用doc2vec模型确定所述子文本数据对应的子文本向量; Divide the English data into at least one sub-text data, and use the doc2vec model to determine the sub-text vector corresponding to the sub-text data;
    依据所述文本数据训练集中恶意代码的位置和类型,对应生成所述子文本数据中所述恶意代码的位置、类型以及对应的概率标签,以组成所述文本数据训练集;According to the location and type of the malicious code in the text data training set, the location, type and corresponding probability label of the malicious code in the sub-text data are correspondingly generated to form the text data training set;
    基于所述文本数据训练集对神经网络模型更新参数和权重值进行迭代训练,直到损失函数达到最小,输出最优神经网络模型,否则调整所述神经网络模型的参数和权重值并重复上述迭代训练过程。Iterative training is performed on the updated parameters and weight values of the neural network model based on the text data training set until the loss function reaches the minimum and the optimal neural network model is output. Otherwise, the parameters and weight values of the neural network model are adjusted and the above iterative training is repeated. process.
  8. 一种恶意代码的检测装置,包括:A malicious code detection device, including:
    数据获取模块,设置为获取待检测文本数据;The data acquisition module is set to obtain the text data to be detected;
    子文本向量确定模块,设置为根据语义逻辑将所述待检测文本数据划分为至少一个子文本数据,确定所述子文本数据对应的子文本向量;A sub-text vector determination module, configured to divide the text data to be detected into at least one sub-text data according to semantic logic, and determine the sub-text vector corresponding to the sub-text data;
    结果确定模块,设置为将所述子文本向量输入至最优神经网络模型中以确定恶意代码的检测结果;其中,所述最优神经网络模型基于含有恶意代码的文本数据训练集生成。The result determination module is configured to input the sub-text vector into an optimal neural network model to determine the detection result of the malicious code; wherein the optimal neural network model is generated based on a text data training set containing malicious code.
  9. 一种电子设备,所述电子设备包括:An electronic device, the electronic device includes:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7中任一项所述的恶意代码的检测方法。The memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor, so that the at least one processor can execute any one of claims 1-7 The detection method of malicious code.
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现权利要求1-7中任一项所述的恶意代码的检测方法。 A computer-readable storage medium stores computer instructions, and the computer instructions are used to implement the malicious code detection method described in any one of claims 1-7 when executed by a processor.
PCT/CN2023/093383 2022-09-09 2023-05-11 Malicious code detection method and apparatus, electronic device, and storage medium WO2024051196A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211104769.9 2022-09-09
CN202211104769.9A CN115455416A (en) 2022-09-09 2022-09-09 Malicious code detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2024051196A1 true WO2024051196A1 (en) 2024-03-14

Family

ID=84302126

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/093383 WO2024051196A1 (en) 2022-09-09 2023-05-11 Malicious code detection method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN115455416A (en)
WO (1) WO2024051196A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115455416A (en) * 2022-09-09 2022-12-09 上海派拉软件股份有限公司 Malicious code detection method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685739A (en) * 2020-12-31 2021-04-20 卓尔智联(武汉)研究院有限公司 Malicious code detection method, data interaction method and related equipment
CN113239354A (en) * 2021-04-30 2021-08-10 武汉科技大学 Malicious code detection method and system based on recurrent neural network
CN114253866A (en) * 2022-03-01 2022-03-29 紫光恒越技术有限公司 Malicious code detection method and device, computer equipment and readable storage medium
CN114357443A (en) * 2021-12-13 2022-04-15 北京六方云信息技术有限公司 Malicious code detection method, equipment and storage medium based on deep learning
CN114579965A (en) * 2021-12-31 2022-06-03 厦门服云信息科技有限公司 Malicious code detection method and device and computer readable storage medium
CN114692156A (en) * 2022-05-31 2022-07-01 山东省计算中心(国家超级计算济南中心) Memory segment malicious code intrusion detection method, system, storage medium and equipment
CN114817913A (en) * 2021-01-19 2022-07-29 腾讯科技(深圳)有限公司 Code detection method and device, computer equipment and storage medium
CN115455416A (en) * 2022-09-09 2022-12-09 上海派拉软件股份有限公司 Malicious code detection method and device, electronic equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685739A (en) * 2020-12-31 2021-04-20 卓尔智联(武汉)研究院有限公司 Malicious code detection method, data interaction method and related equipment
CN114817913A (en) * 2021-01-19 2022-07-29 腾讯科技(深圳)有限公司 Code detection method and device, computer equipment and storage medium
CN113239354A (en) * 2021-04-30 2021-08-10 武汉科技大学 Malicious code detection method and system based on recurrent neural network
CN114357443A (en) * 2021-12-13 2022-04-15 北京六方云信息技术有限公司 Malicious code detection method, equipment and storage medium based on deep learning
CN114579965A (en) * 2021-12-31 2022-06-03 厦门服云信息科技有限公司 Malicious code detection method and device and computer readable storage medium
CN114253866A (en) * 2022-03-01 2022-03-29 紫光恒越技术有限公司 Malicious code detection method and device, computer equipment and readable storage medium
CN114692156A (en) * 2022-05-31 2022-07-01 山东省计算中心(国家超级计算济南中心) Memory segment malicious code intrusion detection method, system, storage medium and equipment
CN115455416A (en) * 2022-09-09 2022-12-09 上海派拉软件股份有限公司 Malicious code detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115455416A (en) 2022-12-09

Similar Documents

Publication Publication Date Title
WO2021212675A1 (en) Method and apparatus for generating adversarial sample, electronic device and storage medium
CN107004159B (en) Active machine learning
WO2020108063A1 (en) Feature word determining method, apparatus, and server
CN111274394A (en) Method, device and equipment for extracting entity relationship and storage medium
CN113807098A (en) Model training method and device, electronic equipment and storage medium
WO2023116561A1 (en) Entity extraction method and apparatus, and electronic device and storage medium
CN113434858B (en) Malicious software family classification method based on disassembly code structure and semantic features
WO2024051196A1 (en) Malicious code detection method and apparatus, electronic device, and storage medium
US20230073994A1 (en) Method for extracting text information, electronic device and storage medium
CN112347760A (en) Method and device for training intention recognition model and method and device for recognizing intention
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN113141360A (en) Method and device for detecting network malicious attack
CN114724156B (en) Form identification method and device and electronic equipment
Gupta et al. SMPOST: parts of speech tagger for code-mixed indic social media text
CN114254636A (en) Text processing method, device, equipment and storage medium
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN114880520B (en) Video title generation method, device, electronic equipment and medium
CN115035890B (en) Training method and device of voice recognition model, electronic equipment and storage medium
US20200099718A1 (en) Fuzzy inclusion based impersonation detection
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN113204613B (en) Address generation method, device, equipment and storage medium
CN114417862A (en) Text matching method, and training method and device of text matching model
US11349856B2 (en) Exploit kit detection
CN109214005A (en) A kind of clue extracting method and system based on Chinese word segmentation
CN111368083A (en) Text classification method, device and equipment based on intention confusion and storage medium