CN111488422A - Incremental method and device for structured data sample, electronic equipment and medium - Google Patents

Incremental method and device for structured data sample, electronic equipment and medium Download PDF

Info

Publication number
CN111488422A
CN111488422A CN201910074352.4A CN201910074352A CN111488422A CN 111488422 A CN111488422 A CN 111488422A CN 201910074352 A CN201910074352 A CN 201910074352A CN 111488422 A CN111488422 A CN 111488422A
Authority
CN
China
Prior art keywords
new
sample
original
webpage
structured data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910074352.4A
Other languages
Chinese (zh)
Inventor
王大伟
杨荣海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN201910074352.4A priority Critical patent/CN111488422A/en
Publication of CN111488422A publication Critical patent/CN111488422A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Storage Device Security (AREA)

Abstract

The application discloses an incremental method of a structured data sample, which aims to solve the technical defect that the existing generation countermeasure network can not generate a new sample for an original sample of which the data type is structured data. By the technical scheme provided by the application, the application field range of the generated countermeasure network is widened to the structured data field, so that the structured data samples can also generate high-quality new samples by utilizing the generated countermeasure network. The application also discloses an incremental device, electronic equipment and a computer readable storage medium for the structured data sample, which have the beneficial effects.

Description

Incremental method and device for structured data sample, electronic equipment and medium
Technical Field
The present application relates to the field of new sample generation, and in particular, to an incremental method and apparatus for structured data samples, an electronic device, and a computer-readable storage medium.
Background
A Generative Adaptive Networks (GAN) is a deep learning model, and is one of the most promising methods for unsupervised learning in complex distribution in recent years. Different from other deep learning models, the generation of a Generative Model (Generative Model) and a discriminant Model (discriminant Model) which play games with each other in a countermeasure network can produce high-quality output. The process of the game is also the process of counterlearning, in the game process: the generation model is responsible for generating new data according to the input data, and strives for the new data generated by the generation model to be capable of being distinguished by the distinguishing model, and the distinguishing model screens out the new data which do not meet the requirements through distinguishing as strictly as possible so as to output the new data which pass the distinguishing. In the continuous counterstudy, the data with higher quality can be output finally.
Due to the above characteristics of the generative countermeasure network, it is widely used to obtain high quality new samples from original samples, but the sample application range of the generation of new samples by the generative countermeasure network is limited to continuous data expressed as quantized values, and cannot be applied to structured data containing structural information (such as special format in header file or header information, structural information between different data, location information, etc.), because the existing generative countermeasure network generates new samples by adjusting the quantized values of original samples, and the structural information is usually expressed in an unquantized manner, so the existing generative countermeasure network does not support.
Taking an image as an example, when a new image is generated according to an original image, the generation is realized by modifying the gray value of some pixel points of the image, that is, the gray value of some pixel point is 200, and the generation of the countermeasure network can obtain the new sample by changing the value into 201. But for the word "Apple", there is no word "Apple + 1". Similarly, structured data also contains non-quantized structural information, so that new samples of this type of data cannot be obtained by the existing generation countermeasure network. For structural data such as a webpage containing a large amount of structural information, because of the requirement of network security, enough samples with high quality are often needed for security judgment, and a simple waiting for a real sample cannot realize good detection on new features which continuously appear, particularly for a tampered webpage containing malicious data, the type of the malicious data is different day by day, and the detection precision of the tampered webpage is really improved by obtaining the high-quality samples more timely and more actively.
Therefore, how to overcome the technical defect that the existing countermeasure network cannot be used to generate a new sample for an original sample with a data type of structured data is an urgent problem to be solved by those skilled in the art.
Disclosure of Invention
The application provides an incremental method and device for structured data samples, electronic equipment and a computer-readable storage medium, and aims to solve the problem that the existing generation countermeasure network cannot be used for generating new samples for original samples of structured data by data types.
To achieve the above object, the present application first provides an incremental method for structuring a data sample, the incremental method comprising:
acquiring an original sample with a data type of structured data;
converting non-quantized structured data in the original sample into quantization parameters;
generating a countermeasure network from an original sample input represented using the quantization parameter;
obtaining new samples in the generative confrontation network in a manner that modifies the quantization parameter.
Optionally, converting the non-quantized structured data in the original sample into quantization parameters includes:
converting non-quantized structured data in the original sample into quantized feature vectors;
correspondingly, obtaining a new sample in the generation countermeasure network in a manner of modifying the quantization parameter includes:
obtaining the new sample in the generating countermeasure network in a manner that modifies parameters of the feature vector.
Optionally, obtaining a new sample in the countermeasure network by modifying the parameters of the feature vector includes:
generating noise according to a preset noise generation rule, and representing the noise as a noise vector;
and splicing the feature vector and the noise vector in the generation countermeasure network to obtain a new feature vector and a new sample corresponding to the new feature vector.
Optionally, when the original sample whose data type is structured data is specifically an original web page, before converting the non-quantized structured data in the original sample into the quantization parameter, the method further includes:
representing each non-quantitative element in the original webpage as a DOM tree; each unquantized element is taken as a node of the DOM tree, and the membership relationship among the nodes is consistent with the original structural relationship among the unquantized elements;
correspondingly, the converting the non-quantized structured data in the original sample into quantization parameters includes:
and converting the unquantized elements expressed in the DOM tree form in the original webpage into quantitative parameters.
Optionally, when the original sample whose data type is structured data is specifically an original web page, the method further includes:
verifying whether a new webpage generated by the generated countermeasure network is normally analyzed by a browser;
only new web pages that are verified as being normally resolvable by the browser are output as the new sample.
Optionally, when the original sample whose data type is structured data is specifically an original tampered web page in the original web page, the method further includes:
extracting a new tampered webpage discrimination feature from a new tampered webpage obtained in the mode of modifying the quantitative parameters in the generated countermeasure network;
and supplementing the new tampered webpage distinguishing characteristics into a tampered webpage characteristic detection library so as to improve the detection precision of the tampered webpage.
To achieve the above object, the present application further provides an increment apparatus for structured data samples, the increment apparatus comprising:
the system comprises an original sample acquisition unit, a data processing unit and a data processing unit, wherein the original sample acquisition unit is used for acquiring an original sample of which the data type is structured data;
the non-quantization-to-quantization conversion unit is used for converting non-quantized structured data in the original sample into quantization parameters;
a sample input unit for inputting the original sample expressed by the quantization parameter into a countermeasure network;
a new sample generation unit, configured to obtain a new sample in the generative countermeasure network in a manner of modifying the quantization parameter.
Optionally, the non-quantization to quantization conversion unit includes:
a feature vector conversion subunit, configured to convert non-quantized structured data in the original sample into quantized feature vectors;
correspondingly, the new sample generation unit comprises:
a new sample generation subunit, configured to obtain the new sample in the generative countermeasure network in a manner of modifying the parameters of the feature vector.
Optionally, the new sample generating subunit includes:
the noise generation and noise vector generation module is used for generating noise according to a preset noise generation rule and expressing the noise as a noise vector;
and the multi-vector splicing new sample generation module is used for splicing the feature vector and the noise vector in the generation countermeasure network to obtain a new feature vector and a new sample corresponding to the new feature vector.
Optionally, when the original sample whose data type is structured data is specifically an original web page, the method further includes:
a DOM tree representation unit, which is used for representing each non-quantized element in the original webpage as a DOM tree before converting the non-quantized structural data in the original sample into a quantized parameter; each unquantized element is taken as a node of the DOM tree, and the membership relationship among the nodes is consistent with the original structural relationship among the unquantized elements;
correspondingly, the non-quantization-to-quantization conversion unit comprises:
and the DOM tree type non-quantization element conversion sub-unit is used for converting the non-quantization elements expressed in the DOM tree type in the original webpage into quantization parameters.
Optionally, when the original sample whose data type is structured data is specifically an original web page, the method further includes:
a normal parsing verification unit for verifying whether a new web page generated by the generated countermeasure network is normally parsed by a browser;
and the new sample output screening unit is used for outputting a new webpage which is verified to be normally parsed by the browser as the new sample.
Optionally, when the original sample whose data type is structured data is specifically an original tampered web page in the original web page, the method further includes:
a new tampered webpage distinguishing feature extracting unit, configured to extract a new tampered webpage distinguishing feature from a new tampered webpage obtained in the generated countermeasure network in a manner of modifying the quantization parameter;
and the falsified webpage distinguishing feature library supplementing unit is used for supplementing the new falsified webpage distinguishing features into the falsified webpage feature detection library so as to improve the detection precision of the falsified webpage.
To achieve the above object, the present application also provides an electronic device, including:
a memory for storing a computer program;
a processor for implementing the incremental method of structuring data samples as described in the foregoing when executing the computer program.
To achieve the above object, the present application further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the incremental method of structuring data samples as described above.
Obviously, in order to solve the technical defect that the existing generation countermeasure network cannot generate a new sample for an original sample with a data type of structured data, before the original sample with the data type of structured data is input into the generation countermeasure network, a conversion operation of converting non-quantized structured data into a quantized parameter is performed, that is, non-quantized structure information is expressed by using a quantized parameter, so that the original sample expressed by using the quantized parameter meets the precondition that the new sample is generated from the original sample by the generation countermeasure network. By the technical scheme provided by the application, the application field range of the generated countermeasure network is widened to the structured data field, so that the structured data samples can also generate high-quality new samples by utilizing the generated countermeasure network.
The application also provides an incremental device, an electronic device and a computer-readable storage medium for the structured data samples, which have the beneficial effects and are not repeated herein.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of an incremental method for structuring a data sample according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a structured data sample increment method for converting non-quantized structured data into quantized feature vectors according to an embodiment of the present application;
fig. 3 is a flowchart for obtaining a new feature vector corresponding to a new sample in a multi-vector splicing manner according to an embodiment of the present disclosure;
FIG. 4 is a flowchart of an incremental method for tampering with a web page, specifically provided in the present application for tampering with a web page;
FIG. 5 is a specific flow diagram provided with respect to FIG. 4;
FIG. 6 is a block diagram of an incremental apparatus for structuring a data sample according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The application provides an incremental method and device for structured data samples, electronic equipment and a computer-readable storage medium, and aims to solve the problem that the existing generation countermeasure network cannot be used for generating new samples for original samples of structured data by data types.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example one
Referring to fig. 1, fig. 1 is a flowchart of an incremental method for structured data samples according to an embodiment of the present application, including the following steps:
s101: acquiring an original sample with a data type of structured data;
the method includes the steps of acquiring an original sample with a data type of structured data, where a webpage, a PDF file, a PE file (Portable Executable file, common EXE, D LL, OCX, SYS, and COM files are PE files), an E L F file (a file in binary file, Executable file, target code, shared library, and core dump format, and common under L inux operating system), and the like all contain a large amount of structured data, where when the original sample is a webpage, the original sample may be a white webpage (i.e., a normal webpage that does not contain malicious data), or a tampered webpage (also called a black webpage that is a webpage that contains malicious data), and a corresponding increase result should be a webpage of a corresponding type, which type of webpage needs to be increased specifically, and may be selected flexibly according to actual needs, and is not limited specifically here.
S102: converting non-quantized structured data in an original sample into quantization parameters;
on the basis of S101, this step aims at converting the non-quantized structured data in the original sample into quantized parameters, that is, using quantized, processable continuous data of the original structure information that cannot be generated against network processing, so as to generate high-quality new samples from the original sample on the premise of satisfying the conditions for generating against network usage.
The key step of converting the structured data into the quantization parameter is how to express the structure information by the quantization parameter, as if it were a tree diagram, where each node has a connection relationship with an upper node, which is a kind of structure information, and the structure information is expressed by connecting lines between different nodes in the tree diagram, so in the example of the tree diagram, what this step needs to do is how to convert the connecting lines between different nodes into the quantization parameter.
Specifically, this process can be implemented in various forms, and actually only needs to represent the structure information by using appropriate quantization parameters, and may need to make an appropriate conversion rule in advance, so as to implement reversible conversion under this conversion rule, and also facilitate to obtain a new quantization parameter by modifying the quantization parameter in the subsequent steps and convert the new quantization parameter back to the original representation form. For example, the non-quantized structure data may be converted into any one or a combination of multiple quantization parameters, such as one or more eigenvectors, sets including multiple elements, matrices, and tables, and it can be seen that each of the above-mentioned representation forms of quantization parameters may have multiple specific quantization parameters, and the quantization parameters converted from the structure information may exist as one or more separate quantization parameters simultaneously with other quantization parameters, or may not exist as separate quantization parameters, and which kind of specific selection may be flexibly selected according to all possible limitations or special requirements in the actual application scenario, and is not specifically limited herein.
S103: generating a countermeasure network by inputting original samples represented by using quantization parameters;
s104: new samples are obtained in the generation of the countermeasure network in a manner that modifies the quantization parameter.
S103 aims to input the original samples represented by the quantization parameters into a generation countermeasure network, so that the generation countermeasure network can process the structured data samples after data type conversion, and S104 aims to obtain the modified quantization parameters in the generation countermeasure network in a manner of modifying the quantization parameters, and finally obtain new samples corresponding to the modified quantization parameters.
In the process of obtaining a new sample, the input original sample represented by the quantization parameter is firstly generated into a preliminary new sample through a generation model, then the preliminary new sample is sent to a discrimination model for discrimination, the discrimination model discriminates the received new sample, and the new sample passing through the discrimination model is used as a final new sample.
The quantization parameter modifying method can directly modify the size of some parameter values in the original quantization parameters, or can obtain new quantization parameters by combining with other quantization parameters, and when the quantization parameters are expressed as feature vectors, the latter implementation mode can be expressed as the concatenation of a plurality of vectors.
In order to output a new sample with higher quality, the generative model and the discriminant model in the generative confrontation network can be further updated in an alternating training manner in the confrontation learning process, and one manner includes but is not limited to: if the correctness of the new sample generated by the generative model cannot be distinguished by the current discriminant model, the discriminant model is updated in a mode of informing the discriminant model of which the new sample generated by the generative model is the new sample generated by the generative model until the discriminant model can accurately identify the new sample generated by the generative model, at the moment, the discriminant model is fixed until the generative model can generate the new sample which cannot be distinguished by the discriminant model again, and so on, under the normal condition, only a few rounds of alternate updating are needed, and the new sample with high quality output by the countermeasure.
Furthermore, according to different types of the structured data samples, some auxiliary operations related to the types can be added before and after the new samples are generated so as to match with the data of the corresponding types. For example, when the structured data sample is a web page, since the web page is represented as a quantization parameter, when the web page is restored to a normal web page, it is necessary to verify whether a new web page corresponding to the modified quantization parameter can still be normally parsed by the browser, so as to prevent the web page from losing its most essential characteristics. And the other kinds of structured data samples are analogized to obtain corresponding auxiliary operations according to different kinds, and those skilled in the art can easily obtain the auxiliary operations corresponding to different kinds of structured data under the guidance of the idea, which is not listed here.
Based on the technical scheme, the conversion operation of converting the non-quantized structured data into the quantization parameters is performed, so that the original samples represented by the quantization parameters meet the precondition that the generation of the countermeasure network generates new samples according to the original samples, and the technical defect that the generation of new samples for the original samples with the data types of the structured data cannot be directly realized by the conventional generation countermeasure network is overcome. By the technical scheme provided by the application, the application field range of the generated countermeasure network is widened to the structured data field, so that the structured data samples can also generate high-quality new samples by utilizing the generated countermeasure network.
Example two
Referring to fig. 2, fig. 2 is a flowchart of a structured data sample increment method for converting non-quantized structured data into quantized feature vectors according to an embodiment of the present disclosure, which is different from the first embodiment in that the embodiment specifically provides an implementation manner for converting non-quantized structured data into quantized feature vectors, including the following steps:
s201: acquiring an original sample with a data type of structured data;
s202: converting non-quantized structured data in an original sample into quantized feature vectors;
the vector is used as a quantization parameter expression form of the non-quantized structured data in the step, and because more structured data exist in the structured data sample, the feature vector used in the step can be preferably a multi-dimensional vector related to the type number or the number of the structured data. Moreover, multi-dimensional quantization also facilitates the use of a separate quantization parameter for representing the structure information, and is more suitable to some extent as a quantization parameter representation for representing the structured data since the vector is determined by both the direction and the magnitude.
Simply, the feature vector of each dimension may be a binary vector, i.e. representing presence, absence, connection, or non-connection, specifically, presence or connection may be represented by 1, and absence or non-connection may be represented by 0.
Further, when the structured data sample is embodied as a web page, the structure of the multidimensional feature vector can be conveniently represented using a form of DOM tree, while maintaining the original element structure, wherein DOM, English is known as documentObject Model, Chinese is known as document object Model, and according to the specification requirements, DOM is an interface independent of browser, platform, language, so that you can access other standard components in the page.
The hierarchy represented by the resulting DOM tree will allow developers to navigate through the tree looking for specific information, and parsing the structure typically requires loading the entire document and constructing the hierarchy before any work can be done. Because it is based on the information hierarchy, the DOM tree is also considered tree-based or object-based.
S203: generating a countermeasure network by inputting original samples represented by using the feature vectors;
s204: new samples are obtained in the generation of the countermeasure network in such a way that the parameters of the feature vectors are modified.
This step also provides a specific implementation manner of obtaining a new sample by modifying the parameters of the feature vectors, please refer to fig. 3, which is a method for modifying the original quantization parameters by splicing a plurality of feature vectors shown in fig. 3, and can be implemented by the following steps:
s301: converting non-quantized structured data in an original sample into quantized feature vectors;
s302: generating noise according to a preset noise generation rule, and representing the noise as a noise vector;
the noise exists as a parameter for distinguishing and modifying quantization parameters in the original feature vector, and a corresponding noise generation rule can be formulated according to specific characteristics of a new sample actually wanted, for example, if a new tampered webpage is wanted to be obtained according to the original tampered webpage, the new tampered webpage is bound to contain malicious data different from that contained in the original tampered webpage, so that the noise data can exist as a parameter for guiding and modifying the malicious data contained in the webpage, and the rest of the noise data can be analogized in sequence.
S303: and splicing the feature vector and the noise vector in the generation of the countermeasure network to obtain a new feature vector and a new sample corresponding to the new feature vector.
Splicing is a common vector basic processing mode for fusing two different vectors, and the step is to modify an original feature vector by vector splicing to obtain a new feature vector and a new sample corresponding to the new feature vector. The number of the noise vectors generated according to the noise may be one or more, and different ways may be selected for splicing, as long as the original feature vectors can be modified, which is not specifically limited herein.
EXAMPLE III
Referring to fig. 4, fig. 4 is a flowchart of an incremental method for tampering a web page specifically provided by the present embodiment for tampering a web page, on the basis of the first embodiment and the second embodiment, the present embodiment provides a specific incremental method for tampering a web page based on the current requirement for the number of tampering web pages containing malicious data, which includes the following steps:
s401: acquiring an original tampered webpage containing original malicious data;
the original tampered webpage is obtained by directly invading a normal webpage by an invader and inserting various malicious data, and can be directly obtained by a conventional technical means.
S402: representing non-quantitative elements in the original tampered webpage as a DOM tree;
the method aims to express unquantized elements in the original tampered webpage as the DOM tree, the DOM tree form is used as an expression form which can better restore the structural relationship among various elements in the webpage, and the method is beneficial to reducing the change of the original tampered webpage structure in the post-compensation modification process, thereby reducing the probability that the modified new tampered webpage cannot be normally analyzed by a browser.
S403: converting unquantized elements expressed in a DOM tree form in an original tampered webpage into quantitative parameters;
the non-quantized elements contained in the web page include characters, images, address links, and the like.
S404: generating an anti-network by inputting an original falsified webpage represented by using a quantization parameter;
s405: and obtaining a new tampered webpage in a mode of modifying the quantitative parameters in the generation countermeasure network.
The method comprises the steps of generating an anti-network for an original tampered webpage represented by a quantization parameter to obtain a new tampered webpage, namely, generating the anti-network for adjusting malicious data contained in the original tampered webpage in a mode of modifying the quantization parameter, and thus obtaining the new tampered webpage containing the new malicious data. In brief, the tampered web page after data type conversion appears to be a quantized parameter when the countermeasure network is generated, so that the adjustment of the web page can be completed only by repeatedly adjusting the gray value of the pixel point when the data type conversion is applied to the image field.
Specifically, the adjustment mode of the malicious data specifically includes four types, namely adding, deleting, replacing and retaining, where adding refers to adding some malicious data based on the old malicious data contained in the original tampered web page, and the number and the position of the added malicious data are not limited herein; deletion means that some malicious data contained in the original tampered webpage are reduced, and the reduced number and positions are not limited; the replacement means that some new malicious data are used for replacing the old malicious data contained in the original tampered webpage, and the replacement amount and position are not limited; retention refers not only to no change, but also to partial stale semantic data that is not added, deleted, or replaced.
In order to further prevent generation of a new tampered webpage with network output incapable of being normally analyzed by the browser, the generated new tampered webpage can be verified whether to be normally analyzed by the browser or not after the generation model is generated and before the judgment of the network, so that the judgment of invalidation of the invalid new tampered webpage by the judgment network is prevented, the workload of the judgment model is reduced, and the working efficiency is improved.
Furthermore, some modification rules can be preset in the generation process of the generation model, so that modification operations which cannot be normally analyzed by the browser are reduced under the limitation of the modification rules, the later unified verification step is omitted, and the efficiency is further improved.
Meanwhile, a new tampered webpage distinguishing feature can be extracted from a new tampered webpage obtained by modifying the quantitative parameters in the generated countermeasure network, and the new tampered webpage distinguishing feature is supplemented into the tampered webpage feature detection library, so that the detection precision of the tampered webpage is improved in a mode of enriching the tampered webpage feature detection library.
Based on the technical scheme, the embodiment mainly utilizes the generation of the countermeasure network to realize the incremental generation of the tampered webpage, so that on the basis of the original tampered webpage, a high-quality new tampered webpage can be actively and timely obtained by generating the countermeasure network, and richer malicious data can be extracted from the new tampered webpage, so that a feature library for detecting whether one webpage is the tampered webpage is expanded, the detection accuracy of the tampered webpage is improved, and meanwhile, the detection capability of novel malicious data is also improved.
In order to deepen understanding of the present application regarding the quantization parameter conversion and modification manner, this embodiment further provides an implementation method for specifically implementing incremental tampering with a web page by using a generated countermeasure network:
firstly, an original tampered webpage is represented as a feature vector M, wherein M is an M-dimensional binary vector, and M isiRepresenting a vector of a dimension corresponding to an ith element in a webpage, wherein a vector value is used for representing whether the corresponding element contains sensitive word text or not, if so, the vector value is 1, and otherwise, the vector value is 0;
generating noise according to a preset noise generation rule, and expressing the noise by using a noise vector N, wherein N is an N-dimensional binary vector;
and splicing m and n to serve as the input of the generating model, so that the generating model modifies malicious data contained in the original tampered webpage according to the input spliced vector as a guide, wherein when the malicious data is expressed as sensitive words, the modification and selection of the malicious data can be selected from a preset sensitive word bank. In order to improve the effect of generating a new tampered webpage of the generated Network, a DNN (Deep Neural Network) model can be adopted for generating the Network, when an output layer has M neurons and uses a tanh activation function, the range of an output value can be limited between [ -1,1], the output of the Network is represented as o, that is, the value range of o is [ -1,1 ];
then, the output vector o is subjected to extremization processing to obtain a vector o'. And for each element o in oiIf o isiLess than or equal to-0.5, then make o'i-1; if oiTo 0.5 ≥ o'i1 is ═ 1; otherwise, let o'i=0。
Computing a feature vector m ' of the newly generated tampered webpage using a new gen () function, for each element m ' in m 'iThe calculation method of (c) is as follows:
if m isiO'i1, then m'i1, adding a sensitive word in the ith element in the original tampered webpage; if m isi1, and o'iIs-1, then m'iIf the number of the sensitive words in the ith element in the original tampered webpage is 0, deleting the sensitive words in the ith element in the original tampered webpage, and further, in order to ensure that the label of the tampered webpage after deletion is unchanged (namely, the tampered webpage is still tampered), limiting the number of the deletions and preventing all the sensitive words from being deleted; for other cases, then let m'i=miAnd the sensitive word of the ith element in the original tampered webpage is kept unchanged.
The above process can refer to the schematic flow chart shown in fig. 5 for obtaining a new tampered web page by generating the anti-network increment, and fig. 5 is an image description manner of the above text description, which is consistent with the content described in the above text, and is not repeated here.
According to the content, the webpage is tampered with by utilizing the generated countermeasure network to carry out incremental processing, so that diversified attack means simulating hackers can be used for countering learning, malicious data can be automatically, massively and actively deformed in advance of an attacker, and therefore the subsequent detection steps can timely cope with a plurality of novel attack modes under the supplement of new malicious data, and the active defense effect is achieved. Compared with the traditional passive webpage tampering detection scheme, the method has the advantages that the evolution of the attack technology cannot be actively followed in time, so that the problem that the detection capability of the novel attack technology is weak is obviously improved, and the detection effect is better.
Sample increment can be realized by the present application for data of the same or similar kind as a web page containing a large amount of structured data, such as text data in the Natural language Processing field (N L P, Natural L image Processing), and the solution provided by the present application can be not only applied to detection of tampered web pages, but also applied to the same or similar scenarios including web page classification, malicious web page recognition, and black-hat SEO (Search Engine Optimization), and the like, and is not limited herein.
Because the situation is complicated and cannot be illustrated by a list, a person skilled in the art can realize that many examples exist according to the basic method principle provided by the application and the practical situation, and the protection scope of the application should be protected without enough inventive work.
Example four
Referring to fig. 6, fig. 6 is a block diagram illustrating a structure of an incremental apparatus for structuring a data sample according to an embodiment of the present application, where the apparatus may include:
an original sample acquiring unit 100, configured to acquire an original sample of which a data type is structured data;
a non-quantization to quantization conversion unit 200, which converts non-quantized structured data in the original sample into quantization parameters;
a sample input unit 300 for inputting an original sample expressed using a quantization parameter into a countermeasure network;
a new sample generating unit 400, configured to obtain a new sample in a manner of modifying the quantization parameter in the generation countermeasure network.
The unquantized-to-quantized transform unit 200 may include:
the characteristic vector conversion subunit is used for converting the non-quantized structured data in the original sample into quantized characteristic vectors;
correspondingly, the new sample generation unit 400 may include:
and the new sample generation subunit is used for obtaining a new sample in a mode of modifying the parameters of the feature vector in the generation countermeasure network.
Wherein the new sample generation subunit may include:
the noise generation and noise vector generation module is used for generating noise according to a preset noise generation rule and expressing the noise as a noise vector;
and the multi-vector splicing new sample generation module is used for splicing the feature vector and the noise vector in the generated countermeasure network to obtain a new feature vector and a new sample corresponding to the new feature vector.
Further, when the original sample whose data type is structured data is specifically an original web page, the incremental apparatus for structured data sample may further include:
the DOM tree representation unit is used for representing each non-quantized element in the original webpage into a DOM tree before converting the non-quantized structural data in the original sample into a quantization parameter; each unquantized element is used as a node of the DOM tree, and the membership relationship among the nodes is consistent with the original structural relationship among the unquantized elements;
correspondingly, the unquantized-to-quantized transform unit 200 may include:
and the DOM tree type non-quantization element conversion sub-unit is used for converting the non-quantization elements expressed in the DOM tree type in the original webpage into quantization parameters.
Furthermore, when the original sample with the data type of structured data is embodied as an original webpage, the incremental apparatus for structured data sample may further include:
a normal parsing verification unit for verifying whether a new web page generated by using the generated countermeasure network is normally parsed by the browser;
and the new sample output screening unit is used for outputting only the new webpage which is verified to be normally parsed by the browser as the new sample.
Furthermore, when the original sample of the structured data is an original tampered web page in the original web page, the incremental apparatus for the structured data sample may further include:
the new tampered webpage distinguishing feature extracting unit is used for extracting a new tampered webpage distinguishing feature from a new tampered webpage obtained in a mode of modifying the quantitative parameters in the generated countermeasure network;
and the falsified webpage distinguishing feature library supplementing unit is used for supplementing the new falsified webpage distinguishing features into the falsified webpage feature detection library so as to improve the detection precision of the falsified webpage.
The present embodiment exists as a system embodiment corresponding to the above method embodiment, and has the same beneficial effects as the method embodiment, and details are not repeated here.
Fig. 6 is a block diagram illustrating an electronic device 500 in accordance with an example embodiment. As shown in fig. 6, the electronic device 500 may include a processor 501 and a memory 502, and may further include one or more of a multimedia component 503, an information input/information output (I/O) interface 504, and a communication component 505.
The processor 501 is configured to control the overall operation of the electronic device 500, so as to complete some or all of the steps in the incremental method for structuring data samples; the memory 502 is used for storing various types of data to support the operations of the steps that the processor 501 needs to perform, and the data may include, for example, instructions for any application or method operating on the electronic device 500, and application-related data, such as quantization parameter transformation rules, transformation scripts, real web pages, original tampered web pages, malicious data sets, and the like. The Memory 502 may be implemented by any type or combination of volatile and non-volatile Memory devices, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-only Memory (EEPROM), Erasable Programmable Read-only Memory (EPROM)
(EPROM) Programmable Read-Only Memory (PROM)
(PROM), Read-Only Memory (ROM), magnetic storage, flash Memory, magnetic or optical disk.
The multimedia component 503 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 502 or transmitted through the communication component 505. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 504 provides an interface between the processor 501 and other interface modules, such as a keyboard, a mouse, etc. The communication component 505 is used for wired or wireless communication between the electronic device 500 and other devices. Wireless communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding communication component 505 may include: Wi-Fi module, bluetooth module, NFC module.
In an exemplary embodiment, the electronic Device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable logic devices (Programmable L analog devices, P L D), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components for performing the incremental method of structuring data samples given in the above embodiments.
In another exemplary embodiment, a computer-readable storage medium storing program instructions to implement operations corresponding to the program instructions when executed by a processor is also provided. For example, the computer readable storage medium may be the memory 502 described above comprising program instructions embodied as an incremental method of structuring data samples as given by the embodiments above, which may be performed by the processor 501 of the electronic device 500 when executed.
The principle and the implementation of the present application are described herein by applying specific examples, and in order to make the various embodiments have a progressive relationship, each embodiment focuses on the differences from the other embodiments, and the same and similar parts among the various embodiments may be referred to each other. For the apparatus disclosed in the embodiments, reference is made to the corresponding method section. The above description of the embodiments is only intended to help understand the method of the present application and its core ideas. It will be apparent to those skilled in the art that various changes and modifications can be made in the present invention without departing from the principles of the invention, and these changes and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (14)

1. An incremental method of structuring a data sample, comprising:
acquiring an original sample with a data type of structured data;
converting non-quantized structured data in the original sample into quantization parameters;
generating a countermeasure network from an original sample input represented using the quantization parameter;
obtaining new samples in the generative confrontation network in a manner that modifies the quantization parameter.
2. An incremental method according to claim 1 wherein converting the non-quantized structured data in the original sample into quantization parameters comprises:
converting non-quantized structured data in the original sample into quantized feature vectors;
correspondingly, obtaining a new sample in the generation countermeasure network in a manner of modifying the quantization parameter includes:
obtaining the new sample in the generating countermeasure network in a manner that modifies parameters of the feature vector.
3. An incremental method according to claim 2, wherein obtaining new samples in the generative countermeasure network in a manner that modifies parameters of the eigenvector comprises:
generating noise according to a preset noise generation rule, and representing the noise as a noise vector;
and splicing the feature vector and the noise vector in the generation countermeasure network to obtain a new feature vector and a new sample corresponding to the new feature vector.
4. An incremental method as claimed in any one of claims 1 to 3, wherein when the original sample whose data type is structured data is specifically an original web page, before converting the non-quantized structured data in the original sample into quantization parameters, the method further comprises:
representing each non-quantitative element in the original webpage as a DOM tree; each unquantized element is taken as a node of the DOM tree, and the membership relationship among the nodes is consistent with the original structural relationship among the unquantized elements;
correspondingly, the converting the non-quantized structured data in the original sample into quantization parameters includes:
and converting the unquantized elements expressed in the DOM tree form in the original webpage into quantitative parameters.
5. The incremental method of claim 4 wherein when the original sample whose data type is structured data is embodied as an original web page, further comprising:
verifying whether a new webpage generated by the generated countermeasure network is normally analyzed by a browser;
only new web pages that are verified as being normally resolvable by the browser are output as the new sample.
6. The method according to claim 5, wherein when the original sample of structured data is an original tampered web page, the method further comprises:
extracting a new tampered webpage discrimination feature from a new tampered webpage obtained in the mode of modifying the quantitative parameters in the generated countermeasure network;
and supplementing the new tampered webpage distinguishing characteristics into a tampered webpage characteristic detection library so as to improve the detection precision of the tampered webpage.
7. An incremental apparatus for structuring a data sample, comprising:
the system comprises an original sample acquisition unit, a data processing unit and a data processing unit, wherein the original sample acquisition unit is used for acquiring an original sample of which the data type is structured data;
the non-quantization-to-quantization conversion unit is used for converting non-quantized structured data in the original sample into quantization parameters;
a sample input unit for inputting the original sample expressed by the quantization parameter into a countermeasure network;
a new sample generation unit, configured to obtain a new sample in the generative countermeasure network in a manner of modifying the quantization parameter.
8. The delta arrangement of claim 7, wherein said non-quantized to quantized transform unit comprises:
a feature vector conversion subunit, configured to convert non-quantized structured data in the original sample into quantized feature vectors;
correspondingly, the new sample generation unit comprises:
a new sample generation subunit, configured to obtain the new sample in the generative countermeasure network in a manner of modifying the parameters of the feature vector.
9. The incrementing device according to claim 8, wherein the new sample generating subunit includes:
the noise generation and noise vector generation module is used for generating noise according to a preset noise generation rule and expressing the noise as a noise vector;
and the multi-vector splicing new sample generation module is used for splicing the feature vector and the noise vector in the generation countermeasure network to obtain a new feature vector and a new sample corresponding to the new feature vector.
10. The incremental device according to any one of claims 7 to 9, wherein when the original sample whose data type is structured data is an original webpage, the incremental device further comprises:
a DOM tree representation unit, which is used for representing each non-quantized element in the original webpage as a DOM tree before converting the non-quantized structural data in the original sample into a quantized parameter; each unquantized element is taken as a node of the DOM tree, and the membership relationship among the nodes is consistent with the original structural relationship among the unquantized elements;
correspondingly, the non-quantization-to-quantization conversion unit comprises:
and the DOM tree type non-quantization element conversion sub-unit is used for converting the non-quantization elements expressed in the DOM tree type in the original webpage into quantization parameters.
11. The incremental device according to claim 10, wherein when the original sample whose data type is structured data is an original web page, the incremental device further comprises:
a normal parsing verification unit for verifying whether a new web page generated by the generated countermeasure network is normally parsed by a browser;
and the new sample output screening unit is used for outputting a new webpage which is verified to be normally parsed by the browser as the new sample.
12. The incremental apparatus according to claim 11, wherein when the original sample of the structured data is an original tampered web page, the method further comprises:
a new tampered webpage distinguishing feature extracting unit, configured to extract a new tampered webpage distinguishing feature from a new tampered webpage obtained in the generated countermeasure network in a manner of modifying the quantization parameter;
and the falsified webpage distinguishing feature library supplementing unit is used for supplementing the new falsified webpage distinguishing features into the falsified webpage feature detection library so as to improve the detection precision of the falsified webpage.
13. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the incremental method of structuring data samples of any one of claims 1 to 6 when executing the computer program.
14. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out a method of incrementing a structured data sample according to any one of claims 1 to 6.
CN201910074352.4A 2019-01-25 2019-01-25 Incremental method and device for structured data sample, electronic equipment and medium Pending CN111488422A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910074352.4A CN111488422A (en) 2019-01-25 2019-01-25 Incremental method and device for structured data sample, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910074352.4A CN111488422A (en) 2019-01-25 2019-01-25 Incremental method and device for structured data sample, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN111488422A true CN111488422A (en) 2020-08-04

Family

ID=71812279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910074352.4A Pending CN111488422A (en) 2019-01-25 2019-01-25 Incremental method and device for structured data sample, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN111488422A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112291273A (en) * 2020-12-24 2021-01-29 远江盛邦(北京)网络安全科技股份有限公司 Page fuzzy matching implementation method based on multi-dimensional vector comparison
CN112328750A (en) * 2020-11-26 2021-02-05 上海天旦网络科技发展有限公司 Method and system for training text discrimination model
CN113360505A (en) * 2021-07-02 2021-09-07 招商局金融科技有限公司 Data processing method and device based on time sequence data, electronic equipment and readable storage medium
CN113780365A (en) * 2021-08-19 2021-12-10 支付宝(杭州)信息技术有限公司 Sample generation method and device
CN117196800A (en) * 2023-06-20 2023-12-08 四川仕虹腾飞信息技术有限公司 Digital management method for staff behaviors of banking outlets

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107123151A (en) * 2017-04-28 2017-09-01 深圳市唯特视科技有限公司 A kind of image method for transformation based on variation autocoder and generation confrontation network
US20170365038A1 (en) * 2016-06-16 2017-12-21 Facebook, Inc. Producing Higher-Quality Samples Of Natural Images
CN108364029A (en) * 2018-03-19 2018-08-03 百度在线网络技术(北京)有限公司 Method and apparatus for generating model
CN108763915A (en) * 2018-05-18 2018-11-06 百度在线网络技术(北京)有限公司 Identifying code is established to generate model and generate the method, apparatus of identifying code
CN109086416A (en) * 2018-08-06 2018-12-25 中国传媒大学 A kind of generation method of dubbing in background music, device and storage medium based on GAN
CN109165735A (en) * 2018-07-12 2019-01-08 杭州电子科技大学 Based on the method for generating confrontation network and adaptive ratio generation new samples

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170365038A1 (en) * 2016-06-16 2017-12-21 Facebook, Inc. Producing Higher-Quality Samples Of Natural Images
CN107123151A (en) * 2017-04-28 2017-09-01 深圳市唯特视科技有限公司 A kind of image method for transformation based on variation autocoder and generation confrontation network
CN108364029A (en) * 2018-03-19 2018-08-03 百度在线网络技术(北京)有限公司 Method and apparatus for generating model
CN108763915A (en) * 2018-05-18 2018-11-06 百度在线网络技术(北京)有限公司 Identifying code is established to generate model and generate the method, apparatus of identifying code
CN109165735A (en) * 2018-07-12 2019-01-08 杭州电子科技大学 Based on the method for generating confrontation network and adaptive ratio generation new samples
CN109086416A (en) * 2018-08-06 2018-12-25 中国传媒大学 A kind of generation method of dubbing in background music, device and storage medium based on GAN

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHENGPING CHE ET AL: "Boosting Deep Learning Risk Prediction with Generative Adversarial Networks for Electronic Health Records", 《IEEE》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328750A (en) * 2020-11-26 2021-02-05 上海天旦网络科技发展有限公司 Method and system for training text discrimination model
CN112291273A (en) * 2020-12-24 2021-01-29 远江盛邦(北京)网络安全科技股份有限公司 Page fuzzy matching implementation method based on multi-dimensional vector comparison
CN113360505A (en) * 2021-07-02 2021-09-07 招商局金融科技有限公司 Data processing method and device based on time sequence data, electronic equipment and readable storage medium
CN113360505B (en) * 2021-07-02 2023-09-26 招商局金融科技有限公司 Time sequence data-based data processing method and device, electronic equipment and readable storage medium
CN113780365A (en) * 2021-08-19 2021-12-10 支付宝(杭州)信息技术有限公司 Sample generation method and device
CN117196800A (en) * 2023-06-20 2023-12-08 四川仕虹腾飞信息技术有限公司 Digital management method for staff behaviors of banking outlets

Similar Documents

Publication Publication Date Title
CN111488422A (en) Incremental method and device for structured data sample, electronic equipment and medium
US11816442B2 (en) Multi-turn dialogue response generation with autoregressive transformer models
US10095610B2 (en) Testing applications with a defined input format
US11468239B2 (en) Joint intent and entity recognition using transformer models
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
US20180260389A1 (en) Electronic document segmentation and relation discovery between elements for natural language processing
CN116820429B (en) Training method and device of code processing model, electronic equipment and storage medium
CN112464655A (en) Word vector representation method, device and medium combining Chinese characters and pinyin
KR102608867B1 (en) Method for industry text increment, apparatus thereof, and computer program stored in medium
CN110162558B (en) Structured data processing method and device
CN112232052A (en) Text splicing method and device, computer equipment and storage medium
US20230153550A1 (en) Machine Translation Method and Apparatus, Device and Storage Medium
CN116432705A (en) Text generation model construction method, text generation device, equipment and medium
CN114201957A (en) Text emotion analysis method and device and computer readable storage medium
US7428697B2 (en) Preserving content or attribute information during conversion from a structured document to a computer program
CN113011177B (en) Model training and word vector determining method, device, equipment, medium and product
CN112347738B (en) Bidirectional encoder characterization quantity model optimization method and device based on referee document
CN112817604B (en) Android system control intention identification method and device, electronic equipment and storage medium
CN117278322B (en) Web intrusion detection method, device, terminal equipment and storage medium
US11983464B2 (en) Neural network-based message communication framework with summarization and on-demand audio output generation
US20220351085A1 (en) Method and apparatus for presenting candidate character string, and method and apparatus for training discriminative model
CN112347196B (en) Entity relation extraction method and device based on neural network
CN117195214A (en) Application process level threat perception method, device, computer equipment and medium
Agung et al. Breakdown Film Script Using Parsing Algorithm.
CN117034870A (en) Training method, device, equipment and storage medium of text error correction model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination