WO2022063133A1 - 敏感信息检测方法、装置、设备与计算机可读存储介质 - Google Patents

敏感信息检测方法、装置、设备与计算机可读存储介质 Download PDF

Info

Publication number
WO2022063133A1
WO2022063133A1 PCT/CN2021/119658 CN2021119658W WO2022063133A1 WO 2022063133 A1 WO2022063133 A1 WO 2022063133A1 CN 2021119658 W CN2021119658 W CN 2021119658W WO 2022063133 A1 WO2022063133 A1 WO 2022063133A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
character
content
sensitive information
page
Prior art date
Application number
PCT/CN2021/119658
Other languages
English (en)
French (fr)
Inventor
刘宇滨
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2022063133A1 publication Critical patent/WO2022063133A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures

Definitions

  • the present application relates to the technical field of financial technology (Fintech), and in particular, to a sensitive information detection method, apparatus, device, and computer-readable storage medium.
  • the current sensitive information detection method mainly detects whether the sensitive information is published on the page by performing HTML keyword detection on the page. Specifically, the HTML source code of the page is obtained, and then keyword identification is performed on the HTML source code to determine whether the The existence of sensitive information, such as the keyword "Notice on the issuance of the four systems of xxx" in the HTML source code, indicates that the official documents of a banking institution may have been leaked.
  • This sensitive information detection method only identifies keywords in the HTML source code, and cannot exclude the influence of some dynamic factors, such as advertisements, etc., and the HTML source code does not represent real data, such as tags containing resource requests, and only after the code is executed. The acquired data and the like cannot be obtained directly. It can be seen that the current sensitive information detection method has a low detection accuracy due to the interference of dynamic factors or the inability to obtain real data.
  • the main purpose of this application is to propose a sensitive information detection method, apparatus, device and computer-readable storage medium, aiming at improving the accuracy of sensitive information detection.
  • the sensitive information detection method comprises the following steps:
  • a target page is generated, and whether there is target sensitive information in the target page is detected to obtain a detection result.
  • the target tag includes a content tag and a style tag
  • the step of generating a target page based on the target tag includes:
  • the step of constructing a document model tree based on the first hierarchical relationship and the content tag includes:
  • the current content tag is a script tag
  • execute the execution code corresponding to the script tag and after the execution code is executed, determine the tag type of the next content tag
  • the current content label is a resource label, obtain the resource corresponding to the resource label, and generate a document node from the resource;
  • a document model tree is constructed.
  • the step of generating a rendering tree based on the document model tree and the style model tree includes:
  • a third node is generated, and based on the third node, a rendering tree is generated.
  • the step of detecting whether there is target sensitive information in the target page to obtain a detection result includes:
  • the sensitive information detection method further includes:
  • the step of detecting whether there is target sensitive information in the target page is performed to obtain the detection result, and after the detection result is obtained, the detection result and the identification information are associated and saved in a preset database middle;
  • the detection result corresponding to the target identification information is acquired.
  • the step of determining the target content corresponding to the target address based on the first content and the second content includes:
  • the longest common subsequence of the first sequence and the second sequence is determined, and based on the longest common subsequence, the target content corresponding to the target address is determined.
  • the sensitive information detection method before the step of sending the first request and the second request to the target address to obtain the first content corresponding to the first request and the second content corresponding to the second request returned by the target address , the sensitive information detection method further includes:
  • the steps of sending the first request and the second request to the target address are performed.
  • the present application also provides a sensitive information detection device, the sensitive information detection device includes:
  • a sending module configured to send the first request and the second request to the target address, so as to obtain the first content corresponding to the first request and the second content corresponding to the second request;
  • a determining module configured to determine the target content corresponding to the target address based on the first content and the second content
  • Extraction module for determining the original character corresponding to the target content, and extracting the target label in the original character
  • a generating module for generating a target page based on the target tag
  • a detection module configured to detect whether there is target sensitive information in the target page to obtain a detection result.
  • the target tag includes a content tag and a style tag
  • the generating module is further configured to:
  • the generation module is further used for:
  • the current content tag is a script tag
  • execute the execution code corresponding to the script tag and after the execution code is executed, determine the tag type of the next content tag
  • the current content label is a resource label, obtain the resource corresponding to the resource label, and generate a document node from the resource;
  • a document model tree is constructed.
  • the generation module is further used for:
  • a third node is generated, and based on the third node, a rendering tree is generated.
  • the detection module is further used for:
  • the detection module is further used for:
  • the step of detecting whether there is target sensitive information in the target page is performed to obtain the detection result, and after the detection result is obtained, the detection result and the identification information are associated and saved in a preset database middle;
  • the detection result corresponding to the target identification information is acquired.
  • the determining module is further configured to:
  • the longest common subsequence of the first sequence and the second sequence is determined, and based on the longest common subsequence, the target content corresponding to the target address is determined.
  • the sending module is further used for:
  • the steps of sending the first request and the second request to the target address are performed.
  • the present application also provides a sensitive information detection device
  • the sensitive information detection device includes: a memory, a processor, and a sensitive information detection device stored on the memory and running on the processor A program, when the sensitive information detection program is executed by the processor, implements the steps of the above sensitive information detection method.
  • the present application also provides a computer-readable storage medium, where a sensitive information detection program is stored on the computer-readable storage medium, and when the sensitive information detection program is executed by a processor, the above-mentioned Steps of a sensitive information detection method.
  • a first request and a second request are sent to a target address to obtain the first content corresponding to the first request and the second content corresponding to the second request; based on the first content and the second content, Determine the target content corresponding to the target address; determine the original character corresponding to the target content, and extract the target tag in the original character; generate the target page based on the target tag, and detect whether there is target sensitive information in the target page to obtain the detection result.
  • This application removes the interference of dynamic factors in the address through two requests from the same address, thereby obtaining fixed content, and then generates a page containing complete data by extracting tags, so that the content of the page is fixed and complete, and then executes the process on the page.
  • the detection of sensitive information improves the accuracy of sensitive information detection.
  • FIG. 1 is a schematic diagram of the device structure of the hardware operating environment involved in the solution of the embodiment of the present application;
  • FIG. 2 is a schematic flowchart of the first embodiment of the sensitive information detection method of the present application
  • FIG. 3 is a schematic diagram of a target matrix in the first embodiment of the sensitive information detection method of the present application.
  • FIG. 4 is a schematic diagram of a document model tree in the first embodiment of the sensitive information detection method of the present application.
  • FIG. 1 is a schematic diagram of a device structure of a hardware operating environment involved in the solution of the embodiment of the present application.
  • the device in this embodiment of the present application may be a mobile terminal or a server device.
  • the device may include: a processor 1001 , such as a CPU, a network interface 1004 , a user interface 1003 , a memory 1005 , and a communication bus 1002 .
  • the communication bus 1002 is used to realize the connection and communication between these components.
  • the user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may include a standard wired interface and a wireless interface (eg, a WI-FI interface).
  • the memory 1005 may be high-speed RAM memory, or may be non-volatile memory, such as disk memory.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .
  • the device structure shown in FIG. 1 does not constitute a limitation on the device, and may include more or less components than the one shown, or combine some components, or arrange different components.
  • the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module and a sensitive information detection program.
  • the operating system is a program that manages and controls sensitive information detection equipment and software resources, and supports the operation of network communication modules, user interface modules, sensitive information detection programs, and other programs or software;
  • the network communication module is used to manage and control the network interface 1002 ;
  • the user interface module is used to manage and control the user interface 1003.
  • the sensitive information detection device invokes the sensitive information detection program stored in the memory 1005 through the processor 1001, and executes the operations in various embodiments of the following sensitive information detection methods.
  • FIG. 2 is a schematic flowchart of the first embodiment of the sensitive information detection method of the present application, and the method includes:
  • Step S10 sending the first request and the second request to the target address to obtain the first content corresponding to the first request and the second content corresponding to the second request;
  • Step S20 based on the first content and the second content, determine the target content corresponding to the target address
  • Step S30 determining the original character corresponding to the target content, and extracting the target label in the original character
  • Step S40 generating a target page based on the target tag, and detecting whether target sensitive information exists in the target page to obtain a detection result.
  • the sensitive information detection method in this embodiment is applied to sensitive information detection equipment of financial institutions such as wealth management institutions or banking systems.
  • the sensitive information detection equipment may be a terminal, a robot or a PC device.
  • the sensitive information detection equipment is referred to as detection equipment for short.
  • relevant personnel establish a sensitive information text database in advance according to the actual situation of financial institutions such as banks, so as to specify which information is not allowed to be leaked.
  • "Customer list" and other information are set as sensitive information, where the sensitive information text library can be set locally in the detection device, or can be set in a server connected to the detection device.
  • the detection device needs to monitor all sites that may leak sensitive information, and these sites are legal and accessible.
  • the detection device accesses a site, it first determines the URL of the site according to regular matching. Whether it is legal, and if legal, send an access request to it, and finally determine whether the site is available according to the returned access result, and only if it is available, the detection device detects the sensitive information of the site.
  • the same site is visited twice, and then the interference of the dynamic factors is eliminated according to the difference between the two visits, so as to obtain a fixed target content, and then by extracting tags , and generate a page containing complete data from the target content, so that the content of the page is fixed and complete.
  • the detection of sensitive information is performed on the page to make the detection result more reliable.
  • Step S10 sending the first request and the second request to the target address to obtain the first content corresponding to the first request and the second content corresponding to the second request;
  • the detection device sends the first request and the second request respectively to the same target address, so as to obtain the first content and the second content respectively.
  • the returned access results are different, that is, the first content at this time is different from the second content.
  • the returned The access results are the same, that is, the first content at this time is the same as the second content.
  • the sensitive information detection method further includes:
  • Step a1 sending a third request to the target address to obtain a status code corresponding to the third request
  • a third request is sent to the target address first, so as to obtain a corresponding status code, wherein the third request is a head request, and the status code is used to indicate whether the current request generates an error. If it is 401, 403, 404, etc., it is an error, and if the status code is 200, it is normal. Therefore, the status code of 200 can be set as the target status code in advance.
  • Step a2 if the status code is the target status code, the step of sending the first request and the second request to the target address is performed.
  • the steps of sending the first request and the second request to the target address are performed, wherein the first request and the second request are get requests.
  • a head request is sent first, and whether the page corresponding to the target address is normal is determined according to the status code returned by the head request, because the get request will return header data and body
  • the get request will return header data and body
  • the status code is 401, 403, 404, etc., it will stop detecting the current target address; if the status code is normal, if the status code is 200, it will send two get request to obtain the first content and the second content corresponding to the two requests.
  • Step S20 based on the first content and the second content, determine the target content corresponding to the target address.
  • the detection device eliminates dynamic factors according to the first content and the second content, thereby determining the target content corresponding to the target address, wherein the target content is the common part of the first content and the second content, that is, in the When removing the dynamic factor, the non-shared part of the first content and the second content is defined as dynamic content, that is, dynamic factor.
  • step S20 includes:
  • Step b1 determining the first sequence corresponding to the first content and the second sequence corresponding to the second content, and generating a target matrix based on the first sequence and the second sequence;
  • Step b2 determining the longest common subsequence of the first sequence and the second sequence based on the target matrix, and determining the target content corresponding to the target address based on the longest common subsequence.
  • the longest common subsequence of the first content and the second content is determined as the common part of the first content and the second content, that is, the target content.
  • Step S30 Determine the original character corresponding to the target content, and extract the target tag in the original character.
  • the returned body data includes bytes source code, such as:
  • the detection device needs to read the original bytes of the body, and then parse it into recognizable original characters according to the preset encoding, such as:
  • the detection device may perform extraction according to a preset tag structure, that is, all objects satisfying the preset tag structure are target tags.
  • the text information in the html source code also needs to be extracted.
  • the text information is placed in the corresponding position of the document model tree according to the parent-child relationship to which the text information belongs.
  • Step S40 generating a target page based on the target tag, and detecting whether target sensitive information exists in the target page to obtain a detection result.
  • the detection device can then detect whether there is sensitive information on the target page on the target page. , so as to obtain more accurate detection results.
  • the step of generating a target page based on the target tag includes:
  • Step c1 determine the first hierarchical relationship of the content label, and build a document model tree based on the first hierarchical relationship and the content label;
  • the target tag includes a content tag and a style tag, wherein the content tag user describes specific content, and the style tag is used to describe the layout of the specific content.
  • the content tag ⁇ html> is the parent layer
  • ⁇ head> and ⁇ body> are ⁇ html> > sub-layer
  • build a document model tree (DOM tree) according to the first hierarchical relationship and the nodes corresponding to the content tags.
  • the step of constructing the document model tree includes:
  • Step c11 determining the label type of the content label in turn
  • js For some dynamically rendered pages using JavaScript (hereinafter referred to as js), what the html source code can obtain is only the js code or the path of the js code, not the data that the detection device can actually obtain.
  • the tag types include resource tags and script tags.
  • Step c12 if the current content tag is a script tag, execute the execution code corresponding to the script tag, and after the execution code is executed, determine the tag type of the next content tag;
  • the detection device executes the execution code corresponding to the script tag, temporarily abandons the construction of the DOM tree, and after the execution is completed, continues to determine the tag type of the next content tag, That is, when the detection device encounters a script tag (that is, encounters js) during the construction of the DOM tree, the construction of the DOM tree is blocked, and the js code is executed by the js engine of the detection device. After the js code is executed, the construction of the DOM tree continues.
  • the purpose of blocking the construction of the DOM tree is to improve the overall efficiency, and to avoid the occurrence of low overall efficiency after a node in the DOM tree is created and deleted by the js code.
  • Step c13 if the current content tag is a resource tag, obtain a resource corresponding to the resource tag, and generate a document node from the resource;
  • Step c14 build a document model tree based on the first hierarchical relationship and the document nodes.
  • the detection device constructs a document model tree according to the generated document nodes in the order of the first hierarchical relationship.
  • Step c2 determine the second hierarchical relationship of the style label, and build a style model tree based on the second hierarchical relationship and the style label;
  • the detection device determines the second hierarchical relationship of style tags in a similar manner, and constructs a style model tree (CSS tree) through the second hierarchical relationship and the style tags, and the specific process is similar to building a document model tree, It is not repeated here.
  • CSS tree style model tree
  • Step c3 based on the document model tree and the style model tree, generate a rendering tree
  • the detection device performs background rendering on the acquired html source code, and specifically, generates a rendering tree from a document model tree and a style model tree.
  • step c3 includes:
  • Step c31 traverse the first node in the document model tree, and sequentially determine the second node corresponding to the first node in the style model tree;
  • the detection device traverses all nodes of the DOM tree, that is, the first node, it can find the second node corresponding to the CSS tree by querying the second node. its style.
  • Step c32 generating a third node based on the first node and the second node, and generating a rendering tree based on the third node.
  • a third node is generated through the first node and the second node, and then added to the rendering tree, thereby generating a rendering tree from the document model tree and the style model tree.
  • the existing technology does not ignore invisible nodes for the sake of data integrity, which causes the problem of low rendering efficiency.
  • the detection device does not scan the invisible data, so this part of the node can be ignored, which can improve the rendering efficiency.
  • Step c4 traverse the nodes of the rendering tree, and generate a target page based on the node and the node relationship between the nodes.
  • the detection device generates the final html page, that is, the target page, by traversing all the nodes of the rendering tree based on the relationship between the nodes.
  • the nodes in the rendering tree have both content description and style description, so a target page containing complete data can be generated.
  • the detection device performs identification and matching of sensitive information keywords in the target page. Specifically, the keywords in the sensitive information are matched with the characters in the target page one by one, so as to obtain a detection result, wherein the target sensitive
  • the information is sensitive information in the sensitive information text library, and the detection result includes leakage or non-disclosure.
  • the detection device in this embodiment sends the first request and the second request to the target address to obtain the first content corresponding to the first request and the second content corresponding to the second request; based on the first content and the second content, the target address is determined The corresponding target content; determine the original characters corresponding to the target content, and extract the target tags in the original characters; based on the target tags, generate a target page, and detect whether there is target sensitive information in the target page to obtain the detection result.
  • This application removes the interference of dynamic factors in the address through two requests from the same address, thereby obtaining fixed content, and then generates a page containing complete data by extracting tags, so that the content of the page is fixed and complete, and then executes the process on the page.
  • the detection of sensitive information improves the accuracy of sensitive information detection.
  • Step d1 determine the first character string corresponding to the target page and the second character string corresponding to the target sensitive information, and based on the first page character of the first character string and the first character of the second character string, the A string is aligned with a second string;
  • Step d2 successively determine whether the sensitive characters of the second character string match the page characters of the first character string corresponding to the same position;
  • Step d3 if the current page character does not match the current sensitive character, then determine the next page character of the page character corresponding to the last sensitive character of the second character string as the target character, and determine whether there is any describe the target character;
  • Step d4 if it does not exist, then based on the next page character of the target character and the first sensitive character of the second character string, align the first character string and the second character string, and execute sequentially determining the sensitivity of the second character string. The step of whether the character matches the page character of the first string corresponding to the same position;
  • Step d5 if it exists, align the first character string with the second character string based on the target character, and execute sequentially determining whether the sensitive character of the second character string matches the page character of the first character string corresponding to the same position A step of;
  • Step d6 if there is a match, record the matching position of the second character string in the first character string, and output a detection result based on the matching position.
  • this embodiment due to the large volume of files generated during multiple batch detection, that is, there are many target pages, if the traditional method of matching keywords one by one is used for detection, it will take too much time, that is, the detection efficiency is low.
  • this embodiment provides an improved matching method, the basic principle of which is to match the sensitive information as a whole, and during the matching process, characters of the same length as the sensitive information are also intercepted on the target page as matching objects, so that the matching The process is accelerated and the detection efficiency is improved.
  • Step d1 determine the first character string corresponding to the target page and the second character string corresponding to the target sensitive information, and based on the first page character of the first character string and the first character of the second character string, the A string is aligned with a second string;
  • the detection device first determines the first character string corresponding to the target page and the second character string corresponding to the target sensitive information, and then constructs a position axis based on the first character string, wherein the first character Each page character in the string corresponds to a position on the position axis and is fixed, and then on the position axis, the first character string is aligned with the second character string, specifically the first page character and the first sensitive character are aligned.
  • Step d2 successively determine whether the sensitive characters of the second character string match the page characters of the first character string corresponding to the same position;
  • the detection device sequentially determines whether the sensitive characters of the second character string match the page characters of the first character string corresponding to the same position, for example, the sensitive characters at the first position on the position axis and the page at the first position whether the characters match.
  • Step d3 if the current page character does not match the current sensitive character, then determine the next page character of the page character corresponding to the last sensitive character of the second character string as the target character, and determine whether there is any describe the target character;
  • the detection device determines the next page character of the page character corresponding to the last sensitive character of the second character string as the target character. , you can first determine the character length of the second character string, and then add one to the character length. At this time, the corresponding position on the position axis is the position of the target character. is the target character.
  • Step d4 if it does not exist, then based on the next page character of the target character and the first sensitive character of the second character string, align the first character string and the second character string, and execute sequentially determining the sensitivity of the second character string. The step of whether the character matches the page character of the first string corresponding to the same position;
  • the target character if it is determined that the target character does not exist in the second character string, the target character is skipped, and the next page character of the target character is used as the alignment character, that is, the next character and the second character of the target character are The first sensitive character of the string is used as the alignment reference, and the second string is moved on the position axis.
  • the second string also corresponds to the page characters with the same character length, and the detection device continues to determine the sensitive characters of the second string in turn. The step of matching the page character of the first string corresponding to the same position.
  • Step d5 if it exists, align the first character string with the second character string based on the target character, and execute sequentially determining whether the sensitive character of the second character string matches the page character of the first character string corresponding to the same position A step of;
  • the detection device continues to perform the step of sequentially determining whether the sensitive character of the second character string matches the page character of the first character string corresponding to the same position.
  • Step d6 if there is a match, record the matching position of the second character string in the first character string, and output a detection result based on the matching position.
  • the matching position of the second character string in the first character string is recorded, and the output includes the matching position test results.
  • the first string does not match until the end of the match, it means that the first string does not contain any sensitive characters in the second string, indicating that the sensitive information of the target is not leaked on the target page, then output the detection of non-leakage. result.
  • A [I,A,M,J,O,H,N,I,L,I,K,E,P,L,A,Y,I,N,G,F,O,O,T,B ,A,L,L,];
  • first align string A and string B with the first character such as:
  • the character string A and the character string B do not match.
  • the next character of the page character J corresponding to the last character I of the character string B that is, the character O at position 4, is taken out. Then go to determine whether there is an O in the string B.
  • the detection device will move the string B to the right by 1 bit, and align the I in the two strings.
  • An I is used as an alignment character for alignment.
  • the detection device outputs a detection result including the matching position.
  • the detection device of this embodiment matches the sensitive information as a whole, and in the matching process, also intercepts characters of the same length as the sensitive information as the matching object on the target page, which speeds up the matching process and improves the detection efficiency.
  • the sensitive information detection method further includes:
  • Step e1 determine the identification information of the target page, and based on the identification information, determine whether there is target identification information consistent with the identification information in the preset database;
  • Step e2 if it does not exist, then perform the step of detecting whether there is target sensitive information in the target page to obtain the detection result, and after obtaining the detection result, the detection result and the identification information are associated and stored in the preset database;
  • Step e3 if it exists, obtain the detection result corresponding to the target identification information.
  • Step e1 determine the identification information of the target page, and based on the identification information, determine whether there is target identification information consistent with the identification information in the preset database;
  • the detection device calculates the identification information of the target page, wherein the identification information may be a hash value, or an MD5 value, etc., used to indicate the unique information of the current target page.
  • the identification information of the target page is compared with the identification information in the preset database to determine whether there is target identification information consistent with the identification information of the target page in the preset database.
  • Step e2 if it does not exist, then perform the step of detecting whether there is target sensitive information in the target page to obtain the detection result, and after obtaining the detection result, the detection result and the identification information are associated and stored in the preset database;
  • the step of detecting whether there is target sensitive information in the target page is performed.
  • the specific process Referring to the previous embodiment, which is not repeated here, a corresponding detection result is obtained. Then, the detection result is associated and bound with the identification information of the target page, and stored in a preset database to avoid the detection of the target page. Repeat detection.
  • Step e3 if it exists, obtain the detection result corresponding to the target identification information.
  • the detection result corresponding to the target identification information is directly obtained in the preset database for output.
  • the present application also provides a sensitive information detection device.
  • the sensitive information detection device of this application includes:
  • a sending module configured to send the first request and the second request to the target address, so as to obtain the first content corresponding to the first request and the second content corresponding to the second request;
  • a determining module configured to determine the target content corresponding to the target address based on the first content and the second content
  • an extraction module for determining the original character corresponding to the target content, and extracting the target label in the original character
  • a generating module for generating a target page based on the target tag
  • a detection module configured to detect whether there is target sensitive information in the target page to obtain a detection result.
  • the target tag includes a content tag and a style tag
  • the generating module is further configured to:
  • the generation module is further used for:
  • the current content tag is a script tag
  • execute the execution code corresponding to the script tag and after the execution code is executed, determine the tag type of the next content tag
  • the current content label is a resource label, obtain the resource corresponding to the resource label, and generate a document node from the resource;
  • a document model tree is constructed.
  • the generation module is further used for:
  • a third node is generated, and based on the third node, a rendering tree is generated.
  • the detection module is further used for:
  • the detection module is further used for:
  • the step of detecting whether there is target sensitive information in the target page is performed to obtain the detection result, and after the detection result is obtained, the detection result and the identification information are associated and saved in a preset database middle;
  • the detection result corresponding to the target identification information is acquired.
  • the determining module is further configured to:
  • the longest common subsequence of the first sequence and the second sequence is determined, and based on the longest common subsequence, the target content corresponding to the target address is determined.
  • the sending module is further used for:
  • the steps of sending the first request and the second request to the target address are performed.
  • the present application also provides a computer-readable storage medium.
  • a sensitive information detection program is stored on the computer-readable storage medium of the present application, and when the sensitive information detection program is executed by the processor, the steps of the above-mentioned sensitive information detection method are implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种敏感信息检测方法,包括:向目标地址发送第一请求和第二请求,以得到第一请求对应的第一内容和第二请求对应的第二内容(S10);基于第一内容和第二内容,确定目标地址对应的目标内容(S20);确定目标内容对应的原始字符,并提取原始字符中的目标标签(S30);基于目标标签,生成目标页面,并检测目标页面中是否存在目标敏感信息,以获得检测结果(S40)。还公开了一种敏感信息检测装置、设备和计算机可读存储介质。

Description

敏感信息检测方法、装置、设备与计算机可读存储介质
优先权信息
本申请要求于2020年9月27日申请的、申请号为202011036671.5的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及金融科技(Fintech)技术领域,尤其涉及敏感信息检测方法、装置、设备与计算机可读存储介质。
背景技术
近年来,随着金融科技(Fintech),尤其是互联网金融的不断发展,信息检测技术被引入银行等金融机构的日常服务中。在银行等金融机构的日常服务过程中,为避免敏感信息,如银行等金融机构的报价信息等被他人上传至外部网站,导致银行等金融机构的敏感信息被外人获知,银行等金融机构往往需要对敏感信息进行泄露检测,以便及时知晓敏感信息被泄露,从而采取补救措施,如删除等。
目前的敏感信息检测方式主要通过对页面进行HTML关键字检测,从而识别敏感信息是否被发布在该页面上,具体的,获取该页面的HTML源码,再对HTML源码进行关键字识别,从而判断是否存在敏感信息,如HTML源码中有关键字“关于印发xxx四项制度的通知”,则表示可能泄露了某银行机构的公文。
此种敏感信息检测方式仅针对HTML源码关键字进行识别,并不能排除一些动态因素的影响,如广告等,且HTML源码并不代表真正的数据,如含有资源请求的标签,以及代码执行后才能获取的数据等并不能直接获得,可见,目前的敏感信息检测方式由于动态因素的干扰或者无法获取到真正的数据,导致检测准确率较低。
发明内容
本申请的主要目的在于提出一种敏感信息检测方法、装置、设备与计算机可读存储介质,旨在提高敏感信息检测的准确率。
为实现上述目的,本申请提供一种敏感信息检测方法,所述敏感信息检测方法包括如下步骤:
向目标地址发送第一请求和第二请求,以得到第一请求对应的第一内容和第二请求对应的第二内容;
基于第一内容和第二内容,确定所述目标地址对应的目标内容;
确定所述目标内容对应的原始字符,并提取所述原始字符中的目标标签;
基于所述目标标签,生成目标页面,并检测所述目标页面中是否存在目标敏感信息,以获得检测结果。
在一实施例中,所述目标标签包括内容标签和样式标签,所述基于所述目标标签,生成目标页面的步骤包括:
确定所述内容标签的第一层级关系,并基于第一层级关系和所述内容标签,构建文档模型树;
确定所述样式标签的第二层级关系,并基于第二层级关系和所述样式标签,构建样式模型树;
基于所述文档模型树和所述样式模型树,生成渲染树;
遍历所述渲染树的节点,并基于所述节点和所述节点的节点关系,生成目标页面。
在一实施例中,所述基于第一层级关系和所述内容标签,构建文档模型树的步骤包括:
依次确定所述内容标签的标签类型;
若当前内容标签为脚本标签,则执行所述脚本标签对应的执行代码,并在所述执行代码执行完毕之后,确定下一内容标签的标签类型;
若当前内容标签为资源标签,则获取所述资源标签对应的资源,并将所述资源生成文档节点;
基于第一层级关系和所述文档节点,构建文档模型树。
在一实施例中,所述基于所述文档模型树和所述样式模型树,生成渲染树的步骤包括:
遍历所述文档模型树中的第一节点,并依次确定第一节点在所述样式模型树中对应的第二节点;
基于第一节点和第二节点,生成第三节点,并基于第三节点,生成渲染树。
在一实施例中,所述检测所述目标页面中是否存在目标敏感信息,以获得检测结果的步骤包括:
确定所述目标页面对应的第一字符串,以及所述目标敏感信息对应的第二字符串,并基于第一字符串的首位页面字符和第二字符串的首位敏感字符,将第一字符串与第二字符串对齐;
依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配;
若当前页面字符与当前敏感字符不匹配,则将第二字符串的末位敏感字符所对应的页面字符的下一页面字符确定为目标字符,并确定第二字符串中是否存在所述目标字符;
若不存在,则基于所述目标字符的下一页面字符和第二字符串的首位敏感字符,将第一字符串和第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤;
若存在,则基于所述目标字符,将第一字符串与第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤;
若匹配,则记录第二字符串在第一字符串的匹配位置,并基于所述匹配位置输出检测结果。
在一实施例中,在生成目标页面之后,所述敏感信息检测方法还包括:
确定所述目标页面的标识信息,并基于所述标识信息,确定预设数据库中是否存在与所述标识信息一致的目标标识信息;
若不存在,则执行检测所述目标页面中是否存在目标敏感信息,以获得检测结果的步骤,并在获得所述检测结果后,将所述检测结果和所述标识信息关联保存在预设数据库中;
若存在,则获取所述目标标识信息对应的检测结果。
在一实施例中,所述基于第一内容和第二内容,确定所述目标地址对应的目标内容的步骤包括:
确定第一内容对应的第一序列,以及第二内容对应的第二序列,并基于第一序列和第二序列,生成目标矩阵;
基于所述目标矩阵,确定第一序列与第二序列的最长公共子序列,并基于所述最长公共子序列,确定所述目标地址对应的目标内容。
在一实施例中,所述向所述目标地址发送第一请求和第二请求,以得到所述目标地址返回的第一请求对应的第一内容和第二请求对应的第二内容的步骤之前,所述敏感信息检测方法还包括:
向所述目标地址发送第三请求,以得到第三请求对应的状态码;
若所述状态码为目标状态码,则执行向目标地址发送第一请求和第二请求的步骤。
此外,为实现上述目的,本申请还提供一种敏感信息检测装置,所述敏感信息检测装置包括:
发送模块,用于向目标地址发送第一请求和第二请求,以得到第一请求对应的第一内容和第二请求对应的第二内容;
确定模块,用于基于第一内容和第二内容,确定所述目标地址对应的目标内容;
提取模块,用于确定所述目标内容对应的原始字符,并提取所述原始字符中的目标标 签;
生成模块,用于基于所述目标标签,生成目标页面;
检测模块,用于检测所述目标页面中是否存在目标敏感信息,以获得检测结果。
在一实施例中,所述目标标签包括内容标签和样式标签,所述生成模块还用于:
确定所述内容标签的第一层级关系,并基于第一层级关系和所述内容标签,构建文档模型树;
确定所述样式标签的第二层级关系,并基于第二层级关系和所述样式标签,构建样式模型树;
基于所述文档模型树和所述样式模型树,生成渲染树;
遍历所述渲染树的节点,并基于所述节点和所述节点的节点关系,生成目标页面。
在一实施例中,所述生成模块还用于:
依次确定所述内容标签的标签类型;
若当前内容标签为脚本标签,则执行所述脚本标签对应的执行代码,并在所述执行代码执行完毕之后,确定下一内容标签的标签类型;
若当前内容标签为资源标签,则获取所述资源标签对应的资源,并将所述资源生成文档节点;
基于第一层级关系和所述文档节点,构建文档模型树。
在一实施例中,所述生成模块还用于:
遍历所述文档模型树中的第一节点,并依次确定第一节点在所述样式模型树中对应的第二节点;
基于第一节点和第二节点,生成第三节点,并基于第三节点,生成渲染树。
在一实施例中,所述检测模块还用于:
确定所述目标页面对应的第一字符串,以及所述目标敏感信息对应的第二字符串,并基于第一字符串的首位页面字符和第二字符串的首位敏感字符,将第一字符串与第二字符串对齐;
依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配;
若当前页面字符与当前敏感字符不匹配,则将第二字符串的末位敏感字符所对应的页面字符的下一页面字符确定为目标字符,并确定第二字符串中是否存在所述目标字符;
若不存在,则基于所述目标字符的下一页面字符和第二字符串的首位敏感字符,将第一字符串和第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第 一字符串的页面字符是否匹配的步骤;
若存在,则基于所述目标字符,将第一字符串与第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤;
若匹配,则记录第二字符串在第一字符串的匹配位置,并基于所述匹配位置输出检测结果。
在一实施例中,所述检测模块还用于:
确定所述目标页面的标识信息,并基于所述标识信息,确定预设数据库中是否存在与所述标识信息一致的目标标识信息;
若不存在,则执行检测所述目标页面中是否存在目标敏感信息,以获得检测结果的步骤,并在获得所述检测结果后,将所述检测结果和所述标识信息关联保存在预设数据库中;
若存在,则获取所述目标标识信息对应的检测结果。
在一实施例中,所述确定模块还用于:
确定第一内容对应的第一序列,以及第二内容对应的第二序列,并基于第一序列和第二序列,生成目标矩阵;
基于所述目标矩阵,确定第一序列与第二序列的最长公共子序列,并基于所述最长公共子序列,确定所述目标地址对应的目标内容。
在一实施例中,所述发送模块还用于:
向所述目标地址发送第三请求,以得到第三请求对应的状态码;
若所述状态码为目标状态码,则执行向目标地址发送第一请求和第二请求的步骤。
此外,为实现上述目的,本申请还提供一种敏感信息检测设备,所述敏感信息检测设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的敏感信息检测程序,所述敏感信息检测程序被所述处理器执行时实现如上所述的敏感信息检测方法的步骤。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有敏感信息检测程序,所述敏感信息检测程序被处理器执行时实现如上所述的敏感信息检测方法的步骤。
本申请提出的敏感信息检测方法,向目标地址发送第一请求和第二请求,以得到第一请求对应的第一内容和第二请求对应的第二内容;基于第一内容和第二内容,确定目标地址对应的目标内容;确定目标内容对应的原始字符,并提取原始字符中的目标标签;基于目标标签,生成目标页面,并检测目标页面中是否存在目标敏感信息,以获得检测结果。 本申请通过同一地址的两次请求,剔除地址中动态因素的干扰,从而得到固定的内容,再通过提取标签,生成包含完整数据的页面,使得页面的内容固定且完整,再在该页面中进行敏感信息的检测,提高了敏感信息检测的准确率。
附图说明
图1是本申请实施例方案涉及的硬件运行环境的设备结构示意图;
图2为本申请敏感信息检测方法第一实施例的流程示意图;
图3为本申请敏感信息检测方法第一实施例中,目标矩阵的一种示意图;
图4为本申请敏感信息检测方法第一实施例中,文档模型树的一种示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
如图1所示,图1是本申请实施例方案涉及的硬件运行环境的设备结构示意图。
本申请实施例设备可以是移动终端或服务器设备。
如图1所示,该设备可以包括:处理器1001,例如CPU,网络接口1004,用户接口1003,存储器1005,通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。
本领域技术人员可以理解,图1中示出的设备结构并不构成对设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
如图1所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及敏感信息检测程序。
其中,操作系统是管理和控制敏感信息检测设备与软件资源的程序,支持网络通信模块、用户接口模块、敏感信息检测程序以及其他程序或软件的运行;网络通信模块用于管理和控制网络接口1002;用户接口模块用于管理和控制用户接口1003。
在图1所示的敏感信息检测设备中,所述敏感信息检测设备通过处理器1001调用存储器1005中存储的敏感信息检测程序,并执行下述敏感信息检测方法各个实施例中的操作。
基于上述硬件结构,提出本申请敏感信息检测方法实施例。
参照图2,图2为本申请敏感信息检测方法第一实施例的流程示意图,所述方法包括:
步骤S10,向目标地址发送第一请求和第二请求,以得到第一请求对应的第一内容和第二请求对应的第二内容;
步骤S20,基于第一内容和第二内容,确定所述目标地址对应的目标内容;
步骤S30,确定所述目标内容对应的原始字符,并提取所述原始字符中的目标标签;
步骤S40,基于所述目标标签,生成目标页面,并检测所述目标页面中是否存在目标敏感信息,以获得检测结果。
本实施例敏感信息检测方法运用于理财机构或者银行系统等金融机构的敏感信息检测设备中,敏感信息检测设备可以是终端、机器人或者PC设备,为描述方便,敏感信息检测设备以检测设备简称。在本实施例中,相关人员事先根据银行等金融机构的实际情况,建立敏感信息文本库,从而规定哪些信息是不允许泄露的,如将“关于xxx的通知”、“xxx报价单”“xxx客户名单”等信息设置为敏感信息,其中该敏感信息文本库可以设置在检测设备本地,也可以设置在与检测设备连接的服务器中。此外,为确保精准检测,检测设备需要对所有可能泄露敏感信息的站点进行监控,而这些站点合法且可访问,也即,检测设备在访问某站点时,先根据正则匹配,判断该站点的url是否合法,并在合法的情况下,向其发送访问请求,最后根据返回的访问结果确定该站点是否可用,在可用的情况下,检测设备才对该站点进行敏感信息的检测。
本实施例为避免站点页面中存在广告等动态因素的干扰,通过对同一站点进行两次访问,再根据两次访问的差异,剔除动态因素的干扰,从而得到固定的目标内容,再通过提取标签,将目标内容生成包含完整数据的页面,使得页面的内容固定且完整,此时,再在该页面中进行敏感信息的检测,使得检测结果更加可靠。
以下将对各个步骤进行详细说明:
步骤S10,向目标地址发送第一请求和第二请求,以得到第一请求对应的第一内容和第二请求对应的第二内容;
在本实施例中,检测设备向同一目标地址分别发送第一请求和第二请求,从而分别得到第一内容和第二内容,其中,若目标地址存在动态因素的影响,如广告的存在,则针对同一地址的不同访问请求,所返回的访问结果不同,也即,此时的第一内容与第二内容不同,当然,若不存在动态因素的影响,针对同一地址的不同访问请求,所返回的访问结果 一致,也即,此时的第一内容与第二内容相同。
进一步地,在一实施例中,步骤S10之前,敏感信息检测方法还包括:
步骤a1,向所述目标地址发送第三请求,以得到第三请求对应的状态码;
在一实施例中,先向目标地址发送第三请求,从而得到对应的状态码,其中,第三请求为head请求,而状态码用于表示当前请求是否产生错误,在具体实施时,状态码为401、403、404等即为错误,状态码为200即为正常,因此,可事先设200的状态码为目标状态码。
步骤a2,若所述状态码为目标状态码,则执行向目标地址发送第一请求和第二请求的步骤。
在一实施例中,若确定当前状态码为200,则执行向目标地址发送第一请求和第二请求的步骤,其中,第一请求和第二请求为get请求。
也即,在一实施例中,不直接对目标地址发送get请求,而是先发送head请求,根据head请求返回的状态码判断目标地址对应的页面是否正常,因为get请求会返回header数据和body数据,而head请求只返回header数据,在实际检测过程中,在header数据无效的情况下,body数据大多也为无效数据,因此,为提高检测效率效率,先发送一次head请求,根据head请求返回的状态码判断是否产生错误,如状态码为401、403、404等,如果产生错误,则停止对当前目标地址进行检测;在状态码正常时,如状态码为200,则会发送两次get请求,以获得两次请求对应的第一内容和第二内容。
步骤S20,基于第一内容和第二内容,确定所述目标地址对应的目标内容。
在本实施例中,检测设备根据第一内容和第二内容,剔除动态因素,从而确定目标地址对应的目标内容,其中,目标内容为第一内容和第二内容的共有部分,也即,在去除动态因素时,将第一内容和第二内容中非共有部分定义为动态内容,也即动态因素。
具体的,在一实施例中,步骤S20包括:
步骤b1,确定第一内容对应的第一序列,以及第二内容对应的第二序列,并基于第一序列和第二序列,生成目标矩阵;
在一实施例中,通过查找第一内容和第二内容的最长公共子序列作为第一内容和第二内容的共有部分。具体的,先确定第一内容对应的第一序列A,再确定第二内容对应的第二序列B,再进一步确定第一序列A的长度为M,第二序列B的长度为N,从而生成大小为(m+1)*(n+1)的目标矩阵C,初始元素全部为0,如图3所示,以第一序列A=[A,B,C,B,D,A,B],长度为7,第二序列B=[B,D,C,A,B,A],长度为6为例,生成一个 8×7的目标矩阵C。
步骤b2,基于所述目标矩阵,确定第一序列与第二序列的最长公共子序列,并基于所述最长公共子序列,确定所述目标地址对应的目标内容。
接着,通过目标矩阵C,查找第一序列与第二序列的最长公共子序列。
具体公式如下:
Figure PCTCN2021119658-appb-000001
其解法为设当前矩阵C初始值均为0,由于i,j均大于0,忽略矩阵的第0行,及第0列,从第i行,第1列开始计算,此时By[1]=B,当i=1时,Ax[1]=A,两者不相等,取C[i-1][j]与C[i][j-1]两者之间最大的值,可知此时为0,当i=2时,Ax[2]=B,两者相等,取C[i-1][j-1]+1,可知此时应为0+1=1当i=3时,以此类推,最终得到最长子序列为[B,C,B,A],长度为4。
最后,将第一内容和第二内容的最长公共子序列确定为第一内容和第二内容的共有部分,也即目标内容。
步骤S30,确定所述目标内容对应的原始字符,并提取所述原始字符中的目标标签。
在本实施例中,在向目标地址发送get请求时,返回的body数据中有bytes源码,如:
\u003c\u0068\u0074\u006d\u006c\u003e\u000a\u0020\u0020\u0020\u0020\u003c\u0068\u0065\u0061\u0064\u003e\u000a\u0020\u0020\u0020\u0020\u0020\u0020\u0020\u0020\u003c\u006d\u0065\...
此时,检测设备需读取该body的原始字节,再根据预设编码解析为可识别的原始字符,如:
<html>
<head>
<meta charset="UTF-8">
<link href=""/>
</head>
<body>
<imgsrc=’x’>
<script>alert(123)</script>
<p>
文本信息1
<span>
文本信息2
</span>
文本信息3
</p>
</body>
</html>
也即,将其转换成html源码,此时,检测设备再提取原始字符,也即html源码中的目标标签,如:<imgsrc=’x’>等。
在一实施例中,在提取目标标签的过程中检测设备可根据预设标签结构进行提取,也即满足预设标签结构的都为目标标签。
此外,需要说明的是,html源码中的文本信息也需要提取,后续在生成文档模型树时,根据文本信息所属的父子关系,将文本信息放置在文档模型树的相应位置。
步骤S40,基于所述目标标签,生成目标页面,并检测所述目标页面中是否存在目标敏感信息,以获得检测结果。
在本实施例中,在得到目标标签之后,即可根据目标标签,获得更为完整的数据,并以此生成目标页面,检测设备再在目标页面上检测是否有敏感信息存在在该目标页面中,从而获得更为准确的检测结果。
具体的,在一实施例中,基于所述目标标签生成目标页面的步骤包括:
步骤c1,确定所述内容标签的第一层级关系,并基于第一层级关系和所述内容标签,构建文档模型树;
在一实施例中,目标标签包括内容标签和样式标签,其中,内容标签用户描述具体内容,而样式标签用于描述具体内容的布局。
在具体实施时,先确定各内容标签的第一层级关系,然后,将内容标签生成节点,如上述部分html源码中,内容标签<html>为父层,<head>和<body>为<html>的子层,参照图4,按照第一层级关系和内容标签对应的节点,构建文档模型树(DOM树)。
进一步地,在一实施例中,基于第一层级关系和所述内容标签,构建文档模型树的步骤包括:
步骤c11,依次确定所述内容标签的标签类型;
在一实施例中,针对一些使用JavaScript(后面简称js)动态渲染的页面,html源码所能够获取的,只是js的代码或js代码的路径,并不是检测设备真正能够获取到的数据。此外,还有一些含有资源请求的标签(如<script src=’a.com’></script>,),检测设备若只获取该标签,并不能拉取真正的数据。
因此,在具体实施时,需要依次确定内容标签的标签类型,其中,标签类型包括资源标签和脚本标签。
步骤c12,若当前内容标签为脚本标签,则执行所述脚本标签对应的执行代码,并在所述执行代码执行完毕之后,确定下一内容标签的标签类型;
在一实施例中,若确定当前内容标签为脚本标签,则检测设备执行脚本标签对应的执行代码,暂时放弃对DOM树的构建,并在执行完毕之后,继续确定下一内容标签的标签类型,也即,检测设备在构建DOM树过程中,若遇到一个script标签时(即遇上js),则阻塞DOM树的构建,通过检测设备的js引擎执行该段js代码。js代码执行完毕后,再继续DOM树的构建。
需要说明的是,阻塞DOM树的构建的目的在于提高整体的效率,避免DOM树某个节点创建完成之后,又被js代码删除,导致整体效率低下的情况发生。
步骤c13,若当前内容标签为资源标签,则获取所述资源标签对应的资源,并将所述资源生成文档节点;
在一实施例中,若当前内容标签为资源标签,则检测设备获取资源标签对应的资源,具体的,检测设备根据资源标签中的请求,如<imgsrc=’x’>,<a href=’x’>,发送http请求获取该资源标签对应的资源,并保存在本地,再将资源生成文档节点。
步骤c14,基于第一层级关系和所述文档节点,构建文档模型树。
在一实施例中,检测设备根据生成的文档节点,按照第一层级关系的顺序,构建文档模型树。
步骤c2,确定所述样式标签的第二层级关系,并基于第二层级关系和所述样式标签,构建样式模型树;
在一实施例中,检测设备以类似的方式,确定样式标签的第二层级关系,并通过第二层级关系和样式标签,构建样式模型树(CSS树),具体过程以构建文档模型树类似,在此不再赘述。
步骤c3,基于所述文档模型树和所述样式模型树,生成渲染树;
在一实施例中,检测设备对获取到的html源码进行后台渲染,具体的,将文档模型树和样式模型树,生成渲染树。
在一实施例中,步骤c3包括:
步骤c31,遍历所述文档模型树中的第一节点,并依次确定第一节点在所述样式模型树中对应的第二节点;
在一实施例中,可以理解的,DOM树与CSS树存在对应关系,因此,检测设备在遍历DOM树的所有节点,也即第一节点时,可通过查询CSS树对应的第二节点,找到其样式。
步骤c32,基于第一节点和第二节点,生成第三节点,并基于第三节点,生成渲染树。
在一实施例中,通过第一节点和第二节点,生成第三节点,再将其添加到渲染树中,从而将文档模型树和样式模型树生成渲染树。
需要说明的是,对于不可见的(如设置了display:none)节点,现有技术为了数据的完整性并不会忽略不可见的节点,从而造成了渲染效率低下的问题,在本实施例中,检测设备并不会对不可见的数据进行扫描,因此该部分节点可以被忽略,可提高渲染效率。
步骤c4,遍历所述渲染树的节点,并基于所述节点和所述节点的节点关系,生成目标页面。
在一实施例中,检测设备通过遍历渲染树的所有节点,由节点间的关系生成最后的html页面,也即目标页面。
可以理解的,渲染树中的节点,既有内容描述,也有样式描述,因此,可生成一个包含完整数据的目标页面。
最后,检测设备再在该目标页面中进行敏感信息关键字的识别匹配,具体的,将敏感信息中的关键字,与目标页面中的字符进行一一匹配,从而得到检测结果,其中,目标敏感信息为敏感信息文本库中的敏感信息,检测结果包括泄露或者未泄露。
本实施例的检测设备向目标地址发送第一请求和第二请求,以得到第一请求对应的第一内容和第二请求对应的第二内容;基于第一内容和第二内容,确定目标地址对应的目标内容;确定目标内容对应的原始字符,并提取原始字符中的目标标签;基于目标标签,生成目标页面,并检测目标页面中是否存在目标敏感信息,以获得检测结果。本申请通过同一地址的两次请求,剔除地址中动态因素的干扰,从而得到固定的内容,再通过提取标签,生成包含完整数据的页面,使得页面的内容固定且完整,再在该页面中进行敏感信息的检测,提高了敏感信息检测的准确率。
进一步地,基于本申请敏感信息检测方法第一实施例,提出本申请敏感信息检测方法第二实施例。
敏感信息检测方法的第二实施例与敏感信息检测方法的第一实施例的区别在于,检测目标页面中是否存在目标敏感信息的步骤包括:
步骤d1,确定所述目标页面对应的第一字符串,以及所述目标敏感信息对应的第二字符串,并基于第一字符串的首位页面字符和第二字符串的首位敏感字符,将第一字符串与第二字符串对齐;
步骤d2,依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配;
步骤d3,若当前页面字符与当前敏感字符不匹配,则将第二字符串的末位敏感字符所对应的页面字符的下一页面字符确定为目标字符,并确定第二字符串中是否存在所述目标字符;
步骤d4,若不存在,则基于所述目标字符的下一页面字符和第二字符串的首位敏感字符,将第一字符串和第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤;
步骤d5,若存在,则基于所述目标字符,将第一字符串与第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤;
步骤d6,若匹配,则记录第二字符串在第一字符串的匹配位置,并基于所述匹配位置输出检测结果。
本实施例由于多次批量检测时,生成的文件体积较大,也即目标页面较多,如果使用传统的关键字逐一匹配的方法进行检测,将耗时过多,也即检测效率低下,因此,本实施例提供了一种改进的匹配方法,其基本原理是将敏感信息作为一个整体进行匹配,且在匹配过程中,在目标页面也截取与敏感信息等长的字符作为匹配对象,使得匹配过程加快,提高检测效率。
以下将对各个步骤进行详细说明:
步骤d1,确定所述目标页面对应的第一字符串,以及所述目标敏感信息对应的第二字符串,并基于第一字符串的首位页面字符和第二字符串的首位敏感字符,将第一字符串与第二字符串对齐;
在本实施例中,检测设备先分别确定目标页面对应的第一字符串,以及目标敏感信息对应的第二字符串,再以第一字符串为基础,构建一个位置轴,其中,第一字符串中的每一个页面字符在位置轴上都对应一个位置,且固定不变,再在位置轴上,将第一字符串与第二字符串对齐,具体以首位页面字符与首位敏感字符对齐。
步骤d2,依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配;
在本实施例中,检测设备依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配,如位于位置轴上第一位置的敏感字符与位于第一位置的页面字符是否匹配。
步骤d3,若当前页面字符与当前敏感字符不匹配,则将第二字符串的末位敏感字符所对应的页面字符的下一页面字符确定为目标字符,并确定第二字符串中是否存在所述目标字符;
在本实施例中,若确定当前页面字符与当前敏感字符不匹配,则检测设备将第二字符串的末位敏感字符所对应的页面字符的下一页面字符确定为目标字符,在具体实施时,可先确定第二字符串的字符长度,再在该字符长度加一,此时在位置轴上所对应的位置,即为目标字符的位置,此时检测设备在该位置上获取到的即为目标字符。
接着,确定第二字符串中是否存在目标字符,也即目标字符是否是敏感字符。
步骤d4,若不存在,则基于所述目标字符的下一页面字符和第二字符串的首位敏感字符,将第一字符串和第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤;
在本实施例中,若确定第二字符串中不存在目标字符,则跳过目标字符,以目标字符的下一页面字符作为对齐字符,也即,将目标字符的下一字符和第二字符串的首位敏感字符作为对齐基准,将第二字符串在位置轴上进行移动,此时第二字符串同样对应有相同字符长度的页面字符,检测设备继续执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤。
步骤d5,若存在,则基于所述目标字符,将第一字符串与第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤;
在本实施例中,若确定第二字符串中存在目标字符,则以目标字符作为对齐字符,将第二字符串在位置轴上进行移动,此时第二字符串同样对应有相同字符长度的页面字符, 且第二字符串与对应的页面字符至少有一个字符相同,检测设备继续执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤。
步骤d6,若匹配,则记录第二字符串在第一字符串的匹配位置,并基于所述匹配位置输出检测结果。
在本实施例中,若确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符匹配,则记录第二字符串在第一字符串中的匹配位置,并输出包含匹配位置的检测结果。
可以理解的,若直至第一字符串匹配结束都匹配不到,说明第一字符串中没有包含第二字符串中的敏感字符,说明目标敏感信息未在目标页面泄露,则输出未泄露的检测结果。
以第一字符串为A,第二字符串为B为例,其中:
A=[I,A,M,J,O,H,N,I,L,I,K,E,P,L,A,Y,I,N,G,F,O,O,T,B,A,L,L,];
B=[N,I,L,I]。
在具体实施时,先把字符串A与字符串B以首位字符对齐,如:
Figure PCTCN2021119658-appb-000002
在位置0处,字符串A和字符串B不匹配,此时取出字符串B的最末位字符I对应的页面字符J的下一位字符,即位置4的字符O。再去判断字符串B中是否有O。
由于字符串B中并没有字符O,此时,将B[0]和O的下一位(即A[5])对齐如下:
此时H和N也不匹配,则取出字符串B末位对应的页面字符的下一位,这里为I。
Figure PCTCN2021119658-appb-000003
I存在于字符串B中,检测设备便将字符串B右边移动1位,将两个字符串中的I对齐,需要说明的是,在字符串B存在多个I的情况下,依次以任一I作为对齐字符进行对齐。
此时得到字符串B在字符串A中的匹配位置:
Figure PCTCN2021119658-appb-000004
检测设备则输出包含该匹配位置的检测结果。
本实施例的检测设备将敏感信息作为一个整体进行匹配,且在匹配过程中,在目标页面也截取与敏感信息等长的字符作为匹配对象,使得匹配过程加快,提高检测效率。
进一步地,基于本申请敏感信息检测方法第一、第二实施例,提出本申请敏感信息检测方法第三实施例。
敏感信息检测方法的第三实施例与敏感信息检测方法的第一、第二实施例的区别在于,在生成目标页面之后,敏感信息检测方法还包括:
步骤e1,确定所述目标页面的标识信息,并基于所述标识信息,确定预设数据库中是否存在与所述标识信息一致的目标标识信息;
步骤e2,若不存在,则执行检测所述目标页面中是否存在目标敏感信息,以获得检测结果的步骤,并在获得所述检测结果后,将所述检测结果和所述标识信息关联保存在预设数据库中;
步骤e3,若存在,则获取所述目标标识信息对应的检测结果。
本实施例为避免重复检测,在得到去除动态因素影响的目标页面之后,计算其标识信息,并与数据库中的标识信息进行比较,若数据库中不存在一致的标识信息,则对其进行敏感信息泄露检测,并将最终的检测结果与标识信息关联存储在数据库中;若数据库中存在一致的标识信息,则直接输出对应的检测结果,不需要再做检测,减少检测操作,提高检测效率。
以下将对各个步骤进行详细说明:
步骤e1,确定所述目标页面的标识信息,并基于所述标识信息,确定预设数据库中是否存在与所述标识信息一致的目标标识信息;
在本实施例中,检测设备在得到去除动态因素影响的目标页面之后,计算目标页面的标识信息,其中,标识信息可以为hash值,或者MD5值等,用于表明当前目标页面唯一的信息。
然后,将目标页面的标识信息与预设数据库中的标识信息进行比较,确定预设数据库中是否存在与目标页面的标识信息一致的目标标识信息。
步骤e2,若不存在,则执行检测所述目标页面中是否存在目标敏感信息,以获得检测结果的步骤,并在获得所述检测结果后,将所述检测结果和所述标识信息关联保存在预设数据库中;
在本实施例中,若确定预设数据库中不存在与目标页面的标识信息一致的目标标识信息,说明目标页面之前没有检测过,则执行检测目标页面中是否存在目标敏感信息的步骤,具体过程参见上一实施例,在此不再赘述,从而得到对应的检测结果,然后,将该检测结果与目标页面的标识信息进行关联绑定,并保存在预设数据库中,以避免对目标页面进行重复检测。
步骤e3,若存在,则获取所述目标标识信息对应的检测结果。
在本实施例中,若确定预设数据库中存在与目标页面的标识信息一致的目标标识信息,说明目标页面之前检测过了,则直接在预设数据库中获取目标标识信息对应的检测结果进行输出。
本实施例为避免重复检测,在得到去除动态因素影响的目标页面之后,计算其标识信息,通过标识信息确定当前目标页面是否检测过,若检测过,则直接输出之前的检测结果,不需要再做检测,减少检测操作,提高检测效率。
本申请还提供一种敏感信息检测装置。本申请敏感信息检测装置包括:
发送模块,用于向目标地址发送第一请求和第二请求,以得到第一请求对应的第一内容和第二请求对应的第二内容;
确定模块,用于基于第一内容和第二内容,确定所述目标地址对应的目标内容;
提取模块,用于确定所述目标内容对应的原始字符,并提取所述原始字符中的目标标签;
生成模块,用于基于所述目标标签,生成目标页面;
检测模块,用于检测所述目标页面中是否存在目标敏感信息,以获得检测结果。
在一实施例中,所述目标标签包括内容标签和样式标签,所述生成模块还用于:
确定所述内容标签的第一层级关系,并基于第一层级关系和所述内容标签,构建文档模型树;
确定所述样式标签的第二层级关系,并基于第二层级关系和所述样式标签,构建样式模型树;
基于所述文档模型树和所述样式模型树,生成渲染树;
遍历所述渲染树的节点,并基于所述节点和所述节点的节点关系,生成目标页面。
在一实施例中,所述生成模块还用于:
依次确定所述内容标签的标签类型;
若当前内容标签为脚本标签,则执行所述脚本标签对应的执行代码,并在所述执行代码执行完毕之后,确定下一内容标签的标签类型;
若当前内容标签为资源标签,则获取所述资源标签对应的资源,并将所述资源生成文档节点;
基于第一层级关系和所述文档节点,构建文档模型树。
在一实施例中,所述生成模块还用于:
遍历所述文档模型树中的第一节点,并依次确定第一节点在所述样式模型树中对应的第二节点;
基于第一节点和第二节点,生成第三节点,并基于第三节点,生成渲染树。
在一实施例中,所述检测模块还用于:
确定所述目标页面对应的第一字符串,以及所述目标敏感信息对应的第二字符串,并基于第一字符串的首位页面字符和第二字符串的首位敏感字符,将第一字符串与第二字符串对齐;
依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配;
若当前页面字符与当前敏感字符不匹配,则将第二字符串的末位敏感字符所对应的页面字符的下一页面字符确定为目标字符,并确定第二字符串中是否存在所述目标字符;
若不存在,则基于所述目标字符的下一页面字符和第二字符串的首位敏感字符,将第一字符串和第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤;
若存在,则基于所述目标字符,将第一字符串与第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤;
若匹配,则记录第二字符串在第一字符串的匹配位置,并基于所述匹配位置输出检测结果。
在一实施例中,所述检测模块还用于:
确定所述目标页面的标识信息,并基于所述标识信息,确定预设数据库中是否存在与所述标识信息一致的目标标识信息;
若不存在,则执行检测所述目标页面中是否存在目标敏感信息,以获得检测结果的步骤,并在获得所述检测结果后,将所述检测结果和所述标识信息关联保存在预设数据库中;
若存在,则获取所述目标标识信息对应的检测结果。
在一实施例中,所述确定模块还用于:
确定第一内容对应的第一序列,以及第二内容对应的第二序列,并基于第一序列和第二序列,生成目标矩阵;
基于所述目标矩阵,确定第一序列与第二序列的最长公共子序列,并基于所述最长公共子序列,确定所述目标地址对应的目标内容。
在一实施例中,所述发送模块还用于:
向所述目标地址发送第三请求,以得到第三请求对应的状态码;
若所述状态码为目标状态码,则执行向目标地址发送第一请求和第二请求的步骤。
本申请还提供一种计算机可读存储介质。
本申请计算机可读存储介质上存储有敏感信息检测程序,所述敏感信息检测程序被处理器执行时实现如上所述的敏感信息检测方法的步骤。
其中,在所述处理器上运行的敏感信息检测程序被执行时所实现的方法可参照本申请敏感信息检测方法各个实施例,此处不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书与附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (11)

  1. 一种敏感信息检测方法,其中,所述敏感信息检测方法包括如下步骤:
    向目标地址发送第一请求和第二请求,以得到第一请求对应的第一内容和第二请求对应的第二内容;
    基于第一内容和第二内容,确定所述目标地址对应的目标内容;
    确定所述目标内容对应的原始字符,并提取所述原始字符中的目标标签;
    基于所述目标标签,生成目标页面,并检测所述目标页面中是否存在目标敏感信息,以获得检测结果。
  2. 如权利要求1所述的敏感信息检测方法,其中,所述目标标签包括内容标签和样式标签,所述基于所述目标标签,生成目标页面的步骤包括:
    确定所述内容标签的第一层级关系,并基于第一层级关系和所述内容标签,构建文档模型树;
    确定所述样式标签的第二层级关系,并基于第二层级关系和所述样式标签,构建样式模型树;
    基于所述文档模型树和所述样式模型树,生成渲染树;
    遍历所述渲染树的节点,并基于所述节点和所述节点的节点关系,生成目标页面。
  3. 如权利要求2所述的敏感信息检测方法,其中,所述基于第一层级关系和所述内容标签,构建文档模型树的步骤包括:
    依次确定所述内容标签的标签类型;
    若当前内容标签为脚本标签,则执行所述脚本标签对应的执行代码,并在所述执行代码执行完毕之后,确定下一内容标签的标签类型;
    若当前内容标签为资源标签,则获取所述资源标签对应的资源,并将所述资源生成文档节点;
    基于第一层级关系和所述文档节点,构建文档模型树。
  4. 如权利要求2所述的敏感信息检测方法,其中,所述基于所述文档模型树和所述样式模型树,生成渲染树的步骤包括:
    遍历所述文档模型树中的第一节点,并依次确定第一节点在所述样式模型树中对应的第二节点;
    基于第一节点和第二节点,生成第三节点,并基于第三节点,生成渲染树。
  5. 如权利要求1所述的敏感信息检测方法,其中,所述检测所述目标页面中是否存在目标敏感信息,以获得检测结果的步骤包括:
    确定所述目标页面对应的第一字符串,以及所述目标敏感信息对应的第二字符串,并基于第一字符串的首位页面字符和第二字符串的首位敏感字符,将第一字符串与第二字符串对齐;
    依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配;
    若当前页面字符与当前敏感字符不匹配,则将第二字符串的末位敏感字符所对应的页面字符的下一页面字符确定为目标字符,并确定第二字符串中是否存在所述目标字符;
    若不存在,则基于所述目标字符的下一页面字符和第二字符串的首位敏感字符,将第一字符串和第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤;
    若存在,则基于所述目标字符,将第一字符串与第二字符串对齐,并执行依次确定第二字符串的敏感字符与对应同一位置的第一字符串的页面字符是否匹配的步骤;
    若匹配,则记录第二字符串在第一字符串的匹配位置,并基于所述匹配位置输出检测结果。
  6. 如权利要求1所述的敏感信息检测方法,其中,在生成目标页面之后,所述敏感信息检测方法还包括:
    确定所述目标页面的标识信息,并基于所述标识信息,确定预设数据库中是否存在与所述标识信息一致的目标标识信息;
    若不存在,则执行检测所述目标页面中是否存在目标敏感信息,以获得检测结果的步骤,并在获得所述检测结果后,将所述检测结果和所述标识信息关联保存在预设数据库中;
    若存在,则获取所述目标标识信息对应的检测结果。
  7. 如权利要求1所述的敏感信息检测方法,其中,所述基于第一内容和第二内容,确定所述目标地址对应的目标内容的步骤包括:
    确定第一内容对应的第一序列,以及第二内容对应的第二序列,并基于第一序列和第二序列,生成目标矩阵;
    基于所述目标矩阵,确定第一序列与第二序列的最长公共子序列,并基于所述最长公共子序列,确定所述目标地址对应的目标内容。
  8. 如权利要求1-7任一项所述的敏感信息检测方法,其中,所述向所述目标地址发送第一请求和第二请求,以得到所述目标地址返回的第一请求对应的第一内容和第二请求对应的第二内容的步骤之前,所述敏感信息检测方法还包括:
    向所述目标地址发送第三请求,以得到第三请求对应的状态码;
    若所述状态码为目标状态码,则执行向目标地址发送第一请求和第二请求的步骤。
  9. 一种敏感信息检测装置,其中,所述敏感信息检测装置包括:
    发送模块,用于向目标地址发送第一请求和第二请求,以得到第一请求对应的第一内容和第二请求对应的第二内容;
    确定模块,用于基于第一内容和第二内容,确定所述目标地址对应的目标内容;
    提取模块,用于确定所述目标内容对应的原始字符,并提取所述原始字符中的目标标签;
    生成模块,用于基于所述目标标签,生成目标页面;
    检测模块,用于检测所述目标页面中是否存在目标敏感信息,以获得检测结果。
  10. 一种敏感信息检测设备,其中,所述敏感信息检测设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的敏感信息检测程序,所述敏感信息检测程序被所述处理器执行时实现如权利要求1至8中任一项所述的敏感信息检测方法的步骤。
  11. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有敏感信息检测程序,所述敏感信息检测程序被处理器执行时实现如权利要求1至8中任一项所述的敏感信息检测方法的步骤。
PCT/CN2021/119658 2020-09-27 2021-09-22 敏感信息检测方法、装置、设备与计算机可读存储介质 WO2022063133A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011036671.5 2020-09-27
CN202011036671.5A CN112052364A (zh) 2020-09-27 2020-09-27 敏感信息检测方法、装置、设备与计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2022063133A1 true WO2022063133A1 (zh) 2022-03-31

Family

ID=73605520

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119658 WO2022063133A1 (zh) 2020-09-27 2021-09-22 敏感信息检测方法、装置、设备与计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN112052364A (zh)
WO (1) WO2022063133A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081440A (zh) * 2022-07-22 2022-09-20 湖南湘生网络信息有限公司 文本中变种词的识别及提取原敏感词的方法、装置及设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052364A (zh) * 2020-09-27 2020-12-08 深圳前海微众银行股份有限公司 敏感信息检测方法、装置、设备与计算机可读存储介质
CN113836915A (zh) * 2021-09-23 2021-12-24 平安普惠企业管理有限公司 数据处理方法、装置、设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019101606A (ja) * 2017-11-30 2019-06-24 ヤフー株式会社 判定装置、判定方法、及び判定プログラム
CN110020306A (zh) * 2017-11-30 2019-07-16 腾讯科技(深圳)有限公司 页面显示方法、装置、存储介质及终端
CN110309364A (zh) * 2018-03-02 2019-10-08 腾讯科技(深圳)有限公司 一种信息抽取方法及装置
CN111353116A (zh) * 2020-02-28 2020-06-30 深圳市意盛科技有限公司 内容检测方法、系统及设备、客户端设备和存储介质
CN111597490A (zh) * 2020-05-21 2020-08-28 深圳前海微众银行股份有限公司 Web指纹识别方法、装置、设备及计算机存储介质
CN112052364A (zh) * 2020-09-27 2020-12-08 深圳前海微众银行股份有限公司 敏感信息检测方法、装置、设备与计算机可读存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662737B (zh) * 2012-03-14 2014-06-11 优视科技有限公司 扩展程序的调用方法及装置
CN103514238B (zh) * 2012-06-30 2017-12-19 重庆新媒农信科技有限公司 基于分类查找的敏感词识别处理方法
CN104133865A (zh) * 2014-07-17 2014-11-05 可牛网络技术(北京)有限公司 一种广告过滤方法以及装置
CN105471823B (zh) * 2014-09-03 2018-10-26 阿里巴巴集团控股有限公司 一种敏感信息处理方法、装置、服务器及安全判定系统
CN106326734A (zh) * 2015-06-30 2017-01-11 阿里巴巴集团控股有限公司 一种检测敏感信息的方法和设备
CN107066882B (zh) * 2017-03-17 2019-07-12 平安科技(深圳)有限公司 信息泄露检测方法及装置
CN110348239B (zh) * 2019-06-13 2023-10-27 张建军 脱敏规则配置方法以及数据脱敏方法、系统、计算机设备
CN111159329B (zh) * 2019-12-24 2023-09-08 深圳市优必选科技股份有限公司 敏感词检测方法、装置、终端设备和计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019101606A (ja) * 2017-11-30 2019-06-24 ヤフー株式会社 判定装置、判定方法、及び判定プログラム
CN110020306A (zh) * 2017-11-30 2019-07-16 腾讯科技(深圳)有限公司 页面显示方法、装置、存储介质及终端
CN110309364A (zh) * 2018-03-02 2019-10-08 腾讯科技(深圳)有限公司 一种信息抽取方法及装置
CN111353116A (zh) * 2020-02-28 2020-06-30 深圳市意盛科技有限公司 内容检测方法、系统及设备、客户端设备和存储介质
CN111597490A (zh) * 2020-05-21 2020-08-28 深圳前海微众银行股份有限公司 Web指纹识别方法、装置、设备及计算机存储介质
CN112052364A (zh) * 2020-09-27 2020-12-08 深圳前海微众银行股份有限公司 敏感信息检测方法、装置、设备与计算机可读存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081440A (zh) * 2022-07-22 2022-09-20 湖南湘生网络信息有限公司 文本中变种词的识别及提取原敏感词的方法、装置及设备

Also Published As

Publication number Publication date
CN112052364A (zh) 2020-12-08

Similar Documents

Publication Publication Date Title
WO2022063133A1 (zh) 敏感信息检测方法、装置、设备与计算机可读存储介质
Khalil et al. RCrawler: An R package for parallel web crawling and scraping
US9767082B2 (en) Method and system of retrieving ajax web page content
US9805009B2 (en) Method and device for cascading style sheet (CSS) selector matching
US9529780B2 (en) Displaying content on a mobile device
CN102597993B (zh) 利用统一资源标识符管理应用状态信息
JP5793722B2 (ja) 未承認フォントのリンクの防止
US8065667B2 (en) Injecting content into third party documents for document processing
US9426200B2 (en) Updating dynamic content in cached resources
US20170199850A1 (en) Method and system to decrease page load time by leveraging network latency
JP6203374B2 (ja) ウェブページ・スタイルアドレスの統合
CN104881608A (zh) 一种基于模拟浏览器行为的xss漏洞检测方法
CN105550206B (zh) 结构化查询语句的版本控制方法及装置
US9690855B2 (en) Method and system for searching for a web document
CN104881607A (zh) 一种基于模拟浏览器行为的xss漏洞检测系统
US10705949B2 (en) Evaluation of library test suites using mutation testing
Jarmul et al. Python web scraping
US12021732B1 (en) Assistant for automatic generation of server load test scripts
JP2008299540A (ja) Webサービス提供システム検査装置及びWebサービス提供システム検査プログラム
JP4846832B2 (ja) Webページの表示方法、計算機システム及びプログラム
US20200042697A1 (en) Buffer overflow detection based on a synthesis of assertions from templates and k-induction
US20110022563A1 (en) Document display system, related document display method, and program
RU2632149C2 (ru) Система, способ и постоянный машиночитаемый носитель для проверки веб-страниц
CN109657472B (zh) Sql注入漏洞检测方法、装置、设备及可读存储介质
CN106649215B (zh) 用于在运行时期间生成样式表的系统、方法和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21871503

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21871503

Country of ref document: EP

Kind code of ref document: A1