CN115470361A - Data detection method and device - Google Patents

Data detection method and device Download PDF

Info

Publication number
CN115470361A
CN115470361A CN202211144823.2A CN202211144823A CN115470361A CN 115470361 A CN115470361 A CN 115470361A CN 202211144823 A CN202211144823 A CN 202211144823A CN 115470361 A CN115470361 A CN 115470361A
Authority
CN
China
Prior art keywords
data
detection
information
detected
knowledge graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211144823.2A
Other languages
Chinese (zh)
Inventor
鲍梦瑶
刘佳伟
章鹏
张谦
贾茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ant Blockchain Technology Shanghai Co Ltd
Original Assignee
Ant Blockchain Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ant Blockchain Technology Shanghai Co Ltd filed Critical Ant Blockchain Technology Shanghai Co Ltd
Priority to CN202211144823.2A priority Critical patent/CN115470361A/en
Publication of CN115470361A publication Critical patent/CN115470361A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Tourism & Hospitality (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Molecular Biology (AREA)
  • Technology Law (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An embodiment of the present specification provides a data detection method and an apparatus, wherein the data detection method includes: analyzing the contract text to determine data description information; acquiring target data, and constructing a data knowledge graph according to the target data and data description information; and carrying out data detection on the data knowledge graph spectrum according to a pre-constructed data risk strategy to obtain a detection result. The data description information aiming at the target data is determined by analyzing the contract text of data transmission, the data knowledge graph is constructed for the acquired target data, data detection is carried out on the data knowledge graph according to the data risk strategy and the data description information which are constructed in advance, and the detection result is obtained.

Description

Data detection method and device
Technical Field
The embodiment of the specification relates to the technical field of data security, in particular to a data detection method.
Background
Data, as a national fundamental strategic resource, is the core and fate of the big data industry. In order to standardize the processing of data generation, acquisition, storage, processing, analysis, service and the like, maintain network security and data security together, promote the development of big data industry, activate the potential of data elements, accelerate the quality change, efficiency change and power change of economic and social development, how to legally and suitably collect, process and apply personal information should bring more attention to enterprises.
In order to help enterprises to realize legal compliance of the full life cycle of data assets, related legal risks are avoided; the system helps a supervision unit to manage illegal regulations, and a set of efficient and automatic compliance detection scheme is urgently needed.
Disclosure of Invention
In view of this, the embodiments of the present specification provide a data detection method. One or more embodiments of the present disclosure also relate to a data detection apparatus, a computing device, a computer-readable storage medium, and a computer program, so as to solve the technical problems in the prior art.
According to a first aspect of embodiments herein, there is provided a data detection method, including:
analyzing the contract text to determine data description information;
constructing a data knowledge graph according to target data transmitted by using the contract text;
and performing data detection on the nodes in the data knowledge graph and the association relation between the nodes according to a pre-constructed data risk strategy and the data description information to obtain a detection result.
According to a second aspect of embodiments herein, there is provided a data detection apparatus comprising:
the protocol analysis module is configured to analyze the same text and determine data description information;
the map construction module is configured to construct a data knowledge map according to target data transmitted by utilizing the contract text;
and the detection module is configured to perform data detection on the nodes in the data knowledge graph and the association relation between the nodes according to a pre-constructed data risk strategy and the data description information to obtain a detection result.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is used for storing computer-executable instructions, and the processor is used for executing the computer-executable instructions, and the computer-executable instructions realize the steps of the data detection method when being executed by the processor.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the above-described data detection method.
According to a fifth aspect of embodiments herein, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above-mentioned data detection method.
An embodiment of the present specification provides a data detection method and an apparatus, wherein the data detection method includes: analyzing the contract text to determine data description information; acquiring target data, and constructing a data knowledge graph according to the target data and data description information; and carrying out data detection on the data knowledge graph spectrum according to a pre-constructed data risk strategy to obtain a detection result. The data description information aiming at the target data is determined by analyzing the contract text of data transmission, the data knowledge graph is constructed for the acquired target data, data detection is carried out on the data knowledge graph according to the data risk strategy and the data description information which are constructed in advance, and the detection result is obtained.
Drawings
Fig. 1a is a schematic view of a data detection method provided in an embodiment of the present specification;
FIG. 1b is a block diagram of a data detection method according to an embodiment of the present disclosure
Fig. 1c is a schematic diagram of a model structure of a data detection method provided in an embodiment of the present specification;
FIG. 1d is a schematic view of a knowledge graph of a data detection method provided in one embodiment of the present disclosure;
FIG. 2a is a flow chart of a data detection method provided in one embodiment of the present description;
FIG. 2b is a schematic illustration of another knowledge-graph of a data detection method provided in an embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating a processing procedure of a data detection method according to an embodiment of the present disclosure;
FIG. 4 is a schematic structural diagram of a data detection apparatus according to an embodiment of the present disclosure;
fig. 5 is a block diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be implemented in many ways other than those specifically set forth herein, and those skilled in the art will appreciate that the present description is susceptible to similar generalizations without departing from the scope of the description, and thus is not limited to the specific implementations disclosed below.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.
First, the noun terms to which one or more embodiments of the present specification relate are explained.
Data asset: data Asset (Data Asset) refers to a Data resource, such as file Data, electronic Data, etc., which is owned or controlled by an enterprise and can be recorded in a physical or electronic manner to bring future economic benefits to the enterprise.
The whole life cycle: the full lifecycle of data assets often includes data acquisition, data storage, data processing, data transmission, data exchange, and data destruction.
Privacy compliance: it means that various enterprises or organizations need to comply with various national laws and regulations when using and processing data assets related to personal privacy data information.
Named entity recognition: named Entity Recognition (NER) is basic work of information extraction, and the task of the NER is to recognize content such as a person name, an organization name, time, a place, a specific digital form and the like from a text and add corresponding label information to the content, so that convenience is provided for subsequent work of information extraction.
And (3) supervised learning: supervised Learning (Supervised Learning) is a method of machine Learning, which refers to classifying or fitting input data given a previously labeled training example.
Deep learning: deep Learning (Deep Learning) is a branch of machine Learning, and is an algorithm for performing characterization Learning on data by using an artificial neural network as a framework.
Knowledge graph: the complex knowledge field is displayed through data mining, information processing, knowledge measurement and graph drawing, the dynamic development rule of the knowledge field is revealed, and practical and valuable reference is provided for subject research.
Graph database: a graph database is a type of NoSQL database that applies graph theory to store relationship information between entities. The graph database is a non-relational database that stores relational information between entities using graph theory. The most common example is the interpersonal relationship in social networks. The relational database has a poor effect of storing the 'relational' data, the query is complex, slow and beyond expectation, and the unique design of the graphic database just makes up for the defect.
MD5 Message Digest Algorithm (English: MD5 Message-Digest Algorithm): a widely used cryptographic hash function generates a 128-bit (16-byte) hash value to ensure the integrity of the message transmission.
In the present specification, a data detection method is provided, and the present specification relates to a data detection apparatus, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
Referring to fig. 1a, fig. 1a is a schematic view illustrating a scenario of a data detection method provided according to an embodiment of the present specification, where the scenario includes a company a, a server, a data knowledge graph, and a data risk policy.
The company A transmits the target data to the server, the server generates a data knowledge graph according to the received target data, and the data knowledge graph is detected according to a pre-generated data risk strategy to determine whether the data in the data knowledge graph violate rules or not.
Specifically, referring to fig. 1b, fig. 1b shows an architecture diagram of a data detection method provided according to an embodiment of the present specification, which includes an agreement parsing module that mainly extracts compliance-related information in a data collection contract text of a company a and other companies or individuals, where the data collection contract text may be a contract text for collecting personal data of the company, the data transmission contract text may be a contract text for transmitting data between the company and the company, and the compliance-related information may be auxiliary information required for performing risk detection and information required to be detected. Extracting the information related to the rules can be realized by using an NER neural network model, referring to fig. 1c, fig. 1c shows a model structure schematic diagram of a data detection method provided according to an embodiment of the present specification, where the model adopts a currently mainstream NER model structure and is composed of two parts, one is an Encoder (encoding) based on a neural network, and can create context-based embeddings for each character, that is, convert a natural language into a numerical vector; the other is a CFR (conditional random field) output layer, which can capture the dependency between adjacent nodes and predict labels. Taking the identification of the privacy data type in the data collection contract text as an example, each input of the model is a sentence in the contract text, that is, "we may collect your order information and browse information for data analysis", each output is a tag of the sentence, and the privacy data type in the sentence identified by the model, that is, "order information" and "browse information" can be obtained after sorting. In general, the NER model can extract compliance-related information in the contract text, including the type of collected private data, storage period, permitted use scenario, etc., which in turn can be entered into the knowledge graph module and stored in the attribute information of the nodes or edges in the graph database.
In the knowledge graph module, information in the form of a data table or a data column, company or organization information, which is obtained from the data, is stored as nodes in the graph, for example, company nodes contain information such as company names, registered addresses, registered capital and the like, data table/data column nodes contain information such as whether the data is personal data, sensitive data types, creation time, modification time, responsible persons and the like, and data use/processing information is stored as edges. Referring to fig. 1d, fig. 1d shows a schematic view of a knowledge graph of a data detection method provided according to an embodiment of the present specification, for example, company a "owns" data 1, and "owns" is data use/processing information, and is a relationship in the graph, that is, an edge in the graph, the edge contains a scene in which company a allows to use data 1, whether the edge is disclosed in a contract text, and the like, that is, attribute information of the edge; for example, company a "transfers" data 1 to data 2 of company B, where "transfer" may be understood as that company a copies data 1 to give data 2 to company B, and "transfer" is also a type of data use/processing information, which is a relationship in the knowledge graph and includes a sender and a receiver of data transfer, whether data transfer is cross-border, a scenario of authorized use, and the like. Meanwhile, the result of the analysis of the agreement analysis module is also used as data use/processing information and is used as a relation in the knowledge graph, for example, the relation of 'possession' comprises 'whether the relation is disclosed in the contract text', and the result is from the contract text analysis. The data in the form of data table or data source data such as column information in the data table, company or organization information, and a result analyzed by the protocol analysis module are cleaned and stored in the graph database as nodes or edges in the graph.
In the compliance engine module, firstly, a compliance risk policy is preset, the risk of the risk policy is generated by analyzing the standardized risk policy by professional personnel, and the main body, the condition and the requirement form a standardized risk policy. For example, the standardized risk policy specifies the subject "data originator" that needs compliance testing, the condition "cross-border data transfer" + "personal data" that needs compliance testing, and the requirement "revealed in user privacy protocols" for compliance testing. Through the standardized description of the risk strategy, the relevant requirements in the laws and regulations can be stored in a standardized form. Secondly, the compliance engine can perform logical expression reasoning, and for a given standardized description of the risk policy and corresponding data in the database, the compliance engine can automatically generate a corresponding boolean expression for logical reasoning, such as a compliance detection requirement of "disclosed in the user privacy agreement", and can generate a boolean expression "possess 1 and disclose = = true" according to a rule, where "possess 1 and disclose" is an attribute "disclose" of "possess 1" relationship on the graph, and takes a value of "true" or "false". The compliance engine deduces the boolean expression "true = = true" or "false = = true", and then a conclusion about whether compliance detection requirements are met and corresponding compliance suggestions can be obtained.
The data knowledge graph is constructed for the data transmitted through the contract text, data detection is carried out on the data knowledge graph according to a pre-constructed data risk strategy and the data description information, and a detection result is obtained.
Referring to fig. 2a, fig. 2a shows a flowchart of a data detection method provided according to an embodiment of the present specification, which specifically includes the following steps.
Step 202: and analyzing the contract text to determine the data description information.
The contract text can be a data acquisition contract text, a data transmission contract text, a data storage contract text and the like, and can be understood as a contract text which specifies the information to be transmitted, how to transmit and the like in the data transmission process, for example, the contract text specifies the order information to be transmitted, the browsing information and the like; the data description information can be whether the data is declared, the period in which the data can be stored, and the data can be used in those scenes, for example, the data description information is "we may collect your order information, browse information for data analysis".
In practical application, the contract text may be parsed by using a natural language processing technique to determine the data description information, and compliance detection needs to be performed by using the data description information, where the natural language processing technique includes an NER model, a sentence classification model, and the like. Then, the contract text is parsed, that is, the compliance related information in the data collection contract text and the data transmission contract text may be extracted.
For example, the contract text is parsed by using a sentence classification model based on a BERT model, the output of the sentence classification model is information to be collected by the contract text, and what scenes the information to be collected can be used for, in the contract text, the information includes "we may collect your order information, browse information for data analysis", the contract text is input into a trained sentence classification model, in the sentence classification model, it can be determined that the data description information includes declared information "order information", "browse information" according to the context, and the scenes for data analysis "can also be determined.
By analyzing the contract text, the embodiment of the specification can determine the characteristics of the related use scene, the declaration or not, the time and the like of the transmitted data, and facilitates the subsequent detection of the compliance of the data.
Specifically, the parsing the contract text and determining the data description information includes:
and analyzing the contract text to determine statement data information, data cycle information and use scene information.
The declaration data information may be understood as a declaration of data, for example, the declaration data information may be "order information and browsing information that we may collect you", that is, declaring "order information" and "browsing information" to be obtained; the data period information may be a period in which data may be stored, for example: the data period information may be "can store for one month", that is, the data is only authorized to be stored for one month; the usage context information may be the usage of the data, for example: "We may collect your order information, browse information for data analysis".
In practical application, the declaration data information may be to declare specific data, or to declare a certain type of data, for example, to identify a sensitive data type (that is, the sensitive data type is a named entity in the task) in a data collection contract text, where the named entity identification task in the type of text refers to extracting privacy sensitive information collected by declaring the contract text from the contract text, and the privacy sensitive information is usually names, identity certificates, mobile phone numbers, and the like. Although the types of common sensitive data are limited, the context of the context is often also considered to be required for extraction, for example, the context in which an identity document appears is only described in the context of the specified private data content, and extraction is not required. Meanwhile, the privacy declaration data is returned, and meanwhile, the position where the sensitive data is declared to be collected in the privacy protocol is returned.
For example, the contract text includes "when you register, log in and use the relevant service, you can create an account number through a mobile phone number, and you can perfect relevant network identification information (head portrait, nickname, password), and the information is collected to help you complete registration. And the user can also select to fill in birthday, region and personal introduction according to the requirement of the user to perfect the information. "mobile phone number", "head portrait", "nickname", "password", "birthday" and "region" are sensitive information collected by the privacy protocol declaration, that is, the content to be extracted.
Also included in the contract text are "(a) personal information: refers to electronically or otherwise recorded information that can be recorded alone or in combination with other information. Identifying the identity of a particular natural person or information reflecting the activity of a particular natural person. (II) personal sensitive information: the system comprises an identity document number, a mobile phone number, personal biological identification information, a bank account number, property information, action traces, transaction information and the like, wherein although the identity document number, the mobile phone number, the personal biological identification information, the bank account number, the property information, the action traces and the transaction information are sensitive data, the sensitive data are not collected by the existing context, and therefore extraction is not needed.
Furthermore, the data extraction performance based on deep learning is better, the problems of inaccurate identification, low identification speed and the like can be avoided, and the data extraction can be carried out by a deep learning method. The specific embodiment is as follows.
The analyzing the contract text to determine statement data, a data period and a use scene comprises the following steps:
performing semantic analysis on the text content of the contract text according to a target model to obtain semantic keywords;
and determining statement data information, data cycle information and use scene information according to the semantic keywords.
Wherein the target model may be a deep learning network model, e.g., a NER network model; semantic keywords may be understood as named entities in the above, such as: person name, organization name, time, place, etc.
In practical application, in addition to collecting sensitive data types in a contract text, compliance detection often needs to automatically extract other information in the contract text, such as storage period, storage area, effective time and the like, and the related information extraction is essentially a named entity recognition task and can be realized by training a deep model. Specifically, the NER network model can be used for extracting the relevant information, the NER network model solves a NER problem, the NER problem is a sequence labeling problem, and therefore, a NER data labeling mode also follows the mode of the sequence labeling problem, mainly uses a biee labeling method, wherein biee respectively represents meanings: b, begin, denotes Start; i, intermedate, representing the middle; e, end, denotes End; o, other, indicates otherwise, is used to mark extraneous characters. For example: "you may need to provide information about your name, phone number, etc. "this sentence is annotated, the result is: [ O, O, O, O, O, B-NAME, E-NAME, O, B-GENER, E-GENER, O, B-PHONE, I-PHONE, I-PHONE, E-PHONE, O, O, O, O ], wherein NAME represents the NAME and PHONE represents the cell PHONE number.
Suppose there are m types of personal information, denoted as c 1 ,c 2 ,c 3 ,……,c m-1 ,c m Given a data record to be recognized with a character length n, W = { W = 1 ,w 2 ,w 3 ,……,w n-1 ,w n W is a string of several consecutive characters W k Constituent sequence S = [ w = k-i ,w k-i+1 ,……,w k ]If S is of w j Personal information of the type, then the task of sensitive personal information identification based on named entity identification (NER) technology is to associate w with k-i Mark w j B from w k-i+1 Start to w k-1 Sign w j I, handle w k Mark w j _E。
After the identification result is obtained, the identification result can be evaluated to adjust parameters of the NER network model, so that the output result of the NER network model is more accurate.
The evaluation of the effect of the NER model is mainly measured by 3 indexes of precision (abbreviated as P), recall (abbreviated as R) and F-measure (abbreviated as F), wherein the precision is the proportion of all samples predicted to be positive, which are actually positive and also positive. The recall is the proportion of samples that are actually positive for prediction and also positive to samples that are actually positive. The F-measure is a harmonic mean of the precision and recall, with higher F-measures indicating a more robust model. The calculation formula is as follows:
Figure BDA0003855172620000071
Figure BDA0003855172620000072
Figure BDA0003855172620000073
through the above calculation formula, the F-measure of the NER network model, that is, the score of the effect evaluation of the NER network model, can be obtained, and if the score does not reach the expected value, the parameter adjustment of the NER network model can be continued, for example, different model structures, the number of layers of the model, the hidden layer size (hidensize), the optimizer (e.g., SGD, momentum, RMSprop, adam, etc.), the learning rate, the batch size (batch size), etc. are selected, so that the score of the effect evaluation of the NER network model reaches a satisfactory value. Specifically, the model structure of the NER network model may be LSTM, RNN, transform, etc., if the batch size is N, N samples are selected each time and are respectively substituted into the NER network model, and their respective corresponding parameter adjustment values are calculated, and then all the parameter adjustment values are averaged to be used as the final parameter adjustment value, so as to adjust the parameters of the NER network model.
For example, the contract text includes "when you register, log in and use the relevant service, you can create an account number through a mobile phone number, and you can perfect relevant network identification information (head portrait, nickname, password), and the information is collected to help you complete registration. The user can select to fill in birthday, area and personal introduction according to the requirement of the user to perfect the information. ", the contract text also includes" (a) personal information: refers to electronically or otherwise recorded information that can be recorded alone or in combination with other information. Identify the identity of a specific natural person or reflect the activity condition of the specific natural person. (II) personal sensitive information: including identity document number, cell phone number, personal biological identification information, bank account, property information, action trail, transaction information, etc.', input these two sections of contract texts into the preset NER network model, can obtain a section of self-defined field, the field includes a plurality of named entities, the named entity of output: each named entity corresponds to a section of analysis result, the analysis result can be determined by a starting identifier and an ending identifier, and the starting identifier and the ending identifier can include the name of the named entity, the paragraph where the named entity is located, the data type of the named entity and other information.
It should be noted that the format of the parsing result in the custom field may be set according to the requirement, for example, if the type of the data is required, the type of the data is added to the parsing result, and in some scenarios, it is only necessary to judge whether to collect the personal information, and then the type of the data does not need to be output.
The embodiment of the specification uses the deep learning network model, so that the named entity can be generated and extracted more quickly and accurately, the whole process of identification and detection is accelerated, and the detection efficiency is improved.
Step 204: and acquiring target data, and constructing a data knowledge graph according to the target data and the data description information.
The target data may be transmitted data or a usage log of the data, for example, the target data is data after a contract text is compressed; the data knowledgemaps may be knowledgemaps generated from transmitted data, or may be other map databases, such as Neo4j, galaxybase, tigergraph, tuGraph.
In practical applications, information in the form of data tables or data columns, company or organization information, may be obtained from stored data, or from information output in the NER network model. The obtained information in the form of a data table or a data column, the company or organization information, is saved as a node in the graph, wherein the data table or the column information may be data given by the company and the relationship between the data, for example, the company node contains information such as company name, registered address, registered capital, etc., the data table or the data column node contains information such as whether the data is personal data, sensitive data type, creation time, modification time, responsible person, etc., and the data use and processing information is saved as an edge. In a knowledge graph, an entity refers to something that is distinguishable and independent. The entity is the most basic element in the knowledge graph, and different relationships exist among different entities. Such as "China", "Beijing", "16410 square kilometers", etc. Relationships are connections between different entities, referring to connections between entities. The nodes in the knowledge graph are connected through the relation nodes to form a graph. Such as "population", "capital", "area", etc. The knowledge graph mainly has two storage modes: one is storage based on RDF (Resource Description Framework); another is based on the storage of graph databases. One important design principle of RDF is the easy distribution and sharing of data, and graph databases place emphasis on efficient graph query and search. Secondly, RDF stores data in a triple manner and does not contain attribute information, but a graph database generally takes an attribute graph as a basic representation form, so entities and relations can contain attributes, which means that real project scenes can be more easily expressed.
For example, target data is obtained, the target data is analyzed to obtain a data table, the data table includes data columns, each data column corresponds to a unique identifier, for example, the data table is entitled "… data of company a", the data can be analyzed to be data owned by company a, the data column includes data 1, which can indicate that company a "owns" data 1, "own" is a data use and processing category, and is a relationship in a map. Data 1 corresponding to the described paragraph in the data table: "data 1 is used for …", the data table obtained by analyzing the paragraph corresponding to the data 1 includes the scene that the company a allows to use the data 1, the type of the data 1, and the like; for another example, if company a "transmits" data 1 to company B data 2, and "transmission" is also a type of data use and processing category, and is a relationship in a graph, the corresponding description paragraphs further include the sender and the receiver of data transmission, whether data transmission is cross-border, and a scene of authorized use. Meanwhile, the result of the analysis by the agreement analysis module (NER module) is also used as the result of the analysis of the contract text, wherein the relation of data use and processing information, such as 'possession', comprises 'whether the data is disclosed in the contract text'.
The embodiment of the specification takes the analysis result of the NER module as the data use and processing information of the data knowledge graph, and the accuracy of the analysis result of the NER module is high, so that more comprehensive data can be obtained, the information of the data knowledge graph can be perfected through the data knowledge graph constructed by the data, and then more information can be used for judging the result when the compliance detection of the knowledge graph is carried out, and the accuracy of the compliance detection is increased.
In one implementable manner, the target data comprises a data relationship table;
correspondingly, the constructing a data knowledge graph according to the target data and the data description information comprises the following steps:
analyzing the target data to obtain the data relation table;
and constructing a data knowledge graph according to the data relation table and the data description information.
The data relationship table may be a list including data and relationships between data, for example, the data includes data of the user U and a phone number of the user U, and the relationships between data may be that the user U owns the phone number of the user U. It should be noted that, the data in the data relationship table may be actively acquired information, such as information of a mobile phone identification code, or may also be actively reported information of an individual, such as information of a name, a head portrait, and the like; the user information may be a name, an identification code, and the like.
In practical application, compliance detection may be performed on the contract text itself, or compliance detection may be performed on the transmitted data, and the obtained target data may be compressed or encrypted data, so the target data is also analyzed to obtain the data relation table. For example, in the process of using the application program, the application program may collect information of the user, and there is also information that the user actively uploads to the application program, and a data knowledge graph may be constructed according to both the information, and both the information need to be subjected to compliance detection.
For example, in the process of using the application program, the actively reported information includes a name and a head portrait, and the actively acquired information by the application program includes a mobile phone identification code, so that the node corresponding to the user U is included in the constructed data knowledge graph, and the node corresponding to the user U "owns" the name node, the head portrait node, and the mobile phone identification code node corresponding to the user U. The company A encrypts the small and clear data through the MD5 algorithm to obtain target data, then the MD5 decryption is carried out on the target data after the target data are obtained to obtain a data relation table, and a data knowledge graph can be constructed according to the data relation table.
It should be noted that, under the condition of constructing the data knowledge graph, based on the principle of protecting the privacy of the user, the identifier of the information of the user U can only be obtained from the specific information in the data that cannot be transmitted, for example, the real name of the user U is small, and the information of the small name cannot be obtained, and only the name of the user U is known.
For another example, if the contract text is analyzed and the statement acquisition "phone number" is included in the contract text, the data knowledge graph can be constructed according to the information of the statement acquisition "phone number".
In the embodiment of the specification, the data relation table of the target data is determined according to the relation between the acquired data and the owner of the data, so that more relations can be added in the knowledge graph, and the data information amount in the knowledge graph is improved.
In one implementable manner, the target data further includes a usage log for the target data;
correspondingly, the acquiring target data and constructing a data knowledge graph according to the target data and the data description information include:
analyzing the target data to obtain the use log, and generating attribute information according to the use log;
adding the attribute information to an edge of the data knowledge-graph.
The usage log can be a log generated by processing and using data, for example, a transmits a telephone number to B; the attribute information may be attribute information in the knowledge graph, such as attribute information of edges, attribute information of nodes.
In practical applications, compliance detection may also be performed on the use of data, for example, if the data is stored, information such as where the data is stored, who is the party storing the data, and the storage time limit may be stored in the knowledge graph, and the use log of the data may employ a log analysis algorithm to generate the attribute information, such as a drain algorithm, a shell algorithm, and the like.
For example, the parsing rule for parsing the usage log into the attribute information may include: the name of the parameter in the log is used as a key and the content of the parameter is used as a value. The separation symbols between different key-value data in the usage log may be: "; ". The separation symbols between different key-value data in the parsed result and between keys and values in the key-value data may be: "-". And converting the log information into a key-value data format in the analysis result according to the analysis rule. If the data is stored by the a user for five days ago, then the parameter names in the usage log are: "storage time", this data is stored for the user of day a before five, then the corresponding value is: and if the key value pair is 'five days ago', the key value pair of the analysis result of the use log is 'storage time-five days ago', and the key value pair is added to the attribute information of the side in the data storage map.
For another example, the log includes fixed variables, "size" represents the size of the data, "data" represents the processing date of the data, and "type" represents the processing type of the data, and corresponding values exist after all of these variables, such as size:50M, data:09:11, type: write, representing 50 megabytes of data written at nine spots eleven. Then, an analysis rule may be established according to a fixed variable, and an analysis result is obtained, for example, the storage time: 09:11, and adding the analysis result to the attribute information of the edge in the data storage map.
In the embodiment of the specification, the related information of a company or an organization and the use and disposal behavior information in the whole life cycle of the data are stored through the data knowledge graph, so that the characteristics of a graph database are effectively utilized, the data are efficiently maintained, and the subsequent storage and maintenance are facilitated; meanwhile, the method is beneficial to managing the relation between the discovered data assets, and can provide visual query results.
In one implementation, the building a data knowledge graph from the data relationship table and the data description information includes:
extracting data items in the data relation table, determining nodes of the data knowledge graph according to the data items, and determining edges of the data knowledge graph according to the relation among the data items;
and generating attribute information according to the data description information, and adding the attribute information to the edge of the data knowledge graph.
The data item may be one of all data in the data relationship table, for example, the data relationship table includes user data of company a, and the user data includes name information, mobile phone number information, and the like.
In practical application, the knowledge graph is converted into simple and clear triples of entities, relations and entities by effectively processing, processing and integrating data of complex documents, so that the entities can be used for expressing nodes in the graph and the relations can be used for expressing edges in the graph. Entities refer to things in the real world such as people, place names, concepts, medicines, companies, etc., and relationships are used to express some kind of connection between different entities, such as people "living in" Beijing, zhang and Li Sishi "friends", logistic regression is a deep learning "leading knowledge", etc. Meanwhile, in the real world, entities and relations have respective attributes. Accordingly, the data description information may be generated into attribute information and the attribute information may be added to the corresponding edge.
For example, referring to FIG. 2b, FIG. 2b shows a schematic diagram of a knowledge-graph, which represents a simple attribute graph. Li Ming and Li Fei are in parent-child relationship, and Li Ming has a 138-start telephone number which is open for 2018, where 2018 can be used as attribute information of the relationship. Similarly, li Ming also carries some attribute information, such as age 45, job title manager, etc. The data description information may be: li Ming has been collected, the data description information can be converted into attribute information: li Ming telephone number was collected: if yes, the attribute information is added to the edge between the node corresponding to Li Ming and the node corresponding to the telephone number of Li Ming.
According to the embodiment of the specification, the data description information is generated into the attribute information and the attribute information is added to the edge of the data knowledge graph, so that the data description information is stored in the data knowledge graph, corresponding information can be directly obtained from the attribute information of the edge during compliance detection, and the compliance detection efficiency is improved.
In one implementation, the building a data knowledge graph from the data relationship table and the data description information includes:
extracting data items in the data relation table, determining nodes of a data knowledge graph according to the data items in the data relation table and the data description information, and determining edges of the data knowledge graph according to the relation between the data items;
and generating attribute information according to the data description information, and adding the attribute information to the nodes of the data knowledge graph.
In practical application, the contract text corresponding to the data description information can also form a separate node, the attribute information is generated through the data description information, and the attribute information is added to the node formed by the contract text corresponding to the data description information.
For example, the data knowledge graph includes: li Ming and Li Fei are in parent-child relationship, and Li Ming has a phone number at the beginning 138, the data description information of the contract text may be: "telephone number acquisition for Li Ming is required", the data description information can be converted into attribute information: "Li Ming telephone number needs to be collected: if yes, generating contract text nodes and adding the attribute information to the nodes corresponding to the contract texts.
In the embodiment of the specification, the contract text corresponding to the data description information is formed into an independent node to be generated, the attribute information is generated through the data description information, the attribute information is added to the node formed by the contract text corresponding to the data description information, the contract text is managed through the independent node, and the contract text is conveniently detected.
In one implementation, the method further comprises:
acquiring a use log aiming at the target data, and generating attribute information according to the use log;
adding the attribute information to an edge of the data knowledge-graph.
In practical application, as time increases, the use logs of the data are continuously increased, and new use logs of the data can be added to the knowledge graph so as to detect the use of the data in time.
For example, having generated a data knowledge graph on a first day and having appeared a usage log for the corresponding data in the data knowledge graph on a second day, attribute information is generated from the usage log on the second day and added to the side of the data knowledge graph.
The embodiment of the specification generates the attribute information by the new use log and adds the attribute information to the edge of the data knowledge graph, so that whether the latest data use is in compliance can be continuously monitored, and the timeliness of compliance detection is improved.
Step 206: and carrying out data detection on the data knowledge graph according to a pre-constructed data risk strategy to obtain a detection result.
The data risk policy may be a code representation constructed according to laws and regulations, for example, represented by C language or C + + language.
In practical applications, different scenarios also include different detection and determination methods, for example, for data authorized in a time period, it is mainly detected whether the data storage time exceeds the storage authorization time. After a pre-constructed data risk strategy and data description information are obtained and a knowledge graph is constructed, data detection can be performed on nodes in the data knowledge graph and association relations among the nodes according to the data risk strategy and the data description information, and a detection result is obtained.
In an implementation manner, the performing data detection on the data knowledge graph according to a pre-constructed data risk policy to obtain a detection result includes:
acquiring a main body identifier of a main body to be detected, and determining a corresponding main body node in the data knowledge graph according to the main body identifier;
determining a detection condition according to a pre-constructed data risk strategy, and determining a side to be detected from a side corresponding to the main body node according to the detection condition;
determining a detection requirement according to a pre-constructed data risk strategy, and carrying out data detection on the edge to be detected according to the detection requirement to obtain a detection result.
The main body identification can be an identification of a main body to be detected in the data knowledge graph; the subject node may be a node of a subject to be detected in the data knowledge graph; the detection condition can be a condition for judging whether to detect which data needs to be acquired; the detection requirement may be a requirement to obtain attribute information to determine compliance.
In practical application, the data risk policy includes detection conditions, and the detection conditions may describe data that needs to be detected, so that which edges in the data knowledge graph need to be detected may be determined according to the detection conditions.
For example, the detection conditions are: the method is characterized in that personal data is used for matching a data knowledge map, wherein names, identity card information and mobile phone number information exist, and because the names, the identity card information and the mobile phone number information all belong to personal information, edges among the names, the identity card information, the mobile phone number information and corresponding main bodies are determined as edges to be detected, and the detection requirement is as follows: it is disclosed in the user privacy agreement that compliance can be judged according to the detection requirement.
According to the embodiment of the specification, the collected data are detected, high-efficiency and high-quality compliance detection can be realized, and omission of use/processing information of data assets through manual checking is avoided.
In an implementable manner, the determining, according to the detection condition, an edge to be detected from edges corresponding to the main body node includes:
determining the type of the data to be detected and the type of the edge to be detected according to the detection condition;
determining an initial edge to be detected and a data node corresponding to the initial edge to be detected from the edge corresponding to the main node according to the type of the edge to be detected;
determining a data node to be detected from the data nodes corresponding to the initial edge to be detected according to the type of the data to be detected;
and determining the edges to be detected from the initial edges to be detected according to the data nodes to be detected.
The data type to be detected may be a data type corresponding to the node, for example, personal data; the type of edge to be detected may be the type of relationship represented by the edge in the data knowledge graph, e.g., the type of transmission.
In practical application, a main body to be detected may be determined first, and then the edge to be detected may be determined according to data corresponding to the main body.
For example, if it is determined that personal data owned by company a is in compliance, it is determined that ten edges exist in a main body node corresponding to company a, where five edges are ownership relationships, corresponding data nodes are determined through the five edges, and if the data nodes corresponding to the five edges are all personal data, the five edges are all edges to be detected.
The embodiment of the specification judges according to the detection condition to determine the edge to be detected, so that the data known map spectrum can be automatically detected, and the detection efficiency is improved.
In an implementation manner, the performing data detection on the edge to be detected according to the detection requirement to obtain a detection result includes:
determining attribute information to be detected of the edge to be detected according to the detection requirement;
and acquiring an attribute value of the attribute information to be detected, and determining a detection result according to the attribute value.
The attribute value may be a value given to the attribute information to be detected, for example, the attribute information to be detected is: if the attribute value is 1, the attribute is revealed, and if the attribute value is 0, the attribute is not revealed.
In practical applications, the attribute information may be in the form of a variable, and the variable may be assigned with a value, so that whether the variable is compliant or not may be determined by obtaining the value of the variable.
For example, for order information of a user, only order information within one month is authorized to be stored, then the storage time of the order information is detected, the attribute information on the side of the node corresponding to the corresponding order information is obtained, and if the attribute information is found to be the storage time, the value of twenty days is found, and the detection result is judged to be in compliance.
According to the embodiment of the disclosure, the value of the attribute information is obtained, and whether the data knowledge graph is in compliance is determined according to the judgment of the value of the attribute information, so that the data knowledge graph can be automatically detected, and the detection efficiency is improved.
In an implementation manner, the performing data detection on the edge to be detected according to the detection requirement to obtain a detection result includes:
determining attribute information to be detected of the edge to be detected and related attribute information of the attribute information to be detected according to the detection requirement;
and under the condition that the related attribute information exists in the attribute information of the edge to be detected, acquiring an attribute value of the related attribute information, and determining a detection result according to the attribute value.
The relevant attribute information may be attribute information associated with the attribute information to be detected, for example, the attribute information to be detected is: if there is a storage time limit, the related attribute information may be a storage time.
In the practical application of the method, the material is,
for example, for an edge of a node corresponding to order information of a user, there is attribute information: and only authorized to store the order information within one month, acquiring attribute information on the edge of the node corresponding to the corresponding order information, finding that the attribute information exists as the attribute information of the storage time, and determining that the detection result is in compliance if the attribute value of the attribute information of the storage time is twenty days.
The embodiment of the specification can realize automatic detection on different companies or organizations and different data assets, and remarkably reduces the compliance cost.
In an implementation manner, the performing data detection on the data knowledge graph according to a pre-constructed data risk policy to obtain a detection result includes:
acquiring a main body identifier, and determining a corresponding contract text node in the data knowledge graph according to the main body identifier;
determining a detection condition according to a pre-constructed data risk strategy, and determining attribute information to be detected from the attribute information of the contract text node according to the detection condition;
determining a detection requirement according to a pre-constructed data risk strategy, and performing data detection on the attribute information to be detected according to the detection requirement to obtain a detection result.
In practical application, for the case that the contract text forms a node separately, in the compliance detection process, the attribute information of the contract text itself needs to be acquired and detected.
For example, a data knowledge graph includes: li Ming and Li Fei are in parent-child relationship, and Li Ming has a 138 beginning telephone number, the data description information of the contract text may be: "telephone number acquisition for Li Ming is required", the data description information can be converted into attribute information: "Li Ming telephone number needs to be collected: if yes, generating a node corresponding to the contract text, and adding the attribute information to the contract text node corresponding to the contract text. When the contract text is subjected to compliance detection, the detection conditions are as follows: if the acquired data is personal information, acquiring attribute information of the contract text node, which is the personal information: "Li Ming telephone number needs to be collected" and the detection requirement is: if the personal information needs to be declared, the value of the attribute information is determined to be "yes", and the detection result can be determined to be compliant.
The embodiment of the specification can also detect an independent contract text node, so that the diversification of compliance detection is increased, and because a certain contract text can be detected independently, the attribute information of a corresponding contract text on each edge does not need to be acquired, so that the detection efficiency is improved.
In an implementable manner, the method further comprises, before the step of generating the risk policy based on the pre-constructed data and the data description information:
and analyzing the security rule to obtain a code analysis result, and determining a data risk strategy according to the code analysis result.
The safety rule can be understood as a legal regulation, and the code analysis result can be understood as code representation of the legal regulation.
In practical application, the subject, the condition and the requirement constitute a standardized risk strategy. Through the standardized description of the risk strategy, the relevant requirements in the laws and regulations can be stored in a standardized form. There is also a need to enable logical expression reasoning, where a compliance engine can automatically generate corresponding boolean expressions for logical reasoning given a standardized described risk policy and the data in the corresponding graph database.
For example, if compliance detection requires "disclosed in user privacy agreement", a boolean expression "possess 1 = = true" may be generated according to a rule, where "whether or not possess 1 discloses" is an attribute "disclose" of the "own 1" relationship on the graph, and takes the value "true" or "false". The compliance engine deduces the boolean expression "true = = true" or "false = = true", and then a conclusion whether the compliance requirement is met and a corresponding compliance recommendation can be obtained.
An embodiment of the present specification provides a data detection method and an apparatus, wherein the data detection method includes: analyzing the contract text to determine data description information; acquiring target data, and constructing a data knowledge graph according to the target data and data description information; and carrying out data detection on the data knowledge graph spectrum according to a pre-constructed data risk strategy to obtain a detection result. The data acquisition method comprises the steps of analyzing a contract text of data transmission, determining data description information aiming at target data, constructing a data knowledge graph for the acquired target data, and carrying out data detection on the data knowledge graph according to a data risk strategy and the data description information which are constructed in advance to obtain a detection result.
An embodiment of the present specification provides a data detection system, including a server and a client;
the server receives a data detection request, analyzes a contract text according to the data detection request and determines data description information;
the client acquires target data and sends the target data to the server;
the server side constructs a data knowledge graph according to the target data and the data description information, performs data detection on the data knowledge graph according to a pre-constructed data risk strategy, obtains a detection result, and
and sending the detection result to the client.
The client can be a mobile phone, a personal computer and other equipment clients; the server may be one end of a cloud server.
In practical application, an application service corresponding to compliance detection can be established in the cloud server, and the user locally sends data to the cloud server for compliance detection.
For example, if it is necessary to detect relevant data corresponding to the contract text 1 of the company a, the server first parses the content of the contract text 1 to determine data description information, the company a transmits target data to the server, the server generates a data knowledge graph according to the received target data and the data description information, and detects the data knowledge graph according to a pre-generated data risk policy to determine whether the data in the data knowledge graph violates rules.
The data are transmitted to the server side to carry out compliance detection, so that the compliance detection can be completed by utilizing the computing capacity of the server side, the resource consumption is reduced, and the convenience degree is improved.
The following description will further explain the data detection method provided in this specification by taking an application of the data detection method in cross-border transmission as an example, with reference to fig. 3. Fig. 3 shows a flowchart of a processing procedure of a data detection method provided in an embodiment of the present specification, which specifically includes the following steps.
Step 302: and performing semantic analysis on the text content of the contract text according to the target model to obtain semantic keywords.
Step 304: and determining statement data information, data cycle information and use scene information according to the semantic keywords.
Step 306: and acquiring target data transmitted by using the contract text, and determining the source relation of the target data.
Step 308: and constructing the data knowledge graph according to the target data and the source relation of the target data.
Step 310: and performing data detection on the nodes in the data knowledge graph and the association relation between the nodes according to a pre-constructed data risk strategy and the data description information to obtain a detection result.
Because the data knowledge graph is constructed on the data transmitted through the contract text, the data and the relation between the data can be automatically detected, the detection efficiency is improved, and the labor cost is reduced.
Corresponding to the above method embodiment, the present specification further provides an embodiment of a data detection apparatus, and fig. 4 shows a schematic structural diagram of a data detection apparatus provided in an embodiment of the present specification. As shown in fig. 4, the apparatus includes:
a protocol parsing module 402 configured to parse the same text to determine data description information;
a graph construction module 404 configured to obtain target data, construct a data knowledge graph according to the target data and the data description information;
and the detection module 406 is configured to perform data detection on the data knowledge graph according to a pre-constructed data risk policy to obtain a detection result.
Optionally, the atlas construction module 404 is further configured to:
correspondingly, the constructing a data knowledge graph according to the target data and the data description information comprises the following steps:
analyzing the target data to obtain the data relation table;
and constructing a data knowledge graph according to the data relation table and the data description information.
Optionally, the atlas construction module 404 is further configured to:
the target data further comprises a usage log for the target data;
correspondingly, the acquiring target data and constructing a data knowledge graph according to the target data and the data description information includes:
analyzing the target data to obtain the use log, and generating attribute information according to the use log;
adding the attribute information to an edge of the data knowledge graph.
Optionally, the atlas construction module 404 is further configured to:
extracting data items in the data relation table, determining nodes of the data knowledge graph according to the data items, and determining edges of the data knowledge graph according to the relation among the data items;
and generating attribute information according to the data description information, and adding the attribute information to the edge of the data knowledge graph.
Optionally, the atlas construction module 404 is further configured to:
extracting data items in the data relation table, determining nodes of a data knowledge graph according to the data items in the data relation table and the data description information, and determining edges of the data knowledge graph according to the relation between the data items;
and generating attribute information according to the data description information, and adding the attribute information to the nodes of the data knowledge graph.
Optionally, the atlas construction module 404 is further configured to:
acquiring a use log aiming at the target data, and generating attribute information according to the use log;
adding the attribute information to an edge of the data knowledge-graph.
Optionally, the atlas construction module 404 is further configured to:
determining a main body node and a data node corresponding to the main body node according to a data item in the data relation table;
and generating contract text nodes according to the contract text corresponding to the data description information.
Optionally, the detecting module 406 is further configured to:
acquiring a main body identifier of a main body to be detected, and determining a corresponding main body node in the data knowledge graph according to the main body identifier;
determining a detection condition according to a pre-constructed data risk strategy, and determining a side to be detected from a side corresponding to the main body node according to the detection condition;
determining a detection requirement according to a pre-constructed data risk strategy, and carrying out data detection on the edge to be detected according to the detection requirement to obtain a detection result.
Optionally, the detecting module 406 is further configured to:
determining the type of the data to be detected and the type of the edge to be detected according to the detection condition;
determining an initial edge to be detected and a data node corresponding to the initial edge to be detected from the edge corresponding to the main node according to the type of the edge to be detected;
determining a data node to be detected from the data nodes corresponding to the initial edge to be detected according to the type of the data to be detected;
and determining the edges to be detected from the initial edges to be detected according to the data nodes to be detected.
Optionally, the detecting module 406 is further configured to:
determining attribute information to be detected of the edge to be detected according to the detection requirement;
and acquiring an attribute value of the attribute information to be detected, and determining a detection result according to the attribute value.
Optionally, the detecting module 406 is further configured to:
determining attribute information to be detected of the edge to be detected and related attribute information of the attribute information to be detected according to the detection requirement;
and under the condition that the related attribute information exists in the attribute information of the edge to be detected, acquiring an attribute value of the related attribute information, and determining a detection result according to the attribute value.
Optionally, the detecting module 406 is further configured to:
acquiring a main body identifier, and determining a corresponding contract text node in the data knowledge graph according to the main body identifier;
determining a detection condition according to a pre-constructed data risk strategy, and determining attribute information to be detected from the attribute information of the contract text node according to the detection condition;
determining a detection requirement according to a pre-constructed data risk strategy, and performing data detection on the attribute information to be detected according to the detection requirement to obtain a detection result.
Optionally, the detecting module 406 is further configured to:
and analyzing the security rule to obtain a code analysis result, and determining a data risk strategy according to the code analysis result.
The embodiment of the specification provides a data detection method and a data detection device, wherein the data detection device analyzes a same text to determine data description information; acquiring target data, and constructing a data knowledge graph according to the target data and data description information; and carrying out data detection on the data knowledge graph spectrum according to a pre-constructed data risk strategy to obtain a detection result. The data description information aiming at the target data is determined by analyzing the contract text of data transmission, the data knowledge graph is constructed for the acquired target data, data detection is carried out on the data knowledge graph according to the data risk strategy and the data description information which are constructed in advance, and the detection result is obtained.
The above is a schematic scheme of a data detection apparatus of the present embodiment. It should be noted that the technical solution of the data detection apparatus and the technical solution of the data detection method belong to the same concept, and details that are not described in detail in the technical solution of the data detection apparatus can be referred to the description of the technical solution of the data detection method.
FIG. 5 illustrates a block diagram of a computing device 500 provided in accordance with one embodiment of the present description. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530, and database 550 is used to store data.
Computing device 500 also includes access device 540, access device 540 enabling computing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 500, as well as other components not shown in FIG. 5, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 5 is for purposes of example only and is not limiting as to the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet computer, personal digital assistant, laptop computer, notebook computer, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 500 may also be a mobile or stationary server.
Wherein the processor 520 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the data detection method described above.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the data detection method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the data detection method.
An embodiment of the present specification further provides a computer-readable storage medium storing computer-executable instructions, which when executed by a processor, implement the steps of the above data detection method.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the data detection method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the data detection method.
An embodiment of the present specification further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the data detection method.
The above is a schematic scheme of a computer program of the present embodiment. It should be noted that the technical solution of the computer program and the technical solution of the data detection method belong to the same concept, and details that are not described in detail in the technical solution of the computer program can be referred to the description of the technical solution of the data detection method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the teaching of the embodiments of the present disclosure. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the embodiments. The specification is limited only by the claims and their full scope and equivalents.

Claims (17)

1. A method of data detection, comprising:
analyzing the contract text to determine data description information;
acquiring target data, and constructing a data knowledge graph according to the target data and the data description information;
and performing data detection on the data knowledge graph according to a pre-constructed data risk strategy to obtain a detection result.
2. The method of claim 1, the target data comprising a data relationship table;
correspondingly, the constructing a data knowledge graph according to the target data and the data description information comprises the following steps:
analyzing the target data to obtain the data relation table;
and constructing a data knowledge graph according to the data relation table and the data description information.
3. The method of claim 2, the target data further comprising a usage log for the target data;
correspondingly, the acquiring target data and constructing a data knowledge graph according to the target data and the data description information includes:
analyzing the target data to obtain the use log, and generating attribute information according to the use log;
adding the attribute information to an edge of the data knowledge-graph.
4. The method of claim 2, the building a data knowledge graph from the data relational tables and the data description information, comprising:
extracting data items in the data relation table, determining nodes of the data knowledge graph according to the data items, and determining edges of the data knowledge graph according to the relation among the data items;
and generating attribute information according to the data description information, and adding the attribute information to the edge of the data knowledge graph.
5. The method of claim 2, the building a data knowledge graph from the data relational tables and the data description information, comprising:
extracting data items in the data relation table, determining nodes of a data knowledge graph according to the data items in the data relation table and the data description information, and determining edges of the data knowledge graph according to the relation between the data items;
and generating attribute information according to the data description information, and adding the attribute information to the nodes of the data knowledge graph.
6. The method of claim 4 or 5, further comprising:
acquiring a use log aiming at the target data, and generating attribute information according to the use log;
adding the attribute information to an edge of the data knowledge-graph.
7. The method of claim 5, the determining nodes of a data knowledge graph from data items in the data relationship table and the data description information, comprising:
determining a main body node and a data node corresponding to the main body node according to a data item in the data relation table;
and generating contract text nodes according to the contract text corresponding to the data description information.
8. The method according to claim 1, 2 or 3, wherein the performing data detection on the data knowledge graph according to a pre-constructed data risk strategy to obtain a detection result comprises:
acquiring a main body identifier of a main body to be detected, and determining a corresponding main body node in the data knowledge graph according to the main body identifier;
determining a detection condition according to a pre-constructed data risk strategy, and determining a side to be detected from a side corresponding to the main body node according to the detection condition;
determining a detection requirement according to a pre-constructed data risk strategy, and carrying out data detection on the edge to be detected according to the detection requirement to obtain a detection result.
9. The method according to claim 8, wherein the determining, according to the detection condition, the edge to be detected from the edges corresponding to the body node comprises:
determining the type of the data to be detected and the type of the edge to be detected according to the detection condition;
determining an initial edge to be detected and a data node corresponding to the initial edge to be detected from the edge corresponding to the main node according to the type of the edge to be detected;
determining a data node to be detected from the data node corresponding to the initial edge to be detected according to the type of the data to be detected;
and determining the edges to be detected from the initial edges to be detected according to the data nodes to be detected.
10. The method according to claim 8, wherein the performing data detection on the edge to be detected according to the detection requirement to obtain a detection result includes:
determining attribute information to be detected of the edge to be detected according to the detection requirement;
and acquiring an attribute value of the attribute information to be detected, and determining a detection result according to the attribute value.
11. The method according to claim 8, wherein the performing data detection on the edge to be detected according to the detection requirement to obtain a detection result includes:
determining attribute information to be detected of the edge to be detected and related attribute information of the attribute information to be detected according to the detection requirement;
and under the condition that the related attribute information exists in the attribute information of the edge to be detected, acquiring an attribute value of the related attribute information, and determining a detection result according to the attribute value.
12. The method according to claim 1, 2 or 7, wherein the performing data detection on the data knowledge graph according to a pre-constructed data risk strategy to obtain a detection result comprises:
acquiring a main body identifier, and determining a corresponding contract text node in the data knowledge graph according to the main body identifier;
determining a detection condition according to a pre-constructed data risk strategy, and determining attribute information to be detected from the attribute information of the contract text node according to the detection condition;
determining a detection requirement according to a pre-constructed data risk strategy, and performing data detection on the attribute information to be detected according to the detection requirement to obtain a detection result.
13. The method according to any one of claims 1 to 5, further comprising, prior to the pre-constructed data risk policy and the data description information:
and analyzing the security rule to obtain a code analysis result, and determining a data risk strategy according to the code analysis result.
14. A data detection apparatus comprising:
the protocol analysis module is configured to analyze the same text and determine data description information;
the map construction module is configured to acquire transmitted target data and construct a data knowledge map according to the target data and the data description information;
and the detection module is configured to perform data detection on the data knowledge graph according to a pre-constructed data risk strategy to obtain a detection result.
15. A data detection system comprises a server and a client;
the server receives a data detection request, analyzes a contract text according to the data detection request and determines data description information;
the client acquires target data and sends the target data to the server;
the server side constructs a data knowledge graph according to the target data and the data description information, performs data detection on the data knowledge graph according to a pre-constructed data risk strategy, obtains a detection result, and
and sending the detection result to the client.
16. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, which when executed by the processor, perform the steps of the data detection method of any one of claims 1 to 13.
17. A computer readable storage medium storing computer executable instructions which, when executed by a processor, carry out the steps of the data detection method of any one of claims 1 to 13.
CN202211144823.2A 2022-09-20 2022-09-20 Data detection method and device Pending CN115470361A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211144823.2A CN115470361A (en) 2022-09-20 2022-09-20 Data detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211144823.2A CN115470361A (en) 2022-09-20 2022-09-20 Data detection method and device

Publications (1)

Publication Number Publication Date
CN115470361A true CN115470361A (en) 2022-12-13

Family

ID=84332818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211144823.2A Pending CN115470361A (en) 2022-09-20 2022-09-20 Data detection method and device

Country Status (1)

Country Link
CN (1) CN115470361A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116192467A (en) * 2023-01-04 2023-05-30 北京夏石科技有限责任公司 Data cross-border compliance management and control method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116192467A (en) * 2023-01-04 2023-05-30 北京夏石科技有限责任公司 Data cross-border compliance management and control method and device
CN116192467B (en) * 2023-01-04 2023-10-10 北京夏石科技有限责任公司 Data cross-border compliance management and control method and device

Similar Documents

Publication Publication Date Title
Jia et al. A practical approach to constructing a knowledge graph for cybersecurity
KR102452123B1 (en) Apparatus for Building Big-data on unstructured Cyber Threat Information, Method for Building and Analyzing Cyber Threat Information
KR20230025714A (en) Document processing and response generation system
Peng et al. Astroturfing detection in social media: a binary n‐gram–based approach
Amato et al. A semantic-based methodology for digital forensics analysis
CN110633577A (en) Text desensitization method and device
US8832126B2 (en) Custodian suggestion for efficient legal e-discovery
CN106407208A (en) Establishment method and system for city management ontology knowledge base
US20120310930A1 (en) Keyword Suggestion for Efficient Legal E-Discovery
WO2023134057A1 (en) Affair information query method and apparatus, and computer device and storage medium
WO2023108980A1 (en) Information push method and device based on text adversarial sample
CN112100398B (en) Patent blank prediction method and system
WO2023035330A1 (en) Long text event extraction method and apparatus, and computer device and storage medium
CN109597894B (en) Correlation model generation method and device, and data correlation method and device
Jindal et al. Construction of domain ontology utilizing formal concept analysis and social media analytics
CN115470361A (en) Data detection method and device
CN116976435B (en) Knowledge graph construction method based on network security
Amato et al. An application of semantic techniques for forensic analysis
CN114969018B (en) Data monitoring method and system
JP5233518B2 (en) Search analysis server device and search analysis method
CN112685389B (en) Data management method, data management device, electronic device, and storage medium
Correa et al. A deep search method to survey data portals in the whole web: toward a machine learning classification model
CN112905757A (en) Text processing method and device
CN111897947A (en) Data analysis processing method and device based on open source information
Shanmugarajah et al. WoKnack–A Professional Social Media Platform for Women Using Machine Learning Approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination