CN115795525A - Sensitive data identification method, apparatus, electronic device, medium, and program product - Google Patents

Sensitive data identification method, apparatus, electronic device, medium, and program product Download PDF

Info

Publication number
CN115795525A
CN115795525A CN202211092172.7A CN202211092172A CN115795525A CN 115795525 A CN115795525 A CN 115795525A CN 202211092172 A CN202211092172 A CN 202211092172A CN 115795525 A CN115795525 A CN 115795525A
Authority
CN
China
Prior art keywords
transaction
information
sensitive
nodes
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211092172.7A
Other languages
Chinese (zh)
Inventor
吴延生
周新衡
邓观何
劳晓华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202211092172.7A priority Critical patent/CN115795525A/en
Publication of CN115795525A publication Critical patent/CN115795525A/en
Pending legal-status Critical Current

Links

Images

Abstract

The present disclosure provides a method, an apparatus, an electronic device, a medium, and a computer program product for identifying sensitive data of a transaction log. The method and the device can be used in the technical field of artificial intelligence. The method comprises the following steps: constructing a knowledge graph according to the acquired transaction log information, wherein the transaction log information comprises transaction unique identification, transaction information, transaction message information and a conversion relation between the transaction information and the transaction message information, nodes of the knowledge graph are constructed according to the transaction information and the transaction message information, connecting edges of the knowledge graph are constructed according to the transaction unique identification and the conversion relation, and parts of the nodes are marked as sensitive nodes; dividing the nodes of the knowledge graph according to the association degree of the nodes to generate n communities, wherein n is an integer greater than or equal to 1; identifying a community containing a sensitive node in the n communities as a sensitive community; and identifying transaction log information corresponding to all nodes contained in the sensitive community as sensitive data.

Description

Sensitive data identification method, apparatus, electronic device, medium, and program product
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, an electronic device, a medium, and a computer program product for identifying sensitive data of a transaction log.
Background
In the existing banking business, the transaction data may relate to user privacy information, such as certificate numbers, card numbers, validity periods, passwords, magnetic track or chip information and other client sensitive information, and the client sensitive information needs to be protected in a grading way according to requirements of relevant departments. In order to realize desensitization storage of the sensitive information, the sensitive information is marked in the system when a production database table structure is designed, and desensitization of the data can be realized through a data deformation tool after the data is exported from a production database.
Along with the development of internet finance, the business scenes of banks are more and more, in addition, in order to improve customer experience, novel channels are continuously expanded, the transaction amount is continuously increased, in order to monitor the system and the business operation condition, each bank system realizes the transaction log information recording function, the transaction log information is automatically recorded by each application program without uniform format requirements, the transaction log information may relate to partial sensitive information, the sensitive information is identified in advance by the existing desensitization technology, then the sensitive information is desensitized and recorded in a database, or a sensitive information field in the transaction log information is specially marked and then is uniformly desensitized after being sent to a log system.
Disclosure of Invention
In view of the above, the present disclosure provides a sensitive data identification method, apparatus, electronic device, computer-readable storage medium, and computer program product for a transaction log with high positioning accuracy and efficiency.
One aspect of the present disclosure provides a method for identifying sensitive data of a transaction log, including: constructing a knowledge graph according to the acquired transaction log information, wherein the transaction log information comprises transaction unique identification, transaction information, transaction message information and a conversion relation between the transaction information and the transaction message information, nodes of the knowledge graph are constructed according to the transaction information and the transaction message information, connecting edges of the knowledge graph are constructed according to the transaction unique identification and the conversion relation, and parts of the nodes are marked as sensitive nodes; dividing the nodes of the knowledge graph according to the association degree of the nodes to generate n communities, wherein n is an integer greater than or equal to 1; identifying a community containing the sensitive node in the n communities as a sensitive community; and identifying the transaction log information corresponding to all nodes contained in the sensitive community as sensitive data.
According to the sensitive data identification method of the transaction log, a knowledge graph is constructed according to transaction log information, nodes of the knowledge graph are divided according to the association degree of the nodes, n communities are generated, the communities including the sensitive nodes in the n communities are identified as sensitive communities, and the transaction log information corresponding to all the nodes included in the sensitive communities is identified as sensitive data. Under the conditions that the data volume of the transaction log information is huge, each application program automatically records and no uniform format requirement exists, all sensitive data in the transaction log can be accurately positioned, the sensitive data in each application system is completely covered, a method for manually analyzing the sensitive data is avoided, and the accuracy and the efficiency of sensitive data identification are improved.
In some embodiments, the constructing a knowledge graph according to the acquired transaction log information includes: acquiring transaction log information of a transaction log; performing structural processing on the transaction log information to obtain a field name and a field value under the field name, wherein the field value is obtained by conversion according to the transaction unique identifier, the transaction information, the transaction message information and the conversion relation; marking parts in the field names as sensitive information; constructing a knowledge graph according to the field names and field values under the field names; and marking the node constructed by the field value corresponding to the sensitive information as a sensitive node.
In some embodiments, the dividing the nodes of the knowledge-graph according to the association degree of the nodes to generate n communities includes: taking each node in the knowledge graph as a group, and calculating the modularity of the group; traversing each group, and determining the modularity change between the group and each group having an edge relationship with the group; when the modularity change meets a set threshold, merging the group and the group with the edge relation according to the modularity change; taking the merged group as a new group, calculating the modularity of the new group, repeatedly executing the traversal of each group, and determining the modularity change between the group and each group with edge relation with the group; when the modularity change does not meet the set threshold, stopping merging the group and the group with the edge relation; and when the modularity degree change between every two groups does not meet the set threshold value, taking the current n groups as n communities.
In some embodiments, the merging the group and the group having an edge relationship with the group according to the modularity change includes: sorting the modularity variations according to the magnitude of the numerical values; and merging the two groups with the first modularity change order or the first last order according to the ordering result.
In some embodiments, the transaction unique identification comprises an event unique code; the transaction information comprises at least one of an event name, a user name, a card number, an identification number, an address and a transaction amount; the transaction message information comprises at least one of an event unique code message converted from the event unique code, an event name message converted from the event name, a house name message converted from the house name, a card number message converted from the card number, an identity card number message converted from the identity card number, an address message converted from the address and a transaction amount message converted from the transaction amount.
In some embodiments, the method for identifying sensitive data of a transaction log further comprises: desensitizing the sensitive data and storing the transaction log information.
In some embodiments, the desensitizing the sensitive data and storing the transaction log information includes: sensitive data in the transaction log information are extracted for desensitization, and desensitization data are obtained; recording the desensitization data in the transaction log information; storing the desensitized transaction log information to a log system, or specially marking sensitive data in the transaction log information; and sending the marked transaction log information to a log system, and carrying out desensitization processing on the transaction log information in the log system.
In some embodiments, the knowledge graph has a data query function, and when transaction log information is searched in the knowledge graph and the transaction log information is queried in the knowledge graph, queried nodes and edges corresponding to the transaction log information are displayed.
Another aspect of the present disclosure provides a sensitive data recognition apparatus of a transaction log, including: the system comprises a construction module, a data processing module and a data processing module, wherein the construction module is used for constructing a knowledge graph according to acquired transaction log information, the transaction log information comprises transaction unique identification, transaction information, transaction message information and a conversion relation between the transaction information and the transaction message information, nodes of the knowledge graph are constructed according to the transaction information and the transaction message information, connecting edges of the knowledge graph are constructed according to the transaction unique identification and the conversion relation, and parts of the nodes are marked as sensitive nodes; the generating module is used for dividing the nodes of the knowledge graph according to the association degree of the nodes to generate n communities, wherein n is an integer greater than or equal to 1; a first identification module configured to perform identification of a community of the n communities that includes the sensitive node as a sensitive community; and the second identification module is used for identifying the transaction log information corresponding to all nodes contained in the sensitive community as sensitive data.
Another aspect of the present disclosure provides an electronic device comprising one or more processors and one or more memories, wherein the memories are configured to store executable instructions that, when executed by the processors, implement the method as described above.
Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.
Another aspect of the disclosure provides a computer program product comprising a computer program comprising computer executable instructions for implementing the method as described above when executed.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:
fig. 1 schematically illustrates an exemplary system architecture to which the methods, apparatus, according to embodiments of the disclosure, may be applied;
FIG. 2 schematically illustrates a flow chart of a method of sensitive data identification of a transaction log according to an embodiment of the present disclosure;
FIG. 3 schematically shows a schematic diagram of a knowledge-graph according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow diagram for building a knowledge-graph from acquired transaction log information, according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow diagram for partitioning nodes of a knowledge-graph according to their degree of association to generate n communities, according to an embodiment of the present disclosure;
FIG. 6 schematically shows a schematic diagram of a knowledge-graph according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow chart for merging the group and the group having an edge relationship therewith according to modularity variations according to an embodiment of the present disclosure;
FIG. 8 schematically illustrates a flow chart of a method of sensitive data identification of a transaction log according to an embodiment of the disclosure;
FIG. 9 schematically illustrates a flow diagram for desensitizing sensitive data and storing transaction log information, according to an embodiment of the disclosure;
FIG. 10 schematically illustrates a block diagram of log centric encryption according to an embodiment of the disclosure;
FIG. 11 schematically illustrates a flow chart of a method of sensitive data identification of a transaction log according to an embodiment of the present disclosure;
FIG. 12 schematically shows a knowledge representation of a knowledge-graph and edge attributes, in accordance with an embodiment of the present disclosure;
FIG. 13 schematically illustrates a flow chart of a atlas modeling method according to an embodiment of the disclosure;
fig. 14 is a block diagram schematically illustrating a structure of a sensitive data recognition apparatus of a transaction log according to an embodiment of the present disclosure;
FIG. 15 schematically illustrates a block diagram of a building block according to an embodiment of the disclosure;
FIG. 16 schematically shows a block diagram of a generating module according to an embodiment of the disclosure;
fig. 17 schematically shows a block diagram of a merging unit according to an embodiment of the present disclosure;
fig. 18 is a block diagram schematically illustrating the structure of a sensitive data recognition apparatus of a transaction log according to an embodiment of the present disclosure;
fig. 19 schematically illustrates a block diagram of a desensitization processing module according to an embodiment of the present disclosure;
FIG. 20 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that these descriptions are illustrative only and are not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary security measures are taken, and the customs of the public order is not violated. In the technical scheme of the disclosure, the data acquisition, collection, storage, use, processing, transmission, provision, disclosure, application and other processing are all in accordance with the regulations of relevant laws and regulations, necessary security measures are taken, and the public order and good custom are not violated.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
Where a convention analogous to "at least one of A, B, or C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B, or C" would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first" and "second" may explicitly or implicitly include one or more of the described features.
In the existing banking business, the transaction data may relate to user privacy information, such as certificate number, card number, validity period, password, magnetic track or chip information and other customer sensitive information, and the customer sensitive information needs to be protected in a grading way according to requirements of relevant departments. In order to realize desensitization storage of the sensitive information, the sensitive information is marked in the system when a production database table structure is designed, and desensitization of the data can be realized through a data deformation tool after the data is exported from a production database.
Along with the development of internet finance, the business scenes of banks are more and more, in addition, in order to improve customer experience, novel channels are continuously expanded, the transaction amount is continuously increased, in order to monitor the system and the business operation condition, each bank system realizes the transaction log information recording function, the transaction log information is automatically recorded by each application program without uniform format requirements, the transaction log information may relate to partial sensitive information, the sensitive information is identified in advance by the existing desensitization technology, then the sensitive information is desensitized and recorded in a database, or a sensitive information field in the transaction log information is specially marked and then is uniformly desensitized after being sent to a log system.
However, because the data volume of the transaction log information is huge, each application program is recorded by itself, and there is no requirement for a uniform format, if it is difficult to identify the sensitive information, the existing desensitization technology is difficult to cover all the sensitive information in the system, and cannot accurately locate all the sensitive information.
Embodiments of the present disclosure provide a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for identifying sensitive data of a transaction log. The sensitive data identification method of the transaction log comprises the following steps: constructing a knowledge graph according to the acquired transaction log information, wherein the transaction log information comprises transaction unique identification, transaction information, transaction message information and a conversion relation between the transaction information and the transaction message information, nodes of the knowledge graph are constructed according to the transaction information and the transaction message information, connecting edges of the knowledge graph are constructed according to the transaction unique identification and the conversion relation, and parts of the nodes are marked as sensitive nodes; dividing the nodes of the knowledge graph according to the association degree of the nodes to generate n communities, wherein n is an integer greater than or equal to 1; identifying a community containing sensitive nodes in the n communities as a sensitive community; and identifying transaction log information corresponding to all nodes contained in the sensitive community as sensitive data.
It should be noted that the method, the apparatus, the electronic device, the computer-readable storage medium, and the computer program product for identifying sensitive data of a transaction log according to the present disclosure may be used in the field of artificial intelligence technology, and may also be used in any field other than the field of artificial intelligence technology, such as the field of finance, and the field of the present disclosure is not limited herein.
Fig. 1 schematically illustrates an exemplary system architecture 100 for a sensitive data identification method, apparatus, electronic device, computer-readable storage medium and computer program product to which a transaction log may be applied, according to embodiments of the disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. Network 104 is the medium used to provide communication links between terminal devices 101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The backend management server may analyze and process the received data such as the user request, and feed back a processing result (for example, a web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that the method for identifying sensitive data of a transaction log provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the sensitive data identification device of the transaction log provided by the embodiment of the present disclosure may be generally disposed in the server 105. The method for identifying sensitive data of a transaction log provided by the embodiment of the present disclosure may also be performed by a server or a server cluster which is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105. Correspondingly, the sensitive data identification device of the transaction log provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.
The sensitive data identification method of the transaction log according to the embodiment of the present disclosure will be described in detail with reference to fig. 2 to 9 based on the scenario described in fig. 1.
Fig. 2 schematically shows a flow chart of a method of sensitive data identification of a transaction log according to an embodiment of the present disclosure.
As shown in fig. 2, the method for identifying sensitive data of a transaction log of this embodiment includes operations S210 to S240.
In operation S210, a knowledge graph is constructed according to the acquired transaction log information, where the transaction log information includes a transaction unique identifier, transaction information, transaction message information, and a conversion relationship between the transaction information and the transaction message information, a node of the knowledge graph is constructed according to the transaction information and the transaction message information, a connection edge of the knowledge graph is constructed according to the transaction unique identifier and the conversion relationship, and a part of the node is labeled as a sensitive node.
As an implementable manner, the transaction unique identifier may include an event unique code; the transaction information may include at least one of an event name, a username, a card number, an identification number, an address, and a transaction amount; the transaction message information may include at least one of an event unique code message converted from an event unique code, an event name message converted from an event name, a user name message converted from a user name, a card number message converted from a card number, an identity card number message converted from an identity card number, an address message converted from an address, and a transaction amount message converted from a transaction amount.
The following description will be given by taking an example that a user opens a card on an ATM, card opening information is converted into transaction message information on the comprehensive front-end application, and desensitized card opening information is displayed on the ATM.
For example, zhang III opens the card on ATM No. 1, and the card opening event is uniquely coded as 1234; the card opening transaction information comprises: event name: opening a credit card and a user name: zhang III, card number: 147852, identity card number: 1305632, address: the amount of the transaction in the east city of Beijing City: 0; the unique code of the card opening event and the transaction information of the card opening can be converted into transaction message information on the comprehensive preposed application, and the transaction message information is character code.
For example, li IV opens the card on ATM machine No. 1, and the unique code of the card opening event is 1235; the card opening transaction information comprises: event name: opening a deposit card and a house name: plum four, card number: 147889, identity card number: 1309832, address: the Hei lake district of Beijing and the transaction amount: 10; the unique code of the card opening event and the transaction information of the card opening can be converted into transaction message information on the comprehensive preposed application, and the transaction message information is character code.
As shown in fig. 3, a knowledge graph may be constructed according to the number 1 ATM and transaction log information (that is, the unique identifier of open card transaction, transaction information, transaction message information, and the conversion relationship between transaction information and transaction message information of zhang san and lie), in the knowledge graph, part of transaction information specified by related departments may be identified as sensitive information, and a node constructed according to the sensitive information may be labeled as a sensitive node.
As some specific examples, as shown in fig. 3 and 4, operation S210 constructs a knowledge graph according to the acquired transaction log information, including operations S211 to S215.
In operation S211, transaction log information of the transaction log is acquired.
In operation S212, the transaction log information is structured to obtain a field name and a field value under the field name, where the field value is obtained by conversion according to the transaction unique identifier, the transaction information, the transaction message information, and the conversion relationship. Taking the above-mentioned opening card of zhang san on No. 1 ATM as an example, after performing structured processing on the transaction log information, a field name event unique code is obtained, and a field value 1234 under the event unique code is obtained; obtaining a field name and an event name, and opening a credit card according to the field value under the event name; obtaining a field name, namely a house name, and a field value under the house name, namely Zhang III; obtaining a field name card number and a field value 147852 under the card number; obtaining the field name, the identification number and the field value 1305632 under the identification number; obtaining a field name address and a field value under the address in the Tokyo city of Beijing; and obtaining the field name transaction amount and the field value 0 under the transaction amount.
Obtaining a field name event unique coding message, and a field character code 1 under the event unique coding message; obtaining a field name event name message and a field character code 2 under the event name message; obtaining a field name and account name message, and encoding a field value character 3 in the account name message; obtaining a field name card number message, and a field value character code 4 under the card number message; obtaining a field name and identification number message, and encoding field value characters under the identification number message 5; obtaining a field name address message, and a field character code 6 under the address message; and obtaining a field name transaction amount message, and a field character code 7 under the transaction amount message.
In operation S213, a part of the field name is marked as sensitive information, for example, the field name card number and the field name identification number may be marked as sensitive information, assuming that the card number and the identification number of the user are specified as sensitive information by the relevant department.
In operation S214, a knowledge graph is constructed according to the field name and the field value under the field name.
In operation S215, a node constructed with a field value corresponding to sensitive information is marked as a sensitive node. The construction of the knowledge graph according to the acquired transaction log information may be facilitated through operations S211 to S215.
In operation S220, the nodes of the knowledge graph are divided according to the association degree of the nodes to generate n communities, where n is an integer greater than or equal to 1.
As a possible implementation manner, as shown in fig. 5, operation S220 divides the nodes of the knowledge graph according to the association degree of the nodes, and generates n communities, including operation S221 to operation S226.
In operation S221, each node in the knowledge graph is regarded as a group, and the modularity of the group is calculated, so that the knowledge graph may be initialized.
In operation S222, each group is traversed, the modularity variation between the group and each group having an edge relationship with the group is determined, and in order to facilitate understanding of the specific implementation of operations S221 to S226, a knowledge graph as shown in fig. 6 is constructed, for example, there are 12 nodes in the knowledge graph, which are nodes a, b, c, d, e, f, g, h, i, j, k and l, respectively. When the knowledge graph is initialized, a, b, c, d, e, f, g, h, i, j, k and l can be respectively used as a group, and the modularity of each group can be respectively calculated. Taking group a as an example, the following describes traversing each group, determining the modularity variation between the group and each group having an edge relationship with the group a, and groups b, c and d having an edge relationship with group a, so that the modularity variation between a and b, the modularity variation between a and c and the modularity variation between a and d can be calculated respectively.
In operation S223, when the modularity variation satisfies the set threshold, the group and the group having an edge relationship therewith are merged according to the modularity variation.
In some specific examples, as shown in fig. 7, operation S223 merges the group and the group having an edge relationship therewith according to the modularity change, including operation S2231 and operation S2232.
In operation S2231, the modularity variations are sorted according to the magnitude of the value.
In operation S2232, the two groups sorted first or last from the modularity variation are merged according to the sorting result. The modularity changes may be sorted in an ascending order or sorted in a descending order according to the magnitude of the value. When the modularity changes are sorted in an ascending order, merging two groups with the first last modularity change sorting; and when the modularity degree changes are sorted in a descending order, combining the two groups with the first modularity degree change sort.
For example, through operation S222, a change in the modularity between a and b is calculated as K1, a change in the modularity between a and c is calculated as K2, and a change in the modularity between a and d is calculated as K3, and when the rank is K1 < K2 < K3, the group a and d corresponding to the K3 with the last rank is merged; when the sorting is K3 > K2 > K1, merging the group a and the group d corresponding to the first sorted K3. The operations S2231 and S2232 can facilitate merging the group and the group having an edge relationship with the group according to the modularity change.
In operation S224, the merged group is used as a new group, the modularity of the new group is calculated, and the traversal of each group is repeatedly performed to determine the modularity change between the group and each group having an edge relationship with the group.
In operation S225, when the modularity degree variation does not satisfy the set threshold, merging the group and the group having an edge relationship therewith is stopped. For example, the set threshold may be set to a positive number, and when the modularity degree is changed to a negative number, and the modularity degree change does not satisfy the set threshold, the merging of the group and the group having an edge relationship therewith may be stopped.
In operation S226, when the change in the modularity between each two groups does not satisfy the set threshold, the current n groups are regarded as n communities. It can be understood that, when every two groups in all the groups of the knowledge graph cannot be merged, the generation of the community is completed, and at this time, the current n groups may be regarded as n communities.
The modularity Q of the group can be obtained by formula (1). m represents the number of edges, Σ, between the group being traversed and other groups in Represents the total weight, Σ, of the edge within the group being traversed tot Representing the total weight of the edge incident on the target group.
Figure BDA0003835878990000121
The modularity Δ Q of the group can be obtained by formula (2). m represents the number of edges between the traversing group and the other groups, ki, in represents the sum of the weights of the edges of the traversing group incident on the target group, Σ tot Represents the total weight of the edges incident to the target group, and ki represents the total weight of the edges of the group being traversed.
Figure BDA0003835878990000122
In operation S230, a community including a sensitive node among the n communities is identified as a sensitive community. Continuing with the example of the knowledge graph in fig. 6, the knowledge graph finally generates 3 communities, namely community a, community B and community C. Assuming that, of the nodes a, b, c, d, e, f, g, h, i, j, k, and l, the node a and the node k are marked as sensitive nodes in operation S215, the community a may be identified as a sensitive community because the node a is located in the community a; since node k is located in community C, community C may be identified as a sensitive community; community B does not contain sensitive nodes and is therefore an insensitive community.
In operation S240, transaction log information corresponding to all nodes included in the sensitive community is identified as sensitive data. Wherein, the community A comprises the nodes a, b, c and d, therefore, the transaction log information corresponding to the nodes a, b, c and d is identified as sensitive data; nodes h, i, j, k, and l are included in the community C, and thus transaction log information corresponding to the nodes h, i, j, k, and l is identified as sensitive data.
According to the sensitive data identification method of the transaction log, a knowledge graph is constructed according to transaction log information, nodes of the knowledge graph are divided according to the association degree of the nodes, n communities are generated, the communities including the sensitive nodes in the n communities are identified as sensitive communities, and the transaction log information corresponding to all the nodes included in the sensitive communities is identified as sensitive data. Under the conditions that the data volume of the transaction log information is huge, each application program automatically records and no uniform format requirement exists, all sensitive data in the transaction log can be accurately positioned, the sensitive data in each application system is completely covered, a method for manually analyzing the sensitive data is avoided, and the accuracy and the efficiency of sensitive data identification are improved.
Fig. 8 schematically shows a flow chart of a method of sensitive data identification of a transaction log according to an embodiment of the present disclosure. The sensitive data identification method of the transaction log further includes operation S250.
In operation S250, desensitization processing is performed on the sensitive data and transaction log information is stored.
As one implementable manner, as shown in fig. 9, operation S250 desensitizes the sensitive data and stores the transaction log information, including operation S251 to operation S253, or operation S254 to operation S255.
In operation S251, sensitive data in the transaction log information is extracted for desensitization, and desensitization data is obtained.
In operation S252, desensitization data is recorded to the transaction log information.
In operation S253, the desensitized transaction log information is stored to the log system. Desensitization processing of sensitive data and storage of transaction log information can be facilitated through operations S251 to S253.
In operation S254, sensitive data in the transaction log information is specially marked.
In operation S255, the marked transaction log information is transmitted to a log system, where desensitization processing is performed on the transaction log information. Desensitization processing of sensitive data and storage of transaction log information can be conveniently realized through operations S254-S255 as well. Desensitizing the sensitive data and storing the transaction log information can improve the safety of the application system and ensure the information safety of customers.
Specifically, a consistent hashing algorithm may be used to partition a desensitization processing server, with predefined service nodes divided into 2 n And (4) partitioning. The partition node stores a correspondence table of all desensitization processing server unique IDs and their specific IP addresses. Desensitization processing server is processed from 1 to 2 by adopting consistent Hash algorithm n Partitioning, for Uniform distribution of desensitization processing servers, universal Hash function Hash (key)% 2 n (key is a key value of data, 2 n The number of servers), the result of the modulus is the desensitization processing servers to be accessed, each desensitization processing server equally divides 1/2n of services, and the service node division 2 n The method is used for meeting the effectiveness of a Byzantine algorithm, enabling normal error correction to be supported subsequently, preventing data from being tampered by bad nodes, achieving flexible expansion capacity of a desensitization processing server, supporting high concurrency and avoiding slow transaction caused by introduction of sensitive data identification.
In some embodiments of the present disclosure, the knowledge graph may have a data query function, and when the transaction log information is searched in the knowledge graph and the transaction log information is queried in the knowledge graph, the queried node and connected edge corresponding to the transaction log information are displayed. When the transaction log information is searched in the knowledge graph and is not queried in the knowledge graph, the operations S210 to S240 are repeatedly performed. Therefore, the processing capacity and the control capacity of the sensitive data identification device of the transaction log on the sensitive data can be improved.
A sensitive data identification method of a transaction log according to an embodiment of the present disclosure is described in detail below with reference to fig. 10 to 13. It is to be understood that the following description is illustrative only and is not intended as a specific limitation of the disclosure.
The disclosure provides a sensitive data identification method of a transaction log, which is suitable for desensitization processing of semi-structured and unstructured data. The stock association relationship among all systems of the bank is also associated with the transaction information, a method for efficiently calculating, storing and managing graph data and the like is provided by utilizing graph calculation, complete description is carried out through the relationship among the data, the data analysis and data service capability with rich, efficient and agile functions is realized, and the problems of identification and desensitization of sensitive information in logs are solved.
As fig. 10 shows a diagram of a log-centric encrypted structure, the present disclosure finds sensitive information and implements desensitization processing by building a data desensitization knowledge-graph: firstly, data acquisition, knowledge acquisition, map modeling and knowledge search are carried out on transaction data of an application system, a sensitive information identification knowledge map is constructed, sensitive information is identified for information in a log, and desensitization processing is carried out.
As shown in fig. 11, the flow chart of the sensitive data identification method of the transaction log is shown, the sensitive data identification method of the transaction log mainly includes three parts, the first part is an application server related processing method, which mainly implements collection of the log, the second part is a desensitization processing server related processing method, which mainly implements analysis of log data, discovery of sensitive data and desensitization processing of the sensitive data, and the third part is a log storage related processing method.
1. An application server: the method mainly realizes application transaction data acquisition and centralized management.
According to the transaction log definition transaction log format and content requirements, a system process of transaction can acquire and record key information in a transaction communication area.
The transaction log generally includes transaction unique identifiers and transaction key elements, such as information of transaction tracking numbers (traceids), transaction channel event numbers, card numbers, and the like, and also includes information of transaction inputs (inputs) and transaction outputs (outputs), which are often transaction message information assembled by upstream or downstream systems, and the format is defined by upstream and downstream applications, and the structure is not fixed, but at least includes information as in table 1.
TABLE 1
Figure BDA0003835878990000151
The transaction log can be uploaded to a message queue through a log acquisition module in an asynchronous process mode, the flash can be used in a log application module, and the message queue can be realized by adopting a Kafka framework.
2. Desensitization processing server: the method mainly realizes sensitive information identification and desensitization processing of transaction log data.
The message queue has transaction log information so as to realize sensitive information identification and desensitization processing, and the transaction log information in the message queue is subscribed and consumed in a batch operation mode to obtain a transaction log.
FIG. 12 shows knowledge representation of a knowledge-graph and edge attributes: the knowledge acquisition is to realize data structuring processing, to clean and align unstructured data from data of different sources and different interfaces to form knowledge elements such as entities, relations, events/attributes and the like, and the feasible knowledge acquisition is to establish a data mode of a knowledge graph and define the structure of the whole knowledge graph.
The knowledge representation comprises knowledge combing and Schema definition, the definition is provided by service experts according to experience or is identified based on transaction data, a Schema definition routing data model based on KG is used for describing the routing knowledge graph data of mass data of a plurality of data producers, the definition of a world Wide Web alliance (W3C) Resource Description Framework (RDF) protocol is used as a basic protocol, a JSON-LD standard organization description language is adopted, and the knowledge representation comprises the following elements:
1) Class: the category of the entity defines a type of entity.
2) Property: and representing descriptions of the entities in different data sources to form an all-round description of the entities.
3) Relationship of relationship: relationships are used to describe associations between data that various types of abstractions model into entities, thereby supporting association analysis.
4) Constant constraint: a constraint that an attribute be under a particular class adds polymorphic and overloaded features to the attribute.
5) Datatype data type: a data type describing an attribute.
6) Subclas concept at the top and bottom: the subordinate upper and lower conceptual relations of Class are described.
The graph modeling is the core of the scheme and is also a key technical link of sensitive information identification, a sensitive information knowledge graph is constructed, after transaction log data are processed, the knowledge graph is constructed according to transaction link relations and metadata of the transaction log, and the knowledge graph is used for analyzing log center data and finding out sensitive information. The map modeling method is shown in fig. 13, and specifically includes S1 to S3.
S1, constructing a data relation group: the method mainly realizes the construction of a training model and carries out dimension reduction processing. The implementation of the module supports but is not limited to community discovery algorithms such as Louvian, FN, GN and the like, the Louvian algorithm is preferentially used, the Louvian community discovery algorithm is based on the modularity community discovery algorithm, the basic idea is that nodes in a network try to traverse all community labels, the community label of modularity increment is maximized, after the modularity is maximized, each community is regarded as a new node, the steps are repeated until the modularity is not increased, and the data relationship group implementation steps are as follows:
the method comprises the following steps: initialization, defining modularity:
Figure BDA0003835878990000171
wherein Q represents modularity, Σ in Represents the weight, sigma, inside community C tot And representing the weight of the edge linked with the community C, wherein the weight of the edge comprises the edge inside and the edge outside the community, each node in the graph is regarded as an independent data relationship group in the initialization stage, the number of the groups is equal to the number of the nodes, and the weights of all the edges are regarded as the same number of 1.
Step two: starting to transfer nodes among the groups, for each node i, sequentially trying to distribute the node i to the group where each neighbor node is located, assuming that three neighbor nodes j1, j2 and j3 exist, moving the node i to the communities where j1, j2 and j3 exist, calculating a modularity change value of response, moving the node i to the community of response when the change value is the largest, and changing the modularity,
Figure BDA0003835878990000172
where m denotes the number of edges in the network, k i,in Represents the sum of the weights, Σ, incident on the group C from node i tot Representing the total weight of the incident group C.
Step three: and iterating the step two, and continuing to perform inter-group node transfer evaluation until the modularity of the group to which all the nodes belong is not changed any more, namely the node transfer among the groups is finished.
Step four: all nodes in each community are reconstructed into a new group, the weight of the edge between the nodes in the group is updated to the weight of the ring of the new node, and the weight of the edge between the groups is updated to the weight of the edge between the new nodes.
Step five: and repeating the first step, the second step and the third step until the algorithm is stable.
S2, identifying suspicious relation data: the method for identifying the sensitive information in the semi-structured or unstructured data is realized by constructing the result of the data relation group, combining the structured data table structure of each application system and identifying the sensitive information by the metadata, and the specific method for identifying the suspicious relation data comprises the following steps:
the method comprises the following steps: according to the definition of the database table structure of each application system, collecting all information of the table structure as data nodes, wherein the node knowledge identification at least comprises information such as generation application names, field names, whether metadata exists, metadata names and the like, and sensitive information nodes such as user names, certificate numbers, validity periods and the like are involved, and the corresponding nodes are identified as the sensitive information in the knowledge graph spectrum, so that the data can be analyzed conveniently, namely corresponding records are found in a graph database, and the identification is added as the sensitive information.
Step two: according to the metadata nodes and the result of constructing the data relation group, finding out the data nodes of nationwide sensitive information in the knowledge graph and constructing all the sensitive information nodes are realized, and the specific realization method comprises the following steps:
A. all nodes are initialized and treated as a single data node for each node.
B. And judging whether the upstream node of the current node is a data node with a suspicious relationship or not, and judging whether the upstream node of the current node relates to a data node with sensitive information or not according to the sensitive information label of the downstream data node.
C. And (4) adjusting the sensitive information of each new data link node or data node, and repeating the operation in the step (B) until no new upstream data node can be found.
D. And identifying the upstream node as suspicious data if the metadata information can not be found completely, and manually screening and labeling whether sensitive information exists.
S3, sensitive information identification: acquiring knowledge according to information in the transaction log, using the knowledge as a knowledge map query condition, querying to find a current knowledge database record, and directly judging whether the current record is sensitive information; and if the current database record cannot be found, newly creating a database record which is used as a new data node, and identifying the sensitive information of the newly added data node through attributes and edges and through a suspicious relation data identification step.
The sensitive information query is a method for providing the sensitive information query service capability, receiving the request of transaction and returning the query result, preferably, a consistent Hash algorithm can be adopted to partition a desensitization processing server, and the predefined service nodes are divided into 2 n And (4) partitioning. The partition node stores a correspondence table of all desensitization processing server unique IDs and their specific IP addresses. Desensitization processing server is processed from 1 to 2 by adopting consistent Hash algorithm n Partitioning, for Uniform distribution of desensitization processing servers, universal Hash function Hash (key)% 2 n (key is the key value of data, 2) n The number of servers), the result of the modulus is the desensitization processing servers to be accessed, and each desensitization processing server is equally divided by 1/2 n Service of (2), service node division n The method is used for meeting the effectiveness of a Byzantine algorithm, enabling normal error correction to be supported subsequently, preventing data from being tampered by bad nodes, achieving flexible expansion capacity of a desensitization processing server, supporting high concurrency and avoiding slow transaction caused by introduction of sensitive data identification.
Desensitization of sensitive information is to prevent private information from being leaked, and the sensitive information in the transaction log needs to be desensitized before being stored in the log system. Desensitization processing supports desensitization processing of sensitive information before log recording and then recording the desensitization processing to log files, and also supports special marking of sensitive information fields in transaction logs, and then desensitization processing is carried out uniformly after the sensitive information fields are sent to a log system. Particularly, the method includes the steps that an open source tool Flink is introduced, log desensitization processing is carried out by means of powerful data swallowing and computing capacity of the Flink, log information is submitted to the Flink in a unified mode through a data stream, the FLink processes the data stream in real time according to configured log desensitization rules, the processed data stream is recorded in a log file, log desensitization processing is switched from a synchronous waiting mode to an asynchronous mode, accordingly resource occupation of log desensitization on a service system is reduced, and response time and processing capacity of the system are improved.
The present disclosure has the following beneficial effects:
1. the disclosure provides a method for identifying sensitive information in semi-structured data, which can accurately find the sensitive information in a transaction full link through a knowledge graph, and avoid the leakage of the sensitive information.
2. By constructing the training model, the automatic identification capability of the sensitive information is improved, and the manual maintenance cost of the sensitive information is avoided.
3. A trap flow for realizing sensitive information identification, sensitive information inquiry and sensitive information desensitization is provided through a distributed system, and the processing capacity and the control capacity of the system on sensitive information are improved.
Based on the sensitive data identification method of the transaction log, the disclosure further provides a sensitive data identification device 10 of the transaction log. The sensitive data identification device 10 for transaction logs will be described in detail below in conjunction with fig. 14-19.
Fig. 14 schematically shows a block diagram of the structure of the sensitive data identification device 10 of the transaction log according to an embodiment of the present disclosure.
The sensitive data identification device 10 of the transaction log comprises a construction module 1, a generation module 2, a first identification module 3 and a second identification module 4.
Building block 1, the building block 1 being configured to perform operation S210: and constructing a knowledge graph according to the acquired transaction log information, wherein the transaction log information comprises transaction unique identification, transaction information, transaction message information and a conversion relation between the transaction information and the transaction message information, nodes of the knowledge graph are constructed according to the transaction information and the transaction message information, connecting edges of the knowledge graph are constructed according to the transaction unique identification and the conversion relation, and parts in the nodes are marked as sensitive nodes.
A generating module 2, the generating module 2 being configured to perform operation S220: and dividing the nodes of the knowledge graph according to the association degree of the nodes to generate n communities, wherein n is an integer greater than or equal to 1.
A first recognition module 3, the first recognition module 3 being configured to perform operation S230: and identifying the communities containing the sensitive nodes in the n communities as the sensitive communities.
A second identification module 4, the second identification module 4 being configured to perform operation S240: and identifying the transaction log information corresponding to all nodes contained in the sensitive community as sensitive data.
Fig. 15 schematically shows a block diagram of the construction module 1 according to an embodiment of the present disclosure. The building block 1 comprises an acquisition unit 11, a processing unit 12, a first annotation unit 13, a building unit 14, and a second annotation unit 15.
An obtaining unit 11, wherein the obtaining unit 11 is used for obtaining transaction log information of the transaction log.
And the processing unit 12 is configured to perform structural processing on the transaction log information to obtain a field name and a field value under the field name, where the field value is obtained by converting according to the transaction unique identifier, the transaction information, the transaction message information, and the conversion relationship.
And the first labeling unit 13, wherein the first labeling unit 13 is used for labeling the part in the field name as the sensitive information.
And the construction unit 14, wherein the construction unit 14 is used for constructing the knowledge graph according to the field names and the field values under the field names.
And the second labeling unit 15, wherein the second labeling unit 15 is used for labeling the node constructed by the field value corresponding to the sensitive information as the sensitive node.
Fig. 16 schematically shows a block diagram of the structure of the generation module 2 according to an embodiment of the present disclosure. The generation module 2 includes a calculation unit 21, a first determination unit 22, a merging unit 23, a second determination unit 24, a stop unit 25, and a third determination unit 26.
And the calculating unit 21 is used for taking each node in the knowledge graph as a group and calculating the modularity of the group.
A first determining unit 22, wherein the first determining unit 22 is configured to traverse each group, and determine a modularity variation between the group and each group having an edge relationship with the group.
And the merging unit 23, where the merging unit 23 is configured to merge the group and the group having the edge relationship with the group according to the change of the modularity when the change of the modularity satisfies the set threshold.
And the second determining unit 24, where the second determining unit 24 is configured to use the merged group as a new group, calculate the modularity of the new group, repeatedly perform traversal of each group, and determine a modularity change between the group and each group having an edge relationship with the group.
And a stopping unit 25, wherein the stopping unit 25 is used for stopping merging the group and the group with the edge relation when the modularity degree change does not meet the set threshold value.
And a third determining unit 26, wherein the third determining unit 26 is used for determining the current n groups as n communities when the modularity degree change between every two groups does not meet the set threshold value.
Fig. 17 schematically shows a block diagram of the merging unit 23 according to an embodiment of the present disclosure. The merging unit 23 comprises a sorting element 231 and a merging element 232.
A ranking component 231, the ranking component 231 for ranking the modularity variations according to numerical size.
A merging component 232, wherein the merging component 232 is configured to merge two groups sorted first or last according to the sorting result.
Fig. 18 schematically shows a block diagram of the structure of the sensitive data identification device 10 of the transaction log according to an embodiment of the present disclosure. The sensitive data recognition device 10 of the transaction log further comprises a desensitization processing module 5.
And the desensitization processing module 5 is used for desensitizing sensitive data and storing transaction log information.
Fig. 19 schematically shows a block diagram of the desensitization processing module 5 according to an embodiment of the present disclosure. The desensitization processing module 5 includes a first desensitization unit 51, a recording unit 52, and a storage unit 53, or the desensitization processing module 5 includes a special mark unit 54 and a second desensitization unit 55.
The first desensitization unit 51, the first desensitization unit 51 is used for extracting the sensitive data in the transaction log information for desensitization, and obtaining desensitization data.
And the recording unit 52, wherein the recording unit 52 is used for recording desensitization data in the transaction log information.
And the storage unit 53, wherein the storage unit 53 is used for storing the desensitized transaction log information to the log system. Or
A special marking unit 54, the special marking unit 54 is used for carrying out special marking on the sensitive data in the transaction log information.
And the second desensitization unit 55, wherein the second desensitization unit 55 is configured to send the marked transaction log information to a log system, and perform desensitization processing on the transaction log information in the log system.
According to the sensitive data identification device 10 of the transaction log in the embodiment of the disclosure, a knowledge graph is constructed according to transaction log information, nodes of the knowledge graph are divided according to the association degree of the nodes, n communities are generated, the communities including the sensitive nodes in the n communities are identified as sensitive communities, and the transaction log information corresponding to all the nodes included in the sensitive communities is identified as sensitive data. Under the conditions that the data volume of the transaction log information is huge, each application program records the transaction log information by itself and no uniform format requirement exists, all sensitive data in the transaction log can be accurately positioned, the sensitive data in each application system is completely covered, a method for manually analyzing the sensitive data is avoided, and the accuracy and the efficiency of sensitive data identification are improved.
In addition, according to the embodiment of the present disclosure, any plurality of the building module 1, the generating module 2, the first identifying module 3, and the second identifying module 4 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module.
According to an embodiment of the present disclosure, at least one of the building module 1, the generating module 2, the first identifying module 3, and the second identifying module 4 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware by any other reasonable manner of integrating or packaging a circuit, or in any one of three implementations of software, hardware, and firmware, or in a suitable combination of any of them.
Alternatively, at least one of the building module 1, the generating module 2, the first identifying module 3 and the second identifying module 4 may be at least partially implemented as a computer program module, which when executed, may perform a corresponding function.
Fig. 20 schematically illustrates a block diagram of an electronic device adapted to implement the above-described method according to an embodiment of the present disclosure.
As shown in fig. 20, an electronic apparatus 900 according to an embodiment of the present disclosure includes a processor 901 which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. Processor 901 may comprise, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 901 may also include on-board memory for caching purposes. The processor 901 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. The processor 901 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the programs may also be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
Electronic device 900 may also include input/output (I/O) interface 905, input/output (I/O) interface 905 also connected to bus 904, according to an embodiment of the present disclosure. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The driver 910 is also connected to an input/output (I/O) interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 902 and/or the RAM 903 described above and/or one or more memories other than the ROM 902 and the RAM 903.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated by the flow chart. The program code is for causing a computer system to carry out the methods of the embodiments of the disclosure when the computer program product is run on the computer system.
The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 901. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal over a network medium, distributed, and downloaded and installed via the communication section 909 and/or installed from the removable medium 911. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program, when executed by the processor 901, performs the above-described functions defined in the system of the embodiment of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing devices may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external computing devices (e.g., through the internet using an internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It will be appreciated by those skilled in the art that various combinations and/or combinations of the features recited in the various embodiments of the disclosure and/or the claims may be made even if such combinations or combinations are not explicitly recited in the disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the disclosure, and these alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (12)

1. A method for identifying sensitive data in a transaction log, comprising:
constructing a knowledge graph according to the acquired transaction log information, wherein the transaction log information comprises transaction unique identification, transaction information, transaction message information and a conversion relation between the transaction information and the transaction message information, nodes of the knowledge graph are constructed according to the transaction information and the transaction message information, connecting edges of the knowledge graph are constructed according to the transaction unique identification and the conversion relation, and parts of the nodes are marked as sensitive nodes;
dividing the nodes of the knowledge graph according to the association degree of the nodes to generate n communities, wherein n is an integer greater than or equal to 1;
identifying a community containing the sensitive node in the n communities as a sensitive community; and
and identifying the transaction log information corresponding to all nodes contained in the sensitive community as sensitive data.
2. The method of claim 1, wherein constructing a knowledge graph from the obtained transaction log information comprises:
acquiring transaction log information of a transaction log;
structuring the transaction log information to obtain a field name and a field value under the field name, wherein the field value is obtained by conversion according to the transaction unique identifier, the transaction information, the transaction message information and the conversion relation;
marking parts in the field names as sensitive information;
constructing a knowledge graph according to the field names and field values under the field names; and
and marking the node constructed by the field value corresponding to the sensitive information as a sensitive node.
3. The method of claim 1, wherein the partitioning the nodes of the knowledge-graph according to the association degrees of the nodes to generate n communities comprises:
taking each node in the knowledge graph as a group, and calculating the modularity of the group;
traversing each group, and determining the modularity change between the group and each group having an edge relationship with the group;
when the modularity change meets a set threshold, merging the group and the group with the edge relation according to the modularity change;
taking the merged group as a new group, calculating the modularity of the new group, repeatedly executing the traversal of each group, and determining the modularity change between the group and each group with an edge relation with the group;
when the modularity change does not meet the set threshold, stopping merging the group and the group with the edge relation; and
and when the modularity change between every two groups does not meet the set threshold value, taking the current n groups as n communities.
4. The method of claim 3, wherein merging the group and the group having an edge relationship with the group according to the modularity variation comprises:
sorting the modularity changes according to the magnitude of the numerical values; and
and combining the two groups with the first or last-to-last modularity degree change sequence according to the sequencing result.
5. The method of claim 1,
the transaction unique identification comprises an event unique code;
the transaction information comprises at least one of an event name, a user name, a card number, an identity card number, an address and a transaction amount;
the transaction message information comprises at least one of an event unique code message converted from the event unique code, an event name message converted from the event name, a house name message converted from the house name, a card number message converted from the card number, an identity card number message converted from the identity card number, an address message converted from the address and a transaction amount message converted from the transaction amount.
6. The method of claim 1, further comprising: desensitize the sensitive data and store the transaction log information.
7. The method of claim 6, wherein desensitizing the sensitive data and storing the transaction log information comprises:
sensitive data in the transaction log information are extracted for desensitization, and desensitization data are obtained;
recording the desensitization data in the transaction log information; and
storing the transaction log information after desensitization to a log system, or
Specially marking sensitive data in the transaction log information; and
and sending the marked transaction log information to a log system, and carrying out desensitization processing on the transaction log information in the log system.
8. The method according to claim 1, wherein the knowledge graph has a data query function, and when transaction log information is searched in the knowledge graph, and the transaction log information is queried in the knowledge graph, nodes and edges corresponding to the queried transaction log information are displayed.
9. An apparatus for identifying sensitive data of a transaction log, comprising:
the system comprises a construction module, a data processing module and a data processing module, wherein the construction module is used for constructing a knowledge graph according to acquired transaction log information, the transaction log information comprises transaction unique identification, transaction information, transaction message information and a conversion relation between the transaction information and the transaction message information, nodes of the knowledge graph are constructed according to the transaction information and the transaction message information, connecting edges of the knowledge graph are constructed according to the transaction unique identification and the conversion relation, and parts of the nodes are marked as sensitive nodes;
the generating module is used for dividing the nodes of the knowledge graph according to the association degree of the nodes to generate n communities, wherein n is an integer greater than or equal to 1;
a first identification module, configured to perform identification of a community, which includes the sensitive node, of the n communities as a sensitive community; and
a second identification module, configured to perform identification of transaction log information corresponding to all nodes included in the sensitive community as sensitive data.
10. An electronic device, comprising:
one or more processors;
one or more memories for storing executable instructions that, when executed by the processor, implement the method of any one of claims 1-8.
11. A computer-readable storage medium, characterized in that the storage medium has stored thereon executable instructions which, when executed by a processor, implement the method according to any one of claims 1 to 8.
12. A computer program product, comprising a computer program comprising one or more executable instructions which, when executed by a processor, implement the method according to any one of claims 1 to 8.
CN202211092172.7A 2022-09-07 2022-09-07 Sensitive data identification method, apparatus, electronic device, medium, and program product Pending CN115795525A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211092172.7A CN115795525A (en) 2022-09-07 2022-09-07 Sensitive data identification method, apparatus, electronic device, medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211092172.7A CN115795525A (en) 2022-09-07 2022-09-07 Sensitive data identification method, apparatus, electronic device, medium, and program product

Publications (1)

Publication Number Publication Date
CN115795525A true CN115795525A (en) 2023-03-14

Family

ID=85431746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211092172.7A Pending CN115795525A (en) 2022-09-07 2022-09-07 Sensitive data identification method, apparatus, electronic device, medium, and program product

Country Status (1)

Country Link
CN (1) CN115795525A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116663777A (en) * 2023-06-05 2023-08-29 重庆翰海睿智大数据科技股份有限公司 Experimental training system and method based on knowledge graph

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116663777A (en) * 2023-06-05 2023-08-29 重庆翰海睿智大数据科技股份有限公司 Experimental training system and method based on knowledge graph

Similar Documents

Publication Publication Date Title
Balouek-Thomert et al. Towards a computing continuum: Enabling edge-to-cloud integration for data-driven workflows
US9069880B2 (en) Prediction and isolation of patterns across datasets
US10360394B2 (en) System and method for creating, tracking, and maintaining big data use cases
CN110807129B (en) Method and device for generating multi-layer user relation graph set and electronic equipment
CN111078776A (en) Data table standardization method, device, equipment and storage medium
CN115795525A (en) Sensitive data identification method, apparatus, electronic device, medium, and program product
CN114049089A (en) Method and system for constructing government affair big data platform
Salih et al. Data quality issues in big data: a review
CN111209403A (en) Data processing method, device, medium and electronic equipment
Das et al. LYRIC: Deadline and budget aware spatio-temporal query processing in cloud
WO2022111148A1 (en) Metadata indexing for information management
CN113869904B (en) Suspicious data identification method, device, electronic equipment, medium and computer program
CN115292516A (en) Block chain-based distributed knowledge graph construction method, device and system
US20220036006A1 (en) Feature vector generation for probabalistic matching
Janev Chapter 1 Ecosystem of Big Data
CN114493853A (en) Credit rating evaluation method, credit rating evaluation device, electronic device and storage medium
Zhong et al. Big data workloads drawn from real-time analytics scenarios across three deployed solutions
CN116401319B (en) Data synchronization method and device, electronic equipment and computer readable storage medium
Bellini et al. Internet 4 Things
CN115604000B (en) Override detection method, device, equipment and storage medium
CN114219053B (en) User position information processing method and device and electronic equipment
Nandhini et al. Big Data with Data Mining
CN117574268A (en) User group construction method, device, equipment, medium and program product
CN114546768A (en) Multi-source heterogeneous log data processing method, device, equipment and medium
Hohwald et al. ARBUD: A reusable architecture for building user models from massive datasets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination