CN113553846A - Method, device, equipment and medium for processing unstructured data - Google Patents

Method, device, equipment and medium for processing unstructured data Download PDF

Info

Publication number
CN113553846A
CN113553846A CN202010331678.3A CN202010331678A CN113553846A CN 113553846 A CN113553846 A CN 113553846A CN 202010331678 A CN202010331678 A CN 202010331678A CN 113553846 A CN113553846 A CN 113553846A
Authority
CN
China
Prior art keywords
word
privacy
sensitive
data
unstructured data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010331678.3A
Other languages
Chinese (zh)
Inventor
朱天清
朱运丽
霍正聃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202010331678.3A priority Critical patent/CN113553846A/en
Priority to PCT/CN2021/075680 priority patent/WO2021212968A1/en
Publication of CN113553846A publication Critical patent/CN113553846A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a processing method of unstructured data, which comprises the steps of segmenting unstructured data to obtain a segmentation result, determining the weight of a sensitive word in the segmentation result, determining the weight of the non-sensitive word according to the similarity of the non-sensitive word and the attribute of private data in the segmentation result, and determining the privacy degree of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word. Due to the fact that non-sensitive words with context relations are considered, the method has high accuracy for privacy degree grading, and the privacy protection effect is good when privacy protection processing is conducted on the basis.

Description

Method, device, equipment and medium for processing unstructured data
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for processing unstructured data.
Background
With the advent of the information age, data is growing explosively. Data can be divided into structured data and unstructured data. Structured data is data that is logically represented and implemented using a table structure, has a particular data format, and is typically stored and managed using a relational database. The privacy protection mechanism for the structured data is quite perfect, and for the unstructured data, the uniform structure cannot be adopted for representation, so that the privacy protection is difficult.
The industry has proposed some privacy protection methods for unstructured data. For example, the method comprises the steps of grading the privacy degree of the unstructured data according to the number of characters of the privacy data in the unstructured data or the number of occupied privacy data, and then adopting a corresponding privacy protection mechanism based on the privacy grade, such as desensitizing protection on all the privacy data in the text.
However, the above method for classifying the privacy degree based on the character number of the privacy data or the ratio of the number of the privacy data is not high in accuracy, so that it is difficult for the privacy protection mechanism based on the privacy level to achieve a good privacy protection effect.
Disclosure of Invention
The method treats unstructured data as a whole, determines the privacy degree of the unstructured data through sensitive words and non-sensitive words in the unstructured data, has high accuracy, can adopt a corresponding privacy protection mechanism to carry out privacy protection processing based on the privacy degree, and can achieve a good privacy protection effect. The application also provides a device, equipment, a computer readable storage medium and a computer program product corresponding to the method.
In a first aspect, the present application provides a method for processing unstructured data. The method may be implemented by a processing system for unstructured data. The system may be deployed in a cloud environment, a marginal environment, or in an end device (i.e., a peer device). Wherein the cloud environment indicates a central cluster of computing devices owned by a cloud service provider for providing computing, storage, and communication resources; the edge environment indicates a cluster of edge computing devices geographically closer to the end-side device for providing computing, storage, and communication resources. When the system is deployed in a cloud environment or an edge environment, the system can be provided for users to use in the form of services. When the system deploys the end-side devices, the system can be provided for users to use in the form of clients. In some implementations, the processing system of unstructured data includes multiple parts, which may also be distributively deployed in different environments.
Specifically, the processing system of the unstructured data performs word segmentation on the unstructured data to obtain word segmentation results, then determines the weight of a sensitive word in the word segmentation results, determines the weight of the non-sensitive word according to the similarity between the non-sensitive word and the attribute of the private data in the word segmentation results, and then determines the privacy degree of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word.
According to the method, the unstructured data are taken as a whole, not only are private data, namely sensitive words, considered, but also non-sensitive words having context relations with the sensitive words, and the privacy degree of the unstructured data is determined based on the sensitive words and the non-sensitive words, so that the method can evaluate the privacy degree more accurately and comprehensively. Furthermore, the method can more accurately adopt the privacy protection mechanism of the corresponding level to carry out privacy protection, and has better privacy protection effect.
In some implementations, considering that the similarity of words can be measured by the distance of words in a vector space, the processing system of unstructured data may further extract a word vector of the non-sensitive word and a word vector of the private data attribute, determine the similarity of the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute, and then determine the weight of the non-sensitive word according to the similarity of the non-sensitive word and the private data attribute.
The method is used for determining the similarity of non-sensitive words and privacy data attributes by introducing a method of calculating the vocabulary similarity by using word vectors in natural language processing. Since the word vector retains semantic features, the similarity determined based on the semantic features has higher reliability.
In some implementations, the processing system of unstructured data may extract the word vectors for the non-sensitive words and the word vectors for the privacy data attributes using a pre-trained word vector model. The word vector is extracted through the word vector model, and the efficiency and the accuracy are high.
In some implementations, the definitions of the private data for different application scenarios may be different, and the language application and expression modes of different application scenarios are greatly different, so that the contexts of the same word may be greatly different in the corpora of different application scenarios, and if the initial word vector model is trained using a general corpus, the accuracy of the trained word vector model may be low. Based on this, the processing system of the unstructured data can also obtain a training corpus matched with the application scene of the unstructured data, and train an initial word vector model by using the training corpus to obtain a word vector model.
In some implementations, words corresponding to the same privacy data attribute often have similar contexts, but the privacy data words corresponding to the same privacy data attribute are always varied, for example, the privacy data words corresponding to the name may be "zhang san", "li xi", "wang wu", and the like, and many privacy data words may appear only a few times, and the word vector model obtained by training directly based on the training corpus is not accurate enough. In order to train better word vector models and more accurately calculate similarity to better assign sensitive weights, the processing system of unstructured data can also preprocess the corpus. Specifically, the sensitive words in the training corpus are identified, the privacy data attributes of the sensitive words are used for replacing the sensitive words, and then the initial word vector model is trained by using the replaced training corpus to obtain a word vector model.
In some implementations, the processing system of the unstructured data may further determine a privacy protection mechanism of the unstructured data according to a privacy degree of the unstructured data, and perform privacy protection on the unstructured data by using the privacy protection mechanism. The method can avoid the direct disclosure of the private information caused by the private data, and also effectively prevent the indirect disclosure of the private information caused by the semantic problem, thereby better protecting the private information.
In a second aspect, the present application provides an apparatus for processing unstructured data. The device comprises:
the word segmentation module is used for segmenting words of the unstructured data to obtain word segmentation results;
the weight determining module is used for determining the weight of the sensitive words in the word segmentation result and determining the weight of the non-sensitive words according to the similarity between the non-sensitive words in the word segmentation result and the privacy data attribute;
and the privacy degree determining module is used for determining the privacy degree of the unstructured data according to the weight of the sensitive words and the weight of the non-sensitive words.
In some implementations, the weight determination module is specifically configured to:
extracting a word vector of the non-sensitive word and a word vector of the privacy data attribute;
determining the similarity between the non-sensitive words and the privacy data attributes according to the distance between the word vectors of the non-sensitive words and the word vectors of the privacy data attributes;
and determining the weight of the non-sensitive word according to the similarity of the non-sensitive word and the privacy data attribute.
In some implementations, the weight determination module is specifically configured to:
and extracting the word vector of the non-sensitive word and the word vector of the privacy data attribute by utilizing a pre-trained word vector model.
In some implementations, the apparatus further includes:
the communication module is used for acquiring training corpora matched with the application scene of the unstructured data;
and the training module is used for training an initial word vector model by using the training corpus to obtain a word vector model.
In some implementations, the apparatus further includes:
the replacing module is used for identifying the sensitive words in the training corpus and replacing the sensitive words by using the privacy data attributes;
the training module is specifically configured to:
and training the initial word vector model by using the replaced training corpus to obtain a word vector model.
In some implementations, the apparatus further includes:
and the privacy protection processing module is used for determining a privacy protection mechanism of the unstructured data according to the privacy degree of the unstructured data and carrying out privacy protection on the unstructured data by utilizing the privacy protection mechanism.
In a third aspect, the present application provides an apparatus comprising a processor and a memory. The processor and the memory are in communication with each other. The processor is configured to execute the instructions stored in the memory to cause the apparatus to perform the method of processing unstructured data as in the first aspect or any implementation manner of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, where instructions are stored, and the instructions instruct a device to execute the method for processing unstructured data according to the first aspect or any implementation manner of the first aspect.
In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on a device, cause the device to perform the method for processing unstructured data according to the first aspect or any of the implementations of the first aspect.
The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.
Drawings
In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings used in the embodiments will be briefly described below.
FIG. 1 is an architecture diagram of an unstructured data processing system according to an embodiment of the present application;
FIG. 2 is an architecture diagram of an unstructured data processing system according to an embodiment of the present application;
fig. 3 is a flowchart of a method for processing unstructured data according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a method for determining weights of non-sensitive words according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an apparatus for processing unstructured data according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an apparatus according to an embodiment of the present application.
Detailed Description
The terms "first" and "second" in the embodiments of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.
Some technical terms referred to in the embodiments of the present application will be first described.
Unstructured data refers to data that is irregular or incomplete in data structure, has no predefined data model, and is not convenient to logically express and implement by using a database two-dimensional logical table. The format of unstructured data is diverse. As one example, unstructured data may include documents or text in various formats.
Word vector (word embedding) is also called word embedding. A word vector refers to a vector formed by mapping a word to a continuous vector space of lower dimensions. A word vector may typically be represented using a sequence of real numbers. Such a representation of a word vector can be understood as a neural network based distributed representation which preserves the semantic features of the words.
Aiming at unstructured data such as personal resumes, medical records, office documents and the like, the industry provides a privacy protection method. Specifically, according to the definition of the private data, the private data existing in the unstructured data are identified, the character number ratio of the private data is determined according to the ratio of the bit number of the private data to the total bit number of the unstructured data, or the number ratio of the private data is determined according to the ratio of the number of the private data to the total number of words in the unstructured data, and the privacy degree is graded according to the character number ratio of the private data or the number ratio of the private data. Then, based on the privacy level, a corresponding privacy protection mechanism is adopted, such as desensitization protection is carried out on all privacy data in the unstructured data.
However, the above-described method of performing privacy degree classification based on the character ratio or the number ratio of the private data ignores the correlation between the contexts. Unstructured data such as personal resumes, medical records, office documents, etc. may include words that are highly similar to or highly directional to the private data, in addition to the private data. Even if all the private data in the unstructured data are desensitized in the privacy protection process, some relevant information of the private data may be deduced from words which are highly similar to or highly directional to the private data, so that the unstructured data are not desensitized completely, and the private information is revealed to some extent.
For example, for the sentence "my name is king, graduating to the university of finance, i do not want to stay working at company X. In the privacy protection processing process, if privacy degree grading is carried out only depending on the number ratio of the privacy data or the number ratio of the characters of the privacy data, and desensitization processing is carried out on the privacy data, a sentence after desensitization processing becomes' my name is. "
The above privacy-preserving process only takes into account the private data, which, although masked to some extent, is incomplete in sentence desensitization. In particular, because the semantic problem of the sentence is not considered, the desensitized sentence semantics are still complete, and the privacy degree of the sentence is not reduced to the minimum. Wherein, the name and the private data name have great directivity; the 'graduation' and the privacy data school have great direction; the 'waiting' and 'working' and the private data work place have great direction, and the will of people is expressed; the information related to desensitized private data or the intention of the person to express can be deduced from these words with high directivity.
Therefore, the privacy degree classification method based on the number ratio of the privacy data or the number ratio of the characters of the privacy data is not high in accuracy, so that a privacy protection mechanism based on the privacy level is difficult to achieve a good privacy protection effect.
In view of this, the present application provides a method for processing unstructured data. The method may be performed by a processing system for unstructured data. Specifically, the processing system of the unstructured data performs word segmentation on the unstructured data to obtain word segmentation results, then the semantic characteristic that context words in the unstructured data have strong relevance is considered, the processing system of the unstructured data also determines the weight of non-sensitive words aiming at the non-sensitive words except for the sensitive words according to the similarity between the non-sensitive words and the attribute of the privacy data, and the privacy degree of the unstructured data is determined through the weight of the sensitive words and the weight of the non-sensitive words.
According to the processing method of the unstructured data, the unstructured data are taken as a whole, not only are private data, namely sensitive words, considered, but also non-sensitive words having context relations with the sensitive words, and the privacy degree of the unstructured data is determined based on the sensitive words and the non-sensitive words, so that the method can evaluate the privacy degree more accurately and comprehensively. Furthermore, the method can more accurately adopt the privacy protection mechanism of the corresponding level to carry out privacy protection, and has better privacy protection effect.
As shown in FIG. 1, a processing system for unstructured data may be deployed on one or more computing devices (e.g., a central server) on a cloud environment, particularly a cloud environment. The system may also be deployed in an edge environment, specifically on one or more computing devices (edge computing devices) in the edge environment, which may be servers. The system may also be deployed in an end-side device (i.e., end-device), including but not limited to a desktop, laptop, smartphone, and the like.
The cloud environment indicates a central cluster of computing devices owned by a cloud service provider for providing computing, storage, and communication resources; the edge environment indicates a cluster of edge computing devices geographically closer to the end-side device for providing computing, storage, and communication resources.
The end-side device may be used as a data providing device for providing unstructured data, so that a processing system of the unstructured data processes the unstructured data to determine the privacy degree thereof, and further performs privacy protection processing based on the privacy degree thereof by using a corresponding privacy protection mechanism. The end-side device may provide unstructured data generated or stored by itself for processing by a processing system of the unstructured data. In some implementations, the end-side device may be a network device, for example, a terminal device accessing a network, and thus, the end-side device may obtain unstructured data from the network and provide the unstructured data to a processing system.
When the processing system of the unstructured data is deployed in a cloud environment or an edge environment, the processing system of the unstructured data can be provided for users to use in a service form. Specifically, a user may access the cloud environment or the edge environment through a browser, create an instance of the processing system for unstructured data in the cloud environment or the edge environment, and then interact with the instance of the processing system for unstructured data through the browser, thereby implementing processing of unstructured data.
Processing systems for unstructured data may also be deployed on the end-side devices. Correspondingly, the processing system of the unstructured data can be provided for the user to use in a client form. Specifically, the user runs the client to realize the processing of the unstructured data.
In some implementations, as shown in FIG. 2, the unstructured-data processing system includes multiple parts (e.g., includes multiple subsystems, each of which includes multiple unit modules), and thus the parts of the unstructured-data processing system may also be distributively deployed in different environments. For example, portions of the processing system of unstructured data may be deployed on three of a cloud environment, an edge environment, an end device, or any two of the environments, respectively.
In order to make the technical solutions provided in the embodiments of the present application clearer and easier to understand, a method for processing unstructured data will be described below from the perspective of a system for processing unstructured data.
Referring to fig. 3, a flow chart of a method for processing unstructured data is shown, the method comprising:
s302: and the processing system of the unstructured data carries out word segmentation on the unstructured data to obtain word segmentation results.
In specific implementation, the processing system of the unstructured data may perform word segmentation on the unstructured data by using any one or more of a word segmentation method based on string matching, a word segmentation method based on understanding, a word segmentation method based on statistics, and the like, so as to obtain word segmentation results.
The word segmentation method based on character string matching matches a character string to be analyzed with entries in a machine dictionary according to a set strategy, and if a certain character string is found in the dictionary, the matching is successful, and a word is recognized. And then continuing to execute the matching operation, thereby realizing word segmentation of the unstructured data.
Further, when the processing system of the unstructured data performs string matching, matching can be performed according to different directions, that is, the word segmentation method based on string matching can be further divided into a forward maximum matching method and a reverse maximum matching method. When the processing system of the unstructured data performs character string matching, the processing system can also be divided into a longest matching method and a shortest matching method according to limited matching of different lengths, namely, a word segmentation method based on character string matching. In addition, the method can be divided into a simple word segmentation method and an integrated method combining word segmentation and part-of-speech tagging according to whether the method is combined with the part-of-speech tagging process or not.
The word segmentation method based on understanding achieves the effect of recognizing words by simulating the understanding of sentences. Specifically, syntactic analysis and semantic analysis are carried out at the same time of word segmentation, and ambiguity is eliminated by utilizing syntactic information and semantic information, so that word segmentation is carried out on unstructured data such as texts.
The word segmentation method based on statistics is to use a statistical machine learning model to learn the rules of word segmentation on the premise of giving a large amount of already segmented texts, thereby realizing the segmentation of unknown texts. The word segmentation method based on statistics comprises a maximum probability word segmentation method and a maximum entropy word segmentation method. The statistical Model used in the above method includes one of an N-gram Model (N-gram), a Hidden Markov Model (HMM), a Maximum Entropy Model (MEM) and a Conditional Random field Model (CRF).
Specifically, the processing system of the unstructured data may select a matching word segmentation method to perform word segmentation based on the language, scene, and the like of the unstructured data, so as to obtain a word segmentation result.
In some implementations, in order to save storage space and improve processing efficiency of unstructured data, the processing system of unstructured data may also stop words (stop words) after word segmentation, so as to obtain a final word segmentation result.
S304: and the processing system of the unstructured data determines the weight of the sensitive words in the word segmentation result and determines the weight of the non-sensitive words according to the similarity of the non-sensitive words and the privacy data attributes in the word segmentation result.
Specifically, the processing system of the unstructured data may determine the sensitive words according to the word segmentation result, and the words other than the sensitive words in the word segmentation result are the non-sensitive words, and then the processing system of the unstructured data may determine the weights of the sensitive words and determine the weights of the non-sensitive words according to the similarity between the non-sensitive words and the privacy data attributes. The weight is specifically used for measuring the importance degree of the sensitive words or the non-sensitive words to the privacy degree of the whole unstructured data.
Wherein the privacy data attribute is used to describe the type of the privacy data. For example, for the private data "zhang san", the corresponding private data attribute is "name", and for the private data xx @ yy.com, the corresponding private data attribute is "email address".
The definition for the private data may be different in view of different application scenarios. For example, for information such as a birthday or a place of birth, privacy is considered in some application scenarios such as General Data Protection Regulation (GDPR), and privacy is not considered in other application scenarios such as a medical scenario. As shown in the following table:
TABLE 1 privacy data template in medical scenarios
Figure BDA0002465174590000061
Figure BDA0002465174590000071
Table 2 privacy data template in GDPR scenario
I Name (I) Whether or not to keep private XI Bank card number Whether or not to keep private
II E-mail address Is that XII Nationality Is that
III Mobile phone number Is that XIII Political party style Is that
IV Home telephone number Is that XIV IP address Is that
V Any address Is that XV GPS information Is that
VI Identity card number Is that XVI DNA information Is that
VII Passport number Is that XVII Finger print Is that
VIII License plate number Is that XVIII Iris information Is that
IX Birthday Is that XIX Disease diagnosis Is that
X Dried rehmannia root Is that
Based on this, when determining the sensitive word, the processing system of the unstructured data can match the attribute of each word in the word segmentation result with the attribute of the privacy data defined by the privacy data template in the current application scene, so as to determine that each word in the word segmentation result is a sensitive word or an insensitive word. The sensitive words and non-sensitive words thus determined have a higher accuracy.
Then, for a sensitive word, the processing system of the unstructured data may determine a weight of the sensitive word according to the set weight. For example, the weight of the sensitive word is set as the standard weight, and if the weight is 1, the weight of the sensitive word can be obtained according to the set weight.
And for the non-sensitive words, determining the weight of the non-sensitive words according to the similarity between the non-sensitive words and the privacy data attribute, specifically determining the weight of the non-sensitive words according to the corresponding relation between the similarity and the weight. The higher the similarity between the non-sensitive word and the privacy data attribute is, the higher the weight of the non-sensitive word is, and the lower the similarity between the non-sensitive word and the privacy data attribute is, the lower the weight of the non-sensitive word is.
For ease of understanding, the following description is made with reference to specific examples. In this example, the unstructured data includes the sentence "three is my name", the processing system of the unstructured data determines "three" as a sensitive word and "name" as an insensitive word based on the attribute of the private data, determines the similarity of the insensitive word "name" and the attribute of the private data "name" as 0.9999 by calculation, and determines the weight ratio as 0.8 according to the correspondence between the similarity and the weight ratio, and thus may determine the weight of "three" as 1 and the weight of "name" as 0.8.
S306: and the processing system of the unstructured data determines the privacy degree of the unstructured data through the weight of the sensitive words and the weight of the non-sensitive words.
Specifically, the processing system of the unstructured data may obtain the privacy degree of the unstructured data by performing weighted aggregation on the weights of all sensitive words and the weights of all non-sensitive words.
In one example, the formula for calculating the privacy level is specifically as follows:
Figure BDA0002465174590000072
here, privacylevel represents a sensitivity level, also referred to as a sensitivity level. n is the total number of sensitive and non-sensitive words. giThe sensitivity value of the ith word in the unstructured data is represented as follows:
Figure BDA0002465174590000081
wherein, IiAnd when the ith word is a non-sensitive word, the similarity between the non-sensitive word and the attribute of the privacy data is high. Alpha is alphaiIs thatAnd the weight of the non-sensitive word represents the influence of the non-sensitive word on the privacy degree of the unstructured data. Wherein alpha isiThe value range is (0, 1), and is determined according to the similarity between the non-sensitive words and the privacy data attributes.
In one example, the similarity of the non-sensitive word and the private data attribute has the following correspondence with the weight of the non-sensitive word:
Figure BDA0002465174590000082
the processing system of unstructured data determines the weight of the non-sensitive words based on equation (3) above, and determines the degree of privacy of the unstructured data based on the weight of the sensitive words and the weight of the non-sensitive words.
Based on the above description, the embodiments of the present application provide a method for processing unstructured data, where the method uses unstructured data as a whole, considers an association relationship between contexts in the unstructured data, determines a weight of a sensitive word by using a similarity between an insensitive word in the unstructured data and an attribute of private data, and determines a text privacy degree by using the weight of the sensitive word and the weight of the insensitive word having the context relationship, so as to have higher accuracy.
Moreover, the method can more accurately determine the privacy protection mechanism of the corresponding level. The privacy protection mechanism is used for carrying out privacy protection on unstructured data, so that direct disclosure of privacy information caused by the privacy data can be avoided, indirect disclosure of the privacy information caused by semantic problems can be effectively prevented, and the privacy information can be better protected.
In order to verify that the privacy degree grading method provided by the application can better evaluate the privacy degree of the unstructured data than a traditional method, the embodiment of the application also designs an attack scene for verification.
Specifically, in an attack scene, the same shielding desensitization processing is adopted for all the private data in the unstructured data of the text, specifically, the private data are all uniformly replaced by spaces, then the private data are guessed by using private data context vocabularies, and the higher the probability of guessing correct information is, the attacker can obtain more text private information, and the currently adopted privacy protection mechanism is insufficient in level, so that the desensitization is not complete enough. Therefore, if the grading of the text privacy degree is not accurate enough, the desensitization processing may be performed on the text data with a high level by using a privacy protection mechanism with a low level, so that the text data is not completely desensitized, and the desensitized text data still may reveal privacy information.
The privacy degree ranking of the text can be verified by designing attack scene prediction privacy data. Correspondingly, the text privacy degree is calculated by adopting the privacy degree grading method and the traditional method provided by the embodiment of the application, and ranking of the text privacy degree is performed. The fact that the ranking is closer to the privacy degree ranking obtained by verification of the attack scene shows that the method can accurately reflect the privacy degree level of the text.
The measure of the closeness of the ranking may be implemented by Mean Square Error (MSE), and a calculation formula of the MSE is as follows:
Figure BDA0002465174590000091
wherein n is the number of documents; x and y represent two ranked lists of document privacy levels.
The embodiments of the present application provide the following experimental data:
TABLE 3 privacy level ranking determined by different methods
Figure BDA0002465174590000092
Computing MSE according to the ranking of table 3 may result in:
MSE (validation rank, rank of the present application) 6;
MSE (verification ranking, number of private data is compared and ranked) is 12;
MSE (verification rank, private data character number ratio rank) 34.
Therefore, compared with the traditional privacy degree grading method based on the number proportion of the privacy data or the number proportion of the characters of the privacy data, the privacy degree grading method based on the similarity is closer to the verification method in ranking, and the method provided by the embodiment of the application can be used for grading the privacy degree of the unstructured data more accurately.
In consideration of the semantic characteristic that context vocabularies have relevance, the embodiment of the application introduces a method for calculating vocabulary similarity by using word vectors in Natural Language Processing (NLP), and the method is used for calculating the similarity between non-sensitive words and private data attributes.
Specifically, as shown in fig. 4, the processing system of unstructured data may extract a word vector of non-sensitive words and a word vector of private data attributes, respectively, e.g., input the non-sensitive words and the private data attributes into a pre-trained word vector model, thereby obtaining the word vector of non-sensitive words and the word vector of private data attributes. And then, determining the similarity between the non-sensitive words and the privacy data attributes according to the distance between the word vectors of the non-sensitive words and the word vectors of the privacy data attributes. Then, according to the similarity between the non-sensitive word and the privacy data attribute, the weight of the non-sensitive word is determined based on the correspondence between the similarity and the weight (for example, the correspondence shown in formula (3)).
The word vector model can be obtained by training through methods such as word2vec and the like. Specifically, the processing system of unstructured data may construct an initial word vector model by word2vec, and train the initial word vector model using the training corpus, thereby obtaining a word vector model for extracting word vectors.
Considering that the definitions of different application scenarios for the private data may be different, the language application and expression modes of different application scenarios are greatly different, which may cause the contexts of the same words in the corpora of different application scenarios to have a large difference, and if the initial word vector model is trained using the common corpus, the accuracy of the trained word vector model may be low. Based on this, the processing system of the unstructured data can obtain the training corpus matched with the application scene of the unstructured data, and then train the initial word vector model by using the specific training corpus to obtain the word vector model.
Further, even in the corpus of the fixed application scenario, the words corresponding to the same privacy data attribute often have similar contexts, but the privacy data words corresponding to the same privacy data attribute are always varied, for example, the privacy data words corresponding to the name may be "zhang san", "li xi", "wang wu", and the like, and many privacy data words may appear only a few times, and the word vector model obtained by training directly based on the training corpus is not accurate enough. In order to train better word vector models and more accurately calculate similarity to better assign sensitive weights, the processing system of unstructured data can also preprocess the corpus. Specifically, the sensitive words in the training corpus are identified, the privacy data attributes of the sensitive words are used for replacing the sensitive words, and then the initial word vector model is trained by using the replaced training corpus to obtain a word vector model.
The method for processing unstructured data provided by the embodiment of the present application is described in detail with reference to fig. 1 to 4, and the apparatus and the device provided by the embodiment of the present application are described with reference to the accompanying drawings.
Referring to fig. 5, a schematic structural diagram of an apparatus for processing unstructured data is shown, where the apparatus 500 includes:
a word segmentation module 502, configured to perform word segmentation on the unstructured data to obtain a word segmentation result;
a weight determining module 504, configured to determine a weight of a sensitive word in the word segmentation result, and determine a weight of a non-sensitive word according to a similarity between the non-sensitive word in the word segmentation result and a private data attribute;
a privacy level determining module 506, configured to determine a privacy level of the unstructured data according to the weight of the sensitive word and the weight of the non-sensitive word.
In some implementations, the weight determining module 504 is specifically configured to:
extracting a word vector of the non-sensitive word and a word vector of the privacy data attribute;
determining the similarity between the non-sensitive words and the privacy data attributes according to the distance between the word vectors of the non-sensitive words and the word vectors of the privacy data attributes;
and determining the weight of the non-sensitive word according to the similarity of the non-sensitive word and the privacy data attribute.
In some implementations, the weight determining module 504 is specifically configured to:
and extracting the word vector of the non-sensitive word and the word vector of the privacy data attribute by utilizing a pre-trained word vector model.
In some implementations, the apparatus 500 further includes:
the communication module is used for acquiring training corpora matched with the application scene of the unstructured data;
and the training module is used for training an initial word vector model by using the training corpus to obtain a word vector model.
In some implementations, the apparatus further includes:
the replacing module is used for identifying the sensitive words in the training corpus and replacing the sensitive words by using the privacy data attributes;
the training module is specifically configured to:
and training the initial word vector model by using the replaced training corpus to obtain a word vector model.
In some implementations, the apparatus further includes:
and the privacy protection processing module is used for determining a privacy protection mechanism of the unstructured data according to the privacy degree of the unstructured data and carrying out privacy protection on the unstructured data by utilizing the privacy protection mechanism.
The processing apparatus 500 for unstructured data according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module/unit of the processing apparatus 500 for unstructured data are respectively for implementing corresponding flows of each method in the embodiment shown in fig. 3, and are not described herein again for brevity.
An embodiment of the present application further provides an apparatus 600. The device 600 may be a peer-side device such as a laptop computer or a desktop computer, or may be a computer cluster in a cloud environment or an edge environment. The device 600 is in particular adapted to realize the functionality of the processing means 500 of unstructured data in the embodiment shown in fig. 5.
Fig. 6 provides a schematic structural diagram of a device 600, and as shown in fig. 6, the device 600 includes a bus 601, a processor 602, a communication interface 603, and a memory 604. The processor 602, memory 604, and communication interface 603 communicate over a bus 601. The bus 601 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus. The communication interface 603 is used for communication with the outside. For example, a corpus matching an application scenario of unstructured data is obtained, or unstructured data is obtained, etc.
The processor 602 may be a Central Processing Unit (CPU). The memory 604 may include a volatile memory (volatile memory), such as a Random Access Memory (RAM). The memory 604 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, an HDD, or an SSD.
The memory 604 stores executable code that the processor 602 executes to perform the processing of the unstructured data described above.
Specifically, in the case where the embodiment shown in fig. 5 is implemented, and the modules of the processing apparatus 500 of unstructured data described in the embodiment of fig. 5 are implemented by software, software or program codes required to perform the functions of the word segmentation module 502, the weight determination module 504, and the privacy level determination module 506 in fig. 5 are stored in the memory 604. The communication module functions are implemented by the communication interface 603. The communication interface 603 receives the unstructured data and transmits the unstructured data to the processor 602 through the bus 601, and the processor 602 executes program codes corresponding to modules stored in the memory 604, such as program codes corresponding to the word segmentation module 502, the weight determination module 504 and the privacy degree determination module 506, to perform operations of performing word segmentation on the unstructured data, then determining the weight of the sensitive word, determining the weight of the insensitive word according to the similarity between the insensitive word and the attribute of the privacy data, and then determining the privacy degree of the unstructured data according to the weight of the sensitive word and the weight of the insensitive word.
Of course, the processor 602 may further execute a program code corresponding to the privacy protection processing module to execute a privacy protection mechanism for determining the unstructured data according to the privacy degree of the unstructured data, and perform privacy protection on the unstructured data by using the privacy protection mechanism.
An embodiment of the present application further provides a computer-readable storage medium, which includes instructions for instructing a computer to execute the above processing method of unstructured data applied to the processing apparatus 500 of unstructured data.
An embodiment of the present application further provides a computer-readable storage medium, which includes instructions for instructing a computer to execute the above processing method of unstructured data applied to the processing apparatus 500 of unstructured data.
The embodiment of the application also provides a computer program product, and when the computer program product is executed by a computer, the computer executes any one of the processing methods of the unstructured data. The computer program product may be a software installation package that can be downloaded and executed on a computer in the event that any of the aforementioned methods of processing unstructured data need to be used.

Claims (14)

1. A method for processing unstructured data, the method comprising:
performing word segmentation on the unstructured data to obtain word segmentation results;
determining the weight of a sensitive word in the word segmentation result, and determining the weight of a non-sensitive word according to the similarity of the non-sensitive word and the privacy data attribute in the word segmentation result;
and determining the privacy degree of the unstructured data through the weight of the sensitive words and the weight of the non-sensitive words.
2. The method of claim 1, wherein determining the weight of the non-sensitive word according to the similarity between the non-sensitive word and the private data attribute in the word segmentation result comprises:
extracting a word vector of the non-sensitive word and a word vector of the privacy data attribute;
determining the similarity between the non-sensitive words and the privacy data attributes according to the distance between the word vectors of the non-sensitive words and the word vectors of the privacy data attributes;
and determining the weight of the non-sensitive word according to the similarity of the non-sensitive word and the privacy data attribute.
3. The method of claim 2, wherein extracting the word vector for the non-sensitive word and the word vector for the privacy data attribute comprises:
and extracting the word vector of the non-sensitive word and the word vector of the privacy data attribute by utilizing a pre-trained word vector model.
4. The method of claim 3, wherein the word vector model is trained by:
acquiring a training corpus matched with an application scene of the unstructured data;
and training an initial word vector model by using the training corpus to obtain a word vector model.
5. The method of claim 4, further comprising:
identifying sensitive words in the training corpus, and replacing the sensitive words by using privacy data attributes;
the training of the initial word vector model by using the training corpus to obtain a word vector model comprises the following steps:
and training the initial word vector model by using the replaced training corpus to obtain a word vector model.
6. The method according to any one of claims 1 to 5, further comprising:
determining a privacy protection mechanism of the unstructured data according to the privacy degree of the unstructured data;
and carrying out privacy protection on the unstructured data by utilizing the privacy protection mechanism.
7. An apparatus for processing unstructured data, the apparatus comprising:
the word segmentation module is used for segmenting words of the unstructured data to obtain word segmentation results;
the weight determining module is used for determining the weight of the sensitive words in the word segmentation result and determining the weight of the non-sensitive words according to the similarity between the non-sensitive words in the word segmentation result and the privacy data attribute;
and the privacy degree determining module is used for determining the privacy degree of the unstructured data according to the weight of the sensitive words and the weight of the non-sensitive words.
8. The apparatus of claim 7, wherein the weight determination module is specifically configured to:
extracting a word vector of the non-sensitive word and a word vector of the privacy data attribute;
determining the similarity between the non-sensitive words and the privacy data attributes according to the distance between the word vectors of the non-sensitive words and the word vectors of the privacy data attributes;
and determining the weight of the non-sensitive word according to the similarity of the non-sensitive word and the privacy data attribute.
9. The apparatus of claim 8, wherein the weight determination module is specifically configured to:
and extracting the word vector of the non-sensitive word and the word vector of the privacy data attribute by utilizing a pre-trained word vector model.
10. The apparatus of claim 9, further comprising:
the communication module is used for acquiring training corpora matched with the application scene of the unstructured data;
and the training module is used for training an initial word vector model by using the training corpus to obtain a word vector model.
11. The apparatus of claim 10, further comprising:
the replacing module is used for identifying the sensitive words in the training corpus and replacing the sensitive words by using the privacy data attributes;
the training module is specifically configured to:
and training the initial word vector model by using the replaced training corpus to obtain a word vector model.
12. The apparatus of any one of claims 7 to 11, further comprising:
and the privacy protection processing module is used for determining a privacy protection mechanism of the unstructured data according to the privacy degree of the unstructured data and carrying out privacy protection on the unstructured data by utilizing the privacy protection mechanism.
13. An apparatus, comprising a processor and a memory;
the processor is to execute instructions stored in the memory to cause the device to perform the method of any of claims 1 to 6.
14. A computer-readable storage medium comprising instructions that direct a device to perform the method of any of claims 1-6.
CN202010331678.3A 2020-04-24 2020-04-24 Method, device, equipment and medium for processing unstructured data Pending CN113553846A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010331678.3A CN113553846A (en) 2020-04-24 2020-04-24 Method, device, equipment and medium for processing unstructured data
PCT/CN2021/075680 WO2021212968A1 (en) 2020-04-24 2021-02-06 Unstructured data processing method, apparatus, and device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010331678.3A CN113553846A (en) 2020-04-24 2020-04-24 Method, device, equipment and medium for processing unstructured data

Publications (1)

Publication Number Publication Date
CN113553846A true CN113553846A (en) 2021-10-26

Family

ID=78101221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010331678.3A Pending CN113553846A (en) 2020-04-24 2020-04-24 Method, device, equipment and medium for processing unstructured data

Country Status (2)

Country Link
CN (1) CN113553846A (en)
WO (1) WO2021212968A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115512810A (en) * 2022-11-17 2022-12-23 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Data management method and system for medical image data
CN115618371A (en) * 2022-07-11 2023-01-17 上海期货信息技术有限公司 Desensitization method and device for non-text data and storage medium
CN117912624A (en) * 2024-03-15 2024-04-19 江西曼荼罗软件有限公司 Electronic medical record sharing method and system

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065287B (en) * 2021-11-18 2024-05-07 南京航空航天大学 Track differential privacy protection method and system for resisting predictive attack
CN115664799B (en) * 2022-10-25 2023-06-06 江苏海洋大学 Data exchange method and system applied to information technology security
CN115828307B (en) * 2023-01-28 2023-05-23 广州佰锐网络科技有限公司 Text recognition method and AI system applied to OCR
CN116432243B (en) * 2023-06-15 2023-08-25 恺恩泰(南京)科技有限公司 Data desensitization method, device, equipment and storage medium for online mall
CN117034356B (en) * 2023-10-09 2024-01-05 成都乐超人科技有限公司 Privacy protection method and device for multi-operation flow based on hybrid chain
CN117591643B (en) * 2023-11-10 2024-05-10 杭州市余杭区数据资源管理局 Project text duplicate checking method and system based on improved structuring processing
CN117892358B (en) * 2024-03-18 2024-07-05 北方健康医疗大数据科技有限公司 Verification method and system for limited data desensitization method
CN118094639B (en) * 2024-04-28 2024-07-02 北京中关村科金技术有限公司 Enterprise big data mining method and system based on artificial intelligence

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012985B (en) * 2010-11-19 2013-12-25 国网电力科学研究院 Sensitive data dynamic identification method based on data mining
CN102184188A (en) * 2011-04-15 2011-09-14 百度在线网络技术(北京)有限公司 Method and equipment for determining sensitivity of target text
WO2013072930A2 (en) * 2011-09-28 2013-05-23 Tata Consultancy Services Limited System and method for database privacy protection
CN102426599B (en) * 2011-11-09 2013-04-24 中国人民解放军信息工程大学 Method for detecting sensitive information based on D-S evidence theory

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618371A (en) * 2022-07-11 2023-01-17 上海期货信息技术有限公司 Desensitization method and device for non-text data and storage medium
CN115618371B (en) * 2022-07-11 2023-08-04 上海期货信息技术有限公司 Non-text data desensitization method, device and storage medium
CN115512810A (en) * 2022-11-17 2022-12-23 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Data management method and system for medical image data
CN117912624A (en) * 2024-03-15 2024-04-19 江西曼荼罗软件有限公司 Electronic medical record sharing method and system

Also Published As

Publication number Publication date
WO2021212968A1 (en) 2021-10-28

Similar Documents

Publication Publication Date Title
WO2021212968A1 (en) Unstructured data processing method, apparatus, and device, and medium
WO2021027533A1 (en) Text semantic recognition method and apparatus, computer device, and storage medium
CN110347835B (en) Text clustering method, electronic device and storage medium
CN111160017B (en) Keyword extraction method, phonetics scoring method and phonetics recommendation method
CN108376151B (en) Question classification method and device, computer equipment and storage medium
CN109815318B (en) Question answer query method, system and computer equipment in question answer system
US7689418B2 (en) Method and system for non-intrusive speaker verification using behavior models
JP5475795B2 (en) Custom language model
WO2021051517A1 (en) Information retrieval method based on convolutional neural network, and device related thereto
CN110569500A (en) Text semantic recognition method and device, computer equipment and storage medium
US9647975B1 (en) Systems and methods for identifying spam messages using subject information
CN107180084B (en) Word bank updating method and device
US20140052688A1 (en) System and Method for Matching Data Using Probabilistic Modeling Techniques
CN110377725B (en) Data generation method and device, computer equipment and storage medium
WO2021098794A1 (en) Text search method, device, server, and storage medium
CN114417865B (en) Description text processing method, device and equipment for disaster event and storage medium
CN110162771A (en) The recognition methods of event trigger word, device, electronic equipment
CN110909531A (en) Method, device, equipment and storage medium for discriminating information security
CN110543633A (en) Sentence intention identification method and device
CN113807073B (en) Text content anomaly detection method, device and storage medium
CN112836039A (en) Voice data processing method and device based on deep learning
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence
US11922515B1 (en) Methods and apparatuses for AI digital assistants
EP1470549A4 (en) Method and system for non-intrusive speaker verification using behavior models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220215

Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province

Applicant after: Huawei Cloud Computing Technologies Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Applicant before: HUAWEI TECHNOLOGIES Co.,Ltd.

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination