WO2021212968A1 - Unstructured data processing method, apparatus, and device, and medium - Google Patents

Unstructured data processing method, apparatus, and device, and medium Download PDF

Info

Publication number
WO2021212968A1
WO2021212968A1 PCT/CN2021/075680 CN2021075680W WO2021212968A1 WO 2021212968 A1 WO2021212968 A1 WO 2021212968A1 CN 2021075680 W CN2021075680 W CN 2021075680W WO 2021212968 A1 WO2021212968 A1 WO 2021212968A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
sensitive
unstructured data
privacy
word vector
Prior art date
Application number
PCT/CN2021/075680
Other languages
French (fr)
Chinese (zh)
Inventor
朱天清
朱运丽
霍正聃
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021212968A1 publication Critical patent/WO2021212968A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a processing method, device, device, and computer-readable storage medium for unstructured data.
  • Structured data is data that is logically expressed and realized using a table structure, has a specific data format, and usually uses a relational database for storage and management.
  • the privacy protection mechanism for structured data has been quite complete.
  • unstructured data the inability to adopt a unified structure for representation brings difficulties to privacy protection.
  • the industry has proposed some privacy protection methods for unstructured data. For example, the privacy level of unstructured data is classified according to the number of private data characters or the proportion of private data in unstructured data, and then the corresponding privacy protection mechanism is adopted based on the privacy level, such as removing all private data in the text. Sensitive protection.
  • This application provides a method for processing unstructured data.
  • the method treats unstructured data as a whole, and determines the degree of privacy of unstructured data through the sensitive words and non-sensitive words in the unstructured data. With higher accuracy, corresponding privacy protection mechanisms can be adopted for privacy protection processing based on the degree of privacy, and better privacy protection effects can be achieved.
  • This application also provides devices, equipment, computer-readable storage media, and computer program products corresponding to the above methods.
  • this application provides a method for processing unstructured data.
  • This method can be implemented by a processing system for unstructured data.
  • the system can be deployed in a cloud environment, an edge environment, or an end device (ie, end-side device).
  • the cloud environment indicates the central computing equipment cluster owned by the cloud service provider and used to provide computing, storage, and communication resources;
  • the edge environment indicates the geographic location closer to the end-side equipment to provide computing, storage, and communication.
  • Resource edge computing equipment cluster When the system is deployed in a cloud environment or an edge environment, the above-mentioned system can be provided to users in the form of services.
  • the system deploys end-side equipment the above-mentioned system can be provided to users in the form of a client.
  • the unstructured data processing system includes multiple parts, and the multiple parts can also be distributed in different environments.
  • the unstructured data processing system performs word segmentation on the unstructured data, obtains the word segmentation result, and then determines the weight of the sensitive words in the word segmentation result, and according to the non-sensitive words and private data in the word segmentation result
  • the similarity of the attributes determines the weight of the non-sensitive word, and then the weight of the sensitive word and the weight of the non-sensitive word are used to determine the degree of privacy of the unstructured data.
  • This method considers unstructured data as a whole, not only considers private data, that is, sensitive words, but also considers non-sensitive words that have a contextual relationship with sensitive words. Based on both sensitive words and non-sensitive words, unstructured
  • the degree of privacy of the data makes the evaluation of the degree of privacy by this method more accurate and comprehensive. Further, the
  • the method can more accurately adopt the corresponding level of privacy protection mechanism for privacy protection, and has better privacy protection.
  • the unstructured data processing system can also extract the word vectors of the non-sensitive words and the words of the private data attributes.
  • Vector determine the similarity between the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute, and then determine the similarity between the non-sensitive word and the private data attribute according to the non-sensitive word and the private data
  • the similarity of the attributes determines the weight of the non-sensitive words.
  • This method introduces the method of calculating vocabulary similarity by using word vectors in natural language processing, and uses it to determine the similarity between non-sensitive words and private data attributes. Since the word vector retains the semantic feature, the similarity determined based on the semantic feature has high reliability.
  • the unstructured data processing system may use a pre-trained word vector model to extract the word vector of the non-sensitive word and the word vector of the private data attribute. Extracting word vectors through the word vector model has high efficiency and accuracy.
  • the definition of private data in different application scenarios can be different, and the language use and expression of different application scenarios are very different, which makes the context of the same words in the corpus of different application scenarios possible. There are big differences. If a general training corpus is used to train the initial word vector model, the accuracy of the word vector model obtained by training may not be high. Based on this, the unstructured data processing system can also obtain a training corpus that matches the application scenario of the unstructured data, and use the training corpus to train an initial word vector model to obtain a word vector model.
  • the vocabulary corresponding to the same private data attribute often has a similar context, but the private data vocabulary corresponding to the same private data attribute is always ever-changing.
  • the private data vocabulary corresponding to the name can be "Zhang San”, “Li Si”, “Wang Wu”, etc., and many private data vocabulary may appear very few times, and the word vector model trained directly based on the training corpus is not accurate enough.
  • the unstructured data processing system can also preprocess the training corpus. Specifically, identifying sensitive words in the training corpus, replacing the sensitive words with the privacy data attributes of the sensitive words, and then using the replaced training corpus to train an initial word vector model to obtain a word vector model.
  • the unstructured data processing system may also determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to perform the unstructured data Carry out privacy protection.
  • This method can not only avoid the direct leakage of private information caused by private data, but also effectively prevent the indirect leakage of private information caused by semantic problems, and thus can better protect private information.
  • this application provides an apparatus for processing unstructured data.
  • the device includes:
  • the word segmentation module is used to segment the unstructured data to obtain the word segmentation result
  • a weight determination module configured to determine the weight of the sensitive word in the word segmentation result, and determine the weight of the insensitive word according to the similarity between the non-sensitive words in the word segmentation result and the attributes of private data;
  • the degree of privacy determination module is used to determine the degree of privacy of the unstructured data through the weights of the sensitive words and the weights of the non-sensitive words.
  • the weight determination module is specifically configured to:
  • the weight of the non-sensitive word is determined according to the similarity between the attributes of the non-sensitive word and the private data.
  • the weight determination module is specifically configured to:
  • a pre-trained word vector model is used to extract the word vector of the non-sensitive word and the word vector of the private data attribute.
  • the device further includes:
  • a communication module for obtaining training corpus matching the application scenario of the unstructured data
  • the training module is used to train the initial word vector model using the training corpus to obtain the word vector model.
  • the device further includes:
  • the replacement module is used to identify sensitive words in the training corpus, and replace the sensitive words with private data attributes;
  • the training module is specifically used for:
  • the device further includes:
  • the privacy protection processing module is configured to determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to protect the privacy of the unstructured data.
  • the present application provides a device including a processor and a memory.
  • the processor and the memory communicate with each other.
  • the processor is configured to execute instructions stored in the memory, so that the device executes the unstructured data processing method in the first aspect or any implementation manner of the first aspect.
  • the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium.
  • the processing method of unstructured data is not limited to:
  • the present application provides a computer program product containing instructions that, when run on a device, enable the device to execute the unstructured data described in the first aspect or any one of the implementations of the first aspect. Approach.
  • FIG. 1 is an architecture diagram of an unstructured data processing system provided by an embodiment of this application.
  • FIG. 2 is an architecture diagram of an unstructured data processing system provided by an embodiment of the application
  • FIG. 3 is a flowchart of a method for processing unstructured data according to an embodiment of the application
  • FIG. 4 is a schematic diagram of determining the weight of non-sensitive words according to an embodiment of the application.
  • FIG. 5 is a schematic structural diagram of an apparatus for processing unstructured data according to an embodiment of the application.
  • FIG. 6 is a schematic structural diagram of a device provided by an embodiment of the application.
  • first and second in the embodiments of the present application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first” and “second” may explicitly or implicitly include one or more of these features.
  • Unstructured data refers to data whose data structure is irregular or incomplete, without a predefined data model, and it is not convenient to use a two-dimensional database table to logically express and implement data.
  • the format of unstructured data is diverse.
  • unstructured data may include documents or text in various formats.
  • Word embedding is also called word embedding.
  • a word vector refers to a vector formed by mapping words to a lower-dimensional continuous vector space.
  • the word vector can usually be represented by a sequence of real numbers.
  • This representation of word vectors can be understood as a distributed representation based on neural networks, which retains the semantic features of words.
  • the industry has proposed a privacy protection method for unstructured data such as personal resumes, medical records, and office documents.
  • identify the private data that exists in unstructured data and determine the proportion of private data characters based on the ratio of the number of private data bits to the total number of unstructured data bits, or according to the sum of the number of private data
  • the ratio of the total number of words in the unstructured data determines the proportion of the number of private data
  • the degree of privacy is graded by the proportion of the number of private data characters or the proportion of the number of private data.
  • adopt corresponding privacy protection mechanisms based on privacy levels, such as desensitizing protection for all private data in unstructured data.
  • unstructured data such as personal resumes, medical records, and office documents may also include words that are highly similar to private data or have a great directivity to private data. Even if all the private data in the unstructured data is desensitized during the privacy protection process, these words that are highly similar to or have a great directivity to the private data may infer the correlation of some private data. Information, resulting in incomplete desensitization of unstructured data, leaking private information to a certain extent.
  • the above privacy protection processing only considers the privacy data. Although the privacy data is covered to a certain extent, the sentence desensitization is not complete. Specifically, since the semantic problem of the sentence is not considered, the semantics of the desensitized sentence is still complete, and the degree of privacy of the sentence is not minimized. Among them, "name” and private data names have a lot of directivity; “graduated from” has a lot of directivity with the private data school; “stay” and “work” have a lot of directivity with the private data workplace, and express From these highly directional words, it is possible to infer the relevant information of the desensitized private data or the wishes expressed by the characters.
  • the accuracy of the above-mentioned privacy level classification method based on the proportion of the number of private data or the proportion of the number of characters of the privacy data is not high, which makes it difficult for the privacy protection mechanism adopted based on the privacy level to achieve a better privacy protection effect.
  • an embodiment of the present application provides a processing method for unstructured data.
  • the method can be executed by a processing system for unstructured data.
  • the unstructured data processing system first performs word segmentation on the unstructured data to obtain the word segmentation result, and then considers the semantic characteristic of strong relevance between context words in the unstructured data.
  • the processing system also determines the weight of non-sensitive words based on the similarity of the attributes of non-sensitive words and private data for non-sensitive words other than sensitive words. The weight of the sensitive words and the weight of the non-sensitive words are used to determine the weight of the non-sensitive words. Describe the degree of privacy of unstructured data.
  • the above-mentioned unstructured data processing method takes unstructured data as a whole, and not only considers privacy
  • Data is sensitive words, and non-sensitive words that have contextual relations with sensitive words are also considered. Based on sensitive words and non-sensitive words
  • the sense words jointly determine the degree of privacy of unstructured data, which makes the evaluation of the degree of privacy by this method more accurate and comprehensive. Furthermore, this method can more accurately adopt the corresponding level of privacy protection mechanism for privacy protection.
  • the processing system for unstructured data can be deployed in a cloud environment, specifically on one or more computing devices (for example, a central server) in the cloud environment.
  • the system may also be deployed in an edge environment, specifically on one or more computing devices (edge computing devices) in the edge environment, and the edge computing devices may be servers.
  • the system can also be deployed in end-side devices (ie end devices), including but not limited to desktop computers, notebook computers, smart phones, and so on.
  • the cloud environment indicates a central computing device cluster owned by a cloud service provider and used to provide computing, storage, and communication resources;
  • the edge environment indicates a cluster of central computing equipment that is geographically close to the end-side device and is used to provide computing and storage ,
  • the edge computing equipment cluster of communication resources is not limited to a cloud service provider and used to provide computing, storage, and communication resources.
  • the end-side device can be used as a data providing device to provide unstructured data so that the unstructured data processing system can process the unstructured data to determine its privacy level, and further based on its privacy level, adopt corresponding privacy protection Mechanism to conduct privacy protection processing.
  • the end-side device can provide unstructured data generated or stored by itself for processing by the unstructured data processing system.
  • the end-side device may be a network device, for example, a terminal device that accesses the network. In this way, the end-side device may obtain unstructured data from the network and provide it to the unstructured data processing system.
  • the unstructured data processing system When the unstructured data processing system is deployed in a cloud environment or an edge environment, the unstructured data processing system can be provided to users as a service. Specifically, the user can access the cloud environment or the edge environment through a browser, create an instance of the unstructured data processing system in the cloud environment or the edge environment, and then interact with the instance of the unstructured data processing system through the browser, thereby Realize the processing of unstructured data.
  • the processing system for unstructured data can also be deployed on end-side devices.
  • the processing system for unstructured data can be provided to users in the form of a client. Specifically, the user runs the client to realize the processing of unstructured data.
  • the processing system for unstructured data includes multiple parts (for example, it includes multiple subsystems, and each subsystem includes multiple unit modules), so each of the unstructured data processing system Parts can also be deployed in different environments in a distributed manner.
  • a part of a processing system for unstructured data can be deployed in three environments in a cloud environment, an edge environment, a terminal device, or any two of them, respectively.
  • the method includes:
  • the unstructured data processing system performs word segmentation on the unstructured data to obtain a word segmentation result.
  • the unstructured data processing system can use any one or more of the word segmentation method based on string matching, the word segmentation method based on understanding, and the word segmentation method based on statistics to segment the unstructured data. Get the word segmentation result.
  • the word segmentation method based on string matching is to match the string to be analyzed with the entry in the machine dictionary according to the set strategy. If a string is found in the dictionary, the matching is successful and a word is recognized . Then continue to perform the above matching operation, thus realizing the word segmentation of the unstructured data.
  • the unstructured data processing system when it performs string matching, it can also perform matching in different directions, that is, the word segmentation method based on string matching can also be divided into a forward maximum matching method and a reverse maximum matching method.
  • the unstructured data processing system when it performs string matching, it can also match according to the limited length of different lengths, that is, the word segmentation method based on string matching is divided into the longest matching method and the shortest matching method.
  • it can be divided into a simple word segmentation method and an integrated method that combines word segmentation and part-of-speech tagging.
  • the word segmentation method based on comprehension achieves the effect of word recognition by simulating the comprehension of the sentence. Specifically, syntactic analysis and semantic analysis are performed at the same time for word segmentation, and syntactic information and semantic information are used to eliminate ambiguity, so as to achieve segmentation of unstructured data such as text.
  • the statistics-based word segmentation method uses statistical machine learning models to learn the rules of word segmentation under the premise of a large number of segmented texts, so as to achieve segmentation of unknown texts.
  • the word segmentation methods based on statistics include the maximum probability word segmentation method and the maximum entropy word segmentation method.
  • the statistical models used in the above methods include N-gram model (N-gram), Hidden Markov Model (HMM), Maximum entropy model (MEM), and Conditional Random Fields model (Conditional Random Fields) , CRF).
  • the unstructured data processing system may select a matching word segmentation method for word segmentation based on the language and scene of the unstructured data, and obtain the word segmentation result.
  • the processing system of unstructured data may also remove stop words after word segmentation, so as to obtain the final word segmentation result.
  • S304 The unstructured data processing system determines the weight of the sensitive word in the word segmentation result, and determines the weight of the non-sensitive word according to the similarity between the non-sensitive word and the private data attribute in the word segmentation result.
  • the unstructured data processing system can determine the sensitive words according to the word segmentation results, and the words other than the sensitive words in the word segmentation results are non-sensitive words, and then the unstructured data processing system can determine the weight of the sensitive words. And according to the similarity of the attributes of the non-sensitive words and the private data, the weight of the non-sensitive words is determined. Among them, the weight is specifically used to measure the importance of sensitive words or non-sensitive words to the degree of privacy of the entire unstructured data.
  • the private data attribute is used to describe the type of private data. For example, for the private data of "Zhang San”, the corresponding private data attribute is "name”, and for the private data of xx@yy.com, the corresponding private data attribute is "email address”.
  • the definition of private data may be different. For example, for information such as birthday or birthplace, it is considered privacy in some application scenarios, such as the General Data Protection Regulation (GDPR), and not considered privacy in other application scenarios, such as medical scenarios. As shown in the following table:
  • GDPR General Data Protection Regulation
  • the unstructured data processing system can match the attributes of each word in the word segmentation results with the privacy data attributes defined by the privacy data template in the current application scenario, thereby determining each of the word segmentation results Words are sensitive words or non-sensitive words.
  • the sensitive words and non-sensitive words thus determined have high accuracy.
  • the unstructured data processing system can determine the weight of the sensitive word according to the set weight. For example, set the weight of the sensitive word as the standard weight. If the weight is 1, then the weight of the sensitive word can be obtained according to the set weight.
  • the weight of non-sensitive words is determined according to the similarity between the attributes of non-sensitive words and private data, and the weight of non-sensitive words is determined according to the corresponding relationship between similarity and weight.
  • the higher the similarity between the non-sensitive word and the private data attribute the greater the weight of the non-sensitive word
  • the lower the similarity between the non-sensitive word and the private data attribute the smaller the weight of the non-sensitive word.
  • the unstructured data includes the sentence "My name is Zhang San”.
  • the processing system of the unstructured data determines that "Zhang San” is a sensitive word and "name” is a non-sensitive word based on the attributes of the private data. Determine the similarity between the non-sensitive word "name” and the private data attribute "name” as 0.9999. According to the previous correspondence between the similarity and the weight ratio, the weight ratio can be determined to be 0.8, which can determine the weight of "Zhang San” Is 1, and the weight of "name” is 0.8.
  • the unstructured data processing system determines the degree of privacy of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word.
  • the unstructured data processing system can obtain the degree of privacy of the unstructured data by performing weighted aggregation on the weights of all sensitive words and the weights of all non-sensitive words.
  • the formula for calculating the degree of privacy is as follows:
  • privacylevel represents the degree of sensitivity, also known as the sensitivity level.
  • n is the total number of sensitive words and non-sensitive words.
  • g i represents the sensitive value of the i-th word in unstructured data, as follows:
  • I i is the similarity between the attribute of the non-sensitive word and the private data when the i-th word is a non-sensitive word.
  • ⁇ i is the weight of the non-sensitive word, which represents the influence of the non-sensitive word on the degree of privacy of unstructured data.
  • the value range of ⁇ i is (0, 1), which is specifically determined according to the similarity between the attributes of non-sensitive words and private data.
  • the similarity between the attributes of non-sensitive words and private data and the weight of the non-sensitive words have the following correspondence:
  • the unstructured data processing system determines the weight of non-sensitive words based on the above formula (3), and determines the degree of privacy of unstructured data based on the weight of sensitive words and the weight of non-sensitive words.
  • the embodiments of the present application provide a method for processing unstructured data.
  • the method takes unstructured data as a whole, considers the relationship between contexts in unstructured data, and uses unstructured data.
  • the similarity of the attributes of non-sensitive words and private data determines the weight of sensitive words.
  • the weight of sensitive words and the weight of non-sensitive words with contextual relationship are used to determine the degree of text privacy, which has higher accuracy.
  • using this method can more accurately determine the corresponding level of privacy protection mechanism.
  • Using this privacy protection mechanism to protect the privacy of unstructured data can not only avoid the direct disclosure of private information caused by private data, but also effectively prevent the indirect disclosure of private information caused by semantic issues, thus better protecting private information .
  • the embodiment of this application also designs an attack scenario for verification.
  • the same occlusion and desensitization process is used for all private data in the unstructured data, which is text. Specifically, all private data are replaced with spaces, and then the private data context vocabulary is used to These private data are guessed. The higher the probability of guessing the correct information, the attacker can obtain more text privacy information.
  • the current privacy protection mechanism is not enough, resulting in insufficient desensitization. Therefore, if the text privacy level is not accurate enough, it may cause high-level text data to be desensitized using a low-level privacy protection mechanism, which makes the text data desensitization incomplete, resulting in the desensitized text data still being desensitized. May reveal private information.
  • Predicting privacy data by designing attack scenarios can verify the privacy ranking of the text.
  • the privacy level grading method proposed in the embodiments of this application and the traditional method are used to calculate the text privacy level, and the text privacy level is ranked. Which ranking is closer to the privacy ranking obtained by using the attack scenario verification indicates that the method can more accurately reflect the privacy level of the text.
  • measuring the closeness of the ranking can be achieved by means of mean square error (MSE),
  • n is the number of documents; x and y represent the ranking lists of the privacy degrees of the two documents.
  • MSE According to the ranking in Table 3, MSE can be calculated:
  • the similarity-based privacy grading method proposed in the embodiment of this application is closer to the ranking of the verification method.
  • the method proposed in the embodiment of the present application can more accurately classify the degree of privacy of unstructured data.
  • the embodiment of this application introduces a method of calculating vocabulary similarity using word vectors in natural language processing (NLP), and uses it to calculate non-sensitive words and privacy The similarity of data attributes.
  • NLP natural language processing
  • the unstructured data processing system can extract the word vectors of non-sensitive words and the word vectors of private data attributes respectively, for example, input the non-sensitive words and private data attributes into the pre-trained word vector model , So as to obtain the word vector of non-sensitive words and the word vector of private data attributes. Then, the similarity between the non-sensitive word and the private data attribute is determined according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute. Then, according to the similarity between the non-sensitive words and the attributes of the private data, the weight of the non-sensitive words is determined based on the corresponding relationship between the similarity and the weight (for example, the corresponding relationship shown in formula (3)).
  • the word vector model can be specifically obtained by training methods such as word2vec.
  • the unstructured data processing system can construct an initial word vector model through word2vec, and use the training corpus to train the initial word vector model, thereby obtaining a word vector model for extracting word vectors.
  • the unstructured data processing system can obtain a training corpus that matches the application scenario of the unstructured data, and then use the specific training corpus to train the initial word vector model to obtain the word vector model.
  • the vocabulary corresponding to the same private data attribute often has a similar context, but the private data vocabulary corresponding to the same private data attribute is always ever-changing.
  • the private data vocabulary corresponding to the name can be It is "Zhang San”, “Li Si”, “Wang Wu”, etc., and many private data vocabulary may appear very few times, and the word vector model trained directly based on the training corpus is not accurate enough.
  • the unstructured data processing system can also preprocess the training corpus. Specifically, identifying the sensitive words in the training corpus, replacing the sensitive words with the privacy data attributes of the sensitive words, and then using the replaced training corpus to train the initial word vector model to obtain the word vector model.
  • the device 500 includes: a word segmentation module 502, configured to segment the unstructured data to obtain a word segmentation result;
  • the weight determination module 504 is configured to determine the weight of the sensitive word in the word segmentation result, and determine the weight of the insensitive word according to the similarity between the non-sensitive word in the word segmentation result and the attributes of private data;
  • the degree of privacy determination module 506 is configured to determine the degree of privacy of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word.
  • the weight determination module 504 is specifically configured to:
  • the weight of the non-sensitive word is determined according to the similarity between the attributes of the non-sensitive word and the private data.
  • the weight determination module 504 is specifically configured to:
  • a pre-trained word vector model is used to extract the word vector of the non-sensitive word and the word vector of the private data attribute.
  • the apparatus 500 further includes:
  • a communication module for obtaining training corpus matching the application scenario of the unstructured data
  • the training module is used to train the initial word vector model using the training corpus to obtain the word vector model.
  • the device further includes:
  • the replacement module is used to identify sensitive words in the training corpus, and replace the sensitive words with private data attributes;
  • the training module is specifically used for:
  • the device further includes:
  • the privacy protection processing module is configured to determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to protect the privacy of the unstructured data.
  • the apparatus 500 for processing unstructured data can correspond to the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of each module/unit of the apparatus 500 for processing unstructured data In order to implement the corresponding processes of the methods in the embodiment shown in FIG. 3, for the sake of brevity, details are not described herein again.
  • the embodiment of the present application also provides a device 600.
  • the device 600 may be an end-side device such as a notebook computer and a desktop computer, and may also be a computer cluster in a cloud environment or an edge environment.
  • the device 600 is specifically used to implement the functions of the apparatus 500 for processing unstructured data in the embodiment shown in FIG. 5.
  • FIG. 6 provides a schematic structural diagram of a device 600.
  • the device 600 includes a bus 601, a processor 602, a communication interface 603, and a memory 604.
  • the processor 602, the memory 604, and the communication interface 603 communicate through a bus 601.
  • the bus 601 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 6, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 603 is used to communicate with the outside. For example, obtaining training corpus that matches the application scenario of unstructured data, or obtaining unstructured data, etc.
  • the processor 602 may be a central processing unit (CPU).
  • the memory 604 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM).
  • the memory 604 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), flash memory, HDD or SSD.
  • the memory 604 stores executable code, and the processor 602 executes the executable code to execute the aforementioned unstructured data processing method.
  • the word segmentation module in FIG. 5 is executed. 502.
  • the software or program codes required for the functions of the weight determination module 504 and the privacy degree determination module 506 are stored in the memory 604.
  • the function of the communication module is implemented through the communication interface 603.
  • the communication interface 603 receives unstructured data and transmits it to the processor 602 via the bus 601.
  • the processor 602 executes the program code corresponding to each module stored in the memory 604, such as the word segmentation module 502, the weight determination module 504, and the privacy degree determination module 506 corresponding program code to perform word segmentation of unstructured data, and then determine the weight of sensitive words, and determine the weight of non-sensitive words according to the similarity of the attributes of non-sensitive words and private data, and then according to the weight and non-sensitive words of sensitive words
  • the weight of sensitive words determines the degree of privacy of unstructured data.
  • the processor 602 may also execute the program code corresponding to the privacy protection processing module to execute a privacy protection mechanism for determining the unstructured data based on the degree of privacy of the unstructured data, and use the privacy protection mechanism to perform the unstructured data Perform privacy protection operations.
  • An embodiment of the present application also provides a computer-readable storage medium, which includes instructions that instruct a computer to execute the above-mentioned unstructured data processing method applied to the unstructured data processing apparatus 500.
  • An embodiment of the present application also provides a computer-readable storage medium, which includes instructions that instruct a computer to execute the above-mentioned unstructured data processing method applied to the unstructured data processing apparatus 500.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product When the computer program product is executed by a computer, the computer executes any one of the aforementioned methods for processing unstructured data.
  • the computer program product may be a software installation package. In the case where any method of the aforementioned unstructured data processing method needs to be used, the computer program product may be downloaded and executed on the computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An unstructured data processing method, comprising: performing word segmentation on unstructured data to obtain a word segmentation result; determining the weight of a sensitive word in the word segmentation result, and determining the weight of a non-sensitive word according to the similarity between the non-sensitive word in the word segmentation result and a private data attribute; and determining a privacy level of the unstructured data according to the weight of the sensitive word and the weight of the non-sensitive word. As the non-sensitive word having a contextual relationship is taken into consideration, the method has high accuracy for the classification of privacy levels, and by performing privacy protection processing on this basis, has a good privacy protection effect.

Description

一种非结构化数据的处理方法、装置、设备及介质Method, device, equipment and medium for processing unstructured data 技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种非结构化数据的处理方法、装置、设备以及计算机可读存储介质。This application relates to the field of artificial intelligence technology, and in particular to a processing method, device, device, and computer-readable storage medium for unstructured data.
背景技术Background technique
随着信息时代的来临,数据正呈现爆炸式增长。数据可以分为结构化数据和非结构化数据。结构化数据是采用表结构来逻辑表达和实现的数据,具有特定的数据格式,通常采用关系型数据库进行存储和管理。对于结构化数据的隐私保护机制已经相当完善,对于非结构化数据,由于无法采用统一结构进行表示,给隐私保护带来困难。With the advent of the information age, data is exploding. Data can be divided into structured data and unstructured data. Structured data is data that is logically expressed and realized using a table structure, has a specific data format, and usually uses a relational database for storage and management. The privacy protection mechanism for structured data has been quite complete. For unstructured data, the inability to adopt a unified structure for representation brings difficulties to privacy protection.
业界提出了一些针对非结构化数据的隐私保护方法。例如,根据非结构化数据中隐私数据字符数或者隐私数据个数占比对非结构化数据进行隐私程度分级,然后基于隐私级别采取相应的隐私保护机制,如对文本中所有的隐私数据进行脱敏保护。The industry has proposed some privacy protection methods for unstructured data. For example, the privacy level of unstructured data is classified according to the number of private data characters or the proportion of private data in unstructured data, and then the corresponding privacy protection mechanism is adopted based on the privacy level, such as removing all private data in the text. Sensitive protection.
然而,上述基于隐私数据字符个数或者隐私数据个数占比进行隐私程度分级的方法的准确度不高,导致基于该隐私级别采取的隐私保护机制难以达到较好的隐私保护效果。However, the accuracy of the above-mentioned method for grading the degree of privacy based on the number of characters of private data or the proportion of the number of private data is not high, which makes it difficult for the privacy protection mechanism adopted based on the privacy level to achieve a better privacy protection effect.
发明内容Summary of the invention
本申请提供了一种非结构化数据的处理方法,该方法将非结构化数据视为一个整体,通过非结构化数据中的敏感词和非敏感词共同确定非结构化数据的隐私程度,具有较高准确度,基于该隐私程度能够采取对应的隐私保护机制进行隐私保护处理,可以达到较好的隐私保护效果。本申请还提供了上述方法对应的装置、设备、计算机可读存储介质以及计算机程序产品。This application provides a method for processing unstructured data. The method treats unstructured data as a whole, and determines the degree of privacy of unstructured data through the sensitive words and non-sensitive words in the unstructured data. With higher accuracy, corresponding privacy protection mechanisms can be adopted for privacy protection processing based on the degree of privacy, and better privacy protection effects can be achieved. This application also provides devices, equipment, computer-readable storage media, and computer program products corresponding to the above methods.
第一方面,本申请提供了一种非结构化数据的处理方法。该方法可以由非结构化数据的处理系统实现。该系统可以部署于云环境、边缘环境或者是端设备(即端侧设备)中。其中,云环境指示云服务提供商拥有的,用于提供计算、存储、通信资源的中心计算设备集群;边缘环境指示在地理位置上距离端侧设备较近的,用于提供计算、存储、通信资源的边缘计算设备集群。当系统部署于云环境或者边缘环境时,上述系统可以通过服务的形式提供给用户使用。当系统部署端侧设备时,上述系统可以通过客户端的形式提供给用户使用。在一些实现方式中,非结构化数据的处理系统包括多个部分,者多个部分也可以分布式地部署在不同环境中。In the first aspect, this application provides a method for processing unstructured data. This method can be implemented by a processing system for unstructured data. The system can be deployed in a cloud environment, an edge environment, or an end device (ie, end-side device). Among them, the cloud environment indicates the central computing equipment cluster owned by the cloud service provider and used to provide computing, storage, and communication resources; the edge environment indicates the geographic location closer to the end-side equipment to provide computing, storage, and communication. Resource edge computing equipment cluster. When the system is deployed in a cloud environment or an edge environment, the above-mentioned system can be provided to users in the form of services. When the system deploys end-side equipment, the above-mentioned system can be provided to users in the form of a client. In some implementations, the unstructured data processing system includes multiple parts, and the multiple parts can also be distributed in different environments.
具体地,非结构化数据的处理系统对所述非结构化数据进行分词,获得分词结果,然后确定所述分词结果中敏感词的权重,以及根据所述分词结果中的非敏感词和隐私数据属性的相似度确定所述非敏感词的权重,接着通过所述敏感词的权重和所述非敏感词的权重确定所述非结构化数据的隐私程度。Specifically, the unstructured data processing system performs word segmentation on the unstructured data, obtains the word segmentation result, and then determines the weight of the sensitive words in the word segmentation result, and according to the non-sensitive words and private data in the word segmentation result The similarity of the attributes determines the weight of the non-sensitive word, and then the weight of the sensitive word and the weight of the non-sensitive word are used to determine the degree of privacy of the unstructured data.
该方法将非结构化数据作为一个整体,不仅考虑了隐私数据即敏感词,还考虑了与敏感词具有上下文关系的非敏感词,基于敏感词和非敏感词共同确定非结构化This method considers unstructured data as a whole, not only considers private data, that is, sensitive words, but also considers non-sensitive words that have a contextual relationship with sensitive words. Based on both sensitive words and non-sensitive words, unstructured
数据的隐私程度,使得该方法对于隐私程度的评价更准确、更全面。进一步地,该The degree of privacy of the data makes the evaluation of the degree of privacy by this method more accurate and comprehensive. Further, the
方法能够更准确地采用对应级别的隐私保护机制进行隐私保护,具有较好的隐私保The method can more accurately adopt the corresponding level of privacy protection mechanism for privacy protection, and has better privacy protection.
护效果。护果。 Care effect.
在一些实现方式中,考虑到单词的相似度可以通过单词在向量空间中的距离进行衡量,非结构化数据的处理系统还可以提取所述非敏感词的词向量以及所述隐私数据属性的词向量,根据所述非敏感词的词向量和所述隐私数据属性的词向量的距离确定所述非敏感词和所述隐私数据属性的相似度,然后根据所述非敏感词和所述隐私数据属性的相似度确定所述非敏感词的权重。In some implementations, considering that the similarity of words can be measured by the distance of the words in the vector space, the unstructured data processing system can also extract the word vectors of the non-sensitive words and the words of the private data attributes. Vector, determine the similarity between the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute, and then determine the similarity between the non-sensitive word and the private data attribute according to the non-sensitive word and the private data The similarity of the attributes determines the weight of the non-sensitive words.
该方法通过引入自然语言处理中利用词向量计算词汇相似度的方法,将其用于确定非敏感词和隐私数据属性的相似度。由于词向量保留了语义特征,基于该语义特征确定的相似度具有较高可靠性。This method introduces the method of calculating vocabulary similarity by using word vectors in natural language processing, and uses it to determine the similarity between non-sensitive words and private data attributes. Since the word vector retains the semantic feature, the similarity determined based on the semantic feature has high reliability.
在一些实现方式中,非结构化数据的处理系统可以利用预训练的词向量模型提取所述非敏感词的词向量和所述隐私数据属性的词向量。通过词向量模型提取词向量,具有较高的效率和准确度。In some implementations, the unstructured data processing system may use a pre-trained word vector model to extract the word vector of the non-sensitive word and the word vector of the private data attribute. Extracting word vectors through the word vector model has high efficiency and accuracy.
在一些实现方式中,不同应用场景对于隐私数据的定义可以是不同的,不同应用场景的语言运用和表达方式存在很大差异,这使得在不同应用场景的语料中,相同的字词的上下文可能存在很大的差异,如果采用通用的训练语料训练初始词向量模型,可能导致训练得到的词向量模型的准确度不高。基于此,非结构化数据的处理系统还可以获取与所述非结构化数据的应用场景匹配的训练语料,利用所述训练语料训练初始词向量模型,得到词向量模型。In some implementations, the definition of private data in different application scenarios can be different, and the language use and expression of different application scenarios are very different, which makes the context of the same words in the corpus of different application scenarios possible. There are big differences. If a general training corpus is used to train the initial word vector model, the accuracy of the word vector model obtained by training may not be high. Based on this, the unstructured data processing system can also obtain a training corpus that matches the application scenario of the unstructured data, and use the training corpus to train an initial word vector model to obtain a word vector model.
在一些实现方式中,相同的隐私数据属性对应的词汇往往具有相似的上下文,但相同的隐私数据属性对应的隐私数据词汇总是千变万化的,例如姓名对应的隐私数据词汇可以是“张三”、“李四”、“王五”等等,而且很多隐私数据词汇可能出现次数很少,直接基于该训练语料训练所得的词向量模型也不够准确。为了能训练出更好的词向量模型,更准确地计算相似度以便更好地赋予敏感权重,非结构化数据的处理系统还可以对训练语料进行预处理。具体是,识别所述训练语料中的敏感词,利用敏感词的隐私数据属性替换所述敏感词,然后利用替换后的训练语料训练初始词向量模型,得到词向量模型。In some implementations, the vocabulary corresponding to the same private data attribute often has a similar context, but the private data vocabulary corresponding to the same private data attribute is always ever-changing. For example, the private data vocabulary corresponding to the name can be "Zhang San", "Li Si", "Wang Wu", etc., and many private data vocabulary may appear very few times, and the word vector model trained directly based on the training corpus is not accurate enough. In order to train a better word vector model and calculate the similarity more accurately to better assign sensitive weights, the unstructured data processing system can also preprocess the training corpus. Specifically, identifying sensitive words in the training corpus, replacing the sensitive words with the privacy data attributes of the sensitive words, and then using the replaced training corpus to train an initial word vector model to obtain a word vector model.
在一些实现方式中,非结构化数据的处理系统还可以根据所述非结构化数据的隐私程度确定所述非结构化数据的隐私保护机制,利用所述隐私保护机制对所述非结构化数据进行隐私保护。该方法既能避免由隐私数据造成的隐私信息的直接泄露,也有效防止由语义问题引起的隐私信息的间接泄露,因而能够更好地保护隐私信息。In some implementation manners, the unstructured data processing system may also determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to perform the unstructured data Carry out privacy protection. This method can not only avoid the direct leakage of private information caused by private data, but also effectively prevent the indirect leakage of private information caused by semantic problems, and thus can better protect private information.
第二方面,本申请提供了一种非结构化数据的处理装置。所述装置包括:In the second aspect, this application provides an apparatus for processing unstructured data. The device includes:
分词模块,用于对所述非结构化数据进行分词,获得分词结果;The word segmentation module is used to segment the unstructured data to obtain the word segmentation result;
权重确定模块,用于确定所述分词结果中敏感词的权重,以及根据所述分词结果中的非敏感词和隐私数据属性的相似度确定所述非敏感词的权重;A weight determination module, configured to determine the weight of the sensitive word in the word segmentation result, and determine the weight of the insensitive word according to the similarity between the non-sensitive words in the word segmentation result and the attributes of private data;
隐私程度确定模块,用于通过所述敏感词的权重和所述非敏感词的权重确定所述非 结构化数据的隐私程度。The degree of privacy determination module is used to determine the degree of privacy of the unstructured data through the weights of the sensitive words and the weights of the non-sensitive words.
在一些实现方式中,所述权重确定模块具体用于:In some implementation manners, the weight determination module is specifically configured to:
提取所述非敏感词的词向量以及所述隐私数据属性的词向量;Extracting the word vector of the non-sensitive word and the word vector of the private data attribute;
根据所述非敏感词的词向量和所述隐私数据属性的词向量的距离确定所述非敏感词和所述隐私数据属性的相似度;Determining the similarity between the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute;
根据所述非敏感词和所述隐私数据属性的相似度确定所述非敏感词的权重。The weight of the non-sensitive word is determined according to the similarity between the attributes of the non-sensitive word and the private data.
在一些实现方式中,所述权重确定模块具体用于:In some implementation manners, the weight determination module is specifically configured to:
利用预训练的词向量模型提取所述非敏感词的词向量和所述隐私数据属性的词向量。A pre-trained word vector model is used to extract the word vector of the non-sensitive word and the word vector of the private data attribute.
在一些实现方式中,所述装置还包括:In some implementation manners, the device further includes:
通信模块,用于获取与所述非结构化数据的应用场景匹配的训练语料;A communication module for obtaining training corpus matching the application scenario of the unstructured data;
训练模块,用于利用所述训练语料训练初始词向量模型,得到词向量模型。The training module is used to train the initial word vector model using the training corpus to obtain the word vector model.
在一些实现方式中,所述装置还包括:In some implementation manners, the device further includes:
替换模块,用于识别所述训练语料中的敏感词,利用隐私数据属性替换所述敏感词;The replacement module is used to identify sensitive words in the training corpus, and replace the sensitive words with private data attributes;
所述训练模块具体用于:The training module is specifically used for:
利用替换后的训练语料训练初始词向量模型,得到词向量模型。Use the replaced training corpus to train the initial word vector model to obtain the word vector model.
在一些实现方式中,所述装置还包括:In some implementation manners, the device further includes:
隐私保护处理模块,用于根据所述非结构化数据的隐私程度确定所述非结构化数据的隐私保护机制,利用所述隐私保护机制对所述非结构化数据进行隐私保护。The privacy protection processing module is configured to determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to protect the privacy of the unstructured data.
第三方面,本申请提供一种设备,所述设备包括处理器和存储器。所述处理器、所述存储器进行相互的通信。所述处理器用于执行所述存储器中存储的指令,以使得设备执行如第一方面或第一方面的任一种实现方式中的非结构化数据的处理方法。In a third aspect, the present application provides a device including a processor and a memory. The processor and the memory communicate with each other. The processor is configured to execute instructions stored in the memory, so that the device executes the unstructured data processing method in the first aspect or any implementation manner of the first aspect.
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,所述指令指示设备执行上述第一方面或第一方面的任一种实现方式所述的非结构化数据的处理方法。In a fourth aspect, the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium. The processing method of unstructured data.
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在设备上运行时,使得设备执行上述第一方面或第一方面的任一种实现方式所述的非结构化数据的处理方法。In the fifth aspect, the present application provides a computer program product containing instructions that, when run on a device, enable the device to execute the unstructured data described in the first aspect or any one of the implementations of the first aspect. Approach.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。On the basis of the implementation manners provided by the above aspects, this application can be further combined to provide more implementation manners.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需使用的附图作以简单地介绍。In order to more clearly illustrate the technical methods of the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments.
图1为本申请实施例提供的一种非结构化数据的处理系统的架构图;FIG. 1 is an architecture diagram of an unstructured data processing system provided by an embodiment of this application;
图2为本申请实施例提供的一种非结构化数据的处理系统的架构图;FIG. 2 is an architecture diagram of an unstructured data processing system provided by an embodiment of the application;
图3为本申请实施例提供的一种非结构化数据的处理方法的流程图;FIG. 3 is a flowchart of a method for processing unstructured data according to an embodiment of the application;
图4为本申请实施例提供的一种确定非敏感词的权重的示意图;FIG. 4 is a schematic diagram of determining the weight of non-sensitive words according to an embodiment of the application;
图5为本申请实施例提供的一种非结构化数据的处理装置的结构示意图;FIG. 5 is a schematic structural diagram of an apparatus for processing unstructured data according to an embodiment of the application;
图6为本申请实施例提供的一种设备的结构示意图。FIG. 6 is a schematic structural diagram of a device provided by an embodiment of the application.
具体实施方式Detailed ways
本申请实施例中的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。The terms "first" and "second" in the embodiments of the present application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with "first" and "second" may explicitly or implicitly include one or more of these features.
首先对本申请实施例中所涉及到的一些技术术语进行介绍。First, some technical terms involved in the embodiments of this application are introduced.
非结构化数据(unstructured data)是指数据结构不规则或不完整,没有预定义的数据模型,不方便使用数据库二维逻辑表来逻辑表达和实现的数据。非结构化数据的格式具有多样性。作为一个示例,非结构化数据可以包括各种格式的文档或者文本。Unstructured data (unstructured data) refers to data whose data structure is irregular or incomplete, without a predefined data model, and it is not convenient to use a two-dimensional database table to logically express and implement data. The format of unstructured data is diverse. As an example, unstructured data may include documents or text in various formats.
词向量(word embedding)也称作词嵌入。词向量是指将单词映射到更低维度的连续向量空间形成的向量。词向量通常可以采用实数组成的序列进行表示。词向量的这种表示可以理解为基于神经网络的分布表示,其保留了单词的语义特征。Word embedding is also called word embedding. A word vector refers to a vector formed by mapping words to a lower-dimensional continuous vector space. The word vector can usually be represented by a sequence of real numbers. This representation of word vectors can be understood as a distributed representation based on neural networks, which retains the semantic features of words.
针对个人简历、病历、办公文档等非结构化数据,业界提出了一种隐私保护方法。具体地,根据隐私数据的定义,识别非结构化数据中存在的隐私数据,根据隐私数据比特数和非结构化数据总比特数的比值确定隐私数据字符数占比,或者根据隐私数据个数和非结构化数据中单词总数的比值确定隐私数据个数占比,通过隐私数据字符数占比或隐私数据个数占比进行隐私程度分级。接着,基于隐私级别采取相应的隐私保护机制,如对非结构化数据中所有的隐私数据进行脱敏保护。The industry has proposed a privacy protection method for unstructured data such as personal resumes, medical records, and office documents. Specifically, according to the definition of private data, identify the private data that exists in unstructured data, and determine the proportion of private data characters based on the ratio of the number of private data bits to the total number of unstructured data bits, or according to the sum of the number of private data The ratio of the total number of words in the unstructured data determines the proportion of the number of private data, and the degree of privacy is graded by the proportion of the number of private data characters or the proportion of the number of private data. Then, adopt corresponding privacy protection mechanisms based on privacy levels, such as desensitizing protection for all private data in unstructured data.
然而,上述基于隐私数据字符占比或者隐私数据个数占比进行隐私程度分级的方法忽略了上下文之间的关联性。个人简历、病历、办公文档等非结构化数据中除了包括隐私数据,还可能包括与隐私数据高度相似或对隐私数据有很大指向性的单词。即便在隐私保护过程中将非结构化数据中所有的隐私数据都进行脱敏处理,由这些与隐私数据高度相似或对隐私数据有很大指向性的单词也可能会推断出一些隐私数据的相关信息,从而导致非结构化数据脱敏不完整,在一定程度上泄露了隐私信息。However, the above-mentioned method of grading the degree of privacy based on the proportion of private data characters or the proportion of the number of private data ignores the correlation between contexts. In addition to private data, unstructured data such as personal resumes, medical records, and office documents may also include words that are highly similar to private data or have a great directivity to private data. Even if all the private data in the unstructured data is desensitized during the privacy protection process, these words that are highly similar to or have a great directivity to the private data may infer the correlation of some private data. Information, resulting in incomplete desensitization of unstructured data, leaking private information to a certain extent.
例如,对某句子“我的名字是王丽,毕业于财经大学,我不想待在X公司工作了。”进行隐私保护处理过程中,如果仅依赖隐私数据个数占比或隐私数据字符数占比进行隐私程度分级,并对其隐私数据进行脱敏处理,则脱敏处理后的句子就变成了“我的名字是**,毕业于**,我不想待在**工作了。”For example, for a sentence "My name is Wang Li, I graduated from the University of Finance and Economics, and I don’t want to stay at Company X anymore." During the privacy protection process, if you only rely on the percentage of private data or the percentage of private data characters After grading the degree of privacy and desensitizing its private data, the sentence after desensitization becomes "My name is **, I graduated from **, I don't want to stay in ** and work."
以上隐私保护处理仅考虑了隐私数据,虽然在一定程度上对隐私数据进行了掩盖,但句子脱敏却不完整。具体地,由于没有考虑到句子的语义问题,脱敏后的句子语义还 是完整的,句子的隐私程度并没有降到最低。其中,“名字”和隐私数据姓名有很大指向性;“毕业于”和隐私数据学校有很大指向性;“待在”及“工作”和隐私数据工作地点有很大指向性,也表达了人物的意愿;由这些指向性很高的单词就可以推断出被脱敏的隐私数据的相关信息或人物表达的意愿。The above privacy protection processing only considers the privacy data. Although the privacy data is covered to a certain extent, the sentence desensitization is not complete. Specifically, since the semantic problem of the sentence is not considered, the semantics of the desensitized sentence is still complete, and the degree of privacy of the sentence is not minimized. Among them, "name" and private data names have a lot of directivity; "graduated from" has a lot of directivity with the private data school; "stay" and "work" have a lot of directivity with the private data workplace, and express From these highly directional words, it is possible to infer the relevant information of the desensitized private data or the wishes expressed by the characters.
因此,上述基于隐私数据个数占比或者隐私数据字符数占比的隐私程度分级方法的准确度不高,导致基于该隐私级别采取的隐私保护机制难以达到较好的隐私保护效果。Therefore, the accuracy of the above-mentioned privacy level classification method based on the proportion of the number of private data or the proportion of the number of characters of the privacy data is not high, which makes it difficult for the privacy protection mechanism adopted based on the privacy level to achieve a better privacy protection effect.
有鉴于此,本申请实施例提供了一种非结构化数据的处理方法。该方法可以由非结构化数据的处理系统执行。具体地,非结构化数据的处理系统先对所述非结构化数据进行分词,获得分词结果,然后考虑非结构化数据中上下文单词之间具有强关联性这一语义特性,非结构化数据的处理系统还针对除敏感词以外的非敏感词,根据非敏感词与隐私数据属性的相似度,确定非敏感词的权重,通过所述敏感词的权重和所述非敏感词的权重共同确定所述非结构化数据的隐私程度。In view of this, an embodiment of the present application provides a processing method for unstructured data. The method can be executed by a processing system for unstructured data. Specifically, the unstructured data processing system first performs word segmentation on the unstructured data to obtain the word segmentation result, and then considers the semantic characteristic of strong relevance between context words in the unstructured data. The processing system also determines the weight of non-sensitive words based on the similarity of the attributes of non-sensitive words and private data for non-sensitive words other than sensitive words. The weight of the sensitive words and the weight of the non-sensitive words are used to determine the weight of the non-sensitive words. Describe the degree of privacy of unstructured data.
上述非结构化数据的处理方法将非结构化数据作为一个整体,不仅考虑了隐私The above-mentioned unstructured data processing method takes unstructured data as a whole, and not only considers privacy
数据即敏感词,还考虑了与敏感词具有上下文关系的非敏感词,基于敏感词和非敏Data is sensitive words, and non-sensitive words that have contextual relations with sensitive words are also considered. Based on sensitive words and non-sensitive words
感词共同确定非结构化数据的隐私程度,使得该方法对于隐私程度的评价更准确、更全面。进一步地,该方法能够更准确地采用对应级别的隐私保护机制进行隐私保The sense words jointly determine the degree of privacy of unstructured data, which makes the evaluation of the degree of privacy by this method more accurate and comprehensive. Furthermore, this method can more accurately adopt the corresponding level of privacy protection mechanism for privacy protection.
护,具有较好的隐私保护效果。It has a better privacy protection effect.
如图1所示,非结构化数据的处理系统可部署在云环境,具体为云环境上的一个或多个计算设备上(例如:中心服务器)。该系统也可以部署在边缘环境中,具体为边缘环境中的一个或多个计算设备(边缘计算设备)上,边缘计算设备可以为服务器。该系统也可以部署在端侧设备(即端设备)中,包括但不限于台式机、笔记本电脑、智能手机等等。As shown in Figure 1, the processing system for unstructured data can be deployed in a cloud environment, specifically on one or more computing devices (for example, a central server) in the cloud environment. The system may also be deployed in an edge environment, specifically on one or more computing devices (edge computing devices) in the edge environment, and the edge computing devices may be servers. The system can also be deployed in end-side devices (ie end devices), including but not limited to desktop computers, notebook computers, smart phones, and so on.
所述云环境指示云服务提供商拥有的,用于提供计算、存储、通信资源的中心计算设备集群;所述边缘环境指示在地理位置上距离端侧设备较近的,用于提供计算、存储、通信资源的边缘计算设备集群。The cloud environment indicates a central computing device cluster owned by a cloud service provider and used to provide computing, storage, and communication resources; the edge environment indicates a cluster of central computing equipment that is geographically close to the end-side device and is used to provide computing and storage , The edge computing equipment cluster of communication resources.
端侧设备可以作为数据提供设备,用于提供非结构化数据,以便非结构化数据的处理系统对该非结构化数据进行处理确定其隐私程度,并进一步基于其隐私程度,采用对应的隐私保护机制,进行隐私保护处理。端侧设备可以提供自身产生或存储的非结构化数据,以供非结构化数据的处理系统进行处理。在一些实现方式中,端侧设备可以是网络设备,例如可以是接入网络的终端设备,如此,端侧设备可以从网络中获取非结构化数据提供给非结构化数据的处理系统。The end-side device can be used as a data providing device to provide unstructured data so that the unstructured data processing system can process the unstructured data to determine its privacy level, and further based on its privacy level, adopt corresponding privacy protection Mechanism to conduct privacy protection processing. The end-side device can provide unstructured data generated or stored by itself for processing by the unstructured data processing system. In some implementation manners, the end-side device may be a network device, for example, a terminal device that accesses the network. In this way, the end-side device may obtain unstructured data from the network and provide it to the unstructured data processing system.
非结构化数据的处理系统部署在云环境或者边缘环境时,非结构化数据的处理系统可以以服务形式提供给用户使用。具体地,用户可以通过浏览器访问云环境或者边缘环境,在云环境或边缘环境中创建非结构化数据的处理系统的实例,然后通过浏览器与非结构化数据的处理系统的实例交互,从而实现对非结构化数据的处理。When the unstructured data processing system is deployed in a cloud environment or an edge environment, the unstructured data processing system can be provided to users as a service. Specifically, the user can access the cloud environment or the edge environment through a browser, create an instance of the unstructured data processing system in the cloud environment or the edge environment, and then interact with the instance of the unstructured data processing system through the browser, thereby Realize the processing of unstructured data.
非结构化数据的处理系统也可以部署在端侧设备。对应地,非结构化数据的处理系统可以以客户端形式提供给用户使用。具体地,用户运行客户端,从实现对非结构化数据的处理。The processing system for unstructured data can also be deployed on end-side devices. Correspondingly, the processing system for unstructured data can be provided to users in the form of a client. Specifically, the user runs the client to realize the processing of unstructured data.
在一些实现方式中,如图2所示,非结构化数据的处理系统包括多个部分(例如包 括多个子系统,每个子系统包括多个单元模块),因此非结构化数据的处理系统的各个部分也可以分布式地部署在不同环境中。例如,可以在云环境、边缘环境、端设备中的三个环境,或其中任意两个环境上分别部署非结构化数据的处理系统的一部分。In some implementations, as shown in Figure 2, the processing system for unstructured data includes multiple parts (for example, it includes multiple subsystems, and each subsystem includes multiple unit modules), so each of the unstructured data processing system Parts can also be deployed in different environments in a distributed manner. For example, a part of a processing system for unstructured data can be deployed in three environments in a cloud environment, an edge environment, a terminal device, or any two of them, respectively.
为了使得本申请实施例提供的技术方案更加清楚、易于理解,下面将从非结构化数据的处理系统的角度,对非结构化数据的处理方法进行介绍。In order to make the technical solutions provided by the embodiments of the present application clearer and easier to understand, the following will introduce the processing method of unstructured data from the perspective of the processing system of unstructured data.
参见图3所示的非结构化数据的处理方法的流程图,该方法包括:Referring to the flowchart of the unstructured data processing method shown in FIG. 3, the method includes:
S302:非结构化数据的处理系统对所述非结构化数据进行分词,获得分词结果。S302: The unstructured data processing system performs word segmentation on the unstructured data to obtain a word segmentation result.
具体实现时,非结构化数据的处理系统可以采用基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法等方法中的任意一种或多种对非结构化数据进行分词,获得分词结果。In specific implementation, the unstructured data processing system can use any one or more of the word segmentation method based on string matching, the word segmentation method based on understanding, and the word segmentation method based on statistics to segment the unstructured data. Get the word segmentation result.
其中,基于字符串匹配的分词方法,是按照设定的策略将待分析的字符串与机器词典中的词条进行匹配,若在词典中找到某个字符串,则匹配成功,识别出一个词。然后继续执行上述匹配操作,由此实现对非结构化数据进行分词。Among them, the word segmentation method based on string matching is to match the string to be analyzed with the entry in the machine dictionary according to the set strategy. If a string is found in the dictionary, the matching is successful and a word is recognized . Then continue to perform the above matching operation, thus realizing the word segmentation of the unstructured data.
进一步地,非结构化数据的处理系统在进行字符串匹配时,还可以按照不同方向进行匹配,也即基于字符串匹配的分词方法还可以分为正向最大匹配法和逆向最大匹配法。非结构化数据的处理系统在进行字符串匹配时,还可以根据不同长度有限匹配,也即基于字符串匹配的分词方法分为最长匹配方法和最短匹配方法。此外,还可以按照是否与词性标注过程相结合,分为单纯分词方法和分词与词性标注相结合的一体化方法。Further, when the unstructured data processing system performs string matching, it can also perform matching in different directions, that is, the word segmentation method based on string matching can also be divided into a forward maximum matching method and a reverse maximum matching method. When the unstructured data processing system performs string matching, it can also match according to the limited length of different lengths, that is, the word segmentation method based on string matching is divided into the longest matching method and the shortest matching method. In addition, according to whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method that combines word segmentation and part-of-speech tagging.
基于理解的分词方法是通过模拟对句子的理解,达到识别词的效果。具体是,在分词同时进行句法分析、语义分析,利用句法信息和语义信息消除歧义,从而实现对文本等非结构化数据进行分词。The word segmentation method based on comprehension achieves the effect of word recognition by simulating the comprehension of the sentence. Specifically, syntactic analysis and semantic analysis are performed at the same time for word segmentation, and syntactic information and semantic information are used to eliminate ambiguity, so as to achieve segmentation of unstructured data such as text.
基于统计的分词方法是在给定大量已经分词的文本的前提下,利用统计机器学习模型学习词语切分的规律,从而实现对未知文本的切分。基于统计的分词方法包括最大概率分词方法和最大熵分词方法。上述方法使用的统计模型包括N元文法模型(N-gram)、隐马尔可夫模型(Hidden Markov Model,HMM)、最大熵模型(maximum entropy model,MEM),和条件随机场模型(Conditional Random Fields,CRF)中的一种。The statistics-based word segmentation method uses statistical machine learning models to learn the rules of word segmentation under the premise of a large number of segmented texts, so as to achieve segmentation of unknown texts. The word segmentation methods based on statistics include the maximum probability word segmentation method and the maximum entropy word segmentation method. The statistical models used in the above methods include N-gram model (N-gram), Hidden Markov Model (HMM), Maximum entropy model (MEM), and Conditional Random Fields model (Conditional Random Fields) , CRF).
具体地,非结构化数据的处理系统可以基于非结构化数据的语言、场景等,选择匹配的分词方法进行分词,得到分词结果。Specifically, the unstructured data processing system may select a matching word segmentation method for word segmentation based on the language and scene of the unstructured data, and obtain the word segmentation result.
在一些实现方式中,为了节省存储空间以及提高非结构化数据的处理效率,非结构化数据的处理系统还可以在分词之后,去停用词(stop words),从而得到最终的分词结果。In some implementations, in order to save storage space and improve the processing efficiency of unstructured data, the processing system of unstructured data may also remove stop words after word segmentation, so as to obtain the final word segmentation result.
S304:非结构化数据的处理系统确定所述分词结果中敏感词的权重,以及根据所述分词结果中非敏感词和隐私数据属性的相似度确定所述非敏感词的权重。S304: The unstructured data processing system determines the weight of the sensitive word in the word segmentation result, and determines the weight of the non-sensitive word according to the similarity between the non-sensitive word and the private data attribute in the word segmentation result.
具体地,非结构化数据的处理系统可以根据分词结果确定出敏感词,该分词结果中除敏感词以外的词即为非敏感词,然后非结构化数据的处理系统可以确定敏感词的权 重,以及根据非敏感词和隐私数据属性的相似度确定非敏感词的权重。其中,权重具体用于衡量敏感词或非敏感词对于整个非结构化数据的隐私程度的重要程度。Specifically, the unstructured data processing system can determine the sensitive words according to the word segmentation results, and the words other than the sensitive words in the word segmentation results are non-sensitive words, and then the unstructured data processing system can determine the weight of the sensitive words. And according to the similarity of the attributes of the non-sensitive words and the private data, the weight of the non-sensitive words is determined. Among them, the weight is specifically used to measure the importance of sensitive words or non-sensitive words to the degree of privacy of the entire unstructured data.
其中,隐私数据属性用于描述隐私数据的类型。例如,针对“张三”这一隐私数据,其对应的隐私数据属性为“姓名”,针对xx@yy.com这一隐私数据,其对应的隐私数据属性为“电子邮件地址”。Among them, the private data attribute is used to describe the type of private data. For example, for the private data of "Zhang San", the corresponding private data attribute is "name", and for the private data of xx@yy.com, the corresponding private data attribute is "email address".
考虑到不同应用场景下,对于隐私数据定义可能有所不同。例如,针对生日或出生地等信息,一些应用场景下如通用数据保护条例(General Data Protection Regulation,GDPR)中视为隐私,另一些应用场景下如医疗场景中不视为隐私。如下表所示:Taking into account different application scenarios, the definition of private data may be different. For example, for information such as birthday or birthplace, it is considered privacy in some application scenarios, such as the General Data Protection Regulation (GDPR), and not considered privacy in other application scenarios, such as medical scenarios. As shown in the following table:
表1医疗场景下隐私数据模板Table 1 Privacy data template in medical scenarios
II 姓名Name 是否隐私Privacy XIXI 银行卡号Bank card number 是否隐私Privacy
IIII 电子邮件地址email address Yes XIIXII 民族nationality no
IIIIII 手机号码mobile phone number Yes XIIIXIII 政治党派Political parties Yes
IVIV 家庭电话号码Home phone number Yes XIVXIV IP地址IP address Yes
VV 任何地址Any address Yes XVXV GPS信息GPS information Yes
VIVI 身份证号ID number Yes XVIXVI DNA信息DNA information no
VIIVII 护照号Passport number Yes XVIIXVII 指纹fingerprint no
VIIIVIII 车牌号number plate Yes XVIIIXVIII 虹膜信息Iris information no
IXIX 生日Birthday no XIXXIX 疾病诊断Disease diagnosis no
XX 出生地place of birth no  To  To  To
表2 GDPR场景下隐私数据模板Table 2 Privacy data template in GDPR scenario
II 姓名Name 是否隐私Privacy XIXI 银行卡号Bank card number 是否隐私Privacy
IIII 电子邮件地址email address Yes XIIXII 民族nationality Yes
IIIIII 手机号码mobile phone number Yes XIIIXIII 政治党派Political parties Yes
IVIV 家庭电话号码Home phone number Yes XIVXIV IP地址IP address Yes
VV 任何地址Any address Yes XVXV GPS信息GPS information Yes
VIVI 身份证号ID number Yes XVIXVI DNA信息DNA information Yes
VIIVII 护照号Passport number Yes XVIIXVII 指纹fingerprint Yes
VIIIVIII 车牌号number plate Yes XVIIIXVIII 虹膜信息Iris information Yes
IXIX 生日Birthday Yes XIXXIX 疾病诊断Disease diagnosis Yes
XX 出生地place of birth Yes  To  To  To
基于此,非结构化数据的处理系统在确定敏感词时,可以根据分词结果中各词的属性与当前应用场景下隐私数据模板所定义的隐私数据属性进行匹配,从而确定分词结果中的每个词是敏感词或者非敏感词。由此确定的敏感词和非敏感词具有较高的准确度。Based on this, when determining sensitive words, the unstructured data processing system can match the attributes of each word in the word segmentation results with the privacy data attributes defined by the privacy data template in the current application scenario, thereby determining each of the word segmentation results Words are sensitive words or non-sensitive words. The sensitive words and non-sensitive words thus determined have high accuracy.
接着,针对敏感词,非结构化数据的处理系统可以根据设定的权重确定该敏感词的权重。例如,设定敏感词的权重为标准权重,如权重为1,则可以根据该设定的权重获得敏感词的权重。Then, for the sensitive word, the unstructured data processing system can determine the weight of the sensitive word according to the set weight. For example, set the weight of the sensitive word as the standard weight. If the weight is 1, then the weight of the sensitive word can be obtained according to the set weight.
针对非敏感词,根据非敏感词和隐私数据属性的相似度确定非敏感词的权重,具体是根据相似度和权重的对应关系,确定非敏感词的权重。其中,非敏感词和隐私数据属性的相似度越高,则该非敏感词的权重越大,非敏感词和隐私数据属性的相似度越低, 则该非敏感词的权重越小。For non-sensitive words, the weight of non-sensitive words is determined according to the similarity between the attributes of non-sensitive words and private data, and the weight of non-sensitive words is determined according to the corresponding relationship between similarity and weight. Among them, the higher the similarity between the non-sensitive word and the private data attribute, the greater the weight of the non-sensitive word, and the lower the similarity between the non-sensitive word and the private data attribute, the smaller the weight of the non-sensitive word.
为了便于理解,下面结合以具体示例进行说明。在该示例中,非结构化数据包括句子“我的名字是张三”,非结构化数据的处理系统基于隐私数据属性确定“张三”为敏感词,“名字”为非敏感词,通过计算确定该非敏感词“名字”和隐私数据属性“姓名”的相似度为0.9999,根据相似度和权重比例之前的对应关系,可以确定出权重比例为0.8,由此可以确定“张三”的权重为1,“名字”的权重为0.8。For ease of understanding, specific examples are used for description below. In this example, the unstructured data includes the sentence "My name is Zhang San". The processing system of the unstructured data determines that "Zhang San" is a sensitive word and "name" is a non-sensitive word based on the attributes of the private data. Determine the similarity between the non-sensitive word "name" and the private data attribute "name" as 0.9999. According to the previous correspondence between the similarity and the weight ratio, the weight ratio can be determined to be 0.8, which can determine the weight of "Zhang San" Is 1, and the weight of "name" is 0.8.
S306:非结构化数据的处理系统通过所述敏感词的权重和所述非敏感词的权重确定所述非结构化数据的隐私程度。S306: The unstructured data processing system determines the degree of privacy of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word.
具体地,非结构化数据的处理系统可以通过对所有敏感词的权重以及所有非敏感词的权重进行加权聚合,得到非结构化数据的隐私程度。Specifically, the unstructured data processing system can obtain the degree of privacy of the unstructured data by performing weighted aggregation on the weights of all sensitive words and the weights of all non-sensitive words.
在一个示例中,隐私程度的计算公式具体如下所示:In an example, the formula for calculating the degree of privacy is as follows:
Figure PCTCN2021075680-appb-000001
Figure PCTCN2021075680-appb-000001
其中,privacylevel表示敏感程度,也称作敏感等级。n为敏感词和非敏感词的总数量。g i表示非结构化数据中第i个词的敏感值,具体如下: Among them, privacylevel represents the degree of sensitivity, also known as the sensitivity level. n is the total number of sensitive words and non-sensitive words. g i represents the sensitive value of the i-th word in unstructured data, as follows:
Figure PCTCN2021075680-appb-000002
Figure PCTCN2021075680-appb-000002
其中,I i是第i个词为非敏感词时,该非敏感词与隐私数据属性的相似度。α i为该非敏感词的权重,表征该非敏感词对非结构化数据的隐私程度的影响力度。其中,α i取值范围为(0,1),具体是根据非敏感词和隐私数据属性的相似度而确定。 Among them, I i is the similarity between the attribute of the non-sensitive word and the private data when the i-th word is a non-sensitive word. α i is the weight of the non-sensitive word, which represents the influence of the non-sensitive word on the degree of privacy of unstructured data. Among them, the value range of α i is (0, 1), which is specifically determined according to the similarity between the attributes of non-sensitive words and private data.
在一个示例中,非敏感词和隐私数据属性的相似度与该非敏感词的权重具有如下对应关系:In an example, the similarity between the attributes of non-sensitive words and private data and the weight of the non-sensitive words have the following correspondence:
Figure PCTCN2021075680-appb-000003
Figure PCTCN2021075680-appb-000003
非结构化数据的处理系统基于上述公式(3)确定非敏感词的权重,并基于敏感词的权重和非敏感词的权重确定非结构化数据的隐私程度。The unstructured data processing system determines the weight of non-sensitive words based on the above formula (3), and determines the degree of privacy of unstructured data based on the weight of sensitive words and the weight of non-sensitive words.
基于上述内容描述,本申请实施例提供了一种非结构化数据的处理方法,该方法将非结构化数据作为一个整体,考虑非结构化数据中上下文之间的关联关系,利用非结构化数据中非敏感词与隐私数据属性的相似度确定敏感词的权重,通过敏感词的权重以及 具有上下文关系的非敏感词的权重共同确定文本隐私程度,具有更高的准确度。Based on the above description, the embodiments of the present application provide a method for processing unstructured data. The method takes unstructured data as a whole, considers the relationship between contexts in unstructured data, and uses unstructured data. The similarity of the attributes of non-sensitive words and private data determines the weight of sensitive words. The weight of sensitive words and the weight of non-sensitive words with contextual relationship are used to determine the degree of text privacy, which has higher accuracy.
而且,采用该方法能够更准确地确定对应级别的隐私保护机制。利用该隐私保护机制对非结构化数据进行隐私保护,既能避免由隐私数据造成的隐私信息的直接泄露,也有效防止由语义问题引起的隐私信息的间接泄露,因而能够更好地保护隐私信息。Moreover, using this method can more accurately determine the corresponding level of privacy protection mechanism. Using this privacy protection mechanism to protect the privacy of unstructured data can not only avoid the direct disclosure of private information caused by private data, but also effectively prevent the indirect disclosure of private information caused by semantic issues, thus better protecting private information .
为了验证本申请提出的隐私程度分级方法比传统方法能更好地评估非结构化数据的隐私程度,本申请实施例还设计了一个攻击场景进行验证。In order to verify that the privacy level classification method proposed in this application can better evaluate the privacy level of unstructured data than traditional methods, the embodiment of this application also designs an attack scenario for verification.
具体地,在攻击场景中,对于文本这一非结构化数据中的所有隐私数据,都采用同样的遮挡脱敏处理,具体是将隐私数据全都用空格进行统一替换,然后利用隐私数据上下文词汇对这些隐私数据进行猜测,猜测出正确信息的概率越高,则说明攻击者可获得更多的文本隐私信息,当前采用的隐私保护机制级别不够,导致了脱敏不够完整。所以,如果文本隐私程度分级不够准确,可能会导致级别高的文本数据使用了级别低的隐私保护机制对其进行脱敏处理,使得文本数据脱敏不够完整,导致脱敏后的文本数据依旧有可能泄露隐私信息。Specifically, in the attack scenario, the same occlusion and desensitization process is used for all private data in the unstructured data, which is text. Specifically, all private data are replaced with spaces, and then the private data context vocabulary is used to These private data are guessed. The higher the probability of guessing the correct information, the attacker can obtain more text privacy information. The current privacy protection mechanism is not enough, resulting in insufficient desensitization. Therefore, if the text privacy level is not accurate enough, it may cause high-level text data to be desensitized using a low-level privacy protection mechanism, which makes the text data desensitization incomplete, resulting in the desensitized text data still being desensitized. May reveal private information.
通过设计攻击场景预测隐私数据可以验证文本的隐私程度排名。相应地,分别采用本申请实施例提出的隐私程度分级方法及传统方法计算文本隐私程度,并进行文本隐私程度级别高低的排名。哪种排名更加接近利用攻击场景验证所得的隐私程度排名,就表明该方法能更准确体现文本的隐私程度级别。Predicting privacy data by designing attack scenarios can verify the privacy ranking of the text. Correspondingly, the privacy level grading method proposed in the embodiments of this application and the traditional method are used to calculate the text privacy level, and the text privacy level is ranked. Which ranking is closer to the privacy ranking obtained by using the attack scenario verification indicates that the method can more accurately reflect the privacy level of the text.
其中,衡量排名的接近程度可以通过均方误差(mean square error,MSE)实现,Among them, measuring the closeness of the ranking can be achieved by means of mean square error (MSE),
MSE的计算公式如下:The calculation formula of MSE is as follows:
Figure PCTCN2021075680-appb-000004
Figure PCTCN2021075680-appb-000004
其中,n为文档个数;x和y表示两个文档隐私程度排名的列表。Among them, n is the number of documents; x and y represent the ranking lists of the privacy degrees of the two documents.
本申请实施例提供了如下实验数据:The examples of this application provide the following experimental data:
表3不同方法确定的隐私程度排名Table 3 Ranking of privacy levels determined by different methods
Figure PCTCN2021075680-appb-000005
Figure PCTCN2021075680-appb-000005
Figure PCTCN2021075680-appb-000006
Figure PCTCN2021075680-appb-000006
根据表3的排名计算MSE可以得到:According to the ranking in Table 3, MSE can be calculated:
MSE(验证排名,本申请排名)=6;MSE (verification ranking, ranking of this application) = 6;
MSE(验证排名,隐私数据个数占比排名)=12;MSE (verification ranking, ranking by the number of private data) = 12;
MSE(验证排名,隐私数据字符数占比排名)=34。MSE (verification ranking, ranking of the proportion of characters in private data)=34.
由此可见,相对于传统的基于隐私数据个数占比或者隐私数据字符数占比的隐私程度分级方法,本申请实施例提出的基于相似度的隐私程度分级方法与验证方法的排名更接近,本申请实施例提出的方法能更准确地为非结构化数据进行隐私程度分级。It can be seen that, compared to the traditional privacy grading method based on the proportion of the number of private data or the proportion of the number of characters of the private data, the similarity-based privacy grading method proposed in the embodiment of this application is closer to the ranking of the verification method. The method proposed in the embodiment of the present application can more accurately classify the degree of privacy of unstructured data.
考虑到上下文词汇之间具有关联性这一语义特性,本申请实施例引入自然语言处理(natural language processing,NLP)中利用词向量计算词汇相似度的方法,将其用于计算非敏感词和隐私数据属性的相似度。Considering the semantic feature of contextual vocabulary, the embodiment of this application introduces a method of calculating vocabulary similarity using word vectors in natural language processing (NLP), and uses it to calculate non-sensitive words and privacy The similarity of data attributes.
具体地,如图4所示,非结构化数据的处理系统可以分别提取非敏感词的词向量和隐私数据属性的词向量,例如,将非敏感词和隐私数据属性输入预训练的词向量模型,从而获得非敏感词的词向量和隐私数据属性的词向量。然后,根据所述非敏感词的词向量和所述隐私数据属性的词向量的距离确定所述非敏感词和所述隐私数据属性的相似度。接着,根据所述非敏感词和所述隐私数据属性的相似度,基于相似度和权重的对应关系(例如公式(3)所示的对应关系),确定所述非敏感词的权重。Specifically, as shown in Figure 4, the unstructured data processing system can extract the word vectors of non-sensitive words and the word vectors of private data attributes respectively, for example, input the non-sensitive words and private data attributes into the pre-trained word vector model , So as to obtain the word vector of non-sensitive words and the word vector of private data attributes. Then, the similarity between the non-sensitive word and the private data attribute is determined according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute. Then, according to the similarity between the non-sensitive words and the attributes of the private data, the weight of the non-sensitive words is determined based on the corresponding relationship between the similarity and the weight (for example, the corresponding relationship shown in formula (3)).
其中,词向量模型具体可以通过word2vec等方法进行训练得到。具体地,非结构化数据的处理系统可以通过word2vec构建一个初始词向量模型,利用训练语料训练初始词向量模型,从而得到用于提取词向量的词向量模型。Among them, the word vector model can be specifically obtained by training methods such as word2vec. Specifically, the unstructured data processing system can construct an initial word vector model through word2vec, and use the training corpus to train the initial word vector model, thereby obtaining a word vector model for extracting word vectors.
考虑到不同应用场景对于隐私数据的定义可以是不同的,不同应用场景的语言运用和表达方式存在很大差异,这使得在不同应用场景的语料中,相同的字词的上下文可能存在很大的差异,如果采用通用的训练语料训练初始词向量模型,可能导致训练得到的词向量模型的准确度不高。基于此,非结构化数据的处理系统可以获取与非结构化数据的应用场景匹配的训练语料,然后利用该特定的训练语料训练初始词向量模型,得的词向量模型。Considering that the definition of private data can be different in different application scenarios, the language usage and expression of different application scenarios are very different, which makes the context of the same words in different application scenarios may be very different. The difference is that if a general training corpus is used to train the initial word vector model, the accuracy of the word vector model obtained by training may not be high. Based on this, the unstructured data processing system can obtain a training corpus that matches the application scenario of the unstructured data, and then use the specific training corpus to train the initial word vector model to obtain the word vector model.
进一步地,即便在固定应用场景的语料中,相同的隐私数据属性对应的词汇往往具有相似的上下文,但相同的隐私数据属性对应的隐私数据词汇总是千变万化的,例如姓名对应的隐私数据词汇可以是“张三”、“李四”、“王五”等等,而且很多隐私数据词汇可能出现次数很少,直接基于该训练语料训练所得的词向量模型也不够准确。为了能训练出更好的词向量模型,更准确地计算相似度以便更好地赋予敏感权重,非结构化数据的处理系统还可以对训练语料进行预处理。具体是,识别所述训练语料中的敏感 词,利用敏感词的隐私数据属性替换所述敏感词,然后利用替换后的训练语料训练初始词向量模型,得到词向量模型。Further, even in the corpus of a fixed application scenario, the vocabulary corresponding to the same private data attribute often has a similar context, but the private data vocabulary corresponding to the same private data attribute is always ever-changing. For example, the private data vocabulary corresponding to the name can be It is "Zhang San", "Li Si", "Wang Wu", etc., and many private data vocabulary may appear very few times, and the word vector model trained directly based on the training corpus is not accurate enough. In order to train a better word vector model and calculate the similarity more accurately to better assign sensitive weights, the unstructured data processing system can also preprocess the training corpus. Specifically, identifying the sensitive words in the training corpus, replacing the sensitive words with the privacy data attributes of the sensitive words, and then using the replaced training corpus to train the initial word vector model to obtain the word vector model.
上文结合图1至图4对本申请实施例提供的非结构化数据的处理方法进行了详细介绍,下面将结合附图对本申请实施例提供的装置、设备进行介绍。The method for processing unstructured data provided by the embodiment of the present application is described in detail above with reference to FIGS. 1 to 4, and the apparatus and equipment provided by the embodiment of the present application will be introduced below with reference to the accompanying drawings.
参见图5所示的非结构化数据的处理装置的结构示意图,该装置500包括:分词模块502,用于对所述非结构化数据进行分词,获得分词结果;Referring to the schematic structural diagram of the device for processing unstructured data shown in FIG. 5, the device 500 includes: a word segmentation module 502, configured to segment the unstructured data to obtain a word segmentation result;
权重确定模块504,用于确定所述分词结果中敏感词的权重,以及根据所述分词结果中的非敏感词和隐私数据属性的相似度确定所述非敏感词的权重;The weight determination module 504 is configured to determine the weight of the sensitive word in the word segmentation result, and determine the weight of the insensitive word according to the similarity between the non-sensitive word in the word segmentation result and the attributes of private data;
隐私程度确定模块506,用于通过所述敏感词的权重和所述非敏感词的权重确定所述非结构化数据的隐私程度。The degree of privacy determination module 506 is configured to determine the degree of privacy of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word.
在一些实现方式中,所述权重确定模块504具体用于:In some implementation manners, the weight determination module 504 is specifically configured to:
提取所述非敏感词的词向量以及所述隐私数据属性的词向量;Extracting the word vector of the non-sensitive word and the word vector of the private data attribute;
根据所述非敏感词的词向量和所述隐私数据属性的词向量的距离确定所述非敏感词和所述隐私数据属性的相似度;Determining the similarity between the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute;
根据所述非敏感词和所述隐私数据属性的相似度确定所述非敏感词的权重。The weight of the non-sensitive word is determined according to the similarity between the attributes of the non-sensitive word and the private data.
在一些实现方式中,所述权重确定模块504具体用于:In some implementation manners, the weight determination module 504 is specifically configured to:
利用预训练的词向量模型提取所述非敏感词的词向量和所述隐私数据属性的词向量。A pre-trained word vector model is used to extract the word vector of the non-sensitive word and the word vector of the private data attribute.
在一些实现方式中,所述装置500还包括:In some implementation manners, the apparatus 500 further includes:
通信模块,用于获取与所述非结构化数据的应用场景匹配的训练语料;A communication module for obtaining training corpus matching the application scenario of the unstructured data;
训练模块,用于利用所述训练语料训练初始词向量模型,得到词向量模型。The training module is used to train the initial word vector model using the training corpus to obtain the word vector model.
在一些实现方式中,所述装置还包括:In some implementation manners, the device further includes:
替换模块,用于识别所述训练语料中的敏感词,利用隐私数据属性替换所述敏感词;The replacement module is used to identify sensitive words in the training corpus, and replace the sensitive words with private data attributes;
所述训练模块具体用于:The training module is specifically used for:
利用替换后的训练语料训练初始词向量模型,得到词向量模型。Use the replaced training corpus to train the initial word vector model to obtain the word vector model.
在一些实现方式中,所述装置还包括:In some implementation manners, the device further includes:
隐私保护处理模块,用于根据所述非结构化数据的隐私程度确定所述非结构化数据的隐私保护机制,利用所述隐私保护机制对所述非结构化数据进行隐私保护。The privacy protection processing module is configured to determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to protect the privacy of the unstructured data.
根据本申请实施例的非结构化数据的处理装置500可对应于执行本申请实施例中描述的方法,并且非结构化数据的处理装置500的各个模块/单元的上述和其它操作和/或 功能分别为了实现图3所示实施例中的各个方法的相应流程,为了简洁,在此不再赘述。The apparatus 500 for processing unstructured data according to the embodiment of the present application can correspond to the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of each module/unit of the apparatus 500 for processing unstructured data In order to implement the corresponding processes of the methods in the embodiment shown in FIG. 3, for the sake of brevity, details are not described herein again.
本申请实施例还提供了一种设备600。该设备600可以是笔记本电脑、台式机等端侧设备,也可以是云环境或边缘环境中的计算机集群。该设备600具体用于实现如图5所示实施例中非结构化数据的处理装置500的功能。The embodiment of the present application also provides a device 600. The device 600 may be an end-side device such as a notebook computer and a desktop computer, and may also be a computer cluster in a cloud environment or an edge environment. The device 600 is specifically used to implement the functions of the apparatus 500 for processing unstructured data in the embodiment shown in FIG. 5.
图6提供了一种设备600的结构示意图,如图6所示,设备600包括总线601、处理器602、通信接口603和存储器604。处理器602、存储器604和通信接口603之间通过总线601通信。总线601可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。通信接口603用于与外部通信。例如,获取与非结构化数据的应用场景匹配的训练语料,或者获取非结构化数据等。FIG. 6 provides a schematic structural diagram of a device 600. As shown in FIG. 6, the device 600 includes a bus 601, a processor 602, a communication interface 603, and a memory 604. The processor 602, the memory 604, and the communication interface 603 communicate through a bus 601. The bus 601 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 6, but it does not mean that there is only one bus or one type of bus. The communication interface 603 is used to communicate with the outside. For example, obtaining training corpus that matches the application scenario of unstructured data, or obtaining unstructured data, etc.
其中,处理器602可以为中央处理器(central processing unit,CPU)。存储器604可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器604还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,HDD或SSD。The processor 602 may be a central processing unit (CPU). The memory 604 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM). The memory 604 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), flash memory, HDD or SSD.
存储器604中存储有可执行代码,处理器602执行该可执行代码以执行前述非结构化数据的处理方法。The memory 604 stores executable code, and the processor 602 executes the executable code to execute the aforementioned unstructured data processing method.
具体地,在实现图5所示实施例的情况下,且图5实施例中所描述的非结构化数据的处理装置500的各模块为通过软件实现的情况下,执行图5中的分词模块502、权重确定模块504和隐私程度确定模块506功能所需的软件或程序代码存储在存储器604中。通信模块功能通过通信接口603实现。通信接口603接收非结构化数据,将其通过总线601传输至处理器602,处理器602执行存储器604中存储的各模块对应的程序代码,如分词模块502、权重确定模块504和隐私程度确定模块506对应的程序代码,以执行对非结构化数据进行分词,然后确定敏感词的权重,以及根据非敏感词和隐私数据属性的相似度确定非敏感词的权重,再根据敏感词的权重和非敏感词的权重确定非结构化数据的隐私程度的操作。Specifically, in the case that the embodiment shown in FIG. 5 is implemented, and the modules of the apparatus 500 for processing unstructured data described in the embodiment of FIG. 5 are realized by software, the word segmentation module in FIG. 5 is executed. 502. The software or program codes required for the functions of the weight determination module 504 and the privacy degree determination module 506 are stored in the memory 604. The function of the communication module is implemented through the communication interface 603. The communication interface 603 receives unstructured data and transmits it to the processor 602 via the bus 601. The processor 602 executes the program code corresponding to each module stored in the memory 604, such as the word segmentation module 502, the weight determination module 504, and the privacy degree determination module 506 corresponding program code to perform word segmentation of unstructured data, and then determine the weight of sensitive words, and determine the weight of non-sensitive words according to the similarity of the attributes of non-sensitive words and private data, and then according to the weight and non-sensitive words of sensitive words The weight of sensitive words determines the degree of privacy of unstructured data.
当然,处理器602还可以执行隐私保护处理模块对应的程序代码,以执行根据非结构化数据的隐私程度确定非结构化数据的隐私保护机制,利用所述隐私保护机制对所述非结构化数据进行隐私保护的操作。Of course, the processor 602 may also execute the program code corresponding to the privacy protection processing module to execute a privacy protection mechanism for determining the unstructured data based on the degree of privacy of the unstructured data, and use the privacy protection mechanism to perform the unstructured data Perform privacy protection operations.
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质包括指令,所述指令指示计算机执行上述应用于非结构化数据的处理装置500的非结构化数据的处理方法。An embodiment of the present application also provides a computer-readable storage medium, which includes instructions that instruct a computer to execute the above-mentioned unstructured data processing method applied to the unstructured data processing apparatus 500.
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质包括指令,所述指令指示计算机执行上述应用于非结构化数据的处理装置500的非结构化数据 的处理方法。An embodiment of the present application also provides a computer-readable storage medium, which includes instructions that instruct a computer to execute the above-mentioned unstructured data processing method applied to the unstructured data processing apparatus 500.
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品被计算机执行时,所述计算机执行前述非结构化数据的处理方法的任一方法。该计算机程序产品可以为一个软件安装包,在需要使用前述非结构化数据的处理方法的任一方法的情况下,可以下载该计算机程序产品并在计算机上执行该计算机程序产品。The embodiment of the present application also provides a computer program product. When the computer program product is executed by a computer, the computer executes any one of the aforementioned methods for processing unstructured data. The computer program product may be a software installation package. In the case where any method of the aforementioned unstructured data processing method needs to be used, the computer program product may be downloaded and executed on the computer.

Claims (14)

  1. 一种非结构化数据的处理方法,其特征在于,所述方法包括:A method for processing unstructured data, characterized in that the method includes:
    对所述非结构化数据进行分词,获得分词结果;Perform word segmentation on the unstructured data to obtain a word segmentation result;
    确定所述分词结果中敏感词的权重,以及根据所述分词结果中的非敏感词和隐私数据属性的相似度确定所述非敏感词的权重;Determining the weight of the sensitive word in the word segmentation result, and determining the weight of the insensitive word according to the similarity between the non-sensitive words in the word segmentation result and the attributes of private data;
    通过所述敏感词的权重和所述非敏感词的权重确定所述非结构化数据的隐私程度。The degree of privacy of the unstructured data is determined by the weight of the sensitive word and the weight of the non-sensitive word.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述分词结果中的非敏感词和隐私数据属性的相似度确定所述非敏感词的权重,包括:The method according to claim 1, wherein the determining the weight of the non-sensitive word according to the similarity between the non-sensitive word in the word segmentation result and the attribute of the private data comprises:
    提取所述非敏感词的词向量以及所述隐私数据属性的词向量;Extracting the word vector of the non-sensitive word and the word vector of the private data attribute;
    根据所述非敏感词的词向量和所述隐私数据属性的词向量的距离确定所述非敏感词和所述隐私数据属性的相似度;Determining the similarity between the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute;
    根据所述非敏感词和所述隐私数据属性的相似度确定所述非敏感词的权重。The weight of the non-sensitive word is determined according to the similarity between the attributes of the non-sensitive word and the private data.
  3. 根据权利要求2所述的方法,其特征在于,所述提取所述非敏感词的词向量以及所述隐私数据属性的词向量,包括:The method according to claim 2, wherein the extracting the word vector of the non-sensitive word and the word vector of the private data attribute comprises:
    利用预训练的词向量模型提取所述非敏感词的词向量和所述隐私数据属性的词向量。A pre-trained word vector model is used to extract the word vector of the non-sensitive word and the word vector of the private data attribute.
  4. 根据权利要求3所述的方法,其特征在于,所述词向量模型通过如下方式训练得到:The method according to claim 3, wherein the word vector model is obtained by training in the following manner:
    获取与所述非结构化数据的应用场景匹配的训练语料;Acquiring a training corpus that matches the application scenario of the unstructured data;
    利用所述训练语料训练初始词向量模型,得到词向量模型。The initial word vector model is trained using the training corpus to obtain the word vector model.
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:The method according to claim 4, wherein the method further comprises:
    识别所述训练语料中的敏感词,利用隐私数据属性替换所述敏感词;Identify sensitive words in the training corpus, and replace the sensitive words with private data attributes;
    所述利用所述训练语料训练初始词向量模型,得到词向量模型,包括:The training of an initial word vector model using the training corpus to obtain a word vector model includes:
    利用替换后的训练语料训练初始词向量模型,得到词向量模型。Use the replaced training corpus to train the initial word vector model to obtain the word vector model.
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 5, wherein the method further comprises:
    根据所述非结构化数据的隐私程度确定所述非结构化数据的隐私保护机制;Determining the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data;
    利用所述隐私保护机制对所述非结构化数据进行隐私保护。The privacy protection mechanism is used to protect the privacy of the unstructured data.
  7. 一种非结构化数据的处理装置,其特征在于,所述装置包括:A processing device for unstructured data, characterized in that the device comprises:
    分词模块,用于对所述非结构化数据进行分词,获得分词结果;The word segmentation module is used to segment the unstructured data to obtain the word segmentation result;
    权重确定模块,用于确定所述分词结果中敏感词的权重,以及根据所述分词结果中的非敏感词和隐私数据属性的相似度确定所述非敏感词的权重;A weight determination module, configured to determine the weight of the sensitive word in the word segmentation result, and determine the weight of the insensitive word according to the similarity between the non-sensitive words in the word segmentation result and the attributes of private data;
    隐私程度确定模块,用于通过所述敏感词的权重和所述非敏感词的权重确定所述非结构化数据的隐私程度。The degree of privacy determination module is used to determine the degree of privacy of the unstructured data through the weights of the sensitive words and the weights of the non-sensitive words.
  8. 根据权利要求7所述的装置,其特征在于,所述权重确定模块具体用于:The device according to claim 7, wherein the weight determination module is specifically configured to:
    提取所述非敏感词的词向量以及所述隐私数据属性的词向量;Extracting the word vector of the non-sensitive word and the word vector of the private data attribute;
    根据所述非敏感词的词向量和所述隐私数据属性的词向量的距离确定所述非敏感词和所述隐私数据属性的相似度;Determining the similarity between the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute;
    根据所述非敏感词和所述隐私数据属性的相似度确定所述非敏感词的权重。The weight of the non-sensitive word is determined according to the similarity between the attributes of the non-sensitive word and the private data.
  9. 根据权利要求8所述的装置,其特征在于,所述权重确定模块具体用于:The device according to claim 8, wherein the weight determination module is specifically configured to:
    利用预训练的词向量模型提取所述非敏感词的词向量和所述隐私数据属性的词向量。A pre-trained word vector model is used to extract the word vector of the non-sensitive word and the word vector of the private data attribute.
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括:The device according to claim 9, wherein the device further comprises:
    通信模块,用于获取与所述非结构化数据的应用场景匹配的训练语料;A communication module for obtaining training corpus matching the application scenario of the unstructured data;
    训练模块,用于利用所述训练语料训练初始词向量模型,得到词向量模型。The training module is used to train the initial word vector model using the training corpus to obtain the word vector model.
  11. 根据权利要求10所述的装置,其特征在于,所述装置还包括:The device according to claim 10, wherein the device further comprises:
    替换模块,用于识别所述训练语料中的敏感词,利用隐私数据属性替换所述敏感词;The replacement module is used to identify sensitive words in the training corpus, and replace the sensitive words with private data attributes;
    所述训练模块具体用于:The training module is specifically used for:
    利用替换后的训练语料训练初始词向量模型,得到词向量模型。Use the replaced training corpus to train the initial word vector model to obtain the word vector model.
  12. 根据权利要求7至11任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 7 to 11, wherein the device further comprises:
    隐私保护处理模块,用于根据所述非结构化数据的隐私程度确定所述非结构化数据的隐私保护机制,利用所述隐私保护机制对所述非结构化数据进行隐私保护。The privacy protection processing module is configured to determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to protect the privacy of the unstructured data.
  13. 一种设备,其特征在于,所述设备包括处理器和存储器;A device, characterized in that the device includes a processor and a memory;
    所述处理器用于执行所述存储器中存储的指令,以使得所述设备执行如权利要求1至6中任一项所述的方法。The processor is configured to execute instructions stored in the memory, so that the device executes the method according to any one of claims 1 to 6.
  14. 一种计算机可读存储介质,其特征在于,包括指令,所述指令指示设备执行如权利要求1至6中任一项所述的方法。A computer-readable storage medium, characterized by comprising instructions that instruct a device to execute the method according to any one of claims 1 to 6.
PCT/CN2021/075680 2020-04-24 2021-02-06 Unstructured data processing method, apparatus, and device, and medium WO2021212968A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010331678.3 2020-04-24
CN202010331678.3A CN113553846A (en) 2020-04-24 2020-04-24 Method, device, equipment and medium for processing unstructured data

Publications (1)

Publication Number Publication Date
WO2021212968A1 true WO2021212968A1 (en) 2021-10-28

Family

ID=78101221

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/075680 WO2021212968A1 (en) 2020-04-24 2021-02-06 Unstructured data processing method, apparatus, and device, and medium

Country Status (2)

Country Link
CN (1) CN113553846A (en)
WO (1) WO2021212968A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065287A (en) * 2021-11-18 2022-02-18 南京航空航天大学 Track difference privacy protection method and system for resisting prediction attack
CN115664799A (en) * 2022-10-25 2023-01-31 江苏海洋大学 Data exchange method and system applied to information technology security
CN115828307A (en) * 2023-01-28 2023-03-21 广州佰锐网络科技有限公司 Text recognition method and AI system applied to OCR
CN116432243A (en) * 2023-06-15 2023-07-14 恺恩泰(南京)科技有限公司 Data desensitization method, device, equipment and storage medium for online mall
CN117034356A (en) * 2023-10-09 2023-11-10 成都乐超人科技有限公司 Privacy protection method and device for multi-operation flow based on hybrid chain
CN117591643A (en) * 2023-11-10 2024-02-23 杭州市余杭区数据资源管理局 Project text duplicate checking method and system based on improved structuring processing
CN117892358A (en) * 2024-03-18 2024-04-16 北方健康医疗大数据科技有限公司 Verification method and system for limited data desensitization method
CN117912624A (en) * 2024-03-15 2024-04-19 江西曼荼罗软件有限公司 Electronic medical record sharing method and system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618371B (en) * 2022-07-11 2023-08-04 上海期货信息技术有限公司 Non-text data desensitization method, device and storage medium
CN115512810A (en) * 2022-11-17 2022-12-23 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Data management method and system for medical image data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012985A (en) * 2010-11-19 2011-04-13 国网电力科学研究院 Sensitive data dynamic identification method based on data mining
CN102184188A (en) * 2011-04-15 2011-09-14 百度在线网络技术(北京)有限公司 Method and equipment for determining sensitivity of target text
CN102426599A (en) * 2011-11-09 2012-04-25 中国人民解放军信息工程大学 Method for detecting sensitive information based on D-S evidence theory
US20140237620A1 (en) * 2011-09-28 2014-08-21 Tata Consultancy Services Limited System and method for database privacy protection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012985A (en) * 2010-11-19 2011-04-13 国网电力科学研究院 Sensitive data dynamic identification method based on data mining
CN102184188A (en) * 2011-04-15 2011-09-14 百度在线网络技术(北京)有限公司 Method and equipment for determining sensitivity of target text
US20140237620A1 (en) * 2011-09-28 2014-08-21 Tata Consultancy Services Limited System and method for database privacy protection
CN102426599A (en) * 2011-11-09 2012-04-25 中国人民解放军信息工程大学 Method for detecting sensitive information based on D-S evidence theory

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065287B (en) * 2021-11-18 2024-05-07 南京航空航天大学 Track differential privacy protection method and system for resisting predictive attack
CN114065287A (en) * 2021-11-18 2022-02-18 南京航空航天大学 Track difference privacy protection method and system for resisting prediction attack
CN115664799A (en) * 2022-10-25 2023-01-31 江苏海洋大学 Data exchange method and system applied to information technology security
CN115828307A (en) * 2023-01-28 2023-03-21 广州佰锐网络科技有限公司 Text recognition method and AI system applied to OCR
CN115828307B (en) * 2023-01-28 2023-05-23 广州佰锐网络科技有限公司 Text recognition method and AI system applied to OCR
CN116432243A (en) * 2023-06-15 2023-07-14 恺恩泰(南京)科技有限公司 Data desensitization method, device, equipment and storage medium for online mall
CN116432243B (en) * 2023-06-15 2023-08-25 恺恩泰(南京)科技有限公司 Data desensitization method, device, equipment and storage medium for online mall
CN117034356A (en) * 2023-10-09 2023-11-10 成都乐超人科技有限公司 Privacy protection method and device for multi-operation flow based on hybrid chain
CN117034356B (en) * 2023-10-09 2024-01-05 成都乐超人科技有限公司 Privacy protection method and device for multi-operation flow based on hybrid chain
CN117591643A (en) * 2023-11-10 2024-02-23 杭州市余杭区数据资源管理局 Project text duplicate checking method and system based on improved structuring processing
CN117591643B (en) * 2023-11-10 2024-05-10 杭州市余杭区数据资源管理局 Project text duplicate checking method and system based on improved structuring processing
CN117912624A (en) * 2024-03-15 2024-04-19 江西曼荼罗软件有限公司 Electronic medical record sharing method and system
CN117892358A (en) * 2024-03-18 2024-04-16 北方健康医疗大数据科技有限公司 Verification method and system for limited data desensitization method

Also Published As

Publication number Publication date
CN113553846A (en) 2021-10-26

Similar Documents

Publication Publication Date Title
WO2021212968A1 (en) Unstructured data processing method, apparatus, and device, and medium
US11017178B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US11727243B2 (en) Knowledge-graph-embedding-based question answering
US10657332B2 (en) Language-agnostic understanding
US11455473B2 (en) Vector representation based on context
WO2019105432A1 (en) Text recommendation method and apparatus, and electronic device
WO2020057022A1 (en) Associative recommendation method and apparatus, computer device, and storage medium
US20170293859A1 (en) Method for training a ranker module using a training set having noisy labels
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
WO2021051517A1 (en) Information retrieval method based on convolutional neural network, and device related thereto
WO2021189951A1 (en) Text search method and apparatus, and computer device and storage medium
US11216701B1 (en) Unsupervised representation learning for structured records
WO2020244065A1 (en) Character vector definition method, apparatus and device based on artificial intelligence, and storage medium
WO2021139343A1 (en) Data analysis method and apparatus based on natural language processing, and computer device
CN111931935B (en) Network security knowledge extraction method and device based on One-shot learning
WO2021068563A1 (en) Sample date processing method, device and computer equipment, and storage medium
CN114417865B (en) Description text processing method, device and equipment for disaster event and storage medium
WO2021004124A1 (en) Data comparison-based information recommendation method and device, and storage medium
US20210209482A1 (en) Method and apparatus for verifying accuracy of judgment result, electronic device and medium
CN109271624B (en) Target word determination method, device and storage medium
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
WO2022022049A1 (en) Long difficult text sentence compression method and apparatus, computer device, and storage medium
WO2022116444A1 (en) Text classification method and apparatus, and computer device and medium
CN114547257B (en) Class matching method and device, computer equipment and storage medium
US9104755B2 (en) Ontology enhancement method and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21793195

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21793195

Country of ref document: EP

Kind code of ref document: A1