WO2021212968A1 - Procédé, appareil et dispositif de traitement de données non structurées et support - Google Patents

Procédé, appareil et dispositif de traitement de données non structurées et support Download PDF

Info

Publication number
WO2021212968A1
WO2021212968A1 PCT/CN2021/075680 CN2021075680W WO2021212968A1 WO 2021212968 A1 WO2021212968 A1 WO 2021212968A1 CN 2021075680 W CN2021075680 W CN 2021075680W WO 2021212968 A1 WO2021212968 A1 WO 2021212968A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
sensitive
unstructured data
privacy
word vector
Prior art date
Application number
PCT/CN2021/075680
Other languages
English (en)
Chinese (zh)
Inventor
朱天清
朱运丽
霍正聃
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021212968A1 publication Critical patent/WO2021212968A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a processing method, device, device, and computer-readable storage medium for unstructured data.
  • Structured data is data that is logically expressed and realized using a table structure, has a specific data format, and usually uses a relational database for storage and management.
  • the privacy protection mechanism for structured data has been quite complete.
  • unstructured data the inability to adopt a unified structure for representation brings difficulties to privacy protection.
  • the industry has proposed some privacy protection methods for unstructured data. For example, the privacy level of unstructured data is classified according to the number of private data characters or the proportion of private data in unstructured data, and then the corresponding privacy protection mechanism is adopted based on the privacy level, such as removing all private data in the text. Sensitive protection.
  • This application provides a method for processing unstructured data.
  • the method treats unstructured data as a whole, and determines the degree of privacy of unstructured data through the sensitive words and non-sensitive words in the unstructured data. With higher accuracy, corresponding privacy protection mechanisms can be adopted for privacy protection processing based on the degree of privacy, and better privacy protection effects can be achieved.
  • This application also provides devices, equipment, computer-readable storage media, and computer program products corresponding to the above methods.
  • this application provides a method for processing unstructured data.
  • This method can be implemented by a processing system for unstructured data.
  • the system can be deployed in a cloud environment, an edge environment, or an end device (ie, end-side device).
  • the cloud environment indicates the central computing equipment cluster owned by the cloud service provider and used to provide computing, storage, and communication resources;
  • the edge environment indicates the geographic location closer to the end-side equipment to provide computing, storage, and communication.
  • Resource edge computing equipment cluster When the system is deployed in a cloud environment or an edge environment, the above-mentioned system can be provided to users in the form of services.
  • the system deploys end-side equipment the above-mentioned system can be provided to users in the form of a client.
  • the unstructured data processing system includes multiple parts, and the multiple parts can also be distributed in different environments.
  • the unstructured data processing system performs word segmentation on the unstructured data, obtains the word segmentation result, and then determines the weight of the sensitive words in the word segmentation result, and according to the non-sensitive words and private data in the word segmentation result
  • the similarity of the attributes determines the weight of the non-sensitive word, and then the weight of the sensitive word and the weight of the non-sensitive word are used to determine the degree of privacy of the unstructured data.
  • This method considers unstructured data as a whole, not only considers private data, that is, sensitive words, but also considers non-sensitive words that have a contextual relationship with sensitive words. Based on both sensitive words and non-sensitive words, unstructured
  • the degree of privacy of the data makes the evaluation of the degree of privacy by this method more accurate and comprehensive. Further, the
  • the method can more accurately adopt the corresponding level of privacy protection mechanism for privacy protection, and has better privacy protection.
  • the unstructured data processing system can also extract the word vectors of the non-sensitive words and the words of the private data attributes.
  • Vector determine the similarity between the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute, and then determine the similarity between the non-sensitive word and the private data attribute according to the non-sensitive word and the private data
  • the similarity of the attributes determines the weight of the non-sensitive words.
  • This method introduces the method of calculating vocabulary similarity by using word vectors in natural language processing, and uses it to determine the similarity between non-sensitive words and private data attributes. Since the word vector retains the semantic feature, the similarity determined based on the semantic feature has high reliability.
  • the unstructured data processing system may use a pre-trained word vector model to extract the word vector of the non-sensitive word and the word vector of the private data attribute. Extracting word vectors through the word vector model has high efficiency and accuracy.
  • the definition of private data in different application scenarios can be different, and the language use and expression of different application scenarios are very different, which makes the context of the same words in the corpus of different application scenarios possible. There are big differences. If a general training corpus is used to train the initial word vector model, the accuracy of the word vector model obtained by training may not be high. Based on this, the unstructured data processing system can also obtain a training corpus that matches the application scenario of the unstructured data, and use the training corpus to train an initial word vector model to obtain a word vector model.
  • the vocabulary corresponding to the same private data attribute often has a similar context, but the private data vocabulary corresponding to the same private data attribute is always ever-changing.
  • the private data vocabulary corresponding to the name can be "Zhang San”, “Li Si”, “Wang Wu”, etc., and many private data vocabulary may appear very few times, and the word vector model trained directly based on the training corpus is not accurate enough.
  • the unstructured data processing system can also preprocess the training corpus. Specifically, identifying sensitive words in the training corpus, replacing the sensitive words with the privacy data attributes of the sensitive words, and then using the replaced training corpus to train an initial word vector model to obtain a word vector model.
  • the unstructured data processing system may also determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to perform the unstructured data Carry out privacy protection.
  • This method can not only avoid the direct leakage of private information caused by private data, but also effectively prevent the indirect leakage of private information caused by semantic problems, and thus can better protect private information.
  • this application provides an apparatus for processing unstructured data.
  • the device includes:
  • the word segmentation module is used to segment the unstructured data to obtain the word segmentation result
  • a weight determination module configured to determine the weight of the sensitive word in the word segmentation result, and determine the weight of the insensitive word according to the similarity between the non-sensitive words in the word segmentation result and the attributes of private data;
  • the degree of privacy determination module is used to determine the degree of privacy of the unstructured data through the weights of the sensitive words and the weights of the non-sensitive words.
  • the weight determination module is specifically configured to:
  • the weight of the non-sensitive word is determined according to the similarity between the attributes of the non-sensitive word and the private data.
  • the weight determination module is specifically configured to:
  • a pre-trained word vector model is used to extract the word vector of the non-sensitive word and the word vector of the private data attribute.
  • the device further includes:
  • a communication module for obtaining training corpus matching the application scenario of the unstructured data
  • the training module is used to train the initial word vector model using the training corpus to obtain the word vector model.
  • the device further includes:
  • the replacement module is used to identify sensitive words in the training corpus, and replace the sensitive words with private data attributes;
  • the training module is specifically used for:
  • the device further includes:
  • the privacy protection processing module is configured to determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to protect the privacy of the unstructured data.
  • the present application provides a device including a processor and a memory.
  • the processor and the memory communicate with each other.
  • the processor is configured to execute instructions stored in the memory, so that the device executes the unstructured data processing method in the first aspect or any implementation manner of the first aspect.
  • the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium.
  • the processing method of unstructured data is not limited to:
  • the present application provides a computer program product containing instructions that, when run on a device, enable the device to execute the unstructured data described in the first aspect or any one of the implementations of the first aspect. Approach.
  • FIG. 1 is an architecture diagram of an unstructured data processing system provided by an embodiment of this application.
  • FIG. 2 is an architecture diagram of an unstructured data processing system provided by an embodiment of the application
  • FIG. 3 is a flowchart of a method for processing unstructured data according to an embodiment of the application
  • FIG. 4 is a schematic diagram of determining the weight of non-sensitive words according to an embodiment of the application.
  • FIG. 5 is a schematic structural diagram of an apparatus for processing unstructured data according to an embodiment of the application.
  • FIG. 6 is a schematic structural diagram of a device provided by an embodiment of the application.
  • first and second in the embodiments of the present application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first” and “second” may explicitly or implicitly include one or more of these features.
  • Unstructured data refers to data whose data structure is irregular or incomplete, without a predefined data model, and it is not convenient to use a two-dimensional database table to logically express and implement data.
  • the format of unstructured data is diverse.
  • unstructured data may include documents or text in various formats.
  • Word embedding is also called word embedding.
  • a word vector refers to a vector formed by mapping words to a lower-dimensional continuous vector space.
  • the word vector can usually be represented by a sequence of real numbers.
  • This representation of word vectors can be understood as a distributed representation based on neural networks, which retains the semantic features of words.
  • the industry has proposed a privacy protection method for unstructured data such as personal resumes, medical records, and office documents.
  • identify the private data that exists in unstructured data and determine the proportion of private data characters based on the ratio of the number of private data bits to the total number of unstructured data bits, or according to the sum of the number of private data
  • the ratio of the total number of words in the unstructured data determines the proportion of the number of private data
  • the degree of privacy is graded by the proportion of the number of private data characters or the proportion of the number of private data.
  • adopt corresponding privacy protection mechanisms based on privacy levels, such as desensitizing protection for all private data in unstructured data.
  • unstructured data such as personal resumes, medical records, and office documents may also include words that are highly similar to private data or have a great directivity to private data. Even if all the private data in the unstructured data is desensitized during the privacy protection process, these words that are highly similar to or have a great directivity to the private data may infer the correlation of some private data. Information, resulting in incomplete desensitization of unstructured data, leaking private information to a certain extent.
  • the above privacy protection processing only considers the privacy data. Although the privacy data is covered to a certain extent, the sentence desensitization is not complete. Specifically, since the semantic problem of the sentence is not considered, the semantics of the desensitized sentence is still complete, and the degree of privacy of the sentence is not minimized. Among them, "name” and private data names have a lot of directivity; “graduated from” has a lot of directivity with the private data school; “stay” and “work” have a lot of directivity with the private data workplace, and express From these highly directional words, it is possible to infer the relevant information of the desensitized private data or the wishes expressed by the characters.
  • the accuracy of the above-mentioned privacy level classification method based on the proportion of the number of private data or the proportion of the number of characters of the privacy data is not high, which makes it difficult for the privacy protection mechanism adopted based on the privacy level to achieve a better privacy protection effect.
  • an embodiment of the present application provides a processing method for unstructured data.
  • the method can be executed by a processing system for unstructured data.
  • the unstructured data processing system first performs word segmentation on the unstructured data to obtain the word segmentation result, and then considers the semantic characteristic of strong relevance between context words in the unstructured data.
  • the processing system also determines the weight of non-sensitive words based on the similarity of the attributes of non-sensitive words and private data for non-sensitive words other than sensitive words. The weight of the sensitive words and the weight of the non-sensitive words are used to determine the weight of the non-sensitive words. Describe the degree of privacy of unstructured data.
  • the above-mentioned unstructured data processing method takes unstructured data as a whole, and not only considers privacy
  • Data is sensitive words, and non-sensitive words that have contextual relations with sensitive words are also considered. Based on sensitive words and non-sensitive words
  • the sense words jointly determine the degree of privacy of unstructured data, which makes the evaluation of the degree of privacy by this method more accurate and comprehensive. Furthermore, this method can more accurately adopt the corresponding level of privacy protection mechanism for privacy protection.
  • the processing system for unstructured data can be deployed in a cloud environment, specifically on one or more computing devices (for example, a central server) in the cloud environment.
  • the system may also be deployed in an edge environment, specifically on one or more computing devices (edge computing devices) in the edge environment, and the edge computing devices may be servers.
  • the system can also be deployed in end-side devices (ie end devices), including but not limited to desktop computers, notebook computers, smart phones, and so on.
  • the cloud environment indicates a central computing device cluster owned by a cloud service provider and used to provide computing, storage, and communication resources;
  • the edge environment indicates a cluster of central computing equipment that is geographically close to the end-side device and is used to provide computing and storage ,
  • the edge computing equipment cluster of communication resources is not limited to a cloud service provider and used to provide computing, storage, and communication resources.
  • the end-side device can be used as a data providing device to provide unstructured data so that the unstructured data processing system can process the unstructured data to determine its privacy level, and further based on its privacy level, adopt corresponding privacy protection Mechanism to conduct privacy protection processing.
  • the end-side device can provide unstructured data generated or stored by itself for processing by the unstructured data processing system.
  • the end-side device may be a network device, for example, a terminal device that accesses the network. In this way, the end-side device may obtain unstructured data from the network and provide it to the unstructured data processing system.
  • the unstructured data processing system When the unstructured data processing system is deployed in a cloud environment or an edge environment, the unstructured data processing system can be provided to users as a service. Specifically, the user can access the cloud environment or the edge environment through a browser, create an instance of the unstructured data processing system in the cloud environment or the edge environment, and then interact with the instance of the unstructured data processing system through the browser, thereby Realize the processing of unstructured data.
  • the processing system for unstructured data can also be deployed on end-side devices.
  • the processing system for unstructured data can be provided to users in the form of a client. Specifically, the user runs the client to realize the processing of unstructured data.
  • the processing system for unstructured data includes multiple parts (for example, it includes multiple subsystems, and each subsystem includes multiple unit modules), so each of the unstructured data processing system Parts can also be deployed in different environments in a distributed manner.
  • a part of a processing system for unstructured data can be deployed in three environments in a cloud environment, an edge environment, a terminal device, or any two of them, respectively.
  • the method includes:
  • the unstructured data processing system performs word segmentation on the unstructured data to obtain a word segmentation result.
  • the unstructured data processing system can use any one or more of the word segmentation method based on string matching, the word segmentation method based on understanding, and the word segmentation method based on statistics to segment the unstructured data. Get the word segmentation result.
  • the word segmentation method based on string matching is to match the string to be analyzed with the entry in the machine dictionary according to the set strategy. If a string is found in the dictionary, the matching is successful and a word is recognized . Then continue to perform the above matching operation, thus realizing the word segmentation of the unstructured data.
  • the unstructured data processing system when it performs string matching, it can also perform matching in different directions, that is, the word segmentation method based on string matching can also be divided into a forward maximum matching method and a reverse maximum matching method.
  • the unstructured data processing system when it performs string matching, it can also match according to the limited length of different lengths, that is, the word segmentation method based on string matching is divided into the longest matching method and the shortest matching method.
  • it can be divided into a simple word segmentation method and an integrated method that combines word segmentation and part-of-speech tagging.
  • the word segmentation method based on comprehension achieves the effect of word recognition by simulating the comprehension of the sentence. Specifically, syntactic analysis and semantic analysis are performed at the same time for word segmentation, and syntactic information and semantic information are used to eliminate ambiguity, so as to achieve segmentation of unstructured data such as text.
  • the statistics-based word segmentation method uses statistical machine learning models to learn the rules of word segmentation under the premise of a large number of segmented texts, so as to achieve segmentation of unknown texts.
  • the word segmentation methods based on statistics include the maximum probability word segmentation method and the maximum entropy word segmentation method.
  • the statistical models used in the above methods include N-gram model (N-gram), Hidden Markov Model (HMM), Maximum entropy model (MEM), and Conditional Random Fields model (Conditional Random Fields) , CRF).
  • the unstructured data processing system may select a matching word segmentation method for word segmentation based on the language and scene of the unstructured data, and obtain the word segmentation result.
  • the processing system of unstructured data may also remove stop words after word segmentation, so as to obtain the final word segmentation result.
  • S304 The unstructured data processing system determines the weight of the sensitive word in the word segmentation result, and determines the weight of the non-sensitive word according to the similarity between the non-sensitive word and the private data attribute in the word segmentation result.
  • the unstructured data processing system can determine the sensitive words according to the word segmentation results, and the words other than the sensitive words in the word segmentation results are non-sensitive words, and then the unstructured data processing system can determine the weight of the sensitive words. And according to the similarity of the attributes of the non-sensitive words and the private data, the weight of the non-sensitive words is determined. Among them, the weight is specifically used to measure the importance of sensitive words or non-sensitive words to the degree of privacy of the entire unstructured data.
  • the private data attribute is used to describe the type of private data. For example, for the private data of "Zhang San”, the corresponding private data attribute is "name”, and for the private data of xx@yy.com, the corresponding private data attribute is "email address”.
  • the definition of private data may be different. For example, for information such as birthday or birthplace, it is considered privacy in some application scenarios, such as the General Data Protection Regulation (GDPR), and not considered privacy in other application scenarios, such as medical scenarios. As shown in the following table:
  • GDPR General Data Protection Regulation
  • the unstructured data processing system can match the attributes of each word in the word segmentation results with the privacy data attributes defined by the privacy data template in the current application scenario, thereby determining each of the word segmentation results Words are sensitive words or non-sensitive words.
  • the sensitive words and non-sensitive words thus determined have high accuracy.
  • the unstructured data processing system can determine the weight of the sensitive word according to the set weight. For example, set the weight of the sensitive word as the standard weight. If the weight is 1, then the weight of the sensitive word can be obtained according to the set weight.
  • the weight of non-sensitive words is determined according to the similarity between the attributes of non-sensitive words and private data, and the weight of non-sensitive words is determined according to the corresponding relationship between similarity and weight.
  • the higher the similarity between the non-sensitive word and the private data attribute the greater the weight of the non-sensitive word
  • the lower the similarity between the non-sensitive word and the private data attribute the smaller the weight of the non-sensitive word.
  • the unstructured data includes the sentence "My name is Zhang San”.
  • the processing system of the unstructured data determines that "Zhang San” is a sensitive word and "name” is a non-sensitive word based on the attributes of the private data. Determine the similarity between the non-sensitive word "name” and the private data attribute "name” as 0.9999. According to the previous correspondence between the similarity and the weight ratio, the weight ratio can be determined to be 0.8, which can determine the weight of "Zhang San” Is 1, and the weight of "name” is 0.8.
  • the unstructured data processing system determines the degree of privacy of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word.
  • the unstructured data processing system can obtain the degree of privacy of the unstructured data by performing weighted aggregation on the weights of all sensitive words and the weights of all non-sensitive words.
  • the formula for calculating the degree of privacy is as follows:
  • privacylevel represents the degree of sensitivity, also known as the sensitivity level.
  • n is the total number of sensitive words and non-sensitive words.
  • g i represents the sensitive value of the i-th word in unstructured data, as follows:
  • I i is the similarity between the attribute of the non-sensitive word and the private data when the i-th word is a non-sensitive word.
  • ⁇ i is the weight of the non-sensitive word, which represents the influence of the non-sensitive word on the degree of privacy of unstructured data.
  • the value range of ⁇ i is (0, 1), which is specifically determined according to the similarity between the attributes of non-sensitive words and private data.
  • the similarity between the attributes of non-sensitive words and private data and the weight of the non-sensitive words have the following correspondence:
  • the unstructured data processing system determines the weight of non-sensitive words based on the above formula (3), and determines the degree of privacy of unstructured data based on the weight of sensitive words and the weight of non-sensitive words.
  • the embodiments of the present application provide a method for processing unstructured data.
  • the method takes unstructured data as a whole, considers the relationship between contexts in unstructured data, and uses unstructured data.
  • the similarity of the attributes of non-sensitive words and private data determines the weight of sensitive words.
  • the weight of sensitive words and the weight of non-sensitive words with contextual relationship are used to determine the degree of text privacy, which has higher accuracy.
  • using this method can more accurately determine the corresponding level of privacy protection mechanism.
  • Using this privacy protection mechanism to protect the privacy of unstructured data can not only avoid the direct disclosure of private information caused by private data, but also effectively prevent the indirect disclosure of private information caused by semantic issues, thus better protecting private information .
  • the embodiment of this application also designs an attack scenario for verification.
  • the same occlusion and desensitization process is used for all private data in the unstructured data, which is text. Specifically, all private data are replaced with spaces, and then the private data context vocabulary is used to These private data are guessed. The higher the probability of guessing the correct information, the attacker can obtain more text privacy information.
  • the current privacy protection mechanism is not enough, resulting in insufficient desensitization. Therefore, if the text privacy level is not accurate enough, it may cause high-level text data to be desensitized using a low-level privacy protection mechanism, which makes the text data desensitization incomplete, resulting in the desensitized text data still being desensitized. May reveal private information.
  • Predicting privacy data by designing attack scenarios can verify the privacy ranking of the text.
  • the privacy level grading method proposed in the embodiments of this application and the traditional method are used to calculate the text privacy level, and the text privacy level is ranked. Which ranking is closer to the privacy ranking obtained by using the attack scenario verification indicates that the method can more accurately reflect the privacy level of the text.
  • measuring the closeness of the ranking can be achieved by means of mean square error (MSE),
  • n is the number of documents; x and y represent the ranking lists of the privacy degrees of the two documents.
  • MSE According to the ranking in Table 3, MSE can be calculated:
  • the similarity-based privacy grading method proposed in the embodiment of this application is closer to the ranking of the verification method.
  • the method proposed in the embodiment of the present application can more accurately classify the degree of privacy of unstructured data.
  • the embodiment of this application introduces a method of calculating vocabulary similarity using word vectors in natural language processing (NLP), and uses it to calculate non-sensitive words and privacy The similarity of data attributes.
  • NLP natural language processing
  • the unstructured data processing system can extract the word vectors of non-sensitive words and the word vectors of private data attributes respectively, for example, input the non-sensitive words and private data attributes into the pre-trained word vector model , So as to obtain the word vector of non-sensitive words and the word vector of private data attributes. Then, the similarity between the non-sensitive word and the private data attribute is determined according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute. Then, according to the similarity between the non-sensitive words and the attributes of the private data, the weight of the non-sensitive words is determined based on the corresponding relationship between the similarity and the weight (for example, the corresponding relationship shown in formula (3)).
  • the word vector model can be specifically obtained by training methods such as word2vec.
  • the unstructured data processing system can construct an initial word vector model through word2vec, and use the training corpus to train the initial word vector model, thereby obtaining a word vector model for extracting word vectors.
  • the unstructured data processing system can obtain a training corpus that matches the application scenario of the unstructured data, and then use the specific training corpus to train the initial word vector model to obtain the word vector model.
  • the vocabulary corresponding to the same private data attribute often has a similar context, but the private data vocabulary corresponding to the same private data attribute is always ever-changing.
  • the private data vocabulary corresponding to the name can be It is "Zhang San”, “Li Si”, “Wang Wu”, etc., and many private data vocabulary may appear very few times, and the word vector model trained directly based on the training corpus is not accurate enough.
  • the unstructured data processing system can also preprocess the training corpus. Specifically, identifying the sensitive words in the training corpus, replacing the sensitive words with the privacy data attributes of the sensitive words, and then using the replaced training corpus to train the initial word vector model to obtain the word vector model.
  • the device 500 includes: a word segmentation module 502, configured to segment the unstructured data to obtain a word segmentation result;
  • the weight determination module 504 is configured to determine the weight of the sensitive word in the word segmentation result, and determine the weight of the insensitive word according to the similarity between the non-sensitive word in the word segmentation result and the attributes of private data;
  • the degree of privacy determination module 506 is configured to determine the degree of privacy of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word.
  • the weight determination module 504 is specifically configured to:
  • the weight of the non-sensitive word is determined according to the similarity between the attributes of the non-sensitive word and the private data.
  • the weight determination module 504 is specifically configured to:
  • a pre-trained word vector model is used to extract the word vector of the non-sensitive word and the word vector of the private data attribute.
  • the apparatus 500 further includes:
  • a communication module for obtaining training corpus matching the application scenario of the unstructured data
  • the training module is used to train the initial word vector model using the training corpus to obtain the word vector model.
  • the device further includes:
  • the replacement module is used to identify sensitive words in the training corpus, and replace the sensitive words with private data attributes;
  • the training module is specifically used for:
  • the device further includes:
  • the privacy protection processing module is configured to determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to protect the privacy of the unstructured data.
  • the apparatus 500 for processing unstructured data can correspond to the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of each module/unit of the apparatus 500 for processing unstructured data In order to implement the corresponding processes of the methods in the embodiment shown in FIG. 3, for the sake of brevity, details are not described herein again.
  • the embodiment of the present application also provides a device 600.
  • the device 600 may be an end-side device such as a notebook computer and a desktop computer, and may also be a computer cluster in a cloud environment or an edge environment.
  • the device 600 is specifically used to implement the functions of the apparatus 500 for processing unstructured data in the embodiment shown in FIG. 5.
  • FIG. 6 provides a schematic structural diagram of a device 600.
  • the device 600 includes a bus 601, a processor 602, a communication interface 603, and a memory 604.
  • the processor 602, the memory 604, and the communication interface 603 communicate through a bus 601.
  • the bus 601 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 6, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 603 is used to communicate with the outside. For example, obtaining training corpus that matches the application scenario of unstructured data, or obtaining unstructured data, etc.
  • the processor 602 may be a central processing unit (CPU).
  • the memory 604 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM).
  • the memory 604 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), flash memory, HDD or SSD.
  • the memory 604 stores executable code, and the processor 602 executes the executable code to execute the aforementioned unstructured data processing method.
  • the word segmentation module in FIG. 5 is executed. 502.
  • the software or program codes required for the functions of the weight determination module 504 and the privacy degree determination module 506 are stored in the memory 604.
  • the function of the communication module is implemented through the communication interface 603.
  • the communication interface 603 receives unstructured data and transmits it to the processor 602 via the bus 601.
  • the processor 602 executes the program code corresponding to each module stored in the memory 604, such as the word segmentation module 502, the weight determination module 504, and the privacy degree determination module 506 corresponding program code to perform word segmentation of unstructured data, and then determine the weight of sensitive words, and determine the weight of non-sensitive words according to the similarity of the attributes of non-sensitive words and private data, and then according to the weight and non-sensitive words of sensitive words
  • the weight of sensitive words determines the degree of privacy of unstructured data.
  • the processor 602 may also execute the program code corresponding to the privacy protection processing module to execute a privacy protection mechanism for determining the unstructured data based on the degree of privacy of the unstructured data, and use the privacy protection mechanism to perform the unstructured data Perform privacy protection operations.
  • An embodiment of the present application also provides a computer-readable storage medium, which includes instructions that instruct a computer to execute the above-mentioned unstructured data processing method applied to the unstructured data processing apparatus 500.
  • An embodiment of the present application also provides a computer-readable storage medium, which includes instructions that instruct a computer to execute the above-mentioned unstructured data processing method applied to the unstructured data processing apparatus 500.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product When the computer program product is executed by a computer, the computer executes any one of the aforementioned methods for processing unstructured data.
  • the computer program product may be a software installation package. In the case where any method of the aforementioned unstructured data processing method needs to be used, the computer program product may be downloaded and executed on the computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Procédé de traitement de données non structurées, consistant à : effectuer une segmentation de mot sur des données non structurées pour obtenir un résultat de segmentation de mots ; déterminer la pondération d'un mot sensible dans le résultat de segmentation de mot, et déterminer la pondération d'un mot non sensible en fonction de la similitude entre le mot non sensible dans le résultat de segmentation de mot et un attribut de données privées ; et déterminer un niveau de confidentialité des données non structurées en fonction de la pondération du mot sensible et de la pondération du mot non sensible. Comme le mot non sensible ayant une relation contextuelle est pris en compte, le procédé présente une précision élevée pour la classification des niveaux de confidentialité, et en effectuant un traitement de protection de la confidentialité sur cette base, présente un bon effet de protection de la confidentialité.
PCT/CN2021/075680 2020-04-24 2021-02-06 Procédé, appareil et dispositif de traitement de données non structurées et support WO2021212968A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010331678.3 2020-04-24
CN202010331678.3A CN113553846A (zh) 2020-04-24 2020-04-24 一种非结构化数据的处理方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2021212968A1 true WO2021212968A1 (fr) 2021-10-28

Family

ID=78101221

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/075680 WO2021212968A1 (fr) 2020-04-24 2021-02-06 Procédé, appareil et dispositif de traitement de données non structurées et support

Country Status (2)

Country Link
CN (1) CN113553846A (fr)
WO (1) WO2021212968A1 (fr)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065287A (zh) * 2021-11-18 2022-02-18 南京航空航天大学 一种抗预测攻击的轨迹差分隐私保护方法和系统
CN115664799A (zh) * 2022-10-25 2023-01-31 江苏海洋大学 一种应用于信息技术安全的数据交换方法和系统
CN115828307A (zh) * 2023-01-28 2023-03-21 广州佰锐网络科技有限公司 应用于ocr的文本识别方法及ai系统
CN116432243A (zh) * 2023-06-15 2023-07-14 恺恩泰(南京)科技有限公司 一种线上商城的数据脱敏方法、装置、设备及存储介质
CN117034356A (zh) * 2023-10-09 2023-11-10 成都乐超人科技有限公司 一种基于混合链的多作业流程的隐私保护方法及装置
CN117591643A (zh) * 2023-11-10 2024-02-23 杭州市余杭区数据资源管理局 一种基于改进的结构化处理的项目文本查重方法及系统
CN117892358A (zh) * 2024-03-18 2024-04-16 北方健康医疗大数据科技有限公司 一种受限数据脱敏方法验证方法及系统
CN117912624A (zh) * 2024-03-15 2024-04-19 江西曼荼罗软件有限公司 一种电子病历共享方法及系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618371B (zh) * 2022-07-11 2023-08-04 上海期货信息技术有限公司 一种非文本数据的脱敏方法、装置及存储介质
CN115512810A (zh) * 2022-11-17 2022-12-23 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种医学影像数据的数据治理方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012985A (zh) * 2010-11-19 2011-04-13 国网电力科学研究院 一种基于数据挖掘的敏感数据动态识别方法
CN102184188A (zh) * 2011-04-15 2011-09-14 百度在线网络技术(北京)有限公司 一种用于确定目标文本的敏感度的方法与设备
CN102426599A (zh) * 2011-11-09 2012-04-25 中国人民解放军信息工程大学 基于d-s证据理论的敏感信息检测方法
US20140237620A1 (en) * 2011-09-28 2014-08-21 Tata Consultancy Services Limited System and method for database privacy protection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012985A (zh) * 2010-11-19 2011-04-13 国网电力科学研究院 一种基于数据挖掘的敏感数据动态识别方法
CN102184188A (zh) * 2011-04-15 2011-09-14 百度在线网络技术(北京)有限公司 一种用于确定目标文本的敏感度的方法与设备
US20140237620A1 (en) * 2011-09-28 2014-08-21 Tata Consultancy Services Limited System and method for database privacy protection
CN102426599A (zh) * 2011-11-09 2012-04-25 中国人民解放军信息工程大学 基于d-s证据理论的敏感信息检测方法

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065287B (zh) * 2021-11-18 2024-05-07 南京航空航天大学 一种抗预测攻击的轨迹差分隐私保护方法和系统
CN114065287A (zh) * 2021-11-18 2022-02-18 南京航空航天大学 一种抗预测攻击的轨迹差分隐私保护方法和系统
CN115664799A (zh) * 2022-10-25 2023-01-31 江苏海洋大学 一种应用于信息技术安全的数据交换方法和系统
CN115828307A (zh) * 2023-01-28 2023-03-21 广州佰锐网络科技有限公司 应用于ocr的文本识别方法及ai系统
CN115828307B (zh) * 2023-01-28 2023-05-23 广州佰锐网络科技有限公司 应用于ocr的文本识别方法及ai系统
CN116432243A (zh) * 2023-06-15 2023-07-14 恺恩泰(南京)科技有限公司 一种线上商城的数据脱敏方法、装置、设备及存储介质
CN116432243B (zh) * 2023-06-15 2023-08-25 恺恩泰(南京)科技有限公司 一种线上商城的数据脱敏方法、装置、设备及存储介质
CN117034356A (zh) * 2023-10-09 2023-11-10 成都乐超人科技有限公司 一种基于混合链的多作业流程的隐私保护方法及装置
CN117034356B (zh) * 2023-10-09 2024-01-05 成都乐超人科技有限公司 一种基于混合链的多作业流程的隐私保护方法及装置
CN117591643A (zh) * 2023-11-10 2024-02-23 杭州市余杭区数据资源管理局 一种基于改进的结构化处理的项目文本查重方法及系统
CN117591643B (zh) * 2023-11-10 2024-05-10 杭州市余杭区数据资源管理局 一种基于改进的结构化处理的项目文本查重方法及系统
CN117912624A (zh) * 2024-03-15 2024-04-19 江西曼荼罗软件有限公司 一种电子病历共享方法及系统
CN117892358A (zh) * 2024-03-18 2024-04-16 北方健康医疗大数据科技有限公司 一种受限数据脱敏方法验证方法及系统

Also Published As

Publication number Publication date
CN113553846A (zh) 2021-10-26

Similar Documents

Publication Publication Date Title
WO2021212968A1 (fr) Procédé, appareil et dispositif de traitement de données non structurées et support
US11017178B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US11727243B2 (en) Knowledge-graph-embedding-based question answering
US10657332B2 (en) Language-agnostic understanding
US11455473B2 (en) Vector representation based on context
WO2019105432A1 (fr) Procédé et appareil de recommandation de texte, et dispositif électronique
WO2020057022A1 (fr) Procédé et appareil de recommandation associative, dispositif informatique et support de stockage associés
US20170293859A1 (en) Method for training a ranker module using a training set having noisy labels
WO2021189951A1 (fr) Procédé et appareil de recherche de texte, et dispositif informatique et support de stockage
CN110162771B (zh) 事件触发词的识别方法、装置、电子设备
WO2021051517A1 (fr) Procédé de récupération d'informations basé sur un réseau neuronal convolutif, et dispositif associé
US11216701B1 (en) Unsupervised representation learning for structured records
WO2020244065A1 (fr) Procédé, appareil et dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle et support de stockage
WO2021139343A1 (fr) Procédé et appareil de traitement de données de langage naturel, et dispositif informatique
WO2021175005A1 (fr) Procédé et appareil de récupération de documents basée sur un vecteur, dispositif informatique, et support de stockage
CN111931935B (zh) 基于One-shot 学习的网络安全知识抽取方法和装置
CN114417865B (zh) 灾害事件的描述文本处理方法、装置、设备及存储介质
WO2021068563A1 (fr) Procédé, dispositif et équipement informatique de traitement de date d'échantillon, et support de stockage
WO2021004124A1 (fr) Procédé et dispositif de recommandation d'informations se fondant sur une comparaison de données, et support de stockage
US20210209482A1 (en) Method and apparatus for verifying accuracy of judgment result, electronic device and medium
CN109271624B (zh) 一种目标词确定方法、装置及存储介质
CN114547257B (zh) 类案匹配方法、装置、计算机设备及存储介质
CN115730597A (zh) 多级语义意图识别方法及其相关设备
WO2022022049A1 (fr) Procédé et appareil de compression de longue phrase textuelle difficile, dispositif informatique et support de stockage
WO2022116444A1 (fr) Procédé et appareil de classification de textes, ainsi que dispositif informatique et support

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21793195

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21793195

Country of ref document: EP

Kind code of ref document: A1