WO2021212968A1

WO2021212968A1 - Unstructured data processing method, apparatus, and device, and medium

Info

Publication number: WO2021212968A1
Application number: PCT/CN2021/075680
Authority: WO
Inventors: 朱天清; 朱运丽; 霍正聃
Original assignee: 华为技术有限公司
Priority date: 2020-04-24
Filing date: 2021-02-06
Publication date: 2021-10-28
Also published as: CN113553846A

Abstract

An unstructured data processing method, comprising: performing word segmentation on unstructured data to obtain a word segmentation result; determining the weight of a sensitive word in the word segmentation result, and determining the weight of a non-sensitive word according to the similarity between the non-sensitive word in the word segmentation result and a private data attribute; and determining a privacy level of the unstructured data according to the weight of the sensitive word and the weight of the non-sensitive word. As the non-sensitive word having a contextual relationship is taken into consideration, the method has high accuracy for the classification of privacy levels, and by performing privacy protection processing on this basis, has a good privacy protection effect.

Description

Method, device, equipment and medium for processing unstructured data

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a processing method, device, device, and computer-readable storage medium for unstructured data.

Background technique

With the advent of the information age, data is exploding. Data can be divided into structured data and unstructured data. Structured data is data that is logically expressed and realized using a table structure, has a specific data format, and usually uses a relational database for storage and management. The privacy protection mechanism for structured data has been quite complete. For unstructured data, the inability to adopt a unified structure for representation brings difficulties to privacy protection.

The industry has proposed some privacy protection methods for unstructured data. For example, the privacy level of unstructured data is classified according to the number of private data characters or the proportion of private data in unstructured data, and then the corresponding privacy protection mechanism is adopted based on the privacy level, such as removing all private data in the text. Sensitive protection.

However, the accuracy of the above-mentioned method for grading the degree of privacy based on the number of characters of private data or the proportion of the number of private data is not high, which makes it difficult for the privacy protection mechanism adopted based on the privacy level to achieve a better privacy protection effect.

Summary of the invention

This application provides a method for processing unstructured data. The method treats unstructured data as a whole, and determines the degree of privacy of unstructured data through the sensitive words and non-sensitive words in the unstructured data. With higher accuracy, corresponding privacy protection mechanisms can be adopted for privacy protection processing based on the degree of privacy, and better privacy protection effects can be achieved. This application also provides devices, equipment, computer-readable storage media, and computer program products corresponding to the above methods.

In the first aspect, this application provides a method for processing unstructured data. This method can be implemented by a processing system for unstructured data. The system can be deployed in a cloud environment, an edge environment, or an end device (ie, end-side device). Among them, the cloud environment indicates the central computing equipment cluster owned by the cloud service provider and used to provide computing, storage, and communication resources; the edge environment indicates the geographic location closer to the end-side equipment to provide computing, storage, and communication. Resource edge computing equipment cluster. When the system is deployed in a cloud environment or an edge environment, the above-mentioned system can be provided to users in the form of services. When the system deploys end-side equipment, the above-mentioned system can be provided to users in the form of a client. In some implementations, the unstructured data processing system includes multiple parts, and the multiple parts can also be distributed in different environments.

Specifically, the unstructured data processing system performs word segmentation on the unstructured data, obtains the word segmentation result, and then determines the weight of the sensitive words in the word segmentation result, and according to the non-sensitive words and private data in the word segmentation result The similarity of the attributes determines the weight of the non-sensitive word, and then the weight of the sensitive word and the weight of the non-sensitive word are used to determine the degree of privacy of the unstructured data.

This method considers unstructured data as a whole, not only considers private data, that is, sensitive words, but also considers non-sensitive words that have a contextual relationship with sensitive words. Based on both sensitive words and non-sensitive words, unstructured

The degree of privacy of the data makes the evaluation of the degree of privacy by this method more accurate and comprehensive. Further, the

The method can more accurately adopt the corresponding level of privacy protection mechanism for privacy protection, and has better privacy protection.

护果。 Care effect.

In some implementations, considering that the similarity of words can be measured by the distance of the words in the vector space, the unstructured data processing system can also extract the word vectors of the non-sensitive words and the words of the private data attributes. Vector, determine the similarity between the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute, and then determine the similarity between the non-sensitive word and the private data attribute according to the non-sensitive word and the private data The similarity of the attributes determines the weight of the non-sensitive words.

This method introduces the method of calculating vocabulary similarity by using word vectors in natural language processing, and uses it to determine the similarity between non-sensitive words and private data attributes. Since the word vector retains the semantic feature, the similarity determined based on the semantic feature has high reliability.

In some implementations, the unstructured data processing system may use a pre-trained word vector model to extract the word vector of the non-sensitive word and the word vector of the private data attribute. Extracting word vectors through the word vector model has high efficiency and accuracy.

In some implementations, the definition of private data in different application scenarios can be different, and the language use and expression of different application scenarios are very different, which makes the context of the same words in the corpus of different application scenarios possible. There are big differences. If a general training corpus is used to train the initial word vector model, the accuracy of the word vector model obtained by training may not be high. Based on this, the unstructured data processing system can also obtain a training corpus that matches the application scenario of the unstructured data, and use the training corpus to train an initial word vector model to obtain a word vector model.

In some implementations, the vocabulary corresponding to the same private data attribute often has a similar context, but the private data vocabulary corresponding to the same private data attribute is always ever-changing. For example, the private data vocabulary corresponding to the name can be "Zhang San", "Li Si", "Wang Wu", etc., and many private data vocabulary may appear very few times, and the word vector model trained directly based on the training corpus is not accurate enough. In order to train a better word vector model and calculate the similarity more accurately to better assign sensitive weights, the unstructured data processing system can also preprocess the training corpus. Specifically, identifying sensitive words in the training corpus, replacing the sensitive words with the privacy data attributes of the sensitive words, and then using the replaced training corpus to train an initial word vector model to obtain a word vector model.

In some implementation manners, the unstructured data processing system may also determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to perform the unstructured data Carry out privacy protection. This method can not only avoid the direct leakage of private information caused by private data, but also effectively prevent the indirect leakage of private information caused by semantic problems, and thus can better protect private information.

In the second aspect, this application provides an apparatus for processing unstructured data. The device includes:

The word segmentation module is used to segment the unstructured data to obtain the word segmentation result;

A weight determination module, configured to determine the weight of the sensitive word in the word segmentation result, and determine the weight of the insensitive word according to the similarity between the non-sensitive words in the word segmentation result and the attributes of private data;

The degree of privacy determination module is used to determine the degree of privacy of the unstructured data through the weights of the sensitive words and the weights of the non-sensitive words.

In some implementation manners, the weight determination module is specifically configured to:

Extracting the word vector of the non-sensitive word and the word vector of the private data attribute;

Determining the similarity between the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute;

The weight of the non-sensitive word is determined according to the similarity between the attributes of the non-sensitive word and the private data.

A pre-trained word vector model is used to extract the word vector of the non-sensitive word and the word vector of the private data attribute.

In some implementation manners, the device further includes:

A communication module for obtaining training corpus matching the application scenario of the unstructured data;

The training module is used to train the initial word vector model using the training corpus to obtain the word vector model.

In some implementation manners, the device further includes:

The replacement module is used to identify sensitive words in the training corpus, and replace the sensitive words with private data attributes;

The training module is specifically used for:

Use the replaced training corpus to train the initial word vector model to obtain the word vector model.

In some implementation manners, the device further includes:

The privacy protection processing module is configured to determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to protect the privacy of the unstructured data.

In a third aspect, the present application provides a device including a processor and a memory. The processor and the memory communicate with each other. The processor is configured to execute instructions stored in the memory, so that the device executes the unstructured data processing method in the first aspect or any implementation manner of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium. The processing method of unstructured data.

In the fifth aspect, the present application provides a computer program product containing instructions that, when run on a device, enable the device to execute the unstructured data described in the first aspect or any one of the implementations of the first aspect. Approach.

On the basis of the implementation manners provided by the above aspects, this application can be further combined to provide more implementation manners.

Description of the drawings

In order to more clearly illustrate the technical methods of the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments.

FIG. 1 is an architecture diagram of an unstructured data processing system provided by an embodiment of this application;

FIG. 2 is an architecture diagram of an unstructured data processing system provided by an embodiment of the application;

FIG. 3 is a flowchart of a method for processing unstructured data according to an embodiment of the application;

FIG. 4 is a schematic diagram of determining the weight of non-sensitive words according to an embodiment of the application;

FIG. 5 is a schematic structural diagram of an apparatus for processing unstructured data according to an embodiment of the application;

FIG. 6 is a schematic structural diagram of a device provided by an embodiment of the application.

Detailed ways

The terms "first" and "second" in the embodiments of the present application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with "first" and "second" may explicitly or implicitly include one or more of these features.

First, some technical terms involved in the embodiments of this application are introduced.

Unstructured data (unstructured data) refers to data whose data structure is irregular or incomplete, without a predefined data model, and it is not convenient to use a two-dimensional database table to logically express and implement data. The format of unstructured data is diverse. As an example, unstructured data may include documents or text in various formats.

Word embedding is also called word embedding. A word vector refers to a vector formed by mapping words to a lower-dimensional continuous vector space. The word vector can usually be represented by a sequence of real numbers. This representation of word vectors can be understood as a distributed representation based on neural networks, which retains the semantic features of words.

The industry has proposed a privacy protection method for unstructured data such as personal resumes, medical records, and office documents. Specifically, according to the definition of private data, identify the private data that exists in unstructured data, and determine the proportion of private data characters based on the ratio of the number of private data bits to the total number of unstructured data bits, or according to the sum of the number of private data The ratio of the total number of words in the unstructured data determines the proportion of the number of private data, and the degree of privacy is graded by the proportion of the number of private data characters or the proportion of the number of private data. Then, adopt corresponding privacy protection mechanisms based on privacy levels, such as desensitizing protection for all private data in unstructured data.

However, the above-mentioned method of grading the degree of privacy based on the proportion of private data characters or the proportion of the number of private data ignores the correlation between contexts. In addition to private data, unstructured data such as personal resumes, medical records, and office documents may also include words that are highly similar to private data or have a great directivity to private data. Even if all the private data in the unstructured data is desensitized during the privacy protection process, these words that are highly similar to or have a great directivity to the private data may infer the correlation of some private data. Information, resulting in incomplete desensitization of unstructured data, leaking private information to a certain extent.

For example, for a sentence "My name is Wang Li, I graduated from the University of Finance and Economics, and I don’t want to stay at Company X anymore." During the privacy protection process, if you only rely on the percentage of private data or the percentage of private data characters After grading the degree of privacy and desensitizing its private data, the sentence after desensitization becomes "My name is **, I graduated from **, I don't want to stay in ** and work."

The above privacy protection processing only considers the privacy data. Although the privacy data is covered to a certain extent, the sentence desensitization is not complete. Specifically, since the semantic problem of the sentence is not considered, the semantics of the desensitized sentence is still complete, and the degree of privacy of the sentence is not minimized. Among them, "name" and private data names have a lot of directivity; "graduated from" has a lot of directivity with the private data school; "stay" and "work" have a lot of directivity with the private data workplace, and express From these highly directional words, it is possible to infer the relevant information of the desensitized private data or the wishes expressed by the characters.

Therefore, the accuracy of the above-mentioned privacy level classification method based on the proportion of the number of private data or the proportion of the number of characters of the privacy data is not high, which makes it difficult for the privacy protection mechanism adopted based on the privacy level to achieve a better privacy protection effect.

In view of this, an embodiment of the present application provides a processing method for unstructured data. The method can be executed by a processing system for unstructured data. Specifically, the unstructured data processing system first performs word segmentation on the unstructured data to obtain the word segmentation result, and then considers the semantic characteristic of strong relevance between context words in the unstructured data. The processing system also determines the weight of non-sensitive words based on the similarity of the attributes of non-sensitive words and private data for non-sensitive words other than sensitive words. The weight of the sensitive words and the weight of the non-sensitive words are used to determine the weight of the non-sensitive words. Describe the degree of privacy of unstructured data.

The above-mentioned unstructured data processing method takes unstructured data as a whole, and not only considers privacy

Data is sensitive words, and non-sensitive words that have contextual relations with sensitive words are also considered. Based on sensitive words and non-sensitive words

The sense words jointly determine the degree of privacy of unstructured data, which makes the evaluation of the degree of privacy by this method more accurate and comprehensive. Furthermore, this method can more accurately adopt the corresponding level of privacy protection mechanism for privacy protection.

It has a better privacy protection effect.

As shown in Figure 1, the processing system for unstructured data can be deployed in a cloud environment, specifically on one or more computing devices (for example, a central server) in the cloud environment. The system may also be deployed in an edge environment, specifically on one or more computing devices (edge computing devices) in the edge environment, and the edge computing devices may be servers. The system can also be deployed in end-side devices (ie end devices), including but not limited to desktop computers, notebook computers, smart phones, and so on.

The cloud environment indicates a central computing device cluster owned by a cloud service provider and used to provide computing, storage, and communication resources; the edge environment indicates a cluster of central computing equipment that is geographically close to the end-side device and is used to provide computing and storage , The edge computing equipment cluster of communication resources.

The end-side device can be used as a data providing device to provide unstructured data so that the unstructured data processing system can process the unstructured data to determine its privacy level, and further based on its privacy level, adopt corresponding privacy protection Mechanism to conduct privacy protection processing. The end-side device can provide unstructured data generated or stored by itself for processing by the unstructured data processing system. In some implementation manners, the end-side device may be a network device, for example, a terminal device that accesses the network. In this way, the end-side device may obtain unstructured data from the network and provide it to the unstructured data processing system.

When the unstructured data processing system is deployed in a cloud environment or an edge environment, the unstructured data processing system can be provided to users as a service. Specifically, the user can access the cloud environment or the edge environment through a browser, create an instance of the unstructured data processing system in the cloud environment or the edge environment, and then interact with the instance of the unstructured data processing system through the browser, thereby Realize the processing of unstructured data.

The processing system for unstructured data can also be deployed on end-side devices. Correspondingly, the processing system for unstructured data can be provided to users in the form of a client. Specifically, the user runs the client to realize the processing of unstructured data.

In some implementations, as shown in Figure 2, the processing system for unstructured data includes multiple parts (for example, it includes multiple subsystems, and each subsystem includes multiple unit modules), so each of the unstructured data processing system Parts can also be deployed in different environments in a distributed manner. For example, a part of a processing system for unstructured data can be deployed in three environments in a cloud environment, an edge environment, a terminal device, or any two of them, respectively.

In order to make the technical solutions provided by the embodiments of the present application clearer and easier to understand, the following will introduce the processing method of unstructured data from the perspective of the processing system of unstructured data.

Referring to the flowchart of the unstructured data processing method shown in FIG. 3, the method includes:

S302: The unstructured data processing system performs word segmentation on the unstructured data to obtain a word segmentation result.

In specific implementation, the unstructured data processing system can use any one or more of the word segmentation method based on string matching, the word segmentation method based on understanding, and the word segmentation method based on statistics to segment the unstructured data. Get the word segmentation result.

Among them, the word segmentation method based on string matching is to match the string to be analyzed with the entry in the machine dictionary according to the set strategy. If a string is found in the dictionary, the matching is successful and a word is recognized . Then continue to perform the above matching operation, thus realizing the word segmentation of the unstructured data.

Further, when the unstructured data processing system performs string matching, it can also perform matching in different directions, that is, the word segmentation method based on string matching can also be divided into a forward maximum matching method and a reverse maximum matching method. When the unstructured data processing system performs string matching, it can also match according to the limited length of different lengths, that is, the word segmentation method based on string matching is divided into the longest matching method and the shortest matching method. In addition, according to whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method that combines word segmentation and part-of-speech tagging.

The word segmentation method based on comprehension achieves the effect of word recognition by simulating the comprehension of the sentence. Specifically, syntactic analysis and semantic analysis are performed at the same time for word segmentation, and syntactic information and semantic information are used to eliminate ambiguity, so as to achieve segmentation of unstructured data such as text.

The statistics-based word segmentation method uses statistical machine learning models to learn the rules of word segmentation under the premise of a large number of segmented texts, so as to achieve segmentation of unknown texts. The word segmentation methods based on statistics include the maximum probability word segmentation method and the maximum entropy word segmentation method. The statistical models used in the above methods include N-gram model (N-gram), Hidden Markov Model (HMM), Maximum entropy model (MEM), and Conditional Random Fields model (Conditional Random Fields) , CRF).

Specifically, the unstructured data processing system may select a matching word segmentation method for word segmentation based on the language and scene of the unstructured data, and obtain the word segmentation result.

In some implementations, in order to save storage space and improve the processing efficiency of unstructured data, the processing system of unstructured data may also remove stop words after word segmentation, so as to obtain the final word segmentation result.

S304: The unstructured data processing system determines the weight of the sensitive word in the word segmentation result, and determines the weight of the non-sensitive word according to the similarity between the non-sensitive word and the private data attribute in the word segmentation result.

Specifically, the unstructured data processing system can determine the sensitive words according to the word segmentation results, and the words other than the sensitive words in the word segmentation results are non-sensitive words, and then the unstructured data processing system can determine the weight of the sensitive words. And according to the similarity of the attributes of the non-sensitive words and the private data, the weight of the non-sensitive words is determined. Among them, the weight is specifically used to measure the importance of sensitive words or non-sensitive words to the degree of privacy of the entire unstructured data.

Among them, the private data attribute is used to describe the type of private data. For example, for the private data of "Zhang San", the corresponding private data attribute is "name", and for the private data of xx@yy.com, the corresponding private data attribute is "email address".

Taking into account different application scenarios, the definition of private data may be different. For example, for information such as birthday or birthplace, it is considered privacy in some application scenarios, such as the General Data Protection Regulation (GDPR), and not considered privacy in other application scenarios, such as medical scenarios. As shown in the following table:

Table 1 Privacy data template in medical scenarios

II	姓名Name	是否隐私Privacy	XIXI	银行卡号Bank card number	是否隐私Privacy
IIII	电子邮件地址email address	是Yes	XIIXII	民族nationality	否no
IIIIII	手机号码mobile phone number	是Yes	XIIIXIII	政治党派Political parties	是Yes
IVIV	家庭电话号码Home phone number	是Yes	XIVXIV	IP地址IP address	是Yes
VV	任何地址Any address	是Yes	XVXV	GPS信息GPS information	是Yes
VIVI	身份证号ID number	是Yes	XVIXVI	DNA信息DNA information	否no
VIIVII	护照号Passport number	是Yes	XVIIXVII	指纹fingerprint	否no
VIIIVIII	车牌号number plate	是Yes	XVIIIXVIII	虹膜信息Iris information	否no
IXIX	生日Birthday	否no	XIXXIX	疾病诊断Disease diagnosis	否no
XX	出生地place of birth	否no	To	To	To

Table 2 Privacy data template in GDPR scenario

II	姓名Name	是否隐私Privacy	XIXI	银行卡号Bank card number	是否隐私Privacy
IIII	电子邮件地址email address	是Yes	XIIXII	民族nationality	是Yes
IIIIII	手机号码mobile phone number	是Yes	XIIIXIII	政治党派Political parties	是Yes
IVIV	家庭电话号码Home phone number	是Yes	XIVXIV	IP地址IP address	是Yes
VV	任何地址Any address	是Yes	XVXV	GPS信息GPS information	是Yes
VIVI	身份证号ID number	是Yes	XVIXVI	DNA信息DNA information	是Yes
VIIVII	护照号Passport number	是Yes	XVIIXVII	指纹fingerprint	是Yes
VIIIVIII	车牌号number plate	是Yes	XVIIIXVIII	虹膜信息Iris information	是Yes
IXIX	生日Birthday	是Yes	XIXXIX	疾病诊断Disease diagnosis	是Yes
XX	出生地place of birth	是Yes	To	To	To

Based on this, when determining sensitive words, the unstructured data processing system can match the attributes of each word in the word segmentation results with the privacy data attributes defined by the privacy data template in the current application scenario, thereby determining each of the word segmentation results Words are sensitive words or non-sensitive words. The sensitive words and non-sensitive words thus determined have high accuracy.

Then, for the sensitive word, the unstructured data processing system can determine the weight of the sensitive word according to the set weight. For example, set the weight of the sensitive word as the standard weight. If the weight is 1, then the weight of the sensitive word can be obtained according to the set weight.

For non-sensitive words, the weight of non-sensitive words is determined according to the similarity between the attributes of non-sensitive words and private data, and the weight of non-sensitive words is determined according to the corresponding relationship between similarity and weight. Among them, the higher the similarity between the non-sensitive word and the private data attribute, the greater the weight of the non-sensitive word, and the lower the similarity between the non-sensitive word and the private data attribute, the smaller the weight of the non-sensitive word.

For ease of understanding, specific examples are used for description below. In this example, the unstructured data includes the sentence "My name is Zhang San". The processing system of the unstructured data determines that "Zhang San" is a sensitive word and "name" is a non-sensitive word based on the attributes of the private data. Determine the similarity between the non-sensitive word "name" and the private data attribute "name" as 0.9999. According to the previous correspondence between the similarity and the weight ratio, the weight ratio can be determined to be 0.8, which can determine the weight of "Zhang San" Is 1, and the weight of "name" is 0.8.

S306: The unstructured data processing system determines the degree of privacy of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word.

Specifically, the unstructured data processing system can obtain the degree of privacy of the unstructured data by performing weighted aggregation on the weights of all sensitive words and the weights of all non-sensitive words.

In an example, the formula for calculating the degree of privacy is as follows:

Among them, privacylevel represents the degree of sensitivity, also known as the sensitivity level. n is the total number of sensitive words and non-sensitive words. g _i represents the sensitive value of the i-th word in unstructured data, as follows:

Among them, I _i is the similarity between the attribute of the non-sensitive word and the private data when the i-th word is a non-sensitive word. α _{i is} the weight of the non-sensitive word, which represents the influence of the non-sensitive word on the degree of privacy of unstructured data. Among them, the _{value range of α i} is (0, 1), which is specifically determined according to the similarity between the attributes of non-sensitive words and private data.

In an example, the similarity between the attributes of non-sensitive words and private data and the weight of the non-sensitive words have the following correspondence:

The unstructured data processing system determines the weight of non-sensitive words based on the above formula (3), and determines the degree of privacy of unstructured data based on the weight of sensitive words and the weight of non-sensitive words.

Based on the above description, the embodiments of the present application provide a method for processing unstructured data. The method takes unstructured data as a whole, considers the relationship between contexts in unstructured data, and uses unstructured data. The similarity of the attributes of non-sensitive words and private data determines the weight of sensitive words. The weight of sensitive words and the weight of non-sensitive words with contextual relationship are used to determine the degree of text privacy, which has higher accuracy.

Moreover, using this method can more accurately determine the corresponding level of privacy protection mechanism. Using this privacy protection mechanism to protect the privacy of unstructured data can not only avoid the direct disclosure of private information caused by private data, but also effectively prevent the indirect disclosure of private information caused by semantic issues, thus better protecting private information .

In order to verify that the privacy level classification method proposed in this application can better evaluate the privacy level of unstructured data than traditional methods, the embodiment of this application also designs an attack scenario for verification.

Specifically, in the attack scenario, the same occlusion and desensitization process is used for all private data in the unstructured data, which is text. Specifically, all private data are replaced with spaces, and then the private data context vocabulary is used to These private data are guessed. The higher the probability of guessing the correct information, the attacker can obtain more text privacy information. The current privacy protection mechanism is not enough, resulting in insufficient desensitization. Therefore, if the text privacy level is not accurate enough, it may cause high-level text data to be desensitized using a low-level privacy protection mechanism, which makes the text data desensitization incomplete, resulting in the desensitized text data still being desensitized. May reveal private information.

Predicting privacy data by designing attack scenarios can verify the privacy ranking of the text. Correspondingly, the privacy level grading method proposed in the embodiments of this application and the traditional method are used to calculate the text privacy level, and the text privacy level is ranked. Which ranking is closer to the privacy ranking obtained by using the attack scenario verification indicates that the method can more accurately reflect the privacy level of the text.

Among them, measuring the closeness of the ranking can be achieved by means of mean square error (MSE),

The calculation formula of MSE is as follows:

Among them, n is the number of documents; x and y represent the ranking lists of the privacy degrees of the two documents.

The examples of this application provide the following experimental data:

Table 3 Ranking of privacy levels determined by different methods

According to the ranking in Table 3, MSE can be calculated:

MSE (verification ranking, ranking of this application) = 6;

MSE (verification ranking, ranking by the number of private data) = 12;

MSE (verification ranking, ranking of the proportion of characters in private data)=34.

It can be seen that, compared to the traditional privacy grading method based on the proportion of the number of private data or the proportion of the number of characters of the private data, the similarity-based privacy grading method proposed in the embodiment of this application is closer to the ranking of the verification method. The method proposed in the embodiment of the present application can more accurately classify the degree of privacy of unstructured data.

Considering the semantic feature of contextual vocabulary, the embodiment of this application introduces a method of calculating vocabulary similarity using word vectors in natural language processing (NLP), and uses it to calculate non-sensitive words and privacy The similarity of data attributes.

Specifically, as shown in Figure 4, the unstructured data processing system can extract the word vectors of non-sensitive words and the word vectors of private data attributes respectively, for example, input the non-sensitive words and private data attributes into the pre-trained word vector model , So as to obtain the word vector of non-sensitive words and the word vector of private data attributes. Then, the similarity between the non-sensitive word and the private data attribute is determined according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute. Then, according to the similarity between the non-sensitive words and the attributes of the private data, the weight of the non-sensitive words is determined based on the corresponding relationship between the similarity and the weight (for example, the corresponding relationship shown in formula (3)).

Among them, the word vector model can be specifically obtained by training methods such as word2vec. Specifically, the unstructured data processing system can construct an initial word vector model through word2vec, and use the training corpus to train the initial word vector model, thereby obtaining a word vector model for extracting word vectors.

Considering that the definition of private data can be different in different application scenarios, the language usage and expression of different application scenarios are very different, which makes the context of the same words in different application scenarios may be very different. The difference is that if a general training corpus is used to train the initial word vector model, the accuracy of the word vector model obtained by training may not be high. Based on this, the unstructured data processing system can obtain a training corpus that matches the application scenario of the unstructured data, and then use the specific training corpus to train the initial word vector model to obtain the word vector model.

Further, even in the corpus of a fixed application scenario, the vocabulary corresponding to the same private data attribute often has a similar context, but the private data vocabulary corresponding to the same private data attribute is always ever-changing. For example, the private data vocabulary corresponding to the name can be It is "Zhang San", "Li Si", "Wang Wu", etc., and many private data vocabulary may appear very few times, and the word vector model trained directly based on the training corpus is not accurate enough. In order to train a better word vector model and calculate the similarity more accurately to better assign sensitive weights, the unstructured data processing system can also preprocess the training corpus. Specifically, identifying the sensitive words in the training corpus, replacing the sensitive words with the privacy data attributes of the sensitive words, and then using the replaced training corpus to train the initial word vector model to obtain the word vector model.

The method for processing unstructured data provided by the embodiment of the present application is described in detail above with reference to FIGS. 1 to 4, and the apparatus and equipment provided by the embodiment of the present application will be introduced below with reference to the accompanying drawings.

Referring to the schematic structural diagram of the device for processing unstructured data shown in FIG. 5, the device 500 includes: a word segmentation module 502, configured to segment the unstructured data to obtain a word segmentation result;

The weight determination module 504 is configured to determine the weight of the sensitive word in the word segmentation result, and determine the weight of the insensitive word according to the similarity between the non-sensitive word in the word segmentation result and the attributes of private data;

The degree of privacy determination module 506 is configured to determine the degree of privacy of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word.

In some implementation manners, the weight determination module 504 is specifically configured to:

In some implementation manners, the apparatus 500 further includes:

In some implementation manners, the device further includes:

The training module is specifically used for:

In some implementation manners, the device further includes:

The apparatus 500 for processing unstructured data according to the embodiment of the present application can correspond to the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of each module/unit of the apparatus 500 for processing unstructured data In order to implement the corresponding processes of the methods in the embodiment shown in FIG. 3, for the sake of brevity, details are not described herein again.

The embodiment of the present application also provides a device 600. The device 600 may be an end-side device such as a notebook computer and a desktop computer, and may also be a computer cluster in a cloud environment or an edge environment. The device 600 is specifically used to implement the functions of the apparatus 500 for processing unstructured data in the embodiment shown in FIG. 5.

FIG. 6 provides a schematic structural diagram of a device 600. As shown in FIG. 6, the device 600 includes a bus 601, a processor 602, a communication interface 603, and a memory 604. The processor 602, the memory 604, and the communication interface 603 communicate through a bus 601. The bus 601 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 6, but it does not mean that there is only one bus or one type of bus. The communication interface 603 is used to communicate with the outside. For example, obtaining training corpus that matches the application scenario of unstructured data, or obtaining unstructured data, etc.

The processor 602 may be a central processing unit (CPU). The memory 604 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM). The memory 604 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), flash memory, HDD or SSD.

The memory 604 stores executable code, and the processor 602 executes the executable code to execute the aforementioned unstructured data processing method.

Specifically, in the case that the embodiment shown in FIG. 5 is implemented, and the modules of the apparatus 500 for processing unstructured data described in the embodiment of FIG. 5 are realized by software, the word segmentation module in FIG. 5 is executed. 502. The software or program codes required for the functions of the weight determination module 504 and the privacy degree determination module 506 are stored in the memory 604. The function of the communication module is implemented through the communication interface 603. The communication interface 603 receives unstructured data and transmits it to the processor 602 via the bus 601. The processor 602 executes the program code corresponding to each module stored in the memory 604, such as the word segmentation module 502, the weight determination module 504, and the privacy degree determination module 506 corresponding program code to perform word segmentation of unstructured data, and then determine the weight of sensitive words, and determine the weight of non-sensitive words according to the similarity of the attributes of non-sensitive words and private data, and then according to the weight and non-sensitive words of sensitive words The weight of sensitive words determines the degree of privacy of unstructured data.

Of course, the processor 602 may also execute the program code corresponding to the privacy protection processing module to execute a privacy protection mechanism for determining the unstructured data based on the degree of privacy of the unstructured data, and use the privacy protection mechanism to perform the unstructured data Perform privacy protection operations.

An embodiment of the present application also provides a computer-readable storage medium, which includes instructions that instruct a computer to execute the above-mentioned unstructured data processing method applied to the unstructured data processing apparatus 500.

The embodiment of the present application also provides a computer program product. When the computer program product is executed by a computer, the computer executes any one of the aforementioned methods for processing unstructured data. The computer program product may be a software installation package. In the case where any method of the aforementioned unstructured data processing method needs to be used, the computer program product may be downloaded and executed on the computer.

Claims

A method for processing unstructured data, characterized in that the method includes:

Perform word segmentation on the unstructured data to obtain a word segmentation result;

Determining the weight of the sensitive word in the word segmentation result, and determining the weight of the insensitive word according to the similarity between the non-sensitive words in the word segmentation result and the attributes of private data;

The degree of privacy of the unstructured data is determined by the weight of the sensitive word and the weight of the non-sensitive word.
The method according to claim 1, wherein the determining the weight of the non-sensitive word according to the similarity between the non-sensitive word in the word segmentation result and the attribute of the private data comprises:

Extracting the word vector of the non-sensitive word and the word vector of the private data attribute;

Determining the similarity between the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute;

The weight of the non-sensitive word is determined according to the similarity between the attributes of the non-sensitive word and the private data.
The method according to claim 2, wherein the extracting the word vector of the non-sensitive word and the word vector of the private data attribute comprises:

A pre-trained word vector model is used to extract the word vector of the non-sensitive word and the word vector of the private data attribute.
The method according to claim 3, wherein the word vector model is obtained by training in the following manner:

Acquiring a training corpus that matches the application scenario of the unstructured data;

The initial word vector model is trained using the training corpus to obtain the word vector model.
The method according to claim 4, wherein the method further comprises:

Identify sensitive words in the training corpus, and replace the sensitive words with private data attributes;

The training of an initial word vector model using the training corpus to obtain a word vector model includes:

Use the replaced training corpus to train the initial word vector model to obtain the word vector model.
The method according to any one of claims 1 to 5, wherein the method further comprises:

Determining the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data;

The privacy protection mechanism is used to protect the privacy of the unstructured data.
A processing device for unstructured data, characterized in that the device comprises:

The word segmentation module is used to segment the unstructured data to obtain the word segmentation result;

A weight determination module, configured to determine the weight of the sensitive word in the word segmentation result, and determine the weight of the insensitive word according to the similarity between the non-sensitive words in the word segmentation result and the attributes of private data;

The degree of privacy determination module is used to determine the degree of privacy of the unstructured data through the weights of the sensitive words and the weights of the non-sensitive words.
The device according to claim 7, wherein the weight determination module is specifically configured to:

Extracting the word vector of the non-sensitive word and the word vector of the private data attribute;

Determining the similarity between the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute;

The weight of the non-sensitive word is determined according to the similarity between the attributes of the non-sensitive word and the private data.
The device according to claim 8, wherein the weight determination module is specifically configured to:

A pre-trained word vector model is used to extract the word vector of the non-sensitive word and the word vector of the private data attribute.
The device according to claim 9, wherein the device further comprises:

A communication module for obtaining training corpus matching the application scenario of the unstructured data;

The training module is used to train the initial word vector model using the training corpus to obtain the word vector model.
The device according to claim 10, wherein the device further comprises:

The replacement module is used to identify sensitive words in the training corpus, and replace the sensitive words with private data attributes;

The training module is specifically used for:

Use the replaced training corpus to train the initial word vector model to obtain the word vector model.
The device according to any one of claims 7 to 11, wherein the device further comprises:

The privacy protection processing module is configured to determine the privacy protection mechanism of the unstructured data according to the degree of privacy of the unstructured data, and use the privacy protection mechanism to protect the privacy of the unstructured data.
A device, characterized in that the device includes a processor and a memory;

The processor is configured to execute instructions stored in the memory, so that the device executes the method according to any one of claims 1 to 6.
A computer-readable storage medium, characterized by comprising instructions that instruct a device to execute the method according to any one of claims 1 to 6.