CN113553846A

CN113553846A - Method, device, equipment and medium for processing unstructured data

Info

Publication number: CN113553846A
Application number: CN202010331678.3A
Authority: CN
Inventors: 朱天清; 朱运丽; 霍正聃
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2021-10-26
Also published as: WO2021212968A1

Abstract

The application provides a processing method of unstructured data, which comprises the steps of segmenting unstructured data to obtain a segmentation result, determining the weight of a sensitive word in the segmentation result, determining the weight of the non-sensitive word according to the similarity of the non-sensitive word and the attribute of private data in the segmentation result, and determining the privacy degree of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word. Due to the fact that non-sensitive words with context relations are considered, the method has high accuracy for privacy degree grading, and the privacy protection effect is good when privacy protection processing is conducted on the basis.

Description

Method, device, equipment and medium for processing unstructured data

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for processing unstructured data.

Background

With the advent of the information age, data is growing explosively. Data can be divided into structured data and unstructured data. Structured data is data that is logically represented and implemented using a table structure, has a particular data format, and is typically stored and managed using a relational database. The privacy protection mechanism for the structured data is quite perfect, and for the unstructured data, the uniform structure cannot be adopted for representation, so that the privacy protection is difficult.

The industry has proposed some privacy protection methods for unstructured data. For example, the method comprises the steps of grading the privacy degree of the unstructured data according to the number of characters of the privacy data in the unstructured data or the number of occupied privacy data, and then adopting a corresponding privacy protection mechanism based on the privacy grade, such as desensitizing protection on all the privacy data in the text.

However, the above method for classifying the privacy degree based on the character number of the privacy data or the ratio of the number of the privacy data is not high in accuracy, so that it is difficult for the privacy protection mechanism based on the privacy level to achieve a good privacy protection effect.

Disclosure of Invention

The method treats unstructured data as a whole, determines the privacy degree of the unstructured data through sensitive words and non-sensitive words in the unstructured data, has high accuracy, can adopt a corresponding privacy protection mechanism to carry out privacy protection processing based on the privacy degree, and can achieve a good privacy protection effect. The application also provides a device, equipment, a computer readable storage medium and a computer program product corresponding to the method.

In a first aspect, the present application provides a method for processing unstructured data. The method may be implemented by a processing system for unstructured data. The system may be deployed in a cloud environment, a marginal environment, or in an end device (i.e., a peer device). Wherein the cloud environment indicates a central cluster of computing devices owned by a cloud service provider for providing computing, storage, and communication resources; the edge environment indicates a cluster of edge computing devices geographically closer to the end-side device for providing computing, storage, and communication resources. When the system is deployed in a cloud environment or an edge environment, the system can be provided for users to use in the form of services. When the system deploys the end-side devices, the system can be provided for users to use in the form of clients. In some implementations, the processing system of unstructured data includes multiple parts, which may also be distributively deployed in different environments.

Specifically, the processing system of the unstructured data performs word segmentation on the unstructured data to obtain word segmentation results, then determines the weight of a sensitive word in the word segmentation results, determines the weight of the non-sensitive word according to the similarity between the non-sensitive word and the attribute of the private data in the word segmentation results, and then determines the privacy degree of the unstructured data through the weight of the sensitive word and the weight of the non-sensitive word.

According to the method, the unstructured data are taken as a whole, not only are private data, namely sensitive words, considered, but also non-sensitive words having context relations with the sensitive words, and the privacy degree of the unstructured data is determined based on the sensitive words and the non-sensitive words, so that the method can evaluate the privacy degree more accurately and comprehensively. Furthermore, the method can more accurately adopt the privacy protection mechanism of the corresponding level to carry out privacy protection, and has better privacy protection effect.

In some implementations, considering that the similarity of words can be measured by the distance of words in a vector space, the processing system of unstructured data may further extract a word vector of the non-sensitive word and a word vector of the private data attribute, determine the similarity of the non-sensitive word and the private data attribute according to the distance between the word vector of the non-sensitive word and the word vector of the private data attribute, and then determine the weight of the non-sensitive word according to the similarity of the non-sensitive word and the private data attribute.

The method is used for determining the similarity of non-sensitive words and privacy data attributes by introducing a method of calculating the vocabulary similarity by using word vectors in natural language processing. Since the word vector retains semantic features, the similarity determined based on the semantic features has higher reliability.

In some implementations, the processing system of unstructured data may extract the word vectors for the non-sensitive words and the word vectors for the privacy data attributes using a pre-trained word vector model. The word vector is extracted through the word vector model, and the efficiency and the accuracy are high.

In some implementations, the definitions of the private data for different application scenarios may be different, and the language application and expression modes of different application scenarios are greatly different, so that the contexts of the same word may be greatly different in the corpora of different application scenarios, and if the initial word vector model is trained using a general corpus, the accuracy of the trained word vector model may be low. Based on this, the processing system of the unstructured data can also obtain a training corpus matched with the application scene of the unstructured data, and train an initial word vector model by using the training corpus to obtain a word vector model.

In some implementations, words corresponding to the same privacy data attribute often have similar contexts, but the privacy data words corresponding to the same privacy data attribute are always varied, for example, the privacy data words corresponding to the name may be "zhang san", "li xi", "wang wu", and the like, and many privacy data words may appear only a few times, and the word vector model obtained by training directly based on the training corpus is not accurate enough. In order to train better word vector models and more accurately calculate similarity to better assign sensitive weights, the processing system of unstructured data can also preprocess the corpus. Specifically, the sensitive words in the training corpus are identified, the privacy data attributes of the sensitive words are used for replacing the sensitive words, and then the initial word vector model is trained by using the replaced training corpus to obtain a word vector model.

In some implementations, the processing system of the unstructured data may further determine a privacy protection mechanism of the unstructured data according to a privacy degree of the unstructured data, and perform privacy protection on the unstructured data by using the privacy protection mechanism. The method can avoid the direct disclosure of the private information caused by the private data, and also effectively prevent the indirect disclosure of the private information caused by the semantic problem, thereby better protecting the private information.

In a second aspect, the present application provides an apparatus for processing unstructured data. The device comprises:

the word segmentation module is used for segmenting words of the unstructured data to obtain word segmentation results;

the weight determining module is used for determining the weight of the sensitive words in the word segmentation result and determining the weight of the non-sensitive words according to the similarity between the non-sensitive words in the word segmentation result and the privacy data attribute;

and the privacy degree determining module is used for determining the privacy degree of the unstructured data according to the weight of the sensitive words and the weight of the non-sensitive words.

In some implementations, the weight determination module is specifically configured to:

extracting a word vector of the non-sensitive word and a word vector of the privacy data attribute;

determining the similarity between the non-sensitive words and the privacy data attributes according to the distance between the word vectors of the non-sensitive words and the word vectors of the privacy data attributes;

and determining the weight of the non-sensitive word according to the similarity of the non-sensitive word and the privacy data attribute.

and extracting the word vector of the non-sensitive word and the word vector of the privacy data attribute by utilizing a pre-trained word vector model.

In some implementations, the apparatus further includes:

the communication module is used for acquiring training corpora matched with the application scene of the unstructured data;

and the training module is used for training an initial word vector model by using the training corpus to obtain a word vector model.

In some implementations, the apparatus further includes:

the replacing module is used for identifying the sensitive words in the training corpus and replacing the sensitive words by using the privacy data attributes;

the training module is specifically configured to:

and training the initial word vector model by using the replaced training corpus to obtain a word vector model.

In some implementations, the apparatus further includes:

and the privacy protection processing module is used for determining a privacy protection mechanism of the unstructured data according to the privacy degree of the unstructured data and carrying out privacy protection on the unstructured data by utilizing the privacy protection mechanism.

In a third aspect, the present application provides an apparatus comprising a processor and a memory. The processor and the memory are in communication with each other. The processor is configured to execute the instructions stored in the memory to cause the apparatus to perform the method of processing unstructured data as in the first aspect or any implementation manner of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, where instructions are stored, and the instructions instruct a device to execute the method for processing unstructured data according to the first aspect or any implementation manner of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on a device, cause the device to perform the method for processing unstructured data according to the first aspect or any of the implementations of the first aspect.

The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings used in the embodiments will be briefly described below.

FIG. 1 is an architecture diagram of an unstructured data processing system according to an embodiment of the present application;

FIG. 2 is an architecture diagram of an unstructured data processing system according to an embodiment of the present application;

fig. 3 is a flowchart of a method for processing unstructured data according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a method for determining weights of non-sensitive words according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus for processing unstructured data according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus according to an embodiment of the present application.

Detailed Description

The terms "first" and "second" in the embodiments of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

Some technical terms referred to in the embodiments of the present application will be first described.

Unstructured data refers to data that is irregular or incomplete in data structure, has no predefined data model, and is not convenient to logically express and implement by using a database two-dimensional logical table. The format of unstructured data is diverse. As one example, unstructured data may include documents or text in various formats.

Word vector (word embedding) is also called word embedding. A word vector refers to a vector formed by mapping a word to a continuous vector space of lower dimensions. A word vector may typically be represented using a sequence of real numbers. Such a representation of a word vector can be understood as a neural network based distributed representation which preserves the semantic features of the words.

Aiming at unstructured data such as personal resumes, medical records, office documents and the like, the industry provides a privacy protection method. Specifically, according to the definition of the private data, the private data existing in the unstructured data are identified, the character number ratio of the private data is determined according to the ratio of the bit number of the private data to the total bit number of the unstructured data, or the number ratio of the private data is determined according to the ratio of the number of the private data to the total number of words in the unstructured data, and the privacy degree is graded according to the character number ratio of the private data or the number ratio of the private data. Then, based on the privacy level, a corresponding privacy protection mechanism is adopted, such as desensitization protection is carried out on all privacy data in the unstructured data.

However, the above-described method of performing privacy degree classification based on the character ratio or the number ratio of the private data ignores the correlation between the contexts. Unstructured data such as personal resumes, medical records, office documents, etc. may include words that are highly similar to or highly directional to the private data, in addition to the private data. Even if all the private data in the unstructured data are desensitized in the privacy protection process, some relevant information of the private data may be deduced from words which are highly similar to or highly directional to the private data, so that the unstructured data are not desensitized completely, and the private information is revealed to some extent.

For example, for the sentence "my name is king, graduating to the university of finance, i do not want to stay working at company X. In the privacy protection processing process, if privacy degree grading is carried out only depending on the number ratio of the privacy data or the number ratio of the characters of the privacy data, and desensitization processing is carried out on the privacy data, a sentence after desensitization processing becomes' my name is. "

The above privacy-preserving process only takes into account the private data, which, although masked to some extent, is incomplete in sentence desensitization. In particular, because the semantic problem of the sentence is not considered, the desensitized sentence semantics are still complete, and the privacy degree of the sentence is not reduced to the minimum. Wherein, the name and the private data name have great directivity; the 'graduation' and the privacy data school have great direction; the 'waiting' and 'working' and the private data work place have great direction, and the will of people is expressed; the information related to desensitized private data or the intention of the person to express can be deduced from these words with high directivity.

Therefore, the privacy degree classification method based on the number ratio of the privacy data or the number ratio of the characters of the privacy data is not high in accuracy, so that a privacy protection mechanism based on the privacy level is difficult to achieve a good privacy protection effect.

In view of this, the present application provides a method for processing unstructured data. The method may be performed by a processing system for unstructured data. Specifically, the processing system of the unstructured data performs word segmentation on the unstructured data to obtain word segmentation results, then the semantic characteristic that context words in the unstructured data have strong relevance is considered, the processing system of the unstructured data also determines the weight of non-sensitive words aiming at the non-sensitive words except for the sensitive words according to the similarity between the non-sensitive words and the attribute of the privacy data, and the privacy degree of the unstructured data is determined through the weight of the sensitive words and the weight of the non-sensitive words.

According to the processing method of the unstructured data, the unstructured data are taken as a whole, not only are private data, namely sensitive words, considered, but also non-sensitive words having context relations with the sensitive words, and the privacy degree of the unstructured data is determined based on the sensitive words and the non-sensitive words, so that the method can evaluate the privacy degree more accurately and comprehensively. Furthermore, the method can more accurately adopt the privacy protection mechanism of the corresponding level to carry out privacy protection, and has better privacy protection effect.

As shown in FIG. 1, a processing system for unstructured data may be deployed on one or more computing devices (e.g., a central server) on a cloud environment, particularly a cloud environment. The system may also be deployed in an edge environment, specifically on one or more computing devices (edge computing devices) in the edge environment, which may be servers. The system may also be deployed in an end-side device (i.e., end-device), including but not limited to a desktop, laptop, smartphone, and the like.

The cloud environment indicates a central cluster of computing devices owned by a cloud service provider for providing computing, storage, and communication resources; the edge environment indicates a cluster of edge computing devices geographically closer to the end-side device for providing computing, storage, and communication resources.

The end-side device may be used as a data providing device for providing unstructured data, so that a processing system of the unstructured data processes the unstructured data to determine the privacy degree thereof, and further performs privacy protection processing based on the privacy degree thereof by using a corresponding privacy protection mechanism. The end-side device may provide unstructured data generated or stored by itself for processing by a processing system of the unstructured data. In some implementations, the end-side device may be a network device, for example, a terminal device accessing a network, and thus, the end-side device may obtain unstructured data from the network and provide the unstructured data to a processing system.

When the processing system of the unstructured data is deployed in a cloud environment or an edge environment, the processing system of the unstructured data can be provided for users to use in a service form. Specifically, a user may access the cloud environment or the edge environment through a browser, create an instance of the processing system for unstructured data in the cloud environment or the edge environment, and then interact with the instance of the processing system for unstructured data through the browser, thereby implementing processing of unstructured data.

Processing systems for unstructured data may also be deployed on the end-side devices. Correspondingly, the processing system of the unstructured data can be provided for the user to use in a client form. Specifically, the user runs the client to realize the processing of the unstructured data.

In some implementations, as shown in FIG. 2, the unstructured-data processing system includes multiple parts (e.g., includes multiple subsystems, each of which includes multiple unit modules), and thus the parts of the unstructured-data processing system may also be distributively deployed in different environments. For example, portions of the processing system of unstructured data may be deployed on three of a cloud environment, an edge environment, an end device, or any two of the environments, respectively.

In order to make the technical solutions provided in the embodiments of the present application clearer and easier to understand, a method for processing unstructured data will be described below from the perspective of a system for processing unstructured data.

Referring to fig. 3, a flow chart of a method for processing unstructured data is shown, the method comprising:

s302: and the processing system of the unstructured data carries out word segmentation on the unstructured data to obtain word segmentation results.

In specific implementation, the processing system of the unstructured data may perform word segmentation on the unstructured data by using any one or more of a word segmentation method based on string matching, a word segmentation method based on understanding, a word segmentation method based on statistics, and the like, so as to obtain word segmentation results.

The word segmentation method based on character string matching matches a character string to be analyzed with entries in a machine dictionary according to a set strategy, and if a certain character string is found in the dictionary, the matching is successful, and a word is recognized. And then continuing to execute the matching operation, thereby realizing word segmentation of the unstructured data.

Further, when the processing system of the unstructured data performs string matching, matching can be performed according to different directions, that is, the word segmentation method based on string matching can be further divided into a forward maximum matching method and a reverse maximum matching method. When the processing system of the unstructured data performs character string matching, the processing system can also be divided into a longest matching method and a shortest matching method according to limited matching of different lengths, namely, a word segmentation method based on character string matching. In addition, the method can be divided into a simple word segmentation method and an integrated method combining word segmentation and part-of-speech tagging according to whether the method is combined with the part-of-speech tagging process or not.

The word segmentation method based on understanding achieves the effect of recognizing words by simulating the understanding of sentences. Specifically, syntactic analysis and semantic analysis are carried out at the same time of word segmentation, and ambiguity is eliminated by utilizing syntactic information and semantic information, so that word segmentation is carried out on unstructured data such as texts.

The word segmentation method based on statistics is to use a statistical machine learning model to learn the rules of word segmentation on the premise of giving a large amount of already segmented texts, thereby realizing the segmentation of unknown texts. The word segmentation method based on statistics comprises a maximum probability word segmentation method and a maximum entropy word segmentation method. The statistical Model used in the above method includes one of an N-gram Model (N-gram), a Hidden Markov Model (HMM), a Maximum Entropy Model (MEM) and a Conditional Random field Model (CRF).

Specifically, the processing system of the unstructured data may select a matching word segmentation method to perform word segmentation based on the language, scene, and the like of the unstructured data, so as to obtain a word segmentation result.

In some implementations, in order to save storage space and improve processing efficiency of unstructured data, the processing system of unstructured data may also stop words (stop words) after word segmentation, so as to obtain a final word segmentation result.

S304: and the processing system of the unstructured data determines the weight of the sensitive words in the word segmentation result and determines the weight of the non-sensitive words according to the similarity of the non-sensitive words and the privacy data attributes in the word segmentation result.

Specifically, the processing system of the unstructured data may determine the sensitive words according to the word segmentation result, and the words other than the sensitive words in the word segmentation result are the non-sensitive words, and then the processing system of the unstructured data may determine the weights of the sensitive words and determine the weights of the non-sensitive words according to the similarity between the non-sensitive words and the privacy data attributes. The weight is specifically used for measuring the importance degree of the sensitive words or the non-sensitive words to the privacy degree of the whole unstructured data.

Wherein the privacy data attribute is used to describe the type of the privacy data. For example, for the private data "zhang san", the corresponding private data attribute is "name", and for the private data xx @ yy.com, the corresponding private data attribute is "email address".

The definition for the private data may be different in view of different application scenarios. For example, for information such as a birthday or a place of birth, privacy is considered in some application scenarios such as General Data Protection Regulation (GDPR), and privacy is not considered in other application scenarios such as a medical scenario. As shown in the following table:

TABLE 1 privacy data template in medical scenarios

Table 2 privacy data template in GDPR scenario

I	Name (I)	Whether or not to keep private	XI	Bank card number	Whether or not to keep private
						II	E-mail address	Is that	XII	Nationality	Is that
III	Mobile phone number	Is that	XIII	Political party style	Is that
						IV	Home telephone number	Is that	XIV	IP address	Is that
V	Any address	Is that	XV	GPS information	Is that
						VI	Identity card number	Is that	XVI	DNA information	Is that
VII	Passport number	Is that	XVII	Finger print	Is that
						VIII	License plate number	Is that	XVIII	Iris information	Is that
IX	Birthday	Is that	XIX	Disease diagnosis	Is that
						X	Dried rehmannia root	Is that

Based on this, when determining the sensitive word, the processing system of the unstructured data can match the attribute of each word in the word segmentation result with the attribute of the privacy data defined by the privacy data template in the current application scene, so as to determine that each word in the word segmentation result is a sensitive word or an insensitive word. The sensitive words and non-sensitive words thus determined have a higher accuracy.

Then, for a sensitive word, the processing system of the unstructured data may determine a weight of the sensitive word according to the set weight. For example, the weight of the sensitive word is set as the standard weight, and if the weight is 1, the weight of the sensitive word can be obtained according to the set weight.

And for the non-sensitive words, determining the weight of the non-sensitive words according to the similarity between the non-sensitive words and the privacy data attribute, specifically determining the weight of the non-sensitive words according to the corresponding relation between the similarity and the weight. The higher the similarity between the non-sensitive word and the privacy data attribute is, the higher the weight of the non-sensitive word is, and the lower the similarity between the non-sensitive word and the privacy data attribute is, the lower the weight of the non-sensitive word is.

For ease of understanding, the following description is made with reference to specific examples. In this example, the unstructured data includes the sentence "three is my name", the processing system of the unstructured data determines "three" as a sensitive word and "name" as an insensitive word based on the attribute of the private data, determines the similarity of the insensitive word "name" and the attribute of the private data "name" as 0.9999 by calculation, and determines the weight ratio as 0.8 according to the correspondence between the similarity and the weight ratio, and thus may determine the weight of "three" as 1 and the weight of "name" as 0.8.

S306: and the processing system of the unstructured data determines the privacy degree of the unstructured data through the weight of the sensitive words and the weight of the non-sensitive words.

Specifically, the processing system of the unstructured data may obtain the privacy degree of the unstructured data by performing weighted aggregation on the weights of all sensitive words and the weights of all non-sensitive words.

In one example, the formula for calculating the privacy level is specifically as follows:

here, privacylevel represents a sensitivity level, also referred to as a sensitivity level. n is the total number of sensitive and non-sensitive words. g_iThe sensitivity value of the ith word in the unstructured data is represented as follows:

wherein, I_iAnd when the ith word is a non-sensitive word, the similarity between the non-sensitive word and the attribute of the privacy data is high. Alpha is alpha_iIs thatAnd the weight of the non-sensitive word represents the influence of the non-sensitive word on the privacy degree of the unstructured data. Wherein alpha is_iThe value range is (0, 1), and is determined according to the similarity between the non-sensitive words and the privacy data attributes.

In one example, the similarity of the non-sensitive word and the private data attribute has the following correspondence with the weight of the non-sensitive word:

the processing system of unstructured data determines the weight of the non-sensitive words based on equation (3) above, and determines the degree of privacy of the unstructured data based on the weight of the sensitive words and the weight of the non-sensitive words.

Based on the above description, the embodiments of the present application provide a method for processing unstructured data, where the method uses unstructured data as a whole, considers an association relationship between contexts in the unstructured data, determines a weight of a sensitive word by using a similarity between an insensitive word in the unstructured data and an attribute of private data, and determines a text privacy degree by using the weight of the sensitive word and the weight of the insensitive word having the context relationship, so as to have higher accuracy.

Moreover, the method can more accurately determine the privacy protection mechanism of the corresponding level. The privacy protection mechanism is used for carrying out privacy protection on unstructured data, so that direct disclosure of privacy information caused by the privacy data can be avoided, indirect disclosure of the privacy information caused by semantic problems can be effectively prevented, and the privacy information can be better protected.

In order to verify that the privacy degree grading method provided by the application can better evaluate the privacy degree of the unstructured data than a traditional method, the embodiment of the application also designs an attack scene for verification.

Specifically, in an attack scene, the same shielding desensitization processing is adopted for all the private data in the unstructured data of the text, specifically, the private data are all uniformly replaced by spaces, then the private data are guessed by using private data context vocabularies, and the higher the probability of guessing correct information is, the attacker can obtain more text private information, and the currently adopted privacy protection mechanism is insufficient in level, so that the desensitization is not complete enough. Therefore, if the grading of the text privacy degree is not accurate enough, the desensitization processing may be performed on the text data with a high level by using a privacy protection mechanism with a low level, so that the text data is not completely desensitized, and the desensitized text data still may reveal privacy information.

The privacy degree ranking of the text can be verified by designing attack scene prediction privacy data. Correspondingly, the text privacy degree is calculated by adopting the privacy degree grading method and the traditional method provided by the embodiment of the application, and ranking of the text privacy degree is performed. The fact that the ranking is closer to the privacy degree ranking obtained by verification of the attack scene shows that the method can accurately reflect the privacy degree level of the text.

The measure of the closeness of the ranking may be implemented by Mean Square Error (MSE), and a calculation formula of the MSE is as follows:

wherein n is the number of documents; x and y represent two ranked lists of document privacy levels.

The embodiments of the present application provide the following experimental data:

TABLE 3 privacy level ranking determined by different methods

Computing MSE according to the ranking of table 3 may result in:

MSE (validation rank, rank of the present application) 6;

MSE (verification ranking, number of private data is compared and ranked) is 12;

MSE (verification rank, private data character number ratio rank) 34.

Therefore, compared with the traditional privacy degree grading method based on the number proportion of the privacy data or the number proportion of the characters of the privacy data, the privacy degree grading method based on the similarity is closer to the verification method in ranking, and the method provided by the embodiment of the application can be used for grading the privacy degree of the unstructured data more accurately.

In consideration of the semantic characteristic that context vocabularies have relevance, the embodiment of the application introduces a method for calculating vocabulary similarity by using word vectors in Natural Language Processing (NLP), and the method is used for calculating the similarity between non-sensitive words and private data attributes.

Specifically, as shown in fig. 4, the processing system of unstructured data may extract a word vector of non-sensitive words and a word vector of private data attributes, respectively, e.g., input the non-sensitive words and the private data attributes into a pre-trained word vector model, thereby obtaining the word vector of non-sensitive words and the word vector of private data attributes. And then, determining the similarity between the non-sensitive words and the privacy data attributes according to the distance between the word vectors of the non-sensitive words and the word vectors of the privacy data attributes. Then, according to the similarity between the non-sensitive word and the privacy data attribute, the weight of the non-sensitive word is determined based on the correspondence between the similarity and the weight (for example, the correspondence shown in formula (3)).

The word vector model can be obtained by training through methods such as word2vec and the like. Specifically, the processing system of unstructured data may construct an initial word vector model by word2vec, and train the initial word vector model using the training corpus, thereby obtaining a word vector model for extracting word vectors.

Considering that the definitions of different application scenarios for the private data may be different, the language application and expression modes of different application scenarios are greatly different, which may cause the contexts of the same words in the corpora of different application scenarios to have a large difference, and if the initial word vector model is trained using the common corpus, the accuracy of the trained word vector model may be low. Based on this, the processing system of the unstructured data can obtain the training corpus matched with the application scene of the unstructured data, and then train the initial word vector model by using the specific training corpus to obtain the word vector model.

Further, even in the corpus of the fixed application scenario, the words corresponding to the same privacy data attribute often have similar contexts, but the privacy data words corresponding to the same privacy data attribute are always varied, for example, the privacy data words corresponding to the name may be "zhang san", "li xi", "wang wu", and the like, and many privacy data words may appear only a few times, and the word vector model obtained by training directly based on the training corpus is not accurate enough. In order to train better word vector models and more accurately calculate similarity to better assign sensitive weights, the processing system of unstructured data can also preprocess the corpus. Specifically, the sensitive words in the training corpus are identified, the privacy data attributes of the sensitive words are used for replacing the sensitive words, and then the initial word vector model is trained by using the replaced training corpus to obtain a word vector model.

The method for processing unstructured data provided by the embodiment of the present application is described in detail with reference to fig. 1 to 4, and the apparatus and the device provided by the embodiment of the present application are described with reference to the accompanying drawings.

Referring to fig. 5, a schematic structural diagram of an apparatus for processing unstructured data is shown, where the apparatus 500 includes:

a word segmentation module 502, configured to perform word segmentation on the unstructured data to obtain a word segmentation result;

a weight determining module 504, configured to determine a weight of a sensitive word in the word segmentation result, and determine a weight of a non-sensitive word according to a similarity between the non-sensitive word in the word segmentation result and a private data attribute;

a privacy level determining module 506, configured to determine a privacy level of the unstructured data according to the weight of the sensitive word and the weight of the non-sensitive word.

In some implementations, the weight determining module 504 is specifically configured to:

In some implementations, the apparatus 500 further includes:

In some implementations, the apparatus further includes:

the training module is specifically configured to:

In some implementations, the apparatus further includes:

The processing apparatus 500 for unstructured data according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module/unit of the processing apparatus 500 for unstructured data are respectively for implementing corresponding flows of each method in the embodiment shown in fig. 3, and are not described herein again for brevity.

An embodiment of the present application further provides an apparatus 600. The device 600 may be a peer-side device such as a laptop computer or a desktop computer, or may be a computer cluster in a cloud environment or an edge environment. The device 600 is in particular adapted to realize the functionality of the processing means 500 of unstructured data in the embodiment shown in fig. 5.

Fig. 6 provides a schematic structural diagram of a device 600, and as shown in fig. 6, the device 600 includes a bus 601, a processor 602, a communication interface 603, and a memory 604. The processor 602, memory 604, and communication interface 603 communicate over a bus 601. The bus 601 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus. The communication interface 603 is used for communication with the outside. For example, a corpus matching an application scenario of unstructured data is obtained, or unstructured data is obtained, etc.

The processor 602 may be a Central Processing Unit (CPU). The memory 604 may include a volatile memory (volatile memory), such as a Random Access Memory (RAM). The memory 604 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, an HDD, or an SSD.

The memory 604 stores executable code that the processor 602 executes to perform the processing of the unstructured data described above.

Specifically, in the case where the embodiment shown in fig. 5 is implemented, and the modules of the processing apparatus 500 of unstructured data described in the embodiment of fig. 5 are implemented by software, software or program codes required to perform the functions of the word segmentation module 502, the weight determination module 504, and the privacy level determination module 506 in fig. 5 are stored in the memory 604. The communication module functions are implemented by the communication interface 603. The communication interface 603 receives the unstructured data and transmits the unstructured data to the processor 602 through the bus 601, and the processor 602 executes program codes corresponding to modules stored in the memory 604, such as program codes corresponding to the word segmentation module 502, the weight determination module 504 and the privacy degree determination module 506, to perform operations of performing word segmentation on the unstructured data, then determining the weight of the sensitive word, determining the weight of the insensitive word according to the similarity between the insensitive word and the attribute of the privacy data, and then determining the privacy degree of the unstructured data according to the weight of the sensitive word and the weight of the insensitive word.

Of course, the processor 602 may further execute a program code corresponding to the privacy protection processing module to execute a privacy protection mechanism for determining the unstructured data according to the privacy degree of the unstructured data, and perform privacy protection on the unstructured data by using the privacy protection mechanism.

An embodiment of the present application further provides a computer-readable storage medium, which includes instructions for instructing a computer to execute the above processing method of unstructured data applied to the processing apparatus 500 of unstructured data.

The embodiment of the application also provides a computer program product, and when the computer program product is executed by a computer, the computer executes any one of the processing methods of the unstructured data. The computer program product may be a software installation package that can be downloaded and executed on a computer in the event that any of the aforementioned methods of processing unstructured data need to be used.

Claims

1. A method for processing unstructured data, the method comprising:

performing word segmentation on the unstructured data to obtain word segmentation results;

determining the weight of a sensitive word in the word segmentation result, and determining the weight of a non-sensitive word according to the similarity of the non-sensitive word and the privacy data attribute in the word segmentation result;

and determining the privacy degree of the unstructured data through the weight of the sensitive words and the weight of the non-sensitive words.

2. The method of claim 1, wherein determining the weight of the non-sensitive word according to the similarity between the non-sensitive word and the private data attribute in the word segmentation result comprises:

3. The method of claim 2, wherein extracting the word vector for the non-sensitive word and the word vector for the privacy data attribute comprises:

4. The method of claim 3, wherein the word vector model is trained by:

acquiring a training corpus matched with an application scene of the unstructured data;

and training an initial word vector model by using the training corpus to obtain a word vector model.

5. The method of claim 4, further comprising:

identifying sensitive words in the training corpus, and replacing the sensitive words by using privacy data attributes;

the training of the initial word vector model by using the training corpus to obtain a word vector model comprises the following steps:

6. The method according to any one of claims 1 to 5, further comprising:

determining a privacy protection mechanism of the unstructured data according to the privacy degree of the unstructured data;

and carrying out privacy protection on the unstructured data by utilizing the privacy protection mechanism.

7. An apparatus for processing unstructured data, the apparatus comprising:

8. The apparatus of claim 7, wherein the weight determination module is specifically configured to:

9. The apparatus of claim 8, wherein the weight determination module is specifically configured to:

10. The apparatus of claim 9, further comprising:

11. The apparatus of claim 10, further comprising:

the training module is specifically configured to:

12. The apparatus of any one of claims 7 to 11, further comprising:

13. An apparatus, comprising a processor and a memory;

the processor is to execute instructions stored in the memory to cause the device to perform the method of any of claims 1 to 6.

14. A computer-readable storage medium comprising instructions that direct a device to perform the method of any of claims 1-6.