CN116484864A

CN116484864A - Data identification method and related equipment

Info

Publication number: CN116484864A
Application number: CN202310356055.5A
Authority: CN
Inventors: 鲍梦瑶; 刘佳伟; 章鹏; 贾茜; 张谦
Original assignee: Ant Blockchain Technology Shanghai Co Ltd
Current assignee: Ant Blockchain Technology Shanghai Co Ltd
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-07-25

Abstract

The specification provides a data identification method and related equipment. The method comprises the following steps: acquiring data to be identified; inputting the data to be identified into a data identification model for key data identification; the data identification model comprises a deep learning model which is obtained by training a deep learning model which is used as a teacher model in a knowledge distillation architecture and is used as a student model.

Description

Data identification method and related equipment

Technical Field

One or more embodiments of the present disclosure relate to the field of data identification technologies, and in particular, to a data identification method and related devices.

Background

The identification of various key data in the data to be identified by using the data identification model is a common technology at present. The traditional data recognition model usually adopts a named entity recognition model, but the named entity recognition model usually needs a large amount of labeling data in the training process, the labeling difficulty is high, the model is complex in calculation, the data recognition effect is not ideal, and the practical application requirement cannot be met.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure provide a data identification method and related apparatus.

In a first aspect, the present specification provides a data identification method, the method comprising:

acquiring data to be identified;

inputting the data to be identified into a data identification model for key data identification; the data identification model comprises a deep learning model which is obtained by training a deep learning model which is used as a teacher model in a knowledge distillation architecture and is used as a student model.

In a second aspect, the present specification provides an application compliance detection method based on data identification, the method comprising:

acquiring data to be identified in a target application; the data to be identified comprises data provided by the target application for a user;

inputting the data to be identified into a data identification model for key data identification; the data identification model comprises a deep learning model which is obtained by training a deep learning model which is used as a teacher model in a knowledge distillation architecture and is used as a student model;

and based on the key data identified from the data to be identified, performing compliance detection on the target application.

In a third aspect, the present specification provides a data recognition apparatus, the apparatus comprising:

The acquisition unit is used for acquiring the data to be identified;

the data identification unit is used for inputting the data to be identified into the data identification model to identify key data; the data identification model comprises a deep learning model which is obtained by training a deep learning model which is used as a teacher model in a knowledge distillation architecture and is used as a student model.

In a fourth aspect, the present specification provides an application compliance detection device based on data identification, the device comprising:

the acquisition unit is used for acquiring the data to be identified in the target application; the data to be identified comprises data provided by the target application for a user;

the data identification unit is used for inputting the data to be identified into the data identification model to identify key data; the data identification model comprises a deep learning model which is obtained by training a deep learning model which is used as a teacher model in a knowledge distillation architecture and is used as a student model;

and the compliance detection unit is used for carrying out compliance detection on the target application based on the key data identified from the data to be identified.

Accordingly, the present specification also provides a computing device comprising: a memory and a processor; the memory has stored thereon a computer program executable by the processor; the processor executes the data identification method according to the first aspect or the application compliance detection method based on data identification according to the second aspect when executing the computer program.

Accordingly, the present specification also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the data recognition method described in the first aspect, or performs the application compliance detection method based on data recognition described in the second aspect.

In summary, first, the present application is based on a knowledge distillation architecture, and a large deep learning model serving as a teacher model in the knowledge distillation architecture can be utilized to perform efficient model training on a small deep learning model serving as a student model, so that the model output effect of the small deep learning model is effectively improved by using excellent language knowledge and learning ability of the large deep learning model. Then, the application can utilize the training-completed small-sized deep learning model to accurately and efficiently identify data in practical application. Therefore, on one hand, when the model is trained, the large-scale deep learning model is used for assisting the small-scale deep learning model to train, so that a large number of parameters of the large-scale deep learning model do not need to be adjusted, only the parameters of the small-scale deep learning model need to be adjusted and optimized, the parameter quantity of the small-scale deep learning model is often smaller, the time consumed by model training and occupied machine resources are greatly reduced, and the training efficiency of the model is improved. On the other hand, in practical application, the method only adopts the small-sized deep learning model as the data recognition model to recognize the key data of the data to be recognized, so that the hardware requirement on equipment for executing the data recognition can be greatly reduced, the realization cost of the data recognition is effectively controlled, and the method is beneficial to large-scale application of the privacy data recognition.

Drawings

FIG. 1 is a flow chart of a method for identifying data according to an exemplary embodiment;

FIG. 2 is a flow diagram of a knowledge-based model training method, in accordance with an illustrative embodiment;

FIG. 3 is a flow diagram of another knowledge-based distillation model training method, provided by an exemplary embodiment;

FIG. 4 is a schematic diagram of data identification of a model architecture based on a student model, provided in an exemplary embodiment;

FIG. 5 is a flow chart of an application compliance detection method based on data identification according to an exemplary embodiment;

FIG. 6 is a schematic diagram of a data recognition device according to an exemplary embodiment;

FIG. 7 is a schematic diagram of an apparatus for detecting compliance with applications based on data identification according to an exemplary embodiment;

FIG. 8 is a schematic diagram of a computing device provided in an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

In addition, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in this application are information and data authorized by the user or sufficiently authorized by the parties, and the collection, use and processing of relevant data requires compliance with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation portals for the user to select authorization or denial.

First, some terms in the present specification are explained for the convenience of understanding by those skilled in the art.

(1) Privacy Data (Private Data), which is secret Data, refers to information that is not intended to be known by others or unrelated persons, etc., and from the perspective of the owner of privacy, the privacy Data may be divided into individual privacy Data, which may include, for example, individual identity information (such as passports, driver's licenses, etc.), individual biometric information (such as fingerprints, facial features, etc.), and common privacy Data, etc. The common privacy data is mainly based on home privacy, such as a home total asset condition, and the like, which is not particularly limited in this specification. Disclosure and abuse of private data is extremely prone to various personal and public security concerns.

(2) Knowledge distillation refers to transferring the learning behavior of a larger Model (typically as a Teacher Model) to a smaller Model (typically as a student Model). The output of the teacher model can be used as a soft label (soft label) for training the student model, or a soft target, and the main idea of knowledge distillation is to train the student model efficiently under the guidance of the teacher model, so that the student model finally obtains ideal accuracy. Since most of the expansion of knowledge distillation is focused on the compressed Deep neural network, a general large Deep Learning (Deep Learning) model is used as a teacher model in the knowledge distillation architecture, and a general small Deep Learning model is used as a student model, and it is understood that the parameter amount of the teacher model is larger than the parameter amount of the student model.

(3) The pre-training model (PTM) is a model training by a batch of language materials, and then training is continued or another use is performed on the basis of the initially trained model. In an embodiment, if the preliminary trained pre-training model needs to be trained again, the pre-training model may specifically correspond to two phases: a pre-training phase (pre-training) and a fine-tuning phase (fine-tuning).

The pre-training stage generally trains the model in an unsupervised (unsupervised) or weakly supervised (weak-supervised) manner on a very large-scale corpus, and the model is expected to obtain language-related knowledge, such as syntax, grammar knowledge and the like. It should be appreciated that, after training of a very large corpus, the pre-training model will often be a super (super) model, which is on the one hand characterized by a sufficiently large linguistic knowledge, and on the other hand because of its large amount of parameters.

The fine tuning stage utilizes a pre-trained model to custom train certain tasks, so that the pre-trained model can better complete the tasks. For example, continuing to train the text classification task with a pre-trained model will result in a more desirable text classification result. It will be appreciated that since the pre-training model already has knowledge of the languages in the pre-training phase, then it is a half-effort to learn the text classification task again based on these knowledge of the languages. It should be noted that some tasks (such as the aforementioned text classification tasks) that use a pre-trained model to fine tune are generally referred to as downstream tasks (down-stream). However, because of the large amount of parameters of the pre-trained model, the fine tuning stage requires long training time and machine resources.

In some embodiments provided herein, the deep learning model in the knowledge distillation architecture as a teacher model may be a pre-training model (i.e., a pre-training model that is initially trained is used in addition).

(4) A prompter (promt) is a technique that employs adding extra text at the input to redefine the task (task reformulation) in order to better use the knowledge of the pre-trained model, or to better mine the ability of the pre-trained model. Wherein the workflow of the Prompt may comprise the following 4 parts: 1. construction of a template (template); 2. construction of a probt answer space map (verbalizer); 3. substituting text into the template and predicting by using a pre-training model; 4. the predicted results are mapped back to labels (label). Illustratively, the original input: x=i likes this movie, campt template: all in all, [ x ] is a [ z ] movie. That bit, redefined input is: i like this movie. In summary, it is a [ z ] movie. Then, the redefined input is sent into a proper pre-training language model, the mask inserts (mask slots) are used for predicting z, the corresponding prediction result is obtained, and the prediction result is mapped back to the original label.

(5) Named entity recognition (Name Entity Recognition, NER), refers to automatically extracting one or more specific meaning entities from a piece of unstructured text and determining the category to which they belong. Named entity recognition generally consists of two parts: the identification of entity boundaries, and the identification of determining entity categories (e.g., determining person names, place names, institution names, etc.). Taking a named entity recognition task for recognizing privacy data (namely, the privacy data is named entity in the named entity recognition task) in privacy data collection protocols of various applications or websites as an example, the named entity recognition task is to recognize (or extract) user privacy data (such as name, mobile phone number, etc.) to be collected by the application or website indicated by the protocol from the privacy data collection protocols, and further determine a privacy category (such as name and mobile phone number belong to personal basic data) corresponding to the privacy data. It should be noted that in some privacy data identification scenarios, whether one privacy data needs to be extracted or not often needs to consider the context of the privacy data, for example, if the context of the mobile phone number is just the content of the privacy data in the description related rule, and the current application is not declared to collect the mobile phone number of the user, and the privacy security of the user is not compromised, the named entity identification task will not identify the mobile phone number as the privacy data.

In some embodiments provided herein, a deep learning model as a student model in a knowledge distillation architecture may identify the model for a named entity.

As described above, the conventional data recognition model usually adopts a named entity recognition model, however, the named entity recognition model often needs a large amount of labeling data in the training process, the labeling difficulty is high, the model calculation is complex, the data recognition effect is not ideal, and the practical application requirement cannot be met. Furthermore, in the traditional data recognition scheme, a named entity recognition scheme based on a pre-training model is adopted, however, because the parameter quantity of the pre-training model is huge, the training time of the pre-training model is long due to the fine adjustment of a large number of parameters, the hardware requirement on equipment is extremely high, and further the realization cost is high, so that the practical application requirement cannot be met.

Based on the above, the specification provides a technical scheme for efficiently training the deep learning model serving as a student model based on the deep learning model serving as a teacher model in a knowledge distillation architecture, so that the trained model is utilized for data recognition, and the model training time and the model running cost are reduced on the premise of ensuring the recognition accuracy.

In the implementation process, the data to be identified is acquired firstly, and then the data to be identified is input into the data identification model to identify key data. It should be noted that, the data recognition model in the present application is a deep learning model as a student model obtained by training based on a deep learning model as a teacher model in a knowledge distillation architecture.

In the above technical scheme, first, the method and the device are based on a knowledge distillation framework, and can utilize a large deep learning model serving as a teacher model in the knowledge distillation framework to perform efficient model training on a small deep learning model serving as a student model, so that excellent language knowledge and learning ability of the large deep learning model are utilized, and the model output effect of the small deep learning model is effectively improved. Then, the application can utilize the training-completed small-sized deep learning model to accurately and efficiently identify data in practical application. Therefore, on one hand, when the model is trained, the large-scale deep learning model is used for assisting the small-scale deep learning model to train, so that a large number of parameters of the large-scale deep learning model do not need to be adjusted, only the parameters of the small-scale deep learning model need to be adjusted and optimized, the parameter quantity of the small-scale deep learning model is often smaller, the time consumed by model training and occupied machine resources are greatly reduced, and the training efficiency of the model is improved. On the other hand, in practical application, the method only adopts the small-sized deep learning model as the data recognition model to recognize the key data of the data to be recognized, so that the hardware requirement on equipment for executing the data recognition can be greatly reduced, the realization cost of the data recognition is effectively controlled, and the method is favorable for large-scale application of the data recognition.

Referring to fig. 1, fig. 1 is a flowchart of a data identification method according to an exemplary embodiment. As shown in fig. 1, the method may be applied to a computing device, which may be a smart wearable device, a smart phone, a tablet, a notebook, a desktop, an on-board computer, or a server. The computing device may be a server, a server cluster or a cloud computing service center formed by a plurality of servers, or the like, which is not particularly limited in this specification. As shown in fig. 1, the method may specifically include the following steps S101-S102.

Step S101, obtaining data to be identified.

In an illustrated embodiment, a computing device obtains data to be identified.

In an embodiment, the data to be identified may be data in various websites, applications or databases, which is not specifically limited in this specification.

Step S102, inputting the data to be identified into a data identification model for key data identification; the data identification model comprises a deep learning model which is obtained by training a deep learning model which is used as a teacher model in a knowledge distillation architecture and is used as a student model.

In an embodiment, after obtaining the data to be identified, the computing device may input the data to be identified into a data identification model obtained by training in advance to identify key data, so as to obtain a corresponding identification result.

In an embodiment, the recognition result of the key data recognition may include whether the key data is included in the data to be recognized.

In an embodiment, if the data to be identified includes key data, the identification result further includes the key data identified from the data to be identified. Further, in an illustrated embodiment, the recognition result further includes one or more combinations of the following: the key data category corresponding to the key data included in the data to be identified, the position of the key data included in the data to be identified, and so on, which are not specifically limited in this specification.

In an embodiment, the data recognition model may be a deep learning model trained through a knowledge distillation architecture. In an illustrated embodiment, the data recognition model may be a deep learning model as a student model trained based on a deep learning model as a teacher model in a knowledge distillation architecture.

In an embodiment, the deep learning model as a student model may be stored locally on a computing device performing model training after training, may be stored in a cloud server, etc., which is not specifically limited in this specification. Correspondingly, the deep learning model serving as the student model can be used as a data recognition model after training is completed so as to be used for subsequent key data recognition.

In an illustrated embodiment, referring to fig. 2, fig. 2 is a flow chart of a model training method based on knowledge distillation according to an exemplary embodiment. In an embodiment, the training process of the data recognition model may be as shown in fig. 2, and specifically may include steps S201 to S204.

Step S201, a training sample set is obtained, where the training sample set includes a plurality of sample data, and at least part of sample data in the plurality of sample data includes key data corresponding to at least one key data category.

In an illustrated embodiment, a computing device may first obtain a training sample set, which may include a plurality of sample data, where each sample data may be text data used as model training. In an embodiment, at least some of the plurality of sample data may include key data corresponding to at least one key data category. By way of example, where the key data is privacy data of a user, the key data category may be a privacy data category (e.g., privacy data category such as personal identity information, personal property information, etc.).

Referring to fig. 3, fig. 3 is a flow chart illustrating another knowledge-based model training method according to an exemplary embodiment. As shown in fig. 3, the sample data may be "we may collect your order information, browse information for data analysis". The key data in the sample data may include privacy data such as "order information" and "browse information". As shown in fig. 3, the staff member may manually label the sample data, and the manually labeled information may include, for example, labeling "order information" and "browse information" as key data.

Step S202, prompt information is added to the plurality of sample data respectively, the plurality of sample data added with the prompt information is input into the knowledge distillation architecture to be used as a deep learning model of a teacher model for coding, and coding results corresponding to the plurality of sample data are obtained.

Further, in an illustrated embodiment, the computing device may input the plurality of sample data into the knowledge distillation architecture as a deep learning model of a teacher model to encode, resulting in encoding results corresponding to the plurality of sample data.

In an illustrated embodiment, the deep learning model in the knowledge distillation architecture as a teacher model may be a pre-training model. Referring to the foregoing explanation, since the pre-training model is trained by a large-scale corpus, a large number of parameters in the pre-training model usually contain a large amount of natural language information, i.e., the pre-training model itself has a large amount of language knowledge.

In an illustrated embodiment, the encoding results corresponding to each sample data may include probability values for the respective key data categories corresponding to the respective characters contained in each sample data.

In an embodiment, in order to better mine information in the pre-training model, the sample data may be added with the sample information, and the sample data added with the sample information may be input into the knowledge distillation architecture to be encoded as a deep learning model of the teacher model, so as to obtain encoding results corresponding to the sample data.

Illustratively, as shown in fig. 3, the sample information may be "key data types appearing in the sentence are: the sample data added with the sample information can be "we may collect your order information and browse information for data analysis". The key data types that occur in this sentence are: ".

In an illustrated embodiment, the template may be added with the template information (i.e., the template) in the form of a template, which may be modified depending on the training task. For example, in parsing the key data category in the text data, the template may be the "key data types appearing in the sentence" described above: ". For example, in parsing specific key data in text data, the template of the template may be "key data appearing in the sentence is: ". Illustratively, when parsing the text data for the storage period of the key data, the template may be "the storage period appearing in the sentence is: ", etc., and the specification is not particularly limited thereto.

In an illustrated embodiment, the output (logic) of the pre-training model, i.e., the above-described encoding results, may then be used as a soft label to train a deep learning model in the knowledge distillation architecture as a student model.

Step S203, taking the encoding results corresponding to the plurality of sample data output by the deep learning model as the teacher model as a soft tag, and taking the manual labeling information corresponding to the plurality of sample data as a hard tag.

Further, in an illustrated embodiment, the present application may use the encoding results corresponding to the plurality of sample data output as the deep learning model of the teacher model as a soft tag, and train the deep learning model in the knowledge distillation architecture as the student model based on the soft tag.

In an embodiment, the application may further use the manual labeling information corresponding to the plurality of sample data as a hard tag, and the deep learning model serving as the student model may be trained based on the soft tag and the hard tag.

It should be noted that, in the knowledge distillation architecture, the output of the teacher model is often used as a carrier of knowledge in the teacher model. Taking a typical image classification task as an example, the manually labeled hard tag and the soft tag output by the teacher model may be as shown in the following table one.

List one

As shown in the first table of the above,the likelihood of the cat's image being misclassified as a dog is very low, but the likelihood of such mistakes is still many times higher than the likelihood of mistaking the cat as a cow or car. As shown in Table one above, the teacher model often outputs floating point numbers (0.9, 0.1, 10) ^-6 、10 ^-9 Etc.), the probability value of classification can be represented, and further more information (such as cat is more like dog than cow) can be provided for the student model as a soft label, so that training efficiency of the student model is improved.

Step S204, training a deep learning model serving as a student model in the knowledge distillation architecture based on the plurality of sample data and the soft tag and the hard tag corresponding to the plurality of sample data.

Further, in an illustrated embodiment, as shown in fig. 3, the computing device may input the plurality of sample data, and the soft tag (i.e., the encoding result output by the teacher model) and the hard tag (i.e., the manual labeling information) corresponding to the plurality of sample data, into the deep learning model of the student model in the knowledge distillation architecture to train the deep learning model of the student model.

In an illustrated embodiment, since the present application primarily performs data recognition on text data, the model architecture of the common Encoder (Encoder) and conditional random field (Conditional Random Field, CRF) layers may be employed as the deep learning model of the student model. The conditional random field (Conditional Random Field, CRF) is a basic model of natural language processing, and has ideal effects in various scenes such as Chinese word segmentation, named entity recognition, part-of-speech tagging and the like after being combined with deep learning. In an embodiment, the model structure of the deep learning model as the student model may also include any other possible architecture besides the encoder+crf, which is not specifically limited in this specification.

In an embodiment, the encoder in the model structure of the student model may include any one of the encoders shown below: encoders based on recurrent neural network (Recurrent Neural Network, RNN) models, encoders based on long short term memory (Long Short Term Memory, LSTM) models, encoders based on transducer models. The transducer is a deep learning model which utilizes an attention mechanism to improve the training speed of the model, is suitable for parallelization calculation, and has the accuracy and performance which are superior to those of RNNs. In an embodiment shown, the encoder may be any other possible encoder, which is not specifically limited in this specification.

It should be noted that, in the forward process, the student model may also output a corresponding encoding result (hidden states), where the hidden states of the student model may be intermediate results of an Encoder in the student model, and the Encoder may finally output probability values of each character in the sample data corresponding to each key data class.

In an embodiment, during the training process, the student model may calculate a loss function (loss) based on the plurality of sample data, the soft tag and the hard tag corresponding to the plurality of sample data, and the hidden states and some parameters or super parameters of the student model, and continuously optimize the parameters in the student model in a gradient descent manner with the aim of minimizing the loss function until the student model can finally output an ideal key data recognition result, and the training is completed.

In an illustrated embodiment, the specific calculation of the loss function L may be as follows:

L＝αL _soft +βL _hard

L _hard ＝EmissionScore+TransactionScore

α：L _hard weight, beta: l (L) _soft Alpha and beta are hyper-parameters, obtained without training.

And z: the Teacher Model is in the hidden states at the t position.

z _(t,i) : the hidden states at the t-position of the Student Model.

N: dimension of hidden states.

T: total character length of the sample data.

The softmax (normalization function) of the hidden states of the Teacher Model at the t-position at the temperature τ outputs a value in the i-th dimension.

The softmax of the Student Model at the t-position's hidden states at temperature τ outputs a value in dimension i.

L _hard The loss function of the general CRF layer consists of two parts, namely an Emission Score and Transaction Score. Where the Emission Score is a probability value that each character position that the Encoder of the student model finally outputs corresponds to each key data class, the hidden states of the student model may be intermediate results of the Encoder. Where Transaction Score is the likelihood of occurrence of the sequence of the entire critical data, including the trainable parameters of the CRF layer.

It should be noted that, in general, the hidden states of the student model need to be consistent with the dimensions of the hidden states of the teacher model to avoid introducing additional training parameters for conversion.

In an illustrated embodiment, as shown in fig. 3, the deep learning model as the student model may specifically be a named entity recognition model, that is, the data recognition model may be a named entity recognition model. In the following, a named entity recognition model will be taken as an example, and a recognition process of key data will be described.

First, it should be noted that the named entity recognition problem is a sequence labeling problem, so that the data labeling manner of named entity recognition also conforms to the manner of the sequence labeling problem, and mainly adopts the BIOE labeling method. Each letter in the biee represents a different meaning, where B, i.e., start (Begin), represents the start of a segment of a character; i, intermediate (i.e., intermediate), represents the middle of a segment of characters; e, end, represents the End of a segment of characters; o, the Other (Other), represents Other unrelated characters.

For example, if there are m key data categories, it is denoted as c ₁ ，c ₂ ，c ₃ ，……，c _(m-1) ，c _m Given a piece of data W to be identified with a character length of n, record w= { W ₁ ，w ₂ ，w ₃ ，……，w _(n-1) ，w _n By a number of consecutive strings W in W _k Sequence s= [ w ] of composition _k-i ，w _k-i+1 ，……，w _k ]. If S is w _j Key data of a type (e.g., name or telephone number, etc.), the recognition result of key data recognition based on the named entity recognition model may include: w in the data W to be identified _k-i Marked w _j -B, from w _k-i+1 All marked w from the beginning to w_ (k-1) _j -I, will w _k Marked w _j -E。

Referring to fig. 4, fig. 4 is a schematic diagram illustrating data recognition of a model architecture based on a student model according to an exemplary embodiment. As shown in fig. 4, taking the above text data, which is "we may collect your order information and browse information for data analysis", as an example, after the text data is input into the named entity recognition model, the text data is sequentially processed by the encoder and the CRF layer, and the recognition result (i.e. the sequence labeling result) of the named entity recognition model for the text data, which is finally output, may be: [ O, O, O, O, O, O, O, O, B, I, I, E, O, B, I, I, E, O, O, O, O, O, O, O, O, O ]. Further, the identification result may be: [ O, O, O, O, O, O, O, B-order information, I-order information, E-order information, O, B-view information, I-view information, E-view information, O, O, O, O, O ]. Further, the recognition result may further include probability values of respective characters corresponding to respective categories, and the like, which is not particularly limited in this specification.

Illustratively, you may need to provide information on your name, phone number, etc. "for example, the recognition result (i.e. sequence labeling result) of the named entity recognition model for the text data may be: [ O, O, O, O, O, O, B-NAME, E-NAME, O, B-PHONE, I-PHONE, I-PHONE, E-PHONE, O, O, O ], wherein NAME stands for NAME, PHONE stands for mobile PHONE number.

Referring to fig. 5, fig. 5 is a flowchart of an application compliance detection method based on data identification according to an exemplary embodiment. As shown in fig. 5, the method may be applied to a computing device, which may be a smart wearable device, a smart phone, a tablet, a notebook, a desktop, an on-board computer, or a server. The computing device may be a server, a server cluster or a cloud computing service center formed by a plurality of servers, or the like, which is not particularly limited in this specification. As shown in fig. 5, the method may specifically include the following steps S301 to S303.

Step S301, obtaining data to be identified in a target application; wherein the data to be identified comprises data provided by the target application for a user.

In an illustrated embodiment, a computing device obtains data to be identified in a target application.

It should be noted that, the type of the target application is not specifically limited in this specification, and the target application may be various published or unpublished applications. In an illustrated embodiment, the target application may be a video application, a music application, a shooting application, an instant messaging application, a shopping application, etc., which is not particularly limited in this specification.

It should be noted that, the method adopted by the computing device to obtain the data to be identified in the target application is not specifically limited in this specification.

In an illustrated embodiment, a computing device may run the target application and obtain data to be identified in the target application during the running of the target application. In an embodiment shown, the target application may be executed by another terminal device, and the computing device may establish a wired or wireless connection with a terminal device (such as a smart phone or a tablet computer) that executes the target application, and obtain, through the established connection, data to be identified in the target application from the terminal device during the process of executing the target application by the terminal device, and so on, which is not specifically limited in this specification.

In an illustrated embodiment, the computing device may also obtain the data to be identified in the target application without running the target application. The computing device may directly obtain the data to be identified in the target application provided by the development platform through the development platform of the target application. For example, the computing device may also directly obtain the data to be identified in the target application during the design process of the target application. For example, the computing device may also obtain the data to be identified in the target application after the target application is published, based on user feedback or data collection of a third party platform, and the like, which is not limited in this specification.

In an illustrated embodiment, the data to be identified in the target application may include data provided by the target application for the user.

In an illustrated embodiment, the data to be identified may specifically include text data provided by the target application to the user.

By way of example, the text data may include text data of a data collection protocol provided by the target application to the user. The data collection protocol may be used to indicate user data corresponding to a user to be collected by the target application, may also be used to indicate data that the target application will not collect, and the like, which is not specifically limited in this specification. By way of example, the target application may provide the user with the data collection protocol when the user registers the user account with the target application, and so forth, which is not specifically limited in this specification.

For example, the text data may further include opinion solicitation information provided to the user whether to agree with the target application to acquire the data read right when the target application attempts to acquire the read right of the key data (e.g., the read right of the data such as the address book) to implement the related function in the target application, and the specification is not limited in detail.

In an illustrated embodiment, the target application may provide the text data described above to the user in various forms.

In an embodiment, the target application may output and display the text data to the user in a pop-up window manner, or the target application may output and display the text data to the user in a message notification manner in the application, or the target application may output and display the text data to the user in a short message manner, or the like, which is not limited in this specification.

In an illustrated embodiment, in addition to the target application, the computing device may also obtain data to be identified in various websites, and perform subsequent key data identification for the data to be identified in the websites. For example, the data to be identified in various websites may also include text data provided by various websites for users, such as data collection protocols provided by various websites for users, and the like, which is not limited in this specification.

In an illustrated embodiment, after acquiring the data to be identified in the target application, the computing device may input the data to be identified into a data identification model obtained by training in advance to perform key data identification, so as to obtain a corresponding identification result.

In an illustrated embodiment, the computing device may input data (e.g., text data such as a data collection protocol) provided by the target application for the user into the data recognition model, where the data recognition model may first pre-process the text data (e.g., split the entire text), and then perform key data recognition on the pre-processed text data to obtain a corresponding recognition result.

Further, in an embodiment, the computing device may output, through a preset interface, a recognition result that displays the above-mentioned key data recognition to the user.

In an embodiment, the identification result of the key data identification may include whether the key data is included in the data to be identified, and if the key data is included in the data to be identified, the identification result may further include the key data identified from the data to be identified.

In an embodiment, if the data to be identified includes key data, the identification result may further include one or more of the following combinations: the key data category corresponding to the key data included in the data to be identified, the position of the key data included in the data to be identified, and so on, which are not specifically limited in this specification.

For example, taking the data to be identified as the data collection protocol as an example, the key data identified from the data to be identified may include user data corresponding to the user to be collected by the target application indicated in the data collection protocol.

For example, taking the key data to be identified at this time as the privacy data of the user, the privacy data possibly collected by the target application and the corresponding privacy data category indicated in the privacy data collection protocol provided by the target application may be as shown in the following table two.

Watch II

In an embodiment, the privacy data that may be collected by the target application may include common privacy data, such as a home total asset, etc., in addition to the personal privacy data shown in table two above, which is not specifically limited in this specification.

The position of the key data in the data to be identified may include what page, what strip, what line, etc. of the sentence corresponding to the key data in the data to be identified, and further may include what character of the key data is in the whole sentence, which is not specifically limited in this specification.

Further, in an embodiment, based on different actual requirements and hardware conditions of the computing device, the recognition result of the above key data recognition further includes one or more of the following combinations: the present specification does not specifically limit operation information of the target application (e.g., name of company, registered address of company), contact information of privacy security responsibilities provided by the target application (e.g., including mailbox address and telephone number, etc.), storage address of the target application for key data (e.g., domestic or foreign, further including which province in domestic, etc.), storage deadline (e.g., 3 years or 5 years, etc.), storage over-period processing mode (e.g., how the key data is processed after the deadline of the key data is stored for more than 3 years), and user complaint channels (e.g., including mailbox address or customer service telephone for complaints by the user) provided by the target application regarding privacy security, etc. Therefore, the method and the device acquire as much information as possible through data identification so as to meet the requirement of carrying out more comprehensive and detailed compliance detection on the target application, and reliably guarantee the safety of user data.

Step S103, performing compliance detection for the target application based on the key data identified from the data to be identified.

Further, in an illustrated embodiment, the computing device performs, after performing the key data recognition on the data to be recognized in the target application through the data recognition model and identifying the corresponding key data from the data to be recognized, compliance detection for the target application based on the key data identified from the data to be recognized.

In an illustrated embodiment, the computing device may obtain key data that the target application actually collected for the user during the run-time. The computing device may then compare the key data identified from the data to be identified with the key data actually collected by the target application to perform compliance detection on the target application.

By way of example, the computing device may determine, by the comparison, whether the key data actually collected by the target application exceeds the data range of the key data identified from the data to be identified; if so, the computing device may determine the compliance detection result of the target application as non-compliance, otherwise, the computing device may determine the compliance detection result of the target application as compliance.

For example, taking the to-be-identified data as the privacy data collection protocol provided by the target application for the user as an example, if the privacy data actually collected by the target application exceeds the data range of the privacy data identified from the privacy data collection protocol, the compliance detection result of the target application can be determined to be non-compliance. For example, if the privacy data actually collected by the target application includes a personal phone number, a name, and a height, but only the name is included in the privacy data identified from the data to be identified, the target application may be determined to be not compliant.

For example, taking the to-be-identified data as the target application, the privacy data collection protocol provided for the user is taken as an example. The computing device may also compare the privacy data identified from the privacy data collection protocol with privacy data permitted to be collected in the relevant specification when performing compliance detection on the target application to perform compliance detection on the target application. For example, if the privacy data identified from the privacy data collection protocol is outside the privacy data collection range of the prescribed license, the target application may be determined to be non-compliant.

For example, the computing device may also compare the privacy data identified from the privacy data collection protocol with the privacy data actually collected by the target application, with the range of privacy data allowed to be collected in the relevant regulations, so as to perform compliance detection for the target application, and the disclosure is not limited in detail herein.

For example, when the data to be identified is taken as the target application and the reading authority of the private data is attempted to be acquired to realize the related function in the target application, the computing device may compare the private data identified from the opinion solicitation information with the private data actually acquired by the target application, so as to perform compliance detection on the target application. For example, if the privacy data actually collected by the target application exceeds the data range of the privacy data identified from the opinion solicitation information, it may be determined that the target application is not compliant. Or on the premise that the user does not agree with the target application to acquire the reading authority of the privacy data, if the privacy data actually acquired by the target application still comprises the privacy data identified from the opinion solicitation information, the target application can be judged to be not compliant.

Further, in an illustrated embodiment, if the computing device determines that the target application is not compliant, the computing device may prompt the user to uninstall the non-compliant target application in time by means of a pop-up window or a short message, i.e. timely guarantee the privacy security of the user.

In addition, in an embodiment, the data to be identified may include, in addition to text data such as a data acquisition protocol, image data or video data provided by the target application for the user, for example, a picture of the data acquisition protocol, which is not specifically limited in this specification.

In an embodiment, if the data to be identified in the target application is image data, the image data may be preprocessed before the data to be identified is input into the data identification model, for example, the text portion in the image data may be extracted to obtain corresponding text data, and then the text data may be input into the data identification model to perform key data identification, so as to obtain an identification result, and perform compliance detection on the target application.

In an embodiment, if the data to be identified in the target application is video data, the video data may be preprocessed before the data to be identified is input into the data identification model, for example, including extracting a text portion in the video data to obtain corresponding text data, and then the text data may be input into the data identification model to perform key data identification to obtain an identification result, so as to perform compliance detection on the target application, which is not specifically limited in this specification.

In summary, first, the present application is based on a knowledge distillation architecture, and a large deep learning model serving as a teacher model in the knowledge distillation architecture can be utilized to perform efficient model training on a small deep learning model serving as a student model, so that the model output effect of the small deep learning model is effectively improved by using excellent language knowledge and learning ability of the large deep learning model. Then, the application can utilize the training-completed small-sized deep learning model to accurately and efficiently identify data in practical application. Therefore, on one hand, when the model is trained, the large-scale deep learning model is used for assisting the small-scale deep learning model to train, so that a large number of parameters of the large-scale deep learning model do not need to be adjusted, only the parameters of the small-scale deep learning model need to be adjusted and optimized, the parameter quantity of the small-scale deep learning model is often smaller, the time consumed by model training and occupied machine resources are greatly reduced, and the training efficiency of the model is improved. On the other hand, when in practical application, the method only adopts the small-sized deep learning model as the data recognition model to perform data recognition on the data to be recognized, so that the hardware requirement on equipment for executing the data recognition can be greatly reduced, the realization cost of the data recognition is effectively controlled, the large-scale application of the data recognition is facilitated, and further, comprehensive and reliable compliance detection can be performed on various applications based on the data recognition, so that the user data safety is protected.

Corresponding to the implementation of the method flow, the embodiment of the specification also provides a data identification device which is applied to the computing equipment. Referring to fig. 6, fig. 6 is a schematic structural diagram of a data identification device according to an exemplary embodiment. As shown in fig. 6, the apparatus 40 includes:

an acquisition unit 401 for acquiring data to be identified;

a data recognition unit 402, configured to input the data to be recognized into a data recognition model for key data recognition; the data identification model comprises a deep learning model which is obtained by training a deep learning model which is used as a teacher model in a knowledge distillation architecture and is used as a student model.

In an embodiment, the recognition result of the key data recognition includes whether the data to be recognized contains key data; if the data to be identified contains key data, the identification result further includes one or more of the following combinations: and the key data corresponds to the key data category, and the position of the key data in the data to be identified.

In an illustrated embodiment, the apparatus 40 further comprises a training unit 403 for:

acquiring a training sample set, wherein the training sample set comprises a plurality of sample data, and at least part of the sample data in the plurality of sample data comprises key data corresponding to at least one key data category;

Inputting the plurality of sample data into the knowledge distillation architecture as a deep learning model of a teacher model to encode, so as to obtain encoding results corresponding to the plurality of sample data; the coding result corresponding to each sample data comprises probability values of the characters contained in each sample data corresponding to the key data categories;

training a deep learning model serving as a student model in the knowledge distillation architecture based on the coding results corresponding to the plurality of sample data output by the deep learning model serving as a teacher model, and taking the trained deep learning model serving as the student model as the data recognition model.

In an illustrated embodiment, the training unit 403 is specifically configured to:

respectively adding prompt information to the plurality of sample data, and inputting the plurality of sample data added with the prompt information into the knowledge distillation architecture to be used as a deep learning model of a teacher model for coding, so as to obtain coding results corresponding to the plurality of sample data; the prompt information is used for prompting the deep learning model serving as a teacher model that the training task is to identify key data aiming at the key data contained in the sample data.

taking the coding results corresponding to the sample data output by the deep learning model serving as the teacher model as soft labels, and taking the manual labeling information corresponding to the sample data as hard labels;

training a deep learning model serving as a student model in the knowledge distillation architecture based on the plurality of sample data and the soft tag and the hard tag corresponding to the plurality of sample data.

In an illustrated embodiment, the deep learning model as a teacher model includes a pre-training model PTM and the deep learning model as a student model includes a named entity recognition NER model.

In an embodiment, the model architecture of the deep learning model as the student model includes: encoder and conditional random field CRF layer.

In an illustrated embodiment, the encoder includes any one of the following illustrated encoders:

an encoder based on a cyclic neural network RNN model, an encoder based on a long-short-term memory LSTM model and an encoder based on a transducer model.

Corresponding to the implementation of the method flow, the embodiment of the specification also provides an application compliance detection device based on data identification, which is applied to the computing equipment. Referring to fig. 7, fig. 7 is a schematic structural diagram of an application compliance detection device based on data identification according to an exemplary embodiment. As shown in fig. 7, the apparatus 50 includes:

an obtaining unit 501, configured to obtain data to be identified in a target application; the data to be identified comprises data provided by the target application for a user;

the data identification unit 502 is configured to input the data to be identified into a data identification model for key data identification; the data identification model comprises a deep learning model which is obtained by training a deep learning model which is used as a teacher model in a knowledge distillation architecture and is used as a student model;

and the compliance detection unit 503 is configured to perform compliance detection for the target application based on the key data identified from the data to be identified.

In an embodiment, the data to be identified includes text data of a data acquisition protocol provided by the target application for a user; the key data comprise user data corresponding to the user, which are indicated in the data acquisition protocol and are to be acquired by the target application.

In an illustrated embodiment, the compliance detection unit 503 is specifically configured to:

acquiring key data actually acquired by the target application for the user in the running process;

and comparing the key data identified from the data to be identified with the key data actually collected by the target application so as to detect the compliance of the target application.

comparing the key data identified from the data to be identified with the key data actually collected by the target application to judge whether the key data actually collected by the target application exceeds the data range of the key data identified from the data to be identified; and if so, determining the compliance detection result of the target application as non-compliance.

The implementation process of the functions and roles of the units in the above-mentioned device 40 and the device 50 is specifically described in the above-mentioned corresponding embodiments of fig. 1 to 5, and will not be described in detail herein. It should be understood that the above-mentioned apparatus 40 and apparatus 50 may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions into a memory by a processor (CPU) of the device. In addition to the CPU and the memory, the device in which the above apparatus is located generally includes other hardware such as a chip for performing wireless signal transmission and reception, and/or other hardware such as a board for implementing a network communication function.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the units or modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The apparatus, units, modules illustrated in the above embodiments may be implemented in particular by a computer chip or entity or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

Corresponding to the method embodiments described above, embodiments of the present specification also provide a computing device. Referring to fig. 8, fig. 8 is a schematic structural diagram of a computing device according to an exemplary embodiment. The computing device 1000 shown in fig. 8 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, an on-board computer, or a server, and the computing device 1000 may be a server, or may be a server cluster or a cloud computing service center formed by a plurality of servers, which is not limited in this specification. As shown in fig. 8, the computing device 1000 includes a processor 1001 and memory 1002, and may further include an input device 1004 (e.g., keyboard, etc.) and an output device 1005 (e.g., display, etc.). The processor 1001, memory 1002, input devices 1004, and output devices 1005 may be connected by a bus or other means. As shown in fig. 8, the memory 1002 includes a computer-readable storage medium 1003, which computer-readable storage medium 1003 stores a computer program executable by the processor 1001. The processor 1001 may be a CPU, microprocessor, or integrated circuit for controlling the execution of the above method embodiments. The processor 1001 may execute each step of the data identification method in the embodiment of the present disclosure, or execute each step of the application compliance checking method based on data identification in the embodiment of the present disclosure when running the stored computer program, and please refer to the description of the corresponding embodiment of fig. 1-5, which is not repeated here.

Corresponding to the above-described method embodiments, embodiments of the present specification also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, can perform the steps of the data identification method in the embodiments of the present specification or perform the steps of the application compliance method based on data identification in the embodiments of the present specification. Please refer to the above description of the corresponding embodiments of fig. 1-5, and detailed descriptions thereof are omitted herein.

The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

In a typical configuration, the terminal device includes one or more CPUs, input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data.

Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, embodiments of the present description may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Claims

1. A method of data identification, the method comprising:

acquiring data to be identified;

2. The method of claim 1, the recognition result of the critical data recognition comprising whether the data to be recognized contains critical data; if the data to be identified contains key data, the identification result further includes one or more of the following combinations: and the key data corresponds to the key data category, and the position of the key data in the data to be identified.

3. The method of claim 2, the method further comprising:

4. The method of claim 3, wherein the inputting the plurality of sample data into the knowledge distillation architecture as a deep learning model of a teacher model for encoding, to obtain the encoding results corresponding to the plurality of sample data, comprises:

5. The method according to claim 3, wherein training the deep learning model as the student model in the knowledge distillation architecture based on the encoding results corresponding to the plurality of sample data output as the deep learning model of the teacher model comprises:

6. The method of claim 1, the deep learning model as a teacher model comprising a pre-training model PTM and the deep learning model as a student model comprising a named entity recognition NER model.

7. The method of claim 1, wherein the model architecture of the deep learning model as a student model comprises: encoder and conditional random field CRF layer.

8. The method of claim 7, the encoder comprising any one of the encoders shown below:

9. An application compliance detection method based on data identification, the method comprising:

10. The method of claim 9, the data to be identified comprising text data of a data acquisition protocol provided by the target application to a user; the key data comprise user data corresponding to the user, which are indicated in the data acquisition protocol and are to be acquired by the target application.

11. The method of claim 10, the compliance detection for the target application based on key data identified from the data to be identified, comprising:

12. The method of claim 11, the comparing the key data identified from the data to be identified with key data actually collected by the target application to perform compliance detection on the target application, comprising:

13. A data identification device, the device comprising:

the acquisition unit is used for acquiring the data to be identified;

14. An application compliance detection device based on data identification, the device comprising:

15. A computing device, comprising: a memory and a processor; the memory has stored thereon a computer program executable by the processor; the processor, when running the computer program, performs the method of any one of claims 1 to 8 or performs the method of any one of claims 9 to 12.

16. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of any of claims 1 to 8 or implements the method of any of claims 9 to 12.