CN112149179A

CN112149179A - Risk identification method and device based on privacy protection

Info

Publication number: CN112149179A
Application number: CN202010986615.1A
Authority: CN
Inventors: 陈永环; 侯辉超; 张正雄
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-29
Anticipated expiration: 2040-09-18
Also published as: CN112149179B

Abstract

One or more embodiments of the present specification disclose a risk identification method and apparatus based on privacy protection, so as to solve the problems of poor risk prevention and control effect and large limitation in the existing method of utilizing terminal edge computing power. The method comprises the following steps: first text data generated in the running process of the target applet is obtained. And performing semantic recognition on the first text data by using a semantic representation model which is deployed in the client in advance to obtain a semantic representation vector corresponding to the first text data. The semantic representation model is obtained by training based on a preset corpus and processing by using a preset knowledge distillation method. The preset corpus comprises a first type sample corpus with risk content and a second type sample corpus without risk content. And performing risk identification on the first text data according to the semantic representation vector to obtain a risk identification result corresponding to the first text data.

Description

Risk identification method and device based on privacy protection

Technical Field

The present disclosure relates to the field of risk prevention and control technologies, and in particular, to a risk identification method and apparatus based on privacy protection.

Background

In the field of content prevention and control, effective content risk prevention and control on the premise of protecting user privacy is a new challenge facing the industry at present. Along with the popularization of smart phones, the edge computing capability of the mobile phone terminal is greatly improved, so that the feasibility for content risk prevention and control by utilizing the edge computing capability of the mobile phone terminal is achieved. In a conventional method for preventing and controlling content risk by using edge computing capability of a mobile phone terminal, the following two methods are generally adopted:

one method is that the terminal disturbs the original content to protect privacy by counting word Frequency, TF-IDF (word Frequency-Inverse file Frequency), text regularization, text rules and the like, although the method can protect the privacy of the user, the semanteme of the original content is damaged at the same time, especially long sentences and long text scenes, and the risk prevention and control effect is limited. The other is to use a general semantic representation model, but the general semantic representation model is large and cannot be deployed to the terminal side, so that the content risk prevention and control by using the terminal edge computing capability is greatly limited.

Disclosure of Invention

In one aspect, one or more embodiments of the present specification provide a risk identification method based on privacy protection, applied to a client, including: first text data generated in the running process of the target applet is obtained. And performing semantic recognition on the first text data by using a semantic representation model which is deployed in the client in advance to obtain a semantic representation vector corresponding to the first text data. The semantic representation model is obtained by training based on a preset corpus and processing by using a preset knowledge distillation method. The preset corpus comprises a first type sample corpus with risk content and a second type sample corpus without risk content. And performing risk identification on the first text data according to the semantic representation vector to obtain a risk identification result corresponding to the first text data.

In another aspect, one or more embodiments of the present specification provide a risk identification apparatus based on privacy protection, applied to a client, including: the acquisition module acquires first text data generated in the running process of the target applet. And the semantic identification module is used for carrying out semantic identification on the first text data by utilizing a semantic representation model which is deployed in the client in advance to obtain a semantic representation vector corresponding to the first text data. The semantic representation model is obtained by training based on a preset corpus and processing by using a preset knowledge distillation method. The preset corpus comprises a first type sample corpus with risk content and a second type sample corpus without risk content. And the risk identification module is used for carrying out risk identification on the first text data according to the semantic representation vector to obtain a risk identification result corresponding to the first text data.

In yet another aspect, one or more embodiments of the present specification provide a risk identification device based on privacy protection, including a processor and a memory electrically connected to the processor, where the memory stores a computer program, and the processor is configured to call and execute the computer program from the memory to implement the risk identification method based on privacy protection.

In still another aspect, embodiments of the present application provide a storage medium for storing a computer program, where the computer program is executable by a processor to implement the above risk identification method based on privacy protection.

Drawings

In order to more clearly illustrate one or more embodiments or technical solutions in the prior art in the present specification, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present specification, and other drawings can be obtained by those skilled in the art without inventive efforts.

FIG. 1 is a schematic flow chart diagram of a method for risk identification based on privacy protection according to an embodiment of the present description;

FIG. 2 is a schematic block diagram of a privacy protection based risk identification system in accordance with an embodiment of the present description;

FIG. 3 is a schematic flow chart diagram of a method for privacy protection based risk identification in accordance with another embodiment of the present description;

FIG. 4 is a schematic block diagram of a risk identification apparatus based on privacy protection according to an embodiment of the present description;

fig. 5 is a schematic block diagram of a risk identification device based on privacy protection according to an embodiment of the present specification.

Detailed Description

One or more embodiments of the present disclosure provide a risk identification method and apparatus based on privacy protection, so as to solve the problems of poor risk prevention and control effect and large limitation in the existing method that utilizes a terminal edge computing capability.

In order to make those skilled in the art better understand the technical solutions in one or more embodiments of the present disclosure, the technical solutions in one or more embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in one or more embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from one or more of the embodiments of the present disclosure without making any creative effort shall fall within the protection scope of one or more of the embodiments of the present disclosure.

Fig. 1 is a schematic flow chart of a risk identification method based on privacy protection according to an embodiment of the present specification, and the method is applied to a client, as shown in fig. 1, and includes the following steps:

s102, first text data generated in the running process of the target small program is obtained.

The first text data can be obtained by monitoring the operation capture of the target applet, the target applet is deployed in a specified host program in the client, the client can monitor the operation process of the target applet, and specifically, when the triggering operation (namely starting the target applet) of the user for the target applet in the client is monitored, the operation page display information of the user in the process of using the target applet is obtained. Text data in the operation page display information, namely first text data generated in the running process of the target applet can be identified by performing text identification on the operation page display information.

And S104, performing semantic recognition on the first text data by using a semantic representation model which is deployed in the client in advance to obtain a semantic representation vector corresponding to the first text data.

The semantic representation model is obtained by training based on a preset corpus and processing by using a preset knowledge distillation method. The preset corpus comprises a first type sample corpus with risk content and a second type sample corpus with no risk content. Each sample corpus in the preset corpus can be obtained by collecting texts related to risk identification in historical text contents, for example, collecting a plurality of historical text contents from encyclopedia, news, interactive communities and other channels, and screening out texts related to risk identification from the plurality of historical text contents as sample corpora.

The semantic representation vector comprises a first semantic feature vector corresponding to the risk content and/or a second semantic feature vector corresponding to the safety content.

In this embodiment, the model type of the semantic representation model is not limited, and may be, for example, a BERT (Bidirectional Encoder representation from Transformers) model. The default knowledge distillation method may be ALBERT, Q8BERT, DistillBERT, TinyBERT, etc. In the model training process, a FullTokenizer (a word segmentation tool) can be used for word segmentation, and the FullTokenizer has the advantage of reducing redundant information in the sample corpus to a greater extent, so that the redundancy in the finally trained semantic representation model is reduced.

And S106, performing risk identification on the first text data according to the semantic representation vector to obtain a risk identification result corresponding to the first text data.

In the step, the client can carry out risk identification on the first text data based on the semantic representation vector locally, so as to obtain a risk identification result; and sending the semantic representation vector to a cloud end, carrying out risk identification on the first text data by the cloud end, and sending a risk identification result to the client end.

By adopting the technical scheme of one or more embodiments of the specification, after the first text data generated in the running process of the target applet is obtained, the semantic representation model pre-deployed at the client is utilized to perform semantic recognition on the first text data to obtain the semantic representation vector corresponding to the first text data, and then the risk recognition is performed on the first text data according to the semantic representation vector, so that the risk recognition of the text data only needs to be based on the corresponding semantic representation vector and does not need to depend on the semantic content of the text data, the semantic content is well protected, and especially when the semantic content of the text data contains the user privacy data, the problem of user privacy disclosure can be avoided. Moreover, the semantic representation model is obtained by training based on a preset corpus and processing the semantic representation model by using a preset knowledge distillation method, and the preset corpus comprises a first type sample corpus with risk content and a second type sample corpus without risk content (namely, a sample corpus related to risk prevention and control), so that the trained semantic representation model can be compressed to a great extent and can be deployed at a client, the problem that the model cannot be supported by the client when the model is too large is solved, the semantic representation model has pertinence to risk prevention and control, and the risk prevention and control efficiency of the client for text data can be improved.

In an embodiment, after the semantic representation vector corresponding to the first text data is identified, the semantic representation vector may be sent to the cloud end, so that the cloud end performs risk identification on the first text data based on the semantic representation vector to obtain a risk identification result corresponding to the first text data, and sends the risk identification result to the client end.

In this embodiment, the client only needs to identify the semantic representation vector corresponding to the first text data, and the risk identification process for the first text data is executed by the cloud, so that the computational pressure on the client side is reduced, and the risk identification efficiency of the first text data is improved. Moreover, the cloud end relies on the semantic representation vector corresponding to the first text data when identifying the risk of the first text data, so that the semantic content of the first text data cannot be known, the semantic content is well protected, and especially when the semantic content contains user privacy data, the problem of user privacy disclosure can be avoided.

In an embodiment, after the semantic feature vector corresponding to the first text data is identified, whether the semantic feature vector includes a first semantic feature vector corresponding to risk content or not may be determined, and if the semantic feature vector includes the first semantic feature vector, the semantic feature vector is sent to the cloud, so that the cloud performs risk identification on the first text data based on the first semantic feature vector included in the semantic feature vector, obtains a risk identification result corresponding to the first text data, and sends the risk identification result to the client. If the semantic representation vector does not include the first semantic feature vector, i.e., only includes the second semantic feature vector corresponding to the secure content, it may be determined that the first text data does not include the risky content.

When the semantic representation model is used for performing semantic identification on the first text data, not only can a semantic representation vector corresponding to the first text data be identified, but also first information of a first semantic feature vector contained in the semantic representation vector can be determined, and the first information can include whether the first semantic feature vector is contained, the number of the contained first semantic feature vectors, the proportion in the semantic representation vector and the like. In the identified semantic representation vector, the semantic representation vector may be identified based on the determined first information.

For example, the client performs semantic recognition on the first text data by using a locally deployed semantic representation model to obtain a semantic representation vector corresponding to the first text data, and determines that the semantic representation vector includes a first semantic feature vector a and a first semantic feature vector B. The client may identify the positions of the first semantic feature vector a and the first semantic feature vector B in the semantic representation vector, so as to identify that the feature vector corresponding to the position belongs to the first semantic feature vector corresponding to the risk content, such as adding identification information "the first semantic feature vector", "the risk content", "the risk feature vector", and the like. Or, the client may determine that the semantic representation vector includes 2 first semantic feature vectors according to the identified first semantic feature vector a and the first semantic feature vector B, and thus may carry quantity information "2" in the identified semantic representation vector to indicate that the semantic representation vector includes the first semantic feature vectors corresponding to 2 risk contents.

The client side can judge whether the semantic representation vector corresponding to the first text data contains the first semantic feature vector corresponding to the risk content according to the first information of the first semantic feature vector determined by the semantic representation model.

In this embodiment, after the client identifies the semantic representation vector corresponding to the first text data, the client sends the semantic representation vector to the cloud only when the semantic representation vector includes the first semantic feature vector corresponding to the risk content, and the cloud performs risk identification on the first text data based on the first semantic feature vector included in the semantic representation vector, for example, when the semantic representation vector only includes the second semantic feature vector corresponding to the security content, the client does not need to send the semantic representation vector to the cloud. Therefore, unnecessary data transmission processes are reduced, data transmission resources are saved, and the risk identification efficiency of the first text data is improved.

In one embodiment, after identifying the semantic feature vector corresponding to the first text data, if the semantic feature vector includes the first semantic feature vector, determining first information of the first semantic feature vector included in the semantic feature vector, where the first information includes: the number of the first semantic feature vectors and/or the proportion of the first semantic feature vectors in the semantic representation vectors; then, determining a risk weight corresponding to the semantic representation vector according to the first information, wherein the risk weight is positively correlated with the quantity and/or the proportion of the first semantic feature vector; and finally, performing risk identification on the first text data according to the semantic representation vector and the corresponding risk weight to obtain a risk identification result corresponding to the first text data.

That is to say, the client can not only identify the semantic feature vector corresponding to the first text data, but also determine first information of the first semantic feature vector included in the semantic feature vector, where the first information may include whether the first semantic feature vector is included, the number of included first semantic feature vectors, the proportion in the semantic feature vector, and the like.

Optionally, the client may send the semantic representation vector and the risk weight corresponding to the first text data to the cloud, and the cloud performs risk identification on the first text data according to the semantic representation vector and the corresponding risk weight. Specifically, the cloud end can judge whether the risk weight reaches a preset weight threshold value; if so, determining that the first text data belongs to risk type text data; if not, determining that the first text data does not belong to the risk type text data.

Optionally, the risk identification of the first text data may also be performed locally at the client, and specifically, the client may perform risk identification on the first text data according to the identified semantic representation vector and the corresponding risk weight. Specifically, the client can judge whether the risk weight reaches a preset weight threshold; if so, determining that the first text data belongs to risk type text data; if not, determining that the first text data does not belong to the risk type text data.

In this embodiment, when determining the risk weight corresponding to the semantic feature vector according to the first information, the client may determine according to information such as the number of the first semantic feature vectors included in the semantic feature vector, and the proportion of the first semantic feature vectors in the semantic feature vectors.

Specifically, a functional relationship between the risk weight and the number of the first semantic feature vectors, the proportion of the first semantic feature vectors in the semantic representation vectors, and the like may be preset, where the risk weight and the number of the first semantic feature vectors are positively correlated, and the proportion of the first semantic feature vectors in the semantic representation vectors is positively correlated.

Or, the mapping relationship between the risk weight and the number of the first semantic feature vectors, the proportion of the first semantic feature vectors in the semantic representation vectors, and the like may be preset, and may be a one-to-one mapping relationship between the numerical value, the proportion and the risk weight, or may be a mapping relationship between the range in which the numerical value and the proportion are located and the risk weight. The following table 1 schematically lists the mapping relationship between the proportion range and the risk weight of the first semantic feature vector in the semantic representation vector. Wherein the higher the risk weight, the higher the risk degree of the first text data is.

TABLE 1

Ratio range	Risk weight
		0	0
(0，10％]	1
		(10％，50％]	2
(50％，80％]	3
		(80％，100％]	4

It should be noted that, the mapping relationship between the proportion range and the risk weight of the first semantic feature vector in the semantic representation vector is only schematically listed in table 1, and the specific numerical values in the table have no limiting meaning.

In the above embodiment, the first information of the first semantic feature vector included in the semantic feature vector (including whether the first semantic feature vector is included, the number of the included first semantic feature vectors, the proportion in the semantic feature vector, and the like) is identified through the semantic feature model locally deployed at the client, and then the risk identification is performed on the first text data according to the first information, so that the risk identification process of the text data is simplified, and the risk identification efficiency of the text data is greatly improved.

In one embodiment, when performing semantic recognition on first text data by using a semantic representation model locally deployed at a client, whether the character length of the first text data is greater than a preset character length threshold value or not can be judged; if so, splitting the first text data according to a preset splitting rule to obtain a plurality of sub-text data; and performing semantic identification on each sub-text data by using a semantic representation model to obtain sub-semantic representation vectors corresponding to each sub-text data, and synthesizing the sub-semantic representation vectors corresponding to each sub-text data to obtain the semantic representation vector corresponding to the first text data.

The sub-semantic representation vectors corresponding to the sub-text data are integrated, and the sub-semantic representation vectors corresponding to the sub-text data can be spliced in sequence to splice the semantic representation vectors corresponding to the first text data; or, calculating an average vector corresponding to the plurality of sub-semantic representation vectors, thereby determining that the average vector is the semantic representation vector corresponding to the first text data.

In this embodiment, the character length of the first text data to be recognized can be monitored by presetting the character length threshold of the text data, and when the character length of the first text data is too long, the first text data can be split, and semantic recognition is performed on the split sub-text data, so that the difficulty of semantic recognition is reduced, and the situation that the semantic recognition result is inaccurate when the text data is too long is avoided.

In any of the above embodiments, if the risk identification result is that the first text data belongs to risk-type text data, the cloud may perform corresponding risk prevention and control operations on the first text data. If the risk identification result is obtained by cloud identification, the cloud can directly perform corresponding risk prevention and control operation on the first text data based on the risk identification result. And if the risk identification result is obtained by the client, the client sends the risk identification result corresponding to the first text data to the cloud so that the cloud performs corresponding risk prevention and control operation on the first text data.

In any of the above embodiments, after the risk identification result corresponding to the first text data is identified, the risk identification result may be displayed on an interface of the client to prompt the user that the target applet has a risk.

In one embodiment, after risk recognition is performed on the first text data according to the semantic representation vector to obtain a risk recognition result corresponding to the first text data, if the risk recognition result is that the first text data belongs to risk type text data, identification information of the target applet is determined, and the identification information of the target applet is sent to the cloud, so that the cloud performs risk prevention and control operation on the target applet based on the identification information of the target applet.

Wherein the risk prevention and control operation may be: the target applet is off-shelf from the specified application mall. The identification information of the target applet may include the applet id, the applet name, etc.

In this embodiment, under the condition that the first text data belongs to risk-type text data, not only can risk prevention and control be performed on the first text data, but also risk prevention and control can be performed on a target applet (i.e., a carrier of the first text data) corresponding to the first text data, so that the purpose of risk prevention and control from the source is achieved, and a risk prevention and control effect is greatly improved.

Fig. 2 is a schematic block diagram of a risk identification system based on privacy protection according to an embodiment of the present description. As shown in fig. 2, the risk identification system based on privacy protection includes a client 21 and a cloud 22, where a semantic representation model is deployed in the client 21, and text data locally acquired by the client 21 can be converted into corresponding semantic representation vectors by using the semantic representation model, so as to protect semantic content of the text data. The cloud 22 may be a server corresponding to the target applet, and is mainly used for training the semantic representation model and deploying the trained semantic representation model to the client 21. Through the interactive process between the client 21 and the cloud 22, the purpose of performing text data risk prevention and control by using the edge computing capacity of the client 21 is achieved.

Fig. 3 is a schematic flowchart of a risk identification method based on privacy protection according to a specific embodiment of the present specification, and as shown in fig. 3, the method is applied to the risk identification system based on privacy protection as shown in fig. 2, and mainly includes the following steps:

s301, the cloud end collects a first type sample corpus with risk content and a second type sample corpus without risk content to obtain a preset corpus.

The cloud end can collect a plurality of historical text contents from different channels such as encyclopedia, news, interactive communities and the like, and screens out texts related to risk identification from the plurality of historical text contents to serve as sample corpora. That is, each sample corpus in the predetermined corpus is associated with risk recognition, and the text that is not associated with risk recognition is deleted as redundant text.

And S302, training a semantic representation model by the cloud based on a preset corpus and by using a preset knowledge distillation method.

Wherein the distillation method of the preset knowledge can be ALBERT, Q8BERT, DistillBERT, TinyBERT and the like.

And S303, the cloud sends the file packet information corresponding to the semantic representation model to the client.

S304, the client deploys the semantic representation model locally according to the file package information corresponding to the semantic representation model issued by the cloud.

S305, when the client monitors the trigger operation of the user for the target small program in the client, acquiring first text data generated in the process of using the target small program by the user.

After monitoring the running of the target applet, the client acquires the operation page display information of the user in the process of using the target applet. Text data in the operation page display information, namely first text data generated in the running process of the target applet can be identified by performing text identification on the operation page display information.

S306, the client performs semantic recognition on the first text data by using the locally deployed semantic representation model to obtain a semantic representation vector corresponding to the first text data.

S307, the client judges whether the semantic representation vectors comprise first semantic feature vectors corresponding to the risk content; if yes, go to S308; if not, go to S308.

The client side can judge whether the first semantic feature vector is contained or not based on the semantic representation vector identified by the semantic representation model. Optionally, when performing semantic identification on the first text data by using the semantic representation model, not only can a semantic representation vector corresponding to the first text data be identified, but also first information of a first semantic feature vector included in the semantic representation vector can be determined, where the first information may include whether the first semantic feature vector is included, the number of the included first semantic feature vectors, a proportion in the semantic representation vector, and the like. The client side can judge whether the semantic representation vector corresponding to the first text data contains the first semantic feature vector corresponding to the risk content according to the first information of the first semantic feature vector determined by the semantic representation model.

S308, the client uploads the semantic representation vector corresponding to the first text data to the cloud.

S309, the cloud carries out risk identification on the first text data based on the first semantic feature vector included in the semantic representation vector, and issues a risk identification result to the client.

And S310, the client receives the risk identification result and displays the risk identification result to the user when the risk identification result is that the first text data belongs to risk text data.

S311, the client determines that the first text data does not belong to risk type text data.

In this embodiment, the semantic representation vector may be represented in binary form, i.e. 0 or 1. The binary vector form can well protect semantic content, so that the privacy data of the user is not leaked.

For example, the text content of the first text data is "the short-term consumption loan scale of the deposit financial institution household department is 9.11 trillion yuan as long as 2019 and 6 months later as long as XX banks publish data, the net increase of the deposit financial institution household department is 3293.19 billion yuan in the first half of 2019, and the increase of the deposit financial institution household department in the first half of 2019 seems not optimistic", and it is assumed that the preset character length threshold is 128 and the character length of the text content exceeds 128, so that the text content can be divided into a plurality of sentences according to punctuation marks, and sub-semantic feature vectors corresponding to the plurality of sentences are identified, and the semantic feature vector corresponding to the first text data is an average value of the plurality of sub-semantic feature vectors.

TABLE 2

Text

Feature 1

Feature 2

Feature 3

Feature 4

Feature 5

……

Characteristic N

Text 1

F11

F12

F13

F14

F15

……

F1N

Text 2

F21

F22

F23

F24

F25

……

F2N

Text 3

F31

F32

F33

F34

F35

……

F3N

Text 4

F41

F42

F43

F44

F45

……

F4N

The semantic representation vectors corresponding to the first text data are schematically listed in table 2, wherein values of the multidimensional feature vectors from feature 1 to feature N are all 0 or 1, so that the user privacy data is well protected.

It can be seen that, according to the technical scheme of this embodiment, the semantic representation model is trained through the cloud, and is deployed at the client, because the semantic representation model is obtained by training based on the preset corpus and processing the semantic representation model by using the preset knowledge distillation method, and the preset corpus includes the first type sample corpora of the risk content and the second type sample corpora without the risk content (i.e., the sample corpora related to risk prevention and control), the trained semantic representation model can be compressed to a great extent, and can be deployed at the client, thereby avoiding the problem that the model is too large and cannot be supported by the client, and the semantic representation model has pertinence to risk prevention and control, and can improve the risk prevention and control efficiency of the client for performing text data. In addition, because the risk identification of the text data only needs to be based on the corresponding semantic representation vector, and does not need to rely on the semantic content of the text data, the semantic content can be well protected, and especially when the semantic content of the text data contains user privacy data, the problem of user privacy disclosure can be avoided.

In summary, particular embodiments of the present subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may be advantageous.

Based on the same idea, the risk identification method based on privacy protection provided by one or more embodiments of the present specification further provides a risk identification device based on privacy protection.

Fig. 4 is a schematic block diagram of a risk identification device based on privacy protection according to an embodiment of the present specification. As shown in fig. 4, the apparatus is applied to a client, and includes:

an obtaining module 410, which obtains first text data generated in the running process of the target applet;

the semantic recognition module 420 is configured to perform semantic recognition on the first text data by using a semantic representation model pre-deployed in the client, so as to obtain a semantic representation vector corresponding to the first text data; the semantic representation model is obtained by training based on a preset corpus and processing by using a preset knowledge distillation method; the preset corpus comprises a first type of sample corpus with risk content and a second type of sample corpus without risk content;

and the risk identification module 430 is configured to perform risk identification on the first text data according to the semantic representation vector to obtain a risk identification result corresponding to the first text data.

In one embodiment, the semantic representation vector comprises: and the first semantic feature vector corresponding to the risk content and/or the second semantic feature vector corresponding to the safety content.

In one embodiment, the risk identification module 430 includes:

the first sending unit is used for sending the semantic representation vector to a cloud end so that the cloud end can carry out risk identification on the first text data based on the semantic representation vector to obtain a risk identification result corresponding to the first text data, and the risk identification result is sent to the client end.

In one embodiment, the risk identification module 430 includes:

the second sending unit is used for sending the semantic representation vector to a cloud end if the semantic representation vector contains the first semantic feature vector, so that the cloud end carries out risk identification on the first text data based on the first semantic feature vector contained in the semantic representation vector to obtain a risk identification result corresponding to the first text data, and the risk identification result is sent to the client end;

and the first determining unit is used for determining that the first text data does not belong to risk type text data if the semantic representation vector does not contain the first semantic feature vector.

In one embodiment, the risk identification module 430 includes:

a second determining unit, configured to determine first information of the first semantic feature vector included in the semantic representation vector if the semantic representation vector includes the first semantic feature vector; the first information includes: the number of the first semantic feature vectors and/or the proportion of the first semantic feature vectors in the semantic representation vectors;

a third determining unit, configured to determine a risk weight corresponding to the semantic representation vector according to the first information; a positive correlation between the risk weight and the number of the first semantic feature vectors and/or the fraction;

and the risk identification unit is used for carrying out risk identification on the first text data according to the semantic representation vector and the corresponding risk weight to obtain a risk identification result corresponding to the first text data.

In one embodiment, the risk identification unit:

and sending the semantic representation vectors and the corresponding risk weights to a cloud end, so that the cloud end carries out risk identification on the first text data according to the semantic representation vectors and the corresponding risk weights to obtain risk identification results corresponding to the first text data, and sends the risk identification results to the client end.

In one embodiment, the risk identification unit:

judging whether the risk weight reaches a preset weight threshold value;

if yes, determining that the first text data belongs to risk text data; if not, determining that the first text data does not belong to the risk type text data.

In one embodiment, the semantic recognition module 420 includes:

the judging unit is used for judging whether the character length of the first text data is larger than a preset character length threshold value or not;

the splitting unit is used for splitting the first text data according to a preset splitting rule if the first text data is the first text data, so that a plurality of sub-text data are obtained;

the semantic identification unit is used for performing semantic identification on each sub-text data by utilizing a semantic representation model which is deployed in the client in advance to obtain sub-semantic representation vectors corresponding to each sub-text data;

and the calculating unit is used for calculating an average vector corresponding to the sub-semantic representation vectors and determining that the average vector is the semantic representation vector corresponding to the first text data.

In one embodiment, the apparatus further comprises:

a determining module, configured to, after performing risk identification on the first text data according to the semantic representation vector to obtain a risk identification result corresponding to the first text data, determine identification information of the target applet if the risk identification result indicates that the first text data belongs to risk-type text data;

and the sending module is used for sending the identification information of the target applet to a cloud end so that the cloud end carries out risk prevention and control operation on the target applet based on the identification information of the target applet.

In one embodiment, the obtaining module 410 includes:

the acquisition unit is used for acquiring operation page display information of a user in the process of using the target applet when monitoring the trigger operation of the user on the target applet in the client;

and the text recognition unit is used for performing text recognition on the display information of the operation page to obtain the first text data generated in the running process of the target applet.

In one embodiment, the semantic representation model comprises a BERT model.

It should be understood by those skilled in the art that the above-mentioned risk identification device based on privacy protection can be used to implement the risk identification method based on privacy protection, and the detailed description thereof should be similar to the above-mentioned method, and in order to avoid complexity, no further description is provided herein.

Based on the same idea, one or more embodiments of the present specification further provide a risk identification device based on privacy protection, as shown in fig. 5. The risk identification device based on privacy protection may have a large difference due to different configurations or performances, and may include one or more processors 501 and a memory 502, and the memory 502 may store one or more stored applications or data. Memory 502 may be, among other things, transient or persistent storage. The application program stored in memory 502 may include one or more modules (not shown), each of which may include a series of computer-executable instructions for a privacy-based protection risk identification device. Still further, the processor 501 may be configured to communicate with the memory 502 to execute a series of computer-executable instructions in the memory 502 on a privacy-based risk identification device. The privacy-preserving based risk identification apparatus may also include one or more power supplies 503, one or more wired or wireless network interfaces 504, one or more input-output interfaces 505, and one or more keyboards 506.

In particular, in this embodiment, the privacy-based risk identification apparatus includes a memory, and one or more programs, where the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the privacy-based risk identification apparatus, and the one or more programs configured to be executed by the one or more processors include computer-executable instructions for:

acquiring first text data generated in the running process of a target applet;

performing semantic recognition on the first text data by using a semantic representation model which is deployed in the client in advance to obtain a semantic representation vector corresponding to the first text data; the semantic representation model is obtained by training based on a preset corpus and processing by using a preset knowledge distillation method; the preset corpus comprises a first type of sample corpus with risk content and a second type of sample corpus without risk content;

and performing risk identification on the first text data according to the semantic representation vector to obtain a risk identification result corresponding to the first text data.

One or more embodiments of the present specification further provide a storage medium, where the storage medium stores one or more computer programs, where the one or more computer programs include instructions, and when the instructions are executed by an electronic device including multiple application programs, the electronic device can execute each process of the above-mentioned risk identification method based on privacy protection, and can achieve the same technical effect, and details are not described here to avoid repetition.

Acquiring first text data generated in the running process of a target applet;

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the various elements may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.

One skilled in the art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

One or more embodiments of the present specification are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

One or more embodiments of the present description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only one or more embodiments of the present disclosure, and is not intended to limit the present disclosure. Various modifications and alterations to one or more embodiments described herein will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of the present specification should be included in the scope of claims of one or more embodiments of the present specification.

Claims

1. A risk identification method based on privacy protection is applied to a client and comprises the following steps:

acquiring first text data generated in the running process of a target applet;

2. The method of claim 1, the semantic representation vector comprising: and the first semantic feature vector corresponding to the risk content and/or the second semantic feature vector corresponding to the safety content.

3. The method according to claim 2, wherein the performing risk identification on the first text data according to the semantic representation vector to obtain a risk identification result corresponding to the first text data includes:

and sending the semantic representation vector to a cloud end so that the cloud end carries out risk identification on the first text data based on the semantic representation vector to obtain a risk identification result corresponding to the first text data, and sending the risk identification result to the client end.

4. The method according to claim 2, wherein the performing risk identification on the first text data according to the semantic representation vector to obtain a risk identification result corresponding to the first text data includes:

if the semantic representation vector contains the first semantic feature vector, sending the semantic representation vector to a cloud end, so that the cloud end carries out risk identification on the first text data based on the first semantic feature vector contained in the semantic representation vector to obtain a risk identification result corresponding to the first text data, and sending the risk identification result to the client end;

and if the semantic representation vector does not contain the first semantic feature vector, determining that the first text data does not belong to risk class text data.

5. The method according to claim 2, wherein the performing risk identification on the first text data according to the semantic representation vector to obtain a risk identification result corresponding to the first text data includes:

if the semantic representation vector comprises the first semantic feature vector, determining first information of the first semantic feature vector contained in the semantic representation vector; the first information includes: the number of the first semantic feature vectors and/or the proportion of the first semantic feature vectors in the semantic representation vectors;

determining a risk weight corresponding to the semantic representation vector according to the first information; a positive correlation between the risk weight and the number of the first semantic feature vectors and/or the fraction;

and performing risk identification on the first text data according to the semantic representation vector and the corresponding risk weight to obtain a risk identification result corresponding to the first text data.

6. The method according to claim 5, wherein the performing risk identification on the first text data according to the semantic representation vector and the corresponding risk weight to obtain a risk identification result corresponding to the first text data includes:

7. The method according to claim 5 or 6, wherein the performing risk identification on the first text data according to the semantic representation vector and the corresponding risk weight to obtain a risk identification result corresponding to the first text data includes:

judging whether the risk weight reaches a preset weight threshold value;

8. The method according to claim 1, wherein the semantic recognition is performed on the first text data by using a semantic representation model pre-deployed in the client to obtain a semantic representation vector corresponding to the first text data, and the semantic representation vector comprises:

judging whether the character length of the first text data is larger than a preset character length threshold value or not;

if so, splitting the first text data according to a preset splitting rule to obtain a plurality of sub-text data;

performing semantic identification on each sub-text data by using a semantic representation model which is deployed in the client in advance to obtain sub-semantic representation vectors corresponding to each sub-text data;

calculating an average vector corresponding to the plurality of sub-semantic representation vectors, and determining that the average vector is the semantic representation vector corresponding to the first text data.

9. The method according to claim 1, after performing risk identification on the first text data according to the semantic representation vector and obtaining a risk identification result corresponding to the first text data, further comprising:

if the risk identification result is that the first text data belongs to risk type text data, determining the identification information of the target applet;

and sending the identification information of the target small program to a cloud end so that the cloud end carries out risk prevention and control operation on the target small program based on the identification information of the target small program.

10. The method of claim 1, wherein the obtaining first text data generated during the running of the target applet comprises:

when monitoring the triggering operation of a user for the target applet in the client, acquiring the operation page display information of the user in the process of using the target applet;

and performing text recognition on the display information of the operation page to obtain the first text data generated in the running process of the target applet.

11. The method of claim 1, the semantic representation model comprising a BERT model.

12. A risk identification device based on privacy protection is applied to a client and comprises:

the acquisition module is used for acquiring first text data generated in the running process of the target applet;

the semantic identification module is used for carrying out semantic identification on the first text data by utilizing a semantic representation model which is deployed in the client in advance to obtain a semantic representation vector corresponding to the first text data; the semantic representation model is obtained by training based on a preset corpus and processing by using a preset knowledge distillation method; the preset corpus comprises a first type of sample corpus with risk content and a second type of sample corpus without risk content;

and the risk identification module is used for carrying out risk identification on the first text data according to the semantic representation vector to obtain a risk identification result corresponding to the first text data.

13. The apparatus of claim 12, the semantic representation vector comprising: and the first semantic feature vector corresponding to the risk content and/or the second semantic feature vector corresponding to the safety content.

14. The apparatus of claim 13, the risk identification module comprising:

15. The apparatus of claim 13, the risk identification module comprising:

16. A privacy protection based risk identification device comprising a processor and a memory electrically connected to the processor, the memory storing a computer program, the processor being configured to invoke and execute the computer program from the memory to implement the privacy protection based risk identification method of any one of claims 1-11.

17. A storage medium storing a computer program executable by a processor to implement the privacy protection based risk identification method of any one of claims 1-11.