WO2021098327A1 - 基于隐私数据保护的异常采集行为识别方法和装置 - Google Patents

基于隐私数据保护的异常采集行为识别方法和装置 Download PDF

Info

Publication number
WO2021098327A1
WO2021098327A1 PCT/CN2020/111725 CN2020111725W WO2021098327A1 WO 2021098327 A1 WO2021098327 A1 WO 2021098327A1 CN 2020111725 W CN2020111725 W CN 2020111725W WO 2021098327 A1 WO2021098327 A1 WO 2021098327A1
Authority
WO
WIPO (PCT)
Prior art keywords
lightweight
data
target
application
applications
Prior art date
Application number
PCT/CN2020/111725
Other languages
English (en)
French (fr)
Inventor
徐文浩
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021098327A1 publication Critical patent/WO2021098327A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Definitions

  • This document relates to the field of computer software technology, in particular to a method, device and electronic equipment for identifying abnormal collection behaviors based on privacy data protection.
  • the purpose of the embodiments of this specification is to provide a method, device, and electronic equipment for identifying abnormal collection behaviors and scene classification models based on privacy data protection, so as to avoid excessive collection of users' private data by lightweight applications such as applets.
  • a method for identifying abnormal collection behaviors based on privacy data protection includes: obtaining page content data, user behavior data, and a list of private data collected by the target lightweight application;
  • the page content data and user behavior data of the target lightweight application are used as the input of the scene classification model to predict the usage scene category of the target lightweight application through the scene classification model; based on the usage scene category of the target lightweight application
  • Corresponding to the list of collectible privacy data and the list of privacy data collected by the target lightweight application it is determined whether the target lightweight application has abnormal collection behavior.
  • a method for training a scene classification model which includes: obtaining page content data, user behavior data, and usage scenario tags of multiple lightweight applications; From the page content data and user behavior data of the mass application, extract the usage scenario features of the multiple lightweight applications; train to obtain a scenario classification model based on the usage scenario features of the multiple lightweight applications and the corresponding usage scenario tags, The scene classification model is used to predict the usage scene category of the lightweight application.
  • a device for identifying abnormal collection behaviors based on privacy data protection including: an acquiring unit that acquires page content data, user behavior data, and a list of private data collected by the target lightweight application
  • a prediction unit which uses the page content data and user behavior data of the target lightweight application as input to the scene classification model to predict the use scene category of the target lightweight application through the scene classification model; the determination unit is based on the scene classification model;
  • the list of collectible privacy data corresponding to the usage scenario category of the target lightweight application and the list of privacy data collected by the target lightweight application application are used to determine whether the target lightweight application has abnormal collection behavior.
  • a training unit of a scene classification model including: a data acquisition unit, which acquires page content data, user behavior data, and usage scenario tags of multiple lightweight applications; feature extraction Unit for extracting the usage scenario features of the multiple lightweight applications from the page content data and user behavior data of the multiple lightweight applications; the model training unit is based on the usage scenario features of the multiple lightweight applications and The corresponding usage scene label is trained to obtain a scene classification model, and the scene classification model is used to predict the usage scene category of the lightweight application.
  • an electronic device comprising: a processor; and a memory arranged to store computer-executable instructions that, when executed, cause the processor to perform the following operations: get The page content data, user behavior data of the target lightweight application, and the list of privacy data collected by the target lightweight application; the page content data and user behavior data of the target lightweight application are used as the input of the scene classification model to pass The scenario classification model predicts the usage scenario category of the target lightweight application; based on the list of collectible privacy data corresponding to the usage scenario category of the target lightweight application and the list of privacy data collected by the target lightweight application, Determine whether the target lightweight application has abnormal collection behavior.
  • a computer-readable storage medium stores one or more programs that, when executed by an electronic device including multiple application programs, cause all The electronic device performs the following operations: obtaining page content data, user behavior data, and a list of privacy data collected by the target lightweight application; taking the page content data and user behavior data of the target lightweight application as The input of the scene classification model to predict the usage scene category of the target lightweight application through the scene classification model; based on the collectible privacy data list corresponding to the usage scene category of the target lightweight application and the target lightweight Apply the list of privacy data collected by the application to determine whether the target lightweight application has abnormal collection behavior.
  • an electronic device including: a processor; and a memory arranged to store computer-executable instructions, which when executed, cause the processor to perform the following operations: obtain multiple light Page content data, user behavior data of the multiple lightweight applications, and usage scenario tags of the multiple lightweight applications; extracting the content data of the multiple lightweight applications from the page content data and user behavior data of the multiple lightweight applications Use scene features; based on the use scene features of the multiple lightweight applications and the corresponding use scene tags, a scene classification model is trained to obtain the scene classification model, and the scene classification model is used to predict the use scene category of the lightweight application.
  • a computer-readable storage medium stores one or more programs that, when executed by an electronic device including multiple application programs, cause all The electronic device performs the following operations: acquiring page content data, user behavior data, and usage scenario tags of the plurality of lightweight applications; and obtaining page content data and user behavior data of the plurality of lightweight applications Extracting the usage scene features of the multiple lightweight applications; based on the usage scene features of the multiple lightweight applications and the corresponding usage scene tags, training to obtain a scene classification model, the scene classification model is used to predict the lightweight The usage scenario category of the application.
  • one or more embodiments provided in this specification can obtain page content data and user behaviors of target lightweight applications Data and the private data list collected by the target lightweight application application, and then use the page content data and user behavior data of the target lightweight application as the input of the scene classification model to predict the use scene category of the target lightweight application through the scene classification model, and It can determine whether the target lightweight application has abnormal collection behaviors based on the list of collectible privacy data corresponding to the usage scenario category of the target lightweight application and the list of private data collected by the target lightweight application.
  • the identification of abnormal collection behaviors of lightweight applications such as small programs is transformed from passive verification to active identification, and the scene classification model is used to identify the use scene category, which improves the identification efficiency on the one hand; Bring a more secure service experience.
  • One or more embodiments provided in this specification can obtain page content data, user behavior data, and usage scenario tags of multiple lightweight applications, and then obtain page content data of multiple lightweight applications And user behavior data, and based on the use scene features and corresponding use scene tags of multiple lightweight applications, a scene classification model can be trained.
  • the scene classification model obtained by training is used to identify the use scenarios of lightweight applications such as small programs.
  • it can improve the efficiency of recognizing the use scenarios of small programs, and on the other hand, it also saves unnecessary human resources.
  • Fig. 1 is a schematic diagram of an implementation process of a method for identifying abnormal collection behaviors based on privacy data protection provided by an embodiment of this specification.
  • Fig. 2 is a schematic diagram of an implementation process of a method for training a scene classification model provided by an embodiment of this specification.
  • Fig. 3 is a schematic flow chart of applying the method for training a scene classification model provided by an embodiment of this specification to an actual scene.
  • Fig. 4 is a schematic structural diagram of a device for identifying abnormal collection behaviors based on privacy data protection provided by an embodiment of this specification.
  • Fig. 5 is a schematic structural diagram of a training device for a scene classification model provided by an embodiment of this specification.
  • Fig. 6 is a schematic structural diagram of an electronic device provided by an embodiment of this specification.
  • FIG. 7 is a schematic structural diagram of another electronic device provided by an embodiment of this specification.
  • one or more embodiments of this specification provide a method for identifying abnormal collection behaviors based on privacy data protection, which can obtain page content data of target lightweight applications , User behavior data and a list of privacy data collected by the target lightweight application application, and then use the page content data and user behavior data of the target lightweight application as the input of the scene classification model to predict the use scenario of the target lightweight application through the scene classification model Category, and can determine whether the target lightweight application has abnormal collection behavior based on the collection of private data list corresponding to the usage scenario category of the target lightweight application and the private data list applied for collection by the target lightweight application.
  • the identification of abnormal collection behavior based on privacy data protection of lightweight applications such as small programs is transformed from passive verification to active identification, and the scene classification model is used to identify the use scene category, which improves the identification efficiency on the one hand; on the other hand, it protects User privacy brings users a more secure service experience.
  • the execution subject of the method for identifying abnormal collection behaviors based on privacy data protection may be, but not limited to, servers, computers, etc., which can be configured to execute at least one of the user terminals of the method provided in the embodiments of this specification. Or, the execution subject of the method may also be the client itself capable of executing the method.
  • the implementation of the method is introduced below by taking a server capable of executing the method as an example where the execution subject of the method is. It can be understood that the fact that the execution subject of the method is the server is only an exemplary description, and should not be understood as a limitation of the method.
  • Fig. 1 is a schematic diagram of an implementation process of a method for identifying abnormal collection behaviors based on privacy data protection provided by an embodiment of this specification.
  • the method of FIG. 1 may include steps S110 to S130.
  • the target lightweight application may specifically include fast apps, applets, H5 applications, etc. That is, the user does not need to install it.
  • a lightweight application that can be used.
  • the page content data of the target lightweight application includes text information, entity types, and the number of corresponding entities in the page of the target lightweight application.
  • entity types can be various objects on the page, such as cats, dogs, houses, and cars.
  • the user behavior data in the target lightweight application includes the user's behavior data such as clicking, sliding, payment, forwarding, and input on the page of the target lightweight application, as well as characteristic data such as the user's city, the user's education, age, and occupation.
  • the list of private data collected by the target lightweight application may specifically be the user’s private data list actually collected when the target lightweight application is used by the user. For example, it may include the user’s ID number, the user’s mobile phone number, the user’s gender, Private data such as the user's avatar and nickname.
  • S120 Use the page content data and user behavior data of the target lightweight application as the input of the scene classification model to predict the use scene category of the target lightweight application through the scene classification model; it should be understood that the lightweight applications such as applets are opened and used by the user At times, the user’s private data is often collected. For example, when a shopping applet is opened in a chat application, the user will be prompted to provide permission to collect the user’s avatar, nickname, contact information and other private data in the chat application. Normally, when users open mini programs, they don’t care whether the mini programs they open excessively collect users’ private data. This leads to many mini programs with the intention of over-collecting users’ private data, thereby maliciously exploiting or selling users. The private data achieves the purpose of additional profit.
  • one or more embodiments of this specification can be based on the page content data, user behavior data of multiple lightweight applications, and the use of these lightweight applications in advance.
  • Scene tags trained to obtain a scene classification model, predict the use scene category of the target lightweight application through the scene classification model, and based on the collection of privacy data list corresponding to the use scene category of the target lightweight application and the privacy that the target lightweight application applies for collection Data list to determine whether the target lightweight application has abnormal collection behavior.
  • S130 Determine whether the target lightweight application has an abnormal collection behavior based on a list of collectible privacy data corresponding to the usage scenario category of the target lightweight application and the list of private data collected by the target lightweight application.
  • the use scenario categories of lightweight applications can include shopping use scenarios, train ticket purchase use scenarios, shared bicycle use scenarios, learning tools use scenarios, etc.
  • lightweight applications of different use scenarios need to be collected
  • the privacy data of users will also be different.
  • light-weight shopping applications usually need to collect private data such as the user's shopping account number and contact information
  • light-weight applications buying train tickets need to collect the user's ID number, ticket purchase account number, contact information and other private data
  • Lightweight applications like this need to collect private data such as the user's login account and contact information
  • lightweight applications like learning tools may only need to collect private data such as the user's login account.
  • the target lightweight application determines whether the target lightweight application has abnormal collection behavior, including: if the target lightweight application applies for collection of private data list and target privacy data If the collection lists are consistent, it is determined that there is no abnormal collection behavior in the target lightweight application; if the private data list requested by the target lightweight application is inconsistent with the target privacy data collection list, it is determined that the target lightweight application has abnormal collection behavior.
  • the method further includes: intercepting the private data sending request of the target lightweight application.
  • the target lightweight application as a shopping lightweight application as an example
  • this type of lightweight application when this type of lightweight application is opened and used by the user, it usually only needs to collect the user’s shopping account number, contact information, shipping address and other private data information.
  • the shopping application additionally collects the privacy data of the user’s ID number, it can be determined that the target lightweight application has abnormal collection behavior based on the privacy data list and the target privacy data collection list applied for by the target lightweight application, Intercept the target lightweight application's sending request for additional private data collected, or intercept the sending request of all private data of the target lightweight application.
  • This specification provides one or more embodiments that can obtain the page content data, user behavior data, and the private data list collected by the target lightweight application, and then combine the target lightweight application’s page content data and users
  • the behavior data is used as the input of the scene classification model to predict the use scene category of the target lightweight application through the scene classification model, and can be based on the collection of private data list corresponding to the use scene category of the target lightweight application and the target lightweight application application collection
  • a list of private data to determine whether the target lightweight application has abnormal collection behavior.
  • the identification of abnormal collection behaviors of lightweight applications such as small programs is transformed from passive verification to active identification, and the scene classification model is used to identify the use scene category. On the one hand, the identification efficiency is improved; on the other hand, the user’s privacy is protected. Bring a more secure service experience.
  • Fig. 2 is a schematic diagram of an implementation process of a method for training a scene classification model provided by an embodiment of this specification, including steps S210 to S230.
  • S210 Obtain page content data, user behavior data, and usage scenario tags of multiple lightweight applications of multiple lightweight applications.
  • the page content data of multiple lightweight applications includes text information, entity types, and corresponding entity numbers in the multiple lightweight application pages.
  • entity types can be various objects on the page, such as cats, dogs, Entities such as houses and cars.
  • the user behavior data in multiple lightweight applications includes the behavior data of multiple users such as clicking, sliding, paying, forwarding, and inputting on the pages of these multiple lightweight applications, as well as the cities where the multiple users are located, and the users' educational backgrounds , Age, occupation and other characteristic data.
  • S220 Extract usage scenario features of multiple lightweight applications from page content data and user behavior data of multiple lightweight applications.
  • the page content data of a lightweight application usually includes text data and image data.
  • image data In order to facilitate the extraction of corresponding feature data from text data and image data, one or more embodiments of this specification may combine image data.
  • the type data is converted into text type data, and then all the text type data is spliced to obtain a text field.
  • extracting the usage scenario characteristics of multiple lightweight applications from the page content data and user behavior data of multiple lightweight applications including: obtaining multiple lightweight applications from the page content data of multiple lightweight applications.
  • the entity types and numbers of the entities are spliced to obtain multiple text fields corresponding to multiple lightweight applications.
  • a text field is spliced by multiple text information in the corresponding lightweight application, the name of the entity type, and the corresponding number of entities Obtained; from multiple text fields and user behavior data corresponding to multiple lightweight applications, extract the usage scenario features of multiple lightweight applications.
  • extract the usage scenario characteristics of multiple lightweight applications from multiple text fields and user behavior data corresponding to multiple lightweight applications including: separately performing data on multiple text fields corresponding to multiple lightweight applications Preprocessing; respectively convert multiple text fields corresponding to multiple lightweight applications after data preprocessing operations into multiple corresponding word vectors; from multiple word vectors and user behavior data corresponding to the multiple lightweight applications , Extract the usage scenario features of multiple lightweight applications; among them, the data preprocessing operation includes the operation of removing stop words.
  • the multiple text fields obtained by merging usually contain some words and matches that have no practical meaning, such as " ⁇ ”, “even”, “in order” and other conjunctive words, these words do not have much value in the scene classification process. Meaning, such words will also increase the amount of calculation for classification. Therefore, in one or more embodiments of this specification, before converting multiple text fields corresponding to multiple applications into multiple corresponding word vectors, you can also Multiple text fields perform data preprocessing operations such as removing stop words.
  • multiple text fields corresponding to multiple lightweight applications after data preprocessing operations are converted into multiple corresponding word vectors.
  • a word vector dictionary obtained from corpus training or an open source version of the word vector dictionary can be used. Replace multiple text fields after data preprocessing operations with multiple corresponding word vectors.
  • the word vector dictionary includes the mapping relationship between multiple words and word vectors, and a word vector corresponds to a set of feature vectors.
  • the behavior characteristic data corresponding to the user behavior data can be obtained through statistical analysis. From multiple text fields and user behavior data corresponding to multiple lightweight applications, extract the usage scenario features of multiple lightweight applications. Specifically, multiple word vectors corresponding to multiple text fields and user behavior data corresponding to behavior features can be extracted The data is merged to obtain the usage scenario characteristics of multiple lightweight applications.
  • one or more embodiments of this specification may change the name of each entity type based on the names and corresponding numbers of the entity types in the pages of multiple lightweight applications. Repeat the corresponding number of times, and then join the text information in the page of the lightweight application to obtain the text field of each lightweight application.
  • the multiple text information in the pages of multiple lightweight applications and the entity types and quantities in the pages of multiple lightweight applications are respectively spliced to obtain multiple text fields corresponding to multiple lightweight applications, including : Based on the names and corresponding numbers of the entity types in the pages of multiple lightweight applications, respectively obtain the text fields corresponding to the entity types in the pages of multiple lightweight applications, and an entity in a page of a lightweight application
  • the text field corresponding to the type includes the names of the corresponding number of entity types; based on multiple text information in the pages of multiple lightweight applications, and text fields corresponding to the entity types in the pages of multiple lightweight applications After splicing, multiple text fields corresponding to multiple lightweight applications are obtained.
  • S230 Train to obtain a scene classification model based on the usage scene features of multiple lightweight applications and corresponding usage scene labels, and the scene classification model is used to predict the usage scene category of the lightweight application.
  • training to obtain a scene classification model based on the use scene features of multiple lightweight applications and corresponding use scene labels includes: using the multi-classification model based on the use scene features of multiple lightweight applications and the corresponding use scene labels, The scene classification model is obtained by training.
  • the multi-classification model may specifically include an xgboost model, which is specifically an open source implementation of a gradient boosting tree model, which can be used for classification and regression tasks.
  • the collection behavior recognition method is introduced in detail, including the following steps S301 to S311.
  • S301 Obtain page content data of multiple applets, where the page content data includes text information and image data displayed on the applet page, where the image data includes the entity type and corresponding quantity displayed on the applet page;
  • S302 Obtain user behavior data of multiple applets.
  • the user behavior data includes user behavior data such as clicking, sliding, jumping, inputting, and paying on the page of the applet.
  • S303 Combine the multiple text information in the pages of the multiple applets and the entity types and numbers in the pages of the multiple applets to obtain multiple text fields corresponding to the multiple applets, and compare the multiple text fields. Stop word removal operation is performed on two text fields to remove redundant information in these multiple text fields, and then these multiple text fields are converted into corresponding multiple word vectors based on the preset word vector dictionary; among them, one text The fields are obtained by concatenating multiple text information in the corresponding applet, the name of the entity type, and the number of corresponding entities.
  • the word vector dictionary includes the correspondence between multiple text fields and word vectors. One word vector corresponds to one word vector. Set of feature vectors.
  • S304 based on the user behavior data of multiple applets, construct corresponding multiple behavior characteristic data; specifically, based on the user behavior data of multiple applets, statistical analysis can be used to obtain characteristic data such as the average operation frequency and operation time period of the user, and Characteristic data such as the user’s city, user’s age, education and occupation.
  • S305 Manually mark the usage scenario data of the multiple applets to obtain the usage scenario label of the multiple applets.
  • the usage scenario label is used to characterize the information related to the usage scenario category of the applet;
  • S306, through xgboost multiple The classification model is trained to obtain a scene classification model based on multiple word vectors and behavior feature data corresponding to multiple applets;
  • S307, the page content data and user behavior data of the target applet are used as the input of the scene classification model to pass the scene classification model Predict the usage scenario category of the target applet.
  • S308 determine the privacy data collection list corresponding to the usage scenario category of the target applet; S309, determine the privacy data collection list that the target applet applies for collection; S310, compare the privacy data collection list corresponding to the usage scenario category of the target applet to its application collection The privacy data list is compared to determine whether the target applet has abnormal collection behavior; S311, if the privacy data collection list corresponding to the usage scenario category of the target applet is inconsistent with the private data list requested for collection, it is determined that the target applet has abnormal collection Behavior, and intercept the private data sending request of the target applet.
  • the private data list corresponding to the usage scenario category of the target mini program includes the sensitive information of the user's mobile phone number, and if the target mini program applies for collection of the private data list also includes When sensitive information such as ID number, it can be determined that the target applet has abnormal collection behavior. In this case, when the target applet sends the user's private data, the target applet's request for sending private data can be intercepted, thereby avoiding excessive collection of the user's private data.
  • One or more embodiments provided in this specification can obtain page content data, user behavior data, and usage scenario tags of multiple lightweight applications, and then obtain page content data of multiple lightweight applications And user behavior data, and based on the use scene features and corresponding use scene tags of multiple lightweight applications, a scene classification model can be trained.
  • the scene classification model obtained by training is used to identify the use scenarios of lightweight applications such as small programs.
  • it can improve the efficiency of recognizing the use scenarios of small programs, and on the other hand, it also saves unnecessary human resources.
  • FIG. 4 is a schematic structural diagram of an abnormal collection behavior identification device 400 based on privacy data protection provided by an embodiment of this specification.
  • the device 400 for identifying abnormal collection behaviors based on privacy data protection may include: an obtaining unit 401, which obtains page content data, user behavior data, and the target lightweight application of the target lightweight application A list of privacy data collected by an application application; a prediction unit 402 that uses page content data and user behavior data of the target lightweight application as input to a scene classification model to predict the use of the target lightweight application through the scene classification model Scene category; the determining unit 403 determines whether the target lightweight application is abnormal based on the list of collectible privacy data corresponding to the usage scenario category of the target lightweight application and the list of privacy data collected by the target lightweight application Collection behavior.
  • the determining unit 403 is configured to: if the private data list requested by the target lightweight application for collection is consistent with the target private data collection list, determine the target lightweight There is no abnormal collection behavior in the application; if the private data list requested by the target lightweight application for collection is inconsistent with the target private data collection list, it is determined that the target lightweight application has abnormal collection behavior.
  • the device further includes: an intercepting unit that intercepts the private data transmission of the target lightweight application request.
  • the device 400 for identifying abnormal collection behavior based on privacy data protection can implement the method of the method embodiment in FIG. 1. For details, refer to the method for identifying abnormal collection behavior based on privacy data protection in the embodiment shown in FIG.
  • FIG. 5 is a schematic structural diagram of a training device 500 for a scene classification model provided by an embodiment of this specification.
  • a training device 500 for a scene classification model may include: a data acquisition unit 501, which acquires page content data, user behavior data, and the multiple lightweight applications of multiple lightweight applications The usage scenario label; the feature extraction unit 502, which extracts the usage scenario features of the multiple lightweight applications from the page content data and user behavior data of the multiple lightweight applications; the model training unit 503, based on the multiple The use scene features of a lightweight application and the corresponding use scene labels are trained to obtain a scene classification model, and the scene classification model is used to predict the use scene category of the lightweight application.
  • the feature extraction unit 502 is configured to: respectively obtain multiple pages of the multiple lightweight applications from the page content data of the multiple lightweight applications. Text information, and the types and quantities of entities in the pages of the multiple lightweight applications; respectively, the multiple text information in the pages of the multiple lightweight applications and the types of entities in the pages of the multiple lightweight applications.
  • the entity types and quantities are spliced together to obtain multiple text fields corresponding to the multiple lightweight applications, where one text field consists of multiple text information in the corresponding lightweight application, the name of the entity type, and the corresponding number of entities Obtained by splicing; extracting the usage scenario features of the multiple lightweight applications from multiple text fields and user behavior data corresponding to the multiple lightweight applications.
  • the feature extraction unit 502 is configured to: perform data preprocessing on multiple text fields corresponding to the multiple lightweight applications;
  • the multiple text fields corresponding to the multiple lightweight applications are converted into multiple corresponding word vectors; from the multiple word vectors and the user behavior data corresponding to the multiple lightweight applications, the multiple text fields are extracted Feature of a use scenario for a lightweight application; wherein the data preprocessing operation includes an operation of removing stop words.
  • the feature extraction unit 502 is configured to: based on the names and corresponding numbers of the entity types in the pages of the plurality of lightweight applications, respectively obtain the information related to the plurality of lightweight applications.
  • a text field corresponding to an entity type in a page of a lightweight application, and a text field corresponding to an entity type in a page of a lightweight application includes the names of a corresponding number of entity types;
  • the multiple text information in the page and the text fields corresponding to the entity types in the pages of the multiple lightweight applications are spliced to obtain multiple text fields corresponding to the multiple lightweight applications.
  • the model training unit 503 is configured to train to obtain a scene classification model based on the usage scene features of the multiple lightweight applications and the corresponding usage scene labels through a multi-classification model.
  • the device 500 for training a scene classification model can implement the methods of the method embodiments in FIGS. 2 to 3.
  • FIG. 6 is a schematic diagram of the structure of an electronic device according to an embodiment of this specification.
  • the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory.
  • the memory may include memory, such as high-speed random access memory (Random-Access Memory, RAM), and may also include non-volatile memory (non-volatile memory), such as at least one disk storage.
  • RAM random access memory
  • non-volatile memory such as at least one disk storage.
  • the electronic device may also include hardware required by other services.
  • the processor, network interface, and memory can be connected to each other through an internal bus.
  • the internal bus can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnection standard) bus, or an EISA (Extended) bus. Industry Standard Architecture, extended industry standard structure) bus, etc.
  • the bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one bidirectional arrow is used to indicate in FIG. 6, but it does not mean that there is only one bus or one type of bus.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory may include memory and non-volatile memory, and provide instructions and data to the processor.
  • the processor reads the corresponding computer program from the non-volatile memory to the memory and then runs it to form an abnormal collection behavior identification device based on privacy data protection at the logical level.
  • the processor executes the program stored in the memory, and is specifically configured to perform the following operations: obtain page content data, user behavior data, and a list of privacy data collected by the target lightweight application for the target lightweight application;
  • the page content data and user behavior data of the mass application are used as the input of the scene classification model to predict the usage scene category of the target lightweight application through the scene classification model;
  • the collected private data list and the private data list applied for collection by the target lightweight application determine whether the target lightweight application has abnormal collection behavior.
  • the method performed by the device for identifying abnormal collection behaviors based on privacy data protection as disclosed in the embodiments shown in FIGS. 1 to 3 of this specification can be applied to or implemented by the processor.
  • the processor may be an integrated circuit chip with signal processing capabilities.
  • each step of the above method can be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software.
  • the above-mentioned processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of this specification can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the electronic device can also execute the method in FIG. 1 and realize the functions of the device for identifying abnormal collection behaviors based on privacy data protection in the embodiment shown in FIG. 1, which will not be repeated in the embodiment of this specification.
  • the embodiment of this specification also proposes a computer-readable storage medium that stores one or more programs, the one or more programs include instructions, and the instructions are used in a portable electronic device that includes multiple application programs.
  • the portable electronic device can be made to execute the method of the embodiment shown in FIG. 1, and is specifically used to perform the following operations: obtain page content data of the target lightweight application, user behavior data, and data collected by the target lightweight application.
  • Privacy data list use the page content data and user behavior data of the target lightweight application as the input of the scene classification model to predict the usage scene category of the target lightweight application through the scene classification model; based on the target lightweight application
  • the electronic equipment in this specification does not exclude other implementations, such as logic devices or a combination of software and hardware, etc. That is to say, the execution body of the following processing flow is not limited to each logic unit. It can also be a hardware or logic device.
  • FIG. 7 is a schematic diagram of the structure of an electronic device according to an embodiment of this specification.
  • the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory.
  • the memory may include memory, such as high-speed random access memory (Random-Access Memory, RAM), and may also include non-volatile memory (non-volatile memory), such as at least one disk storage.
  • RAM random access memory
  • non-volatile memory such as at least one disk storage.
  • the electronic device may also include hardware required by other services.
  • the processor, network interface, and memory can be connected to each other through an internal bus.
  • the internal bus can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnection standard) bus, or an EISA (Extended) bus. Industry Standard Architecture, extended industry standard structure) bus, etc.
  • the bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one bidirectional arrow is used in FIG. 7, but it does not mean that there is only one bus or one type of bus.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory may include memory and non-volatile memory, and provide instructions and data to the processor.
  • the processor reads the corresponding computer program from the non-volatile memory to the memory and then runs it to form a training device for the scene classification model on the logical level.
  • the processor executes the program stored in the memory, and is specifically configured to perform the following operations: obtain page content data, user behavior data, and usage scenario tags of the multiple lightweight applications; From the page content data and user behavior data of the lightweight application, extract the usage scenario features of the multiple lightweight applications; based on the usage scenario features of the multiple lightweight applications and the corresponding usage scenario tags, train to obtain a scenario classification model .
  • the method performed by the apparatus for training a scene classification model disclosed in the embodiments shown in FIG. 2 and FIG. 3 of this specification can be applied to the processor or implemented by the processor.
  • the processor may be an integrated circuit chip with signal processing capabilities.
  • each step of the above method can be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software.
  • the above-mentioned processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of this specification can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the electronic device can also execute the methods in FIGS. 2 and 3, and realize the functions of the embodiments shown in FIGS. 2 and 3 of the training device of the scene classification model, which will not be repeated here.
  • the embodiment of this specification also proposes a computer-readable storage medium that stores one or more programs, the one or more programs include instructions, and the instructions are used in a portable electronic device that includes multiple application programs.
  • the portable electronic device can be made to execute the method of the embodiment shown in FIG. 2, and is specifically used to perform the following operations: obtain page content data, user behavior data, and information of the multiple lightweight applications Use scenario tags; extract the use scenario features of the multiple lightweight applications from the page content data and user behavior data of the multiple lightweight applications; based on the use scenario features of the multiple lightweight applications and the corresponding Use scene tags to train a scene classification model.
  • the electronic equipment in this specification does not exclude other implementations, such as logic devices or a combination of software and hardware, etc. That is to say, the execution body of the following processing flow is not limited to each logic unit. It can also be a hardware or logic device.
  • a typical implementation device is a computer.
  • the computer may be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于隐私数据保护的异常采集行为识别以及场景分类模型的训练方法、装置及电子设备,该方法包括:获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表(S110);将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别(S120);基于所述目标轻量应用的使用场景类别对应的可采集的隐私数据列表和所述目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为(S130)。

Description

基于隐私数据保护的异常采集行为识别方法和装置 技术领域
本文件涉及计算机软件技术领域,尤其涉及一种基于隐私数据保护的异常采集行为识别方法、装置及电子设备。
背景技术
随着移动互联网技术的快速发展,应用程序的应用越来越广泛,小程序等轻量应用由于其能够被嵌入到第三方应用程序中,且无需下载安装,随时可用,也受到越来越广泛的应用。然而,现有的小程序在被打开时,往往会采集用户的隐私数据,且有些小程序还存在过度采集用户隐私数据的情况。
目前,对于这种情况,往往需要运营人员在接到用户对某一小程序的举报的前提下,或者通过系统发现了存在异常采集行为的小程序之后,通过运营人员人工判别这些小程序是否存在过度采集用户隐私数据的情况。因此,亟需一种针对小程序等轻量应用的异常采集行为的判别方法,以应对现有技术的上述问题。
发明内容
本说明书实施例的目的是提供一种基于隐私数据保护的异常采集行为识别以及场景分类模型的训练方法、装置及电子设备,以避免小程序等轻量应用对用户的隐私数据的过度采集情况。
为解决上述技术问题,本说明书实施例是通过以下方面实现的。
第一方面,提出了一种基于隐私数据保护的异常采集行为识别方法,包括:获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表;将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别;基于所述目标轻量应用的使用场景类别对应的可采集的隐私数据列表和所述目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为。
第二方面,提出了一种场景分类模型的训练方法,包括:获取多个轻量应用的页面内容数据、用户行为数据以及所述多个轻量应用的使用场景标签;从所述多个轻量应用 的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征;基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,所述场景分类模型用于预测轻量应用的使用场景类别。
第三方面,提出了一种基于隐私数据保护的异常采集行为识别装置,包括:获取单元,获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表;预测单元,将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别;确定单元,基于所述目标轻量应用的使用场景类别对应的可采集的隐私数据列表和所述目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为。
第四方面,提出了一种场景分类模型的训练单元,包括:数据获取单元,获取多个轻量应用的页面内容数据、用户行为数据以及所述多个轻量应用的使用场景标签;特征提取单元,从所述多个轻量应用的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征;模型训练单元,基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,所述场景分类模型用于预测轻量应用的使用场景类别。
第五方面,提出了一种电子设备,该电子设备包括:处理器;以及被安排成存储计算机可执行指令的存储器,所述可执行指令在被执行时使所述处理器执行以下操作:获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表;将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别;基于所述目标轻量应用的使用场景类别对应的可采集的隐私数据列表和所述目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为。
第六方面,提出了一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个程序,所述一个或多个程序当被包括多个应用程序的电子设备执行时,使得所述电子设备执行以下操作:获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表;将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别;基于所述目标轻量应用的使用场景类别对应的可采集的隐私数据列表和所述目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为。
第七方面,提出了一种电子设备,包括:处理器;以及被安排成存储计算机可执行指令的存储器,所述可执行指令在被执行时使所述处理器执行以下操作:获取多个轻量应用的页面内容数据、用户行为数据以及所述多个轻量应用的使用场景标签;从所述多个轻量应用的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征;基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,所述场景分类模型用于预测轻量应用的使用场景类别。
第八方面,提出了一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个程序,所述一个或多个程序当被包括多个应用程序的电子设备执行时,使得所述电子设备执行以下操作:获取多个轻量应用的页面内容数据、用户行为数据以及所述多个轻量应用的使用场景标签;从所述多个轻量应用的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征;基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,所述场景分类模型用于预测轻量应用的使用场景类别。
由以上本说明书实施例提供的技术方案可见,本说明书实施例方案至少具备如下一种技术效果:本说明书提供的一种或多个实施例,能够获取目标轻量应用的页面内容数据、用户行为数据和目标轻量应用申请采集的隐私数据列表,再将目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过场景分类模型预测目标轻量应用的使用场景类别,并能够基于目标轻量应用的使用场景类别对应的可采集的隐私数据列表和目标轻量应用申请采集的隐私数据列表,确定目标轻量应用是否存在异常采集行为。将小程序等轻量应用的异常采集行为的识别由被动核查转变为主动识别,且使用场景分类模型来识别使用场景类别,一方面提高了识别效率;另一方面保护了用户的隐私,给用户带来更安心的服务体验。
本说明书提供的一种或多个实施例,能够获取多个轻量应用的页面内容数据、用户行为数据以及多个轻量应用的使用场景标签,再从这多个轻量应用的页面内容数据和用户行为数据中,并能够基于多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型。这样再使用训练得到的场景分类模型对小程序等轻量应用的使用场景进行识别,一方面能够提高对小程序使用场景的识别效率,另一方面也节省了不必要的人力资源。
附图说明
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本说明书的一个实施例提供的一种基于隐私数据保护的异常采集行为识别方法的实施流程示意图。
图2是本说明书的一个实施例提供的一种场景分类模型的训练方法的实施流程示意图。
图3是本说明书的一个实施例提供的场景分类模型的训练方法应用在一种实际场景中的流程示意图。
图4是本说明书的一个实施例提供的一种基于隐私数据保护的异常采集行为识别装置的结构示意图。
图5是本说明书的一个实施例提供的一种场景分类模型的训练装置的结构示意图。
图6是本说明书的一个实施例提供的一种电子设备的结构示意图。
图7是本说明书的一个实施例提供的另一种电子设备的结构示意图。
具体实施方式
为使本说明书的目的、技术方案和优点更加清楚,下面将结合本说明书具体实施例及相应的附图对本说明书中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本文件一部分实施例,而不是全部的实施例。基于本文件中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本文件保护的范围。
以下结合附图,详细说明本说明书各实施例提供的技术方案。
为避免小程序等轻量应用对用户的隐私数据的过度采集情况,本说明书一个或多个实施例提供一种基于隐私数据保护的异常采集行为识别方法,能够获取目标轻量应用的页面内容数据、用户行为数据和目标轻量应用申请采集的隐私数据列表,再将目标轻量 应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过场景分类模型预测目标轻量应用的使用场景类别,并能够基于目标轻量应用的使用场景类别对应的可采集的隐私数据列表和目标轻量应用申请采集的隐私数据列表,确定目标轻量应用是否存在异常采集行为。
这样便将小程序等轻量应用的基于隐私数据保护的异常采集行为识别由被动核查转变为主动识别,且使用场景分类模型来识别使用场景类别,一方面提高了识别效率;另一方面保护了用户的隐私,给用户带来更安心的服务体验。
应理解,本说明书实施例提供的基于隐私数据保护的异常采集行为识别方法的执行主体,可以但不限于服务器、电脑等能够被配置为执行本说明书实施例提供的该方法用户终端中的至少一种,或者,该方法的执行主体,还可以是能够执行该方法的客户端本身。
为便于描述,下文以该方法的执行主体为能够执行该方法的服务器为例,对该方法的实施方式进行介绍。可以理解,该方法的执行主体为服务器只是一种示例性的说明,并不应理解为对该方法的限定。
图1是本说明书的一个实施例提供的一种基于隐私数据保护的异常采集行为识别方法的实施流程示意图。图1的方法可包括步骤S110至S130。
S110,获取目标轻量应用的页面内容数据、用户行为数据和目标轻量应用申请采集的隐私数据列表;其中,目标轻量应用具体可以包括快应用、小程序、H5应用等即用户无需安装即可使用的轻量级应用程序。
其中,目标轻量应用的页面内容数据包括目标轻量应用的页面中的文字信息、实体类型以及对应的实体数量,该实体类型可以是页面中的各种物体,比如猫、狗、房子、车等实体。目标轻量应用中的用户行为数据包括用户在目标轻量应用的页面中的点击、滑动、支付、转发、输入等行为数据、以及用户所在的城市、用户的学历、年龄、职业等特征数据。目标轻量应用申请采集的隐私数据列表具体可以是目标轻量应用在被用户使用时,实际采集的用户的隐私数据列表,比如可以包括用户的身份证号、用户的手机号码、用户的性别、用户的头像、昵称等隐私数据。
S120,将目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过场景分类模型预测目标轻量应用的使用场景类别;应理解,小程序等轻量应用在用户打开使用时,往往会采集用户的隐私数据,比如在聊天应用中打开购物类小程序时, 则会提示用户将为其提供采集用户在该聊天应用中的头像、昵称、联系方式等隐私数据的权限。通常情况下,用户在打开小程序时,不会在意其打开的小程序是否会过度采集用户的隐私数据,这就导致很多小程序可能存在过度采集用户隐私数据的意图,从而恶意利用或贩卖用户的隐私数据达到额外获利的目的。
在这种情况下,为了避免用户的隐私数据被过度采集和利用,本说明书一个或多个实施例,可预先基于多个轻量应用的页面内容数据、用户行为数据和这些轻量应用的使用场景标签,训练得到场景分类模型,通过场景分类模型预测目标轻量应用的使用场景类别,并基于目标轻量应用的使用场景类别对应的可采集的隐私数据列表和目标轻量应用申请采集的隐私数据列表,确定目标轻量应用是否存在异常采集行为。
S130,基于目标轻量应用的使用场景类别对应的可采集的隐私数据列表和目标轻量应用申请采集的隐私数据列表,确定目标轻量应用是否存在异常采集行为。
其中,轻量应用的使用场景类别可包括购物类使用场景、购买火车票的使用场景、共享单车类使用场景、学习工具类的使用场景,等等,通常不同使用场景类别的轻量应用需要采集的用户隐私数据也会不同。比如购物类的轻量应用通常需要采集用户的购物账号、联系方式等隐私数据;购买火车票类的轻量应用则需要采集用户的身份证号、购票账号、联系方式等隐私数据;共享单车类轻量应用需要采集用户的登录账号、联系方式等隐私数据;学习工具类的轻量应用可能只需要采集用户的登录账号等隐私数据。
也就是说,依据不同使用场景类别的轻量应用实际申请采集的隐私数据列表、以及对应于不同使用场景类别的轻量应用可采集的隐私数据列表,便可以判断出轻量应用是否存在过度采集用户隐私数据的情况。
可选地,基于目标轻量应用申请采集的隐私数据列表和目标隐私数据采集列表,确定目标轻量应用是否存在异常采集行为,包括:若目标轻量应用申请采集的隐私数据列表和目标隐私数据采集列表一致,则确定目标轻量应用不存在异常采集行为;若目标轻量应用申请采集的隐私数据列表和目标隐私数据采集列表不一致,则确定目标轻量应用存在异常采集行为。
可选地,为了避免目标轻量应用过度采集用户的隐私数据,在确定目标轻量应用存在异常采集行为之后,该方法还包括:拦截目标轻量应用的隐私数据发送请求。
以目标轻量应用为购物类轻量应用为例,这类轻量应用在被用户打开并使用时,通常只需要采集用户的购物账号、联系方式、收货地址等隐私数据信息,显然,在用户进 行购物时,通常情况下是不需要出示用户本人的身份信息的,比如身份证号码。若该购物应用又额外采集了用户的身份证号码这一隐私数据,则可以在基于目标轻量应用申请采集的隐私数据列表和目标隐私数据采集列表,确定目标轻量应用存在异常采集行为之后,拦截目标轻量应用针对其额外采集的隐私数据发送请求,或者拦截目标轻量应用的所有隐私数据的发送请求。
本说明书提供的一种或多个实施例,能够获取目标轻量应用的页面内容数据、用户行为数据和目标轻量应用申请采集的隐私数据列表,再将目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过场景分类模型预测目标轻量应用的使用场景类别,并能够基于目标轻量应用的使用场景类别对应的可采集的隐私数据列表和目标轻量应用申请采集的隐私数据列表,确定目标轻量应用是否存在异常采集行为。将小程序等轻量应用的异常采集行为的识别由被动核查转变为主动识别,且使用场景分类模型来识别使用场景类别,一方面提高了识别效率;另一方面保护了用户的隐私,给用户带来更安心的服务体验。
图2是本说明书的一个实施例提供的一种场景分类模型的训练方法的实施流程示意图,包括步骤S210至S230。
S210,获取多个轻量应用的页面内容数据、用户行为数据以及多个轻量应用的使用场景标签。
其中,多个轻量应用的页面内容数据包括这多个轻量应用的页面中的文字信息、实体类型以及对应的实体数量,该实体类型可以是页面中的各种物体,比如猫、狗、房子、车等实体。多个轻量应用中的用户行为数据包括多个用户在这多个轻量应用的页面中的点击、滑动、支付、转发、输入等行为数据、以及这多个用户所在的城市、用户的学历、年龄、职业等特征数据。
多个轻量应用的使用场景标签为场景分类模型训练之前,通过人工或者机器打标的方式,对这多个轻量应用的使用场景标记对应的使用场景标签,比如购物类、购票类、学习工具类等等使用场景标签。
S220,从多个轻量应用的页面内容数据和用户行为数据中,提取多个轻量应用的使用场景特征。
应理解,轻量应用的页面内容数据中通常会包括文字类数据和图像类数据,为便于从文字类数据和图像类数据提取出对应的特征数据,本说明书一个或多个实施例可将图 像类数据转换为文字类数据,再将所有的文字类数据进行拼接得到一个文本字段。具体地,从多个轻量应用的页面内容数据和用户行为数据中,提取多个轻量应用的使用场景特征,包括:从多个轻量应用的页面内容数据中,分别获取多个轻量应用的页面中的多个文字信息、以及多个轻量应用的页面中的实体类型和数量;分别将多个轻量应用的页面中的多个文字信息、以及多个轻量应用的页面中的实体类型和数量进行拼接,得到多个轻量应用对应的多个文本字段,其中,一个文本字段中由对应的轻量应用中的多个文字信息、实体类型的名称和对应的实体数量拼接得到;从多个轻量应用对应的多个文本字段和用户行为数据中,提取多个轻量应用的使用场景特征。
可选地,从多个轻量应用对应的多个文本字段和用户行为数据中,提取多个轻量应用的使用场景特征,包括:分别对多个轻量应用对应的多个文本字段进行数据预处理;分别将数据预处理操作后的多个轻量应用对应的多个文本字段,转换为对应的多个词向量;从多个词向量和所述多个轻量应用对应的用户行为数据中,提取多个轻量应用的使用场景特征;其中,数据预处理操作包括剔除停用词操作。
由于合并得到的多个文本字段中通常会存在一些没有实际意义的词和符合,比如“的”、“即使”、“以便”这种连接词,这些词对场景分类过程没有过多的价值和意义,这类词还会增加分类的计算量,因此,本说明书一个或多个实施例,在将多个应用对应的多个文本字段,转换为对应的多个词向量之前,还可以对这多个文本字段进行剔除停用词等数据预处理操作。
其中,分别将数据预处理操作后的多个轻量应用对应的多个文本字段,转换为对应的多个词向量,具体可以使用语料训练得到的词向量字典,或者开源版本的词向量字典,将数据预处理操作后的多个文本字段换换为对应的多个词向量。该词向量字典中包括多个词与词向量之间的映射关系,一个词向量对应于一组特征向量。
其中,用户行为数据对应的行为特征数据可通过统计分析的方式得到。从多个轻量应用对应的多个文本字段和用户行为数据中,提取多个轻量应用的使用场景特征,具体可以将多个文本字段对应的多个词向量和用户行为数据对应的行为特征数据进行合并,得到多个轻量应用的使用场景特征。
可选地,为了避免遗漏轻量应用的页面中的特征,本说明书一个或多个实施例可基于多个轻量应用的页面中的实体类型的名称和对应的数量,将各个实体类型的名称重复对应的数量的次数,再与轻量应用的页面中的文字信息进行拼接,得到各轻量应用的文 本字段。具体地,分别将多个轻量应用的页面中的多个文字信息、以及多个轻量应用的页面中的实体类型和数量进行拼接,得到多个轻量应用对应的多个文本字段,包括:基于多个轻量应用的页面中的实体类型的名称和对应的数量,分别获取与多个轻量应用的页面中的实体类型相对应的文本字段,一个轻量应用的页面中的一个实体类型对应的文本字段包括对应的数量的实体类型的名称;基于分别将多个轻量应用的页面中的多个文字信息、以及与多个轻量应用的页面中的实体类型相对应的文本字段进行拼接,得到多个轻量应用对应的多个文本字段。
S230,基于多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,该场景分类模型用于预测轻量应用的使用场景类别。
可选地,基于多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,包括:通过多分类模型基于多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型。
其中,多分类模型具体可以包括xgboost模型,该xgboost模型具体是一种梯度提升树模型的开源实现,能够用于分类和回归任务。
下面以轻量应用为小程序为例,并结合图3所示的场景分类模型和场景分类模型的应用方法流程示意图,对本说明书实施例提供的场景分类模型的训练方法和基于隐私数据保护的异常采集行为识别方法进行详细介绍,包括以下步骤S301至S311。
S301,获取多个小程序的页面内容数据,该页面内容数据包括小程序页面中显示的文字信息和图像类数据,其中图像类数据中包括小程序页面中显示的实体类型和对应的数量;S302,获取多个小程序的用户行为数据,该用户行为数据包括用户对小程序页面的点击、滑动、跳转、输入、付款等行为数据。
S303,分别将这多个小程序的页面中的多个文字信息、以及多个小程序的页面中的实体类型和数量进行拼接,得到多个小程序对应的多个文本字段,并对这多个文本字段进行剔除停用词操作,以剔除这多个文本字段中的冗余信息,再基于预先设置的词向量字典将这多个文本字段转换为对应的多个词向量;其中,一个文本字段中由对应的小程序中的多个文字信息、实体类型的名称和对应的实体数量拼接得到,词向量字典中包括多个文本字段与词向量之间的对应关系,一个词向量对应于一组特征向量。
S304,基于多个小程序的用户行为数据,构造对应的多个行为特征数据;具体可以基于多个小程序的用户行为数据,统计分析得到用户的平均操作频次、操作时间段等特 征数据,以及用户所在的城市、用户的年龄、学历职业等特征数据。
S305,对这多个小程序的使用场景数据进行人工打标,得到这多个小程序的使用场景标签,该使用场景标签用于表征小程序的使用场景类别相关的信息;S306,通过xgboost多分类模型基于多个小程序对应的多个词向量和行为特征数据,训练得到场景分类模型;S307,将目标小程序的页面内容数据和用户行为数据作为场景分类模型的输入,以通过场景分类模型预测目标小程序的使用场景类别。
S308,确定目标小程序的使用场景类别对应的隐私数据采集列表;S309,确定目标小程序申请采集的隐私数据列表;S310,将目标小程序的使用场景类别对应的隐私数据采集列表与其申请采集的隐私数据列表进行对比,判断出目标小程序是否存在异常采集行为;S311,若目标小程序的使用场景类别对应的隐私数据采集列表与其申请采集的隐私数据列表不一致,则确定目标小程序存在异常采集行为,并拦截目标小程序的隐私数据发送请求。
以目标小程序为购物类的小程序为例,该目标小程序的使用场景类别对应的隐私数据列表包括用户的手机号这一敏感信息,而若该目标小程序申请采集的隐私数据列表还包括身份证号等敏感信息时,则可以确定该目标小程序存在异常采集行为。在这种情况下,当目标小程序发送用户的隐私数据时,则可以拦截该目标小程序的隐私数据发送请求,从而避免其对用户的隐私数据的过度采集。
本说明书提供的一种或多个实施例,能够获取多个轻量应用的页面内容数据、用户行为数据以及多个轻量应用的使用场景标签,再从这多个轻量应用的页面内容数据和用户行为数据中,并能够基于多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型。这样再使用训练得到的场景分类模型对小程序等轻量应用的使用场景进行识别,一方面能够提高对小程序使用场景的识别效率,另一方面也节省了不必要的人力资源。
图4是本说明书的一个实施例提供的一种基于隐私数据保护的异常采集行为识别装置400的结构示意图。请参考图4,在一种软件实施方式中,基于隐私数据保护的异常采集行为识别装置400可包括:获取单元401,获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表;预测单元402,将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别;确定单元403,基于所述目标轻量应用 的使用场景类别对应的可采集的隐私数据列表和所述目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为。
可选地,在一种实施方式中,所述确定单元403,用于:若所述目标轻量应用申请采集的隐私数据列表和所述目标隐私数据采集列表一致,则确定所述目标轻量应用不存在异常采集行为;若所述目标轻量应用申请采集的隐私数据列表和所述目标隐私数据采集列表不一致,则确定所述目标轻量应用存在异常采集行为。
可选地,在一种实施方式中,在所述确定单元403确定所述目标轻量应用存在异常采集行为之后,所述装置还包括:拦截单元,拦截所述目标轻量应用的隐私数据发送请求。
基于隐私数据保护的异常采集行为识别装置400能够实现图1的方法实施例的方法,具体可参考图1所示实施例的基于隐私数据保护的异常采集行为识别方法,不再赘述。
图5是本说明书的一个实施例提供的一种场景分类模型的训练装置500的结构示意图。请参考图5,在一种软件实施方式中,场景分类模型的训练装置500可包括:数据获取单元501,获取多个轻量应用的页面内容数据、用户行为数据以及所述多个轻量应用的使用场景标签;特征提取单元502,从所述多个轻量应用的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征;模型训练单元503,基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,所述场景分类模型用于预测轻量应用的使用场景类别。
可选地,在一种实施方式中,所述特征提取单元502,用于:从所述多个轻量应用的页面内容数据中,分别获取所述多个轻量应用的页面中的多个文字信息、以及所述多个轻量应用的页面中的实体类型和数量;分别将所述多个轻量应用的页面中的多个文字信息、以及所述多个轻量应用的页面中的实体类型和数量进行拼接,得到所述多个轻量应用对应的多个文本字段,其中,一个文本字段中由对应的轻量应用中的多个文字信息、实体类型的名称和对应的实体数量拼接得到;从所述多个轻量应用对应的多个文本字段和用户行为数据中,提取所述多个轻量应用的使用场景特征。
可选地,在一种实施方式中,所述特征提取单元502,用于:分别对所述多个轻量应用对应的多个文本字段进行数据预处理;分别将所述数据预处理操作后的所述多个轻量应用对应的多个文本字段,转换为对应的多个词向量;从所述多个词向量和所述多个轻量应用对应的用户行为数据中,提取所述多个轻量应用的使用场景特征;其中,所述 数据预处理操作包括剔除停用词操作。
可选地,在一种实施方式中,所述特征提取单元502,用于:基于所述多个轻量应用的页面中的实体类型的名称和对应的数量,分别获取与所述多个轻量应用的页面中的实体类型相对应的文本字段,一个轻量应用的页面中的一个实体类型对应的文本字段包括对应的数量的实体类型的名称;基于分别将所述多个轻量应用的页面中的多个文字信息、以及与所述多个轻量应用的页面中的实体类型相对应的文本字段进行拼接,得到所述多个轻量应用对应的多个文本字段。
可选地,在一种实施方式中,所述模型训练单元503,用于:通过多分类模型基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型。
场景分类模型的训练装置500能够实现图2~图3的方法实施例的方法,具体可参考图2~图3所示实施例的场景分类模型的训练方法,不再赘述。
图6是本说明书的一个实施例电子设备的结构示意图。请参考图6,在硬件层面,该电子设备包括处理器,可选地还包括内部总线、网络接口、存储器。其中,存储器可能包含内存,例如高速随机存取存储器(Random-Access Memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少1个磁盘存储器等。当然,该电子设备还可能包括其他业务所需要的硬件。
处理器、网络接口和存储器可以通过内部总线相互连接,该内部总线可以是ISA(Industry Standard Architecture,工业标准体系结构)总线、PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一个双向箭头表示,但并不表示仅有一根总线或一种类型的总线。
存储器,用于存放程序。具体地,程序可以包括程序代码,所述程序代码包括计算机操作指令。存储器可以包括内存和非易失性存储器,并向处理器提供指令和数据。
处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,在逻辑层面上形成基于隐私数据保护的异常采集行为识别装置。处理器,执行存储器所存放的程序,并具体用于执行以下操作:获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表;将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别;基于所述目标轻量应用的使用场景类别对应的可采集的隐私数据列表和所述 目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为。
上述如本说明书图1~图3所示实施例揭示的基于隐私数据保护的异常采集行为识别装置执行的方法可以应用于处理器中,或者由处理器实现。处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本说明书实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本说明书实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
该电子设备还可执行图1的方法,并实现基于隐私数据保护的异常采集行为识别装置在图1所示实施例的功能,本说明书实施例在此不再赘述。
本说明书实施例还提出了一种计算机可读存储介质,该计算机可读存储介质存储一个或多个程序,该一个或多个程序包括指令,该指令当被包括多个应用程序的便携式电子设备执行时,能够使该便携式电子设备执行图1所示实施例的方法,并具体用于执行以下操作:获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表;将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别;基于所述目标轻量应用的使用场景类别对应的可采集的隐私数据列表和所述目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为。
当然,除了软件实现方式之外,本说明书的电子设备并不排除其他实现方式,比如逻辑器件抑或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。
图7是本说明书的一个实施例电子设备的结构示意图。请参考图7,在硬件层面, 该电子设备包括处理器,可选地还包括内部总线、网络接口、存储器。其中,存储器可能包含内存,例如高速随机存取存储器(Random-Access Memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少1个磁盘存储器等。当然,该电子设备还可能包括其他业务所需要的硬件。
处理器、网络接口和存储器可以通过内部总线相互连接,该内部总线可以是ISA(Industry Standard Architecture,工业标准体系结构)总线、PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图7中仅用一个双向箭头表示,但并不表示仅有一根总线或一种类型的总线。
存储器,用于存放程序。具体地,程序可以包括程序代码,所述程序代码包括计算机操作指令。存储器可以包括内存和非易失性存储器,并向处理器提供指令和数据。
处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,在逻辑层面上形成场景分类模型的训练装置。处理器,执行存储器所存放的程序,并具体用于执行以下操作:获取多个轻量应用的页面内容数据、用户行为数据以及所述多个轻量应用的使用场景标签;从所述多个轻量应用的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征;基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型。
上述如本说明书图2和图3所示实施例揭示的场景分类模型的训练装置执行的方法可以应用于处理器中,或者由处理器实现。处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本说明书实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本说明书实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦 写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
该电子设备还可执行图2和图3的方法,并实现场景分类模型的训练装置在图2和图3所示实施例的功能,本说明书实施例在此不再赘述。
本说明书实施例还提出了一种计算机可读存储介质,该计算机可读存储介质存储一个或多个程序,该一个或多个程序包括指令,该指令当被包括多个应用程序的便携式电子设备执行时,能够使该便携式电子设备执行图2所示实施例的方法,并具体用于执行以下操作:获取多个轻量应用的页面内容数据、用户行为数据以及所述多个轻量应用的使用场景标签;从所述多个轻量应用的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征;基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型。
当然,除了软件实现方式之外,本说明书的电子设备并不排除其他实现方式,比如逻辑器件抑或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
总之,以上所述仅为本说明书的较佳实施例而已,并非用于限定本说明书的保护范围。凡在本说明书的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本说明书的保护范围之内。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。 计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。

Claims (14)

  1. 一种基于隐私数据保护的异常采集行为识别方法,包括:
    获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表;
    将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别;
    基于所述目标轻量应用的使用场景类别对应的可采集的隐私数据列表和所述目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为。
  2. 如权利要求1所述的方法,基于所述目标轻量应用申请采集的隐私数据列表和所述目标隐私数据采集列表,确定所述目标轻量应用是否存在异常采集行为,包括:
    若所述目标轻量应用申请采集的隐私数据列表和所述目标隐私数据采集列表一致,则确定所述目标轻量应用不存在异常采集行为;
    若所述目标轻量应用申请采集的隐私数据列表和所述目标隐私数据采集列表不一致,则确定所述目标轻量应用存在异常采集行为。
  3. 如权利要求2所述的方法,在确定所述目标轻量应用存在异常采集行为之后,所述方法还包括:
    拦截所述目标轻量应用的隐私数据发送请求。
  4. 一种场景分类模型的训练方法,包括:
    获取多个轻量应用的页面内容数据、用户行为数据以及所述多个轻量应用的使用场景标签;
    从所述多个轻量应用的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征;
    基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,所述场景分类模型用于预测轻量应用的使用场景类别。
  5. 如权利要求4所述的方法,从所述多个轻量应用的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征,包括:
    从所述多个轻量应用的页面内容数据中,分别获取所述多个轻量应用的页面中的多个文字信息、以及所述多个轻量应用的页面中的实体类型和数量;
    分别将所述多个轻量应用的页面中的多个文字信息、以及所述多个轻量应用的页面中的实体类型和数量进行拼接,得到所述多个轻量应用对应的多个文本字段,其中,一 个文本字段中由对应的轻量应用中的多个文字信息、实体类型的名称和对应的实体数量拼接得到;
    从所述多个轻量应用对应的多个文本字段和用户行为数据中,提取所述多个轻量应用的使用场景特征。
  6. 如权利要求5所述的方法,从所述多个轻量应用对应的多个文本字段和用户行为数据中,提取所述多个轻量应用的使用场景特征,包括:
    分别对所述多个轻量应用对应的多个文本字段进行数据预处理;
    分别将所述数据预处理操作后的所述多个轻量应用对应的多个文本字段,转换为对应的多个词向量;
    从所述多个词向量和所述多个轻量应用对应的用户行为数据中,提取所述多个轻量应用的使用场景特征;
    其中,所述数据预处理操作包括剔除停用词操作。
  7. 如权利要求5所述的方法,分别将所述多个轻量应用的页面中的多个文字信息、以及所述多个轻量应用的页面中的实体类型和数量进行拼接,得到所述多个轻量应用对应的多个文本字段,包括:
    基于所述多个轻量应用的页面中的实体类型的名称和对应的数量,分别获取与所述多个轻量应用的页面中的实体类型相对应的文本字段,一个轻量应用的页面中的一个实体类型对应的文本字段包括对应的数量的实体类型的名称;
    基于分别将所述多个轻量应用的页面中的多个文字信息、以及与所述多个轻量应用的页面中的实体类型相对应的文本字段进行拼接,得到所述多个轻量应用对应的多个文本字段。
  8. 如权利要求4所述的方法,基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,包括:
    通过多分类模型基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型。
  9. 一种基于隐私数据保护的异常采集行为识别装置,包括:
    获取单元,获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表;
    预测单元,将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别;
    确定单元,基于所述目标轻量应用的使用场景类别对应的可采集的隐私数据列表和所述目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为。
  10. 一种场景分类模型的训练装置,包括:
    数据获取单元,获取多个轻量应用的页面内容数据、用户行为数据以及所述多个轻量应用的使用场景标签;
    特征提取单元,从所述多个轻量应用的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征;
    模型训练单元,基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,所述场景分类模型用于预测轻量应用的使用场景类别。
  11. 一种电子设备,包括:
    处理器;以及
    被安排成存储计算机可执行指令的存储器,所述可执行指令在被执行时使所述处理器执行以下操作:
    获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表;
    将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别;
    基于所述目标轻量应用的使用场景类别对应的可采集的隐私数据列表和所述目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为。
  12. 一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个程序,所述一个或多个程序当被包括多个应用程序的电子设备执行时,使得所述电子设备执行以下操作:
    获取目标轻量应用的页面内容数据、用户行为数据和所述目标轻量应用申请采集的隐私数据列表;
    将所述目标轻量应用的页面内容数据和用户行为数据作为场景分类模型的输入,以通过所述场景分类模型预测所述目标轻量应用的使用场景类别;
    基于所述目标轻量应用的使用场景类别对应的可采集的隐私数据列表和所述目标轻量应用申请采集的隐私数据列表,确定所述目标轻量应用是否存在异常采集行为。
  13. 一种电子设备,包括:
    处理器;以及
    被安排成存储计算机可执行指令的存储器,所述可执行指令在被执行时使所述处理器执行以下操作:
    获取多个轻量应用的页面内容数据、用户行为数据以及所述多个轻量应用的使用场景标签;
    从所述多个轻量应用的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征;
    基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,所述场景分类模型用于预测轻量应用的使用场景类别。
  14. 一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个程序,所述一个或多个程序当被包括多个应用程序的电子设备执行时,使得所述电子设备执行以下操作:
    获取多个轻量应用的页面内容数据、用户行为数据以及所述多个轻量应用的使用场景标签;
    从所述多个轻量应用的页面内容数据和用户行为数据中,提取所述多个轻量应用的使用场景特征;
    基于所述多个轻量应用的使用场景特征和对应的使用场景标签,训练得到场景分类模型,所述场景分类模型用于预测轻量应用的使用场景类别。
PCT/CN2020/111725 2019-11-22 2020-08-27 基于隐私数据保护的异常采集行为识别方法和装置 WO2021098327A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911158814.7A CN110826006B (zh) 2019-11-22 2019-11-22 基于隐私数据保护的异常采集行为识别方法和装置
CN201911158814.7 2019-11-22

Publications (1)

Publication Number Publication Date
WO2021098327A1 true WO2021098327A1 (zh) 2021-05-27

Family

ID=69558415

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/111725 WO2021098327A1 (zh) 2019-11-22 2020-08-27 基于隐私数据保护的异常采集行为识别方法和装置

Country Status (3)

Country Link
CN (1) CN110826006B (zh)
TW (1) TWI743773B (zh)
WO (1) WO2021098327A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434847A (zh) * 2021-06-25 2021-09-24 平安国际智慧城市科技股份有限公司 应用程序的隐私模块处理方法、装置、电子设备及介质
CN113792341A (zh) * 2021-09-15 2021-12-14 百度在线网络技术(北京)有限公司 应用程序的隐私合规自动化检测方法、装置、设备及介质
CN114793269A (zh) * 2022-03-25 2022-07-26 岚图汽车科技有限公司 摄像头的控制方法及相关设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826006B (zh) * 2019-11-22 2021-03-19 支付宝(杭州)信息技术有限公司 基于隐私数据保护的异常采集行为识别方法和装置
CN111400705B (zh) * 2020-03-04 2023-03-14 支付宝(杭州)信息技术有限公司 一种应用程序的检测方法、装置及设备
CN112491815A (zh) * 2020-11-11 2021-03-12 恒安嘉新(北京)科技股份公司 信息监测方法、装置、设备及介质
CN112765654B (zh) * 2021-01-07 2022-09-20 支付宝(杭州)信息技术有限公司 一种基于隐私数据调用的管控方法及装置
CN112835902A (zh) * 2021-02-01 2021-05-25 上海上讯信息技术股份有限公司 一种数据资产识别及使用的方法及设备
CN112948835B (zh) * 2021-03-26 2022-07-19 支付宝(杭州)信息技术有限公司 小程序风险检测方法和装置
CN113297609A (zh) * 2021-07-27 2021-08-24 支付宝(杭州)信息技术有限公司 针对小程序进行隐私采集行为监控的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130297256A1 (en) * 2012-05-04 2013-11-07 Jun Yang Method and System for Predictive and Conditional Fault Detection
CN105550584A (zh) * 2015-12-31 2016-05-04 北京工业大学 一种Android平台下基于RBAC的恶意程序拦截及处置方法
CN109344042A (zh) * 2018-08-22 2019-02-15 北京中测安华科技有限公司 异常操作行为的识别方法、装置、设备及介质
CN109829300A (zh) * 2019-01-02 2019-05-31 广州大学 App动态深度恶意行为检测装置、方法及系统
CN110826006A (zh) * 2019-11-22 2020-02-21 支付宝(杭州)信息技术有限公司 基于隐私数据保护的异常采集行为识别方法和装置

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20070111603A (ko) * 2006-05-18 2007-11-22 이상규 클라이언트 및 서버의 보안시스템
KR101539841B1 (ko) * 2013-05-30 2015-07-28 제주대학교 산학협력단 스마트그리드 전력 네트워크에서 정책기반 정보보호 서비스 방법 및 시스템
CN104966031B (zh) * 2015-07-01 2018-02-27 复旦大学 安卓应用程序中非权限相关隐私数据的识别方法
CN107958154A (zh) * 2016-10-17 2018-04-24 中国科学院深圳先进技术研究院 一种恶意软件检测装置及方法
US11347871B2 (en) * 2018-01-16 2022-05-31 International Business Machines Corporation Dynamic cybersecurity protection mechanism for data storage devices
CN110475014A (zh) * 2018-05-11 2019-11-19 北京三星通信技术研究有限公司 用户场景的识别方法及终端设备
CN109495727B (zh) * 2019-01-04 2021-12-24 京东方科技集团股份有限公司 智能监控方法及装置、系统、可读存储介质
CN109766488B (zh) * 2019-01-16 2022-09-16 南京工业职业技术学院 一种基于Scrapy的数据采集方法
CN109933503A (zh) * 2019-02-13 2019-06-25 平安科技(深圳)有限公司 用户操作风险系数确定方法、装置及存储介质、服务器
CN109960753B (zh) * 2019-02-13 2023-07-25 平安科技(深圳)有限公司 上网设备用户的检测方法、装置、存储介质及服务器
CN110087099B (zh) * 2019-03-11 2020-08-07 北京大学 一种保护隐私的监控方法和系统
CN110213236B (zh) * 2019-05-05 2022-09-27 深圳市腾讯计算机系统有限公司 确定业务安全风险的方法、电子设备及计算机存储介质
CN110428091B (zh) * 2019-07-10 2022-12-27 平安科技(深圳)有限公司 基于数据分析的风险识别方法及相关设备
CN110457694B (zh) * 2019-07-29 2023-09-22 腾讯科技(上海)有限公司 消息提醒方法及装置、场景类型识别提醒方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130297256A1 (en) * 2012-05-04 2013-11-07 Jun Yang Method and System for Predictive and Conditional Fault Detection
CN105550584A (zh) * 2015-12-31 2016-05-04 北京工业大学 一种Android平台下基于RBAC的恶意程序拦截及处置方法
CN109344042A (zh) * 2018-08-22 2019-02-15 北京中测安华科技有限公司 异常操作行为的识别方法、装置、设备及介质
CN109829300A (zh) * 2019-01-02 2019-05-31 广州大学 App动态深度恶意行为检测装置、方法及系统
CN110826006A (zh) * 2019-11-22 2020-02-21 支付宝(杭州)信息技术有限公司 基于隐私数据保护的异常采集行为识别方法和装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434847A (zh) * 2021-06-25 2021-09-24 平安国际智慧城市科技股份有限公司 应用程序的隐私模块处理方法、装置、电子设备及介质
CN113434847B (zh) * 2021-06-25 2023-10-27 深圳赛安特技术服务有限公司 应用程序的隐私模块处理方法、装置、电子设备及介质
CN113792341A (zh) * 2021-09-15 2021-12-14 百度在线网络技术(北京)有限公司 应用程序的隐私合规自动化检测方法、装置、设备及介质
CN113792341B (zh) * 2021-09-15 2023-10-13 百度在线网络技术(北京)有限公司 应用程序的隐私合规自动化检测方法、装置、设备及介质
CN114793269A (zh) * 2022-03-25 2022-07-26 岚图汽车科技有限公司 摄像头的控制方法及相关设备

Also Published As

Publication number Publication date
TWI743773B (zh) 2021-10-21
TW202121215A (zh) 2021-06-01
CN110826006B (zh) 2021-03-19
CN110826006A (zh) 2020-02-21

Similar Documents

Publication Publication Date Title
WO2021098327A1 (zh) 基于隐私数据保护的异常采集行为识别方法和装置
CN110874440B (zh) 一种信息推送及其模型训练的方法、装置及电子设备
WO2021103909A1 (zh) 风险预测和风险预测模型的训练方法、装置及电子设备
WO2019169978A1 (zh) 资源推荐方法及装置
CN108550046B (zh) 一种资源和营销推荐方法、装置及电子设备
WO2022156065A1 (zh) 一种文本情感分析方法、装置、设备及存储介质
CN110569502A (zh) 一种违禁广告语的识别方法、装置、计算机设备及存储介质
CN111768258A (zh) 识别异常订单的方法、装置、电子设备和介质
CN111598122B (zh) 数据校验方法、装置、电子设备和存储介质
US10762089B2 (en) Open ended question identification for investigations
CN110058992B (zh) 一种文案模板效果反馈方法、装置及电子设备
US9442918B2 (en) Perspective data management for common features of multiple items
CN111275071B (zh) 预测模型训练、预测方法、装置及电子设备
CN116401466B (zh) 一种图书分级分类推荐方法和系统
CN110334936B (zh) 一种信贷资质评分模型的构建方法、装置和设备
US11222143B2 (en) Certified information verification services
US11503055B2 (en) Identifying siem event types
CN115617998A (zh) 一种基于智能营销场景的文本分类方法及装置
CN111754245B (zh) 一种经营场景照认证方法、装置和设备
US20240193365A1 (en) Method and system for insightful phrase extraction from text
CN115689284A (zh) 网络购物风险识别方法、装置、设备及存储介质
CN112001662B (zh) 一种商户图像的风险检验方法、装置及设备
CN118152811A (zh) 数据处理方法及装置、设备、存储介质和程序产品
CN115081006A (zh) 一种敏感数据的处理方法、装置及设备
CN117952097A (zh) 事件抽取方法、相关设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20890966

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20890966

Country of ref document: EP

Kind code of ref document: A1