WO2021201980A1 - Detecting legitimacy of data collection - Google Patents

Detecting legitimacy of data collection Download PDF

Info

Publication number
WO2021201980A1
WO2021201980A1 PCT/US2021/016989 US2021016989W WO2021201980A1 WO 2021201980 A1 WO2021201980 A1 WO 2021201980A1 US 2021016989 W US2021016989 W US 2021016989W WO 2021201980 A1 WO2021201980 A1 WO 2021201980A1
Authority
WO
WIPO (PCT)
Prior art keywords
content
evaluation
data collection
service
tier
Prior art date
Application number
PCT/US2021/016989
Other languages
French (fr)
Inventor
Wenjun GONG
Shiyu Zou
Yao KE
Anqi DU
Xingyu XU
Yihan GE
Yangbin ZHANG
Walter Hoy Toh WONG
Jing Liu
Rui Ding
Shi Han
Dongmei Zhang
Wenfei Tang
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of WO2021201980A1 publication Critical patent/WO2021201980A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Definitions

  • a data collection service may widely refer to various services, applications, software, websites, etc. capable of implementing data collection or having a function of data collection.
  • the survey form service is a type of dedicated data collection service for collecting data through forms.
  • data collection may also be performed in services not dedicated for data collection, e.g., collecting data through emails in an email service, collecting data through webpages in a browser service, collecting data through productivity tool documents in a productivity tool, etc. All these services capable of achieving data collection may be collectively referred to as data collection services.
  • Embodiments of the present disclosure propose methods and apparatuses for detecting legitimacy of data collection.
  • the data collection may be implemented through processing content related to the data collection by a user in a data collection service.
  • At least one event occurred in the data collection service and/or at least one outside service may be monitored, the event being associated with the content and/or the user.
  • state information associated with the event may be detected from the data collection service and/or the outside service.
  • a content evaluation tier and/or a creator evaluation tier may be determined based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content.
  • FIG.1 illustrates an exemplary process of illegitimate data collection.
  • FIG.2 illustrates exemplary content involved in illegitimate data collection.
  • FIG.3 illustrates an exemplary process for detecting legitimacy of data collection according to an embodiment.
  • FIG.4 illustrates an exemplary deployment of a legitimacy detection service for data collection according to an embodiment.
  • FIG.5 illustrates a flowchart of an exemplary method for detecting legitimacy of data collection according to an embodiment.
  • FIG.6 illustrates an exemplary apparatus for detecting legitimacy of data collection according to an embodiment.
  • FIG.7 illustrates an exemplary apparatus for detecting legitimacy of data collection according to an embodiment.
  • Data collection services may be used by malicious users for illegitimate-purpose data collections, thus there is a risk of abusing the data collection services.
  • the data collection services may be maliciously used for collecting personal privacy or sensitive data, collecting commercial secrets, broadcasting inappropriate content, etc., and the collected data may be used for financial criminals, reputation damages, network attacks, etc.
  • Illegitimate-purpose data collections will tremendously damage benefits of providers and legitimate users of data collection services.
  • phishing which is a common network attack.
  • a phisher will conduct illegitimate-purpose data collections, e.g., collecting privacy or sensitive data of log-in account and password, bank card number, credit card number, home address, commercial information of companies, etc.
  • the survey form service is a common data collection service used by phishers. Through the survey form service, a phisher may create and distribute illegitimate-purpose forms, and obtain information provided by responders.
  • Existing techniques for phishing detection may be divided into a proactive approach and a passive approach.
  • Commonly-used techniques for proactively detecting phishing attacks may comprise, e.g., page content examination, blacklist-based detection, etc.
  • Embodiments of the present disclosure propose effective detections of legitimacy of data collection, and accordingly timely controls or interventions to illegitimate data collections.
  • legitimacy of data collection may widely refer to whether a data collection is with a legitimate purpose, legal, reasonable, unmalicious, non-abused, conforming to ethics, not damaging individuals’ or entities’ benefits, etc.
  • Data collection may be implemented through processing content related to the data collection by a user in a data collection service.
  • the content may comprise various digital information forms capable of performing data collection, e.g., form, email, webpage, productivity tool document, etc.
  • the data collection service may be various services supporting the processing of the content, e.g., a survey form service, an email service, a browser service, a productivity tool, etc.
  • a creator of the content may create the content for collecting data in the data collection service, and responders of the content may fill information into the content to provide data in the data collection service, wherein the responders may refer to recipients responding to the content among recipients having received the content.
  • the embodiments of the present disclosure may adopt various types of information from the data collection service and/or from outside services that are different from the data collection service, to trigger and perform detection of legitimacy of data collection.
  • the legitimacy of data collection may be evaluated through an evaluation tier mechanism.
  • a content evaluation tier indicating the legitimacy of the content may be determined, which may be also referred to as content reputation tier.
  • the legitimacy of the content may refer to whether the content itself is legitimate, e.g., whether it involves personal privacy information collection, etc.
  • a creator evaluation tier indicating legitimacy of the content creator may also be determined, which may be also referred to as creator reputation tier.
  • the legitimacy of the creator may refer to whether the creator have a legitimate purpose, e.g., being a phisher or not, etc.
  • Various types of information facilitating to determine the content evaluation tier and the creator evaluation tier may be detected or collected from the data collection service and/or the outside services.
  • Information facilitating to determine evaluation tiers may be extracted from each stage in an overall lifecycle of data collection. The determination of evaluation tier may consider information associated with the content and/or behaviors of the user in the data collection service and/or the outside services, administrative information in the data collection service, etc.
  • Various types of information detected or collected from the data collection service and/or the outside services may be considered comprehensively to determine the content evaluation tier and the creator evaluation tier, and accordingly determine legitimacy of the content and legitimacy of the creator.
  • Various approaches are proposed for determining the content evaluation tier and the creator evaluation tier, e.g., an evaluation rule-based approach, an evaluation score-based approach, a Fast Explainable Additive Model (FXAM)-based approach, etc.
  • the FXAM proposed by the embodiments of the present disclosure is a unified model adopting numerical features, categorical features, temporal features, etc. for predicting evaluation tiers.
  • the embodiments of the present disclosure may conduct corresponding control operations in the data collection service and/or the outside services according to the determined content evaluation tier, creator evaluation tier, etc., to restrict or prevent occurrence of illegitimate data collection behaviors, protect benefits of legitimate users, etc.
  • the embodiments of the present disclosure may identify a phishing form, identify a phisher, prevent a phisher from utilizing a form as a phishing tool, prevent a phishing form from being distributed for collecting privacy or sensitive data, help a recipient to identify phishing behaviors, assist an administrator to identify phishing behaviors so as to ensure benefits of legitimate users and data security, etc.
  • the data collection service is a survey form service
  • the content is a form
  • the legitimacy of the content is associated with whether the content is a phishing form
  • the legitimacy of the creator is associated with whether the creator is a phisher.
  • FIG.l illustrates an exemplary process 100 of illegitimate data collection.
  • Illegitimate data collection usually has a lifecycle including multiple stages.
  • the process 100 exemplarily illustrates this by taking a phisher performing phishing in a survey form service as an example.
  • a phisher may start a survey form service.
  • the phisher may want to collect sensitive data through a form, thereby entering the survey form service at 110.
  • the survey form service may be provided through different approaches, e.g., through a webpage, through a client, etc.
  • the phisher may log in the survey form service with his account.
  • the phisher may also identify a target recipient before starting the survey form service, in order to conduct targeted phishing.
  • the phisher may create a phishing form in the survey form service.
  • the form may include one or more questions intended for collecting, e.g., sensitive data.
  • the phisher may create, in the form, questions that enquire about log-in account and password, bank card number, credit card number, etc.
  • the phisher may impersonate a legal entity in the form, e.g., a company, an individual, a website, etc.
  • the phisher may add disguised logos, fake trademarks, false emails, fake URLs, etc. into the form to make the created form look more like a legitimate-purpose form.
  • the phisher may distribute the created phishing form through various approaches. For example, the phisher may send the form to specific or unspecified recipients via emails. In order to further increase credibility, the phisher may attach some additional information to the email, e.g., false descriptions, disguised logos, fake trademarks, etc. [0028] At 140, the phisher may collect sensitive data. For example, some recipients of the form may fill answers to questions and requested information in the form, and, as a responder, return the form to the phisher. Thus, the phisher may obtain sensitive data of the responder. Those responders who return sensitive data may also be deemed as victims.
  • sensitive data For example, some recipients of the form may fill answers to questions and requested information in the form, and, as a responder, return the form to the phisher. Thus, the phisher may obtain sensitive data of the responder. Those responders who return sensitive data may also be deemed as victims.
  • the phisher may carry out various malicious behaviors with the collected sensitive data for achieving illegitimate purposes. For example, the phisher may use bank account information provided by a responder for stealing financial assets, use privacy information provided by a responder for initiating further network attacks, etc.
  • process 100 only shows several exemplary stages in a lifecycle of an exemplary illegitimate data collection. In other cases, the process of illegitimate data collection may also include more or less other stages.
  • FIG.2 illustrates exemplary content 200 involved in illegitimate data collection.
  • the content 200 is an exemplary form adopted by phishing.
  • the form may be created by a phisher at 120 in FIG.1.
  • FIG.3 illustrates an exemplary process 300 for detecting legitimacy of data collection according to an embodiment.
  • the data collection may be implemented through processing content related to the data collection by a user in a data collection service.
  • the user may be a creator or a recipient of the content.
  • the user as a creator, may create, edit and distribute the content, and view a response result as a result of the data collection, in the data collection service.
  • the user as a recipient, may access the content, fill information into the content and return the content, in the data collection service.
  • At 310 at least one event occurred in a data collection service 302 and/or at least one outside service 304 may be monitored.
  • an event may refer to occurrence of content, occurrence of user operations or behaviors, etc. in various services.
  • the monitored event is associated with content and/or user.
  • the outside service 304 may represent one or more outside services.
  • events occurring in the data collection service 302 may be monitored at 310.
  • a creator user may create new content in the data collection service 302, thus a "new content creation event" may be monitored.
  • a creator user may be modifying questions in the content in the data collection service 302, thus a “content modification event” may be monitored.
  • a recipient user may have received content in the data collection service 302, thus a “content receipt event” may be monitored.
  • a recipient user may be filling information into the content in the data collection service 302, thus a "content response event” may be monitored.
  • an administrator of the data collection service 302 may implement a management or control operation to the processing of the content in the background, thus a "management event" may be monitored.
  • An administrator may refer to a person who conducts management or control for the operating of the data collection service in the background. In some cases, an administrator may also refer to a person who is assigned by an entity renting the data collection service and manages content. It should be appreciated that the events described above are only exemplary, and any other types of events may also be monitored in the data collection service 302.
  • events occurring in the outside service 304 may be monitored at 310.
  • An outside service may refer to any service other than the data collection service.
  • the outside service 304 may comprises: an email service, a browser service, a security detection service for operating system, a cloud service, a social media, etc.
  • the monitoring of events at 310 may include receiving an indication of an event from the outside service 304.
  • the outside service 304 may act as an outside signal source to provide outside signals or notifications about occurrence of events.
  • the outside service 304 being an email service as an example
  • the email service when the email service finds out that an email is embedded with content for data collection and is to be sent, the email service may generate an indication of this event.
  • a "content distribution event” may be monitored at 310.
  • the email service detects a junk mail that includes content for data collection, the email service may generate an indication of this event.
  • a "junk content distribution event” may be monitored at 310.
  • the outside service 304 being a browser service as an example
  • the browser service when a user is accessing content for data collection or a website that includes content for data collection through the browser service, the browser service may generate an indication of this event.
  • a "content identification event” may be monitored at 310.
  • a security detection function is provided in some browser services, e.g., a SmartScreen function is provided in the Edge browser to detect security of websites or pages being browsed.
  • the security detection function may identify that the content includes malicious information, and the browser service may generate an indication of this event.
  • a "malicious content identification event" may be monitored at 310.
  • the security detection service 304 being a security detection service for operating system as an example, e.g., a firewall in an operating system
  • the security detection service may generate an indication for this event.
  • a "content identification event” or a "malicious content identification event” may be monitored at 310.
  • the cloud service when the cloud service finds out that a certain online document, website, attachment, etc. includes content for data collection, the cloud service may generate an indication for this event. Thus, a "content identification event" may be monitored at 310.
  • a security detection function is provided in some cloud services to detect security of online documents, websites, attachments, etc. that use the cloud service. When the security detection function identifies content that includes malicious information, the cloud service may generate an indication for this event. Thus, a "malicious content identification event" may be monitored at 310.
  • the social media when the social media scans that certain social information includes content for data collection or the content includes malicious information, the social media may generate an indication for this event. Thus, a "content identification event” or a “malicious content identification event” may be monitored at 310.
  • outside services and events described above are all exemplary, and any other types of events may also be monitored in any other outside services.
  • an event is monitored in the data collection service or the outside service at 310, the event may trigger subsequent operations in the process 300, thus, the operation of monitoring an event at 310 may also be regarded as a triggering operation.
  • state information may refer to various types of information that facilitate to determine an evaluation tier of the content or content creator and further determine legitimacy of the content or content creator.
  • the state information may also be referred to as evidence.
  • the same or different types of state information may be detected for different events. For different types of users, e.g., the creator and recipients, the same or different types of state information may also be detected.
  • detecting state information may include detecting various types of information associated with the content in the data collection service, e.g., whether a title of the content includes sensitive words, whether questions in the content are enquiring about sensitive data such as password, whether a logo in the content belongs to an entity different from the creator, the number of logos inserted in the content, the number of questions included in the content, whether links in the content are secure, distribution channel of the content, whether the content returned by responders includes a large amount of sensitive data, the total number of collected responses, etc.
  • sensitive data such as password
  • a logo in the content belongs to an entity different from the creator
  • the number of logos inserted in the content the number of questions included in the content
  • links in the content are secure
  • distribution channel of the content whether the content returned by responders includes a large amount of sensitive data, the total number of collected responses, etc.
  • detecting state information may include detecting various types of information associated with behaviors of a user in the data collection service.
  • the state information may include, e.g., the time point at which the content was created, the length of time it took to create the content, whether the user has a history record of abusing the data collection service, the length of time the user stayed while editing a certain question, whether the user repeatedly edited a certain question, the number of times the user refreshed the response result, etc.
  • the state information may include, e.g., the total time the user took to fill answers, the length of time the user stayed on a certain question, etc.
  • the types of the detected state information associated with user behaviors may be not restricted by whether the current user is a creator or a recipient.
  • the detected state information associated with user behaviors may broadly contain various types of state information associated with behaviors of the creator and/or the recipient.
  • detecting state information may include obtaining administrative information in the data collection service.
  • the administrative information may be information corresponding to control measures taken by an administrator for the processing of the content.
  • the administrative information may include, e.g., the amount of reporting the content as illegitimate data collection by recipients, confirming that the content involves collecting sensitive data, confirming that questions in the content involves sensitive data, sending warning information to the creator, sending warning information to recipients, performing restrictions or removing restrictions of accessing or editing the content, etc.
  • detecting state information may include obtaining various types of information associated with behaviors of the content creator in the outside service. When a user logs in the data collection service with his account, the account or related information may also be used for identifying the user in outside services.
  • the user may use the same account or associated accounts to log in the data collection service, an email service, a social media, etc., thus the identity of the user may be identified in different services according to the accounts used by the user.
  • the embodiments of the present disclosure are not limited to use accounts to identify the same user in different services, but may adopt any other means, e.g., through a terminal device used by the user.
  • the state information associated with the behaviors of the content creator in the outside service may include, e.g., whether the URL of the creator has been identified as a malicious URL by the outside service, whether the creator has a low evaluation tier in the outside service, whether the creator has abusive behaviors in the outside service, whether the creator is in a blacklist in the outside service, etc.
  • detecting state information may include obtaining various types of information associated with the content in the outside service, e.g., judgment by the outside service on whether the content is malicious or not.
  • an indication provided by, e.g., a browser service, a security detection service for operating system, a cloud service, a social media, etc., that the content includes malicious information may be used as the state information.
  • the embodiments of the present disclosure may also use various types of other judgment information for the content provided by the outside service as the state information.
  • All the state information described above is exemplary, and the embodiments of the present disclosure may also include any other types of state information.
  • different types of related state information may be detected for different events or different types of users. For example, for a "new content creation event”, various types of state information related to the content and/or the creator may be detected. For example, for a "content receipt event”, various types of state information related to the content and/or a recipient may be detected.
  • state information may be performed for each stage in the entire lifecycle of data collection. For example, state information may be detected at each stage in a lifecycle of illegitimate data collection. Accordingly, legitimacy of data collection may be finally determined at each stage of data collection through the process 300
  • At the stage of starting or logging in the data collection service e.g., at step 110 of starting the survey form service in FIG.l, at least state information associated with behaviors of the content creator in the outside service may be detected.
  • state information related to content creation or editing, state information related to creator behaviors, administrative information, etc. may be detected in the data collection service.
  • At the stage of distributing the content e.g., at step 130 of distributing the phishing form in FIG.l, at least state information associated with content distribution may be detected, e.g., information about content distribution channels, an indication, provided by an outside service for distributing the content, of whether the content includes malicious information, etc.
  • At the stage of collecting responses to the content from responders e.g., at step 140 of collecting sensitive data in FIG.l, at least state information related to the content, state information related to responder behaviors, administrative information, etc. may be detected in the data collection service.
  • At the stage of achieving an illegitimate purpose e.g., at step 150 of carrying out malicious behaviors in FIG.l, at least administrative information in the data collection service, information about control operations that have been taken in the data collection service or outside services, etc. may be detected. Since the process 300 may detect state information in each stage of data collection and further determine the legitimacy of the data collection, this facilitates to implement more timely detection of illegitimate data collection.
  • a content evaluation tier and/or a creator evaluation tier may be determined based on the detected state information.
  • the content evaluation tier corresponds to the legitimacy of the content. Different content evaluation tiers may reflect different degrees of the legitimacy of the content. For example, the content evaluation tier may be divided into legitimate content, suspicious content, and illegitimate content, wherein the legitimate content indicates that the content is used for legitimate purposes, the suspicious content indicates that the content may be used for illegitimate purposes, and the illegitimate content indicates that the content is determined to be used for illegitimate purposes. Moreover, for example, the content evaluation tier may also be simply divided into legitimate content and illegitimate content.
  • the embodiments of the present disclosure may cover content evaluation tiers divided in any approaches.
  • the creator evaluation tier corresponds to the legitimacy of the creator of the content. Different creator evaluation tiers may reflect different degrees of legitimacy of the creator.
  • the creator evaluation tier may be divided into good user, normal user, suspicious user, and illegitimate user, wherein the good user indicates that the user has a good usage record of the data collection service and has a very low possibility to conduct illegitimate data collection, the normal user indicates that the user has not been found conducting illegitimate data collection, the suspicious user indicates that the user is likely to conduct illegitimate data collection, and the illegitimate user indicates that it is determined that the user has conducted illegitimate data collection.
  • the creator evaluation tier may also be simply divided into legitimate user and illegitimate user, wherein the legitimate user indicates that the user has not conducted illegitimate data collection. It should be appreciated that the embodiments of the present disclosure may cover creator evaluation tiers divided in any approaches.
  • Various approaches may be adopted for determining the content evaluation tier and/or the creator evaluation tier, e.g., an evaluation rule-based approach, an evaluation score-based approach, a FXAM-based approach, etc. Moreover, any combination of these approaches may also be used for determining an evaluation tier. For example, respective evaluation tier determination results may be obtained based on each approach, and then these evaluation tier determination results may be combined or merged to obtain a final evaluation tier. For example, the final evaluation result may be obtained through performing weighted summation on multiple evaluation tier determination results based on credibility of each approach. Moreover, optionally, an evaluation tier determination result obtained by the approach with the highest credibility may also be directly selected as the final evaluation tier.
  • an evaluation tier determination result obtained by the approach with the highest credibility may also be directly selected as the final evaluation tier.
  • one or more evaluation rules may be defined in advance, and each evaluation rule or a combination of multiple evaluation rules may correspond to a specific evaluation tier, e.g., a content evaluation tier and/or a creator evaluation tier. It may be determined whether the detected state information matches with at least one evaluation rule, and if yes, an evaluation tier corresponding to the at least one evaluation rule may be determined.
  • each evaluation rule may correspond to a combination of existence of multiple predetermined types of state information. For example, only when all the state information required by an evaluation rule is detected, it is determined that the evaluation rule is matched. Taking the state information "whether a question in the content is enquiring about sensitive data" as an example, if it is detected that the question is enquiring about sensitive data, it may be considered that the state information is detected; otherwise, it may be considered that the state information is not detected. It is assumed that an evaluation rule R1 corresponds to a set ⁇ Si, &, & ⁇ including three types of state information, and corresponds to a content evaluation tier "illegitimate content".
  • the state information detected at 320 includes state information Si, & and &, it may be considered that the detected state information matches with the evaluation rule Rl, and an evaluation tier of illegitimate content may be given. If only the state information & and & are detected at 320, it may be considered that the evaluation rule Rl is not matched, thus the evaluation tier of illegitimate content cannot be given according to the evaluation rule Rl .
  • each evaluation rule may correspond to a combination of values of all possible state information.
  • all possible types of state information may be defined in advance, and only when values of the actually detected state information meet the value combination required by the evaluation rule, it may be determined that the evaluation rule is matched. Taking the state information "whether a question in the content is enquiring about sensitive data" as an example, if it is detected that the question is enquiring about sensitive data, the value of the state information may be set to 1, otherwise, the value may be set to 0.
  • an evaluation rule R2 requires that the set includes at least 3 types of state information with a value of 1, and the evaluation rule R2 corresponds to an user evaluation tier "illegitimate user". If after the detection at 320, the set includes 4 types of state information with a value of 1, it may be considered that the evaluation rule R2 is matched, and an evaluation tier of illegitimate user may be given. If after the detection at 320, the set includes only two types of state information with a value of 1, it may be considered that the evaluation rule R2 is not matched, and thus the evaluation tier of illegitimate user cannot be given according to the evaluation rule R2.
  • evaluation rules are given above, and any other forms of evaluation rules may be defined according to specific application requirements. Moreover, in some cases, when the state information detected at 320 satisfies multiple evaluation rules simultaneously and multiple evaluation tiers are determined accordingly, the worst evaluation tier, as an example, may be used as a final evaluation tier determined by the evaluation rule-based approach.
  • a weighted sum of confidences of the detected state information may be calculated to obtain an evaluation score, and an evaluation tier may be determined based on the evaluation score, e.g., the content evaluation tier and/or creator evaluation tier.
  • Si state information among the n pieces of state information detected at 320
  • the state information may be assigned a weight Wi which represents importance of the state information in determining the evaluation tier.
  • the sum of weights of all the state information is equal to 1.
  • the state information Si has a confidence G which represents the degree of the possibility that the state information is true, and ranges from 0% to 100%.
  • the evaluation score may be further used for determining a corresponding evaluation tier. For example, multiple evaluation score intervals may be defined in advance, and each interval corresponds to an evaluation tier. Therefore, when the calculated evaluation score falls within a certain evaluation score interval, an evaluation tier corresponding to the interval may be used as the final evaluation tier determined by the evaluation score-based approach.
  • the FXAM proposed by the embodiments of the present disclosure may be adopted for predicting an evaluation tier, e.g., content evaluation tier and/or creator evaluation tier.
  • an evaluation tier e.g., content evaluation tier and/or creator evaluation tier.
  • the existing Generalized Additive Model has limitations in performing predictive analysis tasks.
  • the existing GAM cannot process multi-dimensional data including numerical features, categorical features, temporal features, etc.
  • the existing training strategy of the GAM leads to low training speed on a large-scale multi dimensional data set with a large amount of features.
  • the FXAM proposed by the embodiments of the present disclosure may adopt a feature set including numerical features, categorical features, temporal features, etc., and the FXAM is a unified model capable of processing multi-dimensional data, and may be trained for predicting evaluation tiers.
  • the FXAM may be trained through a three-stage iteration (TSI) training strategy, wherein the TSI includes three stages corresponding to numerical features, categorical features and temporal features respectively. Moreover, the training process of the FXAM may be accelerated by applying corresponding optimization for each stage in the TSI.
  • the numerical features, the categorical features, the temporal features, etc. may come from the state information detected at 320.
  • the numerical features may refer to features characterized by numerical values, e.g., the number of logos inserted in the content, the number of questions included in the content, the number of times to refresh response results, the number of times to repeatedly modify a certain question, etc.
  • the categorical features may refer to features characterized by categorical attributes.
  • categorical attributes For example, for the categorical feature "sensitive data enquired by questions in the content”, its categorical attributes may include "password information", "bank account information”, etc.
  • categorical feature “prompts sent to the creator” its categorical attributes may include “content is restricted to be sent”, “the question involves sensitive data”, etc.
  • categorical attribute may include "yes", "no", etc.
  • the temporal features may refer to features characterized by a time point or duration, e.g., the time point when the content is created, the length of stayed time during which a certain question is edited, the time point when the first response is obtained, etc.
  • a special feature detection process for the feature set of the FXAM may be performed at 320.
  • the feature detection process may detect corresponding state information by referring to which features are included in the feature set of the FXAM.
  • Some exemplary details about the FXAM are further discussed below.
  • the response y may be an evaluation tier to be predicted, e.g., content evaluation tier or creator evaluation tier.
  • H t the Hilbert space of a measurable function / (t) over the temporal features is denoted by H t.
  • H t has the same attributes as J-C
  • a time period of the time section component is denoted as a positive integer d > 1.
  • G, ..., t n ⁇ may be processed as discrete time points.
  • T ⁇ : may be denoted as a phase - ⁇ set, because they share the same time phase f E ⁇ 0,1 , d — 1 ⁇ with respective to data ⁇ ft, —, t n ⁇ ⁇ It can be concluded that T i and are mutually disjoint, and
  • D includes n realizations of a random variable Y at p + q + 1 design values, and is denoted as: then the model of the FXAM may be represented as: wherein o i; - e ⁇ 0,1 ⁇ is obtained through one-hot encoding from are smoothers, and ft are parameters to be learned from D.
  • the overall equation is an additive model, which includes three parts corresponding to the modeling of numerical features, categorical features and temporal features respectively.
  • fj models the distribution of Xj with respective to the response y as a univariate smooth function. This is similar to the standard Generalized Additive Model (GAM).
  • GAM Generalized Additive Model
  • categorical features is a parameterized form by representing categorical values z iv ... , z iq in a Q-dimensional one-hot vector, and a weight is assigned to each item o *fc . Since the value of item o ik e ⁇ 0,1 ⁇ indicates whether there is a category value, i.e., a certain categorical feature adopts a specific value, and the additivity among items is considered, expresses a meaning of the contribution of a certain categorical feature when it takes a specific value.
  • the FXAM unifies all categorical features into a one-hot vector, which allows adopting momentum-like accelerations to improve training efficiency.
  • T (ft) + S(ft) decomposes a signal in the temporal feature t into a trend signal T and a time section signal S.
  • Such decomposition expresses multi-angle signals from a single feature, which is helpful for predictive analysis. It should be appreciated that this equation may be extended to cover multiple types of time series signals, e.g., abrupt changes, stochastic drifts, etc.
  • the equation described above does not treat a temporal feature as a numerical feature, which avoids that the multi-angle signals are composed together and thus become less intuitive or cannot be conveyed appropriately.
  • Equation (2) wherein all the bolded items are n x 1 vectors, and Z is a n x Q design matrix corresponding to one-hot encoding of categorical features. is a smoothing function defined on T f , which is a sub-domain of indicates that the time section component S domain-merges all the sub-components S (f) .
  • A, A z , A T , A s are pre- specified hyper-parameters.
  • 2 is the total square error. is regularization for f j.
  • ⁇ 2 regularization is proposed to avoid potential collinearity among one- hot variables derived from categorical features.
  • the time section component S is divided into d phase-equivalent sub-components S (f), and regularization is applied to each S (f).
  • the smoothness of S is omitted, because T mainly carries this type of information. Restricting smoothness within each phase-equivalent domain may better express repeating pattern of the time section signal.
  • the optimization scheme for minimizing a square error with regularization of the form dv may be cubic spline smoothing with knots at each x 1j , ... , x n; ⁇ , and then the loss function may be rewritten as: wherein f z ,f T ,f s are used as alternative expressions of Zb, T, 5, which is convenient for subsequent discussions.
  • K j is a n x n input matrix, which is pre-calculated with values x 1j , ... , x nj .
  • K T is calculated in the same way.
  • K s is a n x n matrix obtained by applying a cubic spline smoother over T f .
  • the indexes are reordered with a permutation matrix wherein matrix with respective to cubic spline smoothing over knots permutation matrix that maps the indexes of these knots into the original indexes in ⁇ t1, ... , t n ⁇ .
  • P P Q , ...,P d-1 is the overall permutation matrix, which maps the indexes of elements from ⁇ y ... , t n ⁇ to the indexes of elements in ⁇ T 0 , ... , T d - 1).
  • the FXAM's normal equations satisfy stationarity conditions of L.
  • the FXAM’ s normal equations are necessary conditions for optimality of L. Solutions of the FXAM's normal equations exist and are globally optimal.
  • TSI Three-stage iteration
  • the TSI is an iterative training process with three stages, and the three stages are applied for the training of numerical features, categorical features and temporal features respectively.
  • the TSI fixes parameters of other stages, and performs training on parameters of interest until local convergence, as shown in lines 1.5, 1.8 and 1.10, and performs iterations over the three stages until global convergence.
  • the TSI will converge to a solution of the FXAM’s standard equations.
  • the training strategy of each stage is flexible. Taking Stage 2 as an example, it desires to solve , accordingly, it may be directly calculated by matrix inversion/multiplication, or b may be estimated by gradient descent. Such flexibility of the training strategy provides space for improving training efficiency. Exemplary optimizations performed for each stage will be discussed below, which are for improving training efficiency and thereby accelerating the overall FXAM training.
  • training efficiency may be improved by improving efficiency of back-fitting.
  • fast kernel smoothing approximation may be adopted.
  • the smoothing task here is with n input samples which are also evaluation points. Cubic spline smoothing is with 0(n) time complexity on this task, but is still expensive in terms of massive operations.
  • the fast kernel smoothing method may achieve 0(n) complexity with small coefficients.
  • the core idea may be referred to as fast sum updating in which, given a polynomial kernel, this method pre-calculates a cumulative sum of each item of the polynomial kernel on evaluation points, and uses these cumulative sums to perform one- shot scanning over the evaluation points to complete the task.
  • the Epanechnihov kernel may be selected, and a fast sum updating algorithm may be used for approximating the original cubic spline smoothing, so as to reduce operations with almost no loss of accuracy.
  • dynamically-adjusted feature order iteration may be performed. Instead of iterating over a fixed order of numerical features, the dynamically-adjusted feature order iteration may be adopted for further accelerating convergence speed. Intuitively, it is desired to evaluate features with higher predictive power earlier, because the loss may be reduced more and thus converges faster.
  • the time cost for evaluating the predictive power of each feature should be lightweight.
  • a lightweight estimator with theoretical guidance is proposed to estimate the predictive power of each feature, and is used for dynamically ordering the features.
  • intelligent sampling may be performed. Better initialization, rather than zero function, will reduce the number of iterations.
  • an iteration is a complete cycle over features x1 ⁇ x p .
  • a random sampling strategy may be adopted for better initializing f j .
  • Sample-size determination needs to be performed, i.e., to control sample variation, and the sample size should be guided by data characteristics rather than a fixed number.
  • Sample variation is the difference between a smoothing function over all data points fn(x) and a smoothing function over sampled data points f s (x).
  • f n ( x ) or f s (x) may be regarded as samples, taken from a ground-truth function F(X), with sample sizes n and s respectively.
  • F(X) ground-truth function
  • the upper boundary of sample variation may be estimated by: [0092]
  • the sample size of feature x i should be
  • s * is used as a sample size applied for all the features
  • y is a hyper-parameter.
  • the Nesterov gradient acceleration may be regarded as an improved momentum, which makes convergence significantly faster than gradient descent, especially when data includes a large amount of categorical features or cardinality is high.
  • Traditional approaches conduct histogram -type smoothing per each categorical feature iteratively, which may be regarded as gradient descent without considering momentum, thus convergence speed is much slower.
  • training efficiency may be improved by performing time section trend decomposition. Trend and time section signals may be identified from the temporal feature t in an iterative approach. Table 3 shows an exemplary process of the time section trend decomposition.
  • the trained FXAM has explainability to features.
  • the importance of each feature in the feature set for performing prediction may be explicitly known, e.g., which features are relatively important for predicting evaluation tiers, which features have relatively small influence on predicting evaluation tiers, etc.
  • the FXAM may be used for showing how anyone feature specifically affects the final predicted relationship, e.g., f j (xi) as discussed above, etc.
  • This relationship may be used in conjunction with the knowledge in the applied domain for verifying whether the laws learned by the FXAM (e.g., fi(Xi)) are trustworthy, whether there are redundant features, whether there is a lack of effective features, etc., thereby providing another angle of explainability. Numerous explainability of the FXAM may be used for selecting or updating features in the feature set.
  • features with very low importance may be removed from the feature set and features with high importance may be retained or added, based on the explainability of the FXAM. Then, the new feature set may be used for the next or more rounds of training, and features may be removed or added again according to their importance.
  • the above- mentioned process of selecting features in the feature set is performed for multiple times in this way. Finally, a good feature set that can make the most accurate predictions may be obtained.
  • the explainability of the FXAM facilitates to select types of state information to be detected at 320. For example, at 320, state information corresponding to those features explained as having high importance by the FXAM may be focused, and state information corresponding to those features explained as having low importance by the FXAM may be ignored. Accordingly, the explainability of the FXAM also facilitates to define a combination of state information for the evaluation rules in the evaluation rule-based approach, set appropriate weights for different state information in the evaluation score-based approach, etc.
  • the content evaluation tier and/or the creator evaluation tier may be determined through the processing at 330 in the process 300, and thus the legitimacy of the content and/or the legitimacy of the creator may be determined.
  • the process 300 may further comprise: performing control operations in response to at least the content evaluation tier and/or the creator evaluation tier at 340.
  • control operations may broadly refer to various operations that facilitate to, e.g., prevent the creator of the content from achieving illegitimate purposes, assist recipients of the content to avoid attacks or losses, assist an administrator of the data collection service to protect legitimate users and service data, etc.
  • control operations may include applying various usage restrictions to the content in the data collection service, e.g., restricting or prohibiting the distribution of the content, restricting or prohibiting the editing of the content, restricting or prohibiting the accessing to the content, restricting or prohibiting the responding to the content, restricting or delaying or prohibiting the displaying of response results, etc.
  • control operations may include applying various behavior restrictions to a user in the data collection service, e.g., preventing the creator from editing or distributing the content, requiring the creator to modify a question involving sensitive data, restricting or prohibiting the creator from accessing the content, preventing a recipient from filling sensitive information, preventing a recipient from sending a response, restricting or prohibiting a recipient from accessing the content, etc.
  • various behavior restrictions e.g., preventing the creator from editing or distributing the content, requiring the creator to modify a question involving sensitive data, restricting or prohibiting the creator from accessing the content, preventing a recipient from filling sensitive information, preventing a recipient from sending a response, restricting or prohibiting a recipient from accessing the content, etc.
  • the control operations may include presenting various prompts in the data collection service.
  • the prompts may be provided to the content creator or content recipients.
  • the prompts may be warnings or notifications about the legitimacy of the content or the legitimacy of the creator, e.g., a prompt about that the current question involves sensitive data, a prompt about that the content may have an illegitimate purpose, a prompt about requiring to modify a question, a prompt about prohibiting the content from being distributed or responded to, etc.
  • control operations may include sending a notification to an administrator of the data collection service, so that the administrator may take further management measures.
  • the notification may be information about the legitimacy of the content or the legitimacy of the creator, e.g., a notification about that the content is determined as illegitimate, a notification about that the creator is determined as illegitimate, a notification about the detected state information, etc.
  • control operations may include sending a notification to an outside service, so that the outside service may take further management measures.
  • the notification may be information about the legitimacy of the content or the legitimacy of the creator, e.g., a notification about that the content is determined as illegitimate, a notification about that the creator is determined as illegitimate, etc.
  • control operations may include storing the determined content evaluation tier and/or creator evaluation tier in a storage device.
  • storing the determined content evaluation tier and/or creator evaluation tier in a storage device.
  • control operations described above are exemplary, and the embodiments of the present disclosure are not limited to these control operations, but may include any other control operations.
  • the control operations performed at 340 may correspond to a specific content evaluation tier or creator evaluation tier. Different control operations may be performed for different content evaluation tiers or creator evaluation tiers respectively, so as to achieve, e.g., different control strategies, different user experiences, etc.
  • the content evaluation tier is divided into legitimate content, suspicious content and illegitimate content. If it is determined at 330 that the current content is legitimate content, strict control operations may be avoided to apply, e.g., only presenting prompts when needed. If it is determined at 330 that the current content is suspicious content, necessary control operations may be applied, e.g., restricting the distribution and editing of the content, requesting the creator to modify a question involving sensitive data, delaying the displaying of a response result, providing a prompt about that the current question involves sensitive data, etc.
  • strict control operations may be applied, e.g., prohibiting the editing or distributing of the content, prohibiting the accessing to the content, preventing a recipient from sending a response, etc.
  • strict control operations may be applied, e.g., prohibiting the creator from editing or distributing the content, prohibiting a recipient from accessing the content, preventing a recipient from sending a response, etc.
  • control operations performed at 340 may also consider historical content evaluation tiers and/or historical creator evaluation tiers.
  • the historical evaluation tier and the currently-determined evaluation tier of the content may be comprehensively considered to determine what control operations will be taken.
  • the historical evaluation tier and the currently- determined evaluation tier of the creator may be comprehensively considered to determine what control operations will be taken.
  • the performing order among the various steps in the process 300 discussed above may be changed in any approaches.
  • a part of the steps may be performed iteratively to further improve the accuracy of detection of the legitimacy of data collection or to improve appropriateness of the control operations.
  • the process 300 may return to step 320 to detect additional state information associated with the monitored event, e.g., to further obtain other state information that may facilitate to determine the evaluation tiers. Then, the previously-determined content evaluation tier and/or creator evaluation tier may be updated at 330 based on the additional state information.
  • the process 300 may return to step 310 to monitor additional events related to the content and/or a user that occur in the data collection service and/or an outside service, which may trigger further state information detection and evaluation tier determination. For example, after an additional event was monitored, additional state information associated with the additional event may be detected from the data collection service and/or the outside service at 320, and the previously-determined content evaluation tier and/or creator evaluation tier may be updated based at least on the additional state information at 330. In one case, after the control operations were performed at 340, the process 300 may return to step 310 to monitor events that occur after the control operations were applied, and to further trigger subsequent processing.
  • the process 300 may return to step 320 to detect additional state information after the control operations were applied, and to further trigger subsequent processing. In one case, after the control operations were performed at 340, the process 300 may return to step 330 to re-determine the evaluation tiers.
  • FIG.4 illustrates an exemplary deployment 400 of a legitimacy detection service for data collection according to an embodiment.
  • the legitimacy detection service for data collection 410 in FIG.4 may refer to a service capable of implementing the process 300 shown in FIG.3.
  • the legitimacy detection service for data collection 410 may include an event monitoring module 420, a state information detecting module 430, an evaluation tier determining module 440, a control performing module 450, a data storage unit 460, etc.
  • the event monitoring module 420 may perform the operation at 310 in FIG.3, e.g., monitoring events that occur in a data collection service 470 and/or an outside service 480.
  • the events monitored by the event monitoring module 420 may be stored into an event message queue in a data storage unit 460.
  • the event message queue may include messages about one or more monitored events. These event messages may be read in sequence to trigger further processing.
  • the state information detecting module 430 may perform the operation at 320 in FIG.3, e.g., in response to an event extracted from the event message queue, detecting state information related to the event from the data collection service 470 and/or the outside service 480.
  • the evaluation tier determining module 440 may perform the operation at 330 in FIG.3, e.g., determining a content evaluation tier and/or a creator evaluation tier based on the state information.
  • the control performing module 450 may perform the operation at 340 in FIG.3, e.g., performing control operations in response to at least the content evaluation tier and/or the creator evaluation tier determined by the evaluation tier determining module 440.
  • the control operations may be divided into frontend controls 452 and backend controls 454.
  • the frontend controls 452 may include various control operations for the data collection service 470.
  • the frontend controls 452 may at least affect a user's usage, experience, etc. of the data collection service.
  • the backend controls 454 may include various control operations outside the data collection service 470, e.g., sending notifications to an administrator of the data collection service 470 through email, sending notifications to the outside service 480, storing the content evaluation tier and/or the creator evaluation tier determined by the evaluation tier determining module 440 in the data storage unit 460, etc.
  • the data collection service 470 may provide a service interface 490 having an interaction function 492.
  • a content creator 402, a content recipient 404, and an administrator 406 of the data collection service may interact with the data collection service 470 through the service interface 490, e.g., creating content, accessing content, perform managements, etc.
  • all the services, modules and architectures in the deployment 400 are exemplary, and the legitimacy detection service for data collection 410 may be deployed in any other approaches.
  • FIG.4 shows the legitimacy detection service for data collection 410 as being deployed outside the data collection service 470
  • the legitimacy detection service for data collection 410 may also be deployed within the data collection service 470, or a part of the legitimacy detection service for data collection 410 may be deployed within the data collection service 470, e.g., deploying the event monitoring module 420 and the control performing module 450 within the data collection service 470, deploying the event monitoring module 420 and the part of frontend controls of the control performing module 450 within the data collection service 470, etc.
  • FIG.4 shows that the legitimacy detection service for data collection 410 includes the control performing module 450, the control performing module 450 may also be omitted from the legitimacy detection service for data collection 410.
  • FIG.4 shows that the data storage unit 460 is included in the legitimacy detection service for data collection 410, the data storage unit 460 or a part thereof may also be separated from the legitimacy detection service for data collection 410.
  • FIG.4 shows that the events monitored by the event monitoring module 420 are transferred to the state information detecting module 430 through the event message queue in the data storage unit 460, the data storage unit 460 may also be omitted, thus the event monitoring module 420 may provide the monitored events to the state information detecting module 430 directly.
  • FIG.5 illustrates a flowchart of an exemplary method 500 for detecting legitimacy of data collection according to an embodiment.
  • the data collection may be implemented through processing content related to the data collection by a user in a data collection service.
  • At 510 at least one event occurred in the data collection service and/or at least one outside service may be monitored, the event being associated with the content and/or the user.
  • state information associated with the event may be detected from the data collection service and/or the outside service.
  • a content evaluation tier and/or a creator evaluation tier may be determined based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content.
  • the content may comprise at least one of: form, email, webpage, and productivity tool document.
  • the data collection service may support processing of the content.
  • the outside service is different from the data collection service, and may comprise at least one of: an email service, a browser service, a security detection service for operating system, a cloud service, and a social media.
  • the monitoring at least one event may comprise: receiving an indication of the event from the outside service.
  • the user may be the creator or a recipient of the content.
  • the data collection service is a survey form service
  • the content is a form
  • the user is a creator of the form
  • the legitimacy of the content is associated with whether the content is a phishing form
  • the legitimacy of the creator is associated with whether the creator is a phisher.
  • the state information may comprise at least one of: information associated with the content in the data collection service; information associated with behaviors of the user in the data collection service; administrative information in the data collection service; information associated with the content in the outside service; and information associated with behaviors of the creator in the outside service.
  • the content evaluation tier may comprise at least one of: legitimate content, suspicious content, and illegitimate content.
  • the creator evaluation tier may comprise at least one of: good user, normal user, suspicious user, and illegitimate user.
  • the content evaluation tier and/or the creator evaluation tier may be determined through at least one of: an evaluation rule-based approach; an evaluation score-based approach; and a FXAM-based approach.
  • the evaluation rule-based approach may comprise: determining that the state information matches with at least one evaluation rule; and determining the content evaluation tier and/or the creator evaluation tier based on the at least one evaluation rule.
  • the evaluation score-based approach may comprise: obtaining an evaluation score through calculating a weighted sum of confidences of the state information; and determining the content evaluation tier and/or the creator evaluation tier based on the evaluation score.
  • the FXAM-based approach may comprise: obtaining at least one of numerical features, categorical features and temporal features in the state information; and predicting, through the FXAM, the content evaluation tier and/or the creator evaluation tier based on the obtained features.
  • the method 500 may further comprise: selecting or updating features in a feature set adopted by the FXAM based on explainability of the FXAM to features.
  • the FXAM may be trained through a three-stage iteration, the three-stage iteration comprising three stages corresponding to numerical features, categorical features and temporal features respectively.
  • the training is accelerated through optimizing at least one of the three stages.
  • the method 500 may further comprise: performing at least one control operation in response to at least the content evaluation tier and/or the creator evaluation tier.
  • the control operation may comprise at least one of: applying usage restrictions to the content in the data collection service; applying behavior restrictions to the user in the data collection service; presenting prompts in the data collection service; sending a notification to an administrator of the data collection service; sending a notification to the outside service; and storing the content evaluation tier and/or the creator evaluation tier.
  • the performing at least one control operation may be further based on historical content evaluation tiers and/or historical creator evaluation tiers.
  • the method 500 may further comprise: detecting additional state information associated with the event from the data collection service and/or the outside service; and updating the content evaluation tier and/or the creator evaluation tier based at least on the additional state information.
  • the method 500 may further comprise: monitoring at least one additional event occurred in the data collection service and/or the outside service, the additional event being associated with the content and/or the user; detecting additional state information associated with the additional event from the data collection service and/or the outside service; and updating the content evaluation tier and/or the creator evaluation tier based at least on the additional state information.
  • the method 500 may further comprise any steps/processes for detecting legitimacy of data collection according to the embodiments of the present disclosure as mentioned above.
  • FIG.6 illustrates an exemplary apparatus 600 for detecting legitimacy of data collection according to an embodiment.
  • the data collection may be implemented through processing content related to the data collection by a user in a data collection service.
  • the apparatus 600 may comprise: an event monitoring module 610, for monitoring at least one event occurred in the data collection service and/or at least one outside service, the event being associated with the content and/or the user; a state information detecting module 620, for in response to the event, detecting state information associated with the event from the data collection service and/or the outside service; and an evaluation tier determining module 630, for determining a content evaluation tier and/or a creator evaluation tier based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content.
  • FIG.7 illustrates an exemplary apparatus 700 for detecting legitimacy of data collection according to an embodiment.
  • the data collection may be implemented through processing content related to the data collection by a user in a data collection service.
  • the apparatus 700 may comprise at least one processor 710 and a memory 720 storing computer-executable instructions.
  • the at least one processor 710 may: monitor at least one event occurred in the data collection service and/or at least one outside service, the event being associated with the content and/or the user; in response to the event, detect state information associated with the event from the data collection service and/or the outside service; and determine a content evaluation tier and/or a creator evaluation tier based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content.
  • the at least one processor 710 may be further configured for performing any operations of the methods for detecting legitimacy of data collection according to the embodiments of the present disclosure as mentioned above.
  • the embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for detecting legitimacy of data collection according to the embodiments of the present disclosure as mentioned above.
  • processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP), a field- programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • DSP digital signal processor
  • FPGA field- programmable gate array
  • PLD programmable logic device
  • the functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform.
  • Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc.
  • the software may reside on a computer-readable medium.
  • a computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk.
  • memory is shown separate from the processors in the various aspects presented throughout the present disclosure, the memory may be internal to the processors, e.g., cache or register.

Abstract

The present disclosure provides methods and apparatuses for detecting legitimacy of data collection. The data collection may be implemented through processing content related to the data collection by a user in a data collection service. At least one event occurred in the data collection service and/or at least one outside service may be monitored, the event being associated with the content and/or the user. In response to the event, state information associated with the event may be detected from the data collection service and/or the outside service. A content evaluation tier and/or a creator evaluation tier may be determined based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content.

Description

DETECTING LEGITIMACY OF DATA COLLECTION
BACKGROUND
[0001] People may collect data of interest through network more conveniently along with the development of the Internet techniques. Data collection may be performed through various approaches of form, email, webpage, productivity tool document, etc. Herein, a data collection service may widely refer to various services, applications, software, websites, etc. capable of implementing data collection or having a function of data collection. For example, the survey form service is a type of dedicated data collection service for collecting data through forms. Moreover, data collection may also be performed in services not dedicated for data collection, e.g., collecting data through emails in an email service, collecting data through webpages in a browser service, collecting data through productivity tool documents in a productivity tool, etc. All these services capable of achieving data collection may be collectively referred to as data collection services.
SUMMARY
[0002] This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
[0003] Embodiments of the present disclosure propose methods and apparatuses for detecting legitimacy of data collection. The data collection may be implemented through processing content related to the data collection by a user in a data collection service. At least one event occurred in the data collection service and/or at least one outside service may be monitored, the event being associated with the content and/or the user. In response to the event, state information associated with the event may be detected from the data collection service and/or the outside service. A content evaluation tier and/or a creator evaluation tier may be determined based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content.
[0004] It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.
BRIEF DESCRIPTION OF THE DRAWINGS
[0005] The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects. [0006] FIG.1 illustrates an exemplary process of illegitimate data collection.
[0007] FIG.2 illustrates exemplary content involved in illegitimate data collection. [0008] FIG.3 illustrates an exemplary process for detecting legitimacy of data collection according to an embodiment.
[0009] FIG.4 illustrates an exemplary deployment of a legitimacy detection service for data collection according to an embodiment.
[0010] FIG.5 illustrates a flowchart of an exemplary method for detecting legitimacy of data collection according to an embodiment.
[0011] FIG.6 illustrates an exemplary apparatus for detecting legitimacy of data collection according to an embodiment.
[0012] FIG.7 illustrates an exemplary apparatus for detecting legitimacy of data collection according to an embodiment.
DETAILED DESCRIPTION
[0013] The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.
[0014] Data collection services may be used by malicious users for illegitimate-purpose data collections, thus there is a risk of abusing the data collection services. For example, the data collection services may be maliciously used for collecting personal privacy or sensitive data, collecting commercial secrets, broadcasting inappropriate content, etc., and the collected data may be used for financial criminals, reputation damages, network attacks, etc. Illegitimate-purpose data collections will tremendously damage benefits of providers and legitimate users of data collection services.
[0015] Taking the phishing as an example, which is a common network attack. Usually, a phisher will conduct illegitimate-purpose data collections, e.g., collecting privacy or sensitive data of log-in account and password, bank card number, credit card number, home address, commercial information of companies, etc. For example, the survey form service is a common data collection service used by phishers. Through the survey form service, a phisher may create and distribute illegitimate-purpose forms, and obtain information provided by responders. Existing techniques for phishing detection may be divided into a proactive approach and a passive approach. Commonly-used techniques for proactively detecting phishing attacks may comprise, e.g., page content examination, blacklist-based detection, etc. These techniques merely focus on a specific stage of phishing attack activity, e.g., the stage of creating a form, and the detection is based on static information, e.g., based on strings, images, etc. contained in the form. Thus, these techniques can only achieve limited detection accuracy, and face the problem of false positives. Passive phishing detections generally rely on reports of phishing situation by users; however, this would suffer a restriction of limited user reports and cannot determine a phishing attack timely. [0016] Embodiments of the present disclosure propose effective detections of legitimacy of data collection, and accordingly timely controls or interventions to illegitimate data collections. Herein, legitimacy of data collection may widely refer to whether a data collection is with a legitimate purpose, legal, reasonable, unmalicious, non-abused, conforming to ethics, not damaging individuals’ or entities’ benefits, etc.
[0017] Data collection may be implemented through processing content related to the data collection by a user in a data collection service. The content may comprise various digital information forms capable of performing data collection, e.g., form, email, webpage, productivity tool document, etc. Accordingly, the data collection service may be various services supporting the processing of the content, e.g., a survey form service, an email service, a browser service, a productivity tool, etc. A creator of the content may create the content for collecting data in the data collection service, and responders of the content may fill information into the content to provide data in the data collection service, wherein the responders may refer to recipients responding to the content among recipients having received the content.
[0018] The embodiments of the present disclosure may adopt various types of information from the data collection service and/or from outside services that are different from the data collection service, to trigger and perform detection of legitimacy of data collection.
[0019] The legitimacy of data collection may be evaluated through an evaluation tier mechanism. In the evaluation tier mechanism, a content evaluation tier indicating the legitimacy of the content may be determined, which may be also referred to as content reputation tier. The legitimacy of the content may refer to whether the content itself is legitimate, e.g., whether it involves personal privacy information collection, etc. In the evaluation tier mechanism, a creator evaluation tier indicating legitimacy of the content creator may also be determined, which may be also referred to as creator reputation tier. The legitimacy of the creator may refer to whether the creator have a legitimate purpose, e.g., being a phisher or not, etc.
[0020] Various types of information facilitating to determine the content evaluation tier and the creator evaluation tier may be detected or collected from the data collection service and/or the outside services. Information facilitating to determine evaluation tiers may be extracted from each stage in an overall lifecycle of data collection. The determination of evaluation tier may consider information associated with the content and/or behaviors of the user in the data collection service and/or the outside services, administrative information in the data collection service, etc.
[0021] Various types of information detected or collected from the data collection service and/or the outside services may be considered comprehensively to determine the content evaluation tier and the creator evaluation tier, and accordingly determine legitimacy of the content and legitimacy of the creator. Various approaches are proposed for determining the content evaluation tier and the creator evaluation tier, e.g., an evaluation rule-based approach, an evaluation score-based approach, a Fast Explainable Additive Model (FXAM)-based approach, etc. The FXAM proposed by the embodiments of the present disclosure is a unified model adopting numerical features, categorical features, temporal features, etc. for predicting evaluation tiers. It is proposed to train the FXAM through a three-stage iteration, and accelerate the training process of the FXAM through a series of optimizations. Explainability of the FXAM may be effectively used for selecting features in a feature set, or continuously updating the features in the feature set. The proposed training approach facilitates to re-train or update the FXAM rapidly.
[0022] The embodiments of the present disclosure may conduct corresponding control operations in the data collection service and/or the outside services according to the determined content evaluation tier, creator evaluation tier, etc., to restrict or prevent occurrence of illegitimate data collection behaviors, protect benefits of legitimate users, etc. Taking phishing via a survey form service as an example, the embodiments of the present disclosure may identify a phishing form, identify a phisher, prevent a phisher from utilizing a form as a phishing tool, prevent a phishing form from being distributed for collecting privacy or sensitive data, help a recipient to identify phishing behaviors, assist an administrator to identify phishing behaviors so as to ensure benefits of legitimate users and data security, etc. In this example, the data collection service is a survey form service, the content is a form, the legitimacy of the content is associated with whether the content is a phishing form, and the legitimacy of the creator is associated with whether the creator is a phisher.
[0023] It should be appreciated that although multiple parts of the following discussion take the phishing as an example of illegitimate data collection and take the survey form service as an example of data collection service, this is merely for the sake of explaining the basic concepts of the disclosure. The disclosure is not limited to these specific examples, but may cover any other types of data collection services and detection of any other types of illegitimate data collection.
[0024] FIG.l illustrates an exemplary process 100 of illegitimate data collection. Illegitimate data collection usually has a lifecycle including multiple stages. The process 100 exemplarily illustrates this by taking a phisher performing phishing in a survey form service as an example.
[0025] At 110, a phisher may start a survey form service. For example, the phisher may want to collect sensitive data through a form, thereby entering the survey form service at 110. It should be appreciated that the survey form service may be provided through different approaches, e.g., through a webpage, through a client, etc. The phisher may log in the survey form service with his account. In some cases, the phisher may also identify a target recipient before starting the survey form service, in order to conduct targeted phishing.
[0026] At 120, the phisher may create a phishing form in the survey form service. The form may include one or more questions intended for collecting, e.g., sensitive data. For example, the phisher may create, in the form, questions that enquire about log-in account and password, bank card number, credit card number, etc. Moreover, in order to deceive recipients more effectively, the phisher may impersonate a legal entity in the form, e.g., a company, an individual, a website, etc. The phisher may add disguised logos, fake trademarks, false emails, fake URLs, etc. into the form to make the created form look more like a legitimate-purpose form.
[0027] At 130, the phisher may distribute the created phishing form through various approaches. For example, the phisher may send the form to specific or unspecified recipients via emails. In order to further increase credibility, the phisher may attach some additional information to the email, e.g., false descriptions, disguised logos, fake trademarks, etc. [0028] At 140, the phisher may collect sensitive data. For example, some recipients of the form may fill answers to questions and requested information in the form, and, as a responder, return the form to the phisher. Thus, the phisher may obtain sensitive data of the responder. Those responders who return sensitive data may also be deemed as victims. [0029] At 150, the phisher may carry out various malicious behaviors with the collected sensitive data for achieving illegitimate purposes. For example, the phisher may use bank account information provided by a responder for stealing financial assets, use privacy information provided by a responder for initiating further network attacks, etc.
[0030] It should be appreciated that the process 100 only shows several exemplary stages in a lifecycle of an exemplary illegitimate data collection. In other cases, the process of illegitimate data collection may also include more or less other stages.
[0031] FIG.2 illustrates exemplary content 200 involved in illegitimate data collection. As an example, the content 200 is an exemplary form adopted by phishing. For example, the form may be created by a phisher at 120 in FIG.1.
[0032] The title of the form, "Please update your password", indicates that the form is intended for assisting a recipient in changing a password for a certain business. The phisher has created, in this form, questions for collecting sensitive data, e.g., "Enter your old password", "Enter a new password", etc. If a recipient gives answers to these questions, the phisher will obtain the desired sensitive data and may further carry out malicious behaviors. [0033] It should be appreciated that FIG.2 only gives an example of content involved in the illegitimate data collection, and various other forms of illegitimate content may exist in actual scenarios.
[0034] FIG.3 illustrates an exemplary process 300 for detecting legitimacy of data collection according to an embodiment. The data collection may be implemented through processing content related to the data collection by a user in a data collection service. The user may be a creator or a recipient of the content. For example, the user, as a creator, may create, edit and distribute the content, and view a response result as a result of the data collection, in the data collection service. The user, as a recipient, may access the content, fill information into the content and return the content, in the data collection service.
[0035] At 310, at least one event occurred in a data collection service 302 and/or at least one outside service 304 may be monitored. Herein, an event may refer to occurrence of content, occurrence of user operations or behaviors, etc. in various services. Thus, the monitored event is associated with content and/or user. The outside service 304 may represent one or more outside services.
[0036] In one aspect, events occurring in the data collection service 302 may be monitored at 310. For example, a creator user may create new content in the data collection service 302, thus a "new content creation event" may be monitored. For example, a creator user may be modifying questions in the content in the data collection service 302, thus a "content modification event" may be monitored. For example, a recipient user may have received content in the data collection service 302, thus a "content receipt event" may be monitored. For example, a recipient user may be filling information into the content in the data collection service 302, thus a "content response event" may be monitored. For example, an administrator of the data collection service 302 may implement a management or control operation to the processing of the content in the background, thus a "management event" may be monitored. An administrator may refer to a person who conducts management or control for the operating of the data collection service in the background. In some cases, an administrator may also refer to a person who is assigned by an entity renting the data collection service and manages content. It should be appreciated that the events described above are only exemplary, and any other types of events may also be monitored in the data collection service 302.
[0037] In one aspect, events occurring in the outside service 304 may be monitored at 310. An outside service may refer to any service other than the data collection service. For example, the outside service 304 may comprises: an email service, a browser service, a security detection service for operating system, a cloud service, a social media, etc. In an implementation, the monitoring of events at 310 may include receiving an indication of an event from the outside service 304. In this case, the outside service 304 may act as an outside signal source to provide outside signals or notifications about occurrence of events.
[0038] Taking the outside service 304 being an email service as an example, when the email service finds out that an email is embedded with content for data collection and is to be sent, the email service may generate an indication of this event. Thus, a "content distribution event" may be monitored at 310. Moreover, when the email service detects a junk mail that includes content for data collection, the email service may generate an indication of this event. Thus, a "junk content distribution event" may be monitored at 310. [0039] Taking the outside service 304 being a browser service as an example, when a user is accessing content for data collection or a website that includes content for data collection through the browser service, the browser service may generate an indication of this event. Thus, a "content identification event" may be monitored at 310. Moreover, a security detection function is provided in some browser services, e.g., a SmartScreen function is provided in the Edge browser to detect security of websites or pages being browsed. When a user is accessing content for data collection or a website that includes content for data collection through the browser service, the security detection function may identify that the content includes malicious information, and the browser service may generate an indication of this event. Thus, a "malicious content identification event" may be monitored at 310.
[0040] Taking the outside service 304 being a security detection service for operating system as an example, e.g., a firewall in an operating system, when the security detection service scans that certain software includes content for data collection or the content includes malicious information, the security detection service may generate an indication for this event. Thus, a "content identification event” or a "malicious content identification event" may be monitored at 310.
[0041] Taking the outside service 304 being a cloud service as an example, when the cloud service finds out that a certain online document, website, attachment, etc. includes content for data collection, the cloud service may generate an indication for this event. Thus, a "content identification event" may be monitored at 310. Moreover, a security detection function is provided in some cloud services to detect security of online documents, websites, attachments, etc. that use the cloud service. When the security detection function identifies content that includes malicious information, the cloud service may generate an indication for this event. Thus, a "malicious content identification event" may be monitored at 310. [0042] Taking the outside service 304 being a social media as an example, when the social media scans that certain social information includes content for data collection or the content includes malicious information, the social media may generate an indication for this event. Thus, a "content identification event” or a "malicious content identification event" may be monitored at 310.
[0043] It should be appreciated that the outside services and events described above are all exemplary, and any other types of events may also be monitored in any other outside services.
[0044] If an event is monitored in the data collection service or the outside service at 310, the event may trigger subsequent operations in the process 300, thus, the operation of monitoring an event at 310 may also be regarded as a triggering operation.
[0045] At 320, in response to the monitored event, various types of state information associated with the event may be detected from the data collection service 302 and/or the outside service 304. Herein, state information may refer to various types of information that facilitate to determine an evaluation tier of the content or content creator and further determine legitimacy of the content or content creator. The state information may also be referred to as evidence. The same or different types of state information may be detected for different events. For different types of users, e.g., the creator and recipients, the same or different types of state information may also be detected.
[0046] In one aspect, detecting state information may include detecting various types of information associated with the content in the data collection service, e.g., whether a title of the content includes sensitive words, whether questions in the content are enquiring about sensitive data such as password, whether a logo in the content belongs to an entity different from the creator, the number of logos inserted in the content, the number of questions included in the content, whether links in the content are secure, distribution channel of the content, whether the content returned by responders includes a large amount of sensitive data, the total number of collected responses, etc.
[0047] In one aspect, detecting state information may include detecting various types of information associated with behaviors of a user in the data collection service. When the user is a creator, the state information may include, e.g., the time point at which the content was created, the length of time it took to create the content, whether the user has a history record of abusing the data collection service, the length of time the user stayed while editing a certain question, whether the user repeatedly edited a certain question, the number of times the user refreshed the response result, etc. When the user is a recipient, the state information may include, e.g., the total time the user took to fill answers, the length of time the user stayed on a certain question, etc. It should be appreciated that, optionally, the types of the detected state information associated with user behaviors may be not restricted by whether the current user is a creator or a recipient. For example, regardless of whether the current user is a creator or a recipient, the detected state information associated with user behaviors may broadly contain various types of state information associated with behaviors of the creator and/or the recipient.
[0048] In one aspect, detecting state information may include obtaining administrative information in the data collection service. The administrative information may be information corresponding to control measures taken by an administrator for the processing of the content. The administrative information may include, e.g., the amount of reporting the content as illegitimate data collection by recipients, confirming that the content involves collecting sensitive data, confirming that questions in the content involves sensitive data, sending warning information to the creator, sending warning information to recipients, performing restrictions or removing restrictions of accessing or editing the content, etc. [0049] In one aspect, detecting state information may include obtaining various types of information associated with behaviors of the content creator in the outside service. When a user logs in the data collection service with his account, the account or related information may also be used for identifying the user in outside services. For example, the user may use the same account or associated accounts to log in the data collection service, an email service, a social media, etc., thus the identity of the user may be identified in different services according to the accounts used by the user. It should be appreciated that the embodiments of the present disclosure are not limited to use accounts to identify the same user in different services, but may adopt any other means, e.g., through a terminal device used by the user. The state information associated with the behaviors of the content creator in the outside service may include, e.g., whether the URL of the creator has been identified as a malicious URL by the outside service, whether the creator has a low evaluation tier in the outside service, whether the creator has abusive behaviors in the outside service, whether the creator is in a blacklist in the outside service, etc.
[0050] In one aspect, detecting state information may include obtaining various types of information associated with the content in the outside service, e.g., judgment by the outside service on whether the content is malicious or not. In an implementation, an indication, provided by, e.g., a browser service, a security detection service for operating system, a cloud service, a social media, etc., that the content includes malicious information may be used as the state information. It should be appreciated that the embodiments of the present disclosure may also use various types of other judgment information for the content provided by the outside service as the state information.
[0051] All the state information described above is exemplary, and the embodiments of the present disclosure may also include any other types of state information. Moreover, different types of related state information may be detected for different events or different types of users. For example, for a "new content creation event", various types of state information related to the content and/or the creator may be detected. For example, for a "content receipt event", various types of state information related to the content and/or a recipient may be detected.
[0052] It should be appreciated that the detection of state information may be performed for each stage in the entire lifecycle of data collection. For example, state information may be detected at each stage in a lifecycle of illegitimate data collection. Accordingly, legitimacy of data collection may be finally determined at each stage of data collection through the process 300 At the stage of starting or logging in the data collection service, e.g., at step 110 of starting the survey form service in FIG.l, at least state information associated with behaviors of the content creator in the outside service may be detected. At the stage of creating or editing the content, e.g., at step 120 of creating a phishing form in FIG.l, at least state information related to content creation or editing, state information related to creator behaviors, administrative information, etc. may be detected in the data collection service. At the stage of distributing the content, e.g., at step 130 of distributing the phishing form in FIG.l, at least state information associated with content distribution may be detected, e.g., information about content distribution channels, an indication, provided by an outside service for distributing the content, of whether the content includes malicious information, etc. At the stage of collecting responses to the content from responders, e.g., at step 140 of collecting sensitive data in FIG.l, at least state information related to the content, state information related to responder behaviors, administrative information, etc. may be detected in the data collection service. At the stage of achieving an illegitimate purpose, e.g., at step 150 of carrying out malicious behaviors in FIG.l, at least administrative information in the data collection service, information about control operations that have been taken in the data collection service or outside services, etc. may be detected. Since the process 300 may detect state information in each stage of data collection and further determine the legitimacy of the data collection, this facilitates to implement more timely detection of illegitimate data collection.
[0053] At 330, a content evaluation tier and/or a creator evaluation tier may be determined based on the detected state information. The content evaluation tier corresponds to the legitimacy of the content. Different content evaluation tiers may reflect different degrees of the legitimacy of the content. For example, the content evaluation tier may be divided into legitimate content, suspicious content, and illegitimate content, wherein the legitimate content indicates that the content is used for legitimate purposes, the suspicious content indicates that the content may be used for illegitimate purposes, and the illegitimate content indicates that the content is determined to be used for illegitimate purposes. Moreover, for example, the content evaluation tier may also be simply divided into legitimate content and illegitimate content. It should be appreciated that the embodiments of the present disclosure may cover content evaluation tiers divided in any approaches. The creator evaluation tier corresponds to the legitimacy of the creator of the content. Different creator evaluation tiers may reflect different degrees of legitimacy of the creator. For example, the creator evaluation tier may be divided into good user, normal user, suspicious user, and illegitimate user, wherein the good user indicates that the user has a good usage record of the data collection service and has a very low possibility to conduct illegitimate data collection, the normal user indicates that the user has not been found conducting illegitimate data collection, the suspicious user indicates that the user is likely to conduct illegitimate data collection, and the illegitimate user indicates that it is determined that the user has conducted illegitimate data collection. Moreover, for example, the creator evaluation tier may also be simply divided into legitimate user and illegitimate user, wherein the legitimate user indicates that the user has not conducted illegitimate data collection. It should be appreciated that the embodiments of the present disclosure may cover creator evaluation tiers divided in any approaches.
[0054] Various approaches may be adopted for determining the content evaluation tier and/or the creator evaluation tier, e.g., an evaluation rule-based approach, an evaluation score-based approach, a FXAM-based approach, etc. Moreover, any combination of these approaches may also be used for determining an evaluation tier. For example, respective evaluation tier determination results may be obtained based on each approach, and then these evaluation tier determination results may be combined or merged to obtain a final evaluation tier. For example, the final evaluation result may be obtained through performing weighted summation on multiple evaluation tier determination results based on credibility of each approach. Moreover, optionally, an evaluation tier determination result obtained by the approach with the highest credibility may also be directly selected as the final evaluation tier.
[0055] In the evaluation rule-based approach, one or more evaluation rules may be defined in advance, and each evaluation rule or a combination of multiple evaluation rules may correspond to a specific evaluation tier, e.g., a content evaluation tier and/or a creator evaluation tier. It may be determined whether the detected state information matches with at least one evaluation rule, and if yes, an evaluation tier corresponding to the at least one evaluation rule may be determined.
[0056] In an implementation, each evaluation rule may correspond to a combination of existence of multiple predetermined types of state information. For example, only when all the state information required by an evaluation rule is detected, it is determined that the evaluation rule is matched. Taking the state information "whether a question in the content is enquiring about sensitive data" as an example, if it is detected that the question is enquiring about sensitive data, it may be considered that the state information is detected; otherwise, it may be considered that the state information is not detected. It is assumed that an evaluation rule R1 corresponds to a set {Si, &, &} including three types of state information, and corresponds to a content evaluation tier "illegitimate content". If the state information detected at 320 includes state information Si, & and &, it may be considered that the detected state information matches with the evaluation rule Rl, and an evaluation tier of illegitimate content may be given. If only the state information & and & are detected at 320, it may be considered that the evaluation rule Rl is not matched, thus the evaluation tier of illegitimate content cannot be given according to the evaluation rule Rl .
[0057] In an implementation, each evaluation rule may correspond to a combination of values of all possible state information. For example, all possible types of state information may be defined in advance, and only when values of the actually detected state information meet the value combination required by the evaluation rule, it may be determined that the evaluation rule is matched. Taking the state information "whether a question in the content is enquiring about sensitive data" as an example, if it is detected that the question is enquiring about sensitive data, the value of the state information may be set to 1, otherwise, the value may be set to 0. It is assumed that all possible n types of state information form a set {Si, Sn}, an evaluation rule R2 requires that the set includes at least 3 types of state information with a value of 1, and the evaluation rule R2 corresponds to an user evaluation tier "illegitimate user". If after the detection at 320, the set includes 4 types of state information with a value of 1, it may be considered that the evaluation rule R2 is matched, and an evaluation tier of illegitimate user may be given. If after the detection at 320, the set includes only two types of state information with a value of 1, it may be considered that the evaluation rule R2 is not matched, and thus the evaluation tier of illegitimate user cannot be given according to the evaluation rule R2.
[0058] It should be appreciated that only examples of evaluation rules are given above, and any other forms of evaluation rules may be defined according to specific application requirements. Moreover, in some cases, when the state information detected at 320 satisfies multiple evaluation rules simultaneously and multiple evaluation tiers are determined accordingly, the worst evaluation tier, as an example, may be used as a final evaluation tier determined by the evaluation rule-based approach.
[0059] In the evaluation score-based approach, a weighted sum of confidences of the detected state information may be calculated to obtain an evaluation score, and an evaluation tier may be determined based on the evaluation score, e.g., the content evaluation tier and/or creator evaluation tier. It is assumed that state information among the n pieces of state information detected at 320 is denoted as Si, the state information may be assigned a weight Wi which represents importance of the state information in determining the evaluation tier. The sum of weights of all the state information is equal to 1. Moreover, it is assumed that the state information Si has a confidence G which represents the degree of the possibility that the state information is true, and ranges from 0% to 100%. For example, for the state information "a question in the content is enquiring about sensitive data", if its confidence is 80%, it represents a probability of 80% that the question is indeed enquiring about sensitive data. A weighted sum of the confidences of all the state information may be calculated through score =
Figure imgf000016_0001
CL x WL to obtain the evaluation score “ score The evaluation score may be further used for determining a corresponding evaluation tier. For example, multiple evaluation score intervals may be defined in advance, and each interval corresponds to an evaluation tier. Therefore, when the calculated evaluation score falls within a certain evaluation score interval, an evaluation tier corresponding to the interval may be used as the final evaluation tier determined by the evaluation score-based approach.
[0060] In the FXAM-based approach, the FXAM proposed by the embodiments of the present disclosure may be adopted for predicting an evaluation tier, e.g., content evaluation tier and/or creator evaluation tier.
[0061] The existing Generalized Additive Model (GAM) has limitations in performing predictive analysis tasks. For example, the existing GAM cannot process multi-dimensional data including numerical features, categorical features, temporal features, etc., and the existing training strategy of the GAM leads to low training speed on a large-scale multi dimensional data set with a large amount of features. The FXAM proposed by the embodiments of the present disclosure may adopt a feature set including numerical features, categorical features, temporal features, etc., and the FXAM is a unified model capable of processing multi-dimensional data, and may be trained for predicting evaluation tiers. The FXAM may be trained through a three-stage iteration (TSI) training strategy, wherein the TSI includes three stages corresponding to numerical features, categorical features and temporal features respectively. Moreover, the training process of the FXAM may be accelerated by applying corresponding optimization for each stage in the TSI. The numerical features, the categorical features, the temporal features, etc. may come from the state information detected at 320.
[0062] The numerical features may refer to features characterized by numerical values, e.g., the number of logos inserted in the content, the number of questions included in the content, the number of times to refresh response results, the number of times to repeatedly modify a certain question, etc.
[0063] The categorical features may refer to features characterized by categorical attributes. For example, for the categorical feature "sensitive data enquired by questions in the content", its categorical attributes may include "password information", "bank account information", etc. For example, for the categorical feature "prompts sent to the creator", its categorical attributes may include "content is restricted to be sent", "the question involves sensitive data", etc. For example, for the categorical feature "whether the creator is in a blacklist of an outside service", its categorical attribute may include "yes", "no", etc.
[0064] The temporal features may refer to features characterized by a time point or duration, e.g., the time point when the content is created, the length of stayed time during which a certain question is edited, the time point when the first response is obtained, etc. [0065] It should be appreciated that, optionally, in the case of adopting the FXAM for determining an evaluation tier, a special feature detection process for the feature set of the FXAM may be performed at 320. The feature detection process may detect corresponding state information by referring to which features are included in the feature set of the FXAM. [0066] Some exemplary details about the FXAM are further discussed below.
[0067] Given a multi-dimensional dataset D including n instances, it has p numerical features {x 1... , xp}, q categorical features {z1 , zq}, one time feature /, and a response y. Here, for the sake of simplification, it is assumed that there is only one temporal feature, but in fact there may be more than one temporal feature. The response y may be an evaluation tier to be predicted, e.g., content evaluation tier or creator evaluation tier.
[0068] Regarding the categorical features, the domain of zt is denoted as dom(zi), and Q =
Figure imgf000017_0001
is the total number of different values among the categorical features. It is assumed that the one-hot vector domain is O: = {O,I}*Q, thus any instance in z1~zi? may be denoted as a unique one-hot vector (01, ... , oQ ) e O as long as pre-specified indexes are assigned to elements in UI dom (zi).
[0069] Regarding the numerical features, for each t e {1, ...,p}, the Hilbert space of a measurable function i
Figure imgf000017_0002
is denoted by
Figure imgf000017_0003
so that E[fi ] 0 ,E[f 2] < ¥, and inner product (fi,fi'] = E[fif]
[0070] Regarding the temporal features, the Hilbert space of a measurable function / (t) over the temporal features is denoted by Ht. Ht has the same attributes as J-C In order to identify a time section component, a time period of the time section component is denoted as a positive integer d > 1. Without loss of generality, the data of t is sorted asO = t1 < t2 £ ··· < tn = tmax , and it is assumed that ¾ — ti-11 i = 2, ... , n} = {0, t}. t is referred to as a time gap. (G, ..., tn} may be processed as discrete time points. Tφ := may be denoted as a phase -φ set, because they share the same time
Figure imgf000017_0004
phase f E { 0,1 , d — 1} with respective to data {ft, —, tn}· It can be concluded that Ti and are mutually disjoint, and
Figure imgf000018_0001
[0071] It is assumed that D includes n realizations of a random variable Y at p + q + 1 design values, and is denoted as: then the model of
Figure imgf000018_0002
the FXAM may be represented as:
Figure imgf000018_0003
wherein oi;- e {0,1} is obtained through one-hot encoding from
Figure imgf000018_0009
are smoothers, and ft
Figure imgf000018_0008
are parameters to be learned from D. The overall
Figure imgf000018_0010
equation is an additive model, which includes three parts corresponding to the modeling of numerical features, categorical features and temporal features respectively.
[0072] Regarding the modeling of numerical features, fj models the distribution of Xj with respective to the response y as a univariate smooth function. This is similar to the standard Generalized Additive Model (GAM).
[0073] Regarding the modeling of categorical features, is a parameterized
Figure imgf000018_0005
form by representing categorical values ziv ... , ziq in a Q-dimensional one-hot vector, and a weight is assigned to each item o*fc . Since the value of item oik e {0,1} indicates whether there is a category value, i.e., a certain categorical feature adopts a specific value, and the additivity among items is considered, expresses a meaning of the
Figure imgf000018_0006
Figure imgf000018_0007
contribution of a certain categorical feature when it takes a specific value. The FXAM unifies all categorical features into a one-hot vector, which allows adopting momentum-like accelerations to improve training efficiency.
[0074] Regarding the modeling of temporal features, T (ft) + S(ft) decomposes a signal in the temporal feature t into a trend signal T and a time section signal S. Such decomposition expresses multi-angle signals from a single feature, which is helpful for predictive analysis. It should be appreciated that this equation may be extended to cover multiple types of time series signals, e.g., abrupt changes, stochastic drifts, etc. The equation described above does not treat a temporal feature as a numerical feature, which avoids that the multi-angle signals are composed together and thus become less intuitive or cannot be conveyed appropriately. [0075] The loss function L(F1, ... , fp, b, T, S ) may be represented as:
Figure imgf000018_0004
Figure imgf000019_0001
Equation (2) wherein all the bolded items are n x 1 vectors, and Z is a n x Q design matrix corresponding to one-hot encoding of categorical features.
Figure imgf000019_0010
is a smoothing function defined on Tf, which is a sub-domain of indicates that the time
Figure imgf000019_0009
section component S domain-merges all the sub-components S(f) . A, Az, AT, As are pre- specified hyper-parameters. ||*||2 is the total square error. is regularization
Figure imgf000019_0002
for fj. Here, A trades off the smoothness of fj and its closeness to fit.
[0076] å2 regularization
Figure imgf000019_0003
is proposed to avoid potential collinearity among one- hot variables derived from categorical features.
[0077] Besides the regularization for the trend component T, the time section component S is divided into d phase-equivalent sub-components S(f), and regularization is applied to each S(f). The smoothness of S is omitted, because T mainly carries this type of information. Restricting smoothness within each phase-equivalent domain may better express repeating pattern of the time section signal.
[0078] The optimization scheme for minimizing a square error with regularization of the form dv may be cubic spline smoothing with knots at each x1j, ... , xn;·, and
Figure imgf000019_0004
then the loss function may be rewritten as:
Figure imgf000019_0005
wherein fz,fT,fs are used as alternative expressions of Zb, T, 5, which is convenient for subsequent discussions. Kj is a n x n input matrix, which is pre-calculated with values x1j, ... , xnj . KT is calculated in the same way.
[0079] Ks is a n x n matrix obtained by applying a cubic spline smoother over Tf, and
0 the indexes are reordered with a permutation matrix
Figure imgf000019_0006
wherein matrix with respective to cubic spline smoothing over knots
Figure imgf000019_0007
permutation matrix that maps the indexes of these
Figure imgf000019_0008
knots into the original indexes in {t1, ... , tn}.
[0080] In order to find the minimization scheme of L, the following FXAM’s normal equations are proposed:
Figure imgf000020_0001
wherein P = PQ, ...,Pd-1 is the overall permutation matrix, which maps the indexes of elements from {y ... , tn} to the indexes of elements in {T0, ... , Td- 1).
[0081] The FXAM's normal equations satisfy stationarity conditions of L. The FXAM’ s normal equations are necessary conditions for optimality of L. Solutions of the FXAM's normal equations exist and are globally optimal.
[0082] Three-stage iteration (TSI) is proposed to process the three types of features of the FXAM respectively. Table 1 below shows an exemplary process of the TSI.
Figure imgf000020_0002
Figure imgf000021_0003
Table 1
[0083] As shown in Table 1, the TSI is an iterative training process with three stages, and the three stages are applied for the training of numerical features, categorical features and temporal features respectively. In each stage, the TSI fixes parameters of other stages, and performs training on parameters of interest until local convergence, as shown in lines 1.5, 1.8 and 1.10, and performs iterations over the three stages until global convergence. The TSI will converge to a solution of the FXAM’s standard equations.
[0084] The training strategy of each stage is flexible. Taking Stage 2 as an example, it desires to solve , accordingly, it may be directly
Figure imgf000021_0001
calculated by matrix inversion/multiplication, or b may be estimated by gradient descent. Such flexibility of the training strategy provides space for improving training efficiency. Exemplary optimizations performed for each stage will be discussed below, which are for improving training efficiency and thereby accelerating the overall FXAM training.
[0085] Regarding Stage 1 , training efficiency may be improved by improving efficiency of back-fitting. The process in lines 1.2-1.5 in Table 1 follows a standard back-fitting process, which includes three parts: smoothing over value pairs
Figure imgf000021_0002
iterating over a fixed order of features to perform smoothing; and initializing smoothing functions f j = before iteration. All these three parts may be modified to improve training efficiency. [0086] In one aspect, fast kernel smoothing approximation may be adopted. The smoothing task here is with n input samples which are also evaluation points. Cubic spline smoothing is with 0(n) time complexity on this task, but is still expensive in terms of massive operations. The fast kernel smoothing method may achieve 0(n) complexity with small coefficients. The core idea may be referred to as fast sum updating in which, given a polynomial kernel, this method pre-calculates a cumulative sum of each item of the polynomial kernel on evaluation points, and uses these cumulative sums to perform one- shot scanning over the evaluation points to complete the task. In one implementation, the Epanechnihov kernel may be selected, and a fast sum updating algorithm may be used for approximating the original cubic spline smoothing, so as to reduce operations with almost no loss of accuracy.
[0087] In one aspect, dynamically-adjusted feature order iteration may be performed. Instead of iterating over a fixed order of numerical features, the dynamically-adjusted feature order iteration may be adopted for further accelerating convergence speed. Intuitively, it is desired to evaluate features with higher predictive power earlier, because the loss may be reduced more and thus converges faster. The time cost for evaluating the predictive power of each feature should be lightweight. Here, a lightweight estimator with theoretical guidance is proposed to estimate the predictive power of each feature, and is used for dynamically ordering the features.
[0088]
Figure imgf000022_0008
-> the ground truth smoothing function is denoted as is independent and identically distributed random error,
Figure imgf000022_0003
Figure imgf000022_0001
(Lipschitz condition). By using kernel smoothing, a smoothed curve may be given by
Figure imgf000022_0004
[0089] In order to estimate s2 accurately, it is assumed that F(x) is linear, and then may be obtained, wherein TSS = is a global
Figure imgf000022_0005
Figure imgf000022_0002
constant and r is a Pearson correlation coefficient. Therefore, L and r may be estimated in one scan of data, and the upper boundary of Loss may be accurately estimated before smoothing is actually applied.
Figure imgf000022_0006
is defined as an estimation of predictive power to prioritize features, which
Figure imgf000022_0007
conforms to that: the higher the Pearson correlation coefficient r and the smaller the sharpness /., the smaller residuals tend to achieve.
[0090] In one aspect, intelligent sampling may be performed. Better initialization, rather than zero function, will reduce the number of iterations. Here, an iteration is a complete cycle over features x1~xp . A random sampling strategy may be adopted for better initializing fj . Sample-size determination needs to be performed, i.e., to control sample variation, and the sample size should be guided by data characteristics rather than a fixed number.
[0091] Sample variation is the difference between a smoothing function over all data points fn(x) and a smoothing function over sampled data points fs(x). fn(x) or fs(x) may be regarded as samples, taken from a ground-truth function F(X), with sample sizes n and s respectively. For any kernel smoother
Figure imgf000023_0002
Therefore, the upper boundary of sample variation may be estimated by:
Figure imgf000023_0003
Figure imgf000023_0004
[0092] In order to control the sample variation for different features, the
Figure imgf000023_0005
sample size of feature xi should be In order to control sample
Figure imgf000023_0006
variation of each feature, s* =
Figure imgf000023_0001
is used as a sample size applied for all the features, y is a hyper-parameter.
[0093] Regarding Stage 2, training efficiency may be improved by performing Nesterov gradient acceleration. The task of smoothing over all one-hot encoded categorical features is to calculate Here, the Nesterov gradient acceleration is
Figure imgf000023_0007
adopted for completing the task. In this case, there exists an optimal learning rate L which equals to the maximum eigenvalue of the matrix so it may be found efficiently
Figure imgf000023_0008
by using Power iteration, wherein the Power iteration is also referred to as a Power method. Table 2 below shows an exemplary process of the Nesterov gradient acceleration with Power iteration.
Figure imgf000023_0010
Figure imgf000023_0009
[0094] The Nesterov gradient acceleration may be regarded as an improved momentum, which makes convergence significantly faster than gradient descent, especially when data includes a large amount of categorical features or cardinality is high. Traditional approaches conduct histogram -type smoothing per each categorical feature iteratively, which may be regarded as gradient descent without considering momentum, thus convergence speed is much slower. [0095] Regarding Stage 3, training efficiency may be improved by performing time section trend decomposition. Trend and time section signals may be identified from the temporal feature t in an iterative approach. Table 3 shows an exemplary process of the time section trend decomposition.
Figure imgf000024_0001
Table 3 [0096] The trending operation in line 3.3 corresponds to a smoothing matrix MT, the cycle-subseries smoothing in line 3.6 corresponds to a smoothing matrix Ms., and fast kernel smoothing is adopted here for improving training efficiency. Such local iteration over MT and Ms. preserves convergence of the TSI.
[0097] The trained FXAM has explainability to features. By using the FXAM, the importance of each feature in the feature set for performing prediction may be explicitly known, e.g., which features are relatively important for predicting evaluation tiers, which features have relatively small influence on predicting evaluation tiers, etc. Moreover, since the FXAM has global model explainability, when combined with an appropriate visualization tool, the FXAM may be used for showing how anyone feature specifically affects the final predicted relationship, e.g., fj(xi) as discussed above, etc. This relationship may be used in conjunction with the knowledge in the applied domain for verifying whether the laws learned by the FXAM (e.g., fi(Xi)) are trustworthy, whether there are redundant features, whether there is a lack of effective features, etc., thereby providing another angle of explainability. Numerous explainability of the FXAM may be used for selecting or updating features in the feature set.
[0098] In an aspect, in the process of training the FXAM, after one or more rounds of training, features with very low importance may be removed from the feature set and features with high importance may be retained or added, based on the explainability of the FXAM. Then, the new feature set may be used for the next or more rounds of training, and features may be removed or added again according to their importance. The above- mentioned process of selecting features in the feature set is performed for multiple times in this way. Finally, a good feature set that can make the most accurate predictions may be obtained.
[0099] In another aspect, during applying the FXAM, when, e.g., an illegitimate user changes his behaviors, the importance of certain features changes, etc., the explainability of the FXAM and the efficient training process facilitate to reselect features in the feature set rapidly. Thus, dynamic updating and optimization of the FXAM is achieved.
[00100] In yet another aspect, the explainability of the FXAM facilitates to select types of state information to be detected at 320. For example, at 320, state information corresponding to those features explained as having high importance by the FXAM may be focused, and state information corresponding to those features explained as having low importance by the FXAM may be ignored. Accordingly, the explainability of the FXAM also facilitates to define a combination of state information for the evaluation rules in the evaluation rule-based approach, set appropriate weights for different state information in the evaluation score-based approach, etc.
[00101] The content evaluation tier and/or the creator evaluation tier may be determined through the processing at 330 in the process 300, and thus the legitimacy of the content and/or the legitimacy of the creator may be determined.
[00102] Optionally, the process 300 may further comprise: performing control operations in response to at least the content evaluation tier and/or the creator evaluation tier at 340. Herein, the control operations may broadly refer to various operations that facilitate to, e.g., prevent the creator of the content from achieving illegitimate purposes, assist recipients of the content to avoid attacks or losses, assist an administrator of the data collection service to protect legitimate users and service data, etc.
[00103] In an implementation, the control operations may include applying various usage restrictions to the content in the data collection service, e.g., restricting or prohibiting the distribution of the content, restricting or prohibiting the editing of the content, restricting or prohibiting the accessing to the content, restricting or prohibiting the responding to the content, restricting or delaying or prohibiting the displaying of response results, etc.
[00104] In an implementation, the control operations may include applying various behavior restrictions to a user in the data collection service, e.g., preventing the creator from editing or distributing the content, requiring the creator to modify a question involving sensitive data, restricting or prohibiting the creator from accessing the content, preventing a recipient from filling sensitive information, preventing a recipient from sending a response, restricting or prohibiting a recipient from accessing the content, etc.
[00105] In an implementation, the control operations may include presenting various prompts in the data collection service. The prompts may be provided to the content creator or content recipients. The prompts may be warnings or notifications about the legitimacy of the content or the legitimacy of the creator, e.g., a prompt about that the current question involves sensitive data, a prompt about that the content may have an illegitimate purpose, a prompt about requiring to modify a question, a prompt about prohibiting the content from being distributed or responded to, etc.
[00106] In an implementation, the control operations may include sending a notification to an administrator of the data collection service, so that the administrator may take further management measures. The notification may be information about the legitimacy of the content or the legitimacy of the creator, e.g., a notification about that the content is determined as illegitimate, a notification about that the creator is determined as illegitimate, a notification about the detected state information, etc.
[00107] In an implementation, the control operations may include sending a notification to an outside service, so that the outside service may take further management measures. The notification may be information about the legitimacy of the content or the legitimacy of the creator, e.g., a notification about that the content is determined as illegitimate, a notification about that the creator is determined as illegitimate, etc.
[00108] In an implementation, the control operations may include storing the determined content evaluation tier and/or creator evaluation tier in a storage device. Through this approach, all the previously-determined historical evaluation tiers about certain content or a certain creator may be collected and retained.
[00109] It should be appreciated that that all the control operations described above are exemplary, and the embodiments of the present disclosure are not limited to these control operations, but may include any other control operations.
[00110] The control operations performed at 340 may correspond to a specific content evaluation tier or creator evaluation tier. Different control operations may be performed for different content evaluation tiers or creator evaluation tiers respectively, so as to achieve, e.g., different control strategies, different user experiences, etc.
[00111] It is assumed that the content evaluation tier is divided into legitimate content, suspicious content and illegitimate content. If it is determined at 330 that the current content is legitimate content, strict control operations may be avoided to apply, e.g., only presenting prompts when needed. If it is determined at 330 that the current content is suspicious content, necessary control operations may be applied, e.g., restricting the distribution and editing of the content, requesting the creator to modify a question involving sensitive data, delaying the displaying of a response result, providing a prompt about that the current question involves sensitive data, etc. If it is determined at 330 that the current content is illegitimate content, strict control operations may be applied, e.g., prohibiting the editing or distributing of the content, prohibiting the accessing to the content, preventing a recipient from sending a response, etc.
[00112] It is assumed that the creator evaluation tier is divided into good user, suspicious user and illegitimate user. If it is determined at 330 that the creator is a good user, strict control operations may be avoided to apply, in order to assist in smoothly completing the data collection, e.g., only presenting prompts when needed. If it is determined at 330 that the creator is a suspicious creator, necessary control operations may be applied, e.g., restricting the distribution and editing of the content by the creator, requesting the creator to modify a question involving sensitive data, restricting a recipient from sending a response, delaying the displaying of a response result, providing a prompt about that the current question involves sensitive data, etc. If it is determined at 330 that the creator is an illegitimate creator, strict control operations may be applied, e.g., prohibiting the creator from editing or distributing the content, prohibiting a recipient from accessing the content, preventing a recipient from sending a response, etc.
[00113] Optionally, in addition to considering the content evaluation tier and/or creator evaluation tier determined at 330, the control operations performed at 340 may also consider historical content evaluation tiers and/or historical creator evaluation tiers. For example, for certain content, the historical evaluation tier and the currently-determined evaluation tier of the content may be comprehensively considered to determine what control operations will be taken. For example, for a creator, the historical evaluation tier and the currently- determined evaluation tier of the creator may be comprehensively considered to determine what control operations will be taken.
[00114] It should be appreciated that the performing order among the various steps in the process 300 discussed above may be changed in any approaches. A part of the steps may be performed iteratively to further improve the accuracy of detection of the legitimacy of data collection or to improve appropriateness of the control operations. In one case, after determining the content evaluation tier and/or creator evaluation tier at 330, the process 300 may return to step 320 to detect additional state information associated with the monitored event, e.g., to further obtain other state information that may facilitate to determine the evaluation tiers. Then, the previously-determined content evaluation tier and/or creator evaluation tier may be updated at 330 based on the additional state information. In one case, after the content evaluation tier and/or creator evaluation tier were determined at 330, the process 300 may return to step 310 to monitor additional events related to the content and/or a user that occur in the data collection service and/or an outside service, which may trigger further state information detection and evaluation tier determination. For example, after an additional event was monitored, additional state information associated with the additional event may be detected from the data collection service and/or the outside service at 320, and the previously-determined content evaluation tier and/or creator evaluation tier may be updated based at least on the additional state information at 330. In one case, after the control operations were performed at 340, the process 300 may return to step 310 to monitor events that occur after the control operations were applied, and to further trigger subsequent processing. In one case, after the control operations were performed at 340, the process 300 may return to step 320 to detect additional state information after the control operations were applied, and to further trigger subsequent processing. In one case, after the control operations were performed at 340, the process 300 may return to step 330 to re-determine the evaluation tiers.
[00115] FIG.4 illustrates an exemplary deployment 400 of a legitimacy detection service for data collection according to an embodiment.
[00116] The legitimacy detection service for data collection 410 in FIG.4 may refer to a service capable of implementing the process 300 shown in FIG.3. The legitimacy detection service for data collection 410 may include an event monitoring module 420, a state information detecting module 430, an evaluation tier determining module 440, a control performing module 450, a data storage unit 460, etc.
[00117] The event monitoring module 420 may perform the operation at 310 in FIG.3, e.g., monitoring events that occur in a data collection service 470 and/or an outside service 480. The events monitored by the event monitoring module 420 may be stored into an event message queue in a data storage unit 460. The event message queue may include messages about one or more monitored events. These event messages may be read in sequence to trigger further processing.
[00118] The state information detecting module 430 may perform the operation at 320 in FIG.3, e.g., in response to an event extracted from the event message queue, detecting state information related to the event from the data collection service 470 and/or the outside service 480.
[00119] The evaluation tier determining module 440 may perform the operation at 330 in FIG.3, e.g., determining a content evaluation tier and/or a creator evaluation tier based on the state information.
[00120] The control performing module 450 may perform the operation at 340 in FIG.3, e.g., performing control operations in response to at least the content evaluation tier and/or the creator evaluation tier determined by the evaluation tier determining module 440. In an implementation, the control operations may be divided into frontend controls 452 and backend controls 454. The frontend controls 452 may include various control operations for the data collection service 470. For example, the frontend controls 452 may at least affect a user's usage, experience, etc. of the data collection service. The backend controls 454 may include various control operations outside the data collection service 470, e.g., sending notifications to an administrator of the data collection service 470 through email, sending notifications to the outside service 480, storing the content evaluation tier and/or the creator evaluation tier determined by the evaluation tier determining module 440 in the data storage unit 460, etc.
[00121] The data collection service 470 may provide a service interface 490 having an interaction function 492. A content creator 402, a content recipient 404, and an administrator 406 of the data collection service may interact with the data collection service 470 through the service interface 490, e.g., creating content, accessing content, perform managements, etc. [00122] It should be appreciated that all the services, modules and architectures in the deployment 400 are exemplary, and the legitimacy detection service for data collection 410 may be deployed in any other approaches. For example, although FIG.4 shows the legitimacy detection service for data collection 410 as being deployed outside the data collection service 470, the legitimacy detection service for data collection 410 may also be deployed within the data collection service 470, or a part of the legitimacy detection service for data collection 410 may be deployed within the data collection service 470, e.g., deploying the event monitoring module 420 and the control performing module 450 within the data collection service 470, deploying the event monitoring module 420 and the part of frontend controls of the control performing module 450 within the data collection service 470, etc. Moreover, for example, although FIG.4 shows that the legitimacy detection service for data collection 410 includes the control performing module 450, the control performing module 450 may also be omitted from the legitimacy detection service for data collection 410. Moreover, for example, although FIG.4 shows that the data storage unit 460 is included in the legitimacy detection service for data collection 410, the data storage unit 460 or a part thereof may also be separated from the legitimacy detection service for data collection 410. Moreover, for example, although FIG.4 shows that the events monitored by the event monitoring module 420 are transferred to the state information detecting module 430 through the event message queue in the data storage unit 460, the data storage unit 460 may also be omitted, thus the event monitoring module 420 may provide the monitored events to the state information detecting module 430 directly.
[00123] FIG.5 illustrates a flowchart of an exemplary method 500 for detecting legitimacy of data collection according to an embodiment. The data collection may be implemented through processing content related to the data collection by a user in a data collection service.
[00124] At 510, at least one event occurred in the data collection service and/or at least one outside service may be monitored, the event being associated with the content and/or the user.
[00125] At 520, in response to the event, state information associated with the event may be detected from the data collection service and/or the outside service.
[00126] At 530, a content evaluation tier and/or a creator evaluation tier may be determined based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content. [00127] In an implementation, the content may comprise at least one of: form, email, webpage, and productivity tool document. The data collection service may support processing of the content. The outside service is different from the data collection service, and may comprise at least one of: an email service, a browser service, a security detection service for operating system, a cloud service, and a social media.
[00128] In an implementation, the monitoring at least one event may comprise: receiving an indication of the event from the outside service.
[00129] In an implementation, the user may be the creator or a recipient of the content. [00130] In an implementation, the data collection service is a survey form service, the content is a form, the user is a creator of the form, the legitimacy of the content is associated with whether the content is a phishing form, and the legitimacy of the creator is associated with whether the creator is a phisher.
[00131] In an implementation, the state information may comprise at least one of: information associated with the content in the data collection service; information associated with behaviors of the user in the data collection service; administrative information in the data collection service; information associated with the content in the outside service; and information associated with behaviors of the creator in the outside service.
[00132] In an implementation, the content evaluation tier may comprise at least one of: legitimate content, suspicious content, and illegitimate content. The creator evaluation tier may comprise at least one of: good user, normal user, suspicious user, and illegitimate user. [00133] In an implementation, the content evaluation tier and/or the creator evaluation tier may be determined through at least one of: an evaluation rule-based approach; an evaluation score-based approach; and a FXAM-based approach.
[00134] The evaluation rule-based approach may comprise: determining that the state information matches with at least one evaluation rule; and determining the content evaluation tier and/or the creator evaluation tier based on the at least one evaluation rule. [00135] The evaluation score-based approach may comprise: obtaining an evaluation score through calculating a weighted sum of confidences of the state information; and determining the content evaluation tier and/or the creator evaluation tier based on the evaluation score.
[00136] The FXAM-based approach may comprise: obtaining at least one of numerical features, categorical features and temporal features in the state information; and predicting, through the FXAM, the content evaluation tier and/or the creator evaluation tier based on the obtained features. [00137] The method 500 may further comprise: selecting or updating features in a feature set adopted by the FXAM based on explainability of the FXAM to features.
[00138] The FXAM may be trained through a three-stage iteration, the three-stage iteration comprising three stages corresponding to numerical features, categorical features and temporal features respectively. The training is accelerated through optimizing at least one of the three stages.
[00139] In an implementation, the method 500 may further comprise: performing at least one control operation in response to at least the content evaluation tier and/or the creator evaluation tier.
[00140] The control operation may comprise at least one of: applying usage restrictions to the content in the data collection service; applying behavior restrictions to the user in the data collection service; presenting prompts in the data collection service; sending a notification to an administrator of the data collection service; sending a notification to the outside service; and storing the content evaluation tier and/or the creator evaluation tier. [00141] The performing at least one control operation may be further based on historical content evaluation tiers and/or historical creator evaluation tiers.
[00142] The method 500 may further comprise: detecting additional state information associated with the event from the data collection service and/or the outside service; and updating the content evaluation tier and/or the creator evaluation tier based at least on the additional state information.
[00143] The method 500 may further comprise: monitoring at least one additional event occurred in the data collection service and/or the outside service, the additional event being associated with the content and/or the user; detecting additional state information associated with the additional event from the data collection service and/or the outside service; and updating the content evaluation tier and/or the creator evaluation tier based at least on the additional state information.
[00144] It should be appreciated that the method 500 may further comprise any steps/processes for detecting legitimacy of data collection according to the embodiments of the present disclosure as mentioned above.
[00145] FIG.6 illustrates an exemplary apparatus 600 for detecting legitimacy of data collection according to an embodiment. The data collection may be implemented through processing content related to the data collection by a user in a data collection service. [00146] The apparatus 600 may comprise: an event monitoring module 610, for monitoring at least one event occurred in the data collection service and/or at least one outside service, the event being associated with the content and/or the user; a state information detecting module 620, for in response to the event, detecting state information associated with the event from the data collection service and/or the outside service; and an evaluation tier determining module 630, for determining a content evaluation tier and/or a creator evaluation tier based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content.
[00147] Moreover, the apparatus 600 may also comprise any other modules configured for performing any steps and operations of the methods for detecting legitimacy of data collection according to the embodiments of the present disclosure as mentioned above. [00148] FIG.7 illustrates an exemplary apparatus 700 for detecting legitimacy of data collection according to an embodiment. The data collection may be implemented through processing content related to the data collection by a user in a data collection service. [00149] The apparatus 700 may comprise at least one processor 710 and a memory 720 storing computer-executable instructions. When the computer-executable instructions are executed, the at least one processor 710 may: monitor at least one event occurred in the data collection service and/or at least one outside service, the event being associated with the content and/or the user; in response to the event, detect state information associated with the event from the data collection service and/or the outside service; and determine a content evaluation tier and/or a creator evaluation tier based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content. The at least one processor 710 may be further configured for performing any operations of the methods for detecting legitimacy of data collection according to the embodiments of the present disclosure as mentioned above.
[00150] The embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for detecting legitimacy of data collection according to the embodiments of the present disclosure as mentioned above.
[00151] It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts. [00152] It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
[00153] Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP), a field- programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform.
[00154] Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk. Although memory is shown separate from the processors in the various aspects presented throughout the present disclosure, the memory may be internal to the processors, e.g., cache or register.
[00155] The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be encompassed by the claims.

Claims

1. A method for detecting legitimacy of data collection, the data collection being implemented through processing content related to the data collection by a user in a data collection service, the method comprising: monitoring at least one event occurred in the data collection service and/or at least one outside service, the event being associated with the content and/or the user; in response to the event, detecting state information associated with the event from the data collection service and/or the outside service; and determining a content evaluation tier and/or a creator evaluation tier based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content.
2. The method of claim 1, wherein the content comprises at least one of: form, email, webpage, and productivity tool document, the data collection service supports processing of the content, and the outside service is different from the data collection service, and comprises at least one of: an email service, a browser service, a security detection service for operating system, a cloud service, and a social media.
3. The method of claim 1, wherein the monitoring at least one event comprises: receiving an indication of the event from the outside service.
4. The method of claim 1, wherein the data collection service is a survey form service, the content is a form, the user is a creator of the form, the legitimacy of the content is associated with whether the content is a phishing form, and the legitimacy of the creator is associated with whether the creator is a phisher.
5. The method of claim 1, wherein the state information comprises at least one of: information associated with the content in the data collection service; information associated with behaviors of the user in the data collection service; administrative information in the data collection service; information associated with the content in the outside service; and information associated with behaviors of the creator in the outside service.
6. The method of claim 1, wherein the content evaluation tier and/or the creator evaluation tier is determined through at least one of: an evaluation rule-based approach; an evaluation score-based approach; and a Fast Explainable Additive Model (FXAM)-based approach.
7. The method of claim 6, wherein the evaluation rule-based approach comprises: determining that the state information matches with at least one evaluation rule; and determining the content evaluation tier and/or the creator evaluation tier based on the at least one evaluation rule.
8. The method of claim 6, wherein the evaluation score-based approach comprises: obtaining an evaluation score through calculating a weighted sum of confidences of the state information; and determining the content evaluation tier and/or the creator evaluation tier based on the evaluation score.
9. The method of claim 6, wherein the FXAM-based approach comprises: obtaining at least one of numerical features, categorical features and temporal features in the state information; and predicting, through the FXAM, the content evaluation tier and/or the creator evaluation tier based on the obtained features.
10. The method of claim 1, further comprising: performing at least one control operation in response to at least the content evaluation tier and/or the creator evaluation tier.
11. The method of claim 10, wherein the control operation comprises at least one of: applying usage restrictions to the content in the data collection service; applying behavior restrictions to the user in the data collection service; presenting prompts in the data collection service; sending a notification to an administrator of the data collection service; sending a notification to the outside service; and storing the content evaluation tier and/or the creator evaluation tier.
12. The method of claim 1, further comprising: detecting additional state information associated with the event from the data collection service and/or the outside service; and updating the content evaluation tier and/or the creator evaluation tier based at least on the additional state information.
13. The method of claim 1, further comprising: monitoring at least one additional event occurred in the data collection service and/or the outside service, the additional event being associated with the content and/or the user; detecting additional state information associated with the additional event from the data collection service and/or the outside service; and updating the content evaluation tier and/or the creator evaluation tier based at least on the additional state information.
14. An apparatus for detecting legitimacy of data collection, the data collection being implemented through processing content related to the data collection by a user in a data collection service, the apparatus comprising: an event monitoring module, for monitoring at least one event occurred in the data collection service and/or at least one outside service, the event being associated with the content and/or the user; a state information detecting module, for in response to the event, detecting state information associated with the event from the data collection service and/or the outside service; and an evaluation tier determining module, for determining a content evaluation tier and/or a creator evaluation tier based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content.
15. An apparatus for detecting legitimacy of data collection, the data collection being implemented through processing content related to the data collection by a user in a data collection service, the apparatus comprising: at least one processor; and a memory storing computer-executable instructions that, when executed, cause the at least one processor to: monitor at least one event occurred in the data collection service and/or at least one outside service, the event being associated with the content and/or the user, in response to the event, detect state information associated with the event from the data collection service and/or the outside service, and determine a content evaluation tier and/or a creator evaluation tier based on the state information, the content evaluation tier corresponding to legitimacy of the content and the creator evaluation tier corresponding to legitimacy of a creator of the content.
PCT/US2021/016989 2020-03-30 2021-02-08 Detecting legitimacy of data collection WO2021201980A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010237495.5A CN113468589A (en) 2020-03-30 2020-03-30 Detecting data collection validity
CN202010237495.5 2020-03-30

Publications (1)

Publication Number Publication Date
WO2021201980A1 true WO2021201980A1 (en) 2021-10-07

Family

ID=74845096

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/016989 WO2021201980A1 (en) 2020-03-30 2021-02-08 Detecting legitimacy of data collection

Country Status (2)

Country Link
CN (1) CN113468589A (en)
WO (1) WO2021201980A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060251068A1 (en) * 2002-03-08 2006-11-09 Ciphertrust, Inc. Systems and Methods for Identifying Potentially Malicious Messages
US20150067833A1 (en) * 2013-08-30 2015-03-05 Narasimha Shashidhar Automatic phishing email detection based on natural language processing techniques
US9026507B2 (en) * 2004-05-02 2015-05-05 Thomson Reuters Global Resources Methods and systems for analyzing data related to possible online fraud

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060569B2 (en) * 2007-09-27 2011-11-15 Microsoft Corporation Dynamic email directory harvest attack detection and mitigation
US8539029B2 (en) * 2007-10-29 2013-09-17 Microsoft Corporation Pre-send evaluation of E-mail communications
CN101667979B (en) * 2009-10-12 2012-06-06 哈尔滨工程大学 System and method for anti-phishing emails based on link domain name and user feedback
US10277628B1 (en) * 2013-09-16 2019-04-30 ZapFraud, Inc. Detecting phishing attempts
US10298602B2 (en) * 2015-04-10 2019-05-21 Cofense Inc. Suspicious message processing and incident response
CN109242250A (en) * 2018-08-03 2019-01-18 成都信息工程大学 A kind of user's behavior confidence level detection method based on Based on Entropy method and cloud model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060251068A1 (en) * 2002-03-08 2006-11-09 Ciphertrust, Inc. Systems and Methods for Identifying Potentially Malicious Messages
US9026507B2 (en) * 2004-05-02 2015-05-05 Thomson Reuters Global Resources Methods and systems for analyzing data related to possible online fraud
US20150067833A1 (en) * 2013-08-30 2015-03-05 Narasimha Shashidhar Automatic phishing email detection based on natural language processing techniques

Also Published As

Publication number Publication date
CN113468589A (en) 2021-10-01

Similar Documents

Publication Publication Date Title
US10972495B2 (en) Methods and apparatus for detecting and identifying malware by mapping feature data into a semantic space
US10834128B1 (en) System and method for identifying phishing cyber-attacks through deep machine learning via a convolutional neural network (CNN) engine
Almomani et al. A survey of phishing email filtering techniques
Dutta Detecting phishing websites using machine learning technique
US10313352B2 (en) Phishing detection with machine learning
Murugan et al. Feature extraction using LR-PCA hybridization on twitter data and classification accuracy using machine learning algorithms
CN108833186B (en) Network attack prediction method and device
Butt et al. Cloud-based email phishing attack using machine and deep learning algorithm
US11223642B2 (en) Assessing technical risk in information technology service management using visual pattern recognition
US11657601B2 (en) Methods, devices and systems for combining object detection models
US20200005170A1 (en) Digital mdr (managed detection and response) analysis
US11783201B2 (en) Neural flow attestation
Obaid et al. An adaptive approach for internet phishing detection based on log data
DR et al. Malicious URL Detection and Classification Analysis using Machine Learning Models
US11201875B2 (en) Web threat investigation using advanced web crawling
Dandıl C‐NSA: a hybrid approach based on artificial immune algorithms for anomaly detection in web traffic
Vaishnavi et al. A Comparative Analysis of Machine Learning Algorithms on Malicious URL Prediction
Patil et al. Learning to Detect Phishing Web Pages Using Lexical and String Complexity Analysis
WO2021201980A1 (en) Detecting legitimacy of data collection
Gupta et al. A comprehensive comparative study of machine learning classifiers for Spam Filtering
Jansi An Effective Model of Terminating Phishing Websites and Detection Based On Logistic Regression
Khan Detecting phishing attacks using nlp
Al-Shalabi Comparative Study of Data Mining Classification Techniques for Detection and Prediction of Phishing Websites''
Azeez et al. Approach for Identifying Phishing Uniform Resource Locators (URLs)
Patil Detection of Clickjacking attacks using the extreme learning Machine algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21709273

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21709273

Country of ref document: EP

Kind code of ref document: A1