WO2022046202A1 - Détection de légitimité de contenu associé à la collecte de données - Google Patents

Détection de légitimité de contenu associé à la collecte de données Download PDF

Info

Publication number
WO2022046202A1
WO2022046202A1 PCT/US2021/030997 US2021030997W WO2022046202A1 WO 2022046202 A1 WO2022046202 A1 WO 2022046202A1 US 2021030997 W US2021030997 W US 2021030997W WO 2022046202 A1 WO2022046202 A1 WO 2022046202A1
Authority
WO
WIPO (PCT)
Prior art keywords
identification element
content
entity
image
visual component
Prior art date
Application number
PCT/US2021/030997
Other languages
English (en)
Inventor
Anqi DU
Bin Zhu
Shiyu Zou
Jian Wang
Xingyu XU
Yao KE
Dongmei Zhang
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of WO2022046202A1 publication Critical patent/WO2022046202A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1466Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links

Definitions

  • a data collection service may widely refer to various services, applications, software, websites, etc. capable of implementing data collection or having a function of data collection.
  • a survey form service is a type of dedicated data collection service for collecting data through forms.
  • data collection may also be performed in services not dedicated for data collection, e.g., collecting data through webpages in a browser service, collecting data through emails in an email service, collecting data through productivity tool documents in a productivity tool, etc. All these services capable of achieving data collection may be collectively referred to as data collection services.
  • Embodiments of the present disclosure provide method and apparatus for detecting legitimacy of content related to data collection.
  • Content related to data collection may be obtained.
  • At least one identification element may be detected from the content.
  • An entity corresponding to the identification element may be recognized.
  • the legitimacy of the content may be determined based at least on correlation between a creator of the content and the entity.
  • FIG.1 illustrates an exemplary process for detecting legitimacy of content related to data collection according to an embodiment of the present disclosure.
  • FIG.2 illustrates an exemplary authentic image identification element and exemplary non-authentic image identification elements obtained through transforming the authentic image identification element.
  • FIG.3 illustrates exemplary non-authentic image identification elements created for a specific entity.
  • FIG.4 illustrates an exemplary form for data collection.
  • FIG.5 illustrates an exemplary process for detecting an image identification element according to an embodiment of the present disclosure.
  • FIG.6 illustrates another exemplary process for detecting an image identification element according to an embodiment of the present disclosure.
  • FIG.7 illustrates yet another exemplary process for detecting an image identification element according to an embodiment of the present disclosure.
  • FIG.8 illustrates an exemplary process for recognizing an entity corresponding to an image identification element according to an embodiment of the present disclosure.
  • FIG.9 illustrates an exemplary process for generating a training dataset for training an entity recognizing model based on authentic identification elements according to an embodiment of the present disclosure.
  • FIG.10 illustrates an exemplary split of an authentic identification element according to an embodiment of the present disclosure.
  • FIG.11 illustrates exemplary transformations of a visual component according to an embodiment of the present disclosure.
  • FIG.12 is a flowchart of an exemplary method for detecting legitimacy of content related to data collection according to an embodiment of the present disclosure.
  • FIG.13 illustrates an exemplary apparatus for detecting legitimacy of content related to data collection according to an embodiment of the present disclosure.
  • FIG.14 illustrates an exemplary apparatus for detecting legitimacy of content related to data collection according to an embodiment of the present disclosure.
  • Data collection services may be used by malicious users for illegitimate-purpose data collections, thus there is a risk of abusing the data collection services.
  • the data collection services may be maliciously used for collecting personal privacy or sensitive data, collecting commercial secrets, broadcasting inappropriate content, etc., and the collected data may be used for financial criminals, reputation damages, network attacks, etc.
  • Illegitimate-purpose data collections will tremendously damage benefits of providers and legitimate users of data collection services. Taking the phishing as an example, this is a common network attack.
  • a phisher will pretend to be a specific entity and conduct illegitimate-purpose data collections, e.g., collecting privacy or sensitive data of log-in account and password, bank card number, credit card number, home address, commercial information of companies, etc.
  • illegitimate-purpose data collections e.g., collecting privacy or sensitive data of log-in account and password, bank card number, credit card number, home address, commercial information of companies, etc.
  • a survey form service is a common data collection service used by phishers. Through the survey form service, a phisher may create and distribute illegitimate-purpose forms, and obtain information provided by responders.
  • logos are widely used in various data collection services to recognize entities associated with the data collection services.
  • a logo may be an image or text composed of graphics and/or text with a sense of design, e.g., including a legally registered trademark, a widely-used icon, etc.
  • an entity may refer to a brand, a product, a service, a person, etc. targeted by a data collection service.
  • a phisher usually uses a logo of a specific entity in a data collection service to deceive users into believing that the data collection service is initiated by an owner or an authorized party of the specific entity.
  • an owner of an entity may refer to a company, an organization, an individual, etc. that legally owns, possesses, and uses the entity
  • an authorized party of an entity may refer to a company, an organization, an individual, etc.
  • a phisher usually provides a brand's logo in an email to convince users that the email was sent by an owner or authorized party of the brand, so as to further obtain user data.
  • Existing phishing detection technologies may usually detect and recognize an entity corresponding to an authentic logo in an email, and determine whether the email is a phishing email based on whether a sender of the email is correlated with an owner of the entity.
  • phishers often provide, in emails, some faked logos, e.g., logos that are different but similar to authentic logos, newly created logos for specific entities, etc.
  • the existing phishing detection technologies may be difficult to recognize such faked logos, and therefore cannot determine whether the emails are phishing emails. Data collection services other than emails are also facing similar problems.
  • Embodiments of the present disclosure propose effective detections of legitimacy of content related to data collection.
  • content may comprise various digital information forms capable of performing data collection, e.g., a form, a webpage, an email, a productivity tool document, etc.
  • data collection services may be various services supporting the processing of the content, e g., a survey form service, a browser service, an email service, a productivity tool, etc.
  • legitimacy of content may refer to whether a creator of the content is legitimate, or whether usage of specific elements by the content is legitimate, etc.
  • a creator of the content may create the content for collecting data in a data collection service, and responders of the content may fill information into the content to provide data in the data collection service, wherein the responders may refer to those recipients responding to the content among recipients having received the content.
  • a creator and a responder may widely refer to a company, an organization, a person, etc. that uses a data collection service.
  • an identification element may refer to an element of graph, text, attribute, style, etc., that enables viewers to associate it with a specific entity.
  • the identification element may include, e.g., a logo, a representative picture, a representative text, etc.
  • the representative picture may be a picture capable of representing a specific entity, such as a picture with a slogan for a specific entity, a picture of a building with distinctive features associated with a specific entity, a portrait associated with a specific entity, etc.
  • the representative text may be a text capable of representing a specific entity, such as a name of a specific entity, a slogan of a specific entity, a description pointing to a specific entity, etc.
  • a description "the world's largest computer software provider” may point to "Microsoft”
  • a description "Microsoft Corporation’s founder” may point to "Bill Gates”, etc.
  • Identification elements usually exist in content in the form of text or images.
  • an identification element in the form of text may be referred to as a text identification element
  • an identification element in the form of image may be referred to as an image identification element.
  • the embodiments of the present disclosure propose to individually recognize each visual component in an image identification element, and determine an entity corresponding to the image identification element based on recognition results of individual visual components.
  • visual components may refer to independent and separate parts that constitute an image identification element.
  • the embodiments of the present disclosure propose to recognize a visual component through a deep learning-based model which may be trained with a training dataset obtained according to an embodiment of the present disclosure.
  • the training dataset may also include visual component samples that are different from the visual components in the authentic identification elements, so that when actually deployed, the trained model may correctly recognize various identification elements without limitation to the authentic identification elements.
  • an authentic identification element may refer to an identification element created and/or used by an owner of an entity
  • visual component samples different from visual components in the authentic identification element may include, e.g., visual component samples obtained through transforming the visual components in the authentic identification element, visual component samples created for a specific entity, etc.
  • the embodiments of the present disclosure propose to, when detecting an image identification element from content, scan an image extracted from the content with a sliding window having a predetermined size to obtain a set of sub-images, and detect the image identification element from the set of sub-images. In this way, the image identification element can be detected even if it occupies a small part of the extracted image.
  • the embodiments of the present disclosure may not only determine legitimacy of content related to data collection after the content is delivered, but also may determine the legitimacy of the content before the content is delivered.
  • the content may be obtained during creation of the content, and when it is determined that the content is illegitimate, a prompt message may be sent to a creator of the content and/or a data collection service provider, wherein the prompt message may, e.g., warn or remind that the creator is illegitimately using an identification element in the content.
  • the content may be prevented from being delivered, thereby avoiding illegitimate collection of user data.
  • FIG.l illustrates an exemplary process 100 for detecting legitimacy of content related to data collection according to an embodiment of the present disclosure.
  • an entity corresponding to an identification element contained in content related to data collection may be recognized, and legitimacy of the content may be determined based on correlation between a creator of the content and the recognized entity.
  • content related to data collection may be obtained.
  • corresponding content may be obtained based on digital information formats employed for the data collection, such as forms, web pages, emails, productivity tool documents, etc.
  • the content may be obtained after the content is published. Alternatively, the content may also be obtained before the content is published, for example during creation of the content.
  • at least one identification element may be detected from the content.
  • a text identification element may be directly detected or extracted from the content, or detected or extracted from information related to the content.
  • the information related to the content may be code information of the content, e g., HTML code information of the content, a render tree constructed during the HTML rendering process, etc.
  • a set of text identification elements may be predefined, and the set of text identification elements may include, e g., representative characters associated with common entities, such as a name of a well-known brand, a name of a famous person, etc.
  • the text identification element may be detected through retrieving predefined text identification element from the content or from information related to the content. Additionally or alternatively, it may be determined whether a text element is a text identification element based on whether its position on the content is identifying. For example, a set of positions may be predefined, and the set of positions may include, e.g., positions where text identification elements usually appear, such as a upper left area, upper right area, lower left area, lower right area, etc. of the content. It may be determined that a text element is a text identification element based on the text element being located at one of the predefined positions.
  • an image may be extracted from the content, and the image identification element may be acquired from the extracted image.
  • the image may be extracted from code information related to the content, or extracted from a rendering result of the content, for example.
  • Exemplary processes for detecting the image identification element will be illustrated later in conjunction with FIGS. 5-7. These processes may be respectively applied to content where an image identification element is located at a predetermined position and content where an image identification element is located at an unknown position.
  • an entity corresponding to the identification element may be recognized.
  • an entity corresponding to the text identification element may be recognized based on text content of the text identification element. For example, for a text identification element including a text "Microsoft", an entity corresponding to the text identification element may be determined based on the text "Microsoft" as a brand "Microsoft", and the entity is owned by Microsoft Corporation.
  • the entity corresponding to the text identification element may be recognized through predefining a mapping table including text identification elements and their corresponding entities, and searching the mapping table or using mapping rules.
  • the entity corresponding to the text identification element may be recognized with a machine learning model. The machine learning model may be trained with a training dataset including text identification element samples and entity labels.
  • a current machine learning model used to recognize an entity corresponding to a text identification element usually relies only on authentic identification elements for training, thus it can only recognize the authentic identification elements, and may be difficult to recognize identification elements different from the authentic identification elements.
  • the identification elements different from the authentic identification elements may also be referred to as non-authentic identification elements, faked identification elements, etc.
  • the embodiments of the present disclosure propose to improve a training dataset used to train the machine learning model, so that when actually deployed, the trained machine learning model can correctly recognize authentic and non-authentic identification elements instead of being limited to the authentic identification elements.
  • a non-authentic identification element may be, e.g., an identification element obtained through transforming an authentic identification element.
  • a non-authentic text identification element "MicrOSoft” may be obtained through replacing the letter “o” in an authentic text identification element "Microsoft" of the brand “Microsoft” with the number "0", and/or replacing the letter “s” with the dollar sign "$". Since “MicrOSoft” is close to “Microsoft", it is still possible for viewers to associate "Micr0$oft” with the brand "Microsoft".
  • the training dataset used to train the machine learning model according to the embodiment of the present disclosure may include training samples related to non-authentic text identification elements in addition to training samples related to authentic identification elements.
  • a training samples related to a non-authentic text identification element may include a non-authentic identification element sample and an entity label corresponding to the non-authentic identification element sample.
  • the entity label may be consistent with an entity label of an authentic identification element sample associated with the non-authentic identification element sample.
  • the image identification element may be used as a whole to recognize its corresponding entity.
  • a candidate entity corresponding to each visual component in the image identification element may be individually recognized, and the entity corresponding to the image identification element may be determined based on recognition results of candidate entities.
  • An exemplary process for recognizing an entity corresponding to an image identification element will be illustrated later in conjunction with FIG.8.
  • a visual component may include a text visual component and a non-text visual component.
  • the text visual component may refer to a visual component only presenting text
  • the non-text visual component may refer to a visual component presenting at least a graph.
  • an entity corresponding to a text visual component may be recognized through OCR
  • at least an entity corresponding to a non-text visual component may be recognized through a deep learning-based model.
  • a model for recognizing an entity corresponding to a non-text visual component may be referred to as an entity recognizing model.
  • the entity recognizing model may also recognize text visual components that are difficult to recognize through OCR, such as a text visual component containing text in artistic fonts, a text visual component containing deformed text, etc.
  • a current entity recognizing model usually relies only on authentic identification elements for training, thus it can only recognize the authentic identification elements, and may be difficult to recognize identification elements different from the authentic identification elements.
  • a non-authentic identification element may include, e g., an identification element obtained through transforming an authentic identification element, an identification element created for a specific entity, etc.
  • the embodiments of the present disclosure propose to improve a training dataset used to train the entity recognizing model, so that when actually deployed, the trained entity recognizing model can correctly recognize the authentic and non-authentic image identification elements instead of being limited to the authentic image identification elements.
  • An exemplary process for generating the training dataset for training the entity recognizing model will be illustrated later in conjunction with FIG.9.
  • the non-authentic image identification element may be, an identification element obtained through e.g., transforming an authentic identification element.
  • FIG.2 illustrates an exemplary authentic image identification element and exemplary non- authentic image identification elements obtained through transforming the authentic image identification element.
  • An identification element 200a may be, e.g., an authentic logo of a brand “AABB”, which may include a non-text visual component composed of a rectangle, a circle, and a triangle, and a text visual component composed of a text “AABB”, and the text visual component is located to the right of the non-text visual component.
  • Identification elements 200b-200d may be non-authentic logos obtained through, e.g., transforming authentic logos.
  • the identification element 200b also includes the same nontext visual component and text visual component as the identification element 200a, but in the identification element 200b, the text visual component is located below the non-text visual component instead of to the right.
  • the relative position between the visual components in the identification element 200c is the same as that of the identification element 200a, but the non-text visual component in the identification element 200c is rotated relative to the non-text visual component in the identification element 200a.
  • the relative position between the visual components in the identification element 200d is the same as that of the identification element 200a, but the non-text visual component in the identification element 200d are flipped with respect to the non-text visual component in the identification element 200a.
  • identification elements 200b-200d shown in FIG.2 are merely examples of non-authentic image identification elements obtained through transforming an authentic image identification element of a specific entity.
  • the non-authentic image identification elements obtained through transforming the authentic image identification element may also have any other form, such as having a different color from the authentic image identification element, having an artistic font different from the authentic image identification element, etc.
  • the non-authentic image identification element may also be, e.g., an image identification element created for a specific entity.
  • FIG.3 illustrates exemplary non- authentic image identification elements created for a specific entity.
  • An identification element 300a and an identification element 300b may be, e.g., created for a brand “AABB”, which may not be obtained through transforming the authentic logo of the brand “AABB” as shown in the identification element 200a in FIG.2.
  • the identification element 300a and the identification element 300b contain the text "AABB", viewers are very likely to mistake them for the logo of the brand "AABB".
  • image identification elements shown in FIG.3 are only several examples of non-authentic image identification elements created for a specific entity.
  • the non-authentic image identification element created for the specific entity may also have any other form, such as a picture with a slogan of the specific entity, a photo with a store of the specific entity, etc.
  • legitimacy of the content may be determined based at least on correlation between a creator of the content and the recognized entity.
  • the creator of the content may be considered to be correlated with the entity. It may be determined whether the creator of the content has such authorization or not through, e.g., checking permission information of the creator of the content. Alternatively, it may be determined whether the entity authorizes the creator to use its identification element through, e.g., checking authorization information of the entity. It should be appreciated that the aforementioned criteria for evaluating the correlation between the creator of the content and the entity are only exemplary, and other criteria may also be used to evaluate the correlation between the creator of the content and the entity.
  • the creator of the content may be considered to be correlated with the entity.
  • the creator of the content may be considered to be correlated with the entity. For example, if a domain name of a verified email address of the creator of the content is consistent with a domain name of an email address of the owner of the entity, it may be considered that the creator of the content is correlated with the entity.
  • the process 100 may proceed to 110, that is, it is determined that the content is legitimate. If it is evaluated at 108 that the creator of the content is unrelated with the recognized entity, then the process 100 can proceed to 112, that is, it is determined that the content is illegitimate. When the content is determined to be illegitimate, some measures may be taken to control or interfere with it, such as prohibiting the data collection, sending reminders to users, etc.
  • the process 100 in FIG.l is only an example of the process for detecting legitimacy of content related to data collection. According to actual application requirements, the process for detecting legitimacy of content related to data collection may include any other steps, and may include more or fewer steps. Moreover, the evaluation result of the correlation between the creator of the content and the entity made at 108 is only one of factors for determining the legitimacy of the content. The determination of the legitimacy of the content may also be based on other factors, such as the type of data collected by the content, the purpose of the data collection involved in the content, etc.
  • the embodiments of the present disclosure are not limited thereto. It is also possible to detect multiple identification elements from a given content.
  • the legitimacy detection process according to the embodiment of the present disclosure can process all the detected identification elements, and recognize an entity corresponding to each identification element.
  • the recognized entities may be the same or different. In the case where the recognized entities are different, the legitimacy of the content may be determined based on the correlation between the creator of the content and each entity.
  • an entity corresponding to a single identification element may be recognized as two or more entities, and confidence scores associated with individual entity for indicating credibility of the corresponding entity may also be relatively close.
  • the legitimacy of the content may be determined based on the correlation between the creator of the content and each entity.
  • the process 100 may be performed before content related to data collection is delivered.
  • the content can be obtained during creation of the content.
  • a prompt message may be sent to the creator of the content.
  • the prompt message may, e.g., warn or remind the creator that an identification element may be used illegitimately in the content.
  • the content may be prevented from being delivered, so that user data can be prevented from being illegitimately collected.
  • FIG.4 illustrates an exemplary form 400 for data collection.
  • a title "Account Manager” of the form 400 indicates that the form is intended to assist a recipient in logging into its account.
  • a logo 402 is provided in the upper left corner of the form 400, which is intended to identify an entity associated with a data collection service.
  • a creator of the form created questions for collecting user data in the form, such as "Enter your username”, “Enter your password”, etc. If the recipient gives answers to these questions, the creator has obtained the desired user data. It should be appreciated that FIG.4 is only an example of data collection, and various other forms of data collection may exist in actual scenarios.
  • a process for detecting legitimacy of content related to data collection may be used to detect legitimacy of the form.
  • the form 400 may be obtained, and an identification element such as the logo 402 may be detected from the form 400.
  • an entity corresponding to the logo 402 may be recognized, such as the brand "AABB”.
  • the logo 402 is different from the authentic logo of the brand "AABB", such as the identification element 200a in FIG.2, the entity recognizing process according to the embodiment of the present disclosure, such as the step 106 in FIG.l, can still recognize the entity corresponding to the logo 402 is the brand "AABB".
  • the creator of the form 400 is correlated with the brand "AABB". For example, if the creator of the form 400 is the same as an owner of the brand "AABB", or the creator of the form 400 is authorized by the owner of the brand "AABB” to use its logo, it may be considered that the creator of the form 400 is correlated with the brand "AABB”. Otherwise, it may be considered that the creator of the form 400 is unrelated with the brand "AABB”.
  • the form 400 may be considered legitimate; while when the creator of the form 400 is unrelated with the brand "AABB”, the form 400 can be considered illegitimate.
  • an image when detecting an image identification element from content, an image may be first extracted from the content, and then the image identification element may be acquired from the extracted image.
  • FIGS. 5-7 illustrate exemplary processes 500-700 for detecting an image identification element. These processes may be respectively applied to content where an image identification element is located at a predetermined position and content where an image identification element is located at an unknown position.
  • FIG.5 illustrates an exemplary process 500 for detecting an image identification element according to an embodiment of the present disclosure.
  • the process 500 may be applied to content where an image identification element is located at a predetermined position.
  • an image identification element is usually located in the upper left area of the form.
  • an image may be extracted from a predetermined position of the content.
  • an image identification element is usually located in the upper left area of the form, for example. Accordingly, an image containing the image identification element is usually located in this area, so the image can be extracted from this area.
  • the image identification elements may be acquired from the extracted image.
  • the image identification element may be acquired, e.g., through a deep learning-based model.
  • a model used to acquire an image identification element at a predetermined position may be referred to as an identification element identifying model.
  • the identification element identifying model may acquire the image identification element without determining the position of the image identification element.
  • the identification element identifying model usually requires an input image with a specified size.
  • the image extracted from the content may be normalized to the specified size required by the identification element identifying model.
  • the image identification element may be acquired from the normalized image through the identification element identifying model.
  • an image containing an identification element is usually located at a predetermined position and its size is within a predetermined range, phishers can still take some evasion behaviors to make it impossible to acquire the image identification element from the image.
  • the image identification element may be located at different positions in the image, or the image identification element may occupy a small part of the image.
  • an image sample set including a plurality of image samples may be generated as the training dataset for training the identification element identifying model.
  • Each image sample in the image sample set may contain an identification element sample.
  • the identification element sample may be located anywhere in the image sample.
  • a ratio of a size of the identification element sample to a size of the image sample where it is located may be within a predetermined range, e.g., 20% to 100%.
  • the identification element identifying model trained with such image sample sets can effectively acquire identification elements from various types of images, such as an image where an identification element is located at any position, a ratio of a size of an identification element to a size of an image within a predetermined range, etc.
  • FIG.6 illustrates another exemplary process 600 for detecting an image identification element according to an embodiment of the present disclosure.
  • the process 600 may be applied to content where an identification element is located at an unknown position.
  • an identification element is located at an unknown position.
  • an object detection method may be used to detect the image identification element.
  • an image may be extracted from the content, and the extracted image as a whole may be used to detect the image identifying element from it.
  • an image may be extracted from content. Since a position of an image identification element is unknown, images at various positions in the content may be extracted.
  • An image identification element may be acquired from an extracted image.
  • the image identification element may be acquired, e.g., through a deep learning-based model.
  • a model used to acquire an image identification element at an unknown position may be referred to as an identification element detection model.
  • the identification element detection model may first determine a position of an image identification element in an image, and then acquire the image identification element from the position.
  • the identification element detection model usually requires an input image with a specified size.
  • the image extracted from the content may be normalized to the specified size required by the model.
  • a position of an image identification element in the extracted image may be determined through the identification element detection model.
  • the image identification element may be acquired from the determined position through the identification element detection model.
  • the identification element detection model may be trained with an image sample set including a plurality of image samples. Each image sample in the image sample set may, e.g., contain an identification element sample and a position of the identification element sample in the image sample. When actually deployed, the identification element detection model trained with such an image sample set may determine, a position of an image identification element from an input image, and acquire the image identification element from the determined position.
  • the identification element identifying model for acquiring an image identification element at a predetermined position and the identification element detection model for acquiring an image identification element at an unknown position may be collectively referred to as an identification element acquiring model.
  • phishers may place the image identification element in a larger-sized image, which may be, e.g., a background image of a web page or email, and the image identification element only occupy a small part of the image.
  • a model used to acquire an image identification element from an image such as the aforementioned identification element detection model, usually requires an input image with a specified size, so when the image containing the image identification element is normalized to the specified size, the image identification element will be too small to be acquired by the model.
  • FIG.7 illustrates yet another exemplary process 700 for detecting an image identification element according to an embodiment of the present disclosure.
  • the process 700 may be applied to content where an image identification element is located at an unknown position and the image identification element occupies a small part of the image.
  • an image may be extracted from content. Since a position of an image identification element is unknown, images at various positions in the content may be extracted.
  • a set of sub-images may be obtained through scanning the image with a sliding window.
  • a size of each sub-image may be consistent with a size of the sliding window.
  • the size of the sliding window may be set to be larger than a size of the image identification element, so that it can include the image identification element. Since the size of the image extracted from the content is known, in order for recipients of the content to be able to notice an image identification element it contains, a size of the image identification element should also be within a reasonable range, so the size of the image identification element may be roughly estimated based on the size of the image.
  • a size of a sub-image should be as large as possible.
  • a ratio of the size of the image identification element to the size of the sub-image should be higher than a predetermined threshold.
  • the size of the sub-image that is, the size of the sliding window, may be smaller than a predetermined multiple of the size of the image identification element. Therefore, the size of the sliding window may be larger than the size of the image identification element and smaller than the predetermined multiple of the size of the image identification element.
  • the set of sub-images may also have a predetermined overlap degree.
  • the predetermined overlap degree may be set to 50%, for example.
  • the sliding window may move 50% of its side length in the horizontal or vertical direction each time, so that the newly obtained sub-image has a 50% overlap with the previous sub-image.
  • An image identification element may be acquired from the obtained set of subimages. Since a ratio of a size of an image identification element to a size of the sub-image is within a predetermined range, the image identification element may be acquired from the sub-image through, e.g., an identification element identifying model.
  • the identification element identifying model may be, e.g., the identification element identifying model trained with an image sample set as described above.
  • the identification element identifying model usually requires an input image with a specified size.
  • each sub-image in the set of sub-images may be normalized to a specified size required by the identification element identifying model.
  • an image identification element may be acquired from the set of normalized sub-images through, e.g., an identification element identifying model.
  • the processes for detecting an image identification element described above in conjunction with FIG.5 to FIG.7 is only exemplary. According to actual application requirements, the process for detecting the image identification element may include any other steps, and may include more or fewer steps.
  • the process for detecting the image identification element may include any other steps, and may include more or fewer steps.
  • the image before scanning an image extracted from content with a sliding window, it may be determined whether a size of the image exceeds a predetermined threshold, and the scanning is performed when the size exceeds the threshold, otherwise, the image may be normalized to a specified size required by the identification element detection model and the image identification element may be acquired from the normalized image, as described in steps 620-640 in FIG.6.
  • multiple sliding windows with corresponding sizes may be used to scan the image to obtain multiple sets of sub-images, and corresponding image identification element may be obtained from each set of sub-images.
  • the processes 500-700 include normalizing an image or sub-images and obtaining an identification element from the normalized image or sub-images, processing related to normalizing may also be omitted from the processes 500-700 in some cases. For example, when a size of the image or sub-images is below a predetermined threshold, an identification element may be directly acquired from the image or sub-images without normalizing the image or sub-images.
  • the identification element 200a may be an authentic identification element including a non-text visual component composed of a square, a circle, and a triangle, and a text visual component formed of the text "AABB" .
  • the identification element 200b includes the same non-text visual component and text visual component as the identification element 200a, but its text visual component is located below its non-text visual components instead of to the right. Phishers often use an identification element whose relative positions of visual components are different from an authentic identification element to evade recognition.
  • FIG.8 illustrates an exemplary process 800 for recognizing an entity corresponding to an identification element according to an embodiment of the present disclosure.
  • the process 800 may be performed for an image identification element having one or more visual components.
  • Candidate entities corresponding to various visual components in the image identification element may be recognized, and an entity corresponding to the image identification element may be determined based at least on the candidate entities.
  • positions of the visual components may also be determined, and the entity corresponding to the image identification element may be determined based on both the candidate entities and the positions.
  • candidate entities corresponding to various visual components in an image identification element may be recognized respectively.
  • a text visual component may be recognized through OCR, and at least a non-text visual component may be recognized through an entity recognizing model based on deep learning.
  • the entity recognizing model may also recognize a text visual component that is difficult to recognize through OCR, such as a text visual component containing text in artistic fonts, a text visual component containing deformed text, etc.
  • a first candidate entity set corresponding to a non-text visual component in the upper part of the identification element 200b may be recognized through the entity recognizing model, and a second candidate entity set corresponding to a text visual component in the lower part of the identification element 200b may be recognized through OCR.
  • a single visual component may correspond to one or more candidate entities.
  • the first candidate entity set may include a brand "AABB” and a brand "CCDD”
  • the second candidate entity set may include the brand "AABB”.
  • positions of various visual components may be determined.
  • the non-text visual component in the upper part of the identification element 200b may be determined to be at a first position
  • the text visual component in the lower part of the identification element 200b may be determined to be at a second position.
  • a distance between every two visual components may be calculated based on the determined positions of various visual components to obtain a set of distances.
  • a distance between the non-text visual component and the text visual component of the identification element 200b may be calculated based on the first position of the non-text visual component and the second position of the text visual component.
  • an entity corresponding to the identification element may be determined based at least on the candidate entities recognized at 810.
  • the first candidate entity set includes the brand “AABB” and the brand “CCDD”
  • the second candidate entity set includes the brand “AABB”. Accordingly, the entity corresponding to the identification element 200b may be determined as the brand "AABB”.
  • the entity corresponding to the identification element may be further determined based on the set of distances calculated at 830.
  • the entity corresponding to the identification element may be determined based on the two candidate entity sets corresponding to the two visual components respectively. For example, for the identification element 200b, it may be determined whether a distance between its two visual components is below a threshold. If the distance is below the threshold, the entity corresponding to the identification element 200b may be determined based on both the first candidate entity set including the brand "AABB" and the brand "CCDD" and the second candidate entity set including the brand "AABB”. This entity may be determined as, e.g., the brand "AABB”.
  • the two visual components may not belong to one identification element. Therefore, the two candidate entity sets corresponding to the two visual components may be directly provided as the entity recognition result.
  • the image identification element 200b if the distance is not below the threshold, the brand "AABB” and brand “CCDD” are provided as the recognition result of the non-text visual component, and the brand "AABB” is provide as the recognition result of the text visual component.
  • the identification element includes more than two visual components
  • the entity corresponding to the identification element may be determined based on candidate entities corresponding to various visual components. If distances between one or more visual components and their closest visual components are not below the threshold, the one or more visual components may not belong to the identification element. In this case, when determining the entity corresponding to the identification element, the candidate entities corresponding to the one or more visual components may not be considered.
  • the process for recognizing the entity corresponding to the text identification element described above in conjunction with FIG.1 only considers the text identification element itself, and the process 800 for recognizing the entity corresponding to the image identification element described above in conjunction with FIG.8 also only considers the image identification element itself.
  • entity recognition may also be performed based on context of the text identification element or image identification element.
  • the context may refer to content where an identification element is located, such as a form, a web page, an email, a productivity tool document, etc.
  • an entity corresponding to an identification element may be recognized based at least on context of the identification element.
  • the text identification element and context of the text identification element may be provided to a model for recognizing an entity corresponding to the text identification element.
  • the model may recognize an entity corresponding to the text identification element based at least on the context.
  • the image identification element and context of the image identification element may be provided to an entity recognizing model.
  • the entity recognizing model may recognize candidate entities corresponding to various visual components in the image identification element based at least on the context. Subsequently, an entity corresponding to the image identification element may be determined through, e.g., steps similar to steps 820-840 in the process 800.
  • the entity corresponding to the identification element may also be recognized based at least on a part of the context of the text identification element or the image identification element which includes the identification element, rather than the entire context. For example, an area with a predetermined size including the identification element may be intercepted from the context, and the intercepted area and the identification element may be provided to a corresponding model.
  • a text visual component that is difficult to recognize through OCR and a non-text visual component may be recognized through an entity recognizing model.
  • the entity recognizing model may be trained with a training dataset according to the embodiments of the present disclosure.
  • the training dataset may be a visual component sample set including a plurality of visual component samples.
  • the visual component sample set may also include visual component samples different from the visual components in the authentic identification elements, so that when actually deployed, the trained model can correctly recognize the authentic and non-authentic identification elements instead of being limited to the authentic identification elements.
  • a visual component sample different from a visual component in an authentic identification element may be generated based on the authentic identification element, or created for a specific entity.
  • FIG.9 illustrates an exemplary process 900 for generating a training dataset for training an entity recognizing model based on authentic identification elements according to an embodiment of the present disclosure.
  • authentic identification elements may be collected.
  • the authentic identification elements may be collected from the Internet and accessible phishing content.
  • An entity corresponding to each authentic identification element may be determined, and the determined entity may be assigned to the authentic identification element as an entity label.
  • context where the authentic identification elements are located such as forms, web pages, emails, productivity tool documents, etc., may also be collected.
  • a position of each authentic identification element e.g., its position in content, may also be determined, and the determined position may be assigned to the authentic identification element as a position label.
  • FIG.10 illustrates an exemplary split of an authentic identification element according to an embodiment of the present disclosure.
  • FIG.10 may be, e.g., an authentic identification element of a brand “AABB”, which may correspond to the identification element 200a in FIG.2.
  • the authentic identification element may be split into a visual component 1000a composed of a rectangle, a circle, and a triangle, and a visual component 1000b composed of text "AABB".
  • a visual component sample may be obtained through transforming a split visual component of the authentic identification element.
  • FIG.11 illustrates exemplary transformations of a visual component according to an embodiment of the present disclosure.
  • a visual component may be rotated.
  • a visual component sample 1100a in FIG.11 may be obtained through rotating the visual component 1000a in FIG.10 180 degrees clockwise.
  • a visual component may be flipped.
  • a visual component sample 1100b in FIG.11 may be obtained through flipping the visual component 1000a in FIG.10.
  • a color of a visual component may be changed to other colors.
  • a color of at least a part of a visual component may be changed to some classic colors, such as red, yellow, black, etc.
  • colors of various parts of a visual component may also be rotated and symmetrically changed. Assume that in the visual component 1000a in FIG.10, the rectangle is red, the circle is yellow, and the triangle is brown. Rotating the colors of these parts may be, e.g., changing the circle to red, the triangle to yellow, and the rectangle to brown, etc.
  • a color of a visual component may be converted into grayscale. Colors are ever-changing, and phishers may easily change a color of a visual component or an identification element into a color that an entity recognition model has never seen before.
  • the embodiments of the present disclosure propose to convert a color of a visual component sample used for training an entity recognizing model into grayscale, so as to make the model to focus on recognizing a shape of the visual component instead of its color.
  • a visual component sample 1100c in FIG.11 may be obtained through converting the color of the visual component 1000a in FIG.10 into grayscale. Each part of the visual component sample 1100c may have different shades of gray, which may, e.g., correspond to different colors in the original visual component, that is, the visual component 1000a.
  • the context may also be transformed. For example, some information in the context, such as templates, colors, etc., may be changed.
  • an extended visual component sample set may be obtained.
  • the extended visual component sample set may include original visual components extracted from authentic identification elements.
  • an extended context sample set may be obtained.
  • the extended context sample set may include the original context.
  • each visual component sample in the expanded visual component sample set may be combined with different context samples in the expanded context sample set to generate multiple training samples, each training sample including a visual component sample and a context sample.
  • one or more data augmentation operations may also be applied to visual component samples, such as rotation, adding noise, perspective transformation, translation, etc.
  • Each training sample may include an entity label, and the entity label may be consistent with an entity label of an authentic identification element associated with the visual component sample.
  • each training sample may also include a position label, and the entity label may correspond to a position label of an authentic identification element associated with the visual component sample.
  • the process 900 in FIG.9 is only an example of a process for generating a training dataset for training an entity recognizing model based on authentic identification elements.
  • the process for generating the training dataset may include any other steps, and may include more or fewer steps.
  • the process 900 includes splitting an authentic identification element into visual components, and accordingly, a generated training sample includes a visual component sample, it is also feasible not to split the authentic identification element.
  • An entity recognizing model trained with such training samples may recognize an entity corresponding to an image identification element by using the image identification elements as a whole when deployed. Furthermore, although the process 900 includes the collection and transformation of the context, it is also feasible to collect only the authentic identification elements and transform only the visual components in the authentic identification elements. In this case, the training sample may not include the context sample. Accordingly, an entity recognizing model trained with such a training dataset may not consider context of an identification element when performing entity recognition.
  • phrases may also use newly created identification elements, such as the identification element 300a and the identification element 300b in FIG.3.
  • the embodiments of the present disclosure propose to add some identification element samples or visual component samples created for specific entities to a training dataset used to train the entity recognizing model. For example, a party that provides a phishing detection service may actively create such identification element samples or visual component samples for some well-known brands and products. Alternatively, identification element samples or visual component samples created the phishers may also be collected from reported phishing content.
  • the training dataset generated in the above manner may be used to train an entity recognizing model.
  • the training dataset may or may not include position labels.
  • the training dataset that includes position labels may be used to train an entity recognizing model for processing content where image identification elements are located at unknown positions.
  • the training dataset that does not include position labels may be used to train an entity recognizing model for processing content where image identification elements are located at predetermined positions.
  • a visual component of an image identification element is regarded as an individual training sample. Accordingly, when an entity recognizing model trained with such training samples recognizes an identification element, it also individually identifies each visual component in the image identification element. In this way, phishers’ behaviors for evading detection by changing relative positions of visual components in an image identification element can be effectively dealt with.
  • the training dataset generated by the above manner may have multiple different training samples for each entity, such as a visual component sample extracted from an authentic identification element, a visual components obtained through transforming a visual component in an authentic identification element, and a visual component sample created for a specific entity. These training samples enrich the training dataset, so that when actually deployed, the trained entity recognizing model can correctly recognize a variety of image identification elements instead of being limited to authentic identification elements.
  • training samples for various entities in the training dataset generated in the above manner may be randomly divided into three subsets: a training subset used to fit parameters of a model with supervised learning, a validation subset used to tune hyperparameters of the model, select a best model and determine a stopping point during training, and a test subset used to evaluate performance of the final model.
  • multiple visual component samples with different colors may be generated by changing a color of a visual component to other colors.
  • An entity recognizing model may be trained with a training dataset including multiple visual component samples with different colors.
  • the trained entity recognizing model may be referred to as a color model.
  • An entity recognizing model may be trained with a training dataset including grayscale component samples.
  • the trained entity recognizing model may be referred to as a grayscale model.
  • the color model and the grayscale model may be used to recognize the entity, respectively, and recognition results of various models may be combined into a recognition result of the visual component.
  • Content such as a form, a web page, an email, and a productivity tool document may also contain a further element that enable recipients of the content to associate the content with a specific entity, such as a template, a style, a color, etc. that the specific entity is accustomed to using. Phishers may use such elements in their content to convince recipients of the content that the content comes from a specific entity corresponding to the element.
  • the embodiments of the present disclosure propose to recognize an entity corresponding to the further element
  • the recognized entity corresponding to the further element may be referred to as a reference entity.
  • a template or style may be extracted from a given content with an image processing technology, and a reference entity corresponding to the extracted template or style may be recognized through recognition or pattern matching.
  • a final entity may be determined based on an entity corresponding to an identification element in the content and the reference entity.
  • the entity may be determined more accurately.
  • the final entity may be determined as the reference entity.
  • legitimacy of the content may be determined based at least on correlation between a creator of the content and the final entity. The legitimacy of the content may be determined, e.g., in a manner similar to steps 108-110 in FIG.l.
  • acquiring an image identification element and recognizing an entity corresponding to the image identification element are performed through separate models, e.g., the acquiring an image identification element is performed through an identification element acquiring model, while the recognizing an entity corresponding to the image identification element is performed through an entity recognizing model, but the embodiment of the present disclosure is not limited to this.
  • the identification element acquiring model and the entity recognizing model may be integrated into an integrated model.
  • the integrated model may detect an identification element from input content, and recognize an entity corresponding to the identification element.
  • the improved training dataset according to the embodiments of the present disclosure may be used to train some neural networks to obtain suitable integrated models.
  • neural networks such as Residual Networks (ResNet) and EfficientNet may be used for training.
  • Residual Networks Residual Networks (ResNet) and EfficientNet
  • EfficientDet Single Shot MultiBox Detector
  • Faster-Region Convolutional Neural Faster-Region Convolutional Neural Network
  • selecting a neural network for training may be based on corresponding computing resource conditions, such as CPU resources, memory resources, etc , to achieve reasonable use of computing resources while ensuring that the accuracy and delay requirements are met.
  • EfficientNet has a simpler network structure and uses fewer computing resources, but it can still achieve higher accuracy. Therefore, when computing resources are limited, EfficientNet may be used; while when computing resources are sufficient, ResNet may be used.
  • FIG. 12 is a flowchart of an exemplary method 1200 for detecting legitimacy of content related to data collection according to an embodiment of the present disclosure.
  • step 1210 content related to data collection may be obtained.
  • At step 1220 at least one identification element may be detected from the content.
  • an entity corresponding to the identification element may be recognized.
  • the legitimacy of the content may be determined based at least on correlation between a creator of the content and the entity.
  • the content may be obtained during creation of the content.
  • the method 1200 may further comprise in response to determining that the content is illegitimate, performing at least one of: sending a prompt message to the creator and/or a data collection service provider; and preventing the content from being delivered.
  • the identification element may be different from an authentic identification element of the entity.
  • the identification element may include a text identification element.
  • the text identification element may be detected from the content or information related to the content.
  • the identification element may include an image identification element.
  • the detecting at least one identification element may comprise: extracting an image from the content; and acquiring the image identification element from the image.
  • the extracting an image may comprise: extracting the image from a predetermined position in the content.
  • the acquiring the image identification element may comprise: determining a position of the image identification element in the image; and acquiring the image identification element from the position.
  • the acquiring the image identification element may comprise: obtaining a set of sub-images through scanning the image with a sliding window, and acquiring the image identification element from the set of sub-images.
  • a size of the sliding window may be larger than a size of the image identification element and smaller than a predetermined multiple of the size of the image identification element.
  • the size of the image identification element may be estimated based on a size of the image.
  • the acquiring the image identification element may be performed through an identification element acquiring model.
  • the identification element acquiring model may be a deep learning-based model trained with an image sample set. Each image sample in the image sample set may contain an identification element sample.
  • the identification element sample may be located at an arbitrary position in the image sample.
  • a ratio of a size of the identification element sample to a size of the image sample may be within a predetermined range.
  • the identification element may include an image identification element.
  • the recognizing an entity may comprise: recognizing at least one candidate entity corresponding to at least one visual component in the image identification element; and determining the entity based at least on the at least one candidate entity.
  • the method 1200 may further comprise: determining at least one position of the at least one visual component, and wherein the entity is determined further based on the at least one position.
  • the at least one visual component may include a text visual component.
  • the at least one candidate entity may be recognized through OCR.
  • the at least one visual component may include a non-text visual component.
  • the at least one candidate entity may be recognized through an entity recognizing model.
  • the entity recognizing model may be a deep learning-based model trained with a visual component sample set. Each visual component sample in the visual component sample set may have an entity label.
  • a visual component sample in the visual component sample set may be obtained through at least one of: extracting from an authentic identification element; transforming a visual component in an authentic identification element; and creating for a specific entity.
  • the transforming a visual component may comprise at least one of: rotating the visual component; flipping the visual component; changing a color of the visual component to another color; and converting a color of the visual component into grayscale.
  • the entity may be recognized based at least on a context of the identification element.
  • the method 1200 may further comprise: recognizing a reference entity corresponding to a further element in the content, the further element including at least one of a template, a style, and a color; and determining a final entity based on the entity corresponding to the identification element and the reference entity, and wherein the legitimacy of the content is determined based at least on correlation between the creator and the final entity.
  • the content may include at least one of a form, a webpage, an email, and a productivity tool document.
  • the identification element may include at least one of a logo, a representative picture, and a representative text.
  • the entity may include at least one of a brand, a product, a service, and a person.
  • the method 1200 may further comprise any steps/processes for detecting legitimacy of content related to data collection according to the embodiments of the present disclosure as mentioned above.
  • FIG.13 illustrates an exemplary apparatus 1300 for detecting legitimacy of content related to data collection according to an embodiment of the present disclosure.
  • the apparatus 1300 may comprise: a content obtaining module 1310, for obtaining content related to data collection; an identification element detecting module 1320, for detecting at least one identification element from the content; an entity recognizing module 1330, for recognizing an entity corresponding to the identification element; and a legitimacy determining module 1340, for determining the legitimacy of the content based at least on correlation between a creator of the content and the entity.
  • a content obtaining module 1310 for obtaining content related to data collection
  • an identification element detecting module 1320 for detecting at least one identification element from the content
  • an entity recognizing module 1330 for recognizing an entity corresponding to the identification element
  • a legitimacy determining module 1340 for determining the legitimacy of the content based at least on correlation between a creator of the content and the entity.
  • FIG.14 illustrates an exemplary apparatus 1400 for detecting legitimacy of content related to data collection according to an embodiment of the present disclosure.
  • the apparatus 1400 may comprise at least one processor 1410.
  • the apparatus 1400 may further comprise a memory 1420 connecting with the processor 1410.
  • the memory 1420 may store computer-executable instructions that, when executed, cause the processor 1410 to perform any operations of the methods for detecting legitimacy of content related to data collection according to the embodiments of the present disclosure as mentioned above.
  • the embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for detecting legitimacy of content related to data collection according to the embodiments of the present disclosure as mentioned above.
  • modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
  • processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP), a field- programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • DSP digital signal processor
  • FPGA field- programmable gate array
  • PLD programmable logic device
  • the functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform.
  • Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc.
  • the software may reside on a computer-readable medium.
  • a computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk.
  • memory is shown separate from the processors in the various aspects presented throughout the present disclosure, the memory may be internal to the processors, e.g., cache or register.
  • the previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are expressly incorporated herein and are intended to be encompassed by the claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioethics (AREA)
  • Signal Processing (AREA)
  • Medical Informatics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Image Analysis (AREA)

Abstract

La présente divulgation concerne un procédé et un appareil permettant de détecter la légitimité de contenu associé à la collecte de données. Un contenu associé à la collecte de données peut être obtenu. Au moins un élément d'identification peut être détecté à partir du contenu. Une entité correspondant à l'élément d'identification peut être reconnue. La légitimité du contenu peut être déterminée sur la base d'au moins une corrélation entre un créateur du contenu et l'entité.
PCT/US2021/030997 2020-08-26 2021-05-06 Détection de légitimité de contenu associé à la collecte de données WO2022046202A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010868389.7A CN112036394A (zh) 2020-08-26 2020-08-26 检测与数据收集相关的内容的正当性
CN202010868389.7 2020-08-26

Publications (1)

Publication Number Publication Date
WO2022046202A1 true WO2022046202A1 (fr) 2022-03-03

Family

ID=73581439

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/030997 WO2022046202A1 (fr) 2020-08-26 2021-05-06 Détection de légitimité de contenu associé à la collecte de données

Country Status (2)

Country Link
CN (1) CN112036394A (fr)
WO (1) WO2022046202A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200067978A1 (en) * 2013-09-16 2020-02-27 ZapFraud, Inc. Detecting phishing attempts

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228720B (zh) * 2017-12-07 2019-11-08 北京字节跳动网络技术有限公司 识别目标文字内容和原图相关性的方法、系统、装置、终端、及存储介质

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200067978A1 (en) * 2013-09-16 2020-02-27 ZapFraud, Inc. Detecting phishing attempts

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BOZKIR AHMET SELMAN ET AL: "LogoSENSE: A companion HOG based logo detection scheme for phishing web page and E-mail brand recognition", COMPUTERS & SECURITY, ELSEVIER SCIENCE PUBLISHERS. AMSTERDAM, NL, vol. 95, 8 May 2020 (2020-05-08), XP086157882, ISSN: 0167-4048, [retrieved on 20200508], DOI: 10.1016/J.COSE.2020.101855 *
SIMONE BIANCO ET AL: "Deep Learning for Logo Recognition", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 10 January 2017 (2017-01-10), XP080740581, DOI: 10.1016/J.NEUCOM.2017.03.051 *

Also Published As

Publication number Publication date
CN112036394A (zh) 2020-12-04

Similar Documents

Publication Publication Date Title
US10805346B2 (en) Phishing attack detection
Gebhardt et al. Document authentication using printing technique features and unsupervised anomaly detection
El Ahmad et al. The robustness of Google CAPTCHA's
CN108566399A (zh) 钓鱼网站识别方法及系统
Gao et al. Divide and conquer: an efficient attack on Yahoo! CAPTCHA
CN116366338B (zh) 一种风险网站识别方法、装置、计算机设备及存储介质
Yao et al. Deep learning for phishing detection
Zhou et al. Visual similarity based anti-phishing with the combination of local and global features
US12021896B2 (en) Method for detecting webpage spoofing attacks
CN114448664B (zh) 钓鱼网页的识别方法、装置、计算机设备及存储介质
Tan et al. Hybrid phishing detection using joint visual and textual identity
Zhang et al. A generative adversarial learning framework for breaking text-based captcha in the dark web
Wang et al. Fourier-residual for printer identification
Wen et al. Detecting malicious websites in depth through analyzing topics and web-pages
Lin et al. Senseinput: An image-based sensitive input detection scheme for phishing website detection
WO2022046202A1 (fr) Détection de légitimité de contenu associé à la collecte de données
US20230069960A1 (en) Generalized anomaly detection
CN114169432B (zh) 一种基于深度学习的跨站脚本攻击识别方法
Sharathkumar et al. Phishing site detection using machine learning
Saxena et al. Fake currency detection using image processing
Gupta et al. GlyphNet: Homoglyph domains dataset and detection using attention-based Convolutional Neural Networks
Bozkır et al. Local image descriptor based phishing web page recognition as an open-set problem
EP4361971A1 (fr) Génération d'images d'entraînement pour la détection de documents frauduleux
US20230421602A1 (en) Malicious site detection for a cyber threat response system
Bhavatarini et al. Phishing websites detection and evaluation using machine learning

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21727764

Country of ref document: EP

Kind code of ref document: A1