CN112036394A

CN112036394A - Detecting validity of content associated with data collection

Info

Publication number: CN112036394A
Application number: CN202010868389.7A
Authority: CN
Inventors: 杜安琪; 朱斌; 邹世宇; 王健; 徐星宇; 柯尧; 张冬梅
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-12-04
Also published as: WO2022046202A1

Abstract

The present disclosure provides methods and apparatus for detecting the validity of content associated with data collection. Content related to data collection may be obtained. At least one identifying element may be detected from the content. An entity corresponding to the identifying element may be identified. The validity of the content may be determined based at least on a correlation between a creator of the content and the entity.

Description

Detecting validity of content associated with data collection

Background

With the development of internet technology, people can collect data of interest more conveniently through a network. Data collection can be done in different ways, such as through forms (forms), web pages, emails, productivity tool documents, and so on. In this context, a data collection service may refer broadly to various services, applications, software, websites, etc. that can be used to implement or have data collection functionality. For example, the survey form (survey form) service is a dedicated data collection service that collects data through forms. Furthermore, data collection may also be performed in services that are not dedicated to data collection, such as collecting data through web pages in a browser service, collecting data through emails in an email service, collecting data through productivity tool documents in a productivity tool, and so forth. All of these data collection enabled services may be collectively referred to as data collection services.

Disclosure of Invention

This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the present disclosure provide methods and apparatus for detecting the validity of content associated with data collection. Content related to data collection may be obtained. At least one identifying element may be detected from the content. An entity corresponding to the identifying element may be identified. The validity of the content may be determined based at least on a correlation between a creator of the content and the entity.

It should be noted that one or more of the above aspects include features that are specifically pointed out in the following detailed description and claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.

Drawings

The disclosed aspects will hereinafter be described in conjunction with the appended drawings, which are provided to illustrate, but not to limit, the disclosed aspects.

FIG. 1 illustrates an exemplary process for detecting content relevant to data collection in accordance with an embodiment of the disclosure.

Fig. 2 shows an exemplary real-image identifying element and an exemplary non-real-image identifying element obtained by transforming the real-image identifying element.

FIG. 3 illustrates an exemplary non-real image identifying element created for a particular entity.

FIG. 4 illustrates an exemplary form for data collection.

FIG. 5 illustrates an exemplary process for detecting image identifying elements according to an embodiment of the disclosure.

FIG. 6 illustrates another exemplary process for detecting image identifying elements according to an embodiment of the disclosure.

FIG. 7 illustrates yet another exemplary process for detecting image identifying elements in accordance with an embodiment of the disclosure.

FIG. 8 illustrates an exemplary process for identifying an entity corresponding to an image identifying element in accordance with an embodiment of the disclosure.

FIG. 9 illustrates an exemplary process of generating a training data set for training an entity recognition model based on true identification elements, according to an embodiment of the disclosure.

FIG. 10 illustrates an exemplary splitting of a true identifying element according to an embodiment of the disclosure.

Fig. 11 illustrates an exemplary transformation of a visual component according to an embodiment of the disclosure.

FIG. 12 is a flow chart of an exemplary method for detecting content relevant to data collection in accordance with an embodiment of the present disclosure.

FIG. 13 illustrates an exemplary apparatus for detecting content related to data collection in accordance with an embodiment of the disclosure.

FIG. 14 illustrates an exemplary apparatus for detecting content related to data collection in accordance with an embodiment of the disclosure.

Detailed Description

The present disclosure will now be discussed with reference to several exemplary embodiments. It is to be understood that the discussion of these embodiments is merely intended to enable those skilled in the art to better understand and thereby practice the embodiments of the present disclosure, and does not teach any limitation as to the scope of the present disclosure.

The data collection service may be used by bad users for data collection for improper purposes, thus risking the data collection service being misused. For example, data collection services may be maliciously used to collect personal private or sensitive data, collect business secrets, propagate objectionable content, etc., and the collected data may be used for financial crimes, reputation violations, network attacks, etc. The non-legitimate purpose of data collection will greatly impair the interests of the provider of the data collection service and the legitimate users. Phishing, for example, is a common network attack. Phishers typically impersonate a particular entity for illicit data collection purposes, such as collecting private or sensitive data such as login account numbers and passwords, bank card numbers, credit card numbers, home addresses, company business information, and the like. For example, the questionnaire services are data collection services commonly used by phishers. Through the survey form service, phishers can create and distribute forms for improper purposes and obtain information provided by respondents.

Flags (logos) are widely used in various types of data collection services to identify entities associated with the data collection service. Herein, the logo may be an image or text composed of a graphic and/or text having a design sense, for example, including a legally registered trademark, a widely used icon, and the like, and the entity may indicate a brand, a product, a service, a person, and the like, to which the data collection service is directed. Phishers typically use the identity of a particular entity in a data collection service to trick users into believing that the data collection service was initiated by the owner or authorized party of the particular entity. Herein, an owner of an entity may refer to a company, organization, individual, etc. that legally owns, possesses, and uses the entity, and an authorized person of an entity may refer to a company, organization, individual, etc. that is authorized to use the entity. For example, phishers often provide a brand flag in an email to give users confidence that the email was sent by the owner or authorized person of the brand to further obtain user data. Existing phishing detection techniques can typically detect and identify an entity to which a true token present in an email corresponds, and determine whether the email is a phishing email based on whether the sender of the email is relevant to the owner of the entity. However, to circumvent detection, phishers tend to provide some forged marks in the email, e.g., marks that are different from but similar to the real marks, newly created marks for a particular entity, etc. Existing phishing detection techniques may have difficulty identifying such counterfeit marks and thus may not be able to determine whether the email is a phishing email. Similar problems are faced by other data collection services than email.

Embodiments of the present disclosure provide for efficient detection of the validity of content associated with data collection. In this context, content may include various forms of digital information that can be used for data collection, such as forms, web pages, emails, productivity tool documents, and the like. Accordingly, the data collection service can be various services that support the processing of content, such as a questionnaire services, a browser service, an email service, a productivity tool, and the like. Additionally, the validity of the content may refer to whether the creator of the content is valid, or whether the content is using a particular element, and so forth. A creator of content may create content for collecting data in a data collection service, and a responder of the content may populate information into the content in the data collection service to provide the data, where the responder may refer to a recipient who responded to the content among recipients who received the content. In this context, creators and responders may refer broadly to companies, organizations, individuals, etc. that use data collection services.

In one aspect, embodiments of the present disclosure provide for detecting an identifying element from content related to data collection, identifying an entity corresponding to the identifying element, and determining the validity of the content based on a correlation between a creator of the content and the identified entity. As used herein, an identifying element may refer to an element such as a graphic, text, attribute, style, etc. that enables a viewer to associate it with a particular entity. The identifying elements may include, for example, a logo, a representative picture, a representative text, and the like. The representative picture may be a picture that is capable of representing a particular entity, such as a picture having an advertising language for the particular entity, a picture of a building having prominent features associated with the particular entity, a character avatar associated with the particular entity, and so forth. The representative text may be text that can represent a particular entity, such as a name of the particular entity, an advertisement of the particular entity, a description pointing to the particular entity, and so forth. For example, the description "the largest computer software provider worldwide" may point to "Microsoft", the description "Microsoft corporation founder" may point to "Bill Gaizz", and so on. The identifying elements are typically present in the content in the form of text or images. Herein, an identifying element in the form of text may be referred to as a text identifying element, and an identifying element in the form of an image may be referred to as an image identifying element.

In another aspect, embodiments of the present disclosure propose to individually identify each visual component in an image identifying element, and determine an entity corresponding to the image identifying element based on the results of the identification of each visual component. In this context, visual components may refer to separate, apart portions that make up the image-identifying elements. By identifying the individual visual components individually, it is possible to effectively cope with phishers circumventing detected behavior by changing the relative positions of the visual components in the image-identifying elements.

In another aspect, embodiments of the present disclosure propose to recognize visual components by a deep learning based model that can be trained with a training data set obtained according to embodiments of the present disclosure. The training data set may include, in addition to visual component samples from real identifying elements, visual component samples that are different from the visual components in the real identifying elements, such that when actually deployed, the trained model is able to correctly recognize a wide variety of identifying elements and is not limited to only real identifying elements. In this context, a genuine identifying element may refer to an identifying element created and/or used by an owner of an entity, and a visual component sample that is different from a visual component in a genuine identifying element may include, for example, a visual component sample obtained by transforming a visual component in a genuine identifying element, a visual component sample created for a particular entity, and so forth.

In another aspect, embodiments of the present disclosure propose, when detecting an image-identifying element from content, scanning an image extracted from the content with a sliding window having a predetermined size to obtain a set of sub-images, and detecting the image-identifying element from the set of sub-images. In this way, even image-identifying elements that are a small fraction of the extracted image can be detected.

In another aspect, embodiments of the present disclosure may determine the validity of content related to data collection not only after the content is published, but also before the content is published. For example, the content may be obtained during content creation and, when it is determined that the content is improper, a prompt message may be sent to the creator of the content and/or the data collection service provider, which may, for example, alert or remind the creator that the identifying element is being improperly used in the content. Additionally or alternatively, when it is determined that the content is not legitimate, the content may be prevented from being published, thereby avoiding improper collection of user data.

It should be appreciated that while the foregoing discussion and the following discussion may refer to examples of detecting or identifying a landmark in content, embodiments of the present disclosure are not so limited, and any other type of identifying element may be detected or identified in content in a similar manner.

FIG. 1 illustrates an exemplary procedure 100 for detecting content related to data collection in accordance with an embodiment of the disclosure. For example, entities corresponding to identifying elements contained with content related to data collection may be identified, and the validity of the content determined based on the correlation between the creator of the content and the identified entities.

At 102, content related to data collection may be obtained. For example, corresponding content, such as forms, web pages, emails, productivity tool documents, and the like, may be obtained based on the form of digital information employed for data collection. The content may be obtained after the content is published. Alternatively, the content may also be obtained before it is published, for example during content creation.

At 104, at least one identifying element may be detected from the content.

The text-identifying element may be detected or extracted directly from the content or from information related to the content. The information related to the content may be code information of the content, such as HTML code information of the content, a render tree (render tree) constructed during HTML rendering, and the like. In one embodiment, a set of textual identifying elements may be predefined, which may include, for example, representative words associated with common entities, such as names of well-known brands, names of famous persons, and so forth. The textual identifying element may be detected by retrieving a predefined textual identifying element from the content or from information related to the content. Additionally or alternatively, whether a text element is a textual identifying element may be determined based on whether its location on the content is identifying. For example, a set of locations may be predefined, which may include, for example, locations where textual identifying elements commonly appear, such as an upper left region, an upper right region, a lower left region, a lower right region, and so forth, of the content. The text element may be determined to be a text-identifying element based on the text element being located in one of the predefined locations.

For the image identifying element, an image may be extracted from the content, and the image identifying element may be acquired from the extracted image. The image may be extracted, for example, from code information related to the content or from a rendering result of the content. An exemplary process of detecting image identifying elements will be described later in conjunction with fig. 5-7. These processes may be applied to content where the image identifying element is located at a predetermined location and content where the image identifying element is located at an unknown location, respectively.

At 106, an entity corresponding to the identifying element can be identified.

For a text-identifying element, an entity corresponding to the text-identifying element can be identified based on the textual content of the text-identifying element. For example, for a text identifying element comprising the text "Microsoft," the entity corresponding thereto may be determined to be the brand "Microsoft" based on the text "Microsoft," which is owned by Microsoft corporation. In one embodiment, a mapping table including text-identifying elements and their corresponding entities may be predefined, and the entities corresponding to the text-identifying elements may be identified by looking up the mapping table or using mapping rules. In another implementation, the machine learning model can be utilized to identify entities corresponding to textual identifying elements. The machine learning model may be trained using a training data set that includes text-identifying element samples and entity labels.

Current machine learning models for recognizing entities corresponding to textual identifying elements typically rely only on authentic identifying elements for training, and thus are only capable of recognizing authentic identifying elements, which can be difficult to recognize. Identifying elements that differ from authentic identifying elements may also be referred to herein as non-authentic identifying elements, counterfeit identifying elements, and the like. Embodiments of the present disclosure propose improving the training data set used to train the machine learning model so that, when actually deployed, the trained machine learning model is able to correctly recognize true and non-true identifying elements and is not limited to only true identifying elements. For the textual identifying element, the non-genuine identifying element may be, for example, an identifying element obtained by transforming the genuine identifying element. For example, the non-genuine text identifying element "Micr 0$ oft" may be obtained by replacing the letter "o" in the genuine text identifying element "Microsoft" of the brand "Microsoft" with the number "0" and/or replacing the letter "s" with the dollar sign "$". Since "Micr 0$ oft" is close to "Microsoft," it is still possible for the viewer to associate "Micr 0$ oft" with the brand "Microsoft. The training data set for training the machine learning model according to the embodiments of the present disclosure may include training samples related to non-authentic text identification elements in addition to training samples related to authentic identification elements. The training samples related to the non-genuine text identifying elements may include a non-genuine identifying element sample and an entity label corresponding to the non-genuine identifying element sample. The entity label can be consistent with the entity label of the genuine identifying element sample associated with the non-genuine identifying element sample.

For an image identifying element, the image identifying element as a whole may identify the entity to which it corresponds. Alternatively, candidate entities corresponding to respective visual components in the image identifying elements may be individually identified, and the entities corresponding to the image identifying elements may be determined based on the results of the identification of the candidate entities. An exemplary process of identifying an entity corresponding to an image identifying element will be described later in connection with fig. 8.

The visual components may include textual visual components and non-textual visual components. In this context, a textual visual component may refer to a visual component that presents only text, while a textual visual component may refer to a visual component that presents at least graphics. For example, entities corresponding to textual visual components may be identified by OCR, and at least entities corresponding to non-textual visual components may be identified by a deep learning based model. A model for identifying an entity corresponding to a non-text visual component may be referred to herein as an entity identification model. In addition, the solid recognition model can also recognize text visual components that are difficult to recognize by OCR, such as text visual components containing text in artistic fonts, text visual components containing deformed text, and so on. Current solid recognition models are typically trained solely on true identifying elements, and thus are only capable of recognizing true identifying elements, which may be difficult to recognize for identifying elements that are different from the true identifying elements. For the image identifying element, the non-real identifying element may include, for example, an identifying element obtained by transforming a real identifying element, an identifying element created for a specific entity, and the like. Embodiments of the present disclosure propose improving the training data set used to train the solid recognition model so that, when actually deployed, the trained solid recognition model can correctly recognize real and non-real image-identifying elements and is not limited to only real image-identifying elements. An exemplary process of generating a training data set for training an entity recognition model will be described later in connection with FIG. 9.

The non-genuine image identifying element may be, for example, an identifying element obtained by transforming a genuine identifying element. Fig. 2 shows an exemplary real-image identifying element and an exemplary non-real-image identifying element obtained by transforming the real-image identifying element. The identifying element 200a may be, for example, a real logo of the brand "AABB", which may include a non-textual visual component composed of rectangles, circles, and triangles and a textual visual component composed of the text "AABB", with the textual visual component being located to the right of the non-textual visual component. The identifying elements 200b-200d may be, for example, non-real tokens obtained by transforming real tokens. For example, the identifying element 200b also includes the same non-textual visual component and textual visual component as the identifying element 200a, but in the identifying element 200b, the textual visual component is located below, rather than to the right of, the non-textual visual component. The relative position between the visual components in the identifying element 200c is the same as the identifying element 200a, but the non-textual visual component in the identifying element 200c is rotated relative to the non-textual visual component in the identifying element 200 a. The relative position between the visual components in the identifying element 200d is the same as the identifying element 200a, but the non-textual visual components in the identifying element 200d are flipped over relative to the non-textual visual components in the identifying element 200 a.

It should be understood that the identifying elements 200b-200d shown in FIG. 2 are merely a few examples of non-real image identifying elements that may be obtained by transforming image real identifying elements of a particular entity. The non-real-image identifying elements obtained by transforming the real-image identifying elements may also have any other form, such as having a different color than the real-image identifying elements, having an artistic font different from the real-image identifying elements, and so forth.

The non-real image identifying element may also be, for example, an image identifying element created for a particular entity. FIG. 3 illustrates an exemplary non-real image identifying element created for a particular entity. The identifying

elements

300a and 300b may be created, for example, for the brand "AABB," which may not be obtained by transforming the true identity of the brand "AABB" as shown by the identifying element 200a in fig. 2. However, since the identifying element 300a and the identifying element 300b contain the text "AABB," the viewer is most likely to mistake them for the logo of the brand "AABB.

It should be understood that the image identifying elements shown in FIG. 3 are merely a few examples of non-real image identifying elements created for a particular entity. The non-real image identifying elements created for a particular entity may also have any other form, such as a picture of an advertising phrase with the particular entity, a photograph of a store with the particular entity, and so forth.

After identifying the entity corresponding to the identifying element, the validity of the content may be determined based at least on a correlation between the creator of the content and the identified entity. At 108, it may be evaluated whether the creator of the content is related to the identified entity. In one embodiment, a creator of content may be considered relevant to an entity if the creator of the content is the same as the owner of the entity. For example, when a flag detected from a form corresponds to the brand "AABB," it may be determined whether the creator of the form is the same as the owner of the brand "AABB. If so, the creator of the form may be considered to be associated with the brand "AABB". In another embodiment, a creator of content may be considered relevant to an entity if the creator of the content is authorized by the owner of the entity to use the identifying element of the entity. Whether the creator of the content has such authorization may be determined, for example, by looking at the rights information of the creator. Alternatively, it may be determined whether the creator is authorized to use its identifying element, for example, by looking at the authorization information of the entity. It should be understood that the above criteria for evaluating the correlation between the creator of the content and the entity are only exemplary, and the correlation between the creator of the content and the entity may also be evaluated by other criteria. For example, if the creator of the content is an employee of the owner of the entity, the creator of the content may be considered relevant to the entity. For example, a creator of content may be considered relevant to an entity if the domain name of the verified email address of the creator of the content and the domain name of the email address of the owner of the entity are consistent.

If at 108 it is evaluated that the creator of the content is associated with the identified entity, process 100 may proceed to 110 where the content is determined to be authentic. If, at 108, it is evaluated that the creator of the content is not associated with the identified entity, process 100 may proceed to 112, where the content is determined to be non-legitimate. When content is determined to be illicit, some action may be taken to control or intervene, such as disabling the data collection, sending a reminder to the user, and so forth.

It should be understood that process 100 in FIG. 1 is merely an example of a process for detecting the validity of content associated with data collection. The process for detecting the validity of content associated with data collection may include any other steps, and may include more or fewer steps, depending on the actual application requirements. Further, the evaluation of the relevance between the creator of the content and the entity made at 108 is only one of the factors for determining the validity of the content. The determination of the validity of the content may also be based on other factors, such as the type of data collected by the content, the purpose of the data collection to which the content relates, and so forth.

Furthermore, it should be understood that although the foregoing discussion and the following discussion may refer to the detection of an identifying element from a given context, embodiments of the present disclosure are not so limited. It is also possible to detect multiple identifying elements from a given content. The validity detection process according to embodiments of the present disclosure may process all detected identifying elements, identifying entities corresponding to the respective identifying elements. The identified entities may be the same or different. Where the identified entities are different, the validity of the content may be determined based on the correlation between the creator of the content and the respective entities.

It should also be understood that although the foregoing discussion and the following discussion may refer to identifying an entity corresponding to a single identifying element as a unique entity, embodiments of the present disclosure are not so limited. In some cases, such as where there are multiple similar identifying elements for different entities, an entity corresponding to a single identifying element may be identified as two or more entities, and the confidence scores associated with the respective entities indicating the trustworthiness of the respective entities may also be closer. In this case, the validity of the content may be determined based on the correlation between the creator of the content and the respective entities.

According to embodiments of the present disclosure, process 100 may be performed before content related to data collection is published. For example, the content may be obtained during content creation. When it is determined that the content is improper, a prompt message may be sent to the creator of the content, which may, for example, alert or remind the creator that the identifying element may be improperly used in the content. Additionally or alternatively, when it is determined that the content is not legitimate, the content may be prevented from being published, so that the user data may be prevented from being unlegitimate collected.

FIG. 4 illustrates an exemplary form 400 for data collection. The title "account manager" of form 400 indicates that the form is intended to assist the recipient in logging into their account. The upper left corner of form 400 is provided with a flag 402 that is intended to identify the entity associated with the data collection service. The creator of the form creates a question in the form for collecting user data, such as "enter your username", "enter your password", etc. If the recipient gives answers to these questions, the creator obtains the desired user data. It should be understood that FIG. 4 is only one example of data collection, and that various other forms of data collection may exist in an actual scenario.

A process for detecting the validity of content associated with data collection, such as process 100 in fig. 1, according to an embodiment of the present disclosure may be employed to detect the validity of the form. For example, form 400 may be obtained and an identifying element, such as logo 402, may be detected from form 400. Subsequently, an entity corresponding to the logo 402, such as the brand "AABB," may be identified. Although the token 402 is different from the authentic token of the brand "AABB", such as the identifying element 200a in fig. 2, the entity identification process according to an embodiment of the present disclosure, such as step 106 in fig. 1, may still identify that the entity corresponding to the token 402 is the brand "AABB". Next, the creator of form 400 can be evaluated whether it is related to the brand "AABB". For example, if the creator of form 400 is the same as the owner of the brand "AABB," or the creator of form 400 is authorized by the owner of the brand "AABB" to use its logo, the creator of form 400 may be considered related to the brand "AABB. Otherwise, the creator of form 400 may be considered unrelated to the brand "AABB". When the creator of form 400 is associated with the brand "AABB," form 400 may be considered legitimate; when the creator of form 400 is not associated with the brand "AABB," form 400 may be considered improper.

According to an embodiment of the present disclosure, when detecting an image-identifying element from content, an image may be extracted from the content, and then the image-identifying element may be acquired from the extracted image. Fig. 5-7 illustrate an exemplary process 500 for detecting image identifying elements 700. These processes may be applied to content where the image identifying element is located at a predetermined location and content where the image identifying element is located at an unknown location, respectively.

Fig. 5 illustrates an exemplary process 500 for detecting image identifying elements in accordance with an embodiment of the disclosure. The process 500 may be applied to content where the image identifying element is located at a predetermined location. For example, for a form, the image identifying element is typically located at the top left area of the form.

At 510, an image may be extracted from a predetermined location of the content. Taking a form as an example, the image identifying element is typically located, for example, at the top left area of the form. Accordingly, an image containing an image-identifying element is also generally located in the region, and thus an image can be extracted from the region.

The image identifying element may be obtained from the extracted image. The image identifying elements may be obtained, for example, by a deep learning based model. The model for acquiring the identifying elements of the image at the predetermined positions may be referred to herein as an identifying element recognition model. The identifying element recognition model may obtain the image identifying element without determining the position of the image identifying element. Identifying element recognition models typically require that the input image be of a specified size. At 520, the image extracted from the content can be normalized to a specified size required by the identifying element recognition model.

At 530, image identifying elements may be obtained from the normalized image by an identifying element recognition model.

Although in some content, such as forms, images containing identifying elements are typically located at predetermined positions and have sizes within predetermined ranges, phishers can still take some evasive action to cause the inability to retrieve image identifying elements from the images. For example, the image identifying elements may be placed at different locations in the image, or such that the image identifying elements occupy a small portion of the image.

To address such evasive behavior by phishers, embodiments of the present disclosure propose improving a training data set used to train a identifying element recognition model. A set of image samples comprising a plurality of image samples may be generated as a set of training data for training the identifying element recognition model. Each image sample in the set of image samples may contain an identifying element sample. The identifying element sample may be located at any position in the image sample. In addition, the ratio of the size of the sample of identifying elements to the size of the image sample in which it is located may be within a predetermined range, for example 20% to 100%. When the identification element recognition model trained by using the image sample set is actually deployed, identification elements, such as an image with the identification elements located at any positions, an image with the size of the identification elements and the size of the image within a predetermined range, and the like, can be effectively acquired from various images.

Fig. 6 illustrates another exemplary process 600 for detecting image identifying elements in accordance with an embodiment of the disclosure. The process 600 may be applicable to content where the identifying element is located at an unknown location. For example, in content such as web pages, emails, productivity tool documents, the location of image identifying elements is often erratic or unknown. For such content, an object detection method may be employed to detect the image-identifying elements. For example, an image may be extracted from the content and the image-identifying elements may be detected from the extracted image as a whole.

At 610, an image can be extracted from the content. Since the position of the image identifying element is unknown, images at various positions in the content can be extracted.

The image identifying element may be obtained from the extracted image. The image identifying elements may be obtained, for example, by a deep learning based model. The model used to acquire the identifying elements of the image at the unknown location may be referred to herein as an identifying element detection model. The identifying element detection model may first determine the location of the image identifying element in the image and then obtain the image identifying element from that location. The identifying element detection model typically requires that the input image be of a specified size. At 620, the image extracted from the content can be normalized to a specified size required by the model.

At 630, the location of the image identifying element in the extracted image may be determined by an identifying element detection model.

At 640, image identifying elements may be obtained from the determined locations by an identifying element detection model.

An identifying element detection model may be trained using a set of image samples that includes a plurality of image samples. Each image sample of the set of image samples may, for example, contain an identifying element sample and a location of the identifying element sample in the image sample. When the identifying element detection model trained by the image sample set is actually deployed, the positions of the image identifying elements can be determined from the input image, and the image identifying elements are acquired from the determined positions. In addition, a marking element recognition model for acquiring an image marking element at a predetermined position and a marking element detection model for acquiring an image marking element at an unknown position may be collectively referred to as a marking element acquisition model.

In addition, for content where the image identifying element is located at an unknown location, the phisher may place the image identifying element in a larger size image, which may be, for example, a background image of a web page or email, and the image identifying element occupies only a small portion of the image. In this case, since a model for acquiring an image identification element from an image, such as the aforementioned identification element detection model, generally requires that the input image have a specified size, when the image containing the image identification element is normalized to the specified size, the image identification element will be too small to be acquired by the model. To cope with such evasive behavior, embodiments of the present disclosure propose to obtain a set of sub-images by scanning an image with a sliding window, and to detect an identifying element from the set of sub-images. Fig. 7 illustrates yet another exemplary process 700 for detecting image identifying elements in accordance with an embodiment of the disclosure. The process 700 may be applicable to content where the identifying element is at an unknown location and the image identifying element is a small portion of the image.

At 710, an image may be extracted from the content. Since the position of the image identifying element is unknown, images at various positions in the content can be extracted.

For each of the extracted images, at 720, a set of sub-images can be obtained by scanning the image with a sliding window. The size of the individual sub-images may coincide with the size of the sliding window. The size of the sliding window may be set larger than the size of the image identifying element to enable it to include the image identifying element. Since the size of the image extracted from the content is known, in order to enable the recipient of the content to notice the image-identifying elements that it contains, the size of the image-identifying elements should also be within a reasonable range, so the size of the image-identifying elements can be roughly estimated based on the size of the image. In addition, in the case where the size of an image extracted from content is constant, the size of the sub-image should be as large as possible in order to reduce the number of sub-images as possible to improve efficiency. However, in order to enable the image identifying element to be acquired from the sub-image, the ratio of the size of the image identifying element to the size of the sub-image should be higher than a predetermined threshold. That is, the size of the sub-image, i.e., the size of the sliding window, may be less than a predetermined multiple of the size of the image identifying element. Thus, the size of the sliding window may be greater than the size of the image identifying element and less than a predetermined multiple of the size of the image identifying element.

In addition, the set of sub-images may also have a predetermined degree of overlap in order to enable the entire image-identifying element to be contained by a single sub-image. The predetermined overlap may for example be set to 50%. In this case, the sliding window may be shifted by 50% of its side length in the horizontal direction or the vertical direction at a time so that the newly obtained sub-image has an overlap of 50% with the previous sub-image.

Image identifying elements may be obtained from the obtained set of sub-images. Since the ratio of the size of the image identifying element to the size of the sub-image is within a predetermined range, the image identifying element may be acquired from the sub-image, for example, by an identifying element recognition model. The identifying element recognition model may be, for example, the identifying element recognition model trained using the set of image samples described above. Identifying element recognition models typically require that the input image be of a specified size. At 730, each sub-image in the set of sub-images can be normalized to a specified size required by the identifying element recognition model.

At 740, image identifying elements can be obtained from the normalized set of sub-images, for example, by an identifying element recognition model.

It should be appreciated that the process for detecting image identifying elements described above in connection with fig. 5-7 is merely exemplary. The process for detecting the image identifying elements may include any other steps, and may include more or fewer steps, depending on the actual application requirements. For example, in the process of fig. 7, before scanning an image extracted from the content by using the sliding window, it may be determined whether the size of the image exceeds a predetermined threshold, and if so, the scanning is performed, otherwise, the image may be directly normalized to the specified size required by the identifying element detection model and the image identifying element may be obtained from the normalized image, as described in

step

620 and 640 in fig. 6. In addition, when there may be a plurality of image-identifying elements with different sizes in the image, the image may be scanned by using a plurality of sliding windows with corresponding sizes to obtain a plurality of groups of sub-images, and corresponding image-identifying elements may be obtained from each group of sub-images. Furthermore, although the process 500-700 includes normalizing the image or sub-image and obtaining the identifying element from the normalized image or sub-image, in some cases, the normalization-related process may be omitted from the process 500-700. For example, when the size of the image or sub-image is below a predetermined threshold, the identifying element may be obtained directly from the image or sub-image without normalizing it.

Referring back to fig. 2, this figure shows an authentic identifying element and several non-authentic identifying elements. For example, the identifying element 200a may be a true identifying element that includes a non-textual visual component composed of squares, circles, and triangles and a textual visual component formed of the text "AABB". The identifying element 200b includes the same non-textual visual component and textual visual component as the identifying element 200a, but its textual visual component is below the non-textual visual component, rather than to the right. Phishers tend to circumvent recognition by employing identifying elements whose relative positions of visual components are different from the true identifying elements.

To cope with such evasive behavior, embodiments of the present disclosure propose to individually recognize each visual component in an image identifying element, and determine an entity corresponding to the identifying element based on the recognition result of each visual component. Fig. 8 illustrates an example process 800 for identifying an entity corresponding to an identifying element in accordance with an embodiment of the disclosure. Process 800 may be performed for an image identifying element having one or more visual components. Candidate entities corresponding to respective visual components in the image identifying elements may be identified, and entities corresponding to the image identifying elements may be determined based at least on the candidate entities. Optionally, it is also possible to determine the location of the visual component and determine the entity corresponding to the image identifying element based on both the candidate entity and the location.

At 810, candidate entities corresponding to respective visual components in the image identifying elements can be identified, respectively. According to embodiments of the present disclosure, for example, textual visual components may be identified by OCR, and at least non-textual visual components may be identified by a deep learning based entity recognition model. In addition, the solid recognition model can also recognize text visual components that are difficult to recognize by OCR, such as text visual components containing text in artistic fonts, text visual components containing deformed text, and so on. Taking the identifying element 200b in fig. 2 as an example, a first set of candidate entities corresponding to a non-textual visual component above the identifying element 200b may be identified by an entity recognition model, and a second set of candidate entities corresponding to a textual visual component below it may be identified by OCR. A single visual component may correspond to one or more candidate entities. For example, the first set of candidate entities may include the brand "AABB" and the brand "CCDD", and the second set of candidate entities may include the brand "AABB".

Optionally, at 820, the location of the individual visual components can be determined. For example, the non-textual visual component above the identifying element 200b may be determined to be at a first location, and the textual visual component below it may be determined to be at a second location.

Optionally, at 830, a distance between each two visual components may be calculated based on the determined locations of the respective visual components to obtain a set of distances. For example, the distance between the non-textual visual component of the identifying element 200b and the textual visual component may be calculated based on a first location of the two visual components and a second location of the textual visual component.

At 840, an entity corresponding to the identifying element may be determined based at least on the candidate entity identified at 810. Continuing with the example of the identifying element 200b, the first set of candidate entities includes the brand "AABB" and the brand "CCDD" and the second set of candidate entities includes the brand "AABB", and accordingly, the entity corresponding to the identifying element 200b may be determined to be the brand "AABB".

Optionally, the entity corresponding to the identity may also be determined further based on the set of distances calculated at 830.

In case the set of distances comprises only one distance, i.e. the identifying element comprises only two visual components, these two visual components may belong to one identifying element if the distance is below a predetermined threshold. Accordingly, the entity corresponding to the identifying element may be determined based on the two sets of candidate entities corresponding to the two visual components, respectively. For example, for the identifying element 200b, it may be determined whether the distance between its two visual components is below a threshold. If the distance is below the threshold, the entity corresponding to the identifying element 200b may be determined based on both a first set of candidate entities including the brand "AABB" and the brand "CCDD" and a second set of candidate entities including the brand "AABB". The entity may be determined, for example, as the brand "AABB". If the distance is not below a predetermined threshold, the two visual components may not belong to one identifying element. Therefore, two candidate entity sets respectively corresponding to the two visual components can be directly provided as the entity recognition result. Continuing with the example of identifying element 200b, if the distance is not below the threshold, brand "AABB" and brand "CCDD" are provided directly as recognition results for the non-textual visual component, and brand "AABB" is provided as recognition results for the textual visual component.

In case the set of distances comprises a plurality of distances, i.e. the identifying element comprises more than two visual components, all visual components may belong to one identifying element if the distance between each of them and its closest visual component is below a threshold. Accordingly, the entity corresponding to the identifying element may be determined based on the candidate entities corresponding to the respective visual components. If the distance between one or more visual components and their closest visual components is not below a threshold, then it is likely that the one or more visual components do not belong to the identifying element. In this case, candidate entities corresponding to the one or more visual components may not be considered in determining the entity corresponding to the identifying element.

In this way, even if the relative position of the visual component of the identifying element is different from the real identifying element, the entity corresponding to the identifying element can be correctly identified.

The process for identifying an entity corresponding to a text identifying element described above in connection with fig. 1 only takes into account the text identifying element itself, and the process 800 for identifying an entity corresponding to an image identifying element described above in connection with fig. 8 also takes into account only the image identifying element itself. In order to be able to more accurately recognize an entity corresponding to a text-identifying element or an image-identifying element, embodiments of the present disclosure propose that entity recognition can also be performed based on the context of the text-identifying element or the image-identifying element. In this context, context may refer to content, such as forms, web pages, emails, productivity tool documents, etc., where the identifying element is located. In one embodiment, an entity corresponding to an identifying element may be identified based at least on the context of the identifying element. For a textual identifying element, the textual identifying element and the context of the textual identifying element may be provided to a model for identifying the entity corresponding to the textual identifying element. The model can identify entities corresponding to the textual identifying elements based at least on context. For an image identifying element, the image identifying element and the context of the image identifying element can be provided to the solid recognition model. The entity recognition model may identify candidate entities corresponding to respective visual components in the image identifying element based at least on the context. Subsequently, the entity corresponding to the image identifying element may be determined, for example, by steps similar to step 820 and 840 of process 800. Alternatively, the entity corresponding to the identifying element may also be identified based at least on a portion of the context of the text identifying element or the image identifying element that includes the identifying element, rather than the entire context. For example, a region of a predetermined size that includes the identifying element may be truncated from the context and provided to the corresponding model along with the identifying element.

As described above, non-textual visual components, as well as textual visual components that are difficult to recognize by OCR, may be recognized by solid recognition models. An entity recognition model may be trained using a training data set according to embodiments of the present disclosure. The training data set may be a visual component sample set comprising a plurality of visual component samples. The set of visual component samples may include, in addition to the visual component samples from the real identifying elements, visual component samples that are different from the visual components in the real identifying elements, such that when actually deployed, the trained model is able to correctly recognize real and non-real identifying elements and is not limited to only real identifying elements. A visual component sample that is different from the visual component in the true identifying element may be generated based on the true identifying element or created for a particular entity.

FIG. 9 illustrates an example process 900 for generating a training data set for training an entity recognition model based on true landmark elements, according to an embodiment of the disclosure.

At 910, true identifying elements can be collected. For example, the collection may be from the internet and phishing content that is accessible. Entities corresponding to respective true identifying elements can be determined and assigned to the true identifying elements as entity labels. In addition, the context in which the truly identifying element is located can also be collected, such as forms, web pages, emails, productivity tool documents, and the like. Alternatively, it is also possible to determine the position of each real identification element, for example its position in the content, and assign the determined position as a position label to the real identification element.

At 920, the collected true identifying elements can be analyzed and the individual true identifying elements can be split into one or more visual components. FIG. 10 illustrates an exemplary splitting of a true identifying element according to an embodiment of the disclosure. Fig. 10 may be a genuine identifying element, such as the brand "AABB," which may correspond to identifying element 200a in fig. 2. The true identification element can be split into a visual component 1000a composed of rectangles, circles, and triangles and a visual component 1000b composed of the text "AABB".

At 930, a visual component sample can be obtained by transforming the visual component of the split true identifying element. Fig. 11 illustrates an exemplary transformation of a visual component according to an embodiment of the disclosure.

In one embodiment, the visual component may be rotated. For example, the visual component sample 1100a in fig. 11 may be obtained by rotating the visual component 1000a in fig. 10 clockwise by 180 degrees.

In another embodiment, the visual component may be flipped. For example, the visual component sample 1100b in fig. 11 may be obtained by flipping the visual component 1000a in fig. 10.

In yet another embodiment, the color of the visual component may be changed to other colors. For example, the color of at least a portion of the visual component may be changed to some classical color, such as red, yellow, black, and the like. Further, the colors of the respective portions of the visual component may be rotated, symmetrically changed, or the like. Assume that in visual component 1000a in fig. 10, the rectangle is red, the circle is yellow and the triangle is brown. The rotational change of the color of the parts may be, for example, changing a circle to red, a triangle to yellow, and a rectangle to brown, etc.

In yet another embodiment, the color of the visual component may be converted to grayscale. The colors are varied and phishers can easily change the color of the visual components or identifying elements to a color never seen by the solid recognition model. To cope with a wide variety of color variations, embodiments of the present disclosure propose converting colors of visual component samples used to train a solid recognition model into gray scale to cause the model to focus on recognizing the shape rather than the color of the visual component. For example, the visual component sample 1100c in fig. 11 may be obtained by color-converting the visual component 1000a in fig. 10 into a gray scale. The various portions in the visual component sample 1100c may have different shades of gray, which may, for example, correspond to different colors in the original visual component, i.e., the visual component 1000 a.

Additionally, the context may also be transformed. For example, some information in the context may be changed, such as templates, colors, and so forth.

It should be understood that only some exemplary transformations to visual components and contexts are listed above. Embodiments of the present disclosure are not limited thereto, and some other transformations may be made to the visual components and context according to the actual application needs.

After transforming the visual component, an extended set of visual component samples may be obtained. The extended sample set of visual components may include original visual components extracted from the authentic identifying elements. In addition, after transforming the context, an extended set of context samples may be obtained. The expanded set of context samples may include the original context.

At 940, each visual component sample in the set of extended visual component samples can be combined with a different context sample in the set of extended context samples to generate a plurality of training samples, each training sample including a visual component sample and a context sample. During combining, one or more data enhancement operations may also be applied to the visual component samples, such as rotation, adding noise, perspective transformation, shifting, and so forth. Each training sample may include an entity label that may be consistent with the entity label associated with the truly identifying element of the visual component sample. Optionally, each training sample may further include a location label, and the entity label may correspond to a location label associated with the truly identifying element of the visual component sample.

It should be appreciated that process 900 in FIG. 9 is merely an example of a process for generating a training data set for training an entity recognition model based on true landmark elements. The process for training data set generation may include any other steps, and may include more or fewer steps, depending on the actual application requirements. For example, although the process 900 includes splitting the genuine identifying element into visual components and, accordingly, the generated training samples include visual component samples, it is also possible that the genuine identifying element is not split. For example, multiple identifying element samples may be generated by directly transforming the true identifying element, e.g., the transformation at step 930. The various identifying element samples may be combined with context samples to generate training samples. An entity recognition model trained with such training samples, when deployed, may recognize the image-identifying elements as a whole to the entities corresponding thereto. Further, although the process 900 includes the collection and transformation of contexts, it is also feasible to collect only true identifying elements and transform only visual components in the true identifying elements. In this case, the training samples may not include context samples. Accordingly, entity recognition models trained using such training data sets may not take into account the context of the identifying elements when performing entity recognition.

The process of generating a training data set for training an entity recognition model based on true landmark elements is described above in connection with FIG. 9. The phisher may also use newly created identifying elements, such as identifying element 300a and identifying element 300b in fig. 3. In order to enable the entity recognition model to recognize such newly created identifying elements, embodiments of the present disclosure propose to add some landmark element samples or visual component samples created for a specific entity in the training data set used to train the entity recognition model. For example, such a sample of flagging elements or visual components may be actively created for some known brands, products, etc. by a party providing a phishing detection service. Alternatively, the samples of the flagging elements or visual components created by the phishers may also be collected from the reported phishing content.

The training data set generated in the above manner may be used to train an entity recognition model. The training data set may or may not include location labels. A training data set including a location label may be used to train an entity recognition model for processing content with image identifying elements located at unknown locations. The training data set that does not include location labels may be used to train an entity recognition model of the content of the image identifying elements at predetermined locations.

In the training data set generation process described above, the visual components of the image identifying elements are treated as separate training samples. Accordingly, the entity recognition model trained by such training samples also individually recognizes each visual component in the image identification elements when identifying the identification elements. In this way, phishers can be effectively addressed to circumvent detected behavior by changing the relative positions of visual components in image-identifying elements. In addition, the training data set generated in the above manner may have a plurality of different training samples for each entity, such as a visual component sample extracted from the real identifying element, a visual component sample obtained by transforming a visual component in the real identifying element, and a visual component sample created for a specific entity. The training samples enrich the training data set, so that the trained entity recognition model can correctly recognize various image identification elements when being actually deployed, and is not limited to the real identification elements.

In addition, the training samples for each entity in the training data set generated in the above manner may be randomly divided into three subsets: a training subset for adapting model parameters using supervised learning, a validation subset for tuning the hyper-parameters of the model during model training, selecting the best model and determining the termination point, and a testing subset for evaluating the performance of the final model.

It is described above that a plurality of visual component samples having different colors can be generated by changing the color of the visual component to other colors. The entity recognition model may be trained using a training data set comprising a plurality of visual component samples having different colors. The trained entity recognition model may be referred to as a color model. In addition, a grayscale visual component sample may also be generated by converting the color of the visual component to grayscale. The entity recognition model may be trained using a training data set comprising grayscale component samples. The trained entity recognition model may be referred to as a grayscale model. In recognizing an entity corresponding to a visual component, recognition may be performed by a color model and a grayscale model, respectively, and the results of the respective model recognition may be combined into a recognition result for the visual component.

Other elements that enable the recipient of the content to associate it with a particular entity, such as templates, styles, colors, etc., that the particular entity is accustomed to using, may also be present in the content, such as forms, web pages, e-mails, and productivity tool documents. Phishers may use such elements in their content to encourage the recipient of the content to believe that the content is from a particular entity corresponding to the element. Embodiments of the present disclosure provide for identifying entities corresponding to other elements. The identified entities corresponding to other elements may be referred to herein as reference entities. For example, a template or pattern may be extracted from a given content using image processing techniques, and a reference entity corresponding to the extracted template or pattern may be identified using recognition or pattern matching. A final entity may then be determined based on the reference entity and the entity corresponding to the identifying element in the content. In this way, the entity can be determined more accurately. For example, where the entity corresponding to the identifying element is two or more entities and the reference entity is one of the two or more entities, the final entity may be determined to be the reference entity. The validity of the content may then be determined based at least on the correlation between the creator of the content and the end entity. The validity of the content may be determined, for example, in a manner similar to that of step 108-110 in fig. 1.

It should be understood that although in the above description, acquiring an image identifying element and identifying an entity corresponding to the image identifying element are performed by separate models, for example, acquiring an image identifying element is performed by an identifying element acquiring model, and identifying an entity corresponding to the image identifying element is performed by an entity identifying model, embodiments of the present disclosure are not limited thereto. For example, the identifying element acquisition model and the entity recognition model may be integrated into an integrated model. The integration model may detect an identifying element from the input content and identify an entity corresponding to the identifying element. Some neural networks may be trained using an improved training data set according to embodiments of the present disclosure to obtain a suitable integrated model. For content where the identifying element is located at a predetermined location, neural Networks such as Residual Networks (resnets), efficient Networks (efficientnets) may be used for training. For content where the identifying element is at an unknown location, a Neural Network such as an efficient Detector (effective Detector), a Single Shot multiple box Detector (SSD), a Faster-Region Convolutional Neural Network (fast-RCNN) may be used for training.

According to the embodiment of the disclosure, when selecting the neural network for training, reasonable use of the computing resources can be realized based on corresponding computing resource conditions, such as CPU resources, memory resources, and the like, while ensuring that accuracy and delay requirements are met. For example, EfficientNet is simpler in network structure and uses fewer computing resources than ResNet, but still can achieve higher accuracy. Thus, when computing resources are limited, EfficientNet may be used; and when computing resources are sufficient, then ResNet may be used.

FIG. 12 is a flow diagram of an exemplary method 1200 for detecting content related to data collection in accordance with an embodiment of the present disclosure.

At step 1210, content related to data collection may be obtained.

At step 1220, at least one identifying element may be detected from the content.

At step 1230, an entity corresponding to the identifying element can be identified.

At step 1240, the validity of the content may be determined based at least on a correlation between the creator of the content and the entity.

In one implementation, the content may be obtained during the time the content is created.

The method 1200 may further include, in response to determining that the content is not legitimate, performing at least one of: sending a prompt message to the creator and/or data collection service provider; and preventing the content from being published.

In one embodiment, the identifying element is different from the true identifying element of the entity.

In one embodiment, the identifying element may comprise a textual identifying element. The text-identifying element may be detected from the content or information related to the content.

In one embodiment, the identifying element may comprise an image identifying element. The detecting at least one identifying element may include: extracting an image from the content; and obtaining the image identification element from the image.

The extracting the image may include: the image is extracted from a predetermined position of the content.

The obtaining the image identifying element may include: determining a location of the image-identifying element in the image; and obtaining the image identifying element from the location.

The obtaining the image identifying element may include: obtaining a set of sub-images by scanning the image with a sliding window; and retrieving the image identifying element from the set of sub-images.

The size of the sliding window may be greater than the size of the image identifying element and less than a predetermined multiple of the size of the image identifying element. The size of the image identifying element may be estimated based on a size of the image.

The acquiring the image identifying element may be performed by an identifying element acquisition model. The identifying element acquisition model may be a deep learning based model trained with a set of image samples. Each image sample of the set of image samples may contain an identifying element sample.

The identifying element sample may be located at any position in the image sample. The ratio of the size of the identifying element sample to the size of the image sample may be within a predetermined range.

In one embodiment, the identifying element may comprise an image identifying element. The identifying entity may include: identifying at least one candidate entity corresponding to at least one visual component in the image identifying element; and determining the entity based at least on the at least one candidate entity.

The method 1200 may further include: determining at least one location of the at least one visual component, and wherein the entity is determined further based on the at least one location.

The at least one visual component may comprise a textual visual component. The at least one candidate entity may be identified by OCR.

The at least one visual component may comprise a non-textual visual component. The at least one candidate entity may be identified by an entity identification model.

The entity recognition model may be a deep learning based model trained with a set of visual component samples. Each visual component sample in the set of visual component samples may have an entity label.

The visual component samples of the set of visual component samples may be obtained by at least one of: extracting from the real identification elements; transforming the visual components in the true identification elements; and created for a particular entity.

The transforming the visual component may comprise at least one of: rotating the visual component; flipping the visual component; changing the color of the visual component to another color; and converting the color of the visual component to a grayscale.

In one embodiment, the entity may be identified based at least on a context of the identifying element.

In one embodiment, the method 1200 may further include: identifying reference entities corresponding to other elements in the content, the other elements including at least one of a template, a style, and a color; and determining a final entity based on the entity corresponding to the identifying element and the reference entity, and wherein the validity of the content may be determined based at least on a correlation between the creator and the final entity.

In one embodiment, the content may include at least one of a form, a web page, an email, and a productivity tool document. The identifying element may include at least one of a logo, a representative picture, and a representative text. The entities may include at least one of brands, products, services, and people.

It should be understood that method 1200 may also include any steps/processes for detecting the validity of content associated with data collection in accordance with embodiments of the present disclosure described above.

Fig. 13 illustrates an exemplary apparatus 1300 for detecting content related to data collection in accordance with an embodiment of the disclosure.

The apparatus 1300 may include: a content obtaining module 1310 for obtaining content related to data collection; an identifying element detecting module 1320 for detecting at least one identifying element from the content; an entity identification module 1330 configured to identify an entity corresponding to the identifying element; and a validity determination module 1340 for determining the validity of the content based at least on a correlation between a creator of the content and the entity.

It should be understood that the apparatus 1300 may also include any other module configured to detect the validity of content related to data collection according to embodiments of the present disclosure described above.

FIG. 14 illustrates an exemplary apparatus 1400 for detecting content related to data collection in accordance with an embodiment of the disclosure.

The apparatus 1400 may include at least one processor 1410. The apparatus 1400 may also include a memory 1420 coupled to the processor 1410. The memory 1420 may store computer-executable instructions that, when executed, cause the processor 1410 to perform any of the operations of the methods for detecting the validity of content related to data collection in accordance with the embodiments of the present disclosure described above.

Embodiments of the present disclosure may be embodied in non-transitory computer readable media. The non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any of the operations of the method for detecting the validity of content related to data collection according to embodiments of the present disclosure as described above.

It should be appreciated that all of the operations in the methods described above are exemplary only, and the present disclosure is not limited to any of the operations in the methods or the order of the operations, but rather should encompass all other equivalent variations under the same or similar concepts.

It should also be appreciated that all of the modules in the apparatus described above may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. In addition, any of these modules may be further divided functionally into sub-modules or combined together.

The processor has been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software depends upon the particular application and the overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented with a microprocessor, a microcontroller, a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a state machine, gated logic units, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented using software executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software should be viewed broadly as meaning instructions, instruction sets, code segments, program code, programs, subroutines, software modules, applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, and the like. The software may reside in a computer readable medium. The computer readable medium may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, a Random Access Memory (RAM), a Read Only Memory (ROM), a programmable ROM (prom), an erasable prom (eprom), an electrically erasable prom (eeprom), a register, or a removable disk. Although the memory is shown as being separate from the processor in the aspects presented in this disclosure, the memory may also be located internal to the processor, such as a cache or registers.

The above description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described herein that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

1. A method for detecting the validity of content associated with data collection, comprising:

obtaining content related to data collection;

detecting at least one identifying element from the content;

identifying an entity corresponding to the identifying element; and

determining the validity of the content based at least on a correlation between a creator of the content and the entity.

2. The method of claim 1, wherein the content is obtained during creation of the content.

3. The method of claim 2, further comprising, in response to determining that the content is not legitimate, performing at least one of:

sending a prompt message to the creator and/or data collection service provider; and

preventing the content from being published.

4. The method of claim 1, wherein the identifying element is different from a true identifying element of the entity.

5. The method of any of claims 1, 2, or 4, wherein the identifying element comprises a textual identifying element, and the textual identifying element is detected from the content or information related to the content.

6. The method of any of claims 1, 2 or 4, wherein the identifying element comprises an image identifying element, and the detecting at least one identifying element comprises:

extracting an image from the content; and

and acquiring the image identification element from the image.

7. The method of claim 6, wherein the extracting the image comprises:

the image is extracted from a predetermined position of the content.

8. The method of claim 6, wherein said obtaining the image identifying element comprises:

determining a location of the image-identifying element in the image; and

the image identifying element is obtained from the location.

9. The method of claim 6, wherein said obtaining the image identifying element comprises:

obtaining a set of sub-images by scanning the image with a sliding window; and

the image identifying element is obtained from the set of sub-images.

10. The method of any of claims 1, 2, or 4, wherein the identifying element comprises an image identifying element, and the identifying entity comprises:

identifying at least one candidate entity corresponding to at least one visual component in the image identifying element; and

determining the entity based at least on the at least one candidate entity.

11. The method of claim 10, further comprising:

determining at least one position of said at least one visual component, and

wherein the entity is determined further based on the at least one location.

12. The method of claim 10, wherein the at least one visual component comprises a textual visual component and the at least one candidate entity is identified by Optical Character Recognition (OCR).

13. The method of claim 10, wherein the at least one visual component comprises a non-textual visual component and the at least one candidate entity is identified by an entity identification model.

14. The method of claim 13, wherein the entity recognition model is a deep learning based model trained with a set of visual component samples, each visual component sample of the set of visual component samples having an entity label.

15. The method of claim 14, wherein the visual component samples of the set of visual component samples are obtained by at least one of:

extracting from the real identification elements;

transforming the visual components in the true identification elements; and

created for a particular entity.

16. The method of claim 15, wherein the transforming the visual component comprises at least one of:

rotating the visual component;

flipping the visual component;

changing the color of the visual component to another color; and

converting the color of the visual component to a grayscale.

17. The method of any of claims 1, 2, or 4, wherein the entity is identified based at least on a context of the identifying element.

18. The method of any of claims 1, 2, or 4, further comprising:

identifying reference entities corresponding to other elements in the content, the other elements including at least one of a template, a style, and a color; and

determining a final entity based on the entity corresponding to the identifying element and the reference entity, and

wherein the validity of the content is determined based at least on a correlation between the creator and the end entity.

19. An apparatus for detecting the validity of content associated with data collection, comprising:

a content obtaining module for obtaining content related to data collection;

an identifying element detecting module for detecting at least one identifying element from the content;

an entity identification module for identifying an entity corresponding to the identifying element; and

a validity determination module to determine validity of the content based at least on a correlation between a creator of the content and the entity.

20. An apparatus for detecting the validity of content associated with data collection, comprising:

at least one processor; and

a memory storing computer-executable instructions that, when executed, cause the at least one processor to:

content related to the collection of data is obtained,

detecting at least one identifying element from the content,

identifying an entity corresponding to the identifying element, an