CN109313659B - Anomaly detection for revisions to web documents - Google Patents

Anomaly detection for revisions to web documents Download PDF

Info

Publication number
CN109313659B
CN109313659B CN201780038502.5A CN201780038502A CN109313659B CN 109313659 B CN109313659 B CN 109313659B CN 201780038502 A CN201780038502 A CN 201780038502A CN 109313659 B CN109313659 B CN 109313659B
Authority
CN
China
Prior art keywords
web document
updated
published
score
anomaly
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201780038502.5A
Other languages
Chinese (zh)
Other versions
CN109313659A (en
Inventor
拉克希米·纳拉辛汉
希曼休·贾殷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
eBay Inc
Original Assignee
eBay Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by eBay Inc filed Critical eBay Inc
Priority to CN202210713893.9A priority Critical patent/CN115238207A/en
Publication of CN109313659A publication Critical patent/CN109313659A/en
Application granted granted Critical
Publication of CN109313659B publication Critical patent/CN109313659B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links

Abstract

Aspects of the present disclosure include a system and computer-implemented method including a computer-readable storage medium storing at least one program for detecting anomalies in revisions to a web document. According to some embodiments, a method includes publishing, on a network-based content publication platform, a web document including a plurality of different elements generated using data received from a user's computing device. The method also includes accessing an updated web document generated based on modifications made by the user to the published web document. The method also includes generating one or more anomaly scores based on a comparison of the updated web document and the published web document, and determining whether to allow publication of the updated web document based on a comparison of the anomaly scores to a threshold anomaly score.

Description

Anomaly detection for revisions to web documents
Cross Reference to Related Applications
This international application claims priority from U.S. patent application entitled "anormal DETECTION FOR WEB DOCUMENT review," serial No. 15/188,532, filed on 21/6/2016, the entire contents of which are incorporated herein by reference in their entirety.
Technical Field
The present disclosure relates generally to machines configured in the art of special purpose machines that facilitate digital content management, including computerized variations of such special purpose machines and improvements to such variations, and to techniques for improving such special purpose machines over other special purpose machines that facilitate digital content management. In particular, the present disclosure presents systems and methods for detecting anomalous revisions to published digital content.
Background
Many online content publication platforms allow users to generate and publish content in the form of web documents (e.g., web pages) online, which can be browsed by other users using a web browser or application. Each published web document is typically assigned a Uniform Resource Identifier (URI) at or before publication. Typically, these online content publication platforms allow users to revise content even after the content is published. While some content of a web document may be updated, a URI typically remains unchanged. In some cases, allowing revisions to a web document while maintaining the same URI may be problematic for online content publishing platforms.
In one example, the content publication platform is an online marketplace that allows users to create content in the form of product listings to provide products they offer for sale to other users. A user of the online marketplace may initially create a product listing for a first product that is in high demand, and the online marketplace may then assign a URI to that product listing. The user may then revise the product listing entirely to cover the second product with lower demand and desire, while the URI assigned to the product listing will remain the same, although the URI may still be associated with inventory and historical sales of the first product with high demand. In this way, the user may leverage the revise capabilities of the online marketplace to manipulate search rankings for undesirable products, hide lower demand, manipulate historical sales figures, or otherwise manipulate consumer demand for products. While this may prove beneficial to individual users, it leads to an overall decline in the navigation quality, information accuracy, and overall performance and reputation of the online marketplace.
Drawings
The various drawings illustrate only example embodiments of the disclosure and are not to be considered limiting of its scope.
FIG. 1 is a network diagram illustrating a content publication platform having a client-server architecture configured for exchanging data over a network, according to an example embodiment.
FIG. 2 is a block diagram illustrating various functional components of an anomaly detection system provided as part of a content distribution platform according to an example embodiment.
FIG. 3 is an interaction diagram illustrating an example exchange between a publication system and a content publication platform, according to an example embodiment.
FIG. 4A is an interface diagram illustrating a published web document according to an example embodiment.
FIG. 4B is an interface diagram illustrating an updated version of a published web document, according to an example embodiment.
Fig. 5 to 8 are flowcharts illustrating example operations of the anomaly detection system in providing anomaly detection services for a content distribution platform according to example embodiments.
FIG. 9 is a flowchart illustrating example operations of a content publication system in providing a user-generated content publication service, according to an example embodiment.
Fig. 10 is a diagrammatic representation of a machine in the example form of a computer system within which a set of instructions, for causing the machine to perform any one or more of the methodologies discussed herein, may be executed.
Detailed Description
Reference will now be made in detail to specific exemplary embodiments for carrying out the inventive subject matter. Examples of these specific embodiments are illustrated in the accompanying drawings. It will be understood that these examples are not intended to limit the scope of the claims to the embodiments shown. On the contrary, they are intended to cover alternatives, modifications, and equivalents, which may be included within the scope of the disclosure. In the following description, specific details are set forth in order to provide a thorough understanding of the present subject matter. Embodiments may be practiced without some or all of these specific details.
Aspects of the present disclosure relate to systems and methods for detecting anomalies in revisions of web documents. As used herein, an "exception" includes a modification to a web document that alters the web document beyond an expected change threshold. In an example embodiment, the web document is a web page, e.g., an online marketplace listing of a product. The disclosed subject matter is applicable to any online content publication platform that allows users to revise published content. In one example, the online content publication platform is an online marketplace.
Example embodiments relate to web documents that include multiple different elements (such as images, text, and numerical values). In an example web document for a listing, the different elements may include an image of the product, a textual description of the product, and a price. In these example embodiments, the method may include accessing an updated version of the published web document (e.g., a user revision). The method also includes comparing portions of the published (e.g., unchanged) web document to corresponding portions of the updated document to generate various anomaly scores. For example, a published image included in a published web document may be compared to an updated image included in an updated web document to generate an image similarity score; published text included in the published web document may be compared to updated text included in the updated web document to generate a text match score; and the updated value included in the updated web document may be compared to the published value (or an average of previously published associated values) to generate a value deviation score.
The method may also include determining whether to publish the updated web document based on the comparison of the anomaly score to the threshold anomaly score. In some embodiments, the system may prevent the publication of updated web documents if any one of the anomaly scores is above a threshold score. In other embodiments, the system may prevent publishing the updated web document if the combination of the anomaly scores is above a threshold score. Other aspects of the disclosure include using machine learning techniques to revise the threshold score based on manual inspection of revised listing detected anomalies.
Referring to FIG. 1, an example embodiment of an advanced client-server based architecture 100 is shown. Although FIG. 1 illustrates a client-server based architecture 100, the subject matter of the present invention is of course not limited to such an architecture, and is equally well suited for use in, for example, event-driven, distributed, or peer-to-peer architecture systems. Moreover, various functional components that are not relevant for conveying an understanding of the inventive subject matter have been omitted from FIG. 1 in order to avoid obscuring the inventive subject matter with unnecessary detail. Furthermore, it should be understood that although the various functional components shown in FIG. 1 are discussed in the singular, multiple instances of any of the various functional components may be employed.
A content publication platform 102, in the example form of a network-based marketplace, provides server-side functionality to one or more client devices 110 via a network 104 (e.g., the internet or a Wide Area Network (WAN)). For example, fig. 1 shows a web client 112 (e.g., a browser), a client application 114, and a programming client 116 executing on a client device 110. One or more portions of network 104 may be an ad hoc network, an intranet, an extranet, a Virtual Private Network (VPN), a Local Area Network (LAN), a wireless LAN (wlan), a Wide Area Network (WAN), a wireless WAN (wwan), a Metropolitan Area Network (MAN), a portion of the internet, a portion of the Public Switched Telephone Network (PSTN), a cellular telephone network, a wireless network, a WiFi network, a WiMax network, another type of network, or a combination of two or more such networks.
Client devices 110 may include, but are not limited to: a mobile phone, desktop computer, laptop computer, Portable Digital Assistant (PDA), smart phone, tablet computer, ultrabook, netbook, notebook computer, multiprocessor system, microprocessor-based or programmable consumer electronics, game console, set-top box, or any other communication device that a user may use to access content distribution platform 102. In some embodiments, client device 110 may include a display module (not shown) to display information (e.g., in the form of a user interface). In other embodiments, client device 110 may include one or more of a touch screen, an accelerometer, a gyroscope, a camera, a microphone, a Global Positioning System (GPS) device, and so forth. In one embodiment, the content publication platform 102 is a network-based marketplace that publishes an advertisement (e.g., a web document) that includes listings of products available on the network-based marketplace.
One or more users 106 may be humans, machines, or other devices that interact with client device 110. In an example embodiment, the user 106 is not part of the client-server based architecture 100, but may interact with the client-server based architecture 100 via the client device 110 or another means. For example, the user 106 may provide input (e.g., touch screen input or alphanumeric input) to the client device 110 and communicate the input to the content publication platform 102 via the network 104. In this example, the content publication platform 102 transmits information to the client device 110 via the network 104 for presentation to the user 106 in response to input received from the user 106. In this manner, the user 106 may interact with the content publication platform 102 using the client device 110.
The client device 110 may include one or more client applications 114 (also referred to as "apps") such as, but not limited to, a web browser, a messaging application, an electronic mail (email) application, an electronic commerce site application (also referred to as a marketplace application), and so forth. In some embodiments, if an e-commerce website application is included in the client device 110, the application may be configured to locally provide at least some of the user interface and functionality, with the client application 114 configured to communicate with the content publication platform 102 as needed to obtain locally unavailable data or processing capabilities (e.g., access a database of items available for sale, authenticate the user 106, verify payment methods, etc.). Conversely, if the e-commerce website application is not included in the client device 110, the client device 110 may use its web browser to access the e-commerce website (or a variant thereof) hosted on the content publication platform 102.
An Application Program Interface (API) server 120 and a web server 122 are coupled to the application server 140 and provide a programming interface and a web interface, respectively, to the application server 140. Application server 140 may host a publication system 142 and an anomaly detection system 144, each of which publication system 142 and anomaly detection system 144 may include one or more modules or applications 114 and may be embodied as hardware, software, firmware, or any combination thereof. In turn, application server 140 is shown coupled to database server 124 that facilitates access to database 126. In an example embodiment, the database 126 is a storage device that stores information (e.g., publication or listing) to be advertised to the publication system 142. According to an example embodiment, the database 126 may also store digital item information.
Additionally, a third party application 132 executing on the third party server 130 is shown having programmatic access to the content publication platform 102 via the programmatic interface provided by the API server 120. For example, the third party application 132 supports one or more features or functions on a website hosted by the third party using information retrieved from the content publication platform 102.
The publication system 142 provides a number of publication functions and services to users 106 accessing the content publication platform 102. For example, publication system 142 provides an interface that allows users 106 to create and publish web documents using client devices 110 (e.g., by communicating with client devices 110). Publishing system 142 may also provide an interface that allows user 106 to modify various portions of a published web document.
Anomaly detection system 144 is configured to monitor changes made by user 106 to published web documents in order to detect anomalous updates. To this end, the anomaly detection system 144 compares the updated web document to previous or original versions of the web document to generate various anomaly scores based on the extent to which the web document was modified. The anomaly detection system 144 may mark certain updated web documents based on these anomaly scores to prevent the publication system 142 from publishing the updated web documents.
Although the publication system 142 and the anomaly detection system 144 are both shown in FIG. 1 as forming part of the content publication platform 102 (e.g., the publication system 142 and the anomaly detection system 144 are subsystems of the content publication platform 102), it should be understood that in alternative embodiments, each of the systems 142 and 144 may form part of a service or platform that is separate and distinct from the content publication platform 102. In some embodiments, anomaly detection system 144 may form a portion of publication system 142.
FIG. 2 is a block diagram illustrating various functional components of the anomaly detection system 144 according to an example embodiment. Functional components (e.g., modules, engines, and databases) that are not germane to an understanding of the inventive subject matter have been omitted from fig. 2 in order not to obscure the inventive subject matter with unnecessary detail. However, those skilled in the art will readily recognize that various additional functional components may be supported by anomaly detection system 144 to facilitate additional functions not specifically described herein.
As shown, anomaly detection system 144 includes an anomaly detector 200, a decision module 240, and an improvement module 250. The anomaly detector 200 includes a text anomaly detector 210, an image anomaly detector 220, and a numerical anomaly detector 230. Each of the above-referenced functional components of the anomaly detection system 144 are configured to communicate with each other (e.g., via a bus, shared memory, switch, Application Programming Interface (API)). Any one or more of the functional components shown in fig. 2 and described herein may be implemented using hardware (e.g., a processor of a machine) or a combination of hardware and software. For example, any module described herein may configure a processor to perform the operations described herein for that module. Further, any two or more of these modules may be combined into a single module, and the functionality described herein for a single module may be subdivided into multiple modules. Further, according to various example embodiments, any of the functional components shown in fig. 2 may be implemented together or separately within a single machine, database, or device, or may be distributed across multiple machines, databases, or devices.
Text anomaly detector 210 is responsible for detecting anomalies in revised web documents that occur as a result of modifications to text included in one or more previously published web document versions. To this end, text anomaly detector 210 is configured to compare modified text included in the modified web document to published text included in the published web document. Using this comparison as a basis, text anomaly detector 210 generates a text match score that provides a measure of similarity between the modified text and the published text. Upon determining whether the modified text represents an anomaly, the text anomaly detector 210 compares the text match score to a threshold text match score. If the text match score crosses the threshold text match score, text anomaly detector 210 determines that an anomaly exists with respect to the modified text.
The image anomaly detector 220 is responsible for detecting anomalies in revised web documents that occur as a result of modifications to images included in one or more previously published versions of a web document. To this end, the image anomaly detector 220 is configured to compare a modified image included in the modified web document with a published image included in the published web document. Using this comparison as a basis, the image anomaly detector 220 generates an image similarity score that provides a measure of similarity between the modified image and the published image. Upon determining whether the modified image represents an anomaly, the image anomaly detector 220 compares the image similarity score to a threshold image similarity score. If the image similarity score crosses the threshold image similarity score, the image similarity detector 220 determines that an anomaly exists with respect to the modified image.
In some embodiments, the image anomaly detector 220 extracts keypoints from the original image and then stores the keypoints in the database 126. The updated image is compared to the original release image by comparing each feature from the new image with a feature of the original image stored in the database 126, respectively, and finding candidate matching features based on the euclidean distance of their feature vectors. From the full set of matches, a subset of keypoints in the new image that are consistent in the object and its position, scale, and orientation are identified by the image anomaly detector 220 to filter out good matches.
The numerical anomaly detector 230 is responsible for detecting anomalies in revised web documents that occur as a result of modifications to numerical values included in one or more previously published versions of web documents. To this end, the numerical anomaly detector 230 is configured to compare a modified numerical value included in the modified web document with one or more numerical values included in one or more published web documents. In some embodiments, the numerical anomaly detector 230 compares the modified numerical value to release numerical values included in release versions of the same web document. In other embodiments, the numerical anomaly detector 230 compares the modified numerical value to an average of a plurality of numerical values included in the associated published web document. For example, the web document may include a product price, and the numerical anomaly detector 230 may compare the modified product price to an average price for the product calculated from multiple listings of the product.
Using the comparison as a basis (e.g., as a comparison to a single previously issued value or an average of issued values), the value anomaly detector 230 generates a value deviation score that indicates a deviation of the modified value from one or more issued values. Upon determining whether the modified value represents an anomaly, the value anomaly detector 230 compares the value deviation score to a threshold value deviation score. If the numerical deviation score crosses the threshold numerical deviation score, the numerical anomaly detector 230 determines that an anomaly exists relative to the modified numerical value.
Each of the text match score, the image similarity score, and the numerical deviation score may be considered an anomaly score. Text anomaly detector 210, image anomaly detector 220, and numerical anomaly detector 230 provide an anomaly score and an indication (e.g., a flag) of whether an anomaly exists in portions (e.g., text, images, or numerical values) of the modified web document to decision module 240. The decision module 240 uses the information provided by the text anomaly detector 210, the image anomaly detector 220, and the numerical anomaly detector 230 to determine whether to allow publication of the modified web document. In some embodiments, the decision module 240 may prevent the modified web document from being published if there is an anomaly in any one portion (e.g., text, image, or numerical value) of the modified web document.
In some embodiments, the decision module 240 aggregates the individual anomaly scores (e.g., text matching scores, image similarity scores, and numerical deviation scores) to generate an aggregate anomaly score. For example, decision module 240 may sum each respective anomaly score to generate an aggregate anomaly score. The decision module 240 may further compare the aggregate anomaly score to a threshold aggregate anomaly score to determine whether to allow publication of the updated web document. If the aggregate anomaly score crosses the threshold anomaly score, decision module 204 prevents the modified web document from being published.
The refining module 250 is configured to refine the various threshold scores (e.g., the threshold text match score, the threshold image similarity score, the threshold numerical deviation score, and the threshold aggregate anomaly score) based on results of manual inspection of the modified web documents determined to be anomalous. For example, upon deciding to prevent publication of the modified web document, decision module 240 may flag the modified web document for manual review by an administrator user. Upon manually inspecting the modified web document, if the administrator user determines that there are no anomalies, the improvement module 250 may revise the one or more threshold anomaly scores, for example, by increasing the threshold.
FIG. 3 is an interaction diagram illustrating an example exchange between publication system 142 and anomaly detection system 144, according to an example embodiment. At operation 302, the publication system 142 publishes a web document generated using data received from the client device 110 based on information provided by the user 106. Upon generation, a web document is assigned a URI that uniquely identifies the web document. The web client 112, client application 114, or programming client 116 may be used to access and display web documents on the client device 110 or other such device through the client device 110 or other such device. For example, the client device 110 submits a request for a particular document to the application server 140, and the application server 140 responds to the client device 110 with a web document and any other data needed by the client device 110 to display the web document.
For example, FIG. 4A illustrates an exemplary form of a published web document 400 for a marketed product listing. As shown, the web document 400 includes a number of different elements, including: text 402-title of product; image 404-an image of a product; and a value 406-the price of the product. The web document 400 also includes a URI 408 that uniquely identifies the web document 400.
Returning to FIG. 3, in operation 304, the publication system 142 provides a user interface to the client device 110 for revising a published web document (e.g., the published web document 400). The user interface includes a plurality of fields (e.g., text 402, images 404, and values 406) corresponding to a plurality of different elements of the published web document that allow the user 106 to modify each portion of the web document.
At operation 306, the publication system 142 generates an updated web document based on the modifications to the published web document. By way of example, FIG. 4B illustrates an updated web document 450 generated based on modifications to the web document 400. As shown, although text 402 has changed to text 452, image 404 has changed to image 454, and value 406 has changed to value 456, updated web document 450 includes URI 408, which is the same as the URI in web document 400.
At operation 308, the anomaly detection system 144 accesses an updated web document (e.g., updated web document 450) generated by the publication system 142. At operation 310, the anomaly detection system 144 analyzes the updated web document. In analyzing the updated web document, the anomaly detection system 144 compares the various updated portions of the updated web document to the various portions of the published web document (published in operation 302) to generate an anomaly score to be used as a basis for determining whether the updated web document is anomalous.
At operation 312, the anomaly detection system 144 determines whether the updated web document is anomalous. In other words, anomaly detection system 144 determines whether the updated web document includes an anomaly due to one or more modifications made by user 106 using the user interface provided at operation 304. Determining whether the updated web document is anomalous is based on the anomaly detection system 144 determining whether one or more anomaly scores (e.g., text matching score, image similarity score, numerical deviation score, and aggregate anomaly score) cross (e.g., are greater than) a respective threshold anomaly score.
If at operation 312 the anomaly detection system 144 determines that the updated web document is not anomalous, at operation 314 the anomaly detection system 144 allows the updated web document to be published, and at operation 316, the publication system 142 publishes the updated web document.
If at operation 312 anomaly detection system 144 determines that the updated web document is anomalous, then at operation 318 anomaly detection system 144 prevents the updated web document from being published. For example, the anomaly detection system 144 can instantiate a flag that, when read by the publication system 142, causes the publication system 142 to end the publication process with respect to the updated web document.
In response to the anomaly detection system 144 preventing the updated web document from being published, at operation 320, the publication system 142 generates a message to notify the user 106 that modifications to the web document will not be allowed and that the updated web document will not be published due to the detected anomaly. At operation 322, the publication system 142 sends the message to the client device 110 of the user 106. The publication system 142 may communicate messages to the client devices 110 of the users 106 using any of a number of messaging networks and platforms. For example, the publication system 142 may push notifications (e.g., via a related push notification service), electronic mail (e-mail), Instant Messaging (IM), Short Message Service (SMS), text, fax, or voice (e.g., voice over IP (VoIP)) messages via a wired (e.g., internet), Plain Old Telephone Service (POTS), or wireless (e.g., mobile, cellular, WiFi, WiMAX) network.
Fig. 5 is a flowchart illustrating example operations of the anomaly detection system 144 when executing a method 500 for providing anomaly detection services for a content distribution platform 102, according to an example embodiment. Method 500 may be implemented in computer readable instructions that are executed by one or more processors such that the operations of method 500 may be performed in part or in whole by anomaly detection system 144; accordingly, the method 500 is described below by way of example with reference thereto. However, it should be understood that at least some of the operations of method 500 may also be deployed on various other hardware configurations, and method 500 is not intended to be limited to anomaly detection system 144.
At operation 505, the anomaly detection system 144 accesses a published web document (e.g., the published web document 400). The published web document includes a number of different elements, such as text, one or more images, and one or more numerical values. The different elements may be generated by or based on information from the users 106 of the content publication platform 102. In an example, a published web document corresponds to a marketed product listing and includes a textual description of the product, one or more images of the product, and a price of the product.
At operation 510, the anomaly detection system 144 accesses an updated (modified) web document (e.g., updated web document 450). The updated web document is based on one or more modifications to the published web document (e.g., initiated by the user 106). Like published web documents, updated web documents include a number of different elements (e.g., text, one or more images, and one or more numerical values). The updated web document includes at least one user revision (e.g., modification) to a different element of the published web document, and in some cases may include at least one user revision to each portion of the published web document.
The updated web document may be or include data objects stored in the database 126 and may be generated based on user input (e.g., user revisions) received from a user interface that allows the user 106 to edit the published web document. In addition, the published web document and the updated web document are assigned the same URI 408.
At operation 515, the anomaly detector 200 generates one or more anomaly scores. Each of the one or more anomaly scores provides a measure of deviation of the updated web document from the published web document. The one or more anomaly scores may include one or more of a text matching score, an image similarity score, and a numerical deviation score. Thus, generating one or more anomaly scores may include: calculating a text match score based on a comparison of updated text (e.g., updated text 452) included in the updated web document and published text (e.g., published text 402) included in the published web document; calculating an image similarity score based on a comparison of an update image (e.g., update image 454) included in the updated web document and a post image (e.g., post image 404) included in the post web page; and calculating a numerical deviation score based on a difference between an updated numerical value (e.g., updated numerical value 456) included in the updated web document and an average of the numerical values associated with the web document (e.g., product average price).
In some embodiments, the one or more anomaly scores comprise an aggregate anomaly score. Thus, in these embodiments, generating one or more anomaly scores may further include aggregating the text matching score, the image similarity score, and the numerical deviation score to generate an aggregate anomaly score.
At operation 520, the anomaly detection system 144 detects whether the updated web document includes an anomaly. Detecting anomalies in the updated web document includes comparing one or more anomaly scores to one or more corresponding threshold anomaly scores. In some embodiments, detecting an anomaly comprises any one of: determining that a text matching score exceeds a threshold text matching score; determining that the image similarity score crosses a threshold image similarity score; alternatively, a numerical deviation score is determined that crosses a threshold numerical deviation score. In embodiments where the one or more anomaly scores include an aggregate anomaly score, detecting an anomaly in the updated web document may include determining that the aggregate anomaly score crosses a threshold anomaly score.
At operation 520, if the anomaly detection system 144 detects an anomaly in the updated web document, at operation 525 the anomaly detection system 144 prevents the updated web document from being published. For example, the anomaly detection system 144 may instantiate a flag associated with the updated web document that causes the publication system 142 to stop publishing the updated web document.
At operation 520, if anomaly detection system 144 does not detect an anomaly in the updated web document, then at operation 530, anomaly detection system 144 allows the updated web document to be published. According to the example above, the anomaly detection system 144 does not instantiate the tags associated with the anomalies, and in turn, the publication system 142 continues to publish the updated web documents. In another example, the anomaly detection system 144 may instantiate a different flag that signals the publication system 142 to continue publishing updated web documents.
As shown in fig. 6, method 500 includes operations 605, 610, 615, 620, 625, and 630. In some example embodiments, operations 605, 610, 615, 620, 625, and 500 included in method 500 may be performed prior to or as part of operation 515 of method 500 (e.g., a pre-stage task, subroutine, or portion thereof), where anomaly detection system 144 generates one or more anomaly scores.
At operation 605, text anomaly detector 210 compares published text (e.g., published text 402) included in a published web document (e.g., published web document 400) with updated text (e.g., updated text 452) included in an updated web document (e.g., updated web document 450). At operation 610, the text anomaly detector 210 generates a text match score based on the comparison of the published text and the modified text. The text matching score provides a measure of similarity between the published text and the modified text.
For example, the text matching score may be or include a cosine similarity score that provides a similarity measure between two vectors of the inner product space that measures a cosine angle between them. Thus, in some embodiments, generating a text matching score may include applying a cosine similarity algorithm to two vectors — a first vector corresponding to published text and a second vector corresponding to updated text.
In a first example, the published text may include "Headset work Mobile Driving Sunglass Headphone BTglass Wireless Bluetooth" and the modified text may include "Headset work Mobile Driving Sunglass Headphone BTglass Wireless Bluetooth-with extra protection". In this example, the post text has been modified to contain an additional attribute, namely "with extra protection". In this example, applying the cosine similarity algorithm returns a cosine similarity score of 87.71.
In a second example, the published text may include "Headset work Mobile hanging glass Headset Wireless Bluetooth" and the modified text may include "Headset hanging glass Wireless Bluetooth" Headset word Mobile phones. In this example, the published text is modified by rearranging the order of some words. In this example, applying the cosine similarity algorithm returns a cosine similarity score of 99.99.
In a third example, the published text may include "Headset word Mobile Driving Sunglass Headset Wireless Bluetooth" and the modified text may include "Universal Qi Wireless Charge Generator steering Pad iphone android htc sony". In this example, the post text has changed completely. In this example, applying the cosine similarity algorithm returns a cosine similarity score of 9.53.
At operation 615, the image anomaly detector 220 compares a published image (e.g., the published image 404) included in the published web document with an updated image (e.g., the updated image 454) included in the updated web document. In operation 620, the image anomaly detector 220 generates an image similarity score based on the comparison of the published image and the updated image. The image similarity score provides a similarity measure between the published image and the updated image. Further details regarding operation 620 according to some example embodiments are discussed below with reference to fig. 7.
At operation 625, the numerical anomaly detector 230 generates a numerical deviation score based on an analysis of an updated (e.g., modified) numerical value (e.g., updated numerical value 456) included in the updated web document. In some embodiments, the numerical anomaly detector 230 generates a numerical deviation score by calculating a difference between the published numerical value and the updated numerical value.
In other embodiments, the numerical anomaly detector 230 generates a numerical deviation score based on a comparison of the updated numerical value to an average of the numerical values associated with the published web documents. For example, a published web document may include a market listing that provides a product for sale, and the published value may be a product price. In this example, the numerical anomaly detector 230 may determine an average price for a product based on, for example, other listings of the product published by the content publication platform 102. The numerical anomaly detector 230 calculates a difference between the updated price of the product and the calculated average price of the product.
In some embodiments, operation 630 is optional, wherein the decision module 240 aggregates the text matching score, the image similarity score, and the numerical deviation score to generate an aggregate anomaly score. For example, the decision module 240 may sum (weighted or unweighted) the text match score, the image similarity score, and the numerical deviation score to generate an aggregate anomaly score.
As shown in fig. 7, method 500 may include additional operations 705, 710, 715, 720, and 725. In some example embodiments, operations 705, 710, 715, 720, and 725 included in method 500 may be performed prior to or as part of operation 620 of operation 515 of method 500 (e.g., a previous task, subroutine, or portion thereof), where the image anomaly detector 220 generates an image similarity score.
In operation 705, the image anomaly detector 220 extracts a first set of feature descriptors from a post image (e.g., post image 404). The first set of feature descriptors includes key points of interest in the published image that provide a "feature description" of the published image. In order to perform reliable identification, it is important to be able to detect features extracted from the original image even in the event of changes in image scale, noise and illumination. These points are typically located in high contrast areas of the image, such as object edges. Similarly, if any change in the internal geometry of the buckled or flexible object occurs between the two images in the set being processed, features located in the buckled or flexible object may not be appropriate. Thus, the image anomaly detector 220 may extract a large number of features from the image to reduce the contribution of errors caused by these local variations in the average error of all feature matching errors.
In some embodiments, the image anomaly detector 220 may utilize a scale-invariant feature transform (SIFT) algorithm to extract image feature descriptors. Using SIFT, the image anomaly detector 220 transforms the published image into a large set of feature vectors, where each feature vector is invariant to image translation, scaling, and rotation, and is partially invariant, especially to illumination variations, and robust to local geometric distortions. The key locations for extraction may include the maximum and minimum of the result of the difference of the gaussian function applied in the scale space and a series of smoothed and resampled images. The image anomaly detector 220 may discard low contrast candidate points and edge response points along the edge. The image anomaly detector 220 assigns a dominant orientation to the local keypoints. The image anomaly detector 220 may then obtain image feature descriptors by blurring and resampling the local image orientation plane, taking into account the pixels around the radius of the key location.
At operation 710, the image anomaly detector 220 stores a first set of feature descriptors in a first matrix corresponding to the published image. A matrix is an array data structure that includes a set of elements, where each element is identified by at least one index or key.
At operation 715, the image anomaly detector 220 extracts a second set of image feature descriptors from the update image (e.g., the update image 454). Similar to the first set of image feature descriptors, the second set of image feature descriptors includes key points of interest in the updated image that provide a "feature description" of the updated image. The image anomaly detector 220 extracts a second set of image feature descriptors from the updated image in a manner similar to that discussed above with reference to extracting the first set of image descriptors from the published image. At operation 720, the image anomaly detector 220 stores a second set of feature descriptors in a second matrix corresponding to the updated image.
In operation 725, the image anomaly detector 220 compares the first matrix with the second matrix to determine a similarity between the published image and the updated image. The first and second matrices will have similarity if both images show the same object, otherwise the two matrices will not be similar. Further, if A corresponds to a first matrix and B corresponds to a second matrix, A · B-1Should be equal to the identity matrix.
Consistent with some embodiments, the operation of comparing the first matrix to the second matrix may include image feature descriptor matching between the first and second matrices. The best candidate match for each feature descriptor in the second matrix is found by identifying its nearest neighbors in the first matrix. Nearest neighbors are defined as feature descriptors with a minimum euclidean distance from a given descriptor vector. The image anomaly detector 220 may identify the nearest neighbors, for example, using the Best-bin-first algorithm, which is a variation of the k-d tree algorithm.
The image anomaly detector 220 generates an image match score based on a comparison of the first matrix and the second matrix. In an example, the image anomaly detector 220 generates an image match score equal to the number of image feature descriptors matched between the two matrices divided by the total number of image feature descriptors.
As shown in fig. 8, method 500 may include additional operations 805, 810, 815, 820, and 825. In some example embodiments, operations 805, 810, 815, 820, and 825 included in method 500 may be performed prior to or as part of operation 520 of method 500 (e.g., a pre-stage task, subroutine, or portion thereof), where the anomaly detection system 144 determines whether an anomaly is detected in the updated web document.
At operation 805, the text anomaly detector 210 compares the text match score to a threshold text match score to determine whether the text match score crosses the threshold text match score. The threshold text matching score may be a default value set by an administrator of the content publication platform 102 or may be a value improved by the improvement module 250 using machine learning techniques. The threshold number of text match scores may be a minimum or maximum value. Thus, a text match score may be considered to cross a threshold text match score because the text match score is above a maximum text match score or below a minimum text match score.
At operation 805, if the text anomaly detector 210 determines that the text match score crosses the threshold text match score, at operation 825, the text anomaly detector 210 determines that an anomaly exists in the updated web document relative to the updated text and the decision module 240 determines that an anomaly exists in the updated web document. At operation 805, if the text anomaly detector 210 determines that the text match score does not cross the threshold text match score, the method continues to operation 810.
In operation 810, the image anomaly detector 220 compares the image similarity score to a threshold image similarity score to determine whether the image similarity score crosses the threshold image similarity score. The threshold image similarity score may be a default value set by an administrator of the content publication platform 102 or may be a value improved by the improvement module 250 using machine learning techniques. The threshold image similarity score quantity may be a minimum or maximum value. Thus, the image similarity score may be considered to cross the threshold image similarity score because the image similarity score is above the maximum image similarity score or below the minimum image similarity score.
If, at operation 810, the image anomaly detector 220 determines that the image similarity score crosses the threshold image similarity score, at operation 825, the image anomaly detector 220 determines that an anomaly exists in the updated web document relative to the updated image, and the decision module 240 determines that an anomaly exists in the updated web document. At operation 810, if the image anomaly detector 220 determines that the image similarity score does not cross the threshold image similarity score, the method continues to operation 815.
At operation 815, the numerical anomaly detector 230 compares the numerical deviation score to a threshold numerical deviation score to determine whether the numerical deviation score crosses the threshold numerical deviation score. The threshold value deviation score may be a default value set by an administrator of the content publication platform 102 or may be a value improved by the improvement module 250 using machine learning techniques. The threshold numerical deviation fraction amount may be a minimum or maximum value. Thus, a numerical deviation score may be considered to cross a threshold numerical deviation score because the numerical deviation score is above a maximum numerical deviation score or below a minimum numerical deviation score.
If the numerical anomaly detector 230 determines that the numerical deviation score crosses the threshold numerical deviation score at operation 815, the numerical anomaly detector 230 determines that an anomaly exists in the updated web document relative to the updated numerical value and the decision module 240 determines that an anomaly exists in the updated web document at operation 825. At operation 815, if the numerical anomaly detector 230 determines that the numerical deviation score does not cross the threshold numerical deviation score, the method continues to operation 820, where the decision module 240 determines that no anomalies are detected in the updated web document.
FIG. 9 is a flowchart illustrating example operations of the content publication platform 102 in performing a method 900 for providing user-generated content publication services according to an example embodiment. The method 900 may be implemented in computer readable instructions that are executed by one or more processors such that the operations of the method 900 may be performed in part or in whole by the content distribution platform 102; accordingly, the method 900 is described below by way of example with reference thereto. However, it should be understood that at least some of the operations of method 900 may also be deployed on various other hardware configurations, and method 900 is not intended to be limited to content distribution platform 102.
At operation 905, the anomaly detection system 144 accesses a corpus of revised web documents (e.g., stored in the database 126). Each revised web document included in the corpus includes at least one revision to a portion of the web document published by the publication system 142.
At operation 910, the anomaly detection system 144 generates an anomaly score (e.g., a text matching score, an image similarity score, and a numerical deviation score) for each revised web document. The anomaly detection system 144 may generate an anomaly score according to the methods discussed above with reference to fig. 4-7.
At operation 915, the refinement module 250 uses the generated anomaly scores to build a decision tree for classifying future revisions. The refinement module 250 also stores the decision tree as a training model, which is stored in the database 126.
At operation 920, the anomaly detection system 144 receives a revision (e.g., update) to the published web document. The revisions may be based on input received from the user 106 via a user interface provided by the content publication platform 102 and displayed on the client device 110.
At operation 925, the anomaly detection system 144 generates one or more anomaly scores for the revised web document. The anomaly detection system 144 generates an anomaly score according to the methods discussed above with reference to fig. 4-7. At operation 930, the one or more anomaly scores are provided to refinement module 250 for inclusion in the training model.
At operation 935, the anomaly detection system 144 detects whether an anomaly has occurred in the revised web document based on the one or more anomaly scores. At operation 935, if the anomaly detection system 144 does not detect an anomaly in the revised web document, at operation 940, the anomaly detection system 144 allows the publication system 142 to publish the revised web document. At operation 940, the anomaly detection system 144 also sends the revised web document to the administrator user's computer device for manual review. The manual review process allows an administrator user (e.g., a human user) to manually evaluate whether the revised document is anomalous (e.g., includes anomalies) and provide approval of the revised web document based on the manual evaluation. In other words, if the administrative user determines that the revised web document does not contain an exception, he approves the revised web document.
At operation 945, the content distribution system platform 102 receives the results of the manual review (e.g., whether the revised web document was approved by the administrator user). If the administrator user approves the revised web document, the content publication platform 102 notifies the user 106 that the revision is allowed and that the revised web document is to be published. For example, the content publication platform 102 may send a message to the client device 110 to notify the user 106 of the situation.
If the administrator user does not approve the revised web document (e.g., because the revised web document is anomalous), the content publication platform 102 notifies the user 106 that the revision is not allowed and that the revised web document is not to be published at operation 955. For example, the content publication platform 102 may send a message to the client device 110 to notify the user 106 of the situation.
The content publication platform 102 also provides the results of the manual review to the refinement module 250 for inclusion in the decision tree, and the refinement module 250 in turn refines the training model. Depending on the particular outcome of the manual review, the improvement of the training model may, for example, include improving one or more threshold anomaly scores, such as a threshold text matching score, a threshold image similarity score, or a numerical deviation score. Improving the one or more threshold anomaly scores may include increasing or decreasing the threshold anomaly score.
At operation 935, if the anomaly detection system 144 detects an anomaly in the revised web document, at operation 960, the anomaly detection system 144 prevents the publication system 142 from publishing the revised web document. At operation 960, the anomaly detection system 144 also sends the revised web document to the computer device of the administrator user for manual review. The manual review process allows an administrator user (e.g., a human user) to manually evaluate whether the revised document is anomalous (e.g., includes anomalies) and provide approval of the revised web document based on the manual evaluation.
At operation 965, the content publication platform 102 receives the results of the manual review (e.g., whether the administrator user approved the revised web document 450). If the administrator user does not approve the revised web document (e.g., because the revised web document is anomalous), the content publication platform 102 notifies the user 106 that the revision is not allowed and that the revised web document is not to be published at operation 955. For example, the content publication platform 102 may send a message to the client device 110 to notify the user 106 of the situation.
If the administrator user approves the revised web document 450, the content publication platform 102 provides the results of the manual review to the improvement module 250 for inclusion in the decision tree, and the improvement module 250 in turn improves the training model. Depending on the particular outcome of the manual review, the improvement of the training model may, for example, include improving one or more threshold anomaly scores, such as a threshold text matching score, a threshold image similarity score, or a numerical deviation score. Improving the one or more threshold anomaly scores may include increasing or decreasing the threshold anomaly score.
Machine architecture
Fig. 10 is a block diagram illustrating components of a machine 1000 capable of reading instructions from a machine-readable medium (e.g., a machine-readable storage medium) and performing any one or more of the methodologies discussed herein, according to some example embodiments. In particular, fig. 10 shows a schematic diagram of a machine 1000 in the example form of a computer system, where instructions 1016 (e.g., software, a program, an application, an applet, an app, or other executable code) may be executed to cause the machine 1000 to perform any one or more of the methodologies discussed herein. For example, the instructions 1016 may include executable code that causes the machine 1000 to perform any of the methods 500 or 900. These instructions convert the general purpose unprogrammed machine into a specific machine that is programmed to perform the functions described and illustrated by the issuance system 142 and the anomaly detection system 144 in the manner described herein. The machine 1000 may operate as a standalone device or may be coupled (e.g., networked) to other machines. In a networked deployment, the machine 1000 may operate in the capacity of a server machine or a client machine in server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. As non-limiting examples, the machine 1000 may include or correspond to a server computer, a client computer, a Personal Computer (PC), a tablet computer, a laptop computer, a netbook, a set-top box (STB), a Personal Digital Assistant (PDA), an entertainment media system, a cellular telephone, a smartphone, a mobile device, a wearable device (e.g., a smartwatch), a smart home device (e.g., a smart appliance), other smart devices, a network device, a network router, a network switch, a network bridge, or any machine capable of sequentially or otherwise executing the instructions 1016 that specify actions to be taken by the machine 1000. Further, while only a single machine 1000 is illustrated, the term "machine" shall also be taken to include a collection of machines 1000 that individually or jointly execute the instructions 1016 to perform any one or more of the methodologies discussed herein.
The machine 1000 may include a processor 1010, a memory/storage device 1030, and I/O components 1050 that may be configured to communicate with each other, e.g., via the bus 1002. In an example embodiment, processor 1010 (e.g., a Central Processing Unit (CPU), a Reduced Instruction Set Computing (RISC) processor, a Complex Instruction Set Computing (CISC) processor, a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Radio Frequency Integrated Circuit (RFIC), other processors, or any suitable combination thereof) may include, for example, processor 1012 and processor 1014, which may execute instructions 1016. The term "processor" is intended to include a multicore processor 1010, which may include two or more independent processors (sometimes referred to as "cores") that may execute instructions concurrently. Although fig. 10 shows multiple processors, the machine 1000 may include a single processor with a single core, a single processor with multiple cores (e.g., a multi-core processor), multiple processors with a single core, multiple processors with multiple cores, or any combination thereof.
The memory/storage 1030 may include a memory 1032 (e.g., a main memory or other storage device), and a storage unit 1036, both the memory 1132 and the storage unit 1136 being accessible by the processor 1010, e.g., via the bus 1002. The storage unit 1036 and the memory 1032 store the instructions 1016 embodying any one or more of the methodologies or functions described herein. The instructions 1016 may also reside, completely or partially, within the memory 1032, within the storage unit 1036, within at least one of the processors 1010 (e.g., within a cache memory of a processor), or within any suitable combination thereof, during execution of the instructions 1016 by the machine 1000. Thus, the memory 1032, the storage unit 1036, and the memory of the processor 1010 are examples of machine-readable media.
As used herein, a "machine-readable medium" refers to a device capable of storing or carrying instructions and data, either temporarily or permanently, and may include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), cache memory, flash memory, optical media, magnetic media, cache memory, other types of memory (e.g., erasable programmable read only memory (EEPROM)), and/or any suitable combination thereof. The term "machine-readable medium" shall be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) that are capable of storing instructions 1016. The term "machine-readable medium" shall also be taken to include any medium, or combination of multiple media, that is capable of storing or executing instructions (e.g., instructions 1016) for execution by a machine (e.g., machine 1000), such that the instructions, when executed by one or more processors of the machine (e.g., processors 1010), cause the machine to perform any one or more of the methodologies described herein. Thus, "machine-readable storage medium" refers to a single storage apparatus or device, as well as a "cloud-based" storage system or storage network that includes multiple storage apparatuses or devices. The term "machine-readable medium" includes both machine-readable storage media and transmission media such as signals.
The I/O components 1150 may include a wide variety of components for receiving input, providing output, generating output, sending information, exchanging information, capturing measurements, and the like. The particular I/O components 1050 included in a particular machine will depend on the type of machine. For example, a portable machine such as a mobile phone would likely include a touch input device or other such input mechanism, while a headless server machine would likely not include such a touch input device. It should be understood that the I/O components 1050 may include many other components not shown in FIG. 10. The I/O components 1050 are grouped by function to simplify the following discussion, and the grouping is not limiting in any way. In various example embodiments, the I/O components 1050 may include output components 1052 and input components 1054. Output components 1052 may include visual components (e.g., a display, such as a Plasma Display Panel (PDP), a Light Emitting Diode (LED) display, a Liquid Crystal Display (LCD), a projector, or a Cathode Ray Tube (CRT)), acoustic components (e.g., speakers), haptic components (e.g., a vibrating motor, a resistive mechanism), other signal generators, and so forth. The input components 1054 may include alphanumeric input components (e.g., a keyboard, a touch screen configured to receive alphanumeric input, an electro-optical keyboard, or other alphanumeric input components), point-based input components (e.g., a mouse, a touchpad, a trackball, a joystick, a motion sensor, or another pointing instrument), tactile input components (e.g., physical buttons, a touch screen providing a location and/or force of a touch or touch gesture, or other tactile input components), audio input components (e.g., a microphone), and so forth.
In other example embodiments, the I/O components 1050 may include a biometric component 1056, a motion component 1058, an environmental component 1060, or a positioning component 1062, among many other components. For example, the biometric components 1056 may include components for detecting expressions (e.g., hand expressions, facial expressions, voice expressions, body gestures, or eye tracking), measuring bio-signals (e.g., blood pressure, heart rate, body temperature, perspiration, or brain waves), identifying a person (e.g., voice recognition, retinal recognition, facial recognition, fingerprint recognition, or electroencephalogram-based recognition), and so forth. The motion components 1058 may include acceleration sensor components (e.g., accelerometers), gravity sensor components, rotation sensor components (e.g., gyroscopes), and the like. The environmental components 1060 may include, for example, illumination sensor components (e.g., a photometer), temperature sensor components (e.g., one or more thermometers that detect ambient temperature), acoustic sensor components (e.g., one or more microphones that detect background noise), or other components that may provide an indication, measurement, or signal corresponding to the surrounding physical environment. The positioning components 1062 may include location sensor components (e.g., Global Positioning System (GPS) receiver components), altitude sensor components (e.g., altimeters or barometers that detect barometric pressure from which altitude may be derived), orientation sensor components (e.g., magnetometers), and the like.
Communication may be accomplished using a variety of techniques. The I/O components 1050 may include a communications component 1064, the communications component 1064 operable to couple the machine 1000 to a network 1080 or a device 1070 via a coupling 1082 and a coupling 1072, respectively. For example, the communication component 1064 may include a network interface component or other suitable device that interfaces with the network 1080. In other examples, communications component 1064 may include a wired communications component, a wireless communications component, a cellular communications component, a Near Field Communications (NFC) component, a wireless communications component, a cellular communications component, a wireless communications component, a,
Figure BDA0001914025360000221
Component (e.g.)
Figure BDA0001914025360000222
Low energy consumption),
Figure BDA0001914025360000231
Components, and other communication components that provide communication via other modalities. The device 1070 may be another machine or any of a variety of peripheral devices (e.g., a peripheral device coupled via a Universal Serial Bus (USB)).
Further, communication component 1064 may detect the identifier or include a component operable to detect the identifier. For example, the communication components 1064 may include Radio Frequency Identification (RFID) tag reader components, NFC smart tag detection components, optical reader components (e.g., optical sensors for detecting one-dimensional barcodes (e.g., Universal Product Code (UPC) barcodes), multi-dimensional barcodes (e.g., Quick Response (QR) codes), Aztrek codes, data matrices, Dataglyph, MaxiCode, PDF417, supercodes, UCC RSS-2D barcodes, and other optical codes), or acoustic detection components (e.g., microphones for identifying audio signals of a tag). In addition, various information can be derived via the communications component 1064, such as via an Internet Protocol (IP)Position of physical location, via
Figure BDA0001914025360000232
Location of signal triangulation, location of NFC beacon signals that may indicate a particular location via detection, and so forth.
Transmission medium
In various example embodiments, one or more portions of network 1080 may be an ad hoc network, an intranet, an extranet, a Virtual Private Network (VPN), a Local Area Network (LAN), a wireless LAN (wlan), a Wide Area Network (WAN), a wireless WAN (wwan), a Metropolitan Area Network (MAN), the internet, a portion of the Public Switched Telephone Network (PSTN), a Plain Old Telephone Service (POTS) network, a cellular telephone network, a wireless network, a network adapter, a wireless network adapter, and a wireless network adapter,
Figure BDA0001914025360000233
A network, another type of network, or a combination of two or more such networks. For example, the network 1080 or a portion of the network 1080 may include a wireless or cellular network and the coupling 1082 may be a Code Division Multiple Access (CDMA) connection, a global system for mobile communications (GSM) connection, or another type of cellular or wireless coupling. In this example, the coupling 1082 may implement any of a variety of types of data transmission techniques, such as single carrier radio transmission technology (1xRTT), evolution-data optimized (EVDO) technology, General Packet Radio Service (GPRS) technology, enhanced data rates for GSM evolution (EDGE) technology, third generation partnership project (3GPP) including 3G, fourth generation wireless (4G) networks, Universal Mobile Telecommunications System (UMTS), High Speed Packet Access (HSPA), Worldwide Interoperability for Microwave Access (WiMAX), Long Term Evolution (LTE) standards, other standards defined by various standards-setting organizations, other remote protocols, or other data transmission techniques.
The instructions 1016 may be transmitted or received over the network 1080 using a transmission medium via a network interface device (e.g., a network interface component included in the communications component 1064) and utilizing a number of well-known transfer protocols (e.g., the hypertext transfer protocol (HTTP)). Similarly, the instructions 1016 may be transmitted or received to the device 1070 via a coupling 1072 (e.g., a peer-to-peer coupling) using a transmission medium. The term "transmission medium" shall be taken to include any intangible medium that is capable of storing, encoding or carrying instructions 1016 for execution by the machine 1000, and includes digital or analog communications signals or other intangible medium to facilitate communication of such software. Transmission media are examples of machine-readable media.
The following numbered examples are examples.
1. A system, comprising:
one or more processors;
a computer-readable medium storing instructions that, when executed by one or more processors, cause the system to perform operations comprising:
accessing a published web document, the published web document including a plurality of different elements generated using data received from a user's computing device;
accessing an updated web document based on one or more modifications made by a user to the published web document using an interface presented on a user's computing device, the updated web document including at least one user-generated modification to an element of the plurality of different elements of the published web document;
generating one or more exception scores based on a comparison of the updated web document and the published web document, the one or more exception scores providing a measure of deviation of the updated web document from the published web document; and
determining whether to allow publication of the updated web document based on a comparison of the anomaly score to a threshold anomaly score.
2. The system of example 1, wherein the plurality of different elements of the published web document include text, images, and numerical values.
3. The system of example 1 or example 2, wherein generating the one or more anomaly scores comprises:
performing a comparison of updated text included in the updated web document and published text included in the published web document; and
generating a text match score based on the comparison, the text match score providing a measure of similarity between the updated text and the published text,
wherein the one or more anomaly scores comprise the text match score.
4. The system of example 3, wherein performing the comparison includes determining a similarity between the updated text and the published text using a cosine similarity algorithm.
5. The system of any of examples 1-4, wherein generating the one or more anomaly scores comprises:
performing a comparison of a published image included in the published web document and an updated image included in the updated web document; and
generating an image similarity score based on the comparison, the image similarity score providing a measure of similarity between the post image and the update image,
wherein the one or more anomaly scores comprise the image similarity score.
6. The system of example 5, wherein performing the comparison comprises:
extracting a first set of feature descriptors based on the published image;
storing a first set of feature descriptors in a first matrix corresponding to the published image;
extracting a second set of feature descriptors based on the updated image;
storing a second set of feature descriptors in a second matrix corresponding to the updated image; and
comparing the first matrix and the second matrix;
wherein generating the image similarity score is based on a comparison of the first matrix and the second matrix.
7. The system of any of examples 1-6, wherein:
generating the one or more exception scores comprises generating a numerical deviation score based on a difference between an updated numerical value included in the updated web document and an average of numerical values associated with the web document; and
the one or more anomaly scores include the numerical deviation score.
8. The system of any of examples 1-7, wherein generating the one or more anomaly scores comprises:
generating a text match score based on a comparison of updated text included in the updated web document and published text included in the published web document;
generating an image similarity score based on a comparison of an updated image included in the updated web document and a published image included in the published web page;
generating a numerical deviation score based on a difference between an updated numerical value included in the updated web document and an average of numerical values associated with the web document; and
aggregating the text matching score, the image similarity score, and the numerical deviation score to generate the anomaly score.
9. The system of any of examples 1-8, wherein determining whether to allow publication of the updated web document based on the comparison of the anomaly score to the threshold anomaly score comprises: responsive to the anomaly score crossing the threshold anomaly score, refraining from publishing the updated web document.
10. The system of example 9, further comprising:
sending the updated web document to an administrator's device for manual review by the administrator; and
refining the one or more threshold anomaly scores based on results of manual review by an administrator.
11. The system of any of examples 1-10, further comprising: receiving approval for the updated web document from the managing computer system as a result of the manual review,
wherein refining the threshold anomaly score comprises increasing the threshold anomaly score based on approval of the updated web document as a result of a manual review.
12. The system of any of examples 1-11, further comprising:
generating a message indicating that the updated web document is anomalous; and
sending the message to a client device of a user responsible for providing the updated web document.
13. The system of any of examples 1-12, wherein determining whether to allow publication of the updated web document based on the comparison of the anomaly score to the threshold anomaly score comprises: in response to the anomaly score not crossing the threshold anomaly score, publishing the updated web document.
14. A method, comprising:
accessing a web document comprising a plurality of different elements generated using data received from a user's computing device;
accessing an updated web document based on one or more modifications made by a user to the published web document using an interface presented on a user's computing device, the updated web document including at least one user-generated modification to an element of the plurality of different elements of the published web document;
generating, using one or more processors, one or more exception scores based on a comparison of the updated web document and the published web document, the one or more exception scores providing a measure of deviation of the updated web document from the published web document; and determining whether to allow publication of the updated web document based on a comparison of the anomaly score to a threshold anomaly score.
15. The method of example 14, wherein generating the one or more anomaly scores comprises:
generating a text match score based on a comparison of updated text included in the updated web document and published text included in the published web document;
generating an image similarity score based on a comparison of an updated image included in the updated web document and a published image included in the published web page;
generating a deviation score based on a difference between an updated numerical value included in the updated web document and an average of numerical values associated with the web document; and
aggregating the text matching score, the image similarity score, and the deviation score to generate the anomaly score.
16. The method of example 14 or example 15, wherein generating the one or more anomaly scores comprises:
performing a comparison of updated text included in the updated web document and published text included in the published web document; and
generating a text match score based on the comparison, the text match score providing a measure of similarity between the updated text and the published text,
wherein the one or more anomaly scores comprise the text match score.
17. The method of any of examples 14 to 16, wherein generating the one or more anomaly scores comprises:
performing a comparison of a published image included in the published web document and an updated image included in the updated web document; and
generating an image similarity score based on the comparison, the image similarity score providing a measure of similarity between the post image and the update image,
wherein the one or more anomaly scores comprise the image similarity score.
18. The method of example 17, wherein performing the comparison comprises:
extracting a first set of feature descriptors based on the published image;
storing a first set of feature descriptors in a first matrix corresponding to the published image;
extracting a second set of feature descriptors based on the updated image;
storing a second set of feature descriptors in a second matrix corresponding to the updated image; and
comparing the first matrix and the second matrix;
wherein generating the image similarity score is based on a comparison of the first matrix and the second matrix.
19. The method of any of examples 14-18, wherein:
generating the one or more exception scores comprises generating a numerical deviation score based on a difference between an updated numerical value included in the updated web document and an average of numerical values associated with the web document; and
the one or more anomaly scores include the numerical deviation score.
20. A non-transitory machine-readable storage medium embodying instructions that, when executed by a machine, cause the machine to perform operations comprising:
publishing a web document comprising a plurality of different elements generated using data received from a user's computing device;
accessing an updated web document based on one or more modifications made by a user to the published web document using an interface presented on a user's computing device, the updated web document including at least one user-generated modification to an element of the plurality of different elements of the published web document;
generating one or more exception scores based on a comparison of the updated web document and the published web document, the one or more exception scores providing a measure of deviation of the updated web document from the published web document; and
determining whether to allow publication of the updated web document based on a comparison of the anomaly score to a threshold anomaly score.
21. A machine-readable medium carrying machine-readable instructions which, when executed by at least one processor of a machine, cause the machine to carry out the method of any one of examples 14 to 19.
Module, component, and logic
Certain embodiments are described herein as comprising logic or multiple components, modules, or mechanisms. The modules may constitute software modules (e.g., code embodied on a machine-readable medium or in a transmission signal) or hardware modules. A hardware module is a tangible unit capable of performing certain operations and may be configured or arranged in a particular manner. In example embodiments, one or more computer systems (e.g., a standalone computer system, a client computer system, or a server computer system) or one or more hardware modules (e.g., processors or groups of processors) of a computer system may be configured by software (e.g., an application or application portion) as a hardware module that operates to perform certain operations described herein.
In various embodiments, the hardware modules may be implemented mechanically or electronically. For example, a hardware module may comprise dedicated circuitry or logic that is permanently configured to perform certain operations (e.g., a hardware module may be a special-purpose processor such as a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC)). A hardware module may also comprise programmable logic or circuitry that is temporarily configured by software to perform certain operations (e.g., programmable logic or circuitry contained in a general-purpose processor or other programmable processor). It should be understood that: the decision to implement a hardware module mechanically, in dedicated and permanently configured circuitry, or in temporarily configured circuitry (e.g., configured by software) may be cost and time considerations.
Thus, the phrase "hardware module" should be understood to encompass a tangible entity, be it a physically-constructed, permanently-configured (e.g., hardwired), or temporarily-configured (e.g., programmed) entity to operate in a specific manner or to perform a specific operation described herein. Considering embodiments in which the hardware modules are temporarily configured (e.g., programmed), each of the hardware modules need not be configured or instantiated at any one time. For example, where the hardware modules include a general-purpose processor configured by software, the general-purpose processor may be configured as respective different hardware modules at different times. Thus, software may configure a processor, for example, to constitute a particular hardware module at one time and a different hardware module at another time.
A hardware module may provide information to and receive information from other hardware modules. Thus, the described hardware modules may be viewed as communicatively coupled. In the case of a plurality of such hardware modules being present simultaneously, communication may be effected by signal transmission (e.g. over appropriate circuits and buses connecting the hardware modules). In embodiments where multiple hardware modules are configured or instantiated at different times, communication between such hardware modules may be accomplished, for example, by storing and retrieving information in a memory structure accessible to the multiple hardware modules. For example, one hardware module may perform an operation and store the output of the operation in a storage device communicatively coupled thereto. Another hardware module may then later access the memory device to retrieve and process the stored output. The hardware modules may also initiate communication with input or output devices and may be capable of operating on resources (e.g., collections of information).
Various operations of the example methods described herein may be performed, at least in part, by one or more processors that are temporarily configured (e.g., via software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such a processor may constitute a processor-implemented module that operates to perform one or more operations or functions. In some example embodiments, "module" as used herein includes a processor-implemented module.
Similarly, the methods described herein may be implemented at least in part by a processor. For example, at least some of the operations of a method may be performed by one or more processors or processor-implemented modules. The execution of certain operations may be distributed among one or more processors and may reside not in a single machine, but rather may be deployed across multiple machines. In some example embodiments, one or more processors or processors may be located at a single site (e.g., in a home environment, an office environment, or a server farm), while in other embodiments processors may be distributed across multiple sites.
The one or more processors may also be operable to support execution of related operations in a "cloud computing" environment or as "software as a service" (SaaS). For example, at least some of the operations may be performed by a group of computers (e.g., machines including processors) that are accessible via a network (e.g., the internet) and via one or more appropriate interfaces (e.g., APIs).
Electronic device and system
Example embodiments may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. Example embodiments may be implemented using a computer program product, such as a computer program tangibly embodied in an information carrier, such as a machine-readable medium for execution by, or to control the operation of, data processing apparatus, such as a programmable processor, a computer, or multiple computers.
The computer program may be written in any form of programming language, including: a compiled or interpreted language, and the computer program can be deployed in any form, including as a stand-alone program or as a module, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by communication network 104.
In an example embodiment, the operations may be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Method operations may also be performed by, and apparatus of example embodiments may be implemented as, special purpose logic circuitry, e.g., an FPGA or an ASIC.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network 104. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In embodiments using a programmable computing system, it will be clear that both hardware and software architectures need to be considered. In particular, it will be appreciated that implementing particular functions in permanently configured hardware (e.g., an ASIC), in temporarily configured hardware (e.g., a combination of software and a programmable processor), or in a combination of permanently and temporarily configured hardware may be a design choice.
Language(s)
Although embodiments of the present disclosure have been described with reference to specific example embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader scope of the disclosed subject matter. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The accompanying drawings that form a part hereof, and in which are shown by way of illustration, and not of limitation, specific embodiments in which the subject matter may be practiced. The embodiments illustrated are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed herein. Other embodiments may be utilized and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. This detailed description, therefore, is not to be taken in a limiting sense, and the scope of various embodiments is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.
Such embodiments of the inventive subject matter are referred to, individually and/or collectively, by the term "invention" merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept if more than one is in fact disclosed. Thus, although specific embodiments have been illustrated and described herein, it should be appreciated that any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description.
All publications, patents, and patent documents mentioned herein are individually incorporated by reference in their entirety. In the event of inconsistencies between the documents incorporated by reference herein, the usage in the incorporated references shall be considered as an addition to this document; for irreconcilable inconsistencies, the usage in this document controls.
In this document, the terms "a" or "an" are used, as is common in patent documents, to include one or more than one, as opposed to any other instances or uses of "at least one" or "one or more. In this document, unless otherwise indicated, the term "or" is used to refer to a non-exclusive or, for example, "a or B" includes "a but not B," B but not a, "and" a and B. In the appended claims, the terms "including" and "in which" are used as the English equivalents of the respective terms "comprising" and "wherein". In addition, in the following claims, the terms "comprise" and "comprise" are open-ended; that is, a system, device, article, or process that includes something other than that listed after the term in a claim will still be considered within the scope of that claim.

Claims (19)

1. An anomaly detection system for revisions to a web document, comprising:
one or more processors;
a computer-readable medium storing instructions that, when executed by the one or more processors, cause the system to perform operations comprising:
accessing a published web document, the published web document assigned a uniform resource identifier, URI, and comprising a plurality of different elements generated using data received from a user's computing device;
accessing an updated web document having the same URI as the published web document and based on one or more modifications made by a user to the published web document using an interface presented on a user's computing device, the updated web document including at least one user-generated modification to an element of the plurality of different elements of the published web document;
generating one or more exception scores based on a comparison of the updated web document and the published web document, the one or more exception scores providing a measure of deviation of the updated web document from the published web document; and
determining whether to allow publication of the updated web document based on a comparison of the anomaly score to a threshold anomaly score, the updated web document being prevented from being published in response to the anomaly score crossing the threshold anomaly score; and in response to the anomaly score not crossing the threshold anomaly score, publishing the updated web document.
2. The system of claim 1, wherein the plurality of different elements of the published web document comprise text, images, and numerical values.
3. The system of claim 1, wherein generating the one or more anomaly scores comprises:
performing a comparison of updated text included in the updated web document with published text included in the published web document; and
generating a text match score based on the comparison, the text match score providing a measure of similarity between the updated text and the published text,
wherein the one or more anomaly scores comprise the text match score.
4. The system of claim 3, wherein performing the comparison includes using a cosine similarity algorithm to determine a similarity between the updated text and the published text.
5. The system of claim 1, wherein generating the one or more anomaly scores comprises:
performing a comparison of a published image included in the published web document and an updated image included in the updated web document; and
generating an image similarity score based on the comparison, the image similarity score providing a measure of similarity between the post image and the update image,
wherein the one or more anomaly scores comprise the image similarity score.
6. The system of claim 5, wherein performing the comparison comprises:
extracting a first set of feature descriptors based on the published image;
storing a first set of feature descriptors in a first matrix corresponding to the published image;
extracting a second set of feature descriptors based on the updated image;
storing a second set of feature descriptors in a second matrix corresponding to the updated image; and
comparing the first matrix and the second matrix;
wherein generating the image similarity score is based on a comparison of the first matrix and the second matrix.
7. The system of claim 1, wherein generating the one or more anomaly scores comprises:
generating a numeric deviation score based on a difference between an updated numeric value included in the updated web document and an average of numeric values associated with the web document; and
the one or more anomaly scores include the numerical deviation score.
8. The system of claim 1, wherein generating the one or more anomaly scores comprises:
generating a text match score based on a comparison of updated text included in the updated web document and published text included in the published web document;
generating an image similarity score based on a comparison of an updated image included in the updated web document and a published image included in the published web document;
generating a numerical deviation score based on a difference between an updated numerical value included in the updated web document and an average of numerical values associated with the web document; and
aggregating the text matching score, the image similarity score, and the numerical deviation score to generate the anomaly score.
9. The system of claim 1, further comprising:
sending the updated web document to an administrator's device for manual review by the administrator; and
refining the one or more threshold anomaly scores based on results of manual review by an administrator.
10. The system of claim 1, further comprising: receiving approval for the updated web document from the managing computer system as a result of the manual review,
wherein improving the threshold anomaly score comprises increasing the threshold anomaly score based on approval of the updated web document as a result of a manual review.
11. The system of claim 1, further comprising:
generating a message indicating that the updated web document is anomalous; and
sending the message to a client device of a user responsible for providing the updated web document.
12. An anomaly detection method for revising a web document, comprising:
accessing a published web document, the published web document assigned a uniform resource identifier, URI, and comprising a web document of a plurality of different elements generated using data received from a user's computing device;
accessing an updated web document having the same URI as the published web document and based on one or more modifications made by a user to the published web document using an interface presented on a user's computing device, the updated web document including at least one user-generated modification to an element of the plurality of different elements of the published web document;
generating, using one or more processors, one or more exception scores based on a comparison of the updated web document and the published web document, the one or more exception scores providing a measure of deviation of the updated web document from the published web document; and
determining whether to allow publication of the updated web document based on a comparison of the anomaly score to a threshold anomaly score, the updated web document being prevented from being published in response to the anomaly score crossing the threshold anomaly score; and in response to the anomaly score not crossing the threshold anomaly score, publishing the updated web document.
13. The method of claim 12, wherein generating the one or more anomaly scores comprises:
generating a text match score based on a comparison of updated text included in the updated web document and published text included in the published web document;
generating an image similarity score based on a comparison of an updated image included in the updated web document and a published image included in the published web document;
generating a deviation score based on a difference between an updated numerical value included in the updated web document and an average of numerical values associated with the web document; and
aggregating the text matching score, the image similarity score, and the deviation score to generate the anomaly score.
14. The method of claim 12, wherein generating the one or more anomaly scores comprises:
performing a comparison of updated text included in the updated web document and published text included in the published web document; and
generating a text match score based on the comparison, the text match score providing a measure of similarity between the updated text and the published text,
wherein the one or more anomaly scores comprise the text match score.
15. The method of claim 12, wherein generating the one or more anomaly scores comprises:
performing a comparison of a published image included in the published web document and an updated image included in the updated web document; and
generating an image similarity score based on the comparison, the image similarity score providing a measure of similarity between the post image and the update image,
wherein the one or more anomaly scores comprise the image similarity score.
16. The method of claim 15, wherein performing the comparison comprises:
extracting a first set of feature descriptors based on the release image;
storing a first set of feature descriptors in a first matrix corresponding to the published image;
extracting a second set of feature descriptors based on the updated image;
storing a second set of feature descriptors in a second matrix corresponding to the updated image; and
comparing the first matrix and the second matrix;
wherein generating the image similarity score is based on a comparison of the first matrix and the second matrix.
17. The method of claim 12, wherein:
generating the one or more exception scores comprises generating a numerical deviation score based on a difference between an updated numerical value included in the updated web document and an average of numerical values associated with the web document; and
the one or more anomaly scores include the numerical deviation score.
18. A non-transitory machine-readable storage medium embodying instructions that, when executed by a machine, cause the machine to perform operations comprising:
publishing a web document assigned with a uniform resource identifier, URI, and comprising a plurality of different elements generated using data received from a computing device of a user;
accessing an updated web document having the same URI as the published web document and based on one or more modifications made by a user to the published web document using an interface presented on the user's computing device, the updated web document including at least one user-generated modification to an element of the plurality of different elements of the published web document;
generating one or more exception scores based on a comparison of the updated web document and the published web document, the one or more exception scores providing a measure of deviation of the updated web document from the published web document; and
determining whether to allow publication of the updated web document based on a comparison of the anomaly score to a threshold anomaly score, the updated web document being prevented from being published in response to the anomaly score crossing the threshold anomaly score; and in response to the anomaly score not crossing the threshold anomaly score, publishing the updated web document.
19. A machine-readable medium carrying machine-readable instructions which, when executed by at least one processor of a machine, cause the machine to carry out the method of any one of claims 12 to 17.
CN201780038502.5A 2016-06-21 2017-06-21 Anomaly detection for revisions to web documents Active CN109313659B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210713893.9A CN115238207A (en) 2016-06-21 2017-06-21 Anomaly detection for revisions to web documents

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US15/188,532 US10218728B2 (en) 2016-06-21 2016-06-21 Anomaly detection for web document revision
US15/188,532 2016-06-21
PCT/US2017/038593 WO2017223230A1 (en) 2016-06-21 2017-06-21 Anomaly detection for web document revision

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202210713893.9A Division CN115238207A (en) 2016-06-21 2017-06-21 Anomaly detection for revisions to web documents

Publications (2)

Publication Number Publication Date
CN109313659A CN109313659A (en) 2019-02-05
CN109313659B true CN109313659B (en) 2022-07-12

Family

ID=60660569

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202210713893.9A Pending CN115238207A (en) 2016-06-21 2017-06-21 Anomaly detection for revisions to web documents
CN201780038502.5A Active CN109313659B (en) 2016-06-21 2017-06-21 Anomaly detection for revisions to web documents

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202210713893.9A Pending CN115238207A (en) 2016-06-21 2017-06-21 Anomaly detection for revisions to web documents

Country Status (6)

Country Link
US (2) US10218728B2 (en)
EP (1) EP3472696A4 (en)
KR (2) KR102206589B1 (en)
CN (2) CN115238207A (en)
AU (1) AU2017281628B2 (en)
WO (1) WO2017223230A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2642411C2 (en) * 2016-04-04 2018-01-24 Общество С Ограниченной Ответственностью "Яндекс" Method for determining trend of indicator of degree of user involvement
US10218728B2 (en) 2016-06-21 2019-02-26 Ebay Inc. Anomaly detection for web document revision
US10635557B2 (en) * 2017-02-21 2020-04-28 E.S.I. Software Ltd System and method for automated detection of anomalies in the values of configuration item parameters
US10521224B2 (en) * 2018-02-28 2019-12-31 Fujitsu Limited Automatic identification of relevant software projects for cross project learning
US20200019583A1 (en) * 2018-07-11 2020-01-16 University Of Southern California Systems and methods for automated repair of webpages
US11200313B2 (en) * 2019-03-18 2021-12-14 Visa International Service Association Defense mechanism against component wise hill climbing using synthetic face generators
US11403682B2 (en) * 2019-05-30 2022-08-02 Walmart Apollo, Llc Methods and apparatus for anomaly detections
CN114503105A (en) * 2019-09-25 2022-05-13 联邦科学和工业研究组织 Password service for browser applications
CN111444244B (en) * 2020-03-31 2022-06-03 温州大学 Big data information management system
JP2021197099A (en) * 2020-06-18 2021-12-27 富士フイルムビジネスイノベーション株式会社 Information processing apparatus and program
US11501013B1 (en) * 2021-07-09 2022-11-15 Sotero, Inc. Autonomous machine learning methods for detecting and thwarting malicious database access

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101043567A (en) * 2006-03-23 2007-09-26 佳能株式会社 Document management apparatus, document management system, control method of the apparatus and system, program, and storage medium
CN101292252A (en) * 2005-10-18 2008-10-22 松下电器产业株式会社 Information processing device, and method therefor
CN102446211A (en) * 2011-09-06 2012-05-09 中国科学院自动化研究所 Method and system for filing and verifying image
CN103106585A (en) * 2011-11-11 2013-05-15 阿里巴巴集团控股有限公司 Real-time duplication eliminating method and device of product information
CN103858164A (en) * 2011-10-04 2014-06-11 汤姆逊许可公司 Method of Automatic Management of a Collection of Images and Corresponding Device
CN104715374A (en) * 2013-12-11 2015-06-17 世纪禾光科技发展(北京)有限公司 Method and system for governing repetition products of e-commerce platform
CN105589879A (en) * 2014-10-23 2016-05-18 阿里巴巴集团控股有限公司 Method and apparatus for downloading picture by client

Family Cites Families (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6012087A (en) 1997-01-14 2000-01-04 Netmind Technologies, Inc. Unique-change detection of dynamic web pages using history tables of signatures
US5898836A (en) 1997-01-14 1999-04-27 Netmind Services, Inc. Change-detection tool indicating degree and location of change of internet documents by comparison of cyclic-redundancy-check(CRC) signatures
US20030079178A1 (en) * 1999-04-01 2003-04-24 James R. H. Challenger Method and system for efficiently constructing and consistently publishing web documents
US20050028080A1 (en) * 1999-04-01 2005-02-03 Challenger James R.H. Method and system for publishing dynamic Web documents
US6654743B1 (en) 2000-11-13 2003-11-25 Xerox Corporation Robust clustering of web documents
US7051328B2 (en) 2001-01-26 2006-05-23 Xerox Corporation Production server architecture and methods for automated control of production document management
US20040205454A1 (en) 2001-08-28 2004-10-14 Simon Gansky System, method and computer program product for creating a description for a document of a remote network data source for later identification of the document and identifying the document utilizing a description
US7240077B1 (en) 2002-12-30 2007-07-03 Amazon.Com, Inc. Web site content change management
US20050210008A1 (en) * 2004-03-18 2005-09-22 Bao Tran Systems and methods for analyzing documents over a network
US7559016B1 (en) 2004-10-07 2009-07-07 Google Inc. System and method for indicating web page modifications
US20080015968A1 (en) * 2005-10-14 2008-01-17 Leviathan Entertainment, Llc Fee-Based Priority Queuing for Insurance Claim Processing
US20090228777A1 (en) * 2007-08-17 2009-09-10 Accupatent, Inc. System and Method for Search
US8886660B2 (en) * 2008-02-07 2014-11-11 Siemens Enterprise Communications Gmbh & Co. Kg Method and apparatus for tracking a change in a collection of web documents
US8121989B1 (en) 2008-03-07 2012-02-21 Google Inc. Determining differences between documents
US9058378B2 (en) * 2008-04-11 2015-06-16 Ebay Inc. System and method for identification of near duplicate user-generated content
US9330191B2 (en) 2009-06-15 2016-05-03 Microsoft Technology Licensing, Llc Identifying changes for online documents
US8874555B1 (en) 2009-11-20 2014-10-28 Google Inc. Modifying scoring data based on historical changes
US20110191328A1 (en) * 2010-02-03 2011-08-04 Vernon Todd H System and method for extracting representative media content from an online document
US20110246330A1 (en) * 2010-04-01 2011-10-06 Anup Tikku System and method for searching content
US9658997B2 (en) * 2010-08-03 2017-05-23 Adobe Systems Incorporated Portable page template
US8745733B2 (en) * 2011-02-16 2014-06-03 F-Secure Corporation Web content ratings
US20120221479A1 (en) 2011-02-25 2012-08-30 Schneck Iii Philip W Web site, system and method for publishing authenticated reviews
US8381095B1 (en) 2011-11-07 2013-02-19 International Business Machines Corporation Automated document revision markup and change control
CN103530299B (en) * 2012-07-05 2017-04-12 阿里巴巴集团控股有限公司 Search result generating method and device
EP2875471B1 (en) * 2012-07-23 2021-10-27 Apple Inc. Method of providing image feature descriptors
US20140047413A1 (en) * 2012-08-09 2014-02-13 Modit, Inc. Developing, Modifying, and Using Applications
US20140195518A1 (en) * 2013-01-04 2014-07-10 Opera Solutions, Llc System and Method for Data Mining Using Domain-Level Context
US9063949B2 (en) 2013-03-13 2015-06-23 Dropbox, Inc. Inferring a sequence of editing operations to facilitate merging versions of a shared document
US10489842B2 (en) * 2013-09-30 2019-11-26 Ebay Inc. Large-scale recommendations for a dynamic inventory
US9569728B2 (en) 2014-11-14 2017-02-14 Bublup Technologies, Inc. Deriving semantic relationships based on empirical organization of content by users
US20160217157A1 (en) * 2015-01-23 2016-07-28 Ebay Inc. Recognition of items depicted in images
US10909468B2 (en) * 2015-02-27 2021-02-02 Verizon Media Inc. Large-scale anomaly detection with relative density-ratio estimation
AU2016202659A1 (en) * 2015-04-28 2016-11-17 Red Marker Pty Ltd Device, process and system for risk mitigation
US10282424B2 (en) * 2015-05-19 2019-05-07 Researchgate Gmbh Linking documents using citations
US10580064B2 (en) * 2015-12-31 2020-03-03 Ebay Inc. User interface for identifying top attributes
US10218728B2 (en) 2016-06-21 2019-02-26 Ebay Inc. Anomaly detection for web document revision

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101292252A (en) * 2005-10-18 2008-10-22 松下电器产业株式会社 Information processing device, and method therefor
CN101043567A (en) * 2006-03-23 2007-09-26 佳能株式会社 Document management apparatus, document management system, control method of the apparatus and system, program, and storage medium
CN102446211A (en) * 2011-09-06 2012-05-09 中国科学院自动化研究所 Method and system for filing and verifying image
CN103858164A (en) * 2011-10-04 2014-06-11 汤姆逊许可公司 Method of Automatic Management of a Collection of Images and Corresponding Device
CN103106585A (en) * 2011-11-11 2013-05-15 阿里巴巴集团控股有限公司 Real-time duplication eliminating method and device of product information
CN104715374A (en) * 2013-12-11 2015-06-17 世纪禾光科技发展(北京)有限公司 Method and system for governing repetition products of e-commerce platform
CN105589879A (en) * 2014-10-23 2016-05-18 阿里巴巴集团控股有限公司 Method and apparatus for downloading picture by client

Also Published As

Publication number Publication date
EP3472696A4 (en) 2020-02-12
CN109313659A (en) 2019-02-05
US20190182282A1 (en) 2019-06-13
KR102274928B1 (en) 2021-07-09
AU2017281628A1 (en) 2018-12-20
US10218728B2 (en) 2019-02-26
WO2017223230A1 (en) 2017-12-28
EP3472696A1 (en) 2019-04-24
KR102206589B1 (en) 2021-01-22
US20170366568A1 (en) 2017-12-21
US10944774B2 (en) 2021-03-09
CN115238207A (en) 2022-10-25
KR20210009444A (en) 2021-01-26
AU2017281628B2 (en) 2019-10-03
KR20190020120A (en) 2019-02-27

Similar Documents

Publication Publication Date Title
CN109313659B (en) Anomaly detection for revisions to web documents
US11836776B2 (en) Detecting cross-lingual comparable listings
US10803345B2 (en) Determining an item that has confirmed characteristics
US11301510B2 (en) Obtaining item listings matching a distinguishing style of an image selected in a user interface
US11301511B2 (en) Projecting visual aspects into a vector space
US10360621B2 (en) Near-identical multi-faceted entity identification in search
CN110622153A (en) Method and system for query partitioning
CN112825180A (en) Validated video commentary
CN109076097B (en) System and method for delegated content processing
CN108292318B (en) System and method for generating target page

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant