US20230409649A1 - Systems and methods for categorizing domains using artificial intelligence - Google Patents
Systems and methods for categorizing domains using artificial intelligence Download PDFInfo
- Publication number
- US20230409649A1 US20230409649A1 US17/845,249 US202217845249A US2023409649A1 US 20230409649 A1 US20230409649 A1 US 20230409649A1 US 202217845249 A US202217845249 A US 202217845249A US 2023409649 A1 US2023409649 A1 US 2023409649A1
- Authority
- US
- United States
- Prior art keywords
- categories
- webpage
- webpages
- domain
- category
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims description 54
- 238000013473 artificial intelligence Methods 0.000 title description 4
- 238000012549 training Methods 0.000 claims abstract description 61
- 238000013515 script Methods 0.000 claims abstract description 14
- 238000013528 artificial neural network Methods 0.000 claims description 14
- 238000001914 filtration Methods 0.000 abstract description 2
- 238000005457 optimization Methods 0.000 abstract description 2
- 208000001613 Gambling Diseases 0.000 description 7
- 238000012545 processing Methods 0.000 description 6
- 238000000605 extraction Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000012552 review Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000003203 everyday effect Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/10—Network architectures or network communication protocols for network security for controlling access to devices or network resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
Definitions
- a user may desire to categorize domains for use by a search engine or other index.
- a user may wish to prevent workers or family members from accessing webpages associated with pornography, gambling, or other controversial categories.
- a set of labeled training data that includes indicators of webpages is received.
- Each indicated webpage is labeled with one or more categories that were determined for the webpage by a human reviewer.
- Features, such as text and scripts, are extracted from each indicated webpage, and are used along with the labels to train a classifier to predict one or more categories for a webpage based on the features of the webpage.
- the trained classifier may be used to associate one or more categories with each domain of a plurality of domains given the categories predicted for some or all of the webpages associated with the domain.
- a list of domains and associated categories may be used for a variety of purposes including search engine optimization and content filtering. When a new domain is created, the trained classifier may be used to quickly and automatically associate one or more categories with the new domain, and the new domain and categories can be added to the list of domains and associated categories.
- a method for training a classifier includes: receiving a training set of webpages by a computing device, wherein each webpage in the training set is associated with one or more categories of a first plurality of categories; for each webpage of the training set of webpages, extracting one or more features from the webpage by the computing device; and for each webpage of the training set of webpages, training a classifier using the one or more extracted features and the one or more categories associated with the webpage by the computing device.
- Embodiments may include some or all of the following features.
- the method may further include: reducing the first plurality of categories to a second plurality of categories; and associating each webpage of the set of webpages with one or more categories of the second plurality of categories based on the one or more categories of the first plurality of categories that are associated with each webpage.
- the one or more features may include text features and script features.
- the method may further include: for each domain of a plurality of domains: retrieving a set of webpages from the domain by the computing device; for each webpage of the set of webpages: extracting one or more features from the webpage of the set of webpages by the computing device; and associating one or more categories of the first plurality of categories with the webpage using the classifier and the one or more features extracted from the webpage by the computing device.
- the method may further include for each domain of the plurality of domains, associating one or more categories of the first plurality of categories with the domain based on the one or more categories associated with each webpage of the set of webpages from the domain.
- Associating one or more categories of the first plurality of categories with the domain based on the one or more categories associated with each webpage of the set of webpages from the domain may include: determining each category associated with more than a threshold percentage of webpages of the set of webpages; and associating the determined categories with the domain. Each category is associated with a different threshold percentage.
- a method for associating categories with domains includes: receiving a list of domains by a computing device; receiving a plurality of categories by the computing device; receiving a classifier by the computing device; for each domain of the list of domains: retrieving a set of webpages from the domain by the computing device; for each webpage of the set of webpages: extracting one or more features from the webpage of the set of webpages by the computing device; associating one or more categories of the plurality of categories with the webpage using the classifier and the one or more features extracted from the webpage by the computing device; and associating one or more categories of the plurality of categories with the domain based on the one or more categories associated with each webpage of the set of webpages from the domain by the computing device.
- Embodiments may have some or all of the following features.
- the one or more features may include text features and script features.
- the method may further include: receiving indications of a training set of webpages by the computing device, wherein each webpage in the training set is associated with one or more categories of the plurality of categories; for each webpage of the training set of webpages, extracting one or more features from the webpage by the computing device; and for each webpage of the training set of webpages, training the classifier using the one or more extracted features and the one or more categories associated with the webpage by the computing device.
- the method may further include using the list of domains and associated one or more categories to control user access to the set of webpages associated with each domain of the list of domains.
- a method for categorizing new domains using artificial intelligence includes: receiving a list of domains by the computing device, wherein each domain in the list of domains was associated with a category of the plurality of categories by a classifier; receiving an indication of a new domain by the computing device, wherein the new domain is not in the list of domains; in response to the indication, retrieving at least one webpage from the new domain by the computing device; extracting one or more features from the at least one webpage by the computing device; associating a category of the plurality of categories with the new domain using the classifier and the extracted one or more features by the computing device; and adding the new domain and the associated category to the list of domains by the computing device.
- Embodiments may include some or all of the following features.
- the plurality of features may include text features and script features.
- the classifier may be a neural network.
- Associating the category of the plurality of categories with the new domain using the classifier and the extracted one or more features may include: determining the category associated with the at least one webpage using the extracted one or more features and the classifier; and associating the determined category with the domain.
- the method may further include: receiving indications of a training set of webpages by the computing device, wherein each webpage in the training set is associated with one or more categories of the plurality of categories; for each webpage of the training set of webpages, extracting one or more features from the webpage by the computing device; and for each webpage of the training set of webpages, training the classifier using the one or more extracted features and the one or more categories associated with the webpage by the computing device.
- the method may further include using the list of domains and associated one or more categories to control user access to webpages associated with the domains in the list of domains.
- the method may further include: receiving one or more access rules; and controlling user access to the webpages associated with the domains in the list of domains according to the received one or more access rules.
- FIG. 1 is an example computing environment for training a classifier and for assigning categories to domains using the classifier
- FIG. 2 is an example computing environment for controlling access to webpages and domains using access rules and a list of domains and categories;
- FIG. 3 is an illustration of an example method for training a classifier to determine one or more categories for webpages
- FIG. 4 is an illustration of an example method for associating categories with domains
- FIG. 5 is an illustration of an example method for controlling access to webpages for a user using access rules and domain categories
- FIG. 6 is an illustration of an example method for controlling access for groups of users to webpages using access rules and domain categories
- FIG. 7 is an illustration of an example method for associating categories with new domains.
- FIG. 8 shows an exemplary computing environment in which example embodiments and aspects may be implemented.
- an artificial-intelligence-based classifier is trained to quickly and efficiently categorize domains based on one or more webpages associated with a domain.
- human reviewers are used to categorize a set of webpages extracted from a variety of domains.
- Features from the webpages and their associated categories are used to train the classifier.
- an entity wants to determine a category for an existing or new domain some number of webpages are extracted from the domain and the classifier is used to categorize each extracted webpage without human reviewers. Some or all of the categories determined for the extracted webpages are then associated with the domain. In this way, new and existing domains can be quickly and efficiently categorized without the cost and time associated with human reviewers.
- FIG. 1 is an example of a cloud computing environment 100 for assigning categories to domains using a classifier.
- the environment 100 includes a classifier server 110 in communication with one or more domains 180 through a network 190 .
- the network 190 may include a combination of public and private networks.
- Each of the classifier server 110 and domains 180 may be implemented using one or more general purpose computing devices such as the computing device 800 illustrated with respect to FIG. 8 .
- the classifier server 110 may be implemented in a cloud-based computing environment.
- a domain 180 may represent a group of webpages 185 reachable in part using a common domain name.
- a domain 180 “foobaz.com” may include multiple webpages 185 such as “foobaz.com/home.html”, “foobaz.com/contact.html” and “foobaz.com/FAQ.com”.
- Each of the webpages 185 is reachable through the internet using a URL that includes the domain name “foobaz.com”.
- the classifier server 110 may generate what is referenced to as a domain list 165 .
- the domain list 165 may be a list of domains 180 along with associated categories 127 .
- a category 127 may be a topic or subject that is commonly associated with the webpages 185 of the domain 180 .
- Example categories 127 may include controversial topics such as “pornography”, “gambling”, or “violence” and more general topics such as “news”, “sports”, and “music.”
- the categories 127 may relate to topics or subjects that an entity, such as a corporation or a family, would like to prevent or restrict associated users from viewing or accessing.
- the particular categories 127 considered by the classifier server 110 may be selected by a user or administrator.
- the classifier server 110 includes several components including, but not limited to, a category engine 120 , an extraction engine 130 , a training engine 140 , and a domain engine 160 . More or fewer components may be supported. Each of the components may be implemented together or separately using one or more general purpose computing devices such as the computing device 800 illustrated with respect to FIG. 8 .
- the classifier server 110 may receive training data 125 .
- the training data 125 may be labeled and may include identifiers of webpages 185 , and each identified webpage 185 may be labeled with one or more categories. Depending on the embodiment, each identified webpage 185 may have been labeled with a category by a human reviewer.
- the category engine 120 may receive the categories 127 that will be used in the domain list 165 and may optionally adjust or simplify the labels used in the training data 125 to conform to the received categories 127 .
- the received training data 125 may be labeled with gambling related categories such as “casino gambling” and “sports betting.”
- the categories 127 may only include a single category 127 for all gambling related categories 127 . Accordingly, the category engine 120 may replace all gambling related labels in the training data 125 with the category 127 of “gambling.”
- the extraction engine 130 may extract features 135 from some or all of the webpages 185 identified in the training data 125 .
- the extracted features 135 may include text features and script features. With respect to text features, these features may include words and phrases, as well as certain combinations or words and phrases, which appear in a webpage 185 . With regards to script features, these features may include all or portions of scripts, such as JavaScript scripts, which are found in a webpage 185 . Other types of features 135 that may be extracted include image and video features. Any method for extracting features 135 from a webpage 185 may be used.
- the training engine 140 may use some or all of the extracted features 135 for each identified webpage 185 in the training data 125 , along with the associated category labels, to train a classifier 155 .
- the classifier 155 may be an artificial intelligence classifier 155 or model that receives as an input features 135 extracted from a webpage 185 , and outputs one or more categories 127 that are likely to be associated with the webpage 185 .
- the classifier 155 may be a convolutional neural network. However, other types of classifiers and/or neural networks may be used such as shallow neural networks, deep neural networks, and recurrent neural networks.
- the training engine 140 may train the classifier 155 using a first portion of the training data 125 , and then may test the classifier 155 using a second portion of the training data 125 .
- the domain engine 160 may use the classifier 155 to generate the domain list 165 .
- the domain engine 160 may generate the domain list 165 , by first receiving a set of domains 180 .
- the domain engine 160 may then, for each domain 180 , use a crawler or other application, to retrieve some or all of the webpages 185 associated with the domain 180 .
- the domain engine 160 may then use the extraction engine 130 to extract features 135 from each of the webpages 185 associated with the domain 180 and may use the classifier 155 to determine or predict one or more categories 127 for each webpage 185 associated with the domain 180 .
- the domain engine 160 may associate each domain 180 with the most frequent or top categories 127 predicted by the classifier 155 for the webpages 185 associated with the domain 180 . These domains 180 and associated categories 127 may be used by the domain engine 160 to create the domain list 165 .
- the domain engine 160 may associate a domain 180 with a category 127 when the category 127 is predicted for a threshold percentage of the webpages 185 associated with the domain 180 by the classifier 155 .
- the threshold percentage may be specified by an administrator.
- the same threshold percentage may be used for all categories. In other embodiments, different threshold percentages may be used for different categories. For example, some controversial categories 127 such as “pornography” may have a lower threshold percentage than benign categories 127 such as “art” or “music”.
- the domain engine 160 may be configured to determine new domains 180 , determine one or more categories 127 for the new domains 180 as described above, and to add the new domains 180 and determined one or more categories 127 to the domain list 165 .
- the domain engine 160 may determine new domains 180 from the WHOIS domains database. Other sources of newly added domains 180 may be used.
- the domain engine 160 may wait to assign categories 127 to new domains 180 until some threshold number of webpages 185 are published.
- the threshold number of webpages 185 may be set by an administrator.
- FIG. 2 is an example computing environment 200 for controlling access to webpages and domains using access rules and a domain list.
- the environment 200 includes an access server 210 in communication with one or more domains 180 and user devices 205 through the network 190 .
- Each of the access server 210 , domain 180 , and user device 205 may be implemented using one or more general purpose computing devices such as the computing device 800 illustrated with respect to FIG. 8 .
- the access server 210 may control access to one or more webpages 185 for user devices 205 based on the domain list 165 described previously with respect to FIG. 1 and one or more access rules 227 . As shown the access server 210 may include several components including, but not limited to, a rule engine 220 and a request engine 230 . More or fewer components may be supported.
- the rule engine 220 may allow for the creation of one or more access rules 227 that control what webpages 185 and/or domains 180 that a user is allowed to access.
- an access rule 227 lists one or more categories 127 that a user is not allowed to view or visit using a corresponding user device 205 .
- an access rule 227 that includes the category 127 “video games” may indicate that a corresponding user is not allowed to visit webpages 185 that are associated with domains 180 that are associated with the category 127 “video games.”
- an access rule 227 may list the categories 127 that the user is allowed to view or visit, and all other categories 127 may be restricted for the user.
- the access rules 227 may apply at all times, or may apply only at certain times. For example, an access rule 227 for a user may prevent the user from viewing webpages 185 that are associated with domains 180 of the category 127 “social networking” between the working hours of 9 am and 5 pm.
- the rule engine 220 may provide a user interface through which administrators may create access rules 227 that apply to users associated with a particular entity such as a corporation or a family. The administrators may select the particular categories 127 for each access rule 227 , as well as the particular users that the access rule 227 will apply to. Depending on the embodiment, the access rules 227 may apply to individual users, or groups of users. For example, an administrator of a company may wish to restrict access to domains 180 associated with the category “pornography” to all users of the company. As another example, an administrator of a home or family network may wish to restrict access by child users to certain categories 127 but not adult users.
- the request engine 230 may receive requests 206 for webpages 185 from user devices 205 and may either allow or deny the request 206 based on the particular access rules 227 that apply to the user associated with the user device 205 .
- the request 206 may be a Domain Name System (DNS) request made by the user device 205 in response to a user entering or selecting a URL using a browser application.
- DNS Domain Name System
- the browser application of the user device 205 must first perform a domain name lookup where an IP address corresponding to the domain name of the URL is determined and can be used to request a webpage 185 using the IP address.
- the request engine 230 (and access server 210 ) may function together with a DNS server that receives requests 206 from user devices 205 .
- the request engine 230 may first determine any access rules 227 that apply to the user of the user device 205 (either individually or as a group) and may determine any forbidden categories 127 that the user is not permitted to access.
- the request engine 230 may then use the domain list 165 to determine if the domain 180 associated with the request 205 is associated with any of the forbidden categories 127 . If the request 206 is not associated with any of the forbidden categories 127 , then the request engine 230 may pass the request 206 to a DNS server for further processing.
- the request engine 230 may either block the request 206 and may optionally redirect the user device 205 to a webpage explaining why the request 206 was blocked.
- the request engine 230 may receive a request 206 from a user that is not associated with an access rule 227 . In such cases the request engine 230 may pass the request 206 to a DNS server for further processing.
- the request engine 230 may receive a request 206 for a webpage 185 associated with a domain 180 that is not in the domain list 165 .
- the request engine 206 may assume that the domain 180 is “safe” and may pass the request to a DNS server for further processing.
- the request engine 230 may retrieve the webpage 185 associated with the request 206 , may extract the features 135 from the webpage 185 , and may use the classifier 155 and the extracted features 135 to predict one or more categories 127 for the webpage 185 . If any of the predicted one or more categories 127 are forbidden categories 127 for the user, the request 206 may be denied as described above.
- FIG. 3 is an illustration of an example method 300 for training a classifier to determine one or more categories for webpages.
- the method 300 may be implemented by the training engine 140 of the classifier server 110 .
- training data is received.
- the training data 125 may be received by the training engine 140 of the classifier server 110 .
- the training data 125 may be labeled and may include a set of indications of webpages 185 .
- Each indicated webpage 185 in the training set may be labeled with one or more categories 127 .
- features are extracted from each webpage indicated in the training data.
- the features 125 may be extracted from each webpage 185 indicated in the training data 125 by the extraction engine 130 .
- the extracted features 135 may include text features and script features. Other types of features 135 may be extracted.
- a classifier is trained using the extracted features and categories associated with each webpage.
- the classifier 155 may be trained by the training engine 140 .
- the classifier 155 may receive as an input features 135 extracted from a webpage 185 and may output one or more categories 127 .
- FIG. 4 is an illustration of an example method 400 for associating categories with domains.
- the method 400 may be implemented by the domain engine 160 of the classifier server 110 .
- a list of domains is received.
- the list of domains may be received by the domain engine 160 .
- the list of domains 180 may include some or all of the domains 180 available on the internet, for example.
- a plurality of categories is received.
- the plurality of categories 127 may be received by the domain engine 160 from the category engine 120 .
- the categories 127 may be selected topics or subjects of webpages 185 and/or domains 180 that one or more entities may desire to restrict or prevent access to for their users or employees.
- a classifier is received.
- the classifier 155 may be received by the domain engine 160 from the training engine 140 .
- the classifier 155 may be a convolutional neural network trained to predict one or more categories 127 for a webpage 185 based on features 135 extracted from the webpage 185 .
- the set of webpages 185 for a domain 180 may be webpages 185 that are part of the domain 180 and may be retrieved by the domain engine 160 .
- the domain engine 160 may use a web crawler or other software tool to retrieve some or all of the webpages 185 available on a domain 180 .
- the domain engine 160 may select a random subset of the webpages 185 that are available at a domain 180 or may select the most popular webpages 185 .
- each webpage in the set of webpages is associated with one or more categories.
- Each webpage 185 may be associated with one or more categories 127 by the domain engine 160 using the classifier 155 .
- the one or more categories 127 may be associated with a webpage 185 , by extracting features 135 from the webpage 185 and using the classifier 155 to predict one or more categories for the webpage 185 based on the features 135 .
- the domain is associated with one or more categories based on the categories associated with the webpages of the set of webpages.
- Each domain 180 may be associated with one or more categories 127 by the domain engine 160 .
- a domain 180 may be associated with a category 127 when a threshold percentage of the webpages 185 of the set of webpages 185 associated with the domain 180 were associated with the category 127 by the classifier 155 . The percentage may be set by a user or administrator.
- the list of domains and associated categories is provided.
- the list or domains and associated categories may be provided by the domain engine 160 to the access server 210 for use in enforcing one or more access rules 227 , for example.
- FIG. 5 is an illustration of an example method 500 for controlling access to webpages using access rules and domain categories.
- the method 500 may be implemented by the access server 210 .
- a list of domains is received.
- the list of domains may be the domain list 165 and may associate each domain 180 in the list with one or more categories 127 .
- the domain list 165 may be received from the classifier server 110 .
- an access rule for a user is received.
- the access rule 227 may be received by the request engine 230 from the rule engine 220 .
- the access rule 227 may include one or more categories 127 of webpages 185 that the user is forbidden from accessing.
- the access rule 227 may apply to individual users or groups of users.
- a request for a webpage is received.
- the request 206 may be received by the request engine 230 from a user device 205 associated with the user.
- the request 206 may be part of a DNS request related to the domain 180 associated with the requested webpage 185 .
- the domain associated with the webpage is in the list of domains is determined. The determination may be made by the request engine 230 searching the domain list 165 . If the domain 180 is not in the domain list 165 , the method 500 may continue at 550 . Else, the method 500 may continue at 560 .
- the classifier is used to determine a category for the domain associated with the request.
- the category 227 may be determined by the request engine 230 using the classifier 155 .
- the request engine 230 may extract features 135 from the requested webpage 185 and may use the extracted features 135 and the classifier 155 to predict one or more categories for the requested webpage 185 .
- the determined one more category 127 may be used for the domain 180 .
- multiple webpages 185 associated with the domain 180 may be retrieved and the categories 127 predicted for these webpages 185 may be used to determine the one or more categories for the domain 180 .
- the request engine 230 may update the domain list 165 .
- the category of the domain is in the access rule is determined. The determination may be made by the request engine 230 . If the domain 180 of the requested webpage 185 is in the access rule 227 , then the method 500 may continue at 570 . Else, the method 500 may continue at 580 .
- the webpage is blocked.
- the requested webpage 185 may be blocked by the request engine 230 .
- the request engine 230 may block the requested webpage 185 by redirecting the user device 205 to a different webpage 185 that explains why the requested webpage 185 was blocked.
- the different webpage 185 may indicate the blocked categories 127 that were associated with the domain 180 and may include contact information for a user or administrator.
- the request engine 230 may redirect the request 206 by sending the user device 205 an IP address associated with the different webpage 185 in response to the DNS request.
- the user is allowed to access the requested webpage.
- the user may be allowed to access the requested webpage 185 by the request engine 230 .
- the request engine 230 may pass the request 206 to a DNS server for fulfilment.
- FIG. 6 is an illustration of an example method 600 for controlling access to webpages using access rules and domain categories.
- the method 600 may be implemented by the access server 230 .
- an identifier of a group of users is received.
- the identifier may be received by the rule engine 220 .
- a user or administrator may desire to create an access rule 227 for the users in the group and may connect to the rule engine 220 using a user interface provided by the rule engine 220 or access server 210 .
- a selection of one or more categories is received.
- the selection of the one or more categories may be received by the rule engine 220 from the user or administrator creating the access rule 227 .
- the one or more categories 127 may be categories of domains 180 and/or webpages 185 that the user or administrator would like to prevent users in the group from viewing or accessing.
- an access rule is generated.
- the access rule 227 may be generated by the rule engine 220 based on the identified group of users and the selected categories 127 .
- a request is received from a user.
- the request may be received from a user device 205 associated with the user by the request engine 230 .
- the request 206 may be a DNS request and may be a request to access a webpage 185 associated with a domain 180 .
- the method 600 may continue at 660 . Else, the method 600 may continue at 670 .
- the request is processed using the access rule.
- the request 206 may be processed by the request engine 230 using the access rule 227 as described previously.
- the request engine 230 may only permit the user to view the requested webpage 185 if the domain 180 associated with the webpage 185 is not also associated with any category 127 indicated in the access rule 227 .
- the user is allowed to access the webpage 185 .
- the user may be allowed to access the requested webpage 185 by the request engine 230 .
- the request engine 230 may return the IP address associated with the domain 180 of the requested webpage 185 or may pass the request 206 to a DNS server for fulfilment.
- FIG. 7 is an illustration of an example method 700 for associating categories with new domains.
- the method 700 may be implemented by the classifier server 110 .
- the list of domains may be the domain list 165 and may be received by the domain engine 160 .
- the domain list 165 150 may include some or all of the domains 180 available on the internet at certain time.
- Each domain 180 in the list 165 may have one or more associated categories 127 .
- an indication of a new domain is received.
- the indication of a new domain 180 may be received by the domain engine 160 .
- the indication of a new domain 180 may be received from a service or publication that lists all new domains 180 created on a subsequent day.
- the new domain 180 may be a domain 180 that is not in the domain list 165
- one or more webpages associated with the new domain are retrieved.
- the one or more webpages 185 may be retrieved by the domain engine 160 .
- features are extracted from the one or more webpages.
- the features 135 may be extracted by the extraction engine 130 of the classifier server 110 .
- the features 135 may include text features 135 and script features 135 .
- Other features 135 may be supported
- one or more categories for the one or more webpages are determined.
- the one or more categories 127 for each of the one or more webpages 185 may be determined by the domain engine 160 using the classifier 155 and the features extracted from each of the one or more webpages.
- one or more of categories are associated with the new domain.
- the one or more categories 127 may be associated with the new domain 180 by the domain engine 160 .
- the domain engine 160 may associate categories 127 with the new domain 180 that are associated with more than a threshold percentage of the one or more webpages 185 .
- the new domain and associated one or more categories are added.
- the new domain and associated one or more categories may be added to the domain list 165 by the domain engine 160 .
- FIG. 8 shows an exemplary computing environment in which example embodiments and aspects may be implemented.
- the computing device environment is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality.
- Numerous other general purpose or special purpose computing devices environments or configurations may be used. Examples of well-known computing devices, environments, and/or configurations that may be suitable for use include, but are not limited to, personal computers, server computers, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, network personal computers (PCs), minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.
- Computer-executable instructions such as program modules, being executed by a computer may be used.
- program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium.
- program modules and other data may be located in both local and remote computer storage media including memory storage devices.
- an exemplary system for implementing aspects described herein includes a computing device, such as computing device 800 .
- computing device 800 typically includes at least one processing unit 802 and memory 804 .
- memory 804 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two.
- RAM random access memory
- ROM read-only memory
- flash memory etc.
- Computing device 800 may have additional features/functionality.
- computing device 800 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape.
- additional storage is illustrated in FIG. 8 by removable storage 808 and non-removable storage 810 .
- Computing device 800 typically includes a variety of computer readable media.
- Computer readable media can be any available media that can be accessed by the device 800 and includes both volatile and non-volatile media, removable and non-removable media.
- Computer storage media include volatile and non-volatile, and removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
- Memory 804 , removable storage 808 , and non-removable storage 810 are all examples of computer storage media.
- Computer storage media include, but are not limited to, RAM, ROM, electrically erasable program read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 800 . Any such computer storage media may be part of computing device 800 .
- Computing device 800 may contain communication connection(s) 812 that allow the device to communicate with other devices.
- Computing device 800 may also have input device(s) 814 such as a keyboard, mouse, pen, voice input device, touch input device, etc.
- Output device(s) 816 such as a display, speakers, printer, etc. may also be included. All these devices are well known in the art and need not be discussed at length here.
- FPGAs Field-programmable Gate Arrays
- ASICs Application-specific Integrated Circuits
- ASSPs Application-specific Standard Products
- SOCs System-on-a-chip systems
- CPLDs Complex Programmable Logic Devices
- the methods and apparatus of the presently disclosed subject matter may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium where, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the presently disclosed subject matter.
- program code i.e., instructions
- tangible media such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium
- exemplary implementations may refer to utilizing aspects of the presently disclosed subject matter in the context of one or more stand-alone computer systems, the subject matter is not so limited, but rather may be implemented in connection with any computing environment, such as a network or distributed computing environment. Still further, aspects of the presently disclosed subject matter may be implemented in or across a plurality of processing chips or devices, and storage may similarly be affected across a plurality of devices. Such devices might include personal computers, network servers, and handheld devices, for example.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- Accurately categorizing webpages or domains is important for a variety of applications. For example, a user may desire to categorize domains for use by a search engine or other index. As another example, a user may wish to prevent workers or family members from accessing webpages associated with pornography, gambling, or other controversial categories.
- While categorizing webpages and domains is useful, it is also extremely time consuming and labor intensive. Generally, human reviewers review each webpage of a domain, and may assign one or more categories to each domain based on their review. However, such human review is error prone and time consuming. Moreover, given the huge number of new domains created every day, manually categorizing each new domain is impractical.
- In an embodiment, a set of labeled training data that includes indicators of webpages is received. Each indicated webpage is labeled with one or more categories that were determined for the webpage by a human reviewer. Features, such as text and scripts, are extracted from each indicated webpage, and are used along with the labels to train a classifier to predict one or more categories for a webpage based on the features of the webpage. The trained classifier may be used to associate one or more categories with each domain of a plurality of domains given the categories predicted for some or all of the webpages associated with the domain. A list of domains and associated categories may be used for a variety of purposes including search engine optimization and content filtering. When a new domain is created, the trained classifier may be used to quickly and automatically associate one or more categories with the new domain, and the new domain and categories can be added to the list of domains and associated categories.
- In an embodiment, a method for training a classifier is provided. The method includes: receiving a training set of webpages by a computing device, wherein each webpage in the training set is associated with one or more categories of a first plurality of categories; for each webpage of the training set of webpages, extracting one or more features from the webpage by the computing device; and for each webpage of the training set of webpages, training a classifier using the one or more extracted features and the one or more categories associated with the webpage by the computing device.
- Embodiments may include some or all of the following features. The method may further include: reducing the first plurality of categories to a second plurality of categories; and associating each webpage of the set of webpages with one or more categories of the second plurality of categories based on the one or more categories of the first plurality of categories that are associated with each webpage. The one or more features may include text features and script features. The method may further include: for each domain of a plurality of domains: retrieving a set of webpages from the domain by the computing device; for each webpage of the set of webpages: extracting one or more features from the webpage of the set of webpages by the computing device; and associating one or more categories of the first plurality of categories with the webpage using the classifier and the one or more features extracted from the webpage by the computing device. The method may further include for each domain of the plurality of domains, associating one or more categories of the first plurality of categories with the domain based on the one or more categories associated with each webpage of the set of webpages from the domain. Associating one or more categories of the first plurality of categories with the domain based on the one or more categories associated with each webpage of the set of webpages from the domain may include: determining each category associated with more than a threshold percentage of webpages of the set of webpages; and associating the determined categories with the domain. Each category is associated with a different threshold percentage.
- In an embodiment, a method for associating categories with domains is provided. The method includes: receiving a list of domains by a computing device; receiving a plurality of categories by the computing device; receiving a classifier by the computing device; for each domain of the list of domains: retrieving a set of webpages from the domain by the computing device; for each webpage of the set of webpages: extracting one or more features from the webpage of the set of webpages by the computing device; associating one or more categories of the plurality of categories with the webpage using the classifier and the one or more features extracted from the webpage by the computing device; and associating one or more categories of the plurality of categories with the domain based on the one or more categories associated with each webpage of the set of webpages from the domain by the computing device.
- Embodiments may have some or all of the following features. The one or more features may include text features and script features. The classifier is a neural network. Associating one or more categories of the plurality of categories with the domain based on the one or more categories associated with each webpage of the set of webpages from the domain may include: determining each category associated with more than a threshold percentage of webpages of the set of webpages; and associating the determined categories with the domain. Each category may be associated with a different threshold percentage. The method may further include: receiving indications of a training set of webpages by the computing device, wherein each webpage in the training set is associated with one or more categories of the plurality of categories; for each webpage of the training set of webpages, extracting one or more features from the webpage by the computing device; and for each webpage of the training set of webpages, training the classifier using the one or more extracted features and the one or more categories associated with the webpage by the computing device. The method may further include using the list of domains and associated one or more categories to control user access to the set of webpages associated with each domain of the list of domains.
- A method for categorizing new domains using artificial intelligence is provided. The method includes: receiving a list of domains by the computing device, wherein each domain in the list of domains was associated with a category of the plurality of categories by a classifier; receiving an indication of a new domain by the computing device, wherein the new domain is not in the list of domains; in response to the indication, retrieving at least one webpage from the new domain by the computing device; extracting one or more features from the at least one webpage by the computing device; associating a category of the plurality of categories with the new domain using the classifier and the extracted one or more features by the computing device; and adding the new domain and the associated category to the list of domains by the computing device.
- Embodiments may include some or all of the following features. The plurality of features may include text features and script features. The classifier may be a neural network. Associating the category of the plurality of categories with the new domain using the classifier and the extracted one or more features may include: determining the category associated with the at least one webpage using the extracted one or more features and the classifier; and associating the determined category with the domain. The method may further include: receiving indications of a training set of webpages by the computing device, wherein each webpage in the training set is associated with one or more categories of the plurality of categories; for each webpage of the training set of webpages, extracting one or more features from the webpage by the computing device; and for each webpage of the training set of webpages, training the classifier using the one or more extracted features and the one or more categories associated with the webpage by the computing device. The method may further include using the list of domains and associated one or more categories to control user access to webpages associated with the domains in the list of domains. The method may further include: receiving one or more access rules; and controlling user access to the webpages associated with the domains in the list of domains according to the received one or more access rules.
- The embodiments described herein provide many benefits over the prior art. First, by categorizing domains using a trained classifier, the need for expensive human classifiers is greatly reduced. Second, because the trained classifier can quickly categorize new domains without human input, any application that relies on such categorized domains will be more current than applications that use traditional human-based methods for domain categorization.
- Additional advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
- The accompanying figures, which are incorporated herein and form part of the specification, illustrate a domain categorization system and method. Together with the description, the figures further serve to explain the principles of the domain categorization system and method described herein and thereby enable a person skilled in the pertinent art to make and use the domain categorization system and method.
-
FIG. 1 is an example computing environment for training a classifier and for assigning categories to domains using the classifier; -
FIG. 2 is an example computing environment for controlling access to webpages and domains using access rules and a list of domains and categories; -
FIG. 3 is an illustration of an example method for training a classifier to determine one or more categories for webpages; -
FIG. 4 is an illustration of an example method for associating categories with domains; -
FIG. 5 is an illustration of an example method for controlling access to webpages for a user using access rules and domain categories; -
FIG. 6 is an illustration of an example method for controlling access for groups of users to webpages using access rules and domain categories; -
FIG. 7 is an illustration of an example method for associating categories with new domains; and -
FIG. 8 shows an exemplary computing environment in which example embodiments and aspects may be implemented. - The construction and arrangement of the systems and methods as shown in the various exemplary embodiments are illustrative only. Although only a few embodiments have been described in detail in this disclosure, many modifications are possible (e.g., variations in sizes, dimensions, structures, shapes and proportions of the various elements, values of parameters, mounting arrangements, use of materials, colors, orientations, etc.). For example, the position of elements may be reversed or otherwise varied, and the nature or number of discrete elements or positions may be altered or varied. Accordingly, all such modifications are intended to be included within the scope of the present disclosure. The order or sequence of any process or method steps may be varied or re-sequenced according to alternative embodiments. Other substitutions, modifications, changes, and omissions may be made in the design, operating conditions, and arrangement of the exemplary embodiments without departing from the scope of the present disclosure.
- As described above, many organizations categorize domains. These categorized domains may be used for a variety of purposes such as search engine creation and access control. However, currently most entities rely on human reviewers to review and categorize domains, which given the large number of existing domains domains, and the large number of new domains that are created every day, categorizing domains is difficult and time consuming.
- As will be described below in greater detail, to solve the problems noted above for categorizing domains, an artificial-intelligence-based classifier is trained to quickly and efficiently categorize domains based on one or more webpages associated with a domain. Initially, human reviewers are used to categorize a set of webpages extracted from a variety of domains. Features from the webpages and their associated categories are used to train the classifier. Later, when an entity wants to determine a category for an existing or new domain, some number of webpages are extracted from the domain and the classifier is used to categorize each extracted webpage without human reviewers. Some or all of the categories determined for the extracted webpages are then associated with the domain. In this way, new and existing domains can be quickly and efficiently categorized without the cost and time associated with human reviewers.
-
FIG. 1 is an example of acloud computing environment 100 for assigning categories to domains using a classifier. As shown, theenvironment 100 includes aclassifier server 110 in communication with one ormore domains 180 through anetwork 190. Thenetwork 190 may include a combination of public and private networks. Each of theclassifier server 110 anddomains 180 may be implemented using one or more general purpose computing devices such as thecomputing device 800 illustrated with respect toFIG. 8 . Moreover, in some embodiments, theclassifier server 110 may be implemented in a cloud-based computing environment. - A
domain 180 may represent a group ofwebpages 185 reachable in part using a common domain name. For example, adomain 180 “foobaz.com” may includemultiple webpages 185 such as “foobaz.com/home.html”, “foobaz.com/contact.html” and “foobaz.com/FAQ.com”. Each of thewebpages 185 is reachable through the internet using a URL that includes the domain name “foobaz.com”. - In order to control access to
webpages 185, theclassifier server 110 may generate what is referenced to as adomain list 165. Thedomain list 165 may be a list ofdomains 180 along with associatedcategories 127. Acategory 127 may be a topic or subject that is commonly associated with thewebpages 185 of thedomain 180.Example categories 127 may include controversial topics such as “pornography”, “gambling”, or “violence” and more general topics such as “news”, “sports”, and “music.” Generally, thecategories 127 may relate to topics or subjects that an entity, such as a corporation or a family, would like to prevent or restrict associated users from viewing or accessing. Theparticular categories 127 considered by theclassifier server 110 may be selected by a user or administrator. - As shown, to create the
domain list 165, theclassifier server 110 includes several components including, but not limited to, acategory engine 120, anextraction engine 130, atraining engine 140, and adomain engine 160. More or fewer components may be supported. Each of the components may be implemented together or separately using one or more general purpose computing devices such as thecomputing device 800 illustrated with respect toFIG. 8 . - The
classifier server 110 may receivetraining data 125. Thetraining data 125 may be labeled and may include identifiers ofwebpages 185, and each identifiedwebpage 185 may be labeled with one or more categories. Depending on the embodiment, each identifiedwebpage 185 may have been labeled with a category by a human reviewer. - The
category engine 120 may receive thecategories 127 that will be used in thedomain list 165 and may optionally adjust or simplify the labels used in thetraining data 125 to conform to the receivedcategories 127. For example, the receivedtraining data 125 may be labeled with gambling related categories such as “casino gambling” and “sports betting.” However, thecategories 127 may only include asingle category 127 for all gamblingrelated categories 127. Accordingly, thecategory engine 120 may replace all gambling related labels in thetraining data 125 with thecategory 127 of “gambling.” - The
extraction engine 130 may extractfeatures 135 from some or all of thewebpages 185 identified in thetraining data 125. The extracted features 135 may include text features and script features. With respect to text features, these features may include words and phrases, as well as certain combinations or words and phrases, which appear in awebpage 185. With regards to script features, these features may include all or portions of scripts, such as JavaScript scripts, which are found in awebpage 185. Other types offeatures 135 that may be extracted include image and video features. Any method for extractingfeatures 135 from awebpage 185 may be used. - The
training engine 140 may use some or all of the extracted features 135 for each identifiedwebpage 185 in thetraining data 125, along with the associated category labels, to train aclassifier 155. Theclassifier 155 may be anartificial intelligence classifier 155 or model that receives as an input features 135 extracted from awebpage 185, and outputs one ormore categories 127 that are likely to be associated with thewebpage 185. Theclassifier 155 may be a convolutional neural network. However, other types of classifiers and/or neural networks may be used such as shallow neural networks, deep neural networks, and recurrent neural networks. Depending on the embodiment, thetraining engine 140 may train theclassifier 155 using a first portion of thetraining data 125, and then may test theclassifier 155 using a second portion of thetraining data 125. - The
domain engine 160 may use theclassifier 155 to generate thedomain list 165. In some embodiments, thedomain engine 160 may generate thedomain list 165, by first receiving a set ofdomains 180. Thedomain engine 160 may then, for eachdomain 180, use a crawler or other application, to retrieve some or all of thewebpages 185 associated with thedomain 180. - The
domain engine 160 may then use theextraction engine 130 to extractfeatures 135 from each of thewebpages 185 associated with thedomain 180 and may use theclassifier 155 to determine or predict one ormore categories 127 for eachwebpage 185 associated with thedomain 180. Depending on the embodiment, thedomain engine 160 may associate eachdomain 180 with the most frequent ortop categories 127 predicted by theclassifier 155 for thewebpages 185 associated with thedomain 180. Thesedomains 180 and associatedcategories 127 may be used by thedomain engine 160 to create thedomain list 165. - In some embodiments, the
domain engine 160 may associate adomain 180 with acategory 127 when thecategory 127 is predicted for a threshold percentage of thewebpages 185 associated with thedomain 180 by theclassifier 155. The threshold percentage may be specified by an administrator. - In some embodiments, the same threshold percentage may be used for all categories. In other embodiments, different threshold percentages may be used for different categories. For example, some
controversial categories 127 such as “pornography” may have a lower threshold percentage thanbenign categories 127 such as “art” or “music”. - As may be appreciated,
new domains 180 are constantly being created. Accordingly, thedomain engine 160 may be configured to determinenew domains 180, determine one ormore categories 127 for thenew domains 180 as described above, and to add thenew domains 180 and determined one ormore categories 127 to thedomain list 165. Depending on the embodiment, thedomain engine 160 may determinenew domains 180 from the WHOIS domains database. Other sources of newly addeddomains 180 may be used. - Because there may be a delay in registering a
domain 180 and publishing one ormore webpages 185 under thedomain 180, in some embodiments, thedomain engine 160 may wait to assigncategories 127 tonew domains 180 until some threshold number ofwebpages 185 are published. The threshold number ofwebpages 185 may be set by an administrator. -
FIG. 2 is anexample computing environment 200 for controlling access to webpages and domains using access rules and a domain list. As shown, theenvironment 200 includes anaccess server 210 in communication with one ormore domains 180 anduser devices 205 through thenetwork 190. Each of theaccess server 210,domain 180, anduser device 205 may be implemented using one or more general purpose computing devices such as thecomputing device 800 illustrated with respect toFIG. 8 . - The
access server 210 may control access to one ormore webpages 185 foruser devices 205 based on thedomain list 165 described previously with respect toFIG. 1 and one or more access rules 227. As shown theaccess server 210 may include several components including, but not limited to, arule engine 220 and arequest engine 230. More or fewer components may be supported. - The
rule engine 220 may allow for the creation of one ormore access rules 227 that control whatwebpages 185 and/ordomains 180 that a user is allowed to access. As used herein anaccess rule 227 lists one ormore categories 127 that a user is not allowed to view or visit using acorresponding user device 205. For example, anaccess rule 227 that includes thecategory 127 “video games” may indicate that a corresponding user is not allowed to visitwebpages 185 that are associated withdomains 180 that are associated with thecategory 127 “video games.” Alternatively, anaccess rule 227 may list thecategories 127 that the user is allowed to view or visit, and allother categories 127 may be restricted for the user. - In some embodiments, the access rules 227 may apply at all times, or may apply only at certain times. For example, an
access rule 227 for a user may prevent the user fromviewing webpages 185 that are associated withdomains 180 of thecategory 127 “social networking” between the working hours of 9 am and 5 pm. - The
rule engine 220 may provide a user interface through which administrators may createaccess rules 227 that apply to users associated with a particular entity such as a corporation or a family. The administrators may select theparticular categories 127 for eachaccess rule 227, as well as the particular users that theaccess rule 227 will apply to. Depending on the embodiment, the access rules 227 may apply to individual users, or groups of users. For example, an administrator of a company may wish to restrict access todomains 180 associated with the category “pornography” to all users of the company. As another example, an administrator of a home or family network may wish to restrict access by child users tocertain categories 127 but not adult users. - The
request engine 230 may receiverequests 206 forwebpages 185 fromuser devices 205 and may either allow or deny therequest 206 based on theparticular access rules 227 that apply to the user associated with theuser device 205. In some embodiments, therequest 206 may be a Domain Name System (DNS) request made by theuser device 205 in response to a user entering or selecting a URL using a browser application. When a user enters a URL that includes a domain name, the browser application of theuser device 205 must first perform a domain name lookup where an IP address corresponding to the domain name of the URL is determined and can be used to request awebpage 185 using the IP address. - The request engine 230 (and access server 210) may function together with a DNS server that receives
requests 206 fromuser devices 205. When arequest 206 is received from auser device 205, therequest engine 230 may first determine anyaccess rules 227 that apply to the user of the user device 205 (either individually or as a group) and may determine anyforbidden categories 127 that the user is not permitted to access. Therequest engine 230 may then use thedomain list 165 to determine if thedomain 180 associated with therequest 205 is associated with any of the forbiddencategories 127. If therequest 206 is not associated with any of the forbiddencategories 127, then therequest engine 230 may pass therequest 206 to a DNS server for further processing. - If the
request 206 is associated with any of the forbiddencategories 127, then therequest engine 230 may either block therequest 206 and may optionally redirect theuser device 205 to a webpage explaining why therequest 206 was blocked. - In some embodiments, the
request engine 230 may receive arequest 206 from a user that is not associated with anaccess rule 227. In such cases therequest engine 230 may pass therequest 206 to a DNS server for further processing. - As may be appreciated, because of the large number
new domains 180 that are created every day, therequest engine 230 may receive arequest 206 for awebpage 185 associated with adomain 180 that is not in thedomain list 165. In some embodiments, when arequest 206 for awebpage 185 associated with adomain 180 that is not in thedomain list 165 is received, therequest engine 206 may assume that thedomain 180 is “safe” and may pass the request to a DNS server for further processing. - Alternatively, in some embodiments, when a
request 206 for awebpage 185 associated with adomain 180 that is not in thedomain list 165 is received, therequest engine 230 may retrieve thewebpage 185 associated with therequest 206, may extract thefeatures 135 from thewebpage 185, and may use theclassifier 155 and the extracted features 135 to predict one ormore categories 127 for thewebpage 185. If any of the predicted one ormore categories 127 are forbiddencategories 127 for the user, therequest 206 may be denied as described above. -
FIG. 3 is an illustration of anexample method 300 for training a classifier to determine one or more categories for webpages. Themethod 300 may be implemented by thetraining engine 140 of theclassifier server 110. - At 310, training data is received. The
training data 125 may be received by thetraining engine 140 of theclassifier server 110. Thetraining data 125 may be labeled and may include a set of indications ofwebpages 185. Each indicatedwebpage 185 in the training set may be labeled with one ormore categories 127. - At 320, features are extracted from each webpage indicated in the training data. The
features 125 may be extracted from eachwebpage 185 indicated in thetraining data 125 by theextraction engine 130. The extracted features 135 may include text features and script features. Other types offeatures 135 may be extracted. - At 330, a classifier is trained using the extracted features and categories associated with each webpage. The
classifier 155 may be trained by thetraining engine 140. Theclassifier 155 may receive as an input features 135 extracted from awebpage 185 and may output one ormore categories 127. -
FIG. 4 is an illustration of anexample method 400 for associating categories with domains. Themethod 400 may be implemented by thedomain engine 160 of theclassifier server 110. - At 410, a list of domains is received. The list of domains may be received by the
domain engine 160. The list ofdomains 180 may include some or all of thedomains 180 available on the internet, for example. - At 420, a plurality of categories is received. The plurality of
categories 127 may be received by thedomain engine 160 from thecategory engine 120. Thecategories 127 may be selected topics or subjects ofwebpages 185 and/ordomains 180 that one or more entities may desire to restrict or prevent access to for their users or employees. - At 430, a classifier is received. The
classifier 155 may be received by thedomain engine 160 from thetraining engine 140. Theclassifier 155 may be a convolutional neural network trained to predict one ormore categories 127 for awebpage 185 based onfeatures 135 extracted from thewebpage 185. - At 440, a set of webpages is received for each domain. The set of
webpages 185 for adomain 180 may bewebpages 185 that are part of thedomain 180 and may be retrieved by thedomain engine 160. In some embodiments, thedomain engine 160 may use a web crawler or other software tool to retrieve some or all of thewebpages 185 available on adomain 180. Alternatively, thedomain engine 160 may select a random subset of thewebpages 185 that are available at adomain 180 or may select the mostpopular webpages 185. - At 450, for each domain, each webpage in the set of webpages is associated with one or more categories. Each
webpage 185 may be associated with one ormore categories 127 by thedomain engine 160 using theclassifier 155. Depending on the embodiment, the one ormore categories 127 may be associated with awebpage 185, by extractingfeatures 135 from thewebpage 185 and using theclassifier 155 to predict one or more categories for thewebpage 185 based on thefeatures 135. - At 460, for each domain, the domain is associated with one or more categories based on the categories associated with the webpages of the set of webpages. Each
domain 180 may be associated with one ormore categories 127 by thedomain engine 160. In some embodiments, adomain 180 may be associated with acategory 127 when a threshold percentage of thewebpages 185 of the set ofwebpages 185 associated with thedomain 180 were associated with thecategory 127 by theclassifier 155. The percentage may be set by a user or administrator. - At 470, the list of domains and associated categories is provided. The list or domains and associated categories may be provided by the
domain engine 160 to theaccess server 210 for use in enforcing one ormore access rules 227, for example. -
FIG. 5 is an illustration of anexample method 500 for controlling access to webpages using access rules and domain categories. Themethod 500 may be implemented by theaccess server 210. - At 510, a list of domains is received. The list of domains may be the
domain list 165 and may associate eachdomain 180 in the list with one ormore categories 127. Thedomain list 165 may be received from theclassifier server 110. - At 520, an access rule for a user is received. The
access rule 227 may be received by therequest engine 230 from therule engine 220. Theaccess rule 227 may include one ormore categories 127 ofwebpages 185 that the user is forbidden from accessing. Theaccess rule 227 may apply to individual users or groups of users. - At 530, a request for a webpage is received. The
request 206 may be received by therequest engine 230 from auser device 205 associated with the user. Therequest 206 may be part of a DNS request related to thedomain 180 associated with the requestedwebpage 185. - At 540, whether the domain associated with the webpage is in the list of domains is determined. The determination may be made by the
request engine 230 searching thedomain list 165. If thedomain 180 is not in thedomain list 165, themethod 500 may continue at 550. Else, themethod 500 may continue at 560. - At 550, the classifier is used to determine a category for the domain associated with the request. The
category 227 may be determined by therequest engine 230 using theclassifier 155. In some embodiments, therequest engine 230 may extractfeatures 135 from the requestedwebpage 185 and may use the extracted features 135 and theclassifier 155 to predict one or more categories for the requestedwebpage 185. The determined onemore category 127 may be used for thedomain 180. Alternatively,multiple webpages 185 associated with thedomain 180 may be retrieved and thecategories 127 predicted for thesewebpages 185 may be used to determine the one or more categories for thedomain 180. Depending on the embodiment, after determining the one ormore categories 127 for thedomain 180 therequest engine 230 may update thedomain list 165. - At 560, whether the category of the domain is in the access rule is determined. The determination may be made by the
request engine 230. If thedomain 180 of the requestedwebpage 185 is in theaccess rule 227, then themethod 500 may continue at 570. Else, themethod 500 may continue at 580. - At 570, the webpage is blocked. The requested
webpage 185 may be blocked by therequest engine 230. In some embodiments, therequest engine 230 may block the requestedwebpage 185 by redirecting theuser device 205 to adifferent webpage 185 that explains why the requestedwebpage 185 was blocked. Thedifferent webpage 185 may indicate the blockedcategories 127 that were associated with thedomain 180 and may include contact information for a user or administrator. Therequest engine 230 may redirect therequest 206 by sending theuser device 205 an IP address associated with thedifferent webpage 185 in response to the DNS request. - At 580, the user is allowed to access the requested webpage. The user may be allowed to access the requested
webpage 185 by therequest engine 230. Therequest engine 230 may pass therequest 206 to a DNS server for fulfilment. -
FIG. 6 is an illustration of anexample method 600 for controlling access to webpages using access rules and domain categories. Themethod 600 may be implemented by theaccess server 230. - At 610, an identifier of a group of users is received. The identifier may be received by the
rule engine 220. A user or administrator may desire to create anaccess rule 227 for the users in the group and may connect to therule engine 220 using a user interface provided by therule engine 220 oraccess server 210. - At 620, a selection of one or more categories is received. The selection of the one or more categories may be received by the
rule engine 220 from the user or administrator creating theaccess rule 227. The one ormore categories 127 may be categories ofdomains 180 and/orwebpages 185 that the user or administrator would like to prevent users in the group from viewing or accessing. - At 630, an access rule is generated. The
access rule 227 may be generated by therule engine 220 based on the identified group of users and the selectedcategories 127. - At 640, a request is received from a user. The request may be received from a
user device 205 associated with the user by therequest engine 230. Therequest 206 may be a DNS request and may be a request to access awebpage 185 associated with adomain 180. - At 650, whether the user associated with the request is in the identified group of users is determined. The determination may be made by the
request engine 230. If the user is in the group of users, themethod 600 may continue at 660. Else, themethod 600 may continue at 670. - At 660, the request is processed using the access rule. The
request 206 may be processed by therequest engine 230 using theaccess rule 227 as described previously. In particular, therequest engine 230 may only permit the user to view the requestedwebpage 185 if thedomain 180 associated with thewebpage 185 is not also associated with anycategory 127 indicated in theaccess rule 227. - At 670, the user is allowed to access the
webpage 185. The user may be allowed to access the requestedwebpage 185 by therequest engine 230. Therequest engine 230 may return the IP address associated with thedomain 180 of the requestedwebpage 185 or may pass therequest 206 to a DNS server for fulfilment. -
FIG. 7 is an illustration of anexample method 700 for associating categories with new domains. Themethod 700 may be implemented by theclassifier server 110. - At 710, a list of domains is received. The list of domains may be the
domain list 165 and may be received by thedomain engine 160. Thedomain list 165 150 may include some or all of thedomains 180 available on the internet at certain time. Eachdomain 180 in thelist 165 may have one or moreassociated categories 127. - At 720, an indication of a new domain is received. The indication of a
new domain 180 may be received by thedomain engine 160. The indication of anew domain 180 may be received from a service or publication that lists allnew domains 180 created on a subsequent day. Thenew domain 180 may be adomain 180 that is not in thedomain list 165 - At 730, one or more webpages associated with the new domain are retrieved. The one or
more webpages 185 may be retrieved by thedomain engine 160. - At 740, features are extracted from the one or more webpages. The
features 135 may be extracted by theextraction engine 130 of theclassifier server 110. Thefeatures 135 may include text features 135 and script features 135.Other features 135 may be supported - At 750, one or more categories for the one or more webpages are determined. The one or
more categories 127 for each of the one ormore webpages 185 may be determined by thedomain engine 160 using theclassifier 155 and the features extracted from each of the one or more webpages. - At 760, one or more of categories are associated with the new domain. The one or
more categories 127 may be associated with thenew domain 180 by thedomain engine 160. In some embodiments, thedomain engine 160 may associatecategories 127 with thenew domain 180 that are associated with more than a threshold percentage of the one ormore webpages 185. - At 770, the new domain and associated one or more categories are added. The new domain and associated one or more categories may be added to the
domain list 165 by thedomain engine 160. -
FIG. 8 shows an exemplary computing environment in which example embodiments and aspects may be implemented. The computing device environment is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality. - Numerous other general purpose or special purpose computing devices environments or configurations may be used. Examples of well-known computing devices, environments, and/or configurations that may be suitable for use include, but are not limited to, personal computers, server computers, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, network personal computers (PCs), minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.
- Computer-executable instructions, such as program modules, being executed by a computer may be used. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules and other data may be located in both local and remote computer storage media including memory storage devices.
- With reference to
FIG. 8 , an exemplary system for implementing aspects described herein includes a computing device, such ascomputing device 800. In its most basic configuration,computing device 800 typically includes at least oneprocessing unit 802 andmemory 804. Depending on the exact configuration and type of computing device,memory 804 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two. This most basic configuration is illustrated inFIG. 8 by dashedline 806. -
Computing device 800 may have additional features/functionality. For example,computing device 800 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated inFIG. 8 byremovable storage 808 andnon-removable storage 810. -
Computing device 800 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by thedevice 800 and includes both volatile and non-volatile media, removable and non-removable media. - Computer storage media include volatile and non-volatile, and removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
Memory 804,removable storage 808, andnon-removable storage 810 are all examples of computer storage media. Computer storage media include, but are not limited to, RAM, ROM, electrically erasable program read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computingdevice 800. Any such computer storage media may be part ofcomputing device 800. -
Computing device 800 may contain communication connection(s) 812 that allow the device to communicate with other devices.Computing device 800 may also have input device(s) 814 such as a keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 816 such as a display, speakers, printer, etc. may also be included. All these devices are well known in the art and need not be discussed at length here. - It should be understood that the various techniques described herein may be implemented in connection with hardware components or software components or, where appropriate, with a combination of both. Illustrative types of hardware components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. The methods and apparatus of the presently disclosed subject matter, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium where, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the presently disclosed subject matter.
- Although exemplary implementations may refer to utilizing aspects of the presently disclosed subject matter in the context of one or more stand-alone computer systems, the subject matter is not so limited, but rather may be implemented in connection with any computing environment, such as a network or distributed computing environment. Still further, aspects of the presently disclosed subject matter may be implemented in or across a plurality of processing chips or devices, and storage may similarly be affected across a plurality of devices. Such devices might include personal computers, network servers, and handheld devices, for example.
- Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Claims (20)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/845,249 US20230409649A1 (en) | 2022-06-21 | 2022-06-21 | Systems and methods for categorizing domains using artificial intelligence |
US17/846,514 US20230409900A1 (en) | 2022-06-21 | 2022-06-22 | Systems and method for categorizing domains using artificial intelligence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/845,249 US20230409649A1 (en) | 2022-06-21 | 2022-06-21 | Systems and methods for categorizing domains using artificial intelligence |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/846,514 Continuation US20230409900A1 (en) | 2022-06-21 | 2022-06-22 | Systems and method for categorizing domains using artificial intelligence |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230409649A1 true US20230409649A1 (en) | 2023-12-21 |
Family
ID=89168900
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/845,249 Pending US20230409649A1 (en) | 2022-06-21 | 2022-06-21 | Systems and methods for categorizing domains using artificial intelligence |
US17/846,514 Pending US20230409900A1 (en) | 2022-06-21 | 2022-06-22 | Systems and method for categorizing domains using artificial intelligence |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/846,514 Pending US20230409900A1 (en) | 2022-06-21 | 2022-06-22 | Systems and method for categorizing domains using artificial intelligence |
Country Status (1)
Country | Link |
---|---|
US (2) | US20230409649A1 (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8886799B1 (en) * | 2012-08-29 | 2014-11-11 | Google Inc. | Identifying a similar user identifier |
US20180218241A1 (en) * | 2015-05-08 | 2018-08-02 | Guangzhou Ucweb Computer Technology Co., Ltd. | Webpage classification method and apparatus, calculation device and machine readable storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11637863B2 (en) * | 2020-04-03 | 2023-04-25 | Paypal, Inc. | Detection of user interface imitation |
US20210406255A1 (en) * | 2020-06-29 | 2021-12-30 | Forescout Technologies, Inc. | Information enhanced classification |
US11727077B2 (en) * | 2021-02-05 | 2023-08-15 | Microsoft Technology Licensing, Llc | Inferring information about a webpage based upon a uniform resource locator of the webpage |
-
2022
- 2022-06-21 US US17/845,249 patent/US20230409649A1/en active Pending
- 2022-06-22 US US17/846,514 patent/US20230409900A1/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8886799B1 (en) * | 2012-08-29 | 2014-11-11 | Google Inc. | Identifying a similar user identifier |
US20180218241A1 (en) * | 2015-05-08 | 2018-08-02 | Guangzhou Ucweb Computer Technology Co., Ltd. | Webpage classification method and apparatus, calculation device and machine readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
US20230409900A1 (en) | 2023-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10678807B1 (en) | Generating real-time search results | |
US20200098275A1 (en) | Integrating an application or service with a learning platform using a database system | |
US9465828B2 (en) | Computer implemented methods and apparatus for identifying similar labels using collaborative filtering | |
US8554759B1 (en) | Selection of documents to place in search index | |
US10579691B2 (en) | Application programming interface representation of multi-tenant non-relational platform objects | |
US9031946B1 (en) | Processor engine, integrated circuit and method therefor | |
US20160203193A1 (en) | Context aware query selection | |
US8972856B2 (en) | Document modification by a client-side application | |
US10692157B2 (en) | Selection of information sources based on social activities | |
AU2015202478A1 (en) | Combining internal and external search results | |
US10114873B2 (en) | Computer implemented methods and apparatus for retrieving content related to a feed item of an online social network | |
US20160239569A1 (en) | Dynamic search set creation in a search engine | |
US20180365334A1 (en) | Enhanced web browsing | |
US20140214820A1 (en) | Method and system of creating a seach query | |
US20240062010A1 (en) | Secure complete phrase utterance recommendation system | |
US20180365333A1 (en) | Enhanced web browsing | |
US10977333B2 (en) | Link corrections by cognitive analysis of web resources | |
US20230409649A1 (en) | Systems and methods for categorizing domains using artificial intelligence | |
US20230419100A1 (en) | Systems and methods for categorizing domains using artificial intelligence | |
US10594809B2 (en) | Aggregation of web interactions for personalized usage | |
US11706226B1 (en) | Systems and methods for controlling access to domains using artificial intelligence | |
US11222086B2 (en) | Finding content on computer networks | |
CN112189195A (en) | Application programming interface for identifying, using and managing trusted sources in online and network content | |
CN111753171B (en) | Malicious website identification method and device | |
AU2021289542B2 (en) | Refining a search request to a content provider |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: UAB 360 IT, LITHUANIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:RAZINSKAS, DAINIUS;BRILIAUSKAS, MANTAS;REEL/FRAME:063064/0387 Effective date: 20220614 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
AS | Assignment |
Owner name: UAB 360 IT, LITHUANIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GURINAVICIUTE, JUTA;LUMBRERAS, CARLOS ELISEO SALAS;SIGNING DATES FROM 20220614 TO 20220620;REEL/FRAME:066537/0450 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: AMENDMENT AFTER NOTICE OF APPEAL |
|
STCV | Information on status: appeal procedure |
Free format text: NOTICE OF APPEAL FILED |