WO2015196414A1

WO2015196414A1 - Batch-optimized render and fetch architecture

Info

Publication number: WO2015196414A1
Application number: PCT/CN2014/080832
Authority: WO
Inventors: Hao Fang; Erik Arjan Hendriks; Hui Xu; Cristian Tapus; Rupesh Kapoor
Original assignee: Google Inc.
Priority date: 2014-06-26
Filing date: 2014-06-26
Publication date: 2015-12-30
Also published as: EP3161668B1; US11328114B2; US9984130B2; CN106462582B; RU2659481C1; US20150379014A1; EP3161668A4; CN106462582A; US20180276220A1; EP3161668A1; JP2017526041A; JP6356273B2

Abstract

Implementations include a batch-optimized render and fetch architecture. An example method performed by the architecture includes receiving a request from a batch process to render a web page and initializing a virtual clock and a task list for rendering the web page. The virtual clock stands still when a request for an embedded item is outstanding and when a task is ready to run. The method may also include generating a rendering result for the web page when the virtual clock matches a run time for a stop task in the task list, and providing the rendering result to the batch process. Another example method includes receiving a request from a batch process to render a web page, identifying an embedded item in the web page, and determining, based on a rewrite rule, that the embedded item has content that is duplicative of content for a previously fetched embedded item.

Description

BATCH-OPTIMIZED RENDER AND FETCH ARCHITECTURE

BACKGROUND

[OOOIJylte world-wide -web is a rich source of information, joda , there are est imated to be over one trill ion unique web pages. Many of these pages are dynamically created, e.g.. the home page of the New York mes. and have links to embedded content such as images and videos that can affect the content and appearance of the rendered web page. For example, when a browser executes s cript, such as Javascript rode, this can affect how a web page appears to a user and change the content and/or visual appearance of the page after the browser has finished rendering the web page, _, s another example, some web pages use style sheets that tell the browser how to change the appearance of text typical web page can have hundreds of such a dditional embedded items, some of which arc specifically des igned for or directed to the browser rendering engine, jhc addit ional information generated by the rendering process can be hel ful to downstream systems, su ch as an Internet search engine. While it is relat ively stra ightforward for a single user^'s web browser to render a single web page in real time, it is mu ch more difficult to render a large number of pages, s eli as all of the pages on the world wide web ( 1 trill ion pages) or even just the to 1 % of pages on the world wide web ( 10 bill on pages) in real time.

SUMMARY

[0002] Implementat ions in eki de a rendering server and a let eh server opt inked for bat ch rendering of web pages for a downstream user, su ch as a web page indexing system. Mien the downstream user identifies a web page (e.g., using its URL) with one or more embedded items, the downstream user may request that the rendering server render the URL- to generate a rendering result, jhe rendering server can include many (e.g.. tens of thousands) of rendering engines. Each rendering engine simulates a browser kernel opt imized for bat ch rendering, including, use of a virtual clock that eliminates many rendering errors. During rendering, as the rendering engine discovers embedded items, the rendering engine requests the embedded items from a fetch server, jhe fet ch server includes a data store of embedded items, keyed by an identifier tor ea ch an bedded item (e.g. it s URL), and the content for that item as retrieved by a web-crawler, BeJbrc looking, in the data store for the embedded item, the fetch server may rewrite the URL using rewrite rules, j c rewrite rules may repla ce the URL with a redirect U RL when content for the URL is du plicate of another embedded item (e.g., represented by the redirect U L), If a requested embedded item is a dupl icate, the fetch server ma rewrite the URL to use the redirect URL, which allows alrea dy-retrieved content for the redirect- URL to be used instea d of fet ching content for the requested U RL. such dc-du plicat ion methods can

dramat ically redu ce the a ctual number of crawl requests made by the fetch server and improve response time of the rendering engine, jhc rewrite rules may also indicate a URL- is blacklisted. In some implementations, the let ch server ma store the

dimens ions, rather than the actual content, of embedded images. ^'When a rendering engine requests an image, the fetch server may generate a mock image having the dimens ions of the image and return the mock image to the rendering engine. When the rendering engine has finished rendering the web page, it may pro ide a rendering result to the downstream user, such as an indexing engine, which can use the informal ion in the rendering result to enhance the processing of the web page.

[0003] In one aspect, a computer system inclu des at least one processor and memory storing a data store of content for embedded items and instructions that, when executed by the at least one processor, cause the system to perform operations. T e operations include receiving a request from a batch process to render a web page and ident ifying an embedded item in the web page, jhc operations also inclu de determining, based on a rewrite rule, that the embedded item has content that is duplicative of content for a previously let died embedded item and, in response to the determination, providing the content for the prev iousl y fetched embedded item from the data store, generating a rendering result for the web page using the content for the previously fet ched embedded item, and pro iding the rendering result to the bat ch process.

[0004] One or more of the implementations of the subject matter described herein can include one or more of the lbl lo ing features. For example, determining that the embedded item has content that is dupl icative of content for a previously fetched embedded item can include mat ching the embedded item to a template of the rewrite rule, the rewrite rule including a redirect ident ifier. In such implementations, prov iding Ihc content for the previously fet ched embedded it em includes us ing the redirect identifier lo locate the cont ent for the previously fet ched embedded it an and br template may in elu de a URL without a qu ery string,

[0005] \s another example, the embedded item ma be a first embedded I em and the operations may also include identifying a second embedded it em in the web page, det ermine whether the second embedded item is blackl ist ed, returning an error when the second embedded item is blacklisted, without fetching content for the second embedded item, and generating the rendering result without the cont ent for the second embedded item, _f s another exampl e, the operations ma y include use a virtual clock when generat ing the rendering result, the v irtual clock advancing independently of real tine, _, s another example, the operations may inclu de use a virtual clock when generating the rendering result, where the v irtual clock does not advance while wailing for the prov ided cont ent of the previously fet ched embedded item,

[OOOej^s another example, the embedded item may be a first embedded it em and the operations may include ident ifying a second embedded item in the web page, determining that the second embedded item inclu des an image, generating a mock image that specifies dimensions for t e second embedded item using a dimension table, and using the mock image in generating the rendering result,

[0007] In another aspect, a computer implemented method includes receiving a request, from a batch process, to render a web page, init ialising a v irtual clock and a task list for rendering the web page, wherein the virtual clock stands st ill when a request for an embedded item is outstanding and when a task is ready to run, jhc method also includes generating a rendering result for the web page when the v irtual clock matches a run time for a slop task in the task list and providing the rendering result to the batch process,

[0008] One or more of the implem entations of the subject matt er described herein can include one or more of the Ibllowing features. For example, init i l king the task l ist may include adding the stop task with a run t ine set to a predeterm ined l ine added lo the virtual clock, jhe predetermined time may be at least 5 seconds, _,4s another- example, the method may also inclu de advancing the v irtual clock to a run time of a task in the task list when no requests for embedded Hans are outstanding and onl tasks with run times greater than t e irtual clock are in the task list, ^s another example, the method may also inclu de ident ifying an em edded image in the web page, requ esting content for the embedded image, receiving, in response to the request, a mock image that specifies dimens ions for the an bedded image but has empt cont ent, and using the mock image in generat ing the rendering result. ,\s another example, the batch process may be on indexing engine and the method further includes demot ing a rank for the web page based on informal ion in the rendering result and/br using the rendering result to index dynamically generated cont ent.

[0009] In another aspect, a method includes receiving a requ est from a batch rendering p'occss ibr Uniform Resource Locator (U RL) of an embedded it em in a web page and appl ing rewrite rules to determine a rewritten U RL. jhc method may also include determining whether content for the rewritt en URL- exists in a data store and, when the content exists, providing the com oil to the batch rendering process. When the content does not exist, the method may include initiating a fetch of the content, wherein the bat ch rendering process is configured to wait without timing out during the fetch, receiv ing the content from a web-crawling engine, prov iding the content to the batch rendering process, and storing the cont ent in the data store, jhe content may be used by the batch rendering process to generat e a rendering result o the web pa e.

[0010] On c or more of the impl ementations of the subject matter described herein can include one or more of the following features. For exampl e, a pl ying rewrit e rules nay include mat ching the U RL- to a t emplate, the template being associated with a redirect URL, wherein when the URL matches the t emplate, the redirect U RL is determined to be the rewritten U RL. and wherein when the U L fails to match a template, the URL is determined to be the rewritt en URL. _, s another example the method may also include determining that the cont cut for the rewritten U RL is stal e based 011 a change rat e or a type of the embedded item stored in the data store and. in response to the det ermination that the cont ent lor the rewritten U RL- is stale, receiving updated content from the wcb-crawling engine, updating the data store with the u dated content, and providing the u dated content as the content for rewritten U RL.

[0011] In another aspect, a computer system includes at least one processor and memory storing a tabl e of dimens ions stored by image ident ifier, and instruct ions that, when executed by the at l east one processor, caus e the system to perform operations, χΐιε operations can include ident ifying an embedded image in a web page. determining dimensions for the embedded image from the table of dimensions, and generat ing a mock image using the dimensions, jhe operations may also include generat ing a rendering result for the web page using t e mock image.

[0012] In another aspect, a eomputcr-impl cmenl ed method includes receiv ing a requ est to render a web page from a batch process and identifying at least one embedded image in the eb page, jh. method also includes receiving a mock image Iron) a fetch erver, the mock image having dim ensions of the embedded image and empty cont ent, and enerating a rendering result for the web page using the mock image. In some impl ementations, the method may provide t e rendering result to the batch process that requ ested the web page.

[0013] In another aspect, a non -transitory computer ^readable medium may include instruct tons executabl e by at l east one processor formed in a substrat e that cause a comput er syst em to perform one or more of the methods described abov e,

[0014] One or more of the implem entations of the subject matt er described herein can be imp! emcnted so as to realize one or more of the following advantages. A_^s one exampl e, because the batch rendering engine is not connected to input devices ( e,g„ keyboard, mous e) or output dev ices ( e.g., display, touchscreen, et c.), the rendering engine can be simpler and sleeker than an actual browser rendcrer, for exampl e having a machine-friendly PI rather than a user-friendly _, PL A^o becaus e (he rendering engine does not need to displa (he final rendered page or interact with a user, thc rendcring engine can us e a virtual clock that advances based on finished tasks rather than actual time, which can fast-tra ck the rendering process and avoid common errors. For example, fetching in a batch environment can be much slower than in a personal web environment, which may lead to many time-out errors, yhe virtual clock hides the fet ch latency, avoiding the t ime-out errors, joe virtual clock also allows for more det erminist ic results. For example, in a U L that inclu des a date/lime component, rather than repla cing the dalcAimc component with a fixed time, the syst em ma use the value of the irtual clock, yhis means that not all time parameters in a web page will ha e the same valu e, but that ea ch time a web page is rendered a part icular l ime paramet er wilt have the same value, jhis flexibilit y allows the system to advance time, which is important in some web pages for the correctness of the rendered result, whil e si ill ensuring the set of URLs requested rema in the same across renders {which leads to less crawl requ ests), h system may also avoid fetching unnecessary Jems, e.g., blacklisted items, storing dimensions for an image, rather than the actual content of the image, reduces the storage requ ircmcnts for images in the fetch server, requires less dala to be transferred to the rendering engine, and further improves the rendering time at the rendering engine. Rewriting the URL avoids fetching duplicative content, further speeding the bat eh rendering process.

BRIEF DESCRIPTION OF THE DRA INGS

[0015] f 1 illustrates an example system in accordance with t e disclosed subject matter.

[0016] FIG.2 is a block diagram of a web page having embedded items.

[001?] FIG.3 is a block dia ram of a batch rendering en git c. according to an implementation.

[0018] FIG.4 is a flowchart illustrating an example process by which a batch rendering engine can render a web page having embedded objects, according to an implementation.

[0019] FIG.5 is a How chart illustrating an example process by which a batch rendering engine advances a irtual clock, according to an iinpl emanation.

[0020] FIG 6 is a How chart illustrating an example process by which a fetch server provides content for embedded items to a batch rendering engine, according to an implementation.

[0021] FIG.7 is a flowchart illustrating an example process by which a fetch server provides mock images to a batch rendering engine, according to an implementation.

[0022] FIG.8 shows an example of a computer device that can be used lo implement the described techniques.

[0023] FIG.9 shows an example of a distributed computer de ice that can be used to .implement the described techniques.

DETAILED ESCRIPTIO

[0024] jo completely render a web page, the content of all of the embedded external resources in the web page must first be obtained, gueli resources may include, but are not limited to. external images. Javascript code, and stylesheets. Often, the same external resource is embedded in many different web pages. While it is efficient lor a single users web browser to request an external web page resource such as the Goo gje analytics Javascript code in real time {i.e.. when the page in which the resource is embedded is rendered), it is neither feasible nor efficient for a batch rendering engine to do so. , batch rendering engine, for example for a web page indexing process, is designed to efficiently and quickl render a large n mb er of web pages at a lime. But fetching embedded external resources can be slow, and sometimes such resources are not important for the purposes o a batch process (e.g., without a human user to view the final rendered product).70 im rove pro cess tag time to render a web page in a batch environment, the rendering engine may work using a virtual clock, may work with a fetch server to avoid dnplicativeand unnecessary fetches, and may minim keltic processing of visual or other user -oriented elements in the web page.

[0025] FIG.1 is a block diagram of a system in accordance with an example implementation, jhc system 100 ma be used to efficiently and quickly render web pages in a batch mode for a requesting process, jhc requesting process illustrated in system 100 is an indexing engine for an Internet search engine, but implementations are not limited to an indexing engine as the downstream user of the rendered web pages. For example, the requesting process may bean analysis engine to analyze a page to troublcshoot slowness or to del ermine if a tool, such as Google analytics, is correctly set u , or an advertisingsystem, or other^* systems that rel on automated interaction with complex web pages, e.g. filingout forms or clicking on elements, Thus, while the system 100 may be described as us tag batch-generated rendering result for indexing, the system 100 can be used for other batch systems where the information provided in a rendering result is useful.

[P026]jhc system 100 may bea com uting device or devices that lakelhe form of a number of different devices. For example the system 100 may bea standard server, a grou of sueh servers, a client -ser er system, or a rack server system. In add! ion, system 100 ma be implemented in a personal computer, "phe system 100 may bean example ol^* computer device 800, as depicted in FIG.8. or com uter de ice 900. as depicted in FIG.9.

[0027] jhc system 100 includes a web-crawling engine 130. a requesting process, such as indexing engine 110, a render server 140, and a fetch server 150. jlie web-crawling engine 130, the render server 140, andthc fetch server 150 work together to efficiently render a large number of web pages, such as web pages^' that can be found on the World Wide Web, χΐιο result of the render of a web page is a rendering result, which includes various data elements useful to and otherwise unavailable to the requesting process.

[0021] Indexing engine 110 can include erne or more processors configured to execute one or more machine executable instructions or pieces of software, firmware, or a combination thereof to create index 115. For example, the indexing engine 110 may receive information from servers 190 via web -cra ling engine 130. jhc indexing engine 110 may process the content of the received web pages to gen crate the index 115. servers 1 0 may be any type of computing device accessible o er the Internet that hosts one or more web pages or resources embedded in one or more web pages, flic web pages accessed by the crawling engine 130 may include embedded items, such as style sheets. Javascript, images, etc., some of which may alter the cont ent and layout ol^" a rendered web p ge. While indexing engine 110 can index what is provided via the crawling engine 130, the indexing engine can ask raider server 140 to provide a browser- -rendered rendering result ofthc web page, which includes layout informal ion and dynamic content otherwise unavailable to the indexing engine 110, χΐιο indexing engine 110 can use the rendering result content to enhance the information available about the document in the index 115, For example, the indexing engine 110 may alter the rank of a text clement in the web page based on the location or ske of the text in the web page image. For instance, text appearing abovc-lhc-l ld (e.g., visible without scrolling) may be considered more im ortant than text bclow- thc-Iinc. another example, text in an advertisement may be considered less important to the web page. Furthermore, as some content is dynamically generated, eg,, not available until alter the web page is rendered, the indexing engine 110 may use the rendering result to index dynamically generated content, althou h not shown in FIG.1 for the sake of brevity, in some implementations, the indexing engine 110 may be distribut ed over or more separal e computing dev ices.

[0029] Like indexing engine 110. qu cry engine 120 ma include one or more servers that use the index 115 to identify search results for queries 182, for example, u ing conventional or other information retrieval techniques. Query engine 120 may include one or more servers that recei e queries 182 from a requester, such as client 180. yhe query en ine 120 may identity documents responsive lo the query using index 115, and provide information from the responsive documents as search results 184 to the requester. In some implementations, the query engine 120 may also use rendering results in rendering results data store 148 to provide a thumbnail as part of the search results 184, yhc query engine 120 may include a ranking engine that calculates scores for the documents rcsponsivcto the query, for example, using oncor more ranking signals, Oncor more of the ranking signals can be based on content obtained from the rendering result associated with the document, yhe ranking engine ma rank t e documents found rcsponsivcto the query u ing the scores,

pJ030}yhe system may also include web-crawling engine 130. yhc web- erawling engine 130 ran include on cor more processors configured to execute one or more ma chine executable instr ctions or pieces of software, firmware, or a combination thereof, yhc web-crawling engine 130 may be a computing device, such as a standard server, a frou ofsueti ser ers, a el lent -server system, or a rack server system. In some implementations, the web-era wling engine 130 may share components, such as memory or hardware processors with other components of system 100, such as fetch server 150 or indexing engine 110. yhc web-crawling engine 130 may crawl web pages that can be found on the world-wide-wcb. When the web-crawling engine 130 receives the crawled web pge, i,e., the contents for the crawled web page, the web-crawling engine 130 may provide the contents to the requester, which may be indexing engine 110 or fetch server 150, yhc eb-erawJ g engine 130 may also store the contents in a data store (not shown) and provide the location to the requester, \ used herein, the content of a web page refers to the IlyML code that is provided to a web page rendering engine and used to render the web page for display in a web browser, and includes any links to external objects that are embedded in the web page, such as style sheets. Javascript, other web pages, or image files, yhe web -crawling engine 130 may also be used by fetch server 150 to fetch the contents of these embedded items, yhe web-crawling engine 130 may pro ide the contents of embedded items to the letch server 150 or may store the fetched contents in a data store, such as embedded item table 152. yhe web-crawling engine 130 may notify the requester when the embedded item has been crawled.

[0031],vs previously mentioned, the system 100 includes a fetch server 150. yhe fetch server 150 can include oncor more processors configured to execute one or more machine executable instructions or pieces of software, firmware, or a

combination thereof, jhc fetch server 150 may be a computing device, such as a standard ser er, a group of such ser ers, a client-server system, or a rack server system In some implementations, the fetch server 150 may share components, such as memory or hardware processors with other components of system 100, such as render server 140, web-crawling engine 130, or indexing engine 110. jhc fct eh server 150 is configured to request that the wcb-erawling engine 130 fetch content for a particular embedded item, e.g., by its URL, and receive the fetched content and crawl time of the requested embedded item. The fetch server 150 may recei e the content and letch-time cither directly from the web-crawling engine 130 or from the embedded item table 152, which the web-crawling engine 130 updates, yhe fetch server 150 may recei e requests tor the embedded items from rendering engines 142. Tire fct eh server 150 may provide a response to the request ing rendering engine 1 2. jhc response may include the contents, either as -fetched or as-stored in the embedded item table 152, a mock image based on the image dimension table 156, or an error response. In sonic implementations, the fetch server 150 can provide the contents of an embedded item by sending the content and crawl timeof the embedded it cm to t e rendering engine 142 of the render server 140 thai requested the embedded item. Alternati ely, t e fetch server 150 can provide the contents b y not if in g t e rertdertn engine 142 that the content and crawl time of the embedded items are available ia a specified location in the embedded it an table 152, and the rendering engine 142 can retrieve the content and crawl time of the web page from that data store.

[0032]Thc fetch server 150 may apply URL rewrite roles 154 to the requested embedded items (e.g.. the requested URLs), jhc URL rewrite rules 154 include rules for rewriting a URL when the^'URL is associated with content that is the some as another URL. T is often occurs when a web-site owner wants the browser to download the content each time the resource is req ested and, therefore, provides a dynamically-generated URL ora cache-bust trig URL. giieh URLs often have a lime stamp or a random string embedded as part of the URL that causes the U RL to be unique each time a eb pa e is rendered. e.g., by Javascript that gettcrat.es the cache- busting URL. However, the content for the dynamicall -generated URL provided from the host ing server does not change, or does not change in a wa meanin ful for batch rendering purposes. The fetch server 150 may use URL rewrite rules 154 to more efficiently respond to requests for embedded it ems. For example the URL rewrite rules 154 may include patterns ortcraplatcs, and URLs matching the template of a ml e return the some content, e ., duplicative content. In some implementations, Ihetemplatcs may be determined by an offline or bat eh process that compares the content of various URLs using a fetch log and identifying a pattern in the URL common to the URLs with du licative content, jhc fetch logmay be maintained for example, by the web-crawling engine 130 or the fetch server 150. Templates may also be user-entered. If a requested embedded item has a U L that matches oncof the templates, the U L rewrite rules 154 ma tell the fetch server 150 that the requested item is a duplicate and direct, it to rewrite the rcqu cstcd URL with a redirect URL, which is associated with previously fetched content, e.g., a URL with contents in the embedded item table 152. jhc URL of the previously fetched embedded item may be considered a redirect URL. jiiis allows the fetch server 150 to avoid fetches unnecessarily, speeding its response to the requesting bat eh rendering engine 142 and eliminating stress on host ta ser ers caused by excessi e fetch requests. Of course, if a requested U L docs not match a tem late in the URL rewrite nilcs 154. rewriting the URL may result in no changes to the requested URL.

[0033] jhc URL rewrite rules 154 may also include patterns or templates for blacklisted URLs. If a requested embedded item mat dies a blacklisted URL pattern, (he system may return a predetermined error rather than att erupting to fetch cont cut for the URL. Blacklisted URLs may be altered into the URL rewrite rules 154 by a user after determining that the content is not needed for batch rendering purpos es. One example of this is the Google analytics Javascript code that many web pa es include, jliis Javascript code may not be considered important lor the layout of the rendered pa e and does not need to be run for the purposes of the batch rendering engine, xhus. for rendering efficiency, some embedded items may be blacklisted us iig the URL rewrite rules 154, In some Implementations, rather than returning an error for blacklisted URLs, the system may rewrite the URL, using a redirect URL as described above, to an entry in the embedded item table 152 that never expires and that has predetermined contents appropriate for the embedded item. In some implementations, the URL rewrite rules ma Hag a URL as blacklisted when it matches a template, jhc URL rewrite ruls 154 can dramat ica 11 y r edu ce th c numb er of embedded items fetched via the web-crawling engine 130. impro ing the fetch server 150 response time to any request for resources and niinimizingthe fetch volume on any particular server 1 0, Mnimkinglhe fetch volume ensures that the system docs not overwhelm the server 190 will fetch requests. In some

implementations, the fa eh server 150 antPorthc web-crawling engine 130 maybe configured to limit the number of fetch requests directed at a server 190, and if requests exceed the limit, the system may begin to queue the requests. If the queue gets too large, the system may fail the fetch requests. hus, the URL rewrite rules 154 can also minim tec fetch volume,

[0034] In some implementations, the fetch server 150 may include image dimension table 156. Image din ens ion table 156 may be a key-value store that associates an ima e U L with known dimensions for the image, flic known dimensions may be determined what the ima e is fetched Using the din ens ions of a requested image, the fetch server 150 may generate a mock image that has the same dimensions as the requested image but empty content or simple tiles as cont nt. The mock image is a val id image with the sam e dimensions as the requ esled image bul not the sa e image data. Because the fetch server 150 fetches content for a batch rendering engine, the actual image may not be important to the rendering result, but the dimensions of the image may affect- the layout of the rendered page. Us nig a mock image rather than the actual image makes the file size very small (eg., only tens of bytes per image), which saves network bandwidth when transmtttingthe mock image and processor and memory resources for t e batch rendering engine. In some implementations, the ima e dimension table 156 may bea key-value store, such as an SSXable. but the dimension table 156 may beany data structure that stores the dimensions by image identifier.

[0035J-j-he system 100 may include embedded item table 152, jhc embedded item table 152 may be keyed by URL and may store the fetched content lor an embedded item returned from the web-crawling engine 130. In some

imp! an en tat ions, lite embedded item table 152 may also store a crawl history. For example, in some implementations the embedded item table 152 may include content fetched over a period of time, for example seven days, two weeks, etc. The embedded item table 152 may also include a change rate based on the crawling history. In some implementations, the embedded item tabic 152 may be im lemented as a BtgTablc. a relational database, a Hadoop Distributed File, etc. flic fetch server 150 ma use the anbcddcd item tabic 152 to quickly return contents for previously let died embedded items. Because the fetch server 150 can processes requests for thousands of batch rendering engines, there is a high likelihood that a requested embedded item has been let died before HI response to an earlier fetch request. When the fetched contents are located in the embedded item table 152 the fetch server 150 may respond to the request using the contents in the embedded item table 152 rather than askingthe web- crawling engine 130 to provide the contents, jhis eases the burden on the servers 190 that store the fetched contents and allows the fetch server 150 to respond more quickly to requests for embedded items. Fetch ser er 150 can further reduce craw! requests by de-dup!icatinglJRLs using URL rewrite rules 154.

[0036] If, at any stage of the rendering process, the content of otic or more of the requested embedded items is not stored in the embedded item table 152 or is stale, the fetch server 150 may instruct the web-crawling engine 130 to schedule a crawl of the requested embedded item. Once the web-crawling engine 130 has crawled the requested embedded item, it notifies the fetch server 150. j/Jie fetch server 150 may thai store the fetched content in the embedded item table 152. along with a crawl - time. If the embedded item is an image, the fetch server 150 may, alternatively or additionally, store the dimensions of the fetched image in image dimension table 156. along with the crawl time, ΐιο fetch server 150 may then send the raiucstcd content, or for image files ma send a mock image with the image dimensions, back to the request ingrendcring engine 142.

[0037] jhc system 100 includes a render server 140. jhc render server 140 can include one or more processors configured to e ecute one or more mac ine executable instructions or pieces of software, firmware, or a combination thereof, jhc render server 140 may be a computing device, such as a standard server, a group of such servers, a client -server system, or a rack server system. In some

in pi a nen tat tons, the render server 140 may share components, such as memory or hardware processors with other components of system 100, such as fetch server 150 or indexing engine 110. j c render sever- 1 0 receives a request from the indexing engine 110, or other requesting process, to render a particular web page. In other words, the render server 140 ma receive the URL of a requested web pige. jhc render server 140 may include one or many rendering engines 142. In some

implementations the render server 140 include tens of thousands of rendering engines 142 and may pa-form ioad-balancin to select a rendering engine 142 to render the web page. Once a rendering engine 142 is selected, the rendering engine 142 attempts to render the web page to a rendering result. The wets page may be referred to as the embedder web page because il typically includes additional embedded items,

[0038] Each rendering engine 1 2 is configured lo em late a renderer for personal web browser, but with optimisations for baleh rendering. Accordingly, after a rendering engine 142 receives the embedder web page it may begin to populate a task list, the tasks representing work the rendering engine 142 does to generate a rendering result, While man tasks may be scheduled lo run immediately, some tasks may be scheduled in the future. One of the batch organizations for rendering engine 1 2 is lo use a virtual clock and add a task lo the task list thai indicates rendering is complete at a predetermined lime. For example, in some implementations the task may indicate rendering is complete at the current time plus 20 seconds, he

redetermined time may be based on a time in which a majority of web page designers will desip a eb page to look complete, g.. any animations or layout changes are designed to be finished within the predetermined time. Because most users do not appreciate waiting ery long for a page to load, the predetermined time can be between 5 and 20 seconds, although it may be longer for some situations. The rendering engine 1 2 will not take the entire 20 seconds due to the use of the irtual clock, and often a full render may occur in a ew seconds if embedded items do not have be crawled (e.g., the fetch server 150 can locate the contents in the embedded items table 152), Thus, a task that generates the final rendering result may be added to the task list, with a start line 20 seconds from the current tine. he current time is based on the initialized time of the virtual clock, which can bezcro or the current time of a real clock.

[0039].vs part of the rendering process, eg,, one of the rendering tasks, the rendering engine 142 may determine whether the embedder web pa e includes any embedded items, such as stylesheets, image files, Java script, and the like. These embedded items are referred lo as primary embedded objects. If the web page docs not contain any embedded objects, the rendering engine can immediately p'oeess the web page to a rendering result, and may store the rendering rcsull in the rendering results data store 148. If, however, the web page contains embedded items, the rendering engine 1 2 may extract all embedded items and send a request to fetch server 150 for the content of the embedded items, jherec|uested embedded items are each represented by a respective URL. jhe rendering engine 142, however, does not stop rendering or t inc out while waiting for the fetched resource. Rather, because the rendering engine 142 uses a virtual clock, as will be explained in more detail below, waiting for a resource lo be fetched via the web-crawling engine 130 does not advance the clock and the rendering engine 142 dos not time out.

[0040] When the content for a requested embedded item is received, the rendering engine 142 may add tasks to the task list to process the content. Part of processing the content may include disco v eriiig whether the requested embedded object (i.e., the primary embedded object) itself has embedded objects (i.e., secondary embedded objects). U^" the primary embedded object does not contain secondary embedded objects, the rendering engine 142 can continue working on rendering tasks (eg., executing Javascript code) changing inage properties. If, ho e er, the primary embedded object contains one or more secondary embedded objects, the rendering engine 142 requests the secondary embedded objects trorn the fetch server 150. fhis process of discovering andrcqucsling embedded objects is repeated until the rendering engine has discovered, requested, and received the content of all of the objects that are embedded in the web page to be rendered (e.g., primary, secondary, tertiary, etc.).

[0041] Eich embedded item request may be a task in the task list that is removed once the fetch server 150 returns the content for the time. When content is returned, the rendering engine 142 may add tasks for processing the content, which in lorn ma add additional tasks, such as changing the opacity on an image, running a script, etc. Each task may be associated with a run time, gome tasks may havca future run time. For example, lo fade in (or out) an image, the brow er may add se eral tasks to the task list, each changing die opaci of the image over intervals of tine, ^s will be ex lained in more detail below, the renderin engine 142 may use a virtual clock rather than real-time in relation to the task list to determine when a task is ready to run.

[0042] jhc rendering engine 142 works on the process rendering tasks in the task list until the rendering is complete, e g„ a rendering result is generated jhc rendering engine 1 2 ma then store the rendering result in the rendering results data store 148 and/or provide the rendering result to the requesting process (e.g., indexing engine 110). yhe request in g process, such as indexing engine 110, may then use information extracted from the rendering result in processing the web page. For example, the requesting process ma us e Javag eript errors, layout informal km. style informal ion, adspacc information, a list of resources fetched, performance statistics, etc., all of which maybe included in a rendering result but not otherwise available to (he request trig process.

[0043]jlic system 100 may be in communication ith the elienl(s) 180 and servers 190 o a' network 170. Network 170 may be for example, the Interact or the network 170 can be a wired or wireless local area network (L,\ ). wide area network ( \V, N), a oomb haiku of these, etc., implemented using, for example, gateway devices, bridges, switelics, and/or so forth. Via the network 1 0, thequcry engine, the web-crawling engine 130 and/br the fetch scrv er 150 may communicate with and transmit data t /from clients 180 and/br servers 190.

[p044]xhc system 100 may also include other components not illustrated for brevity. For example, one or more of the indexing engine 110, thequcry engine 120, the web-crawling engine 130. the render server 140. and the fetch server 150 may be distributed across one or more computing devices, similarl , index 115, rendering results data store 148, embedded item table 152, and image dimension tabic 156 may also be stored across multiple computing devices. In some implementations the various components of system 100 may share hardware components of a computing device, or ma be logical partitions of the same computing device.

[0045] F!G.2 is a block diagram of a web page having embedded objccls, s shown in the figure, a web page 200 can contain a plurality of embedded items, fhesc embedded objects can include, but arc not limited to. other web pages 210. style sheets 220, image files 230, so-called cache-busting U Ls 240, and Javascript code 250. Additional, and different types of embedded objects, are of course possible. Moreover, each of the objects thai are embedded in web pa e 200 may embed other objects. For example, a web page 210 that is embedded in web page 200 may embed other web pages, image files, stylesheets and the like. Likewise, a stylesheet 220 that is embedded in web page 200 may em ed other objects such as a background image file. Further, each of the objects that arc embedded in web page 210 or stylesheet 220 may themselves embed even more objects, TO compl el cly raider such a web pageto an image file, a batch rendering engine must request each of the em edded objects 210-250 (primary embedded objects), ali of the objects (.secondary embedded objects) thai arc embedded in the embedded objects 210-250, and all of the objects (tertiary embedded objects) thai are embedded in the objects that are embedded in embedded objects 210-250, and so on,

[0046] s discussed above, while an individual user^'s web browser can efficiently request all of these embedded objects and use them to completely render and display web page 200 in real lime, a batch rendering engine can be opt talked so that it does not fetch duplicative or unnecessary content, so that it docs- not timeout waiting for the content of embedded objects, and so that it finishes the rendering as quickly as possible, regardless of internal timing for tasks, yhus, to efficiently render a large number oi^" crawled web pages to rendering results, a system such as that disclosed in FIG.1 can be employed.

[004?] PIG, 3 is a block diagram of some component!; of batch rendering engine 142, according to an implementation, jhc batch rendering engine may include additional components not illustrated in FIG.3. jhe batch rendering engine 142 includes a page task list 305, a virtual clock 310. and a rendering result 315. yhc virtual clock 310 may be used to warp the timeline for loading a web pa e and to avoid a multitude of errors that can occur du e to waiting for let ched resources, "plie virtual clock 310 ma be initialized I© zero or the current clock time at the start of the rendering process and may advance only when the rendering engine- is not waiting for a fetch of an embedded item and when there arc no tasks in the page task list 305 that arcrcadv to run at the current lime. Mien the virtual clock is advanced the rendering en ine 142 advances the virtual clock 310 based on the page task list 305. In other words, the rendering engine 142 advances the virtual clock 310 to the time

represented b tJiencxt-occurrngtask. In this sense, fetching an embedded item and running Javascript takes no virtual time, which can avoid an entire class of errors encountered by a live (or personal) browser. Furthermore, the rendering process ma finish in real-time much faster than the times specified in t e task list. For example, although the task "Generate Final Rendering" is set to occur at 20 seconds, the virtual clock typically advances to 20 seconds in a few actual seconds, depending on how long it takes to actually finish the tasks in the page task list 305. jhe "General e Final Rendering^" tasks in page task list 305 is an example of a gto task that tells the batch rendering engine 1 2 when the render is finish cd. [0048Jjhc rendering engine 1 2 may render a rendering-result 315 of the embedder web page, The rendering result 315 may include a variety of components.

For example, the rendering result 315 can include an image 316 of the rendered page, The image 316 ma be the image thai would be displayed to a user of a live (or personal) web browser and can be used, for example, to display a thumbnail of the rendered page to a user, jhe rendering result 315 can also include a Document Object Model (DOM) tree 317. jhe DOM tree 317 represents the HT-ML structure of the web page. For example, the system may determine tokens, or thclexl of a document, visible to a user, by processing the DOM tree, yhc rendering result 315 may also include layout 318. Layout 318 includes a box for each element of the eb pa e, the box specifying the coordinates of the clement in the linage 316. For example, the layout can include box representations of DOM nodes in the DOM tree (although not every DOM elements ma have a corresponding render box), jhe boxes can be organized in a tree structure, also known as a render tree. Thus, for example, a table may be represented by a box in the layout, and a paragraph may berepresentcd by another box in the layout. Thus, the layout 318 provides an indication of where on the web page an element occurs, how much space it takes on the web pa e, etc. Thus, the la you! 318 provides information on how much of the web page is ads. how prominent a para raph is (e.g. abovc-thc-lbldorbclow-thc-fold), whether the clement is visible, etc. The layout 318 thus provides geometric information about the elements of the web page, jhe rendering result 315 may also include errors 320. Errors 320 include errors encountered as a result of running script, eg.. Javascript, y c rendering result 315 may also include a list of embedded resources 319 fetched during the rendering, and can include-other elements generated as pari of the rendering process, yhus, the rendering result 315 provides informal fan not available to the requesting process sol el y v la a f el eh of cont ei it. from th e host a i g s cr v cr. ii i d ex i i g α i gin c, for cxampl e, can use the rendering result information to rank the element ir. the index, to avoid providing invisible elements as pari of a snippet, and to index dynamically-generated content. Dynamically-generated content is content thai exists alia- rendering the web page but not in the as-crawled content.

[0049] FIG.4 is a flowchart illustrating an example process 400 by which a batch rendering engine can raider a web page having embedded objects, according to an implementation.. jhe process 400 may be performed by a system, such as system 100 of FIG, 1, yJicsystem may use process 400 to generate a rendering result of a web page in a batch mode at the request of a downstream process, such as an advertisin system or an Internet indexin system. In some implementations, process 400 ma be performed by a batch rendering engine of a rendering server and may be initialed in response to a request from a requesting process,

[0050] Pro cess 400 may begin with receiving a request to render a web page (405), In some implementations the request may include the URL and/Or the fetched content of the requested web pa e a d associated metadata (e.g., crawl time), in some implementations, rat er than receivin t e content of the web pgc, the batch rendering engine can receive a notification that the content of the web page is available in a database and can retrieve the content and associated metadata (eg., crawl time) from the database, fhe fetched content ma be pro tded because, for example; the request tag process has already fetched the content, jh e batch rendering engine may egin rendering b init talking a virtual clock and adding a gtoptask to the task fist (410). For example, the hatch rendering engine may set t e virtual clock to zero and add a $t op task to the task list that causes the rendering engine to determine rendering is complete at a predetermined time, j c run time associated with this gtop task ma be a time in w ich most web pages finish loading on an individual user^'s m d line. For example, the time maybe 15 or 20 seconds, /s part of beginning the rendering, the batch rendering engine may also odd other tasks to the task list, such as fetching the content for the web page (if the content was not provided), and processing the content for the web pa e. These tasks may be added with a virtual time of zero . so they can start immediately, for example,

[0051]jhe batch rendering engine may then begin working on the tasks hi the task list (415). For example, as part of processing the content for the web page, the batch rendering engine may identify one or more embedded items (420), hc batch rendering engine may then request the contents of the embedded items from a fetch server (425), hc fetch server may bet e fetch server 150 of FIG, 1. In some implementations, Ihebatch rendering engine may keep track of which embedded items it idem tiled and whether or not the fetch server has returned the content for the respecti e embedded items. In sonic implementations, this list of embedded items ma be included in the rendering result for the web page, \l\ct the batch rendering engine has requested the embedded item, the batch rendering engine may continue working on other lasks (415 ) thai are ready to run whil e waiting for the fetch server to return the cont ents. If there are no tasks ready to run at the current virtual time or the batch rendering engine may wait for a response from the l et ch server. While a let eli is outstanding, the bat ch rendering engine does not a dvance- the virtual clock and, thus, the batch rendering engine docs not time out wail ing for a fetch,

[0052] When a response From the fetch s erv er is received, the bat ch rendering engine ma process the content of the embedded it an (430), For exampl e, in response to receiving the cont ent the bat ch renderin engine may add lasks, su ch as pars nig the received cont ent for embedded items, to the task list, jlicse tasks ma be given a start time of the current virtual clock, which indicates the task is ready to run (e.g. the current l im e on the v irtual clock). Par ing receiv ed content, whether for the originally request ed web pa e or for an embedded item, may cause the batch rendering process to add additional tasks to the- task list. For exampl e, parsing the content for an embedded item may discov er additional embedded items (e. g. secondary embedded items), which may cause the batch rendering engine to request the embedded items and parse their cont ent when they arc returned. If the content inclu des script, for example- Javascript, running the script may cause additional tasks to be performed, such as generating the layout or changing the appearance of one or more el ements of the web page, gome of these tasks may be schedul ed to start in the future. For example, changing the opacity of an image at fixed int ervals makes the ima ge appear to the user as if it is fading in. Bach change in opacit is a task and the script may cause several such tasks to be added to the task l ist, ea ch with a run time of the current v irtual clock plus a specified amount.

[0053^ s part of the rendering process, the bat ch rendering engine may determine whether the render is finished (435), jhts determinat ion may be done, for example, each l ime the batch rendering engine compl et es a task or at predetermined time intervals, et c. fhc render may be finished when the virtual clock reaches the time specified in the glop task. Because the virtual clock does not advance whil e a fetch of an embedded it em is outstanding, when the ilual clock docs reach the time specified in the gtop task the bat eh rendering engine is assured to have recciv ed a response for each fetch request, yhus, the batch rendering en gin e n ev er t im es out waiting on a resource. [0054] If the render is not finis ed (435, No), the batch rendering engine may continue working on tasks in the task list, wailing for a response lor a request of one or more embedded items, etc. If the render is finished (435. Yes), thebatch rendering engine may finalize a rendering result for the requested web pa ge (440) and return the rendering result to the request kg process. Elements of the rendering result may have been previously generated as a result oflasks completed by the batch rendering engine. For example, the list of embedded it cms fetched and errors encountered while running script maybe generated prior lo the render being finished. Oilier elements, such as determining the la out may occur alter the raider is finished In some implementations, thebatch rendering engine docs not determine the layout until ater the render is finished unless a script run as part of the rendering process requests the location of an clement. Even if the layout is generated prior to the rendering being finished, the batch rendering engine ma gen crate the layout a final time as part of f ina 1 k in th e r cn d ering r esu It . jlms, finalizing the rendering result may include generating new elements and collecting elements already generated. In some implementations, thebatch rendering engine m y store the rendering. result in a memory and may provide the requesting process with the location of the rendering result. In some implementations, the system may store the rendeiing result with a limcslamp indicating when it was generated and may store more than one version of the rendering result. Process 400 then ends, ha ing generated a rendering result in batch mode, with optimizations for batch processing.

[0055] F'G.5 is a How chart illustrating an example process 500 by which a batch rendering engine advances a vilual clock, according to an implementation. Process 500 may be run as part of del ennining whether a render is finished (e.g., step 435 of FIG.4), allhou gli it may be run at other limes as well (e ., periodicall ). Process 500 begins with determmtog whether the batch renderin engine is waiting for a request of an embedded item (505), For example, if the batch rendering engine requested an embedded item from the fetch server and has not yet received a response from the fetch server, the batch rendering engine is waiting. If the bat A rendering engine is waitin (505, Yes), the vilual clock is not advanced and the batch rendering engine may work on tasks ready to run at the current vilual lime, if they exist, or may wait (510). Trfiis step may be performed as part of step 415 of FIG.4. If the batch rendering engine is not w iting for a fetch request (505, No), the batch rendering engine ma determine whether there are tasks in the task l ist that arc ready to run (515), For example if a task in the task list has a run time that is equal to the virtual clock, the task is ready to run. If a task is ready to run {515, Yes), the batch rendering engine ma work on the task (520), Working on the task may a dd other tasks to the task list, some of which may be ready to run and others of which may hav e a run time in the future (eg, current virtual clock time plus some specified time), jhis st ep may also be performed as pari of st ep 415 of FIG. 4, If there arc no pending tasks ready to run ( 515, No), the bat ch rendering engine ma advance the v irtual clock to the next run time specified in the task l ist (525 ), In other words, the batch rendering engine warps the virtual clock forward so that the next task in line in the task list is ready to run.

[0056] If the next task in line in the lask list is the gtop task ( 530. Yes), the rendering is fi ished. If not, the bat ch rendering engine may continue to work on pending tasks (520), Process 500 demonstrates how the virtual clock is not a dvanced while there are pending tasks ready to run or whil e wa il ing ibr a fetch of an embedded item, Thus, the virtual clock "stands still" for thes e events, which can avoid a class f errors encount ered when the a rendering engine uses a real clock. Furthermore, process 500 demonstrates how the virtual clock can be warped forward, so that in some instances the rendering process can take l ess real-t tme than t iming dictat ed b (he tasks { c,g,, waiting for intervals of time to fade-in an image or play an animat ion), jhis is especially true when embedded it ems can be returned w ithout a crawl, as will be expla ined in more deta il herein. Of course, it is understood that the order of checking for peodiiig tasks (515 ) and fetch requ ests (505 ) can be reversed, and im l ementations arc not l imited to the order illustrated in FIG. 5.

[00 B7 J FIG. 6 is a llow chart illustrating an exampl e process 600 by which a fetch server provides content of embedded items lo a batch rendering engine, according lo an in pi em en tat ion. Process 600 may be performed by a syst em, su ch as syst em 100 of FIG. 1 , T e syst em may use process 600 lo respond lo felch requests for embedded items from a plurality of batch rendering en ines. In some

impl ementations, process 600 may be performed by a fet ch server and may be initialed in response to a requ est from one of the bat ch rendering engines.

[0058] Pr cess 600 may begin with the fetch server rccciving a URL ibr an embedded item (605). Th e URL may be prov ided by a batch rendering engine and ma be one ol^' a plurality of URLs requested by the batch rendering engine, jhe fetch server may apply rewrite rules to the URL of the requested embedded item (610). yhe rcwrile rules may be URL rewrite rules 154 of FIG.1. A rcwritcruleinay include a template and a redirect URL. AP !vm a rewrite rule-ma include determining whether the URL matches the pattern ortcmplalcforoncof thcrcwritcrulcs. For example, the iemplale inay be a U L with any query strings removed and the system may remove the query string from the U L of therequested embedded item to sec if it mat dies the template, _/ s another example, t e template may include wild-card characters. c.g.. * aid ^*?, that indicate places where any character or characters can match the wild-card characters.

[0059] If the URL does match the pattern, the rewrite rule ma prov ide a redirect URL and the fetch server ma substitute t e URL of thcrcqucstcd embedded item with the redirect U L. One reason for applyingrcwritc rules is to allow the fetch server to identify URLs that return the same content and to use the redirect URL to avoid having to schedule an unnecessary fetch, certain types of commonl embedded items have URLs that are dynamically general ed. For example, the URLs of some embedded items depend upon a random number that is generated by a random number generator or on a current date and time that is returned by a date and time function. Embedded objects such as these, known as cache-bust iig tracking URLs, are commonly used to determine the number of unique hits or views of a web page for the purpose of determining advertising costs or revenues. While the contents of such embedded objects arc usually identical, a unique URL is general ed for the object each lime it is discovered by a rendering engine, jlius. for web pages containing such embedded items, the rendering engine will sec a new and different URL for the object each time it tries to render the web pa e, and without applying rewrite rules the fetch ser er would fetch the same cont cut over and over, jo avoid this, the re-write rules may apply templates that allow the fetch server to identify these URLs and redirect a fetch request to previously-fetched content stored under a redirect URL.

reason to apply rewrite rules is to identify blacklisted URLs, fhc rewrite rules may also include rules thai identify blacklisted URLs, or a pattern or template for blacklisted URLs. For example, the rewire rule may include a template and an associated redirect URL. error, or Hag, If the U L- for the requested embedded item matches a blacklisted U L or a template for a blacklisted U L, the fetch ser er ma identify the URL as blacklisted. In some implementations, applying the rewrite rules may cause the URL to be replaced with a redirect URL, In some

implementations, applying the rewrite rules may Dag the URL as blacklisted, or ma provide on error to return as the response to the request for the embedded item identified by the URL.

[0061] If the URL is blacklisted (615, Yes), the fetch server may return an error to the requestin batch rendering engine (620). flic error may be a standard browser error indicating the resource could not be found, or a specific error that tells the rendering engine that theresource is not needed or can be skipped etc, jhe error may be provided b the match ingrcwrrtc rule, from the embedded item table if the rewrite rule provided a redirect URL. selected based on a flag in the rewrite rule, hard coded, etc. T¾C fetch request for this URL is then complete and process 600 ends.

[0062] If the URL is not blacklist cd(615. No), the fetch server may look for the rewritten URL in the embedded items data store (625). jlie embedded items data store ma be embedded item table 152 of FIG. I. "flie rewritten URL may belhe redirect URL provided by the rewrite rules, I^* the original U L matched a pattern identified in the rewrite rules, jhc rewritten URL- may be the original URL if the URL did not match any templates in the rewrite rules. If the URL is in the embedded item data store (625. Yes), the fetch server may optionally determine whether the requested URL is for an image (630). fliis is optional, and in implementations that do not test for an image, step 630 can be omitted. Whether the requested embedded it an is an ima e can be determined based on information in the request, the URL its el i^", or based on a field in the embedded item data store for the rewritten URL. If the embedded item is an image (630. yes), thesystem may look in dimensions tabic for the dimensions of the image and return a mock -image having the dimensions, as explained in more detail with regard lo process 700 of FIG.7. It is also understood that in some in pi em en tat tons, the f etch s crv er ma y perform step 630 prior to applying rewrite rules, prior to looking the embedded item data store, or alter determining if the entry is stale.

[0063] If the requested embedded it an is not an image (630, No), the fetch server may determine whether the entry in the embedded items table is stale (645). Whether an entry is stale or not may depend on several factors, such as the change rate of the it an, the type of embedded item (e,g., a script, stylesheet, image, etc.), the importance of the web page that the browser rendering engine is rendering, etc, In some implementations, the embedded item table may have a fid d or value (to indicates the entry never goes stale, e.g. for a redirect URL of a blacklisted embedded item. If the entry is not stale (645, No), the fetch server ma return the content in the embedded item table for the rewritten U L to the requesting batch rendering engine (650) and process 600 ends for this embedded item. In some implementations, returning the content may include the fetch server providing the location of the entry in the embedded item table as the response, and the batch rendering process accessing the content its kg the location,

[0064] it the entry in the embedded item table is stale (645. Yes) or if the rewritten URL is not in the embedded item data store (625, No), t e fetch server may request a fetch of the URL from the web crawler, e.g. web-crawling engine 130 of FIG, 1 (635), When the fetch server receives the crawled content, it may store the received content, without massagin or further processing, as an entry in the embedded item data store (640). In some implementations, the fetch server can save the content and crawl time of the embedded item without overwriting the content and crawl time of a previous crawl of the embedded item. In some im lementations, the fetch server may keep o e entry in the embedded item table and may not preserve a previous cra l of the embedded item. Regardless, once saved in the embedded item table the content is cached and docs not need to be fetched again until it becomes stale, jhe fetch server may then return the fetched content to the requesting bat eh rendering engine (650) and process 600 ends,

[0065] F'(j.7 is a How chart illustrating an example process 700 by which a fetch server provides mock images to a batch rendering engine, accordin to an implementation. Process 700 may be performed by a system, such as system 100 of FIG.1. yhc system may use process 700 to respond to fetch req ests for images embedded in a web page from a plurality of batch rendering engines. In some im lementations, process 700 may be performed by a fetch server and may be initiated it response to a request from one of the batch rendering engines. In some implementations, the fetch server may execute process 700 independently ol^" other embedded items (e.g.. process 600 of FIG.6). In other implementations, the fetch server may incorporate elements of process 700 into a process that includes other embedded items, e.g., process 600 of FIG.6. [0066] Process 700 may begin with the fetch server determining whether the request ed image has an entry in an image dimensions tabic (705), j e image dimensions tabic m y be the image dimension table 156 of FIG, I. The ima e dimensions table includes dimaisions for the image, which arc stored by an identifier, such as the URL, for the ima e. If the image is not in the dimensions tablc(705. No), or if the linage is in the dimensions tabic {705, Yes) but is stale (710, Yes), the fetch server may schedule a fetch of the image (715), for example via a web -crawl trig engine such as web -craw ling engine 130 of FIG.1. In some implementations, the fetch server may use information in the dimensions tabic to determine whether the entry is stale. In some implementations, the fetch server may use information from a separate embedded items table, as described above with regard to step 645 of FIG.6, to determine if the dimensions are stale, jhus, in some inplcmonlatiotis, the fetch server ma perforin step 710 in conjunction with or as part of step 645 of FIG.6. When the content for the image is recei ed, the fetch server may add an entry for t e image into the dimensions table, the entry in eluding t e dimaisions for the fetched ima e (720). In some implementations, the fetch server may also store the fetched content in an embedded items table, as described above as part of step 640 of FIG, 6.

[0067] if the image is in the dimensions table (705, Yes) and is not stale (710, No) or ater the ima e has been fetched and stored (720). the system may gen crate a mock image using the dimensions from the dimensions tabic (725). yhc mock image may have image flic format data that specifics the same din ens ions as the requested image but empty content, fhc system may return the mock image (730) to the requesting batch rendering engine and process 700 ends.

[0068] It is understood that in some implementations, some of the steps of process 700 may be optional or performed as part of^" other processing For example, determining whether the dimaisions Ibrlhc image are stale may be performed as part of step 645 of FIG.6 and may be based on information in an embedded items tabic, additionall , step 715 may be performed as part of, or in conjunction with step 635 of FIG, 6, In other words, the fetch server may combine aspects of process 700 with aspects of process 600, such as fetch trig content for images, determining whether cached fetched content is stale, etc. Of course, the letch server may also perform process 700 completely in dependent of process 600, Tnt , implementations may include variations of process 700. [0069] [·^'!(;.8 shows an example of a generic computer device 800, which ma be operated as system 100, and/or client 170 of FIG.1, which ma be used with the techniques described here, computing device 800 is hi ended to represent various example forms of computing devices, such as la tops, desktops, workstations, personal digital assistants, cellular telephones, smart phones, tablets, servers, and other computing devices, including wearable devices, yhc components shown here, their connections and relationships, and their functions, arc meant to be examples only, and arc not meant to limit implementations of the inventions described and/or claimed in this document.

[0070}(Oiiipiiting device 800 includes a processor 802, e.g., a siliconc-bascd hardware processor, memory 804, a storage dev ice 806, and expansion ports 810 connected via an interface 808. In some implementations, computing device 800 may include transceiver 846, communication interlace 844, and a GPg (Global Positioriiig System) receiver module 848, among other components, connected via interface 80S. De ice 800 ma communicate irclcssly through communication interface 844, which may include digital sipial processing circuitry where necessary. Each of the components 802, 804, 806, 808, 810, 840, 844, 846, and 848 may be mounted on a common motherboard or ii other manners as appropriate.

[0071] jhe processor 802 can process instructions for execution within the computing device 800, including instructions stored in the memory 804 or on the storage device 806 to display graphical information for a GUI on an external in put rout pit device, such as display 816, Display 816 may bea monitor or a Oat touchscreen display. In some implementations, multiple processors and/or multiple buses may be used as appropriate, alongwith multiple memories and types of memory, ^lso, multiple computing devices 800 may be connected, with cadi device providing port ions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).

[0072]jhe memory 804 stores information ithin the computing device 800. In one implement at ion, the memory 804 is a volatile memory unit or units. Ii another implementation, the memory 804 is a non-volatile memory unit or units, yhc memory 804 may also be another form of computer-readable medium, such as a magnetic or optical disk. In some implementations, the memory 804 may include expansion memory pro ided through an expansion interface. [0073] | he storage device 806 is capable of providing mass storage for the computing device 800, In one implementation, the storage device 806 may be or include a com uter -readable medium, such as a floppy disk device, a hard disk device, on o tical disk device, or a tape device, a Hash memory or ot er similar solid slate memory device, or an array of devices, including devices h a storage area network or other configurations, , computer program product can be tangibly embodied in such a com uter -readable medium, computer program product may also include instructions that, when executed, perform one or more methods, such as those described above, jhc computer- or machine-readable medium is a storage de ice such as the memory 804, the storage device 806, or memory on processor 802,

[0074] jhc interlace 808 may be a high speed controller that manages ban dwidlh -intensi e operations for the computing de ice 800 or a low speed controller that manages lower bandwidth -intensive opera! ions, or a combination of such controllers. _j n external interlace 840 may be pro ided so as to enable near area communication of device 800 with other devices. In some implementations, controller 808 may be coupled to storage device 806 and expansion port 814, yhe expansion port, which may include various communication ports (e.g., UgB, Bluetooth, Ethernet, wireless Ethernet) ma be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networkin device such as a switch or router, e.g., through a network adapter,

[007S]jIie computing device B00 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 830. or multiple times in a group of such servers. It may also be im lemented as part of a ra k server system, in addition, it may be implemented in a personal computer such as laptop computer 832, desktop computer 834, or smart phone 836, n entire system may beinade u of multiple computing devices 800 communicating with each other. Other configurations are possible,

[0076] lid.9 shows an example of a generic compter device 900, which may be system 100 ol^* FIG. I, which may be used with the techniques described here, Computing device 900 is intended to represent various example forms oflargc-seale data processing de ices, such as servers, blade servers, datacenters, mainframes, and other large-scale computing devices, computing device 900 may be a distributed system ha ing multiple processors, possibly in el din net work a lia died storage nodes, that arc ill a- connected by one or more communication networks, fhc components shown here, their connections and relationships, and their fun cl ions, arc meant to be examples only, and are not meant to limit implementations of the inventions described and-br claimed in this document.

[0077] Distributed computing device 900 may include any number of computing devices 980. computing devices 980 may include a server or rack servers, mainframes, etc. communicating over a local or wide-area network, dedicated optical links, modems, bridges, routers, switches, wired or wireless networks, etc.

[0078] in some implementations, each computing device may include multiple racks. For example, computing device 980a includes multiple racks 958a -958n. Each rack may include one or more processors, suc as processors 952a -952n and 962a -962η. jtie processors may include data processors, network attached storage devices, and other com uter controlled devices. In some implementations, one processor may operate as a master processor and control the scheduling and data distribution tasks. Processors may be interconnected through one or more rack switches 958. and one or more racks ma be connected through switch 978. gwiteh 978 may handle communications between multiple connected computin devices 900.

[0079] Each rack may include memory, such as memory 954 and memory 964. and storage, such as 956 and 966. storage 956 and 966 may provide mass storage and may include volatile or non-volatile storage, such as net work -at ( ched disks, floppy disks, hard disks, optical disks, tapes. Hash memory or other similar solidstate memory devices, or an array of devices, including devices in a storage area network or other configurations, gtorage 956 or 966 may be shared between multiple processors, multiple racks, or multiple com uting devices and may include a computer -readable medium storiig instructions executable by one or more of the processors. Memory 954 and 964 may include, e.g., volatile memory unit or units, a non -vol at ile memory unit or units, and/or other forms of computer -readable media, such as a magnetic or optical disks. Hash memory, cache. Random ^cccs Memory (R M), Read Only Memory (ROM), and combinations thereof. Memory, such as memory 954 may also bcsharcd between processors 952a -952n. Data structures, such as an index, may be stored, for exam pi e. across storage 956 and memory 954.

Computing device 900 may include other components not shown, such as controllers, buses, inpul/out ut devices^', communications modules, etc. [0080] ii entire system, such as system 100, may be made up of multiple computing devices 900 communicating with each ot er. For example, device 80a may communicate with devices 980b, 980 c, and 9804 and these may collectively be known as system 100, _, s another example, system 100 of FIG, 1 may include one or more computing devices 900, gornoof the computing devices may be located geographically close to each other, and others may be located geographically distant, The layout of computing device 900 is an example only and the system may take on other layouts or configurations.

[0081]Vanous implementations can include im lementation in one or more com uter programs that are executable and/or interpretablcon a programmable system including at least one programmable processor formed in a substrate, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to. a storage s slcm, at least one input device, and at least one output device,

[0082]y ose computer pro grants (also known as pro grains, software, software applications or code) include machine instruct tons fora programmable processor, and can be implemented in a high-level procedural an dor object-oriented programming language, andfer h ass anblyma chine language, _/ s used herein, the terms "maehtme- readablc medium^** "eompul er-rea da l c m odium'^* refers to an non -transitory computer program product, apparatus anAbr device (eg., magnetic discs, optical disks, memory (including Read access Memory), Programmable Logic Devices (PLDs)) used to provide machine instruct ions andbr data to a programmable processor.

pJ083]yhe systems and techniques dcscribcdhcrc can be implemented in a coinpulingsystcni that includes a back end componcnl (c.g., as a data server), or that includes a middleware component (e.g. an application server), or that includes a front end component (eg,, a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components, ylie components of the system can be inter connected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("L4 "), a wide area network PW4 ^**), and the Internet, [0084] I hj computing system ran include clients and servers. client and server are generally remote from each other and typically interact throiigha communication network, jhc relationship of client and server arises by virtue of computer programs running on the respective computers and havtnga cl icnl -s crv cr relationship to each other.

[0085] number of imptancntations have been described. Nevertheless, various modifications may be made without departing from the spirit and scope of the invention. In addition, the logic Hows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be pro ided, or steps may be eliminated, from the described Hows, and other components ma be added to, or removed from, the described systems, accordin ly, other implementations are within thescope of thcfol!owing claims.

Claims

WHAT IS CLAIMED IS;

,\ computer system comprising:

at least one processor: and

memory storing;

a data store of content for embedded items, and

instructions that, when executed by the al least one processor, cause the system to:

receive a request, from a batch process, to render a web page, identify an embedded item in the web page,

determine, based on a rewrite rule, that the embedded item has content that is duplicative of content for a previously fetched embedded item,

in response to the determination, provide the content for the

previously fetched em bedded item from the data store,

generate a rendering result for the web page using the content for the previously fetched embedded item, and

provide the rendering result, to the batch process.

The system of claim I , wherein as part of determining that the embedded item has content that is duplicative of content for a previously fetched embedded item, the instructions further include instructions lhat. when executed by the at least one processor, cause the system to:

match the embedded item to a template of the rewrite rule, the rewrite rule also including a redirect identifier,

wherein providing the content for the previously fetched embedded item includes using the redirect identifier to locate the content for the previously fetched embedded item. jhc system of claim 2, wherein the template includes a URL without a query string.

4, flic system of claim 1 , wherein the embedded item is a first embedded item and the instructions further include instructions that, when executed by the at least one processor, cause the system to:

identify a second embedded item in the web page:

determine whether ihc second embedded item is blacklisted:

return an error when the second embedded item is blacklisted without fetching content for the second embedded item: and

generate the rendering result without the content for the second

embedded item,

5, jhe system of claim 1 , wherein the instructions further include instructions thai, when executed by the at least one processor, cause the system to: use a virtual clock when generating the rendering result,

wherein the virtual clock advances independently of real time,

6, The system of claim 1 , wherein the instructions further include instructions that, when executed by the at least one processor, cause the system to: use a virtual clock when generating the rendering result,

wherein the virtual clock does not advance while waiting for the

provided content of the previously fetched embedded item,

7, The system of claim 1 , wherein the embedded item is a first embedded item and the instructions further include instructions that, when executed by the at least one processor, cause the system to:

identify a second embedded item in the web page:

determine that the second embedded item includes an image:

generate a mock image that specifies dimensions for the second

embedded item using a dimension table: and

use the mock image in generating the rendering result,

8. \ method comprising:

receiving a request, from a batch process, to render a web page: initializing, using at least one processor, a virtual clock and a task list for rendering the web page, wherein the virtual clock stands still when a request for an embedded item is outstanding and when a task is ready to run:

generating, using the at least one processor, a rendering result for the web page when the virtual clock matches a run time for a stop task in the task list: and

providing the rendering result to the batch process.

9. The method of claim 8, wherein initializing the task list includes adding the stop task with a run time set to a predetermined time added to the virtual clock.

10. The method of claim 8, wherein the batch processes includes an indexing engine and the method further comprises using the rendering result to rank tokens in an index,

1 1. The method of claim 8, further comprising advancing the virtual clock to a run time of a task in the task list when no requests for embedded items are outstanding and only tasks with run times greater than the virtual clock are in the task list,

12. The method of claim 8, further comprising:

identifying an embedded image in the web page;

requesting content tor the embedded image;

receiving, in response to the request, a mock image thai specifies

dimensions for the embedded linage but lias empty content: and using the mock image in generating the rendering result.

13. The method of claim 8, wherein the batch process is an indexing engine and the method further comprises demoting a rank for the web page based on information in the rendering result.

14. flic method ol^' claim 8, wherein the batch process is an indexing engine and the method further comprises using the rendering result lo index dynamically generated content.

15. , method comprising:

receiving a request from a batch rendering process for Uniform Resource

Locator (URL) of an embedded item in a web page:

applying, using at least one processor, rewrite rules to determine a

rewritten URL;

determining, using the at least one processor, whether content for the rewritten URL exists in a data store:

when the content exists, providing the content to the batch rendering process: and

when the content does not exist:

initiating a fetch ol^" the content, wherein the batch rendering process is configured to wait without timing out during the fetch. receiving the content from a web-crawling engine,

providing the content to the batch rendering process, and

storing the content in the data store.

16. Tftc method of claim 15, further comprising using, by the batch rendering process, the content to generate a rendering result of the web page.

17. yhc method of claim 16, wherein the rendering result includes layout

information and dynamically-generated content.

18. The method of claim 15, wherein applying rewrite rules includes:

matching the URL to a template, the template being assoeialecl with a redirect U L,

wherein when the URL matches the template, the redirect URL is

determined to be the rewritten URL, and

wherein when the URL fails to match a template, the URL is determined to be the rewritten U L. The method ol^' claim 15, the method further comprising:

determining that the content for the rewritten URL is stale based on a change rate or a t pe ofthc embedded item stored in the data store: and

in response to the determination thai the content for the rewritten URL is stale:

receiving updated content from the web-crawling engine, updating the data store with the updated content, and

providing the updated content as the content for rewritten URL.