GB2577742A

GB2577742A - Data processing apparatus and method

Info

Publication number: GB2577742A
Application number: GB1816254.5A
Authority: GB
Inventors: Santer Mike; Senior Richard; Hall-May Martin; Aubrey-Jones Tristan; Kalkis Jurgis
Original assignee: Blupoint Ltd
Current assignee: Blupoint Ltd
Priority date: 2018-10-05
Filing date: 2018-10-05
Publication date: 2020-04-08
Also published as: WO2020070483A1

Abstract

An electronic document has information of a predetermined type (eg. text in a web page, PDF or word processing document) extracted in order to generate images of different portions (eg. large subtitles, diagrams or figures) in an embeddable video whose soundtrack may be a text-to-speech audio synthesis synchronised with the displayed images, which may be identified by eg. HTML tags according to a hierarchy level (eg. <main>, <body> etc). The generated video may then be transmitted (in eg. 3GP format), allowing text document access to users with poor literacy or eyesight or very basic phones.

Description

DATA PROCESSING APPARATUS AND METHOD

BACKGROUND

Field of the Disclosure

The present disclosure relates to a data processing apparatus and method.

Description of the Related Art

The "background" description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in the background section, as well as aspects of the description which may not otherwise qualify as prior art at the time of filing, are neither expressly or impliedly admitted as prior art against the present disclosure.

For many people, particularly those that struggle with literacy, poor eyesight and/or have very basic digital devices, content such as textual web content, electronic documents, articles or the like can be difficult or impossible to access, read and understand. This is particularly true in the developing world, where users are frequently poorly educated, and have no access to tablets or PCs, but may instead have phones with small screens and very limited browser applications. However, the irony is, that although this demographic may find such digital educational resources hard to access, they are arguably those that need to access and assimilate these resources the most, in order to further their education and thereby improve their prospects.

One potential solution to this problem is to take such content and convert it into a form that is more accessible for such users. Currently, however, such a conversion must be done manually.

This is labour intensive and time consuming. It is therefore difficult for large corpuses of existing digital content to be converted into a form that is more widely accessible with minimal cost to the end user.

SUMMARY

The present disclosure is defined by the claims.

The foregoing paragraphs have been provided by way of general introduction, and are not intended to limit the scope of the following claims. The described embodiments, together with further advantages, will be best understood by reference to the following detailed description taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete appreciation of the disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein: Figure 1 schematically shows a data processing apparatus; Figures 2A to 2C schematically show an electronic document to be converted; Figure 3 schematically shows a data structure for facilitating conversion of the electronic document; Figures 4A to 4T schematically show slide images generated based on extracted information of the electronic document; Figures 5A to 5D schematically show playback of a video file generated using the generated slide images; and Figure 6 schematically shows a data processing method.

DESCRIPTION OF THE EMBODIMENTS

Referring now to the drawings, wherein like reference numerals designate identical or corresponding parts throughout the several views.

In an embodiment, a data processing apparatus of the present disclosure automatically takes appropriate digital content (e.g. web-pages, electronic documents, articles and the like) and automatically converts (transliterates) them into video and/or audio files that can be played on basic mobile devices. Such basic mobile devices may have small screens and limited web browsers / document viewers or, in the case of audio files, may lack a screen altogether. This transliteration capability may be hosted on servers in the cloud, or on portable low power devices deployed close to end users, so that they can access said resources without any need for their own connection to the Internet. One particular application of this innovation would be to transliterate web-pages either in the cloud or on such a portable device, on-demand, so that large corpuses of content or even live web-content could be accessed as vocalized audio and/or videos, without having to pre-transliterate and store all the content that should be made available.

The audio files, once generated, contain a vocalized version of the source document, and so can be enjoyed even by blind or illiterate users that have effective hearing and sufficient understanding of the language in which the documents are written. The video files contain not only a similar vocalized audio track, but also any images/figures in the document, and the document text shown as subtitles which are synced with the audio vocalization. The videos can therefore not only be understood by deaf / illiterate users, but can also help improve the literacy of sighted users by allowing them to see the document text as subtitles at the same time as hearing the vocalized equivalent. The subtitles are displayed together with any relevant images, figures and/or tables. This allows sighted users to see what they are hearing in small bite sized chunks that can be easily paused and replayed, thereby reinforcing learning. Both audio and video versions of these resources can also be played on devices with only basic multimedia capabilities (i.e. no sort of web-browser), or in the case of the audio, can be played from a local hub device, either directly through a speaker, or via a local FM radio broadcast.

In an embodiment, a data processing apparatus of the present disclosure automatically generates video resources from an electronic document such as a webpage, Portable Document Format (PDF) document, Microsoft Word document, XML (Extensible Markup Language) feed, Microsoft PowerPoint ® presentation or the like through the following steps: 1. Read in the source document, either directly from a file system of local storage (e.g. storage medium 103 -see below), or from a remote location via a network (e.g. the internet).

2. Identify the main content of the source document that should transliterated.

3. Convert that content into a simple structured form comprising a hierarchy of optionally titled sections, which may contain images, videos and/or paragraphs of text.

4. Convert the structured form into a "plan" or sequence of slides, each of which includes either an embedded video, or an optional image, heading and/or small piece of associated text.

5. For each "slide" in the plan, use a text-to-speech engine (or human) to generate an audio version of the heading and text on that slide, and dub this onto a rendered image of the text and optional image on the slide converted into a video of the same length.

6. Concatenate these video sections together into a full video sequence, with appropriate pauses between sections, and book-ended with appropriate start and end video sections.

7. Convert the resultant video into a format such as 3GP that can be played on the target device.

8. Broadcast or send this video to a consumer device using Wi-Fi, Bluetooth, FM-radio or similar.

The audio files are generated in a similar way up to step 4, but ignoring any graphical elements (except, for example, any meaningfully long captions of an image), and then creating audio sections, or even the complete audio file, by vocalizing the plan / resulting text, and then converting it into an appropriate format like MP3 (which is then played / transmitted).

An optional step that could be inserted, for example, between steps 1 and 2, or after step 3, is to use a natural language translation engine to automatically translate the document text from its source language into an alternate language that might be more familiar to the target users.

Such a process could even be used to produce audio/video versions of the source document in many different languages, so that it could be understood by users in a wide range of different geographical locations.

An example implementation of the above-mentioned process is now described. This example relates to conversion of a webpage as the source document. However, it will be appreciated that an equivalent technique is applicable to documents in other formats such as PDF documents, Word documents, XML feeds and the like.

After reading or downloading the source document, the first step is to identify the part / parts of the document that are salient and warrant conversion. In the case of a webpage this may be achieved by looking for the document's <main> tag, which will normally omit any menus / footer text that shouldn't be included in the conversion. If no such <main> tag exists, then the system can take the <body> element and then search for the page's title heading, which will either be in an <hi> element, or will be the largest text on the page, and use this as the starting point for the conversion. Once the section of the document to be converted has been identified, the system can start converting this into an intermediate simple abstract form (data structure) which can be used as the source for the slide plan generation step. The extraction of the part(s) of the document to be converted and the generation of an abstract data structure using these extracted parts may be referred to as pre-processing. By carrying out the pre-processing, unwanted content is discarded, thus simplifying the slide plan generation step. It also means that multiple pre-processors can be implemented for different document types, all of which can then be followed by a single slide plan generator implementation.

In the current example the abstract data structure output by the pre-processing is an abstract syntax tree (AST) comprising sections with titles, images (with optional captions), videos and/or blocks of text. This tree is generated by traversing the section of the webpage's DOM (document object model) creating AST nodes from HTML (HyperText Markup Language) elements, while carrying a stack of sections defined by heading elements (<h1>, <h2>, <h3> etc.) that are used to define the hierarchy of the AST generated from the source document. For example, <img> tags are converted into image nodes annotated with the relevant image URL, and blocks of text are converted into text nodes. Image tags are annotated with any caption in the <img> tag's "all' attribute or connected <caption> element, for example, so that these image descriptions are also displayed and vocalized in any generated video and/or audio files.

When a heading tag (e.g. <h2>) is encountered it is used to start a new section, annotated with that heading tag's title text. If the heading tag's number (e.g. 2) is greater than the depth of the current section, it begins a new child section. If it's equal to the current section, it creates a sibling, and if it is less than the current section, it ends the current section and creates a new section at the parent level. This process takes a complicated piece of HTML mark-up and converts it into an intermediate AST form that contains just the information necessary to generate a highly accessible video or audio file from the document. In some embodiments, the input web-page may be annotated with extra tags (e.g. by the creator of the webpage who wishes for the video and/or audio file generated from the webpage to have a particular format) that act as instructions to the pre-processor. For example, the extra tags may tell the preprocessor to either ignore certain content (e.g. using a tag in the form "<div data-bp-videogen-meta="ignoreElement">Text not to include</div>", where "Text not to include" will then not be included in the generated video image), or to treat certain content in a non-standard way. For example, a non-standard way is to provide further information indicating how to vocalize certain content (e.g. by providing a tag <span data-videogen-meta='{ "ssml": { "say-as" : "interpret-as: "number" }} y>1234</span>, the textual content "1234" will be vocalized as the number "one thousand, two hundred and thirty four"). This provides a highly efficient assisted automatic system, where just enough user intervention is required to guide the automatic process and to thereby quickly increase the quality of the output.

In this process, HTML mark-up (such as <form> elements) that can't be sensibly converted to audio / video are ignored. Any audio / video already embedded in the document can be included in the AST via special audio / video nodes. Ordered and unordered lists are converted into blocks of text that are optionally numbered paragraphs separated by commas.

Once the intermediate AST is generated, the system generates the slide plan by using a visitor pattern to do a depth-first traversal of the AST, thereby accumulating a sequence of slides to simulate how a human would naturally read the document from beginning to end.

When the visitor first enters a section node it generates a heading slide with the section's title in large text. The system then looks ahead to see if that section contains an image, and if it does, the heading slide will also show a small version of the image above the heading text, to help illustrate to the user what that section will contain.

Image and paragraph nodes are always leaves of the AST. When an image leaf is visited, that image is used to generate one or two slides in the plan. If that image node is annotated with a caption, then the visitor generates a slide that shows the caption with a scaled down copy of the image which will be shown while the audio version caption is played. Then a subsequent slide is created which is shown for a fixed duration (e.g. 2 to 5 seconds) with the image being shown in a large size so that the user can see it in detail.

When a text node is visited that text is used to create a slide that carries all that text, with all leading and trailing whitespace removed. The slide plan to video conversion then handles chopping up that text into sensible fragments that can be displayed and vocalized in a video section. One useful feature that can be included in this phase, is a replacements engine that can be defined to replace certain substrings in the document's text with forms that can be more easily displayed / vocalized / understood. For example, the acronym "AST" may be pronounced by the text-to-speech engine incorrectly, so may be changed to "A.S.T." by the slide plan to video conversion.

Any video nodes in the AST are simply included as special "embedded video" slides which link to the underlying video files to be embedded.

Once the visitor has traversed the entire AST accumulating a sequence of slides (i.e. planned video sections), the plan is ready to be converted into an audio and/or video file.

Once the slide plan has been generated, every slide in the plan is converted into a video section. Each slide can be converted into a video section in parallel, if the system has sufficient computational resources. Each slide in the plan is defined as images and/or blocks of text, which the renderer then converts into one or more images (which are then further converted into respective video sections). If the subtitle text can't fit in the space below any optional image, that text is split into multiple images, each of which displays the image if any is defined, and a portion of the text.

Once the text has been split up, such that each phrase fits at the required font size as subtitles on a given image, these phrases are all vocalized using a text-to-speech engine, which may be a standalone tool, or an online web-service. Alternatively, each phrase may be vocalized by a real person (this is useful for less widely spoken languages for which text-to-speech technology of a suitable technology is less readily available, for example). To improve the performance of transliteration re-runs, the audio file generated for each phrase is cached in a database (stored in a storage medium 103 of the data processing apparatus, for example -see below), against a cryptographic hash of the phrase text, so that this computationally / fiscally expensive process need only take place once for each input string.

While the text is being vocalized to produce an audio file for each phrase, the associated slide is rendered as one or more images, with any images and heading / subtitle text. These images are rendered at a much greater resolution than the target video, so that when the video compression takes place, it yields a much clearer and easily understandable visual result.

Once the image(s) and associated audio for each slide in the plan have been created, these are converted into video sections by taking each generated image and producing a video that is the same duration as the audio file for that image and which uses that same image for every frame of the video section. Each such video section is then dubbed over with the associated audio file.

These video sections are then concatenated together with any pre-and post-video sections defined for the process to yield a single video sequence for the whole document. Slides that just link to embedded videos in the input document, are simply converted into the same resolution and framerate as the other video sections, and then included in the whole video sequence.

Although the above-mentioned process can be completely automated, there are various corpus or document specific settings that it is possible to define. These are defined either at a corpus level via a structured configuration file in a format like XML or JSON (JavaScript Object Notation) or may be document-specific (or even document section-specific) by annotating the input document with tags in a suitable meta-language. For example, for websites comprising webpages to be converted, it is possible to define a site level JSON file with default configuration options for all webpages of the website. However, it is also possible to tag individual webpages and webpage elements with webpage-/ section-specific overrides by using a custom attribute added to HTML elements that can also contain JSON.

Settings that can be set on a site, webpage, or webpage section specific level include: * The gender and specific voice used in the text vocalization.

* The source language of the document(s), and any optional language to convert to.

* Any custom pronunciation rules that can be sent to the vocalization engine.

* Start and end audio/video sections for generated audio/video files.

* String replacements to correct misspellings and aid vocalization.

* Fonts and font sizes.

* Image/video resolutions and compression qualities.

* Audio sampling rates and compression qualities.

Once generated, the transliterated documents in audio and/or video form can be delivered to end users via any suitable transit with which their device is compatible. For example, they may be delivered to smartphones as MP4 files via Wi-Fi, to feature phones as 3GP files via Bluetooth or even to the most basic phones as plain audio via FM-radio.

Figure 1 shows a data processing apparatus 102 according to an embodiment. The data processing apparatus comprises a storage medium 103, a network interface 104, a transceiver 105, a processor 106 and memory 107. The storage medium is for storing data and may take the form of a hard disk drive, solid state drive, tape drive or the like. The network interface 104 allows the data processing apparatus to receive data from and transmit data to other data processing apparatuses over a network 101 (which, in this case, is the internet). The transceiver 105 is for sending data to and receiving data from user equipment 108. The user equipment 108 is capable of receiving video and/or audio files generated by the data processing apparatus 102 from electronic documents. The user equipment 108 may be any suitable device, including a non-smart device with limited or non-existent internet browsing capability. In this example, the user equipment is a feature phone which is able to play video and/or audio files in certain formats but is not able to access electronic documents such as webpages, PDF documents, Word documents, XML documents and the like. The processor 106 executes processing on input electronic documents in order to convert them into video and/or audio files. The memory 107 temporarily stores data such as an input electronic document to be processed by the processor 106. The processor 106 controls the operation of each of the other components of the data processing apparatus 102. Each of the network interface 104, transceiver 105, processor 106 and memory 107 is implemented using appropriate circuitry.

The network 101, data processing apparatus 102 and user equipment 108 form a system 100. Electronic documents to be converted by the data processing apparatus 102 are stored in the storage medium 103. The electronic documents may be retrieved from the internet via network interface 104.

The network interface 104 is optional. In the case that there is no network interface 104, the storage medium 103 may be removable and insertable into another data processing apparatus (not shown) in order for electronic documents to be converted to be stored on the storage medium 103. The data processing apparatus 102 may then convert the electronic documents stored on the storage medium 103 when the storage medium 103 is reinserted into the data processing apparatus 102. This allows the electronic documents stored in the storage medium 103 to be updated when the data processing apparatus 102 is not connected to a network (e.g. in geographical locations with no or limited network access). The provider of the data processing apparatus may keep the electronic documents stored in the storage medium 103 updated by periodically travelling to the data processing apparatus 102 and updating the electronic documents. An employee or volunteer of the provider may carry a replacement storage medium 103 containing the updated electronic documents or may carry the other data processing apparatus in order to update the electronic documents.

The transceiver 105 is optional in the case that the data processing apparatus 102 is located at a location other than the point of use of the converted electronic documents. For example, the data processing apparatus may be for converting the electronic documents but not for transmitting them to the user equipment 108. In this case, the data processing apparatus may be part of a network and may transmit audio and/or video files generated from electronic documents (via the network interface 104) to another other data processing apparatus for the other data processing apparatus to transmit the audio and/or video files to the user equipment 108. Alternatively, the storage medium 103 may again be removable such that generated audio and/or video files may be stored on the storage medium and transferred to the other data processing apparatus for transmission to the user equipment 108 by physical transfer of the storage medium 103 to the other data processing apparatus.

In another embodiment, the data processing apparatus 102 comprises the transceiver 105 and may receive electronic documents to be converted from another user equipment (not shown) via the transceiver 105. New and/or updated electronic documents to be converted may therefore be provided to the data processing apparatus 102 by an employee or volunteer travelling to the data processing apparatus 102 with the other user equipment and uploading the new and/or updated electronic documents to be converted to the data processing apparatus 102 via the transceiver 105.

It will be appreciated that the processes of storing electronic documents to be converted, the conversion itself and the transmission of the resulting audio and/or video files may be split up according to the requirements of the users of the system.

In an embodiment, an electronic document comprising one or more predetermined types of information is retrieved from the storage medium 103 and stored in the memory 107. The processor 106 extracts information of the one or more predetermined types from the electronic document. The processor 106 generates one or more images each comprising a respective portion of the extracted information and generates a video image comprising the one or more generated images.

One of the one or more predetermined types of information may be text (e.g. a paragraph, heading or image caption on a webpage). The processor 106 may generate an audio track of the generated video image by performing a text-to-speech operation on the text information and temporally aligning the display of one or more images of the generated video image comprising the text with the generated audio track.

One of the one or more predetermined types of information may be an image (e.g. an image on a webpage).

One of the one or more predetermined types of information may be an embedded video image (e.g. a video image embedded on a webpage). In this case, the generated video image comprises a portion of the embedded video image.

Each of the one or more predetermined types of information in the electronic document is identified by a respective identifier in the document (e.g. by an appropriate HTML tag of a webpage). In this case, the processor 106 extracts information from the electronic document identified by one or more of the identifiers. One or more of the identifiers may indicate a level in a hierarchy of the predetermined type of information associated with that identifier (e.g. text of a webpage defined within a heading tag <h1> is higher in the hierarchy than text within a heading tag <h2> which is higher in the hierarchy than text within a heading <h3>). The processor 106 determines the temporal order of the one or more generated images in the generated video image based on levels in the hierarchy of the information extracted from the electronic document.

In an embodiment, the transceiver 105 transmits the generated video image to another data processing apparatus capable of playing back the generated video image. This is done via W-Fi (Wireless Fidelity), Bluetooth or FM radio, for example.

In an embodiment, the processor 106 separates the generated audio track from the generated video image and the transceiver 105 transmits the generated audio track to another data processing apparatus capable of playing back the audio track. This is done via Wi-Fi (Wireless Fidelity), Bluetooth or FM radio, for example. This allows users who are visually impaired and/or who have user equipment 108 not capable of playing back video files to nonetheless listen to the generated audio file associated with the electronic document.

Figures 2A to 2C show an example of an electronic document to be converted. The document is a webpage and comprises a plurality of types of information. The information includes a first level header 200 (identified by HTML tag <h1>), a second level header 201 (identified by HTML tag <h2>) , text 202, a third level header 203 (identified by HTML tag <h3>), text 204 (comprising bullets, each bullet being identified by HTML tag <Ii>), image 205 (identified by HTML tag <img> and comprising the caption 213 identified by HTML tag <caption>), text 206, second level header 207 (identified by another HTML tag <h2>), text 208, second level header 209 (identified by another HTML tag <h2>), text 210, second level header 211 (identified by another HTML tag <h2>) and text 212. The first level header 200 is a first predetermined type of information. The second level headers 201, 207, 209 and 211 are a second predetermined type of information. The third level header 203 is a third predetermined type of information. The text 202, 204, 206, 208, 210 and 212 is a fourth predetermined type of information. The image 206 (including caption 213) is a fifth predetermined type of information. Each of these types of information are comprised within the <body> HTML tag of the webpage and are extracted by the processor 106.

The processor 106 then defines a data structure using the extracted information. An example data structure 302 is shown in Figure 3. The data structure 302 is a graph comprising a plurality of nodes 300 and edges 301. Each node of the data structure corresponds to a respective one of the instances of information 200 to 212 shown in Figures 2A to 2C and the position of the nodes in the data structure depends on a hierarchy of the predetermined information types. In particular, first level header information (identified by HTML tag <h1>) is higher in the hierarchy than second level header information (identified by HTML tag <h2>) which, in turn, is higher in the hierarchy than third level header information (identified by HTML tag <h3>). Furthermore, text not appearing within a header tag (e.g. text 202, 204, 206, 208, 210 and 212) and images will always correspond to a leaf node. The reference signs 200 to 212 used to annotate each instance of information in Figures 2A to 2C are used to annotate the corresponding nodes in Figure 3. The data structure 302 provides a simplified and abstract view of the webpage shown in Figures 2A to 2C. The data structure 302 is simpler and more abstract than an HTML DOM (Document Object Model), for example, since all HTML specific tags are removed from the extracted information to be included in the generated video file and irrelevant information, sections and tags (which are determined in advance not to be included in or to affect the generated video file) are ignored when generating the data structure 302. This helps make the generation of the video file less processor intensive.

The order in which the processor 106 generates slides for generation of the video file is determined by the order of the nodes when a depth-first traversal is performed on the graph 302. Thus, the first slide(s) corresponds to the first level header 200 (with highest header level <h1>). The next slide(s) corresponds to the second level header 201 (with next highest header level <h2>). The next slide(s) corresponds to the text 202. The next slide(s) corresponds to the third level header (with lowest header level <h3>). The next slide(s) corresponds to text 204.

The next slide(s) corresponds to image 205 (with caption 213). The next slide(s) corresponds to text 206. The next slide(s) corresponds to second level header 207. This also has header level <h2> and is therefore at the same level in the graph as second level header 201. The next slide(s) corresponds to text 208. The next slide(s) corresponds to second header level 209 (again with header level <h2> and therefore at the same level as second header levels 201 and 207). The next slide(s) corresponds to text 210. The next slide(s) corresponds to second header level 211 (again with header level <h2> and therefore at the same level as second header levels 201, 207 and 209). The next slide(s) corresponds to text 212.

The processor 106 generates one or more slides for each node, depending on the content associated with that node. The slides generated for the webpage shown in Figures 2A to 2C using the data structure shown in Figure 3 are shown in Figures 4A to 4T.

Figure 4A shows a slide generated from first level header 200. The slide is an image comprising the text of first level header 200.

Figure 4B shows a slide generated from second level header 201. The slide is an image comprising the text of second level header 201.

Figures 4C and 4D are slides generated from text 202. The slide of Figure 4C is an image comprising a first portion 202A of text 202 and the slide of Figure 4D is an image comprising a second portion 202B of text 202. The processor 106 generates a plurality of slides for a single text node when there is too much text to fit on a single slide at a given font size (the font size being chosen so that the text is displayed clearly on a display of the user equipment 108). In this case, two slides are required in order to fit all the text 200.

Figure 4E shows a slide generated from third level header 203 and image 205. The slide is an image comprising the text of third level header 203 and a preview sized version of image 205. The preview-sized image 205 is included on the slide in order to give the viewer of the video a visual clue as to the subject-matter of the third level header 203 and the content defined under the third level header 203. The image 205 is chosen for the preview because the image is under the third level header 203 in the webpage mark-up. The caption 213 of the image 205 is not included in the preview (although it may be, if desired).

Figures 4F to 41 are slides generated from text 204. The slide of Figure 4F is an image comprising a first portion 204A of text 204, the slide of Figure 4G is an image comprising a second portion 204B of text 204, the slide of Figure 4H is an image comprising a third portion 204C of text 204 and the slide of Figure 41 is an image comprising a fourth portion 204D of text 204. In this case, the first and second portions of text 204 are bullet points, each one being identified by HTML tag <Ii>. The processor 106 has converted the bullet points into a numbered list spread across two slides, thereby allowing the text of the list to be displayed on the slides at a predetermined font size and aiding comprehension of text-to-speech audio of the list by the end user (numbered points are better handled by some text-to-speech compared to bullet points). In another embodiment, each bulleted point may have its own respective slide (there would therefore be 8 slides for the list of text 204 rather than just two), thereby allowing each individual bulleted point to be shown separately. In another embodiment, the bulleted points may be converted to a paragraph of text with each of the points separated by commas. It will be appreciated that there is flexibility in the way in which bulleted lists and the like are handled during slide generation, depending on the preferences of the provider, the end user and the like.

Figures 4J and 4K are slides generated from image 205. The first slide of Figure 4J shows the image 205 at a first, smaller size together with caption 213. The second slide of Figure 4K shows the image 205 at a second, larger size without caption 213. The user is therefore able to initially view the image 205 together with caption 213. The user is then shown an enlarged view of the image 205 so as to appreciate the image 205 in more detail.

Figure 4L is a slide generated from text 206. The slide of Figure 4L is an image comprising text 15 206.

Figure 4M shows a slide generated from second level header 207. The slide is an image comprising the text of second level header 207.

Figures 4N and 40 are slides generated from text 208. The slide of Figure 4N is an image comprising a first portion 208A of text 208 and the slide of Figure 40 is an image comprising a second portion 208B of text 208.

Figure 4P shows a slide generated from second level header 200. The slide is an image comprising the text of second level header 209.

Figure 4Q is a slide generated from text 210. The slide of Figure 4Q is an image comprising text 210.

Figure 4R shows a slide generated from second level header 211. The slide is an image comprising the text of second level header 211.

Figures 4S and 4T are slides generated from text 212. The slide of Figure 4S is an image comprising a first portion 212A of text 212 and the slide of Figure 4T is an image comprising a second portion 212B of text 212.

The location of splitting of single sections of text split over multiple slides (e.g. the splitting of text 202 over the slides of Figures 4C and 4D, the splitting of text 204 over the slides of Figures 4F to 41, the splitting of text 208 over the slides of Figures 4N and 40 and the splitting of text 212 over the slides of Figures 4S and 4T) may be determined based on one or more suitable characteristics of the text in order to facilitate comprehension of the text when it is split across multiple slides. In the example of Figures 4A to 4T, the text is split between sentences (for example, the end of each sentence being recognisable by the presence of a full stop ".") or between bullet points (for example, the end of each bullet point being recognisable by the presence of another bullet point or of a new line) so that the minimum amount of text present on one slide is one sentence or one bullet point. This helps comprehension of the text when presented on the slides. It will be appreciated that the location of splitting of single sections of text split over multiple slides in other languages (e.g. languages with non-Latin based character sets) may be determined in a way most appropriate to the language in question.

As well as the generation of the slides shown in Figures 4A to 4T, the processor 106 also performs a text-to-speech operation (using a suitable automated text-to-speech technique as known in the art) of the text shown on each of the slides. This may be carried out in parallel with the generated of the slides (the processor 106 may comprise a plurality of sub-processors (not shown) in order to carry out the slide generation and text-to-speech processes in parallel). Such parallel processing helps to increase the speed of the document to video / audio conversion.

After the slide generation and text-to-speech processes are complete, the processor 106 has generated a plurality of images (each image being a respective one of the generated slides) to be displayed in a given order (according to the data structure 302) and, for each image comprising text, a corresponding audio file of the text of that image being read out. The processor 106 then generates a video file comprising the generated images and audio files.

The video file comprises a video image and an audio track. Each frame of the video image is one of the generated slide images which is repeated for multiple consecutive frames of the video image. The number of consecutive frames for which each generated slide image is displayed depends on the length of the audio file generated for that slide image. If a slide image does not have an associated audio file (e.g. because it contains no text -see the slide of Figure 4K, for example), then the number of consecutive frames for which that slide image is displayed may be determined in advance. The audio file of each slide image forms the portion of the audio track of the video image to be played back whilst that slide image is displayed in the video image.

For example, if the frame rate of the generated video is to be 30 fps (frames per second) and the audio file associated with the slide image shown in Figure 4A lasts 2 seconds, then the slide image shown in Figure 4A will last for 30 fps x 2 seconds = 60 consecutive frames. If the audio file associated with the slide image shown in Figure 4B lasts 3 seconds, then the slide image shown in Figure 4B will last for 30 fps x 3 seconds = 90 consecutive frames. The latter 90 consecutive frames will be displayed in the video image immediately after the former 60 consecutive frames, since the slide image of Figure 4B immediately follows the slide image of Figure 4A in the order defined by the data structure 302. Similarly, the audio track accompanying the video image will comprise playback of the audio file associated with the image of Figure 4A for the 60 consecutive frames for which the image of Figure 4A is displayed immediately followed by playback of the audio file associated with the image of Figure 4B for the 90 consecutive frames for which the image of Figure 4B is displayed. Thus, the playback of each audio file forming the audio track of the generated video is temporally aligned with the display of the slide image from which that audio file was generated during playback of the video image.

In general, for a slide image with an audio file lasting for a time of t, the number of consecutive frames n of the video image with a frame rate f will be n=fxt In the case that there is no audio file associated with the slide image (e.g. if the slide image contains no text), the time t may be determined in advance. For example, t may be fixed at 2, 3, 4 or 5 seconds. In an embodiment, the vocalized audio can be omitted entirely (e.g. for documents in languages for which text-to-speech engines are not readily available). In this case, the generated file contains no audio track. In such cases, a language specific formula may used to estimate how long a phrase would take to be read by a typical end-user in that language. The slide image associated with that phrase is then shown for the estimated period of time.

The result is a video file which, when transmitted to and played back by the user equipment 108, successively displays the generated slide images shown in Figures 4A to 4T whilst (where appropriate) outputting audio in which the text on each slide image is read out. Such videos may be generated in (or converted to) a suitable format (e.g. 3GP) for playback on non-smart devices such as feature phones. Users with user equipment 108 which is not capable of displaying the webpage shown in Figures 2A to 2C may therefore still experience the text and image content of the webpage. The audio track which reads out the webpage text of the slides also increases accessibility for users who are partially sighted or who have low literacy levels. Content which would otherwise not be available to certain users is therefore made available to them with the present technique. Moreover, the content conversion is carried out automatically, alleviating the need for human input and therefore increasing the speed and reducing the cost for converting content.

Once the video file is generated, it is stored in the storage medium 103 for subsequent transmission to the user equipment 108 by the transceiver 105 or to another network entity via the network interface 104.

Figures 5A to 5D show an example of playback of a video file generated using the slide images shown in Figures 4A to 4T on a user equipment 108. In this case, the user equipment is a feature phone without the capability to display webpages like that shown in Figures 2A to 2C but with the ability to play back video files generated during the conversion process carried out by the data processing apparatus 102.

Figure 5A shows the user equipment 108 at the beginning of playback of the generated video file. The slide image 500 shown in Figure 4A is displayed on a screen 600 of the user equipment. A pointer 602 on progress bar 601 indicates the progress of playback of the video file. It can be seen that the playback has only just started. The user may pause the video by pressing button 203 and may forward or rewind the video using the toggle button 604 (the user presses on the left of the toggle button 604 to rewind and presses on the right of the toggle button 604 to forward). Concurrently with the display of slide image 500, the audio file comprising the read-out text of slide image 500 (that is, the text of first level header 200) will be played back (via a loudspeaker (not shown) or headphones (not shown) of the user equipment,

for example).

Figure 5B shows the user equipment 108 at a later point during playback of the generated video file. The slide image 501 shown in Figure 4E is displayed on the screen 600. The position of the pointer 602 on the progress bar 601 indicates that playback of the video has progressed relative to the point of time shown in Figure 5A. It is noted that, between the times shown in Figures 5A and 5B, the slide images of Figures 4B to 4D will have been successively displayed as part of the played back video. Concurrently with the display of slide image 501, the audio file comprising the read-out text of slide image 501 (that is, the text of third level header 203) will be played back.

Figure 5C shows the user equipment 108 at a later point during playback of the generated video file. The slide image 502 shown in Figure 41 is displayed on the screen 600. The position of the pointer 602 on the progress bar 601 indicates that playback of the video has progressed relative to the point of time shown in Figure 5B. It is noted that, between the times shown in Figures 5B and 5C, the slide images of Figures 4F to 4H will have been successively displayed as part of the played back video. Concurrently with the display of slide image 502, the audio file comprising the read-out text of slide image 502 (that is, the portion 204D of text 204) will be played back.

Figure 5D shows the user equipment 108 at a later point during playback of the generated video file. The slide image 503 shown in Figure 4K is displayed on the screen 600. The position of the pointer 602 on the progress bar 601 indicates that playback of the video has progressed relative to the point of time shown in Figure 5C. It is noted that, between the times shown in Figures 5B and 5C, the slide image of Figure 4J will have been displayed as part of the played back video. The slide image 503 does not contain any text, and therefore no audio is played back during display of the slide image 503.

It will thus be appreciated that, although the user equipment 108 does not have the capability of displaying the webpage shown in Figures 2A to 2C, various types of content of the webpage including text and images may still be provided to a user of the user equipment 108 in an intuitive way.

In an embodiment, if a webpage (or other electronic document to be converted) comprises an embedded video, then this may correspond to a further leaf of the generated data structure 302 and be included in the generated video file. In this case, the embedded video itself forms a portion of the generated video file at an appropriate point in the generated video file (the appropriate point being determined based on the order imposed by the generated data structure 302). In this case, all frames of the generated video will be a corresponding slide image (like the slide images of Figures 4A to 4T) except the frames of the portion of the generated video file formed of the embedded video. The frames of the portion of the generated video formed of the embedded video will be appropriate respective frames of the embedded video (e.g. all frames of the embedded video if the embedded video is at the same frame rate as the generated video file or a portion of frames of the embedded video if the embedded video is at a higher frame rate than the generated video file). If necessary, the processor 106 converts the embedded video into the same format as that of the generated video prior to including it in the generated video (e.g. if the generated video is in the 3GP format, then the embedded video is converted to the 3GP format prior to being included in the generated video).

Similarly, if the webpage (or other electronic document to be converted) comprises embedded audio, then this may be included as part of the audio track of the generated video file. The temporal position of the embedded audio file in the audio track of the generated video file may be determined such that the embedded audio file is temporally aligned with any caption / alt text associated with the embedded audio file in the source document and included in the generated video image. Alternatively, if there is no caption / alt text associated with the embedded audio file, then the embedded audio file may be temporally aligned with a heading or image of the section of the source document in which the audio file is embedded.

Although the example embodiments of Figures 2 to 5 relate to converting a webpage, it will be appreciated that this technique may be applied to any type of electronic document (e.g. PDF, Word ®, XML feeds or the like) from which one or more predetermined types of information (e.g. text, images and video) may be extracted. Each predetermined type of information may comprise an appropriate identifier (with hierarchy, if appropriate) for detection by the processor 106 and generation of a suitable data structure and slides. In an embodiment, the processor 106 comprises multiple pre-processors (not shown) each configured to extract the predetermined type(s) of information from a different respective type of electronic document and to generate a suitable data structure. The data structure may be in the same predetermined format for all electronic document types and may comprise all information necessary for generating the necessary slide images (for example, the data defining each node of the data structure may comprise the information (e.g. images or text) associated with that node). This allows the processor 106 to perform the same slide generation process for all electronic document types, thereby simplifying the processing involved in generating videos from different document types.

In an embodiment, the audio file generated for each slide may be a recording of a human reading aloud the text of the slide instead of an audio file generated using an automated text-tospeech process. This allows improved quality control of the audio track generated for each video file. It also allows vocalization of less widely spoken languages (as previously mentioned).

In an embodiment, the most popular documents may have audio tracks which are human voice generated whereas less popular documents may have audio tracks which are text-to-speech generated. This provides a compromise between maximising the quality of video files generated for the most popular documents and allowing large numbers of documents to be converted at reduced cost.

In an embodiment, the audio track of a generated video file (which is a concatenation of the audio files generated for each slide the order of display of the slides in the generated video file) may be separable from the video image of the generated video file. This audio track may then be transmitted as a standalone audio file to user equipment 108 which does not have the capability of displaying video but does have the ability to playback audio. For example, the transceiver 105 may transmit the generated audio file as an FM radio broadcast (thereby allowing the audio file to be listened to by any device with an FM radio receiver (e.g. basic mobile phone or standalone FM radio).

In an embodiment, the standalone audio file is generated by concatenating only the audio files associated with slides containing text (so that portions of the audio track associated with non-textual slides, such as those containing only images, are not included in the standalone audio file). The slides containing text are identified from the data structure 302, for example, and are used to generate a new plan defining the audio files to be concatenated and their order. This helps to avoid undesirable pauses in the standalone audio file when it is played back. Another characteristic of the text which is vocalized (e.g. punctuation) may also be amended when generating a standalone audio file in order to improve the comprehensibility of the standalone audio file when played back without the corresponding video image. A standalone audio file may therefore be shorter in duration than a video file generated from the same document when that document contains images, thereby reducing the file size of the standalone audio file.

In an embodiment, the source document is generated from a live webpage containing one or more interactive sections and/or which is partially generated using JavaScript or the like. In this case, a current state of the HTML DOM of the live webpage is taken as the source document in order to generate a video image representative of the webpage in that current state. At a later point in time (when the live webpage has been updated), the updated HTML DOM is taken as a new source document in order to generated a new video image representative of the updated webpage.

Figure 6 shows a data processing method according to an embodiment. The process starts at step 700. At step 701, the processor 106 receives an electronic document (e.g. via the network interface 104 or from storage medium 103) comprising one or more predetermined types of information. At step 702, the processor 106 extracts information of the one or more predetermined types from the electronic document. At step 703, the processor 106 generates one or more images (e.g. the slide images shown in Figures 4A to 4T) each comprising a respective portion of the extracted information. At step 704, the processor 106 generates a video image comprising the one or more generated images. The process then ends at step 705.

Numerous modifications and variations of the present disclosure are possible in light of the above teachings. It is therefore to be understood that within the scope of the appended claims, the disclosure may be practiced otherwise than as specifically described herein.

In so far as embodiments of the disclosure have been described as being implemented, at least in part, by software-controlled data processing apparatus, it will be appreciated that a non-transitory machine-readable medium carrying such software, such as an optical disk, a magnetic disk, semiconductor memory or the like, is also considered to represent an

embodiment of the present disclosure.

It will be appreciated that the above description for clarity has described embodiments with reference to different functional units, circuitry and/or processors. However, it will be apparent that any suitable distribution of functionality between different functional units, circuitry and/or processors may be used without detracting from the embodiments.

Described embodiments may be implemented in any suitable form including hardware, software, firmware or any combination of these. Described embodiments may optionally be implemented at least partly as computer software running on one or more data processors and/or digital signal processors. The elements and components of any embodiment may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the disclosed embodiments may be implemented in a single unit or may be physically and functionally distributed between different units, circuitry and/or processors.

Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in any manner suitable to implement the technique.

Claims

CLAIMS1. A data processing apparatus comprising circuitry configured to: receive an electronic document comprising one or more predetermined types of information; extract information of the one or more predetermined types from the electronic document; generate one or more images each comprising a respective portion of the extracted information; and generate a video image comprising the one or more generated images.
2. A data processing apparatus according to claim 1, wherein one of the one or more predetermined types of information is text.
3. A data processing apparatus according to claim 2, wherein: the circuitry is configured to generate an audio track of the generated video image by performing a text-to-speech operation on the text and temporally aligning the display of one or more images of the generated video image comprising the text with the generated audio track.
4. A data processing apparatus according to any preceding claim, wherein one of the one or more predetermined types of information is an image.
A data processing apparatus according to preceding claim, wherein: one of the one or more predetermined types of information is an embedded video image; and the generated video image comprises a portion of the embedded video image.
6. A data processing apparatus according to any preceding claim, wherein: each of the one or more predetermined types of information in the electronic document is identified by a respective identifier in the document; and the circuitry is configured to extract information from the electronic document identified by one or more of the identifiers.
7. A data processing apparatus according to claim 6, wherein: one or more of the identifiers indicates a level in a hierarchy of the predetermined type of information associated with that identifier; and the circuitry is configured to determine the temporal order of the one or more generated images in the generated video image based on levels in the hierarchy of the information extracted from the electronic document.
8. A data processing apparatus according to claim 6 or 7, wherein the electronic document is a webpage and each of the identifiers is a Hyptertext Markup Lanuage (HTML) tag indicative of a type of information included on the webpage.
9. A data processing apparatus according to any preceding claim, wherein the circuitry is configured to transmit the generated video image to another data processing apparatus capable of playing back the generated video image.
10. A data processing apparatus according to claim 3, wherein the circuitry is configured to separated the generated audio track from the generated video image and transmit the generated audio track to another data processing apparatus capable of playing back the audio track.
11. A data processing method comprising: receiving an electronic document comprising one or more predetermined types of information; extracting information of the one or more predetermined types from the electronic document; generating one or more images each comprising a respective portion of the extracted information; and generating a video image comprising the one or more generated images.
12. A program for controlling a computer to perform a method according to claim 11.
13. A storage medium storing a computer program according to claim 12. 30