WO2023218217A1

WO2023218217A1 - Text rendering on mobile devices

Info

Publication number: WO2023218217A1
Application number: PCT/IB2022/000272
Authority: WO
Inventors: Xiaoyu YE; Mingzhi TIAN; Qiang Qiu
Original assignee: Innopeak Technology, Inc.
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2023-11-16

Abstract

A method is implemented at an electronic device for real time rendering of text on a screen of an electronic device. The electronic device obtains a first image. The first image includes original text in a source language. The electronic device identifies a text region that contains the original text in the first image, translates the original text in the source language into translated text in a target language, and displays the translated text within the text region of the first image. The original text in the first image has a first number of lines, and the translated text is displayed in a second number of lines, and the second number is distinct from the first number.

Description

Text Rendering on Mobile Devices TECHNICAL FIELD [0001] This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for rendering texts on an existing image. BACKGROUND [0002] Software applications have been developed to provide optical character recognition (OCR) translation or associated text rendering features. Despite their acceptable performance, existing applications cannot render text size, position and orientation efficiently with seamless user experience, particularly when heavy-duty text rendering occurs. It would be beneficial to have an efficient text rendering mechanism to render texts on an existing image. SUMMARY [0003] Various embodiments of this application are directed to rendering textual content on an existing image from which related content is extracted. Text scaling and position adjustment is optionally applied to fit the rendered textual content into a region, e.g., where the related content is identified and replaced. Specifically, in some embodiments, scaled and adjusted text is rendered to a text location on a screen in real time. The text is scaled with respect to the texture size of initialized glyphs, and rotated to keep an orientation of the texts consistent with an orientation of a text region associated with the text location. In some situations, the text is offset in both horizontal and vertical directions to make the text rendered at a center of the text region. Further, a long text is optionally split into multiple lines to keep a font size human-readable. Parameters for text scaling, position adjustment, text orientation, text offset, or text splitting are computed in real time while the existing image is processed to extract the related content and render the textual content. Such textual content rendering is applied to render most common languages (e.g., Chinese, English, Japanese, Korean), after corresponding glyphs are loaded as texture from a font library. [0004] In some embodiments, a Least Recently Used (LRU) cache is used to store and update the characters’ information, which is used to query textureID, location and index in a hash map. Application of the LRU avoids an unlimited increase in memory and a redundancy in loading characters on mobile devices having constrained computational resources. For texture storage, a large texture array is used to store, a large number of characters (e.g., 8192 characters) during initialization. A query of a specific texture is executed in a vertex shader using attributes (e.g., texture location and corner index). [0005] In one aspect, a method is implemented at an electronic device for real time rendering of text on a screen of an electronic device. The method includes obtaining a first image. The first image includes original text in a source language. The method further includes identifying a text region that contains the original text in the first image, translating the original text in the source language into translated text in a target language, and displaying the translated text within the text region of the first image. The original text in the first image has a first number of lines, and the translated text is displayed in a second number of lines, and the second number is distinct from the first number. In some embodiments, the second number is larger than the first number. In some embodiments, the text region fully contains the original text in the first image, and the translated text is entirely displayed within the text region of the first image. [0006] In some embodiments, the method further includes obtaining a second image that is captured prior to the first image, identifying a distinct text region in the second image, the distinct text region including the original text, and prior to identifying the text region of the first image, recognizing the original text from the distinct text region of the second image. [0007] In another aspect, some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the one or more processors to perform any of the above methods. [0008] In yet another aspect, some implementations include a non-transitory computer- readable medium, having instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform any of the above methods. [0009] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there. BRIEF DESCRIPTION OF THE DRAWINGS [0010] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures. [0011] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments. [0012] Figure 2 is a block diagram illustrating an electronic system for data processing, in accordance with some embodiments. [0013] Figure 3 is a graphics user interface of a real time translation application executed to translate textual content in a field of view of a camera, in some embodiments. [0014] Figure 4 is a block diagram of a text rendering process, in accordance with some embodiments. [0015] Figures 5A-5E are graphics user interfaces of a real time translation application implementing a text rendering process, in accordance with some embodiments. [0016] Figure 6 is a diagram of example glyph metrics defining an individual glyph, in accordance with some embodiments. [0017] Figure 7 is a flow diagram of a text scale adjustment process, in accordance with some embodiments. [0018] Figure 8 is a diagram illustrating visual effects 800 of adjusted text scale and offsets as a character string is rendered in one line, in accordance with some embodiments. [0019] Figure 9 is a flow diagram of an exemplary process of reorganizing text into a plurality of lines, in accordance with some embodiments. [0020] Figure 10 illustrates visual effect of text being rendered in a plurality of lines, in accordance with some embodiments. [0021] Figure 11 illustrates visual effect of text rotation, in accordance with some embodiments. [0022] Figure 12 is a flowchart of a method for real time rendering of text on a screen of an electronic device, in accordance with some embodiments. [0023] Like reference numerals refer to corresponding parts throughout the several views of the drawings. DETAILED DESCRIPTION [0024] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with image or video processing capabilities. [0025] Various embodiments of the present application are described in detail based on many names and terms, e.g., optical character recognition (OCR), vertex shader, glyph, OpenGL, and graphics processing unit (GPU). Specifically, a vertex shader refers to a programmable Shader stage in the rendering pipeline that handles processing of individual vertices. A glyph is a hieroglyphic character or symbol. OpenGL is a cross-language, cross- platform application programming interface for rendering two-dimensional (2D) and three- dimensional (3D) vector graphics. Systems and methods disclosed herein for text size adjustment could provide a better user experience and visual effect than other systems in OCR translation or other text rendering methods, especially when the rendered text needs to be compressed or enlarged to fit within a region. A rendering rate is improved when an LRU cache and a large texture array to store and query glyph textures, compared with a rate of rendering text on a character-by-character basis and using a smaller texture in a loop. Additionally, methods and systems disclosed herein are able to provide consistent rendering results of text in most common languages and of different fonts. [0026] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network-connected home devices (e.g., a camera). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs are processed locally (e.g., for translation and/or rendering) at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, process the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104. For example, storage 106 may store video content translation and text rendering and/or video content obtained by a user to which a translation and text rendering process is applied to determine one or more actions associated with the video content. [0027] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera and a mobile phone 104C. The networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104C to monitor the events occurring near the networked surveillance camera in the real time and remotely. [0028] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communication links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational, and other computer systems that route data and messages. [0029] A text translation and/or text rendering process is applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. [0030] Figure 2 is a block diagram illustrating an electronic system 200 for data processing, in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof. The electronic system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104. [0031] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof: x Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks; x Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on; x User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.); x Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction; x Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account; x One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, real time translation application 226, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices); x Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224; x One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Data 238 for feeding one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video data, visual data, audio data); and o Content data and results 242 that are obtained by and outputted to the client device 104 of the electronic system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server 102 to provide the associated results 242 (e.g., translated text to be rendered in a region of an image) to be presented on client device 104. [0032] Optionally, the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200. Optionally, the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively. [0033] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above. [0034] Figure 3 is a graphics user interface (GUI) 300 of a real time translation application 226 executed to translate textual content in a field of view of a camera, in some embodiments. The real time translation application 226 is executed on a mobile device 104 to provide OCR translation or associated text rendering features. The user interface 300 displays a Chinese restaurant menu that has been translated to English. Chinese content has been entirely covered by English translation except a missing region 302. The missing region 302 is still displayed with Chinese characters as an error. Text in a first region 304 is displayed with inconsistent font sizes, and some text 306 is too small to be recognized. Some text 308 is displayed with such an abnormal aspect ratio that the text 308 is distorted. As such, despite acceptable performance, the user interface 300 cannot render text size, position and orientation efficiently with seamless user experience. Particularly, when heavy-duty text rendering occurs, e.g., when a large paragraph of an article is being displayed, a user of the real time translation application 226 could feel an extended latency from frame to frame due to a high computation level. [0035] Figure 4 is a block diagram of a text rendering process 400, in accordance with some embodiments. In the text rendering process 400, the electronic system 200 receives information of one or more texts 402 to be rendered and information of one or more text regions 404, and renders text(s) 402 properly within text region(s) 404 on a screen of a mobile device 104. In an example, the one or more texts 402 to be rendered include translated texts, and the one or more text regions 404 include OCR boxes where original text is displayed and replaced with the translated text. The output of the system is rendering results 406 displayed on the screen of the mobile device 104, e. g., with OpenGL. [0036] In order to render texts properly, the system utilizes some functions for text adjustment, such as reorganizing text, getting text width and height, and adjusting text rendering parameters. The rotation about z-axis will also be calculated and updated in the vertex shader. Since the texts can be in different languages and the total lengths of the texts are unknown, the scale and arrangement need to be calculated dynamically with respect to the size of the text regions. Besides, if the texts are being compressed to a very small scale to fit in a text region, the whole sentence is reorganized into multiple lines, depending on the ratios. [0037] In some embodiments, the one or more texts 402 to be rendered by the electronic system 200 are converted (406) to a wide string, and the one or more text regions 404 are converted (408) to the parameters of a text region are converted to the GL coordinate. The electronic system 200 loads (410) the characters of the input text from font libraries and generates (412) the text hash map. The electronic system 200 further re-organizes (414) the input text and generates (416) a list of split texts. The electronic system 200 gets (418) the text width and height to calculate (420) the text width, height and offset. The electronic system 200 also adjusts (422) text render parameters to calculate (424) the text scale. Subsequently, the electronic system 200 renders (426) the text, updates (428) vertex data, parses (430) the parameters to shader, and invokes (432) GL drawcalls to render the input text to the rendering results 406. [0038] Figures 5A-5E are graphics user interfaces 500, 510, 520, 530, and 540 of a real time translation application 226 implementing a text rendering process 400, in accordance with some embodiments. The real time translation application 226 is implemented in mobile devices, for example, mobile phones 104C. The real time translation application 226 is configured to detect texts using OCR algorithms and get translated texts in multiple languages, such as Chinese, English, Korean, Japanese, etc. The rendered texts are the translated text from a source language. The real time translation application 226 achieves desirable visual effects and user experiences by way of the text rendering process 400. [0039] Referring to Figure 5A, the real time translation application 226 displays a restaurant menu 502 in Chinese in a first user interface 500. The restaurant menu 502 includes a list of Chinese phrases 504 corresponding to Chinese dishes. The first user interface 500 includes a start affordance 506, a stop affordance 508, and a target language affordance 514. In response to a user selection of the target language affordance 514, a list of target language options are displayed and a user may select the target language to translate the list of Chinese phrases 504. In some embodiments, a first user action on the start affordance 506 initiates a text translation mode in which the text rendering process 400 is implemented to recognize and translate text in the first user interface 500 and display the translated text in the target language in real time. A second user action on the stop affordance 508 follows the first user action and terminates the text translation mode, allowing the Chinese menu 502 to be redisplayed on the first user interface 500. In some embodiments, the source language of Chinese is automatically detected by the real time translation application 226 based on image data displayed in the first user interface 500. Conversely, in some embodiments, the first user interface 500 further includes a source language affordance 512. In response to a user selection of the source language affordance 512, a list of source language options are displayed, and a user may select the source language to recognize the Chinese phrases 504 from the restaurant menu 502. In this example, both the source and target languages are Chinese. When the first user action is applied on the start affordance 506, no OCR or text rendering is executed to translate the original restaurant menu 502. [0040] Referring to Figure 5B, in some embodiments, the source and target languages are English and Chinese selected via the source and target language affordances 512 and 514, respectively. When the first user action is applied on the start affordance 506, OCR is executed for recognizing the Chinese phrases 504. In some embodiments, the real time translation application 226 fails to recognize the Chinese phrases 504 based on the source language of English, and therefore, aborts OCR and text translation. The original restaurant menu 502 is displayed in Chinese as is. Alternatively, in some embodiments, the real time translation application 226 recognizes the Chinese phrases 504 successfully, while aborting text translation in accordance with a determination that the recognized Chinese phrases 504 are not consistent with the source language as defined by the source language affordance 512 or a determination that the recognized Chinese phrases 504 are recognized in the target language as defined by the target language affordance 514. [0041] Referring to Figure 5C, in some embodiments, the source and target languages are Chinese and Korean selected via the source and target language affordances 512 and 514, respectively. When the first user action is applied on the start affordance 506, OCR is executed to recognize the Chinese phrases 504 in Chinese. The real time translation application 226 then translates the Chinese phrases 504 to Korean. The translated Korean phrases 518 are displayed in place of the corresponding Chinese phrases 504 on the restaurant menu 502. Specially, each Chinese phrase 504 corresponds to a text region 516 in which the Chinese phrase 504 is tightly enclosed, and a corresponding translated Korean phrase 518 is displayed within the same text region 516. [0042] Referring to Figure 5D, in some embodiments, the source and target languages are Chinese and English selected via the source and target language affordances 512 and 514, respectively. After the first user action is applied on the start affordance 506, the real time translation application 226 translates the Chinese phrases 504 to English. The translated English phrases 518 are displayed in place of the corresponding Chinese phrases 504 on the restaurant menu 502, e.g., in the same text regions 516 of the corresponding Chinese phrases 504. [0043] Referring to Figure 5E, in some embodiments, the source and target languages are Chinese and Japanese selected via the source and target language affordances 512 and 514, respectively. After the first user action is applied on the start affordance 506, the real time translation application 226 translates the Chinese phrases 504 to Japanese. The translated Japanese phrases 518 are displayed in place of the corresponding Chinese phrases 504 on the restaurant menu 502, e.g., in the same text regions 516 of the corresponding Chinese phrases 504. [0044] Independently of the target language, each character of the translated text 518 is loaded (410) from a corresponding font library with a respective text hash map 412. The translated text 518 is reorganized (414) to a list of split text 416. Based on the text hash map 412, text width, height, and offset of each character are obtained (418), and text render parameters are adjusted (422) to identify a text scale 424 for the translated text 518. In accordance with the text scale 424, the translated text 518 is fit into the text region 516 that fully contains the original Chinese phrases 504 in the restaurant menu 502. The translated text 518 is optionally rendered using an OpenGL process. [0045] Figure 6 is a diagram of exemplary glyph metrics defining an individual glyph (e.g., “g”) 600, in accordance with some embodiments. A library is used to load a plurality of glyphs 600 as textures and render the texture with OpenGL. An example of the library is a FreeType library, and adjustment and each individual glyph 600 is based on a plurality of glyph metrics in the FreeType library. In some embodiments, FreeType loads TrueType fonts. For each glyph 600, the FreeType library generates a bitmap image and calculates several metrics. The bitmap image of each character glyph 600 is extracted to generate textures, and each character glyph is positioned using the loaded glyph metrics. [0046] Each of the plurality of glyphs 600 is associated with a horizontal baseline 620 (as depicted by the horizontal arrow). In some embodiments, the horizontal baseline 620 passes the glyph 600 (e.g., “g”, “p”). That said, the glyph 600 has a first glyph portion that sits on this baseline 620 and a second glyph portion that is slightly below the baseline 620. Alternatively, in some embodiments, the glyph 600 (e.g., “a”) sits on the baseline 620 entirely. For each glyph 600, the plurality of glyph metrics define an exact offset to properly position the glyph 600 with respect to the baseline 620, a size in which each glyph should be displayed, and a number of pixels that are needed to render the glyph. [0047] In some embodiments, the plurality of glyph metrics of each glyph 600 include one or more of: x Width 602 representing a width of the corresponding bitmap image of the glyph 600 in pixels, which is accessed via a library path of face>glyph>bitmap.width; x Height 604 representing a height of the corresponding bitmap image of the glyph 600 in pixels, which is accessed via a library path of face>glyph>bitmap.rows; x BearingX 606 representing a horizontal bearing (e.g., a horizontal position of the corresponding bitmap image of the glyph 600 relative to an origin 640) in pixels, which is accessed via a library path of face>glyph>bitmap_left; x BearingY 608 representing a vertical bearing (e.g., a vertical position of the corresponding bitmap image of the glyph 600 relative to the origin 640) in pixels, which is accessed via a library path of face>glyph>bitmap_top; and x Advance 610 representing a horizontal advance (e.g. a horizontal distance from the origin 640 to another origin of an immediately adjacent glyph), which is accessed via a library path of face>glyph>advance.x. [0048] Figure 7 is a flow diagram of a text scale adjustment process 700, in accordance with some embodiments. Text to be rendered includes a plurality of characters, and the text including the characters is adjusted and scaled to fit within a text region (e.g., region 516 in Figures 5A-5E). The text region is also called a bounding box. The text region is also called a rendering region or an OCR box. The input for adjustment is a character string 702 including a list of split text. Scaling adjustment includes getting (704) a length and a height of the string 702 in a pixel coordinate. The length of the characer string 702 refers to a sentence width 706, which is calculated accumlatively on a character-by-character basis. In an example the length of a single character is represented by one of two glyph metrics, i.e., width 602 (ch.size.x) and advance 610 (ch.advance). A height of the character string refers to a sentence height 708, which is measured from the top most postion to the bottom most postion of the characters in the character string 702. [0049] The text region where the character string 702 is rendered has a box width and a box height. The character string 702 is adjusted (710) to fit into the text region based on text render parameters, and the text render parameters includes a width scale factor 712 and a height scale factor 714. The character string 702 is displayed in one or more lines, and each line has a line width (line_width) and a line height (line_height). The width scale factor 712 is a ratio of the box width and the length of the character string 702, and the height scale factor 714 is a ratio of the box height and the height of the character string. An overall text sacle 716 is a smaller one of the width scale factor 712 and height scale factor 714. [0050] Stated another way, an electronic system 200 receives the list of split text 702, and determines (704) text width and height from the input text. A sentence width 706 is a sum of character widths of the string of characters 702 and a sentence height 708 is a difference between the most top postion and the most bottom position of the characters in the character string. The electronic system 200 adjusts (710) the text render parameters including the width scale factor 712 and a height scale factor 714. The width scale factor 712 (scale_x) is equal to the box width (box_width) divided by the sentence width 706 (line_width), and the height scale factor 714 (scale_y) is equal to the box height (box_height) divided by the sentence height (line_height). The overall text sacle 716 is a smaller one of the width scale factor 712 and height scale factor 714, i.e., min(scale_x, scale_y). [0051] Additionally, the text is rendered in the center of the text region. A texture coordinate in a graphic coordinate takes a bottom left corner as an origin of the character string 702, and by default, is aligned with the bottom left corner of the text box. The text is rendered in one or more lines within the text region. The electronic system 200 adjusts (718) x and y offsets (i.e., x_offset, y_offset) of each line in the text region. In some embodiments, the character string is displayed in one line within the text region. The x offset 720 (x_offset) is equal to a half of a difference between the box width and an adjusted sentence width, i.e., equal to (box_width – adjusted_sentence_width)/2. The y offset 722 (y_offset) is equal to (box_height- adjusted_sentence_height)/2+(N-1)*line heights for each line counted from the bottom of the text region. [0052] Figure 8 is a diagram illustrating visual effects 800 of adjusted text scale 710 and offsets 718 as a character string 702 is rendered in one line, in accordance with some embodiments. In some embodiments, after getting a length 706 and height 708 of the character string 702 in pixel, the character string 702 is scaled with respect to the box width 802 and height 804 of a bonding box 806 (also called text region). A smaller value of the width scale factor 712 and height scale factor 714 is used as an overall text sacle 716. In some situations, the character string 702 is enlarged when the the length of the character string 702 is smaller than the box width of the bonding box. An enlargement operation is optionally applied horizontally and/or vertically to keep an aspect ratio of the character string 702. In an example, an adjusted text 808A fills a horiztonal direction of the bounding box 806 and leaves space in a vertical direction. In another example, an adjusted text 808B fills the verical direction of the bounding box 806 and leaves space in a horizontal direction. [0053] In some embodiments, the adjusted text 808 is rendered in the center of the bonding box 806. A texture coordinate in a graphic coordinate takes the bottom left corner as an origin 810. By default, the texture coordinate is aligned with the bottom left corner of the bonding box, and horizontal and veritcal offsets 812 and 814 of the adjusted text 808 are calculated with respect to the original and added in horizontal and veritcal directions of the adjusted text 808, respectivly. Alternatively, the adjusted text 808 is aligned on the horizontal and vertical directions indepdently. On the horizontal direction, the adjusted text 808 is aligned with a left edge or a right edge of the bounding box 806 or equally spaced from the left and right edges of the bounding box 806. On the vertical direction, the adjusted text 808 is placed immediately under a top edge or immediately above a bottom edge of the bounding box 806, or equally spaced from the top and bottom edges of the bounding box 806. [0054] In some situations, the character string 702 is organized into more than one line when the length of the character string 702 (reprsented by a sentence width 706) is longer than a box width 802 of the bonding box 806. For a list of multiple lines, a splitted sentence is optionally aligned at the center of each line or with a left or right edge of the bounding box 806. Optionally, the list of multiple lines are rendered immediately under the top edge or above the bottom edge of the bounding box 806 or spaced equally to the top and bottom edges of the bounding box 806. [0055] Figure 9 is a flow diagram of an exemplary process 900 of reorganizing text into a plurality of lines, in accordance with some embodiments, and Figure 10 illustrates visual effect 1000 of text 702 being rendered in a plurality of lines, in accordance with some embodiments. The text corresponds to a sentence and includes a character string 702. The character string 702 is too long to be rendered in a single line. That said, if the character string 702 is scaled down to a predefined font size (e.g., a minimum font size that can be displayed), the length 706 of the character string 702 is greater than the box width 802 of the bounding box 806, and the character string 702 cannot fit within one line of the bounding box 806. The character string 702 is thereby reorganized into two or more lines to fit within the box width 802 of the bounding box. [0056] In some embodiments, the predefined font size of a target language is defined based on a character ratio of a target language and a source language. The text to be rendered 702 includes a character string in English and is translated from Chinese. Original Chinese text defines the box width 802 and height 804 of the bounding box 806. In an example, the predefined character ratio threshold is 5.5 between Chinese and English. Space of every character in the source language of Chinese is configured to accommodate at most 5.5 characters in the target language of English. If a total number of characters of the character string 702 is more than 5.5 times of a total number of characters in the original Chinese text, the character string 702 is split into multiple lines, e.g., with each line having 5.5 times of the total number of characters in the original Chinese text at most. [0057] During text reorganization, an electronic device obtains the text to be rendered 702, related parameters 902 (e.g., w, h, x, y) in GL coordinate, and character ratios 904. A character ratio 904 of the text 702 and original text is compared (906) with a predefined character ratio threshold (e.g., equal to 5.5), and the text 702 is reorganized (908) based on a comparison result. If the character ratio 904 is greater than predefined character ratio threshold, the electronic system 200 keeps (910) the text 702 in one line of sentence. Conversely, if the character ratio 904 is less than predefined character ratio threshold, the electronic system 200 determines that the text 702 needs to be split (912) into two or more lines. To split the text 702, the electronic system 200 tracks a plurality of text adjustment parameters 914 to determine (916) whether each word can be added to a current line. Examples of the text adjustment parameters 914 include a maximum line width 914A (max_width_of_the_line), a character width 914B (char_width), a word width 914C (word_width), and a line width 914D (line_width). If the current word can be added into the current line without exceeding the maximum line width 914A, the current word is added (918A) to the current line, and the line width 914D is updated (918B). Conversely, if the current word can only be added into the current line by exceeding the maximum line width 914A, the current word is used to start (920) a new line, which is set as a current line. [0058] Additionally, if the electronic system 200 determines (922) that the word width 914C of the current word is greater than the maximum line width 914A, the current word is split (924) to a plurality of chunks to be displayed in different lines. In some embodiments, to make the sentence splitting in a more human-readable way, the word will be kept intact. That means the algorithm will search for a space to break the sentence, and move the word to the next line, except for some super long words, such as URLs, in which the length of the word itself is longer than the bonding box’s width. [0059] Figure 11 illustrates visual effect 1100 of text rotation, in accordance with some embodiments. A bonding box has a rotational angle 1102 with respect to a horizontal direction. Text to be rendered 702 are rotated with the same rotational angle 1102 about a rotation pivot 1104. The rotation pivot 1104 is optionally a center of the bonding box 806. The rotation angle 1102 parsed from the pipeline is in a pixel coordinate, and the text rendering is in a graphic coordinate. A projection matrix is applied to keep a text layout consistent with an aspect ratio of a screen. For example, in OCR translation of a real time translation application, an orthogonal projection is used to get the projection matrix. A rotation matrix is optionally calculated with respect to a z-axis. Alternatively, the rotation matrix describes a rotation with respect to any axis in a three-dimensional (3D) space. The projection and rotation matrices are converted to one or more vertex shader output variables, e.g., gl_Position and glPosition.xy, in a vertex shader, utilizing a parallel computing capability of GPU. [0060] The projection matrix, rotation matrix, and two vertex shader output variables are repsented as follows:

[0061] Figure 12 is a flowchart of a method 1200 for real time rendering of text on a screen of an electronic device, in accordance with some embodiments. The method 1200 is implemented by an electronic device (e.g., a client device 104, a server 102, or a combination thereof). An example of the client device 104 is a mobile phone. Method 1200 is, in some embodiments, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the electronic device. Each of the operations shown in Figure 12 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the system 200 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non- volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 1200 may be combined and/or the order of some operations may be changed. [0062] The electronic device obtains (1210) at least one image including a first image (e.g., a restaurant menu 502 in Figures 5A-5E). Each of the at least one image (e.g., the first image) includes original text 504 in a source language (e.g., ,Chinese). The electronic device identifies (1220) a text region (e.g., region 516 in Figures 5A-5E) that contains the original text 504 in the first image, translates (1230) the original text in the source language into translated text 518 in a target language (e.g., Korean, English, Japanese), and displays (1240) the translated text within the text region of the first image. The original text in the first image has a first number of lines, and the translated text is displayed in a second number of lines (e.g., two lines in Figure 10). The second number is distinct from the first number. In some embodiments, the text region fully contains the original text in the first image, and the translated text is entirely displayed within the text region of the first image. [0063] In some embodiments, the second number is larger than the first number. For example, the first and second numbers are equal to 1 and 2, respectively. The original text is displayed in one line in Chinese, and the translated text is displayed in two lines in English. In some embodiments, the second number is smaller than the first number. For example, the first and second numbers are equal to 2 and 1, respectively. The original text is displayed in two lines in Chinese, and the translated text is displayed in one line in English. [0064] In some embodiments, the at least one image further includes a second image that is captured prior to the first image. In some embodiments, the electronic device further obtains the second image, identifies a distinct text region in the second image. The distinct text region includes the original text. Prior to identifying the text region of the first image, the electronic device recognizes the original text from the distinct text region of the second image. For example, text is used across images. Stated another way, the original text is recognized or translated from the second image that is captured earlier than the first image, and in accordance with a determination that the same text appears in both the first and second image, the first image displays the translated text directly without waiting for text recognition and translation, thereby expediting text rendering and enabling real time visual effects in a real time translation application. [0065] In some embodiments, the translated text is displayed in the first image with a latency from when the first image is captured, and the latency is less than a threshold latency (e.g., < 1/50 second) such that the translated text is displayed in the first image in real time with capturing the first image. [0066] Referring to Figure 9, in some embodiments, the electronic device extracts a set of characters 702 from a font library of the target language to form the translated text, reorganizes the set of characters 702 forming the translated text into the second number of lines to fit the translated text entirely within the text region 806 in the first image, adjusts a scale and an orientation of the set of characters forming the translated text to fit the translated text entirely within the text region in the first image, and renders the translated text with the set of characters 702 in the first image for display on the screen of the electronic device. In some embodiments, the set of characters 702 have a predefined size and a predefined orientation. [0067] In some embodiments, the scale and the orientation of the set of characters 702 forming the translated text further are adjusted. For each character, the electronic system 200 generates a bitmap image for a respective glyph 600 of the font library, and determines a predefined width and a predefined height of the respective glyph 600 of the font library based on the predefined size A line width and a line height are calculated for each line of the second number of lines within the text region 806 according to the predefined width 602 and height 604 of the respective glyph 600 of each character to be displayed in the text region 806. The scale of the second number of lines is adjusted according to the line width and line height of each line of the second number of lines and according to a region width and a region height of the text region. [0068] Referring to Figure 11, in some embodiments, the scale and the orientation of the set of characters forming the translated text are adjusted by rotating the translated text to an orientation of the text region 806 around a rotation pivot 1104. The rotation pivot 1104 is optionally located at a center of the text region 806. [0069] In some embodiments, the set of characters forming the translated text are reorganized by determining whether a translation character ratio from the source language to the target language exceeds a threshold (e.g., 5.5); and in response to a determination that the translation character ratio exceeds the threshold, splitting the translated text into the second number of lines within the text region. [0070] In some embodiments, the set of characters forming the translated text are reorganized by determining according to width of a current line of the second number of lines and width of a current word within the current line whether there is enough space to add the current word to the current line, and in response to a determination to there is enough space to add the current word to the current line, adding the current word to the current line. [0071] Referring to Figure 9, in some embodiments, the set of characters forming the translated text are reorganized by in response to a determination there is not enough space to add the current word to the current line, determining if the current word is longer than a maximum width; in response to a determination that the current word is longer than the maximum width: splitting the current word into multiple chunks and adding a chunk of the multiple chunks of the current word to a next line after the current line; in response to a determination that the current word is not longer than the maximum width: adding the current word to the next line after the current line. [0072] Alternatively and additionally, in some embodiments, the set of characters forming the translated text further are reorganized by determining a character width for a smallest font of each of the set of characters of the translated text in the target language; determining a region width of the text region in the first image; determining a maximum character number per line based on a ratio of the region width and the character width; and determining the second number based on a ratio of a total number of characters in the translated text and the maximum character. [0073] It should be understood that the particular order in which the operations in Figure 12 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways of real time rendering of text on a screen of an electronic device as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 1-11 are also applicable in an analogous manner to method 1200 described above with respect to Figure 12. For brevity, these details are not repeated here. [0074] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. [0075] As used herein, the term “if” is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context. [0076] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art. [0077] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is: 1. A method of real time rendering of text on a screen of an electronic device, comprising: obtaining a first image, the first image including original text in a source language; identifying a text region that contains the original text in the first image; translating the original text in the source language into translated text in a target language; and displaying the translated text within the text region of the first image, wherein the original text in the first image has a first number of lines, and the translated text is displayed in a second number of lines, and wherein the second number is distinct from the first number.

2. The method of claim 1, wherein the second number is larger than the first number.

3. The method of claim 1 or 2, further comprising: obtaining a second image that is captured prior to the first image; identifying a distinct text region in the second image, the distinct text region including the original text; and prior to identifying the text region of the first image, recognizing the original text from the distinct text region of the second image.

4. The method of any of the preceding claims, wherein the translated text is displayed in the first image with a latency from when the first image is captured, and the latency is less than a threshold latency such that the translated text is displayed in the first image in real time with capturing the first image.

5. The method of any of the preceding claims, further comprising: extracting a set of characters from a font library of the target language to form the translated text, the set of characters having a predefined size and a predefined orientation; reorganizing the set of characters forming the translated text into the second number of lines to fit the translated text entirely within the text region in the first image; adjusting a scale and an orientation of the set of characters forming the translated text to fit the translated text entirely within the text region in the first image; and rendering the translated text with the set of characters in the first image for display on the screen of the electronic device.

6. The method of claim 5, wherein adjusting the scale and the orientation of the set of characters forming the translated text further comprises: for each character, generating a bitmap image for a respective glyph of the font library, and determining a predefined width and a predefined height of the respective glyph of the font library based on the predefined size; calculating a line width and a line height of each line of the second number of lines within the text region according to the predefined width and height of the respective glyph of each character to be displayed in the text region; and adjusting the scale of the second number of lines according to the line width and line height of each line of the second number of lines and according to a region width and a region height of the text region.

7. The method of claim 6, wherein adjusting the scale and the orientation of the set of characters forming the translated text comprises: rotating the translated text to an orientation of the text region around a rotation pivot, wherein the rotation pivot is located at a center of the text region.

8. The method of any of claims 5-7, wherein reorganizing the set of characters forming the translated text further comprises: determining whether a translation character ratio from the source language to the target language exceeds a threshold; and in response to a determination that the translation character ratio exceeds the threshold, splitting the translated text into the second number of lines within the text region.

9. The method of claim 8, wherein reorganizing the set of characters forming the translated text further comprises: determining according to width of a current line of the second number of lines and width of a current word within the current line whether there is enough space to add the current word to the current line; and in response to a determination to there is enough space to add the current word to the current line, adding the current word to the current line.

10. The method of claim 9, wherein reorganizing the set of characters forming the translated text further comprises: in response to a determination there is not enough space to add the current word to the current line, determining if the current word is longer than a maximum width; in response to a determination that the current word is longer than the maximum width: splitting the current word into multiple chunks and adding a chunk of the multiple chunks of the current word to a next line after the current line; in response to a determination that the current word is not longer than the maximum width: adding the current word to the next line after the current line.

11. The method of any of claims 5-7, wherein reorganizing the set of characters forming the translated text further comprises: determining a character width for a smallest font of each of the set of characters of the translated text in the target language; determining a region width of the text region in the first image; determining a maximum character number per line based on a ratio of the region width and the character width; and determining the second number based on a ratio of a total number of characters in the translated text and the maximum character number per line.

12. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the one or more processors to perform a method of any of claims 1-11.

13. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method of any of claims 1-11.