WO2023118967A1 - Appareil et procédés d'aide à la lecture - Google Patents

Appareil et procédés d'aide à la lecture Download PDF

Info

Publication number
WO2023118967A1
WO2023118967A1 PCT/IB2022/000784 IB2022000784W WO2023118967A1 WO 2023118967 A1 WO2023118967 A1 WO 2023118967A1 IB 2022000784 W IB2022000784 W IB 2022000784W WO 2023118967 A1 WO2023118967 A1 WO 2023118967A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
word
processor
audio signal
image
Prior art date
Application number
PCT/IB2022/000784
Other languages
English (en)
Inventor
Doron WEIZMAN
Yonatan Wexler
Efrat Beery
Roi Nathan
Tal ROSENWEIN
Amnon Shashua
Nir SANCHO
Oren Tadmor
Yair DEITCHER
Michael Druker
Ian BUDMAN
Guy EYAL
Original Assignee
Orcam Technologies Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Orcam Technologies Ltd. filed Critical Orcam Technologies Ltd.
Publication of WO2023118967A1 publication Critical patent/WO2023118967A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/142Image acquisition using hand-held instruments; Constructional details of the instruments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • G06V30/246Division of the character sequences into groups prior to recognition; Selection of dictionaries using linguistic properties, e.g. specific for English or German language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/268Lexical context
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B17/00Teaching reading
    • G09B17/003Teaching reading electrically operated apparatus or devices
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B19/00Teaching not covered by other main groups of this subclass
    • G09B19/04Speaking
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B19/00Teaching not covered by other main groups of this subclass
    • G09B19/06Foreign languages
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/06Electrically-operated educational appliances with both visual and audible presentation of the material to be studied
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B7/00Electrically-operated teaching apparatus or devices working with questions and answers
    • G09B7/02Electrically-operated teaching apparatus or devices working with questions and answers of the type wherein the student is expected to construct an answer to the question which is presented or wherein the machine gives an answer to the question presented by a student
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B7/00Electrically-operated teaching apparatus or devices working with questions and answers
    • G09B7/06Electrically-operated teaching apparatus or devices working with questions and answers of the multiple-choice answer-type, i.e. where a given question is provided with a series of answers and a choice has to be made from the answers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/1431Illumination control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • This disclosure generally relates to devices and methods for capturing and processing images and/or audio from an environment of a user, and using information derived from the captured images and/or audio to assist the user in reading.
  • Dyslexia is characterized by reading difficulty in individuals with otherwise unaffected intelligence. Problems may include difficulties in spelling words, reading quickly, writing words, “sounding out” words in the head, pronouncing words when reading aloud, and/or understanding what one reads. Alexia, on the other hand, relates to a situation in which a user who could previously read loses their ability to do so. Dyslexia, alexia, or other similar disorders may cause low self-esteem and decreased learning abilities.
  • Embodiments consistent with the present disclosure provide devices and methods for automatically capturing and processing images and audio from an environment of a user. Additionally, embodiments consistent with the present disclosure provide devices, systems and methods for processing information associated with the captured images and audio to provide feedback to the users, allowing for enhanced interaction between the users and their environment that may assist the users in reading and learning.
  • a reading device includes a light source configured to illuminate an object; a trigger configured to activate the light source, the trigger being operable by an index finger of a hand of a user; a camera configured to capture images from an environment of the user; an audio output device configured to output audio signals; and at least one processor.
  • the at least one processor is programmed to project light from the light source to illuminate an area of the object in response to operation of the trigger; capture at least one image of the illuminated area of the object, wherein the at least one image includes a representation of written material; analyze the at least one image to recognize text; transform the recognized text into at least one audio signal; and output the at least one audio signal using the audio output device.
  • a method of reading written material depicted on an object includes receiving, by a processor, an input representative of operation of a trigger on a reading device.
  • light is projected from a light source of the reading device to illuminate an area of the object.
  • a camera of the reading device at least one image of the illuminated area of the object is captured, wherein the at least one image includes a representation of written material.
  • the processor uses the processor, the at least one image is analyzed to recognize text.
  • the recognized text is transformed into at least one audio signal.
  • the at least one audio signal is output using an audio output device associated with the reading device.
  • a system includes a camera, a microphone, a visual pointer, an audio output device, and at least one processor.
  • the camera may be configured to capture one or more images from an environment of a user.
  • the microphone may be configured to capture sounds from the environment of the user.
  • the visual pointer may be configured to indicate a point of interest in the one or more images.
  • the audio output device may be configured to output audio signals.
  • the at least one processor may be programmed to receive at least one image captured by the camera, the at least one image including a representation of written material; analyze the at least one image to recognize text; identify at least one printed word within the text; generate a question based on the at least one printed word; present the question to the user; after presenting the question to the user, receive at least one indication from the user; identify at least one word in the indication; compare the at least one word to the at least one printed word; provide a positive indication when the at least one word matches the at least one printed word; and provide a negative indication when the at least one word does not match the at least one printed word.
  • a system may include a camera configured to capture images from an environment of a user.
  • the system may also include a microphone configured to capture sounds from the environment of the user. Further the system may include an audio output device for outputting audio signals.
  • the system may include at least one processor.
  • the processor may be programmed to receive at least one image captured by the camera, the at least one image including a representation of written material.
  • the processor may also be programmed to obtain a material type of the written material.
  • the processor may be programmed to analyze the at least one image to recognize text.
  • the processor may be programmed to generate at least one audio signal representing the recognized text, the at least one audio signal being generated based on the material type.
  • the processor may also be programmed to output the at least one audio signal via the audio output device.
  • a system may include a camera configured to capture images from an environment of a user.
  • the system may also include a microphone configured to capture sounds from the environment of the user.
  • the system may include an audio output device for outputting audio signals.
  • the system may also include at least one processor.
  • the processor may be programmed to receive at least one image captured by the camera, the at least one image including a representation of written material.
  • the processor may also be programmed to obtain a linguistic level of the user.
  • the processor may be programmed to analyze the at least one image to recognize text.
  • the processor may be programmed to substitute at least one original word within the recognized text with a synonym word based on the linguistic level of the user.
  • the processor may also be programmed to generate at least one audio signal representing the recognized text, wherein the at least one audio signal represents the synonym word rather than the original word.
  • a system may include a camera configured to capture images from an environment of a user.
  • the system may also include a microphone configured to capture sounds from the environment of the user. Further the system may include an audio output device for outputting audio signals.
  • the system may also include at least one processor.
  • the processor may be programmed to receive at least one image captured by the camera, the at least one image including a representation of written material.
  • the processor may also be programmed to analyze the at least one image to recognize text, the recognized text being in a first language. Further, the processor may be programmed to obtain a second language from the user.
  • the processor may be programmed to translate the recognized text into the second language.
  • the processor may also be programmed to generate at least one audio signal representing the recognized text, wherein the second audio signal represents the text translated into the second language.
  • a system may include a camera configured to capture images from an environment of a user.
  • the system may also include a microphone configured to capture sounds from the environment of the user.
  • the system may include an audio output device for outputting audio signals.
  • the system may include at least one processor.
  • the processor may be programmed to receive at least one image captured by the camera, the at least one image including a representation of written material.
  • the processor may also be programmed to analyze the at least one image to recognize text, the recognized text comprising at least one word. Further, the processor may be programmed to identify at least one first audio signal associated with the at least one word.
  • the processor may also be programmed to generate at least one second audio signal representing the text.
  • the at least one second audio signal may include the at least one first audio signal such that when the at least one second audio signal is output via the audio output device, the at least one first audio signal is played immediately before, immediately after, or instead of the at least one word.
  • a system may include a camera configured to capture images from an environment of a user.
  • the system may also include a microphone configured to capture sounds from the environment of the user.
  • the system may include an audio output device for outputting audio signals.
  • the system may include at least one processor.
  • the at least one processor may be programmed to receive a user setting associated with a reading speed.
  • the at least one processor may also be programmed to receive at least one image captured by the camera, the at least one image including a representation of written material.
  • the at least one processor may be programmed to analyze the at least one image to recognize text.
  • the at least one processor may also be programmed to generate at least one audio signal representing the recognized text.
  • the at least one audio signal may be generated based on the user setting.
  • the at least one processor may be programmed to output the at least one audio signal via the audio output device.
  • a method may include receiving, by at least one processor, at least one image captured by a camera configured to capture images from an environment of a user.
  • the at least one image may include a representation of written material.
  • the method may also include obtaining, by the at least one processor, a material type of the written material. Further the method may include analyzing, by the at least one processor, the at least one image to recognize text.
  • the method may also include generating, by the at least one processor, at least one audio signal representing the recognized text. The at least one audio signal may be generated based on the material type.
  • the method may include outputting the at least one audio signal via an audio output device configured for outputting audio signals.
  • a method may include receiving, by at least one processor, at least one image captured by a camera configured to capture images from an environment of a user.
  • the at least one image may include a representation of written material.
  • the method may include obtaining, by the at least one processor, a linguistic level of the user. Further, the method may include analyzing, by the at least one processor, the at least one image to recognize text.
  • the method may include substituting, by the at least one processor, at least one original word within the recognized text with a synonym word based on the linguistic level of the user.
  • the method may also include generating, by the at least one processor, at least one audio signal representing the recognized text. The at least one audio signal may represent the synonym word rather than the original word.
  • a method may include receiving, by at least one processor, at least one image captured by a camera configured to capture images from an environment of a user.
  • the at least one image may include a representation of written material.
  • the method may also include analyzing, by the at least one processor, the at least one image to recognize text, the recognized text being in a first language. Further, the method may include obtaining, by the at least one processor, a second language from the user. The method may include translating, by the at least one processor, the recognized text into the second language.
  • the method may also include generating, by the at least one processor, at least one audio signal representing the recognized text. The audio signal may represent the text translated into the second language.
  • a method may include receiving, by at least one processor, at least one image captured by the camera configured to capture images from an environment of a user.
  • the at least one image may include a representation of written material.
  • the method may include analyzing, by the at least one processor, the at least one image to recognize text.
  • the recognized text may comprise at least one word.
  • the method may include identifying, by the at least one processor, at least one first audio signal associated with the at least one word.
  • the method may also include generating, by the at least one processor, at least one second audio signal representing the text.
  • the at least one second audio signal may include the at least one first audio signal such that when the at least one second audio signal is output via the audio output device, the at least one first audio signal is played immediately before, immediately after, or instead of the at least one word.
  • a method may include receiving, by at least one processor, a user setting associated with a reading speed.
  • the method may include receiving, by the at least one processor, at least one image captured by a camera configured to capture images from an environment of a user.
  • the at least one image may include a representation of written material.
  • the method may include analyzing, by the at least one processor, the at least one image to recognize text.
  • the method may also include generating, by the at least one processor, at least one audio signal representing the recognized text.
  • the at least one audio signal may be generated based on the user setting.
  • the at least one processor may be programmed to output the at least one audio signal via an audio output device configured for outputting audio signals.
  • a system may include a camera configured to capture images from an environment of a user.
  • the system may also include an audio output device for outputting audio signals.
  • the system may include at least one processor.
  • the processor may be programmed to receive at least one image captured by the camera, the at least one image including a representation of written material.
  • the processor may also be programmed to analyze the at least one image to recognize text.
  • the processor may be programmed to perform text-to-speech conversion of the recognized text to generate at least one audio signal representing the recognized text.
  • the at least one audio signal may be generated in a voice of a predetermined speaker or speaker type.
  • the processor may be programmed to output the at least one audio signal via the audio output device.
  • a system may include a camera configured to capture images from an environment of a user.
  • the system may also include an audio output device for outputting audio signals.
  • the system may include at least one processor.
  • the processor may be programmed to receive at least one image captured by the camera, the at least one image including a representation of written material.
  • the processor may also be programmed to analyze the at least one image to recognize text.
  • the processor may be programmed to perform text-to-speech conversion of the recognized text to generate at least one audio signal representing the recognized text.
  • the audio signal may be generated to convey a sentiment determined based on analysis of the recognized text.
  • the processor may also be programmed to output the at least one audio signal via the audio output device.
  • a method may include receiving at least one image captured by a camera, the at least one image including a representation of written material. The method may also include analyzing the at least one image to recognize text. Further, the method may include performing text-to-speech conversion of the recognized text to generate at least one audio signal representing the recognized text. The at least one audio signal may be generated in at least one of a voice of a predetermined speaker or to convey a sentiment determined based on analysis of the recognized text. The method may also include outputting the audio signal via the audio output device.
  • a system may comprise a camera configured to capture images from an environment of a user; a microphone configured to capture sounds from the environment of the user; and at least one processor.
  • the at least one processor may be programmed to: receive at least one image captured by the camera, the at least one image including a representation of written material; analyze the at least one image to recognize text; receive at least one audio signal captured by the microphone, the at least one audio signal representing speech by the user; analyze the at least one audio signal to recognize at least one first word in the speech by the user; compare the at least one first word with at least one second word in the recognized text; determine whether the at least one first word matches the at least one second word; and provide feedback information to the user based on determining whether the at least one first word matches the at least one second word.
  • a method may process audio and image signals.
  • the method may comprise receiving at least one image captured by a camera, the at least one image including a representation of written material; analyzing the at least one image to recognize text; receiving at least one audio signal captured by the microphone, the at least one audio signal representing speech by a user; analyzing the at least one audio signal to recognize at least one first word in the speech by the user; comparing the at least one first word with at least one second word in the recognized text; determining whether the at least one first word matches the at least one second word; and providing feedback information to the user based on determining whether the at least one first word matches the at least one second word.
  • a system may comprise a camera configured to capture images from an environment of a user; and at least one processor.
  • the at least one processor may be programmed to: cause the camera to capture a first image including a first representation of written material; analyze the first image to detect a predetermined sign within the first image; determine first coordinates of the predetermined sign within the first image; cause the camera to capture a second image including a second representation of the written material; determine a transformation between the first image and the second image; apply the transformation to obtain second coordinates within the second image corresponding to the first coordinates of the predetermined sign within the first image; and analyze the second image to recognize at least one word based on the second coordinates.
  • a system may comprise a camera configured to capture images from an environment of a user; and at least one processor.
  • the at least one processor may be programmed to: cause the camera to capture a first image including a representation of written material; analyze the first image to detect a predetermined sign within the first image; determine coordinates of the predetermined sign within the first image; and recognize at least one word based on the determined coordinates.
  • a method may audio and image signals.
  • the method may comprise capturing, using a camera, a first image including a first representation of written material; analyzing the first image to detect a predetermined sign within the first image; determining first coordinates of the predetermined sign within the first image; causing the camera to capture a second image including a second representation of the written material; determining a transformation between the first image and the second image; applying the transformation to obtain second coordinates within the second image corresponding to the first coordinates of the predetermined sign within the first image; and analyzing the second mage to recognizing at least one word based on the second coordinates.
  • a method may process audio and image signals.
  • the method may comprise capturing, using a camera, an image including a representation of written material; analyzing the image to detect a predetermined sign within the image; determining coordinates of the predetermined sign within the image; and recognizing at least one word based on the determined coordinates.
  • a reading device may comprise: a camera configured to capture images from an environment of a user; a microphone configured to capture sounds from the environment of the user; an audio output device for outputting audio signals; and at least one processor
  • the at least one processor may be programmed to associate the reading device with a user by identifying at least one of the user or a user account associated with the user; receive at least one image captured by the camera, the at least one image including a representation of written material; analyze the at least one image to recognize text; receive at least one audio signal captured by the microphone, the at least one audio signal representing speech by the user; analyze the at least one audio signal to recognize at least one first word; compare the at least one first word with at least one second word, the at least one second word comprising at least one of: an expected word or a word in the recognized text; determine feedback information indicating whether the at least one first word matches the at least one second word; and provide the feedback information to the user.
  • a method for processing audio and image signals may comprise associating a reading device with a user by identifying at least one of a user or a user account associated with the user; receiving at least one image captured by a camera configured to capture images from an environment of the user, the at least one image including a representation of written material; analyzing the at least one image to recognize text; receiving at least one audio signal captured by a microphone configured to capture sounds from the environment of the user, the at least one audio signal representing speech by the user; analyzing the at least one audio signal to recognize at least one first word; comparing the at least one first word with at least one second word, the at least one second word comprising at least one of: an expected word or a word in the recognized text; determining feedback information indicating whether the at least one first word matches the at least one second word; and providing the feedback information to the user.
  • non-transitory computer-readable storage media may store program instructions, which are executed by at least one processor and perform any of the methods described herein.
  • FIG. 1 is a schematic illustration of a user with a handheld apparatus consistent with the present disclosure.
  • FIGs. 2A, 2B, 2C, 2D, and 2E are schematic illustrations of an example of the handheld apparatus shown in Fig. 1 from various viewpoints consistent with the present disclosure.
  • FIG. 3 is a schematic illustration of an example system consistent with the present disclosure.
  • FIG. 4 is a schematic illustration of the exemplary apparatus shown in Fig. 1, being used for reading textual information consistent with the present disclosure.
  • Fig. 5 is a block diagram illustrating an example of the components of the handheld apparatus and a computing device consistent with the present disclosure.
  • Fig. 6 is a block diagram illustrating another exemplary embodiment of a handheld apparatus comprising image and voice recognition components consistent with the present disclosure.
  • FIG. 7 is a schematic illustration of a front-end view of an exemplary handheld apparatus consistent with the present disclosure.
  • FIG. 8 is a schematic illustration of the exemplary apparatus shown in Fig. 1, being used for identifying a word in textual information consistent with the present disclosure.
  • Fig. 9 is a flowchart of a method of reading written material depicted on an object consistent with the present disclosure.
  • FIGs. 10A, 10B, 10C, 10D, and 10E are schematic illustrations of an example use case of the apparatus scanning text and presenting a question to a user and receiving the user’s response consistent with the present disclosure.
  • Fig. 11 is a flowchart of a method for scanning text and presenting a question to a user and receiving the user’s response consistent with the present disclosure.
  • Fig. 12 is a flowchart of a method for scanning text and a user asking a question consistent with the present disclosure.
  • Fig. 13 is a schematic illustration of the exemplary apparatus shown in Fig. 1, being used for reading textual information and identifying a material type of the written material, consistent with the present disclosure
  • Figs. 14A, 14B, 14C, and 14D are illustrations of exemplary graphic user interfaces on a user device for receiving an input from a user, consistent with the present disclosure.
  • Figs. 15A, 15B, 15C, and 15D illustrate exemplary methods performed by the system of Fig. 3.
  • Fig. 16A is a schematic illustration of an exemplary system for generating an audio signal in a predetermined voice, consistent with the present disclosure.
  • Fig. 16B is a schematic block diagram of an exemplary system for generating an audio signal in one or more voices, consistent with the present disclosure.
  • FIG. 16C is an illustration of a device with a display screen showing an exemplary graphical user interface, consistent with the present disclosure.
  • FIG. 17 is a schematic illustration of an environment of a user in which the exemplary apparatus shown in Fig. 1 is being used for reading textual information and identifying a sentiment, emotion, or context of the written material, consistent with the present disclosure
  • Figs. 18A and 18B illustrate exemplary methods performed by the system of Fig. 3.
  • Fig. 19 is a schematic illustration of a user using a handheld apparatus to extract information consistent with the present disclosure.
  • Fig. 20 is a schematic illustration of extracted and derived environmental information.
  • Fig. 21 is a flowchart of a method for processing audio and image signals according to disclosed embodiments.
  • Fig. 22 is a schematic illustration of images captured by a camera consistent with the present disclosure.
  • Fig. 23 is a schematic illustration of transforming and determining coordinates associated with images, consistent with the present disclosure.
  • Fig. 24 is a flowchart of a method for processing audio and image signals according to disclosed embodiments.
  • Fig. 25 illustrates an example code for identifying a user, consistent with the disclosed embodiments.
  • Fig. 26A illustrates an example technique for identifying a user based on a physical characteristic of the user, consistent with the disclosed embodiments.
  • Fig. 26B illustrates an example technique for identifying a user based on a captured audio signal, consistent with the disclosed embodiments.
  • Fig. 27 is a flowchart showing an example process for processing audio and image signals, consistent with the disclosed embodiments.
  • terms such as “processing,” “calculating,” “computing,” “determining,” “generating,” “setting,” “configuring,” “selecting,” “defining,” “applying,” “obtaining,” “monitoring,” “providing,” “identifying,” “segmenting,” “classifying,” “analyzing,” “associating,” “extracting,” “storing,” ’’receiving,” “transmitting,” or the like include actions and/or processes of a computer that manipulate and/or transform data into other data, the data represented as physical quantities, and/or electronic quantities.
  • should be expansively construed to cover any kind of electronic device, component or unit with data processing capabilities, including, by way of non-limiting example, a personal computer, a wearable computer, smart glasses, a tablet, a smartphone, a server, a computing system, a cloud computing platform, a communication device, a processor (for example, digital signal processor (DSP), an image signal processor (ISP), a microcontroller, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), a central processing unit (CPU), a graphics processing unit (GPU), a visual processing unit (VPU), and so on), possibly with embedded memory, a single core processor, a multi core processor, a core within a processor, any other electronic computing device, or any combination of the above.
  • DSP digital signal processor
  • ISP image signal processor
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • CPU central processing unit
  • GPU graphics processing unit
  • VPU visual processing unit
  • the phrase “for example,” “such as,” “for instance,” and variants thereof describe non-limiting embodiments of the presently disclosed subject matter.
  • Reference in the specification to features of “embodiments,” “one case,” “some cases,” “other cases” or variants thereof means that a particular feature, structure or characteristic described may be included in at least one embodiment of the presently disclosed subject matter. Thus, the appearance of such terms does not necessarily refer to the same embodiment(s).
  • the term “and/or” includes any and all combinations of one or more of the associated listed items.
  • Fig. 1 illustrates a user 100 using a handheld apparatus 110, consistent with disclosed embodiments.
  • user 100 may point apparatus 110 towards object 120.
  • object 120 may be a book. It is contemplated, however, that object 120 may additionally or alternatively include a newspaper, a magazine, a poster, a label, a notebook, a notepad, paper, or any other object that may have text or graphical material displayed on or printed on the object.
  • object 120 may include a screen or display of a secondary device (not shown).
  • Such a secondary device or secondary computing device may include, for example, a desktop computer, a laptop computer, a smartphone, a tablet, a smartwatch, an electronic reader (e.g., e- reader, KindleTM), a dedicated processing unit that may be portable (e.g., that can be carried in a pocket, a bag, and/or on a person of user 100), or any other device capable of displaying text on a display screen.
  • object 120 may include a wall, a physical screen or any other object such as the user’s wrist, onto which text may be projected using, for example, a projection device.
  • Apparatus 110 may be a small and light device generally adapted to being held in a hand of user 100, who may be a child, an adult, or a person of any age.
  • apparatus 110 may include an image sensor (not shown in Fig. 1) for capturing real-time image data of the text or graphical material displayed on object 120.
  • image data includes any form of data retrieved from optical signals in the near-infrared, infrared, visible, and ultraviolet spectrums.
  • the image data may include still images and/or video clips.
  • Apparatus 110 may also include one or more lighting or illumination devices (not shown in Fig. 1) that may be configured to illuminate the text displayed on object 120 or other objects, or to highlight parts of object 120.
  • the image sensor of apparatus 110 may be able to capture images of the text displayed on object 120 using illumination provided by object 120.
  • Apparatus 110 may also include an audio sensor (not shown in Fig. 1) for capturing sounds (e.g., voice commands) from user 100 or an environment of user 100.
  • apparatus 110 may be connected to an audio feedback device 130.
  • audio feedback device 130 may include an over-the-ear audio headphone being worn by user 100. It is contemplated, however, that audio feedback device 130 may include a built-in speaker, a stand-alone speaker, a portable speaker, a hearing aid device, a bone conduction headphone, a within-ear headphone (e.g., earbuds), a speaker associated with a computing device, or any other device capable of playing audio signals.
  • apparatus 110 may be connected to audio feedback device 130 via a wired connection.
  • apparatus 110 may be connected to audio feedback device 130 via a wireless connection based, for example, on a BluetoothTM protocol or any other wireless communication protocol that may allow apparatus 110 to transmit audio signals to audio feedback device 130. It is appreciated that feedback device 130 may also be included in or a part of apparatus 110.
  • Fig. 2A is a schematic illustration of an isometric view of an exemplary apparatus 110.
  • apparatus 110 may include a housing or outer case 200 that may enclose one or more components of apparatus 110.
  • Apparatus 110 may also include, for example, camera 210, illumination device 212, targeting lasers 214, power button 216, volume buttons 218, battery level indicator 220, charging port 222, and trigger button 224.
  • Apparatus 110 may include one or more replaceable and/or chargeable batteries (not shown) that may provide electrical power for the performance of one or more operations of apparatus 110.
  • Illumination device 212 may include one or more lights, for example, light-limiting-diodes (LEDs).
  • LEDs light-limiting-diodes
  • illumination device 212 may include any other type of light source capable of providing a visible, infrared, or ultraviolet light. Illumination device 212 may be configured to emit light that may be used to illuminate the text displayed on object 120, allowing camera 210 of apparatus 110 to capture one or more images of the displayed text.
  • Targeting lasers 214 may include one or more laser light sources configured to emit laser light that may be used to mark a portion of the textual and/or graphical material displayed on object 120. Although three targeting lasers 214 are illustrated in Fig. 2A, apparatus 110 may include any number of targeting lasers 214. In some embodiments, one or more of the targeting lasers 214 may include a mask for projecting a particular shape onto object 120. For example, targeting lasers 214 may include a mask that projects lines defining, for example, corners of an area that the user may wish to mark on the text displayed on object 120. In other examples, targeting lasers 214 may include a mask that projects an arrow, one or more lines, or the like. By way of example, Fig.
  • object 120 may be a display screen of a computing device.
  • the one or more targeting lasers 214 may emit laser light that marks the displayed text matter on object 120 using corners 412, 414, 416, and 418. Corners 412, 414, 416, and 418 may mark or identify an area of the displayed text that user 100 may be interested in reading using apparatus 110.
  • a particular shape for example, four corners (e.g., 412, 414, 416, and 418) has been illustrated in Fig.
  • the shape projected onto object 120 by the one or more targeting lasers 214 may include a complete frame (e.g., having a square, a rectangular, triangular, circular, or any other shape), a cursor, a hand, an arrow, one or more lines, or any other symbol that may be used to identify or mark the text that user 100 may be interested in reading using apparatus 110.
  • a complete frame e.g., having a square, a rectangular, triangular, circular, or any other shape
  • a cursor e.g., a hand, an arrow, one or more lines, or any other symbol that may be used to identify or mark the text that user 100 may be interested in reading using apparatus 110.
  • power button 216 may be configured to turn apparatus 110 ON or OFF. In some embodiments, pressing power button 216 once may turn apparatus 110 ON, pressing power button 216 a second time may cause operations of apparatus 110 to be suspended, and pressing power button 216 one more time may turn apparatus 110 OFF. Volume buttons 218 may allow a user (e.g., user 100) to increase or decrease a volume or sound level of an associated audio feedback device 130. Trigger button 224 may be configured to perform one or more functions of apparatus 110. For example, pressing trigger button 224 once may turn on targeting lasers 214 to allow a user to mark portions of the text displayed on object 120.
  • pressing trigger button 224 a second time may cause camera 210 to capture one or more images of the text marked by targeting lasers 214 or the entire text displayed on object 120. It is contemplated that operation of trigger button 224 may cause apparatus 110 to perform other functions as desired.
  • Battery level indicator 220 may be configured to display an amount of remaining power or battery capacity of apparatus 110.
  • battery level indicator 220 may include, for example, a plurality of LEDs, and the number of LEDs that turn on may indicate a remaining amount of power or battery capacity. It is contemplated, however, that in some embodiments, battery level indicator 220 may include other graphical symbols and/or numerical values that may indicate an amount of remaining power or battery capacity associated with apparatus 110.
  • Charging port 222 may be configured to receive one end of a charging cable, which in turn may be configured to be connected to an electrical power outlet or another power source for charging a battery associated with apparatus 110.
  • Fig. 2B is an exemplary schematic illustration of a front view of apparatus 110, illustrating camera 210, illumination device 212, and targeting lasers 214.
  • Fig. 2C is an exemplary schematic illustration of a right-hand side of apparatus 110.
  • apparatus 110 may include power button 216, volume buttons 218, battery level indicator 220, charging port 222, and trigger button 224.
  • apparatus 110 may include eyelet loop 226 that may be used to attach a lanyard or a chain.
  • the lanyard or chain may allow user 100 to suspend apparatus 110 from the user’s wrist, around the user’s neck, or to attach apparatus 110 to an accessory being carried by the user, or to an article of clothing (e.g., short, a belt, pants) of the user.
  • apparatus 110 may be attached to a temple of the user’s eyeglasses.
  • apparatus 110 may be attached via a connecting unit such that the user may remove the apparatus and re-attach it without having to calibrate and relocate the apparatus.
  • Fig. 2D is an exemplary schematic illustration of a left-hand side of apparatus 110, illustrating power button 216, volume buttons 218, trigger button 224, eyelet loop 226, and audio outlet 228.
  • Audio outlet 228 may be configured to received one end of a cable (e.g., one pin headphone plug) that may allow audio feedback device 130 to be connected to apparatus 110 via a wired connection.
  • Fig. 2E is an exemplary schematic illustration of a bottom side of apparatus 110, illustrating one or more microphones 230 capable of capturing sounds from an environment of the user.
  • the one or more microphones 230 may be configured to capture sounds associated with voice commands issued by user 100 and convert the sounds into audio signals for further processing by apparatus 110.
  • microphones 230 have been illustrated as being located on a bottom side of apparatus 110, it should be noted that microphones 230 may be located on any one or more sides (e.g., a front side, a left side, a right side, a top side, and/or a bottom side) of apparatus 110. It is also contemplated that in some embodiments, one or more microphones 230 may additionally or alternatively be located on audio feedback device 130 (e.g., headphone with microphone) being worn by user 100. As also illustrated in Fig. 2E, one or more audio feedback units 130 (e.g., headphone 130a and/or speaker 130b) may be connected to apparatus 110 via audio outlet 228. In some embodiments, one or more speakers 130b may be integrated with and included in apparatus 110 (e.g., including in housing or outer case 200 of apparatus 110).
  • Fig. 3 is a schematic illustration of an exemplary system 300, including apparatus 110 held by user 100, one or more audio feedback devices 130, an optional computing device 350, and/or a server 380 capable of communicating with apparatus 110 and/or with secondary communications device 350 via network 370.
  • apparatus 110 may capture and analyze image data of the text or graphical material displayed by object 120.
  • apparatus 110 may include image sensor 310 configured to capture real-time image data of the text or graphical material displayed by object 120.
  • image sensor refers to a device capable of detecting and converting optical signals in the near-infrared, infrared, visible, and ultraviolet spectrums into electrical signals.
  • the electrical signals may be used to form an image or a video stream (i.e., image data) based on the detected signal.
  • image data includes any form of data retrieved from optical signals in the near-infrared, infrared, visible, and ultraviolet spectrums.
  • Examples of image sensors may include semiconductor charge-coupled devices (CCD), active pixel sensors in complementary metal-oxide-semiconductor (CMOS), or N-type metal-oxide-semiconductor (NMOS, Live MOS).
  • CMOS complementary metal-oxide-semiconductor
  • NMOS N-type metal-oxide-semiconductor
  • image sensor 310 may be part of camera 210 included in apparatus 110.
  • Apparatus 110 may include one or more processors 320 for controlling image sensor 310 to capture image data and for analyzing the image data according to disclosed embodiments.
  • processor 320 may also control audio feedback unit 130 to provide feedback to user 100, including information based on the analyzed image data or captured audio data and stored software instructions.
  • the term “a processor” or “at least one processor” may constitute any physical device or group of devices having electric circuitry that performs a logic operation on an input or inputs.
  • the processor or the at least one processor may include one or more integrated circuits (ICs), including application-specific integrated circuit (ASIC), microchips, microcontrollers, microprocessors, all or part of a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), a field-programmable gate array (FPGA), a server, a virtual server, or other circuits suitable for executing instructions or performing logic operations.
  • ICs integrated circuits
  • ASIC application-specific integrated circuit
  • microchips microchips
  • microcontrollers microprocessors, all or part of a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), a field-programmable gate array (FPGA), a server, a virtual server, or other circuits suitable for executing instructions or performing logic operations.
  • ASIC application-specific integrated circuit
  • microcontrollers microprocessors
  • microprocessors all or part of a central processing unit (CPU), a graphics processing
  • a processor or at least one processor may include more than one processor.
  • Each processor may have a similar construction, or the processors may be of differing constructions that are electrically connected or disconnected from each other.
  • the processors may be separate circuits or integrated in a single circuit.
  • the processors may be configured to operate independently or collaboratively.
  • the processors may be coupled electrically, magnetically, optically, acoustically, mechanically or by other means that permit them to interact.
  • Apparatus 110 may include a memory 330.
  • the instructions executed by a processor or at least one processor may, for example, be pre-loaded into a memory integrated with or embedded into the controller or may be stored in a separate memory.
  • the memory may include a Random- Access Memory (RAM), a Read-Only Memory (ROM), a hard disk, an optical disk, a magnetic medium, a flash memory, other permanent, fixed, or volatile memory, or any other mechanism capable of storing instructions.
  • RAM Random- Access Memory
  • ROM Read-Only
  • Apparatus 110 may include an audio sensor 340 that may be configured to receive one or more sounds, including environmental sounds and/or voices of one or more persons who are speaking (e.g., user 100) from an environment of a user and to convert the received sounds into one or more audio signals for further processing by, for example, processor 320.
  • audio sensor 340 may be part of one or more microphones 230 included in apparatus 110.
  • the one or more microphones 230 may be directional microphones, which may be more sensitive to picking up sounds in certain directions.
  • microphone 230 may comprise a unidirectional microphone designed to pick up sounds from a single direction or from a small range of directions.
  • microphone 230 may comprise a cardioid microphone, which may be sensitive to sounds from the front and sides.
  • microphone 230 may include a microphone array which may comprise additional microphones located on different sides of apparatus 110.
  • microphone 230 may be a multi-port microphone for capturing multiple audio signals.
  • system 300 may include a computing device 350.
  • the term “computing device” refers to a device including a processing unit (e.g., processor) and having computing capabilities.
  • Some examples of computing device 350 include a desktop computer or PC, laptop computer, a smartphone, a smartwatch, an e-reader, or other computing systems.
  • computing device 350 may be a smartphone having a display 360.
  • Computing device 350 may be configured to communicate directly with apparatus 110 or server 380 over network 370.
  • computing device 350 may be a computing system configured particularly for apparatus 110, and may be provided integral to apparatus 110 or tethered thereto via a wired or wireless connection.
  • Apparatus 110 may also connect to computing device 350 over network 370 via any known wireless standard (e.g., Wi -Fi, Bluetooth®, etc.), as well as near-field capacitive coupling, and other short range wireless techniques, or via a wired connection.
  • computing device 350 may have a dedicated application installed therein.
  • user 100 may view on display 360 data (e.g., images, video clips, extracted information, feedback information, etc.) that originate from or are triggered by apparatus 110.
  • user 100 may select part of the data for storage in server 380.
  • secondary device 350 may include an input device or input/output interface that may allow user 100 to provide inputs, instructions, and/or feedback to processor 210.
  • Network 370 may be a shared, public, or private network, may encompass a wide area or local area, and may be implemented through any suitable combination of wired and/or wireless communication networks. Network 370 may further comprise an intranet or the Internet. In some embodiments, network 370 may include short range or near-field wireless communication systems for enabling communication between apparatus 110 and computing device 350 provided in close proximity to each other, such as on or near a user 100, for example.
  • Apparatus 110 may establish a connection to network 370 autonomously, for example, using a wireless module (e.g., Wi-Fi, cellular).
  • apparatus 110 may use the wireless module when being connected to an external power source, to prolong battery life.
  • communication between apparatus 110 and server 380 may be accomplished through any suitable communication channels, such as, for example, a telephone network, an extranet, an intranet, the Internet, satellite communications, off-line communications, wireless communications, transponder communications, a local area network (LAN), a wide area network (WAN), and a virtual private network (VPN)
  • LAN local area network
  • WAN wide area network
  • VPN virtual private network
  • apparatus 110 may transfer or receive data to/from server 380 via network 370.
  • the data being received from server 380 and/or computing device 350 may include numerous different types of information based on the analyzed image data, including information related to the content captured by apparatus 110, a commercial product, a person’s identity, an identified landmark, and any other information capable of being stored in or accessed by server 380.
  • data may be received and transferred via computing device 350.
  • system 300 may include one or more audio feedback devices 130 (e.g., headphone 130a and/or speaker 130b).
  • the one or more audio feedback devices 130 may be a part of apparatus 110.
  • the one or more audio feedback devices 130 may be connected to apparatus 110 and/or to computing device 350 via a wired connection or a wireless communication protocol. It is contemplated that the one or more audio feedback devices may be connected to apparatus 110 and/or computing device 350 via network 370. Additionally or alternatively, the one or more audio feedback devices 130 may be connected to apparatus 110 and/or computing device 350 via a near-field wireless communication protocol (e.g., Bluetooth®).
  • a near-field wireless communication protocol e.g., Bluetooth®
  • server 380 may involve the usage of a server or cloud server (e.g., server 380).
  • server or cloud server refers to a computer platform that provides services via a network, such as the Internet.
  • server 380 may use virtual machines that may not correspond to individual hardware.
  • computational and/or storage capabilities may be implemented by allocating appropriate portions of desirable computation/storage power from a scalable repository, such as a data center or a distributed computing environment.
  • server 380 may implement the methods described herein using customized hard-wired logic, one or more Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), firmware, and/or program logic which, in combination with the computer system, cause server 380 to be a special-purpose machine.
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • program logic which, in combination with the computer system, cause server 380 to be a special-purpose machine.
  • Fig. 4 is a schematic illustration of the exemplary apparatus 110 being used for reading textual information consistent with the present disclosure.
  • object 120 is configured to display text 410 that may include a headline “Example Headline Text” and additional text.
  • a user e.g., user 100
  • a user may hold apparatus 110 and point apparatus 110 towards object 120.
  • Apparatus 110 may activate one or more LEDs 212 to illuminate text 410 displayed on object 120.
  • User 100 may press trigger button 224, which in turn may cause processor 320 of apparatus 110 to activate the one or more targeting lasers 214 to generate laser light and illuminate the text on object 120 using, for example, four corners 412, 414, 416, and 418.
  • User 100 may press the trigger button 224 again, which may cause processor 320 to activate camera 210, causing camera 210 to capture one or more images of the text displayed within the corners 412, 414, 416, and 418.
  • camera 210 may operate in a standby or “always on state,” and processor 320 may analyze the field of view of camera 210 at periodic intervals (e.g., every second, every two seconds, every five seconds, etc.) to determine whether text appears within the field of view.
  • Processor 320 may perform optical character recognition (OCR) on the text displayed within the corners 412, 414, 416, and 418 to recognize the text.
  • OCR optical character recognition
  • Processor 320 may also be configured to execute one or more text-to-speech algorithms to convert the text recognized by processor 320 into audio signals.
  • Processor 320 may be configured to transmit the audio signals to one or more audio feedback devices 130, which in turn may be configured to play the audio signals for user 100.
  • Fig. 5 is a block diagram illustrating the components of apparatus 110 according to an example embodiment including computing device 350.
  • apparatus 110 may include image sensor 310, processor 320, memory 330, microphone 340, LEDs 212, targeting laser 214, charging port 222, mobile power source 514, and wireless transceiver 516.
  • computing device 350 may include processor 520, feedback-outputting unit 524, memory 522, wireless transceiver 526, and a display 360.
  • Processors 320 and 520 may have the characteristics of “a processor” or “at least one processor” described earlier, memories 330 and 522 may have the characteristics of memory described earlier.
  • Image sensor 310, microphone 340, LEDs 212, targeting lasers 214, and charging port 222 of apparatus 110 may have characteristics similar to those described earlier.
  • Mobile power source 514 of apparatus 110 may include any device capable of providing electrical power, which can be easily carried by hand (e.g., mobile power source 514 may weigh less than a pound). The mobility of the power source may enable user 100 to use apparatus 110 in a variety of situations.
  • mobile power source 514 may include one or more batteries (e.g., nickel-cadmium batteries, nickel-metal hydride batteries, and lithium-ion batteries) or any other type of electrical power supply.
  • mobile power source 514 may be rechargeable and contained within outer case 200 that holds components of apparatus 110.
  • mobile power source 514 may include one or more energy harvesting devices for converting ambient energy into electrical energy (e.g., portable solar power units, human vibration units, etc.).
  • Mobile power source 514 of apparatus 110 may be charged using power from an external power source 512.
  • external power source 512 may be an electrical power outlet that in turn may be connected to an electrical power grid or an electricity generator. Additionally or alternatively, external power source 512 may include one or more power sources similar to those described above with respect to mobile power source 514.
  • Apparatus 110 may include wireless transceiver 516 and computing device 350 may include wireless transceiver 526.
  • wireless transceiver refers to any device configured to exchange transmissions over an air interface by use of radio frequency, infrared frequency, magnetic field, or electric field.
  • Wireless transceiver 516 and/or 526 may use any known standard to transmit and/or receive data (e.g., Wi-Fi, Bluetooth®, Bluetooth Smart, 802.15.4, or ZigBee).
  • wireless transceiver 516 may transmit data (e.g., raw image data, processed image data, extracted information, audio signals, products of processing audio signals) from apparatus 110 to computing device 350 and/or server 380.
  • Wireless transceiver 516 may also receive data from computing device 350 and/or server 380. In other embodiments, wireless transceiver 516 may transmit data and instructions to an external audio feedback device 130. In some embodiments, wireless transceiver 526 may transmit data (e.g., raw image data, processed image data, extracted information, audio signals, products of processing audio signals) from computing device 350 to apparatus 110 and/or server 380. Wireless transceiver 526 may also receive data from apparatus 110 and/or server 380. In other embodiments, wireless transceiver 526 may transmit data and instructions to an external audio feedback device 130.
  • data e.g., raw image data, processed image data, extracted information, audio signals, products of processing audio signals
  • apparatus 110 and computing device 350 are each depicted with only one wireless transceiver in Fig. 5, apparatus 110 and/or computing device 350 may include more than one wireless transceiver (e.g., two wireless transceivers). In an arrangement with more than one wireless transceiver, each of the wireless transceivers may use a same or different standard to transmit and/or receive data.
  • a first wireless transceiver may communicate using a cellular standard (e.g., LTE or GSM), and a second wireless transceiver may communicate using a short-range standard (e.g., Wi-Fi or Bluetooth®).
  • apparatus 110 may use the first wireless transceiver when the wearable apparatus is powered by a mobile power source included in apparatus 110, and use the second wireless transceiver when apparatus 110 is powered by an external power source.
  • Feedback outputting unit 524 of computing device 350 may have characteristics similar to those of audio feedback device 130 as described earlier. Alternatively, feedback outputting unit 524 of computing device 350 may be configured to transmit audio signals via a wired connection to an external audio feedback device 130 (e.g., headphone 130a or speaker 130b).
  • an external audio feedback device 130 e.g., headphone 130a or speaker 130b.
  • Fig. 6 illustrates an exemplary embodiment of apparatus 110 comprising text and voice recognition components consistent with the present disclosure.
  • Apparatus 110 is shown in Fig. 6 in a simplified form, and apparatus 110 may contain additional elements or may have alternative configurations, for example, as shown in Figs. 2A-2E and/or 5.
  • Memory 330 may include text recognition component 610 and voice recognition component 620.
  • Memory 330 may include additional components such as an image analysis component, logic components for carrying out methods as detailed below, artificial intelligence engines (e.g., such as classifiers configured to classify images), trained networks (e.g., one or more machine learning systems, one or more neural networks, etc.), or the like.
  • artificial intelligence engines e.g., such as classifiers configured to classify images
  • trained networks e.g., one or more machine learning systems, one or more neural networks, etc.
  • Components 610 and 620 may contain software instructions for execution by at least one processing device, e.g., processor 320, included with apparatus 110.
  • Components 610 and 620 are shown within memory 330 by way of example only, and may be located in other locations within system 300. For example, components 610 and 620 may be located in computing device 350, on a remote server 380, or in another associated device.
  • apparatus 110 may be configured to exchange (e.g., send or receive) information with a data structure or database 650.
  • Data structure or database 650 may be included in memory 330 or in another memory or storage device included in apparatus 110, or may be included in a memory device of a remote server (e.g., server 380) that is accessible via one or more networks (e.g., via network 370).
  • a remote server e.g., server 380
  • a data structure or database consistent with the present disclosure may include any collection of data values and relationships among them.
  • the data may be stored linearly, horizontally, hierarchically, relationally, non-relationally, uni-dimensionally, multidimensionally, operationally, in an ordered manner, in an unordered manner, in an object-oriented manner, in a centralized manner, in a decentralized manner, in a distributed manner, in a custom manner, or in any manner enabling data access.
  • data structures or databases may include an array, an associative array, a linked list, a binary tree, a balanced tree, a heap, a stack, a queue, a set, a hash table, a record, a tagged union, ER model, and a graph.
  • a data structure or database may include an XML database, an RDBMS database, an SQL database or NoSQL alternatives for data storage/search such as, for example, MongoDB, Redis, Couchbase, Datastax Enterprise Graph, Elastic Search, Splunk, Solr, Cassandra, Amazon DynamoDB, Scylla, HBase, and Neo4J.
  • a data structure or database may be a component of the disclosed system or a remote computing component (e.g., a cloud-based data structure). Data in the data structure or database may be stored in contiguous or non-contiguous memory. Moreover, a data structure or database, as used herein, does not require information to be co-located. It may be distributed across multiple servers or storage devices, for example, that may be owned or operated by the same or different entities. Thus, the term “data structure” or “database” as used herein in the singular is inclusive of plural data structures.
  • camera 210 of apparatus 110 may capture one or more images of, for example, text displayed on object 120.
  • Processor 320 may receive the captured images and perform image processing on the received images.
  • processor 320 may execute one or more algorithms stored in text recognition component 610 to identify the text in the received images.
  • processor 320 may be configured to perform OCR to recognize one or more characters or words in the received images.
  • Processor 320 may also be configured to convert the detected characters or words into audio signals using one or more text-to-speech algorithms.
  • Such algorithms may include, for example, one or more of WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • processor 320 may be configured to execute one or more text-to-speech modules stored apparatus 110, in computing device 350, and/or on server 380 or database 650 to convert the text displayed on object 120 into audio signals representing speech.
  • Processor 320 may be configured to transmit the audio signals to audio feedback device 130 via, for example, wireless transceiver 516.
  • Processor 320 may also be configured to cause audio feedback device 130 to play the audio signals to user 100 allowing user 100 to listen to, learn from, and understand the text displayed on object 120.
  • the text or parts thereof may be read to a user in a predetermined voice.
  • the predetermined voice may be selected from a set of predetermined voices stored in apparatus 110, in computing device 350, and/or on server 380.
  • user 100 may be able to select a desired voice from among the predetermined voices available on apparatus 110.
  • apparatus 110 may be configured to perform one or more actions (e.g., capture one or more images or activate targeting lasers 214) in response to one or more voice commands.
  • Audio sensor 340 in microphone 230 of apparatus 110 may capture a voice of user 100 when user 100 issues a voice command to apparatus 110.
  • Processor 320 may analyze the audio signals captured by microphone 230 to identify a voice of user 100. This may be performed using voice recognition component 620 and may include one or more voice recognition algorithms, such as Hidden Markov Models, Dynamic Time Warping, neural networks, or other techniques.
  • Voice recognition component 620 and/or processor 320 may access a data structure (or database) 650 or another storage device, which may further include a voiceprint of user 100.
  • Voice recognition component 620 may analyze the audio signals captured by microphone 340 to determine whether a voice in those signals matches a voiceprint of user 100.
  • Having a user’s voiceprint, and a high-quality voiceprint in particular, may provide for fast and efficient way to identify the user’s voice in the audio signals.
  • a high-quality voice print may be collected, for example, when the user speaks alone, preferably in a quiet environment.
  • the voice signature may be generated using any engine or algorithm such as but not limited to a neural network.
  • the audio may be for example, of one second of a clean voice.
  • the output signature may be a vector representing the speaker's voice
  • processor 320 may employ one or more trained machine learning models or neural networks to identify one or more segments of the audio signal received by processor 320 as comprising speech. For example, a set of training audio signals together with corresponding labels may be provided to train a machine learning model or a neutral network. One or more segments of the audio signal received by processor 320 may be presented as input to the trained machine learning model or neural network, which may output an indication regarding whether or not the one or more segments of the audio signal comprise speech.
  • Processor 320 may be configured to transcribe the detected voice or speech in the audio signal using various speech recognition algorithms.
  • Processor 320 may be configured to execute one or more sound recognition modules in voice recognition component 620 to identify one or more words in the received audio signals.
  • the one or more sound processing modules may allow processor 320 to identify or detect a command in the received audio signals.
  • Processor 320 may also be configured to take one or more actions in response to the determined commands. For example, processor 320 may access database 650 that may store one or more commands in association with one or more actions.
  • Processor 320 may be configured to execute the one or more actions associated with a detected command.
  • apparatus 110 may be configured to extract text from the user’s speech.
  • the user may be reading displayed text, asking or answering questions, or the like, and apparatus 110 may analyze the text or perform other actions regarding the text, as detailed below.
  • a reading device as described herein may be useful for people with various types of learning disabilities.
  • a common disability for which the reading device may be particularly helpful is dyslexia, also known as reading disorder.
  • Dyslexia is characterized by reading difficulty in individuals with otherwise unaffected intelligence. Different people are affected to different degrees. Problems may include difficulties in spelling words, reading correctly and/or quickly, writing words, “sounding out” words in the head, pronouncing words when reading aloud, and understanding what one reads. Often these difficulties are first noticed at school.
  • the reading device may also be useful for cases of “alexia,” which relates to a situation in which someone who could previously read loses their ability to read.
  • Dyslexia is believed to be caused by the interaction of genetic and environmental factors, and some cases run in families. Dyslexia that develops due to a traumatic brain injury, stroke, or dementia is called “acquired dyslexia.”
  • Dyslexia is diagnosed through a series of tests of memory, vision, spelling, and reading skills. Dyslexia is separate from reading difficulties caused by hearing or vision problems or by insufficient teaching or opportunity to learn.
  • Dyslexia may cause low self-esteem and thus a vicious cycle of low learning abilities resulting in increasing gap from the expected level. It is thus important to break this cycle, by helping people with dyslexia succeed in performing their challenges, thereby keeping them in line with the desired learning in level, increasing their self-esteem, and encouraging them to further break their limits.
  • a reading device comprises a light source configured to illuminate an object.
  • the reading device is configured to be held in the hand of the user.
  • the reading device may be apparatus 110.
  • the reading device may have a shape different than apparatus 110 but still be sized and shaped such that the reading device may be held in the hand of a user.
  • the reading device may be in various sizes, to accommodate different hand sizes for different users, e.g., a version sized for a child’s hand and a version sized for an adult’s hand.
  • buttons on the reading device e.g., if there are one or more control buttons on a side of the reading device
  • the user may hold the reading device in a position similar to holding a pen.
  • the reading device may be designed such that this position enables the user to activate a trigger (as will be described below) with the user’ s index finger, also referred to as the forefinger.
  • This position is convenient for pointing the reading device at a computer screen, a mobile phone, a tablet, a book, a newspaper, or the like.
  • This position is a natural position for the user to hold the reading device and causes less stress on the hand muscles and tendons, thereby allowing usage of the reading device for prolonged periods of time. This position is also natural since it is similar to how a user holds a pen or pencil.
  • the reading device may not be handheld. For example, a user of the reading device may not be physically able to hold the reading device in their hand. If the reading device is not handheld, the reading device may be connected to a stand, a tripod, a mount, or other device to support the reading device in a position in which the user may operate the reading device, as will be described below.
  • the reading device may be connected to the stand, the tripod, the mount, or other device in various ways, including, but not limited to, a bracket where the reading device slides into the bracket or is force-fit into the bracket or a connector arrangement with a first part of the connector is associated with the stand (for example) and a second part of the connector is associated with the reading device.
  • the first part of the connector may be a threaded fastener (e.g., a screw) and the second part of the connector may be a threaded bore in housing 200 of apparatus 110 configured to receive the threaded fastener.
  • the first part of the connector may be a tab, a raised button, or similar protrusion extending from a surface of the stand (for example) and the second part of the connector may be a groove in housing 200 of apparatus 110 configured to receive the tab such that apparatus 110 connects to the stand in a “slide and lock” manner.
  • the light source may include illumination device 212 and/or targeting lasers 214.
  • the light source includes at least one of a laser light, a visible light, an infrared light, or an ultraviolet light.
  • the object may include a book, a newspaper, a magazine, a poster, a label, a notebook, a notepad, paper, a computerized display, a tablet computer, a smartphone, or any other object that may have text displayed on or printed on the object.
  • the reading device comprises a trigger configured to activate the light source, the trigger being at least partially operable by a finger of a hand of a user.
  • the trigger may be operable by an index finger or any finger of a hand of user.
  • the trigger may include trigger button 224 of apparatus 110, volume buttons 218, and power button 216.
  • the trigger may include a touch- sensitive area on housing 200 of apparatus 110, such as a touch screen or a touch sensor.
  • the trigger may include one or more pressure sensitive areas on housing 200 of apparatus 110 such that a user may “squeeze” housing 200 to activate the trigger.
  • the trigger may be voice activated by the user speaking a keyword which may be received by a microphone (e.g., microphones 230) of apparatus 110.
  • a trigger may also be operated by any combination, in any order, of the above.
  • the reading device comprises a camera configured to capture images from an environment of the user.
  • the camera may be camera 210 of apparatus 110.
  • camera 210 may include image sensor 310, as described above.
  • the environment of the user may include an area near the user and within a visual range of the camera, such as within a field of view of a lens of the camera.
  • the reading device comprises an audio output device configured to output audio signals.
  • the audio output device may include external audio feedback device 130 (e.g., headphones 130a or speaker 130b) to output audio signals.
  • the audio output device may include any device connectable to audio outlet 228 of apparatus 110 configured to output audio signals.
  • the reading device comprises at least one processor programmed to perform various operations.
  • the at least one processor may include any component or unit with data processing capabilities, as described above.
  • the at least one processor may be programmed to, in response to operation of the trigger, project light from the light source to illuminate an area of the object. For example, if the trigger includes trigger button 224, pressing trigger button 224 may turn on targeting lasers 214 to project light to illuminate an area of object 120. As another example, pressing trigger button 224 may turn on illumination device 212 to project light onto object 120.
  • the light source is configured to illuminate one or more borders of the area of the object.
  • the light source may project light in a particular shape (e.g., a rectangle) and the user may move apparatus 110 closer to the object or farther from the object to position the shape to illuminate the borders of the area of the object.
  • the light source is configured to illuminate one or more corners of the area of the object.
  • one or more targeting lasers 214 may emit laser light that marks the displayed text matter on object 120 using corners 412, 414, 416, and 418.
  • the light source may be configured to project only two corners, such as corners 412 and 416 to bracket a left side of the area of the object, corners 414 and 418 to bracket a right side of the area of the object, or corners 412 and 418 or corners 414 and 416 to bracket opposite corners of the area of the object.
  • the light source includes a mask configured to project the light in a predetermined shape onto the area of the object.
  • the mask may include a physical object (e.g., a physical mask) placed over the light source to project the predetermined shape.
  • the mask may be changeable by the user or may be pre-installed on the reading device.
  • the predetermined shape includes one of a cursor, a hand, an arrow, a circle, or a square. Other shapes and different sizes of the shapes may also be used as the mask.
  • the mask may be a digital mask, masking undesired parts of the frame. The digital mask may be configurable. [0118] Referring to Fig.
  • apparatus 110 includes camera 210, illumination device 212, and three targeting lasers 214a, 214b, and 214c.
  • Each targeting laser 214 includes a mask 702 over the laser, such that the laser projects a predetermined circle shape because mask 702 is circular.
  • targeting laser 214a includes mask 702a
  • targeting laser 214b includes mask 702b
  • targeting laser 214c includes mask 702c.
  • Fig. 7 shows masks 702a, 702b, and 702c all having the same shape, in some embodiments, different masks may have different shapes, such as a cursor, a hand, an arrow, or the like; i.e., the shapes of all the masks do not need to be the same.
  • the at least one processor may be programmed to capture at least one image of the illuminated area of the object, wherein the at least one image includes a representation of written material. For example, pressing trigger button 224 a second time may cause camera 210 to capture one or more images of the illuminated area of the object.
  • the illuminated area of the object may include text marked by targeting lasers 214 or the entire text displayed on object 120.
  • the written material may include a word or a sequence of words (e.g., a phrase, a sentence, a paragraph, or a series of multiple paragraphs) shown on the object.
  • the at least one processor may be programmed to analyze the at least one image to recognize text.
  • the at least one processor may be programmed to recognize the text from the at least one image using an optical character recognition technique.
  • the at least one processor may include processor 320 and may perform optical character recognition (OCR) on the at least one image to recognize one or more characters or words in the at least one image.
  • OCR optical character recognition
  • one or more machine learning algorithms may be used to perform the OCR or to supplement the OCR.
  • the machine learning algorithms may include performing natural language processing with a neural network.
  • the neural network may include a Fong Short-Term Memory (ESTM) neural network, a Gated Recurrent Unit (GRU) neural network, or other type of neural network.
  • Fong Short-Term Memory (ESTM) neural network
  • GRU Gated Recurrent Unit
  • the at least one processor may be programmed to transform the recognized text into at least one audio signal.
  • processor 320 may convert the detected characters or words into audio signals using one or more text-to-speech algorithms.
  • Such algorithms may include, for example, one or more of WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • the text may be read in the order a human user would read it, for example, if there are multiple columns in the text, then in English the columns may be read left-to-right, and each column may be read top-down.
  • the at least one processor may be programmed to output the at least one audio signal using the audio output device.
  • processor 320 may transmit the audio signals to the audio output device, which in turn may be configured to play the audio signals for the user.
  • the audio output device includes one of a speaker disposed on the reading device, an external speaker, a headphone associated with the reading device, a hearing aid interface, or a speaker of a computing device associated with the user.
  • the headphone may include headphones 130a and the external speaker may include speaker 130b. Headphones 130a and speaker 130b may receive the audio signals by a wireless connection or a wired connection (e.g., via audio outlet 228 of apparatus 110).
  • the computing device includes one of a desktop computer, a laptop computer, a tablet, a smartphone, or a smartwatch.
  • the speaker of the computing device may include a speaker internal to the computing device or a speaker external to the computing device.
  • outputting the at least one audio signal causes generation of sounds representative of speech.
  • the recognized text may be transformed into an audio signal using a text-to-speech algorithm, such that the audio signal is representative of speech.
  • the at least one processor may be further programmed to receive at least one user setting configured to adjust at least one characteristic of the speech.
  • the user may enter the user setting by using trigger button 224, volume buttons 218, power button 216 or a combination of trigger button 224, volume buttons 218, and power button 216.
  • the user may enter the user setting via a user interface of computing device 350 associated with apparatus 110.
  • a characteristic of the speech may include any one or more of speed of the speech (e.g., how fast the speaker speaks the words), loudness (e.g., the volume at which the words are spoken), intonation (e.g., which words in the speech are emphasized), voice of the speaker (e.g., male, female, adult, or child), or accent of the speaker.
  • speed of the speech e.g., how fast the speaker speaks the words
  • loudness e.g., the volume at which the words are spoken
  • intonation e.g., which words in the speech are emphasized
  • voice of the speaker e.g., male, female, adult, or child
  • accent of the speaker e.g., male, female, adult, or child
  • the at least one user setting includes at least one of a reading speed or a time interval between sentences or paragraphs.
  • the reading speed may indicate a number of words per minute in the spoken speech, e.g., 140 words per minute.
  • the number of words per minute in the spoken speech may be set by selecting an absolute numerical value (e.g., 140 words per minute), by selecting a range of values (e.g., 130-150 words per minute), by selecting a category (e.g., “slow,” “medium,” or “fast,” with a corresponding number of words per minute for each category), or other selection options to determine how fast the spoken words are presented to the user.
  • the time interval between sentences or paragraphs may indicate a length of time after each sentence or paragraph is spoken.
  • the time interval may be set by selecting an absolute numerical value (e.g., 2 seconds), by selecting a range of values (e.g., 2-5 seconds), by selecting a category (e.g., “short,” “medium,” or “long,” with a corresponding length of time for each category), or other selection options to determine the length of time after each sentence or paragraph is spoken.
  • an absolute numerical value e.g., 2 seconds
  • a range of values e.g., 2-5 seconds
  • a category e.g., “short,” “medium,” or “long,” with a corresponding length of time for each category
  • the at least one user setting includes selection of an accent for the speech.
  • the user setting may include a list of speech accents (e.g., New York-accented English, southern United States-accented English, or California-accented English) and the user may select an accent from the list.
  • speech accents e.g., New York-accented English, southern United States-accented English, or California-accented English
  • the at least one processor may be further programmed to adjust the at least one audio signal based on the user setting.
  • settings for the text-to-speech algorithm to generate the audio signal may include parameters that may be adjusted based on the user settings.
  • the reading device further includes at least one button configured to receive an input associated with the at least one user setting.
  • the at least one button may include trigger button 224, volume buttons 218, or power button 216.
  • Receiving an input from the at least one button may include the user pressing trigger button 224, volume buttons 218, power button 216, or any combination of these buttons, either by repeated presses of a single button, pressing different buttons in a particular sequence, or pressing different buttons simultaneously.
  • the at least one processor is programmed to apply the at least one user setting in response to activation of the at least one button.
  • the user settings may be applied to the text-to-speech algorithm parameters to adjust the generated audio signal.
  • the reading device further comprises a microphone configured to capture sounds from the environment of the user.
  • the microphone may include microphone 230 and may be directional or unidirectional.
  • the environment of the user may include an area near the user and within an audio receiving range of the microphone.
  • receiving the at least one user setting includes analyzing one or more words spoken by the user and represented in sounds captured by the microphone.
  • the user settings may be selected by the user speaking the desired settings, such as “medium speed” or “New York-accented English.”
  • microphone 230 may capture the user speaking the settings and then the settings may be applied.
  • the words spoken by the user may be analyzed by any speech-to-text algorithm as described herein. For example, the user may precede speaking the settings by saying a keyword such as “settings.”
  • the at least one processor is programmed to receive an input from the trigger.
  • the trigger may include trigger button 224 and receiving an input from the trigger may include the user pressing trigger button 224.
  • the at least one processor responsive to the input from the trigger, is programmed to cause light to be projected from the light source to illuminate a point within a word in the written material. For example, once trigger button 224 is pressed, illumination device 212 and/or targeting lasers 214 may be activated to project light.
  • the input includes one of a single tap on the trigger or a double-tap on the trigger. For example, the user may tap trigger button 224 once (a “single tap”) or twice in rapid succession (a “double-tap”).
  • object 120 includes text 410.
  • trigger button 224 When the user presses trigger button 224, one targeting laser 214 is activated and projects a light beam 802 from apparatus 110 to illuminate the word “necessarily” or a point within the word “necessarily” in text 410.
  • the recognized text includes a formula or a mathematical expression.
  • the at least one processor is further programmed to transform the recognized text into the audio signal such that when the audio signal is output through the audio output device, the resulting sound corresponds to the order in which a human reader would read the formula or the mathematical expression. For example, if the recognized text includes the mathematical expression
  • Some embodiments describe a method of reading written material depicted on an object.
  • the method may be performed by apparatus 110 or a processor included in apparatus 110 (e.g., processor 320).
  • the method includes receiving, by a processor, an input representative of operation of a trigger on a reading device.
  • the reading device may include apparatus 110 and the processor may include processor 320.
  • the trigger may include trigger button 224.
  • the input representative of operation of the trigger may include a user pressing trigger button 224 once or multiple times; the user pressing a sequence of trigger button 224, volume buttons 218, and/or power button 216; or the user simultaneously pressing one or more of trigger button 224, volume buttons 218, and power button 216.
  • the method includes, in response to operation of the trigger, projecting light from a light source of the reading device to illuminate an area of the object.
  • a light source of the reading device For example, if the trigger includes trigger button 224, pressing trigger button 224 may turn on targeting lasers 214 to project light to illuminate an area of object 120. As another example, pressing trigger button 224 may turn on illumination device 212 to project light onto object 120.
  • the method includes capturing, using a camera of the reading device, at least one image of the illuminated area of the object, wherein the at least one image includes a representation of written material.
  • the camera may include camera 210 of apparatus 110.
  • pressing trigger button 224 a second time may cause camera 210 to capture one or more images of the illuminated area of the object.
  • the illuminated area of the object may include text marked by targeting lasers 214 or the entire text displayed on object 120.
  • the written material may include a word or a sequence of words (e.g., a phrase, a sentence, a paragraph, or a series of multiple paragraphs) shown on the object.
  • the method includes analyzing, using the processor, the at least one image to recognize text.
  • the at least one processor may include processor 320 and may perform optical character recognition (OCR) to recognize one or more characters or words in the at least one image.
  • OCR optical character recognition
  • one or more machine learning algorithms may be used to perform the OCR or to supplement the OCR.
  • the machine learning algorithms may include performing natural language processing with a neural network.
  • the neural network may include a Long Short-Term Memory (LSTM) neural network, a Gated Recurrent Unit (GRU) neural network, or other type of neural network.
  • LSTM Long Short-Term Memory
  • GRU Gated Recurrent Unit
  • the method includes transforming, using the processor, the recognized text into at least one audio signal.
  • processor 320 may convert the detected characters or words into audio signals using one or more text-to-speech algorithms.
  • Such algorithms may include, for example, one or more of WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • the method includes outputting the at least one audio signal using an audio output device associated with the reading device.
  • processor 320 may transmit the audio signals to the audio output device, which in turn may be configured to play the audio signals for the user.
  • the audio output device may include a speaker disposed on the reading device, an external speaker (e.g., speaker 130b), headphones associated with the reading device (e.g., headphones 130a), or a speaker of a computing device associated with the user (e.g., computing device 350).
  • outputting the at least one audio signal causes generation of sounds representative of speech.
  • the recognized text may be transformed into an audio signal using a text-to-speech algorithm, such that the audio signal is representative of speech.
  • the method further includes receiving, by the processor, at least one user setting configured to adjust at least one characteristic of the speech.
  • the user may enter the user setting by using trigger button 224, volume buttons 218, power button 216 or a combination of trigger button 224, volume buttons 218, and power button 216.
  • the user may enter the user setting via a user interface of computing device 350 associated with apparatus 110.
  • the at least one user setting includes at least one of: a reading speed, a time interval between sentences or paragraphs, or an accent for the speech, as described above.
  • the method further includes adjusting the at least one audio signal based on the user setting. For example, parameters of the text-to-speech algorithm to generate the audio signal may be adjusted based on the user setting.
  • the method further includes receiving an input from the trigger.
  • the trigger may include any button, voice, or another method as described above and receiving an input from the trigger may include the user pressing trigger button 224.
  • the method further includes responsive to the input from the trigger, causing light to be projected from the light source to illuminate a point within a word in the written material. For example, once a trigger is operated, illumination device 212 and/or targeting lasers 214 may be activated to project light.
  • Fig. 9 is a flowchart of a method 900 of reading written material depicted on an object.
  • the method may be performed by a reading device (e.g., apparatus 110) or a processor included in the reading device (e.g., processor 320 in apparatus 110).
  • the object may include a book, a newspaper, a magazine, a poster, a label, a notebook, a notepad, paper, a computerized display, a tablet, a smartphone, or any other object that may have text (i.e., written material) displayed on or printed on the object.
  • the reading device projects light to illuminate an area of the object (step 904).
  • the trigger includes trigger button 224
  • pressing trigger button 224 may turn on targeting lasers 214 to project light to illuminate an area of object 120.
  • introducing a trigger may turn on illumination device 212 to project light onto object 120.
  • An image of the illuminated area is captured (step 906).
  • the reading device may include a camera (e.g., camera 210 of apparatus 110).
  • Receiving another trigger activation, e.g., pressing trigger button 224 a second time may cause camera 210 to capture one or more images of the illuminated area of the object.
  • the illuminated area of the object may include text marked by targeting lasers 214 or the entire text displayed on object 120.
  • the written material may include a word or a sequence of words (e.g., a phrase, a sentence, a paragraph, or a series of multiple paragraphs) shown on the object.
  • the image is analyzed to recognize text in the image (step 908).
  • the reading device may include a processor (e.g., processor 320 of apparatus 110).
  • the processor may perform optical character recognition (OCR) on the image to recognize one or more characters or words in the image.
  • OCR optical character recognition
  • one or more machine learning algorithms may be used to perform the OCR or to supplement the OCR.
  • the machine learning algorithms may include performing natural language processing with a neural network.
  • the neural network may include a Long Short-Term Memory (LSTM) neural network, a Gated Recurrent Unit (GRU) neural network, or other type of neural network.
  • LSTM Long Short-Term Memory
  • GRU Gated Recurrent Unit
  • the recognized text from the image is transformed into an audio signal (step 910).
  • the processor of the reading device e.g., processor 320 of apparatus 110
  • Such algorithms may include, for example, one or more of WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • the audio signal is output to a user of the reading device (step 912).
  • the processor of the reading device e.g., processor 320 of apparatus 110
  • the audio output device includes a speaker disposed on the reading device, an external speaker (e.g., speaker 130b), headphones associated with the reading device (e.g., headphones 130a), or a speaker of a computing device associated with the user (e.g., computing device 350).
  • trigger button 216 may be operated by any of trigger button 216, voice activation, pressing touch or pressure sensitive areas, or any combination thereof.
  • the trigger instead of the reading device being operable by the user pressing the trigger button, the trigger may be operable by voice activation (e.g., the user speaking a command and/or an activation keyword), the user pressing a touch sensitive area, the user pressing a pressure sensitive area, or any combination thereof.
  • the user may press the trigger button and speak a command to operate the trigger.
  • a user of apparatus 110 may use apparatus 110 to ask and answer questions relating to text 410 displayed on object 120.
  • apparatus 110 may parse the text and generate a question for the user to answer.
  • the user may ask a question to apparatus 110 based on the text and apparatus 110 may provide an answer based on the illuminated words.
  • a system includes a camera configured to capture one or more images from an environment of a user.
  • the system may include apparatus 110 and the camera may include camera 210.
  • camera 210 may include image sensor 310, as described above.
  • the environment of the user may include an area near the user and within a visual range of the camera, such as within a field of view of a lens of the camera.
  • the system includes a microphone configured to capture sounds from the environment of the user.
  • the microphone may include microphone 230 and may be directional, unidirectional, a microphone array, or the like.
  • the environment of the user may include an area near the user and within an audio range of the microphone.
  • the system includes a visual pointer configured to be aimed by a used and indicate a point of interest in the one or more images.
  • the visual pointer may include the one or more targeting lasers 214 to generate laser light and illuminate the point of interest.
  • the visual pointer may include the corners 412, 414, 416, and 418 to define an area.
  • the visual pointer may be configured to point at a single word in an image containing text.
  • point of interest may include a word, a sequence of words (e.g., a phrase, a sentence, a paragraph, or a series of multiple paragraphs), an image (e.g., a dog, a bird, a tree, etc.), or a portion of an image (e.g., a plaque on a statue) that the user is interested in.
  • a sequence of words e.g., a phrase, a sentence, a paragraph, or a series of multiple paragraphs
  • an image e.g., a dog, a bird, a tree, etc.
  • a portion of an image e.g., a plaque on a statue
  • the system includes an audio output device for outputting audio signals.
  • the audio output device may include external audio feedback device 130 (e.g., headphone 130a or speaker 130b) to output audio signals.
  • the system includes at least one processor.
  • the at least one processor is programmed to perform various operations.
  • the at least one processor may be programmed to receive at least one image captured by the camera, the at least one image including a representation of written material.
  • the written material may include a word or a sequence of words (e.g., a phrase, a sentence, a paragraph, or a series of multiple paragraphs).
  • the written material may be in a book, a newspaper, a sign, a brochure, a handwritten page, an electronic display, or the like.
  • the at least one processor may be programmed to analyze the at least one image to recognize text.
  • the at least one processor may include processor 320 and may perform optical character recognition (OCR) to recognize one or more characters or words in the at least one image.
  • OCR optical character recognition
  • the at least one processor may be programmed to identify at least one printed word within the text, which may be of interest for generating a question. Additionally or alternatively, the question may be a more general question based on the text or a part thereof, such as a paragraph or a sentence. In some embodiments, the at least one printed word is a name or an identifier of a person, an object, a location, or a time indicator.
  • the at least one processor may be programmed to generate a question based on the at least one printed word.
  • the question may ask the user to speak one or more words based on the at least one printed word.
  • the question may ask the user to summarize or analyze the at least one printed word.
  • other types of questions may be generated, for which the answer is not within the text, to check other skills.
  • the question may ask the user to transform a sentence to a different tense, from first person to second person or third person, to rephrase the sentence in a higher or lower register, or the like.
  • the term “register” in this context relates to a level of formality or to the richness of the vocabulary of the speech. For example, a higher register indicates more formal speech, while a lower register indicates more casual speech. In some instances, a lower register may include colloquialisms, regional phrases or dialects, or slang.
  • the processor is further programmed to execute at least one machine learning model to generate the question.
  • the questions may be generated by an Artificial Intelligence (Al) engine, such as a Neural Network (NN), a Deep Neural Network (DNN), or the like.
  • the Al engine may be trained on a plurality of examples of text including a point of interest and a question that can be answered from the text.
  • Other machine learning models and algorithms may be used to generate the question. The choice of a specific machine learning model or algorithm does not affect the overall operation of the system described herein.
  • the at least one processor may be programmed to present the question to the user.
  • the processor is further programmed to present the question to the user by playing a sound representative of the question via the audio output device.
  • the question may be presented via a text-to-speech engine to generate the sounds representative of the question.
  • the processor is further programmed to present the question to the user by transmitting the question to a computing device associated with the user and causing the computing device to display the question on a display associated with the computing device.
  • the question may be transmitted to computing device 350 and displayed on display 360.
  • the question may be presented to the user as an audio question through an audio output device of computing device 350 or as a combined audiovisual question on computing device 350.
  • the computing device is one of a desktop computer, a laptop computer, a tablet, a smartphone, a smart watch, or smart glasses.
  • the at least one processor may be programmed to receive at least one indication from the user.
  • the indication may include the user directing the visual pointer to point to at least one word that includes an answer to the question. For example, the user may position the visual pointer over or around one or more words in the text to indicate an answer to the question by highlighting the one or more words in the text.
  • the indication may be generated by the user pressing one or more of power button 216, volume buttons 218, or trigger button 224 one or more times, simultaneously, or in a predetermined sequence after the at least one word is highlighted.
  • the at least one processor may be configured to perform optical character recognition (OCR) to identify the at least one word.
  • OCR optical character recognition
  • the indication may include an audio signal received from the user that includes an answer to the question spoken by the user.
  • the audio signal may be captured by the microphone.
  • the processor is further programmed to receive an audio signal from the user. For example, if the user speaks the answer to the question, the audio signal may be captured by the microphone.
  • the processor is further programmed to analyze the audio signal to extract an answer to the question. For example, the processor may perform a speech recognition algorithm to identify at least one word in the answer.
  • the at least one processor may be programmed to identify at least one word in the indication.
  • the at least one processor may be configured to identify the at least one word said in the indication using various speech recognition algorithms.
  • the at least one processor may be configured to execute one or more sound recognition modules (e.g., in voice recognition component 620) to identify one or more words in the audio signal.
  • the at least one processor may be programmed to compare the at least one word to the at least one printed word.
  • the at least one word identified in the indication may be compared to the at least one printed word using any comparison technique, such as a string comparison or other technique.
  • the at least one processor may be programmed to provide a positive indication when the at least one word matches the at least one printed word.
  • providing the positive indication includes generating a responsive audio signal and playing the responsive audio signal using the audio output device.
  • the responsive audio signal may include a “beep” or another (i.e., “happy”) sound to provide positive reinforcement to the user.
  • the positive indication may include a visual signal, such as one or more lights displayed on apparatus 110 (e.g., a certain color light or a predetermined sequence of lights) or a visual indication on display 360 of computing device 350.
  • a “match” as used herein may include a literal match between the at least one word identified in the indication and the at least one printed word or may include a syntactical match between the at least one word identified in the indication and the at least one printed word (e.g., the at least one word identified in the indication has the same meaning or a similar meaning as the at least one printed word).
  • the processor is further programmed to provide a positive indication when the at least one word is a synonym for the at least one printed word or when the at least one word partially matches the at least one printed word.
  • the at least one processor may be programmed to provide a negative indication when the at least one word does not match the at least one printed word.
  • the negative indication may include a sound played through the audio output device.
  • the sound may include a “buzz” or another (i.e., “unhappy”) sound to indicate that the at least one word identified in the indication did not match the at least one printed word.
  • the negative indication may include a visual signal, such as one or more lights displayed on apparatus 110 (e.g., a certain color light or a predetermined sequence of lights) or a visual indication on display 360 of computing device 350.
  • the processor is further programmed to transmit the positive indication or the negative indication to a computing device associated with the user and cause the computing device to display the positive indication or the negative indication on a display of the computing device.
  • a computing device associated with the user For example, display 360 of computing device 350 may display the positive indication or the negative indication.
  • the positive indication or the negative indication may include a text message, a graphical image, or a combination of text and graphics.
  • the processor is further programmed to transmit the positive indication or the negative indication to a computing device associated with the user and cause the computing device to output the positive indication or the negative indication using a speaker of the computing device.
  • the positive indication may include a “beep” or another (i.e., “happy”) sound to provide positive reinforcement to the user.
  • the negative indication may include a “buzz” or another (i.e., “unhappy”) sound.
  • an audio output device of computing device 350 may play the sound.
  • the positive indication or the negative indication may also be presented to a third party (i.e., a person not currently using the device, such as a parent or a teacher).
  • the positive indication or the negative indication may also be stored.
  • the indication may be stored in memory 330 of apparatus 110, in memory 522 of computing device 350, or in another external location (e.g., a cloud-based storage).
  • the at least one processor may be further configured to provide an explanation to the user of why the provided answer is incorrect.
  • the explanation may include an explanation of the correct answer to the question.
  • the explanation may be provided to the user as a spoken explanation played through the audio output device or as a visual explanation on display 360 of computing device 350.
  • the explanation may be generated by a machine learning model trained on the text, the user’s response, and the correct explanation.
  • the explanation may be generated by using a knowledge graph, indicating where in the graph the user took the wrong path. The explanation may be provided to the user or to the third party in a similar manner as the positive indication or the negative indication.
  • the processor is further programmed to determine an amount of time elapsed after presenting the question to the user, during which no indication from the user is detected and play a sound to prompt the user to answer the question when the amount of time is greater than or equal to a predetermined threshold time. For example, for a predetermined period of time, a microphone (e.g., microphone 230 or microphone 340 of apparatus 110) may be opened and audio may be captured, assuming that the user will say the answer. If the user says nothing, the user may be prompted to respond by playing a sound, for example “the answer is,” “please answer,” or the like. As another example, for the predetermined period of time, the visual pointer may be activated (i.e., project light), assuming that the user will direct the visual pointer to the answer.
  • a microphone e.g., microphone 230 or microphone 340 of apparatus 110
  • the visual pointer may be activated (i.e., project light), assuming that the user will direct the visual pointer to the answer.
  • the processor is further programmed to compare the determined answer to the at least one printed word.
  • the comparison may be performed using any comparison technique.
  • the match does not need to be an exact match or a synonym match to be considered to be a match.
  • the comparison may generate a numerical value for an amount of matching between the determined answer and the point of interest.
  • a match may be considered to occur when the numerical value exceeds a predetermined threshold.
  • the predetermined threshold may be set to 75% or greater.
  • Other thresholds and numerical scales may be used.
  • the numerical scale may be real numbers between 0 and 1.
  • the processor is further configured to provide the positive indication when the determined answer matches the at least one printed word.
  • the positive indication may be provided as described herein.
  • the processor is further configured to provide the negative indication when the determined answer does not match the at least one printed word.
  • the negative indication may be provided as described herein.
  • the processor is further programmed to determine whether the question is at least one of wrong, repetitive, semantically incorrect, or offensive. Such questions may be recognized by another machine learning model trained accordingly to, for example, receive examples of questions and labels indicating for each question whether it should be kept or discarded. In some embodiments, the processor is further programmed to discard the question when the question is one of wrong, repetitive, semantically incorrect, or offensive. For example, if the machine learning model indicates that the question falls into one of these categories, then the question may be discarded.
  • the question is a first question
  • the processor is further programmed to determine a first difficulty level associated with the first question.
  • the system may be configured to present multiple questions to the user relating to the recognized text in the written material.
  • the number of questions presented to the user by the system may be configured by the user (e.g., via a user setting) or by another person associated with the user (e.g., a parent or a teacher via a “supervisory” level setting).
  • the difficulty level associated with the question may be based on the words used in the question, the context (i.e., meaning) of the question, the words in the recognized text, or other metric.
  • the difficulty level may relate to the text readability level, and may include metrics such as word length, sentence length, and word frequency.
  • the text readability level may be determined using a known method, such as Flesch-Kincaid grade level, Dale-Chall score, Lexile Framework, or other method.
  • the text readability level may be determined by or assisted by a machine learning algorithm, such as a natural language processing (NLP) algorithm.
  • NLP natural language processing
  • the difficulty level may include a numerical value (e.g., 1 to 10, 1 to 100, or another range), a category (e.g., “easy,” “medium,” “difficult,” or other categorization), or a numerical value translated into a category.
  • the difficulty level may be stored with the question.
  • the processor is further programmed to generate a second question having a second difficulty level greater than the first difficulty level. If the user correctly answers the first question, the second question may be generated with a higher difficulty level than the first question.
  • the second question may be related to the first question, e.g., may be a “follow-up” type question, or may be unrelated to the first question but still related to the at least one printed word in the text.
  • the second difficulty level may be greater than the first difficulty level by a predetermined amount. For example, if the first difficulty level is “easy,” then the second difficulty level may be “medium.” As another example, if the first difficulty level is 40 on a numerical scale of 1 to 100, the second difficulty level may be 65.
  • the difference between the first difficulty level and the second difficulty level may be determined by a user setting. If using a numerical scale, the user setting may permit the user to adjust the numerical difference between the first difficulty level and the second difficulty level, either in absolute terms (e.g., a 15 point difference) or in percentage terms (e.g., a 10% difference).
  • the processor is programmed to generate the second question to have the second difficulty level greater than the first difficulty level by generating the second question based on a word that does not explicitly appear in the recognized text.
  • the second question may relate to an interpretation of the recognized text or may use one or more synonyms to words that appear in the recognized text.
  • the question is a first question
  • the processor is further programmed to determine a first difficulty level associated with the first question.
  • the first difficulty level of the first question may be determined in a similar manner as described above.
  • the processor is further programmed to generate a second question having a second difficulty level lower than the first difficulty level. If the user incorrectly answers the first question, the second question may be generated with a lower difficulty level than the first question.
  • the second question may be related to the first question, e.g., may be a “follow-up” type question, a leading question, or may rephrase the first question.
  • the second difficulty level may be lower than the first difficulty level by a predetermined amount. For example, the second question may rephrase the first question by using words with a lower readability level.
  • the processor is further programmed to determine a difficulty level of the recognized text and generate the question based on the determined difficulty level.
  • the difficulty level of the recognized text may be determined in a similar manner as determining the difficulty level of the question as described above.
  • the generated question may have the same difficulty level.
  • the difficulty level of the question may vary slightly (e.g., by a predetermined amount) from the difficulty level of the recognized text. For example, if the difficulty level of the recognized text is based on a numerical scale and the difficulty level is 40, then the difficultly level of the generated question may be within a predetermined range around 40, e.g., in a range of 35-45.
  • the processor is further programmed to determine an age of the user. For example, the age of the user may be entered when the system is first used (or during any subsequent use) and the user’s age may be stored in memory 330 of apparatus 110. As another example, the system may ask the user their age and the user may respond by speaking their age. As another example, the system may prompt the user to enter their age via computing device 350.
  • the processor is further programmed to determine a difficulty level of the recognized text.
  • the difficulty level of the recognized text may be determined in a similar manner as determining the difficulty level of the question as described above.
  • the processor is further programmed to generate the question based on the age and the determined difficulty level.
  • Some methods of determining the difficulty level of the recognized text may produce a result that corresponds to a grade level (e.g., a reading grade level, such as 4 th grade). For example, if the difficulty level of the recognized text indicates that the recognized text is at a 4 th grade reading level and the user is in an age range typically associated with students in the 4 th grade, then the generated question may also be at the 4 th grade level.
  • the generated question may be adjusted to be at a difficulty level more closely associated to the user’s age (e.g., a 3 rd grade level).
  • the generated question may be adjusted to be at a difficulty level more closely associated to the user’s age (e.g., a 5 th grade level).
  • Some embodiments include a method for processing audio and image signals.
  • the method may be performed by one or more processors in a device, such as apparatus 110.
  • a non-transitory computer-readable medium may be provided and include instructions which when executed by at least one processor perform the method.
  • the method includes receiving at least one image captured by a camera, the at least one image including a representation of written material.
  • the camera may include camera 210 of apparatus 110.
  • the written material may include a word or a sequence of words (e.g., a phrase, a sentence, a paragraph, or a series of multiple paragraphs).
  • the written material may be in a book, a newspaper, a sign, a brochure, a handwritten page, an electronic display, or the like.
  • the method includes analyzing the at least one image to recognize text.
  • apparatus 110 may include processor 320 which is programmed to perform optical character recognition (OCR) to recognize one or more characters or words in the at least one image.
  • OCR optical character recognition
  • the method includes identifying at least one printed word within the text, which may be of interest for generating a question.
  • the question may be a more general question based on the text or a part thereof, such as a paragraph or a sentence.
  • the at least one word may be a name, a nickname, or another description of a person, an object, a location, a time or duration indicator, or another entity.
  • the method includes generating a question based on the at least one printed word or a more general question.
  • the question may ask the user to speak one or more words based on the at least one printed word.
  • the question may ask the user to summarize or analyze the at least one printed word.
  • other types of questions may be generated, for which the answer is not within the text, to check other skills.
  • the question may ask the user to transform a sentence to a different tense, from first person to second person or third person, to rephrase the sentence in a higher or lower register, or the like.
  • the term “register” in this context relates to a level of formality or to the richness of the vocabulary of the speech. For example, a higher register indicates more formal speech, while a lower register indicates more casual speech. In some instances, a lower register may include colloquialisms, regional phrases or dialects, or slang.
  • the method includes presenting the question to a user.
  • the question may be presented to the user via a text-to-speech engine to generate the sounds representative of the question and played through an audio output device of apparatus 110, e.g., headphones 130a or speaker 130b.
  • the question may be presented to the user by transmitting the question to a computing device associated with the user and causing the computing device to display the question on a display associated with the computing device.
  • the question may be transmitted to computing device 350 and displayed on display 360.
  • the method includes receiving at least one indication from the user.
  • the indication may include an audio signal received from the user that includes the user speaking an answer to the question which may be captured by microphone 230 or microphone 340 of apparatus 110.
  • the indication may include the user directing the visual pointer to point to at least one word that includes an answer to the question.
  • the method includes identifying at least one word in the indication.
  • processor 320 in apparatus 110 may be configured to transcribe the audio signal to extract the at least one word spoken by the user. The transcription may be performed using a natural language processing (NLP) algorithm or other machine learning technique.
  • NLP natural language processing
  • processor 320 in apparatus 110 may be configured to perform optical character recognition (OCR) or other processing on the text to identify the at least one word.
  • OCR optical character recognition
  • the method includes comparing the at least one word to the at least one printed word.
  • the at least one word may be compared to at the least one printed word using any comparison technique, such as a string comparison or other technique.
  • the method includes providing a positive indication to the user when the at least one word matches the at least one printed word.
  • the positive indication may include generating a responsive audio signal and playing the responsive audio signal using the audio output device of apparatus 110.
  • the responsive audio signal may include a “beep” or another (i.e., “happy”) sound to provide positive reinforcement to the user.
  • the positive indication may include a visual signal, such as one or more lights displayed on apparatus 110 (e.g., a certain color light or a predetermined sequence of lights) or a visual indication on display 360 of computing device 350.
  • the “match” as used herein may include a literal match between the word and the point of interest or may include a syntactical match between the word and the point of interest (e.g., the word has the same meaning or a similar meaning as the point of interest).
  • the positive indication is provided when the word is a synonym for the point of interest or when the word partially matches the point of interest.
  • the method includes providing a negative indication to the user when the at least one word does not match the at least one printed word.
  • the negative indication may include a sound played through the audio output device. The sound may include a “buzz” or another (i.e., “unhappy”) sound to indicate that the word did not match the point of interest.
  • the negative indication may include a visual signal, such as one or more lights displayed on apparatus 110 (e.g., a certain color light or a predetermined sequence of lights) or a visual indication on display 360 of computing device 350.
  • Figs. 10A-10E are schematic illustrations of an example use case of the apparatus scanning text and presenting a question to a user and receiving the user’s response consistent with the present disclosure.
  • User 100 may use apparatus 110 to capture an image with a camera, e.g., camera 210 (Fig. 10A).
  • apparatus 110 may display one or more targeting lasers 214 and corners 412, 414, 416, and 418 onto an object 120 which helps to position the camera over an area of interest in text 410 (Fig. 10B).
  • Apparatus 110 includes a processor configured to receive the image captured by the camera and to analyze the image to recognize text 410. Based on the recognized text, apparatus 110 may generate a question to be presented to user 100. In some embodiments, the question may be presented to user 100 via display 360 of computing device 350 (Fig. 10C). As shown in Fig. 10C, the question may be presented in text form (e.g., “What is the value of Pi?”). In other embodiments, the question may be presented to the user as an audio question and output through an audio output device of apparatus 110 (e.g., headphones 130a or speaker 130b).
  • an audio output device of apparatus 110 e.g., headphones 130a or speaker 130b.
  • user 100 may speak the answer to the question or may use targeting lasers 214 of apparatus 110 to point to the answer to the question in object 120 (Fig. 10D).
  • the processor in apparatus 110 is configured to identify at least one word in the answer, compare the at least one word in the answer with at least one word recognized in the text, and provide an indication to the user whether the user provided the correct answer (Fig. 10E). For example, as shown in Fig. 10E, if user 100 answers the question correctly (in the example shown, user 100 says “3.14” or points to the text “3.14”), then a positive indication may be provided on display 360 of computing device 350.
  • a sound indicating the positive indication may be played for user 100 via the audio output device of apparatus 110. If user 100 does not provide the correct answer to the question, then a negative indication may be provided to user 100, either on display 360 of computing device 350 or be playing a sound via the audio output device of apparatus 110.
  • Fig. 11 is a flowchart of a method 1100 for scanning text and presenting a question to a user and receiving the user’s response consistent with the present disclosure.
  • the steps of method 1100 may be performed by a processor in apparatus 110.
  • method 1100 may include receiving an image captured by a camera (e.g., camera 210), the image including a representation of written material.
  • the written material may include a word or a sequence of words (e.g., a phrase, a sentence, a paragraph, or a series of multiple paragraphs).
  • the written material may be in a book, a newspaper, a sign, a brochure, a handwritten page, an electronic display, or the like.
  • method 1100 may include analyzing the image to recognize text.
  • OCR optical character recognition
  • OCR optical character recognition
  • method 1100 may include identifying at least one printed word in the text, which may be of interest for generating a question. Additionally or alternatively, the question may be a more general question based on the text or a part thereof, such as a paragraph or a sentence. For example, the at least one printed word may be a name or an identifier of a person, an object, a location, or a time indicator.
  • method 1100 may include generating a question based on the at least one printed word.
  • the question may ask the user to speak one or more words based on the at least one printed word.
  • the question may ask the user to summarize or analyze the at least one printed word.
  • other types of questions may be generated, for which the answer is not within the text, to check other skills.
  • method 1100 may include presenting the question to the user.
  • the question may be presented to the user as an audio question (e.g., sounds played via an audio output device of apparatus 110, such as headphones 130a or speaker 130b) or as a text question (e.g., via display 360 of computing device 350).
  • the question may be presented via a text-to-speech engine to generate the sounds representative of the question.
  • method 1100 may include receiving an indication of an answer to the question from the user.
  • the user may speak a response to the question which is received by apparatus 110 (e.g., via microphone 230) as the indication.
  • the user may direct apparatus 110 to point to the answer to the question in the written material.
  • the user may aim the targeting lasers 214 of apparatus 110 to highlight one or more words in the written material to indicate the answer to the question.
  • method 1100 may include identifying at least one word in the indication provided by the user. If the user speaks the answer, the at least one word may be identified using speech recognition. If the user points to the answer using apparatus 110, the at least one word may be identified using optical character recognition.
  • method 1100 may include comparing the at least one word in the indication to the at least one printed word in the recognized text to determine whether the user provided a correct answer to the question.
  • the at least one word identified in the indication may be compared to the at least one printed word using any comparison technique, such as a string comparison or other technique.
  • method 1100 may include providing a positive indication to the user if the user provided a correct answer (i.e. , the at least one word in the indication matches the at least one printed word).
  • the positive indication may include playing a sound via the audio output device or displaying text and/or graphics via display 360 of computing device 350.
  • method 1100 may include providing a negative indication to the user if the user did not provide a correct answer (i.e., the at least one word in the indication does not match the at least one printed word).
  • the negative indication may include playing a sound via the audio output device or displaying text and/or graphics via display 360 of computing device 350.
  • the questions may be tested before being presented to the user. Testing may include providing the text and the question to a machine learning model trained to find answers within a text. The answer provided by the machine learning model may be compared to the original answer as based on the printed word. If the answers are the same, or similar beyond a predetermined level according to some metrics, the question may be kept. In some embodiments, the comparison may generate a numerical value for an amount of matching between the machine learning model’s answer and the answer based on the point of interest. A match may be considered to occur (i.e., the answers are considered to be the same or considered to be equivalent) when the numerical value exceeds a predetermined threshold. For example, the predetermined threshold may be set to 75% or greater. Other thresholds and numerical scales may be used. For example, the numerical scale may be between 0 and 1.
  • a system includes a camera configured to capture images from an environment of a user.
  • the system may include apparatus 110 and the camera may include camera 210.
  • camera 210 may include image sensor 310, as described above.
  • the environment of the user may include an area near the user and within a visual range of the camera, such as within a field of view of a lens of the camera.
  • the system includes a microphone configured to capture sounds from the environment of the user.
  • the microphone may include microphone 230 or microphone 340 and may be directional, unidirectional, a microphone array, or the like.
  • the environment of the user may include an area near the user and within an audio range of the microphone.
  • the system includes an audio output device for outputting audio signals.
  • the audio output device may include external audio feedback device 130 (e.g., headphone 130a or speaker 130b) to output audio signals.
  • the system includes at least one processor programmed to perform various operations.
  • the at least one processor may be programmed to receive at least one image captured by the camera, the least one image including a representation of written material.
  • the written material may include a word or a sequence of words (e.g., a phrase, a sentence, a paragraph, or a series of multiple paragraphs).
  • the written material may be in a book, a newspaper, a sign, a brochure, a handwritten page, an electronic display, or the like.
  • the at least one processor is programmed to analyze the at least one image to recognize text.
  • the at least one processor may include processor 320 and may perform optical character recognition (OCR) to recognize one or more characters or words in the at least one image.
  • OCR optical character recognition
  • the at least one processor is programmed to receive at least one audio signal generated by the microphone, the at least one audio signal representing speech by the user.
  • apparatus 110 may include microphone 340 which is in communication with processor 320.
  • the at least one processor is programmed to identify at least one word said in the audio signal.
  • the processor may be configured to transcribe the audio to extract the at least one word spoken by the user.
  • the transcription may be performed using a natural language processing (NLP) algorithm or other machine learning technique.
  • the at least one processor is programmed to analyze the at least one word to determine a question asked by the user.
  • the at least one processor is programmed to search for an answer to the question using the question and the text recognized from the at least one image.
  • the processor is further programmed to search for the answer using an Internet search engine.
  • the processor may send a query to an Internet search engine to search for the answer.
  • processor 320 may search database 650 for the answer.
  • processor 320 may first search database 650 for the answer and if database 650 does not contain the answer, then processor 320 may send the query to the Internet search engine.
  • the processor is further programmed to execute a machine learning model to determine the answer based on the text recognized from the at least one image. For example, the recognized text, words identified in the user’s speech, and possibly additional words, may be provided to the machine learning model trained to find answers to questions within the text.
  • the answer(s) may be generated in a plurality of manners, which may depend upon the specific machine learning model used.
  • the machine learning model may be trained on the recognized text and configured to generate the answer based on the recognized text.
  • the machine learning model used to determine the answer to the question may be a same type of machine learning model used to generate a question as described above but trained on different data (and thus would be a different model).
  • the machine learning model used to determine the answer to the question may use a different algorithm or may be based on a different machine learning model type (e.g., may use supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning) than the machine learning model used to generate the question.
  • the processor may be further programmed to determine whether the answer determined by the machine learning model is at least one of wrong, repetitive, semantically incorrect, or offensive. Such answers may be recognized by another machine learning model trained accordingly to, for example, receive examples of answers and labels indicating for each answer whether it should be kept or discarded. In some embodiments, the processor may be further programmed to discard the answer determined by the machine learning model when the answer is one of wrong, repetitive, semantically incorrect, or offensive. For example, if the machine learning model indicates that the answer falls into one of these categories, then the answer may be discarded.
  • this information may be provided as feedback to the machine learning model to improve future results.
  • Another answer may be determined using the same machine learning model or algorithm. This process may repeat until a match is determined to have occurred.
  • revising the determined answer may include rephrasing the answer (e.g., selecting a synonym for one or more words in the answer, reordering the words in the answer, or discarding one or more words in the answer).
  • the processor is further programmed to search for the answer in an auto-regressive manner.
  • Generating the answer in an auto-regressive manner includes generating the answer one word at a time.
  • the answer may be returned as one or more segments of the recognized text, each segment including an answer on its own or in combination with other segments.
  • Each segment may be indicated by a beginning and an end.
  • each segment may include a beginning marker (e.g., a flag or other indicator) and an ending marker (e.g., a second flag or other indicator).
  • the at least one processor is programmed to output the answer or an indication that there is no answer.
  • the answer may be provided to the user as spoken text, as written text, highlighted or otherwise indicated as part of the displayed text, stored within a file, or the like.
  • the processor may be further programmed to output the answer by generating an output audio signal corresponding to the answer and playing the output audio signal using the audio output device.
  • the answer may be generated by a text-to-speech algorithm and played via the audio output device.
  • the processor is further programmed to output the answer by transmitting the answer to a computing device associated with the user and causing the computing device to display the answer on a display of the computing device.
  • the answer may be transmitted to computing device 350 and displayed on display 360.
  • the answer may be presented to the user as an audio answer through an audio output device of computing device 350 or as a combined audiovisual answer on computing device 350.
  • computing device 350 may include any one or more of: a desktop computer, a laptop computer, a tablet, a smartphone, a smart watch, or smart glasses.
  • no answer may be retrieved for the question, either because no answer is available or due to a failure of the machine learning model.
  • the indication to the user may include a suggestion to the user to search for the answer in alternative sources, such as Internet search. If the user agrees, such a search may be conducted and if an answer is retrieved, it may be provided to the user in a similar manner as described above.
  • the questions that may be searched on the Internet may be filtered, such that questions to which a relevant answer is not expected to be found online (for example, if the question is highly context-dependent) are eliminated. For example, while it may make sense to look online for an answer to the question “which prize did Marie Curie win?” this is not the case for a question such as “where did the student go?” because the latter question is context-dependent.
  • the answer includes a plurality of answers.
  • the processor is further programmed to determine confidence levels associated with each of the plurality of answers and outputting the answer includes outputting one or more of the plurality of answers determined to have a confidence level greater than or equal to a confidence threshold. If there are multiple answers to the question (either generated by the machine learning model or found via an Internet search), then a confidence level associated with each answer may be determined.
  • the confidence level may include a numerical value, with higher values indicating a higher confidence level (e.g., a scale of 1 to 100, although other scales are possible).
  • the confidence level may indicate how closely the answer matches the question (i.e., how likely it is that the answer is a correct answer to the question).
  • a confidence threshold may be set (either predetermined or user-adjustable) and any answers with a confidence level below the confidence threshold may be discarded. Answers with a confidence level equal to or greater than the confidence threshold may be output to the user in a similar manner as described above.
  • the processor is further programmed to determine a plurality of response options associated with the question. For example, there may be multiple answers to the question (either generated by the machine learning model or found via an Internet search). As another example, there may be multiple ways to understand the user’s question (e.g., based on variations in word choice in the question), which may result in different answers.
  • the processor may be further programmed to generate a responsive audio signal associated with at least one of the plurality of response options and play the responsive audio signal using the audio output device.
  • apparatus 110 may output spoken versions of each of the response options.
  • the processor is further programmed to receive at least one other audio signal generated by the microphone, the at least one other audio signal representing speech including a response by the user. For example, the user may be prompted whether she meant option A or option B and apparatus 110 may provide an audio output for each of the options. Apparatus 110 may then capture the user’s response.
  • the processor may be further programmed to determine the question based on the response. For example, if the user selects option A as the correct question, then the processor may search for the answer to the question of option A.
  • the user may provide a selection of the option by using one or more buttons on apparatus 110, such as trigger button 224. For example, the user may press trigger button 224 once for option A and twice for option B.
  • Some embodiments include a method for processing audio and image signals.
  • the method may be performed by one or more processors in a device, such as apparatus 110.
  • a non-transitory computer-readable medium may be provided and include instructions which when executed by at least one processor perform the method.
  • the method includes receiving at least one image captured by a camera, the at least one image including a representation of written material.
  • the camera may include camera 210 of apparatus 110.
  • the written material may include a word or a sequence of words (e.g., a phrase, a sentence, a paragraph, or a series of multiple paragraphs).
  • the written material may be in a book, a newspaper, a sign, a brochure, a handwritten page, an electronic display, or the like.
  • the method includes analyzing the at least one image to recognize text.
  • apparatus 110 may include processor 320 which is programmed to perform optical character recognition (OCR) to recognize one or more characters or words in the at least one image.
  • OCR optical character recognition
  • the method includes receiving at least one audio signal generated by a microphone, the at least one audio signal representing speech by a user.
  • apparatus 110 may include microphone 340 which is in communication with processor 320.
  • the method includes generating a transcription of the at least one audio signal.
  • processor 320 in apparatus 110 may be configured to transcribe the audio signal to extract one or more words spoken by the user.
  • the transcription may be performed using a natural language processing (NLP) algorithm or other machine learning technique.
  • NLP natural language processing
  • the method includes analyzing the transcription to determine a question asked by the user.
  • the NLP algorithm or other machine learning algorithm used to generate the transcription may be used to analyze the transcription to determine the question.
  • the question may be preceded by a keyword or a key phrase, such as “question” or “ask question.”
  • the method includes searching for an answer to the question using the question and the text recognized from the at least one image.
  • processor 320 of apparatus 110 may be programmed to search for the answer using an Internet search engine.
  • processor 320 may search database 650 for the answer.
  • processor 320 may first search database 650 for the answer and if database 650 does not contain the answer, then processor 320 may send the query to the Internet search engine.
  • the method includes outputting the answer.
  • the answer may be output to the user as spoken text, as written text, stored within a file, or the like.
  • processor 320 of apparatus 110 may be programmed to output the answer by generating an output audio signal corresponding to the answer and playing the output audio signal using the audio output device of apparatus 110.
  • the answer may be generated by a speech-to-text algorithm and played via the audio output device.
  • the answer may be output by transmitting the answer to a computing device associated with the user and causing the computing device to display the answer on a display of the computing device.
  • the answer may be transmitted to computing device 350 and displayed on display 360.
  • Fig. 12 is a flowchart of a method 1200 of for scanning text and a user asking a question consistent with the present disclosure.
  • the steps of method 1200 may be performed by a processor in apparatus 110.
  • method 1200 may include receiving an image captured by a camera (e.g., camera 210), the image including a representation of written material.
  • the written material may include a word or a sequence of words (e.g., a phrase, a sentence, a paragraph, or a series of multiple paragraphs).
  • the written material may be in a book, a newspaper, a sign, a brochure, a handwritten page, an electronic display, or the like.
  • method 1200 may include analyzing the image to recognize text. For example, optical character recognition (OCR) may be performed on the image to recognize one or more characters or words in the image.
  • OCR optical character recognition
  • method 1200 may include receiving an audio signal of speech by the user, the user asking a question.
  • apparatus 110 may include microphone 340 to receive the audio signal.
  • the question may be preceded by a keyword or a key phrase, such as “question” or “ask question.”
  • method 1200 may include identifying at least one word in the audio signal.
  • the at least one word may be identified by a speech recognition algorithm.
  • the processor may be configured to transcribe the audio to extract the at least one word spoken by the user. The transcription may be performed using a natural language processing (NLP) algorithm or other machine learning technique.
  • NLP natural language processing
  • method 1200 may include analyzing the at least one word identified in the audio signal to determine a question asked by the user.
  • step 1210 may be combined with step 1208 and the NLP algorithm may be used to determine the question asked by the user.
  • method 1200 may include the searching for an answer to the question.
  • apparatus 110 may be configured to search for an answer to the question.
  • apparatus 110 may activate an artificial intelligence (Al) engine to receive as input the text and the question, and provide the answer. Additionally or alternatively, apparatus 110 may analyze the text and search database 650 or may perform an Internet search to search for the answer to the question.
  • method 1200 may include outputting the answer to the user. The answer may be output to the user as an audio signal via an audio output device of apparatus 110 (e.g., headphones 130a or speaker 130b) or as a text and/or graphical message via display 360 of computing device 350. If an answer to the question cannot be found, then an indication that the answer cannot be found may be provided to the user in a similar manner as if the answer was found.
  • Disclosed embodiments may provide devices, systems, and/or methods that may help users of all ages who may have difficulty in reading, learning, or comprehension.
  • disclosed embodiments may provide devices, systems, and/or methods for automatically capturing and processing images of textual and/or written material, and generating audio signals corresponding to the textual and/or written material. When these audio signals are played by an audio output device, the audio output device may produce sounds that may sound as if the written material is being read out aloud.
  • disclosed embodiments may provide devices, systems, and/or methods for determining a material type (e.g., restaurant menu, multiple choice test, tabular data) of the textual and/or written material, and for generating sounds representing the textual and/or written material as if being read aloud by a reader according to the determined material type.
  • Disclosed embodiments may also provide devices, systems, and/or methods for adjusting the generated sounds representing the textual and/or written material based on a linguistic level of the user, or on another user setting (e.g., reading speed).
  • disclosed embodiments may provide devices, systems, and/or methods for translating text recognized in textual and/or written material from a first language into a second language, and for generating sounds representing the textual and/or written material in the second language.
  • an exemplary system 300 for reading text may include apparatus 110 held by user 100, one or more audio feedback devices 130, an optional secondary device 350, and/or a server 380 capable of communicating with apparatus 110 and/or with secondary communications device 350 via network 370.
  • Apparatus 110 may be configured to recognize textual and/or written material and generate audio signals representing the textual and/or written material.
  • Apparatus 110 of system 300 may also be configured to play the generated audio via, for example, audio feedback device 130 to read the recognized text to a user.
  • the system includes a camera configured to capture images from an environment of a user.
  • the disclosed system may include an apparatus that includes an imaging device (e.g., camera) capable of capturing one or more images of an environment surrounding the camera and/or the user.
  • the disclosed imaging device may also be configured to capture an image of an object that may have textual or written material displayed on a surface of the object.
  • apparatus 110 may include camera 210.
  • camera 210 of apparatus 110 may include image sensor 310 configured to capture one or more images of an environment 400 (see e.g., Fig. 4) of user 100.
  • Environment 400 may include one or more objects (e.g., object 120) that may display written material and/or may have written material printed on a surface of the one or more objects.
  • Image sensor 310 of camera 210 may be configured to capture one or more images of textual or written material displayed by, for example, object 120.
  • object 120 may display text 410, such as: “The text here is provided as an example of the type of subject matter the disclosed device and systems may be able to process” and/or additional textual material.
  • Camera 210 of apparatus 110 may be configured to capture one or more images of text 410 displayed by object 120.
  • the system includes a microphone configured to capture sounds from the environment of the user.
  • the disclosed system may include an apparatus that includes a microphone that may be capable of capturing one sounds from the environment surrounding the user.
  • the microphone may be configured to capture one or more voice commands and/or other speech by a user of the apparatus.
  • system 300 may include apparatus 110 that may include one or more microphones 230.
  • the one or more microphones 230 may be configured to capture sounds associated with sounds in an environment of user 100, voice commands issued by user 100, and/or any other speech of user 100, and convert the sounds into audio signals for further processing by apparatus 110.
  • the system includes an audio output device for outputting audio signals.
  • the disclosed system may include an audio output device (or audio feedback device), such as a headphone, a speaker, and/or any other device capable of playing audio signals.
  • apparatus 110 may include audio output device 130.
  • audio output device 130 may include an over-the-ear audio headphone being worn by user 100, a built-in speaker, a stand-alone speaker, a portable speaker, a hearing aid device, a bone conduction headphone, a within-ear headphone (e.g., earbuds), a speaker associated with the secondary device, or any other device capable of playing audio signals.
  • Apparatus 110 may be connected to audio output device 130 via a wired connection. It is contemplated, however, that apparatus 110 may be connected to audio output device 130 via a wireless connection based, for example, on a BluetoothTM protocol or any other wireless communication protocol that may allow apparatus 110 to transmit audio signals to audio feedback device 130. Audio output device 130 may be configured to play one or more audio signals by generating sounds corresponding to the one or more audio signals. User 100 may be able to hear the sounds generated by audio output device 130.
  • the system includes at least one processor.
  • the at least one processor may be understood to be processor as defined elsewhere in this disclosure.
  • the at least one processor may include one or more integrated circuits, microchips, microcontrollers, microprocessors, all or part of a central processing unit (CPU), graphics processing unit (GPU), digital signal processor (DSP), field programmable gate array (FPGA), or other circuits suitable for executing instructions or performing logic operations.
  • system 300 may include apparatus 110 that may include one or more processors 320.
  • the one or more processors 320 may control one or more components of apparatus 110 and/or of system 300. Referring to Figs.
  • the at least one processor may include one or more of processor 320 of apparatus 110, processor 520 of secondary device 350, and/or one or more processors associated with server 380.
  • processor 320 is described as performing one or more functions associated with the system for reading text. It is contemplated, however, that one or more of the processes or functions described herein as being performed by the at least one processor may be performed by any of or a combination of processor 320 of apparatus 110, processor 520 of secondary device 350, and/or one or more processors associated with server 380.
  • the at least one processor is programmed to receive at least one image captured by the camera, the at least one image including a representation of written material.
  • apparatus 110 may comprise one or more cameras, such as camera 210, which may capture images of environment 400 of user 100.
  • Processor 320 of apparatus 110 may receive the one or more images captured by, for example, camera 210.
  • camera 210 may generate one or more image files including digital representations of the one or more images captured by camera 210.
  • These digital representations may be stored by camera 210 in a storage device associated with camera 210 (e.g., memory 330) and/or in a storage device associated with secondary device 350 and/or server 380.
  • Processor 320 may access and read the digital representations of the images stored by camera 210 from memory 330, and/or one or more storage devices associated with secondary device 350 and/or server 380.
  • processor 320 may receive and process at least one image out of the plurality of images captured by, for example, camera 210 from environment 400 of user 100.
  • environment 400 of user 100 may include one or more objects that may have one or more surfaces displaying written material.
  • environment 400 may include one or more objects such as a newspaper, a book, a paper with written material, and/or any other object displaying textual and/or written material.
  • the written material may be displayed on a display device or screen of a secondary device 350 associated with the user 100.
  • secondary device 350 may be a smartphone, tablet, smartwatch, or a desktop or laptop computer, having a display screen capable of displaying textual and/or written material. It is also contemplated that the written material may be displayed on a standalone display device or screen, for example, an electronic advertising screen, a television, or any other device capable of displaying written material.
  • Camera 210 of apparatus 110 may capture an image that includes a representation of the textual and/or written material displayed on an object. For example, the image may include a picture or depiction of the textual and/or written material displayed on the object.
  • Processor 320 of apparatus 110 may receive the one or more images, including the representation of the textual and/or written material directly from camera 210 and/or from a memory 330.
  • the at least one processor is programmed to obtain a material type of the written material.
  • a material type may refer to a descriptor or characteristic of the textual content of the written material.
  • the written material may include one or more of a paragraph of text, a list of items for sale as in a menu, a list of songs, a to-do list, a table including one or more columns of text, a question paper or test, a catalog of books or materials, a phone book, or written material in any other format.
  • the material type for a paragraph of text may be “prose”
  • the material type for a list of items for sale may be “menu” or “price list”
  • the material type for data shown in the form of a table with rows and columns may be “tabular data.” It is to be understood that the disclosed system is not limited to the exemplary material types described above, and that other descriptors may be used to define the material type of written material based on the subject matter, layout, or other characteristics of the written material.
  • the at least one processor is programmed to analyze the at least one image to recognize text. In some embodiments, the at least one processor is further programmed to determine the material type by analyzing the recognized text.
  • the processor e.g., 320 or 520 associated with the disclosed system may recognize one or more words, phrases, and/or sentences based on an analysis of the one or more images of the written material captured by the camera.
  • the processor may employ techniques such as optical character recognition (OCR) to recognize the one or more words, phrases and/or sentences in the written material.
  • OCR optical character recognition
  • Such techniques may additionally or alternatively include image recognition techniques such as pattern matching, pattern recognition, or image correlation to match portions or glyph like shapes and features in the image with known characters of a language.
  • adaptive recognition techniques may be used to recognize the shapes of letters. It is also contemplated that, in some embodiments, the techniques may include use of trained machine learning models and/or neural networks that may recognize single words, phrases, or even lines of text instead of focusing on single characters.
  • the processor may identify the material type based on the recognized one or more words, phrases and/or sentences. In some embodiments, the processor may identify the material type based also on the layout of the written material, such as identifying tables, columns, rows, cells, or the like. In some embodiments, the material type is a restaurant menu.
  • the processor e.g., processor 320, 520
  • the processor associated with the disclosed system may determine that the written material presented in the image captured by the camera is a restaurant menu based on an analysis of the image. For example, the processor may recognize one or more words and/or symbols in the written material depicted in the image, using one or more techniques discussed above. Further, the disclosed system 300 may include one or more databases (e.g., 650).
  • the one or more databases 650 may store one or more words and/or symbols in association with identifiers or characteristics of the words and/or symbols. Such characteristics may indicate, for example, whether a word and/or symbol corresponds to a food or drink item or category, or to a currency symbol.
  • the processor may compare one or more words and/or symbols recognized in the image of the written material with words and/or symbols stored in the database. When a word and/or symbol recognized in the image of the written material matches a word and/or symbol stored in the database, the processor may read or retrieve identifiers or characteristics associated with the matching word and/or symbol from the database. The processor may determine the material type based on the extracted identifiers or characteristics.
  • processor 320 of apparatus 110 may recognize the word “menu” in the image of the written material.
  • the word menu may be stored in association with an identifier of characteristic labeled “restaurant menu” in database 650.
  • Processor 320 may compare the word “menu” recognized in the image of the written material with words stored in database 650. After determining a match, the processor may retrieve the identifier or characteristic “restaurant menu” associated with the matching word. The processor may then identify the material type of the written material as “restaurant menu.”
  • Fig. 13 illustrates written material 1300 depicted in an image captured by, for example, camera 210 of apparatus 110. As illustrated in Fig. 13, written material 1300 may include the word “MENU” as a title 1302.
  • Processor 320 of apparatus 110 may recognize the word “MENU” in title 1302 and by comparing the word “MENU” with words stored in database 650, processor 320 may determine the material type of the written material in the image as being a “restaurant menu.”
  • the processor may employ one or more image processing technique discussed above to recognize that the written material is arranged in one or more sections. Further, the processor may determine a title or heading associated with the sections. The processor may determine identifiers or characteristics associated with the recognized headings by comparing the recognized headings to words stored in the database as described above. The processor may determine the material type based on the extracted identifiers or characteristics.
  • Fig. 13 illustrates written material 1300 depicted in an image captured by, for example, camera 210 of apparatus 110. As illustrated in Fig.
  • written material 1300 may include three sections with headings “Salad,” “Entree,” and “Dessert.”
  • Processor 320 of apparatus 110 may recognize one or more of these headings and by comparing the headings with words stored in database 650, processor 320 may determine that an identifier or characteristic associated with the headings “Salad,” “Entree,” and/or “Dessert” is “restaurant menu.”
  • Processor 320 may then determine the material type of the written material 1300 in the image as being a “restaurant menu” based on the identifier and/or characteristics retrieved from database 650.
  • the processor may identify a word or phrase recognized in the image of the written material as the name of a food or drink item (e.g., steak, burger, wine, cocktail, lemonade, etc.). For example, processor 320 may compare a word or phrase recognized in the image of the written material with words or phrases stored in database 650. Database 650 may, for example, store words representing food or drink items in association with the characteristic “food,” “drink,” “beverage,” etc. When the processor finds a word in database 650 that matches the word or phrase recognized in the image of the written material, processor 320 may retrieve the associated characteristic from database 650. By way of example, Fig.
  • processor 320 may compare one or more of these words to words stored in database 650 and may be able to identify these words as “food.” Further, the processor may identify a word recognized in the image of the written material as including a currency symbol followed by numerals.
  • the currency symbol identified by the processor may include one or more of a dollar ($) symbol, a pound (£) symbol, a euro ( €) symbol, or any other symbol representing a currency, which may be configured for each specific device.
  • the processor may identify the word as representing a price.
  • processor 210 of apparatus 110 may recognize the words, for example, “$8” or “$6” in written material 1300 as prices.
  • the processor may determine the material type of the written material as a “restaurant menu.”
  • the processor may use other types of text processing and/or analysis to determine whether the written material in the image is a restaurant menu.
  • the processor may use one or more trained machine learning models or neural networks. Examples of such models may include support vector machines, Fisher’s linear discriminant, nearest neighbor, k nearest neighbors, decision trees, random forests, and so forth.
  • a set of training examples may include one or more images of restaurant menus, one or more words, phrases, and/or or symbols in the restaurant menu (e.g., salad, steak, $15), and designation of the written materials in the images as restaurant menus.
  • the training examples may also include negative examples of non-menu materials which are designated as non-menu.
  • the training examples may additionally or alternatively identify one or more words in the sample images as food or drink items and/or prices.
  • a machine learning model or neural network may be trained to identify written material as a restaurant menu based on these and/or other training examples. Further, the trained machine learning model or neural network may output the material type as “restaurant menu” when presented with an image of written material depicting a restaurant menu or when presented with text including one or more words and/or or symbols (e.g., Steak or $8) that may appear in a restaurant menu.
  • a trained machine learning model or neural network for identifying restaurant menus from images may be a separate and distinct machine learning model or neural network or may be an integral part of one or more other machine learning models or neural networks discussed above.
  • the material type is a tabular data.
  • the processor e.g., processor 320, 520 associated with the disclosed system may determine that the written material presented in the image captured by the camera is tabular data based on an analysis of the image. For example, as discussed above, the processor may recognize one or more words and/or symbols in the written material depicted in the image.
  • the disclosed system may include one or more databases (e.g., 650). In some embodiments, one or more databases 650 may store one or more words and/or symbols in association with identifiers or characteristics of the words and/or symbols. Such characteristics may indicate, for example, whether a word and/or symbol corresponds to tabular data.
  • the processor may access a database or data structure (e.g., database 650) that stores a plurality of words and/or symbols in association with identifiers or characteristics of the words and/or symbols. Further, the processor may compare one or more words and/or symbols recognized in an image of the written material with the stored plurality of words and/or symbols. When the word and/or symbol recognized in the image of the written material matches with one of the words and/or symbols stored in the database, the processor may extract or retrieve identifiers or characteristics associated with the matching word or symbol from the database. The processor may determine the material type based on the extracted identifiers or characteristics. For example, the processor may recognize the word “Table” or “column” in an image of written material.
  • Fig. 13 illustrates written material 1320 depicted in an image captured by, for example, camera 210 of apparatus 110. As illustrated in Fig. 13, written material 1320 may include the words “Table A” in title 1322. Processor 320 of apparatus 110 may recognize the word “Table” in title 1322 and by comparing the word “Table” with words stored in database 650, processor 320 may determine the material type of the written material in the image as being “tabular data.”
  • the processor may determine that the words and/or phrases in the written material are arranged in one or more columns. For example, the processor may determine that the written material includes a plurality of groups of words or phrases separated by white space and/or vertical lines delineating the one or more columns. Based on detecting such an arrangement or pattern, the processor may determine that a material type of the written material is “tabular data.”
  • Fig. 13 illustrates an image of written material 1320. As illustrated in Fig. 13, written material 1300 includes two groups 1324 and 1326 of words. Group 1324 is separated from group 1326 by vertical lines.
  • group 1324 includes the words “Quality,” “Excellent,” “Good,” and “Fair” positioned one below the other.
  • group 1326 includes the words (or numbers) 100, 97, 90, 87, 84, etc. positioned one below the other.
  • the word “Quality” in group 1324 is separated from the words (or numbers) 100, 97, 90, 87, 84 in group 1326 by a vertical line.
  • Processor 320 may identify this arrangement of the words in the written material and determine that the material type of written material 1320 is “tabular data.”
  • the processor may use other types of text processing and/or analysis to determine whether the written material in the image is tabular data.
  • the processor may use one or more trained machine learning models or neural networks.
  • a set of training examples may include one or more images of tabular data containing two or more columns, adjacent columns being separated by a plurality of spaces, or by symbols such as a vertical line or other characters (e.g., * or /), and an identification of the written material in the images as being tabular data.
  • a machine learning model or neural network may be trained to identify written material as tabular data based on these and/or other training examples, including negative examples not containing tables.
  • the trained machine learning model or neural network may output the material type as “tabular data” when presented with an image of written material including text arranged in a plurality of columns or rows.
  • a trained machine learning model or neural network for identifying tabular data from images may be a separate and distinct machine learning model or neural network or may be an integral part of one or more other machine learning models or neural networks discussed above.
  • the material type is a test.
  • the processor e.g., processor 320, 520
  • the processor may determine that the written material presented in the image captured by the camera is a test based on an analysis of the image. For example, as discussed above, the processor may recognize one or more words and/or symbols in the written material depicted in the image.
  • the disclosed system may include one or more databases (e.g., 650).
  • the one or more databases 650 may store one or more words and/or symbols in association with identifiers or characteristics of the words and/or symbols. Such characteristics may indicate, for example, whether a word and/or symbol corresponds to a test.
  • the processor may access a database or data structure (e.g., database 650) that stores a plurality of words and/or symbols in association with identifiers or characteristics of the words and/or symbols. Further, the processor may compare one or more words and/or symbols recognized in an image of the written material with the stored plurality of words and/or symbols. When the word and/or symbol recognized in the image of the written material matches with one of the words and/or symbols stored in the database, the processor may retrieve identifiers or characteristics associated with the matching word or symbol from the database. The processor may determine the material type based on the extracted identifiers or characteristics. For example, the processor may recognize the words “test,” “exam,” or “examination” in an image of the written material.
  • a database or data structure e.g., database 650
  • Fig. 13 illustrates written material 1320 depicted in an image captured by, for example, camera 210 of apparatus 110. As illustrated in Fig. 13, written material 1320 may include the word “Examination” in title 1322. Processor 320 of apparatus 110 may recognize the word “Examination” in title 1322 and by comparing the word “Examination” with words stored in database 650, processor 320 may determine the material type of the written material in the image as being “test.”
  • the processor may identify a set of words or phrases, or one or more sentences followed by a question mark (?).
  • the processor may identify the material type of the written material as “test,” based on identifying sentences ending in a question mark.
  • written material 1340 may include a sentence, “Who invented the telephone?” ending in a question mark.
  • Processor 320 of apparatus 110 may recognize the sentence 1346 as a question based on the question mark (?) in the sentence and in response may determine the material type of the written material in the image as being a “test.”
  • the processor may use other types of text processing and/or analysis to determine whether the written material in the image is a test.
  • the processor may use one or more trained machine learning models or neural networks.
  • a set of training examples may include one or more images of written material, including a series of questions, each question ending in a question mark, optionally together with an identification of the written material in the images as a “test,” a phrase like “answer the following questions,” or “answer X out of the following questions,” or the like.
  • a machine learning model or neural network may be trained to identify written material as a “test” based on these and/or other training examples.
  • the trained machine learning model or neural network may output the material type as “test” when presented with an image of written material including a plurality of questions.
  • a trained machine learning model or neural network for identifying tabular data from images may be a separate and distinct machine learning model or neural network or may be an integral part of one or more other machine learning models or neural networks discussed above.
  • the material type is a multiple-choice test.
  • the processor e.g., processor 320, 520 associated with the disclosed system may determine that the written material the presented in the image captured by the camera is a multiple-choice test based on an analysis of the image. For example, as discussed above, the processor may recognize one or more words and/or symbols in the written material depicted in the image.
  • the disclosed system may include one or more databases (e.g., 650). In some embodiments, one or more databases 650 may store one or more words and/or symbols in association with identifiers or characteristics of the words and/or symbols. Such characteristics may indicate, for example, whether a word and/or symbol corresponds to a multiple-choice test.
  • the processor may access a database or data structure (e.g., database 650) that stores a plurality of words and/or symbols in association with identifiers or characteristics of the words and/or symbols. Further, the processor may compare one or more words and/or symbols recognized in an image of the written material with the stored plurality of words and/or symbols. When the word and/or symbol recognized in the image of the written material matches with one of the words and/or symbols stored in the database, the processor may retrieve identifiers or characteristics associated with the matching word or symbol from the database. The processor may determine the material type based on the extracted identifiers or characteristics.
  • the processor may recognize the words “multiple-choice test,” “multiple-choice exam,” “multiple-choice examination,” “please select the answer,” or “please select the most correct answer,” in an image of the written material.
  • the words “multiple-choice test,” “multiplechoice exam,” or “multiple-choice examination” or the like may be stored in database 650 in association with the characteristic “multiple-choice test.”
  • Processor 320 may compare the one or more words recognized in the image of the written material with words in database 650 and retrieve the characteristic “multiple-choice test” when the word recognized in the image of the written material matches, for example, the words “multiple-choice test,” “multiple-choice exam,” or “multiple-choice examination” in database 650.
  • Fig. 13 illustrates written material 1340 depicted in an image captured by, for example, camera 210 of apparatus 110.
  • written material 1340 may include the phrase “Multiple Choice Test” in title 1342.
  • Processor 320 of apparatus 110 may recognize the word “multiple choice test” in title 1342 and by comparing the word “multiple choice test” with words stored in database 650, processor 320 may determine the material type of the written material in the image as being “multiple choice test.”
  • the processor may identify a structure or pattern that includes a question followed by one or more lines of text.
  • the question may be a sentence ending in a question mark, whereas the other lines of text may include a plurality of words, phrases, or sentences separated by, for example, commas, periods, or semicolons, or may be positioned on different lines.
  • the processor may identify the material type of the written material as a “multiple-choice test.”
  • Fig. 13 illustrates written material 1340 depicted in an image captured by, for example, camera 210 of apparatus 110.
  • written material 1340 may include question 1344 followed by text 1346 and 1348.
  • Processor 320 of apparatus 110 may recognize item 1344 as a question because of the question mark at the end of item 1344. Processor 320 may also recognize other lines of text 1346 and 1348 following question 1344. Based on this arrangement, processor 320 may determine the material type of written material 1340 as being a “multiple-choice test.”
  • the processor may use other types of text processing and/or analysis to determine whether the written material in the image is tabular data.
  • the processor may use one or more trained machine learning models or neural networks.
  • a set of training examples may include one or more images of written material, including multiple choice tests, together with an identification of the written material in the images as a “multiple-choice test.”
  • the images of “multiple-choice test” may include one or more questions ending in a question mark, with each question being followed by a plurality of words, phrases, or sentences, serving as the answers.
  • a machine learning model or neural network may be trained to identify written material as a “multiple-choice test” based on these and/or other training examples.
  • the trained machine learning model or neural network may output the material type as “multiple-choice test” when presented with an image of written material depicting a similar layout or arrangement of text.
  • a trained machine learning model or neural network for identifying a multiple-choice test from images may be a separate and distinct machine learning model or neural network or may be an integral part of one or more other machine learning models or neural networks discussed above.
  • the material type is a question that includes an incomplete sentence, prompting the user to respond with the word for completing the sentence.
  • the processor e.g., 320 or 520
  • the processor may recognize one or more words forming a sentence. Based on recognizing text in the image of the written material, the processor may identify a blank (e.g., a missing word or phrase) in the identified sentence. For example, the processor may recognize a series of spaces (e.g., “ ”) or a series of underlined spaces (e.g., “ ”) located in between other words.
  • the processor may determine the material type of the written material as a “question.” After recognizing the spaces and determining the material type as a question, the processor may generate an audio signal that may include a prompt to the user to respond with the word to fill-in the missing word. By way of example, the processor may generate an audio signal which when played via an audio output device (e.g., audio feedback device 130) may include sounds representative of a prompt to the user to supply the missing word.
  • an audio output device e.g., audio feedback device 130
  • the audio signal when played via an audio output device may include the prompt: “Please fill in the blank” or “Please provide a word to complete the sentence.”
  • the audio signal when played via an audio output device may include the sound “blank” corresponding to the series of spaces (underlined or otherwise) recognized by the processor in the image of the written material.
  • the user e.g., user 100
  • the material type is obtained from the user.
  • the processor may generate an audio signal representative of a question asking the user to identify the material type.
  • the processor may also transmit the audio signal to an audio output device (e.g., audio feedback device 130).
  • the audio output device may play the audio signal such that the user (e.g., user 100) may hear the question, requesting information regarding the material type of the written material in the text.
  • processor 320 may generate an audio signal representative of the question: “What is the material type?” or a request for information: “Please specify the material type of the text in the image.”
  • Processor 320 may transmit the audio signal to audio output device 130, which may play the audio signal.
  • user 100 may identify the material type, for example, by speaking. For example, user 100 may say “the material type is a restaurant menu” or “the material type is a test,” or “the material type is a multiple-choice test.”
  • Processor 320 may receive audio signals representative of the user’s speech generated by the one or more microphones 230. Processor 320 may perform speech-to-text processing on the audio signals received from the one or more microphone 230 and recognize words such as “restaurant menu,” “test,” or “tabular data” in the received audio signals. In response, the processor may identify the material type based on the recognized words.
  • the user may point apparatus 110 towards a word or phrase displayed on an object, where the word or phrase may be representative of the material type.
  • camera 210 of apparatus 110 may capture an image of the word or phrase.
  • processor 320 may recognize one or more words in the captured image.
  • Processor 320 may determine the material type based on the recognized word or phrase, for example, by comparing the recognized one or more words with words stored in database 650 as described above.
  • the processor is further programmed to receive an input indicative of the material type from a user interface displayed on a secondary device associated with the user.
  • a secondary device e.g., 350
  • Such a secondary device may include a smartphone, a laptop, a desktop computer, a tablet computer, smartwatch, smart glasses, or any other type of computing device capable of displaying text or graphical material on a display screen.
  • the secondary device may be associated with or may belong to user 100 of apparatus 110.
  • secondary device 350 may be coupled via wired or wireless connection to apparatus 110.
  • a wireless connection between secondary device 350 and apparatus 110 may be based on one or more protocols, for example, Wi-Fi, Bluetooth®, Bluetooth Smart, 802.15.4, ZigBee, or any other method that may allow wireless communication of signals between secondary device 350 and apparatus 110.
  • secondary device 350 may include a display device or screen configured to display textual and/or graphical material, for example, in a graphical user interface.
  • a display screen may display a list of material types on the graphical user interface.
  • the graphical user interface may also have one or more interactive elements, for example, buttons or checkboxes, or other types of widgets that may be selectable by the user via one or more input devices associated with the secondary device.
  • the user may select a material type by touching the displayed material types with a finger, a pen, a pencil, a stylus, or any other type of pointing device.
  • secondary device 350 may have a display screen 360 that may display a graphical user interface 1400.
  • graphical user interface 1400 may display a menu or list of material types.
  • the material types 1402 may be identified by their respective labels, for example, “Multiple-Choice Test,” “Restaurant Menu,” “Table,” and “Flight Schedule.” Each of the labels may be associated with a checkbox 1404.
  • User 100 may be able to select one of the checkboxes 1404 using one or more input devices associated with secondary device 350, and/or by touching display screen 360 when secondary device 350 is equipped with a touch screen 360.
  • Processor 520 of secondary device 350 may transmit a signal indicative of the user’s selection to processor 320 of apparatus 110.
  • processor 320 may receive an input indicative of the material type from a user interface 1400 displayed on a secondary device 350 associated with the user 100.
  • the at least one processor is programmed to generate at least one audio signal representing the recognized text, the at least one audio signal being generated based on the material type.
  • the processor e.g., 320 or 520
  • Such algorithms may include, for example, one or more of WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • processor 320 may execute one or more text-to-speech modules stored in apparatus 110, in secondary computing device 350, on server 380, and/or in database 650 to convert the text recognized by processor 320 into audio signals representing speech. It is contemplated that the processor may generate the audio signal based on the material type. For example, the processor may generate the audio signal such that when the audio signal is played, for example, by an audio output device (e.g., audio feedback device 130), the resulting sound represents the manner in which a reader would read that type of material. For example, written material having a material type “restaurant menu” may be read according to its sections (e.g., Appetizer, Entree, Desert, etc.).
  • the processor may generate the audio signal to be representative of how a reader may read a restaurant menu based on the various sections of the written material. For example, for written material 1300 illustrated in Fig. 13, processor 320 of apparatus 210 may generate an audio signal that when played by an audio output device (e.g., audio feedback device 130) may generate a sound representative of the word “Salad.” This may be followed by a sound representative of the first item: “Hand tossed salad with French or ranch dressing.” As another example, when the material type is “multiple-choice test,” the processor may generate the audio signal to be representative of how a reader may read a multiple-choice test.
  • an audio output device e.g., audio feedback device 130
  • the processor may generate the audio signal to be representative of how a reader may read a multiple-choice test.
  • the resulting sound may reflect a person reading a question, preceded by a phrase such as “question number 1,” followed by a number of possible answers, each optionally preceded with a word such as “option”, “answer”, or the like.
  • processor 320 of apparatus 210 may generate an audio signal that when played by an audio output device (e.g., audio feedback device 130) may generate a sound representative of a question: “Question number 1, Who invented the telephone?”
  • the generated audio signal may include a short pause followed by sounds representative of the two options 1346 and 1348.
  • the audio signal when played by an audio output device may generate a sound representative of “Answer 1, Benjamin Franklin” followed by “Answer 2, Alexander Graham Bell.”
  • processor 320 of apparatus 110 may adjust the audio signal to reflect the material type of the written material (e.g., 1300, 1320, 1340)
  • the at least one processor may be programmed to output the at least one audio signal via the audio output device.
  • processor 320 of apparatus 110 may transmit the generated audio signal to an output device (e.g., audio feedback device 130).
  • audio output device 130 may include one or more of an over-the- ear audio headphone being worn by user 100, a built-in speaker, a stand-alone speaker, a portable speaker, a hearing aid device, a bone conduction headphone, a within-ear headphone (e.g., earbuds), a speaker associated with the secondary device, or any other device capable of playing audio signals.
  • Audio output device 130 in turn may play the audio signal to generate a sound that may represent speech by a reader. That is, the audio output device 130 may play the audio signal such that user 100 may hear the text recognized in the image captured by camera 210 as if it were being read by a reader to user 100.
  • outputting the at least one audio signal includes outputting a first portion corresponding to a first question on the multiple-choice test and associated first potential answers.
  • the material type of the written material may be “multiple-choice test.”
  • the at least one processor may generate an audio signal representative of the written material. When the written material is a multiple-choice test, the processor may generate an audio signal that includes a first portion corresponding to a first question on the multiplechoice test and the plurality of answers associated with the first question.
  • the processor may generate a first portion of an audio signal to be representative of a first question followed by the answer choices to that first question.
  • processor 320 of apparatus 210 may generate a first portion of an audio signal that when played by an audio output device (e.g., audio feedback device 130) may generate a first sound representative of the first question 1344: “Who invented the telephone?” followed by second sounds representative of the two answers, namely 1346 (“Benjamin Franklin”) and 1348 (“Alexander Graham Bell”).
  • processor 320 may generate the first portion of the audio signal such that the first sound may include the phrase “Question number 1” before the question. Likewise, processor 320 may generate an audio signal such that the second sounds may include the phrase “Answer 1” before the answer 1346 and “Answer 2” before the answer 1348. [0285] In some embodiments, outputting the at least one audio signal includes outputting a second portion corresponding to a second question on the multiple-choice test and associated second potential answers. In a manner similar to that discussed above, when the material type is “multiple-choice test,” the processor may generate a second portion of the audio signal corresponding to a second question on the multiple-choice test and the plurality of answers associated with the second question.
  • processor 320 of apparatus 210 may generate a second portion of the audio signal that when played by an audio output device (e.g., audio feedback device 130) may generate a third sound representative of the second question 1350: “In which country are letters to Santa addressed to?” followed by fourth sounds representative of the two answers, namely 1352 (“Finland”) and 1354 (“Canada”).
  • processor 320 may generate the second portion of the audio signal such that the third sound may include the phrase “Question number 2” before question 1350.
  • processor 320 may generate an audio signal such that the fourth sounds may include the phrase “Answer 1” before the answer 1352 and “Answer 2” before the answer 1354.
  • the at least one processor is further programmed to output the second portion when a predetermined time period has elapsed after outputting the first portion.
  • the processor e.g., 320, 520
  • the processor may play a second portion of the audio signal via an audio output device (e.g., audio feedback device 130) that may generate a third sound corresponding to a second question on the multiple-choice test, and fourth sounds corresponding to the answers to the second question.
  • an audio output device e.g., audio feedback device 130
  • the at least one processor may play the second portion of the audio signal via audio output device (e.g., audio feedback device 130) after having played the first portion of the audio signal via the audio output device.
  • the processor e.g., processor 320
  • the processor may do so to allow the user to provide a response to the first question included in the first audio signal.
  • the processor may play the second audio signal after the predetermined period of time regardless of whether the user has provided a response to the first question included in the first portion of the audio signal.
  • the processor is further programmed to output the second portion after receiving a response from the user to the first portion.
  • the processor may play the second portion of the audio signal after receiving a response from the user to the first question without waiting for expiry of the predetermined time period.
  • user 100 may provide a response to the first question, for example, by speaking the response.
  • question 1344 illustrated in Fig. 13 the user may respond by speaking the words “Answer 2” or “Alexander Graham Bell.”
  • Microphone 230 of apparatus 110 may capture the spoken words of the user and generate an audio signal that may be received by processor 320.
  • processor 320 may play the second portion of the audio signal corresponding to the second question and its associated answers, via audio output device 130.
  • the processor is programmed to receive a request by the user (e.g., “please read the desserts section” in a menu, “please repeat question 2” in a test, or the like), and act accordingly.
  • the at least one processor is programmed to obtain a linguistic level of the user.
  • a linguistic level may refer to a language proficiency of the user.
  • linguistic levels may be defined by any number of proficiency levels representative of the reading proficiency of the user.
  • the linguistic levels may be defined as “beginner level,” “elementary level,” “intermediate level,” “upper-intermediate level,” “advanced level,” “proficiency level.”
  • Each of these linguistic levels may be defined, for example, by the amount of vocabulary and/or the complexity of sentences of paragraphs that a user may be able to read.
  • the user may be able to understand and use familiar everyday expressions and basic phrases, may be able to introduce themselves, and may be able to interact with another person, when the other person talks slowly and clearly.
  • the user may be able to read simply connected text on topics that are familiar to the user and/or may be capable of describing experiences and events, and give reasons and explanations for the user’s opinions.
  • the proficiency level for example, the user may be able to understand with ease virtually everything that the user leads or hears, may be able to summarize information from different spoken and written sources, and may be able to express themselves spontaneously, very fluently and precisely.
  • the proficiency levels may include any number of levels and hierarchy and may be based on definitions for each of the hierarchical levels that may be similar to or different from the above-described exemplary definitions.
  • the linguistic level may be determined, for example based on a well-known standard such as Lexile scores for a text.
  • the processor is programmed to receive the linguistic level from the user.
  • the processor e.g., 320, 520
  • the processor may transmit the audio signal to an audio output device that may play the audio signal such that the user may hear the question requesting information regarding the user’s linguistic level.
  • processor 320 may generate an audio signal representative of the question: “What is your linguistic level?” or a request for information: “Please specify your linguistic level.”
  • user 100 may identify his or her linguistic level, for example, by speaking.
  • Processor 320 may receive audio signals representative of the user speech generated by the one or more microphones 230. Processor 320 may perform speech-to-text processing and recognize, for example, words such as “Beginner” in the received audio signals. Processor 320 may determine the user’s linguistic level based on the recognized words.
  • the processor is programmed to receive an input indicative of the linguistic level from a user interface displayed on a secondary device associated with the user.
  • the secondary device may include a display device or screen configured to display textual and/or graphical material, for example, in a graphical user interface.
  • the display screen may display a list of linguistic levels on a graphical user interface displayed on a display screen of a secondary device associated with the user.
  • the graphical user interface may also have one or more interactive elements, for example, buttons or checkboxes, or other types of widgets that may be selectable by the user via one or more input devices associated with the secondary device.
  • the user may select a linguistic level by touching the displayed linguistic levels with a finger, a pen, a pencil, a stylus, or any other type of pointing device.
  • secondary device 350 may have a display screen 360 that may display a graphical user interface 1420.
  • graphical user interface 1420 may display a menu or list of linguistic levels.
  • the linguistic levels 1422 may be identified by their respective labels, for example, “Low,” “Medium,” or “High.” Each of the labels may be associated with a checkbox 1424.
  • User 100 may be able to select one of the checkboxes 1424 using one or more input devices associated with secondary device 350, and/or by touching display screen 360 when secondary device 350 is equipped with a touch screen 360.
  • Processor 520 of secondary device 350 may transmit a signal indicative of the user’s selection to processor 320 of apparatus 110.
  • processor 320 may receive an input indicative of the user’s linguistic level from a user interface 1420 displayed on a secondary device 350 associated with the user 100.
  • the at least one processor may be programmed to determine a difficulty level of the written material by analyzing one or more images of the written material. For example, as discussed above, the processor (e.g., 320, 520) may recognize one or more words, phrases, or sentences from the image of the written material. Further, the processor may access a database, storing a plurality of words, phrases, and/or sentences in association with associated difficulty levels. The processor may be programmed to compare words, phrases, and/or sentences recognized in the image of the written material captured by the camera with the one or more words, phrases, and/or sentences stored in the database.
  • the processor e.g., 320, 520
  • the processor may access a database, storing a plurality of words, phrases, and/or sentences in association with associated difficulty levels.
  • the processor may be programmed to compare words, phrases, and/or sentences recognized in the image of the written material captured by the camera with the one or more words, phrases, and/or sentences stored in the database.
  • the processor may retrieve, from the database, a difficulty level associated with the matching word, phrase, and/or sentence.
  • the processor may use other types of text processing and/or analysis to determine the difficulty level for words, phrases, and/or sentences recognized in the image of the written material.
  • the processor may use one or more trained machine learning models or neural networks.
  • a set of training examples may include one or more images of written material or one or more words, phrases, and/or sentences included in the written material, and associated difficulty levels.
  • the machine learning model or neural network may be trained to identify a difficulty level of the text included in an image based on these and/or other training examples. Further, the trained machine learning model or neural network may output the difficulty level of individual words or phrases and/or a Lexile score for the entire text when an image including written material is presented as an input to the trained machine learning model.
  • a trained machine learning model or neural network for determining a difficulty level of individual words or phrases and/or a Lexile score for the entire text from images of written material may be a separate and distinct machine learning model or neural network or may be an integral part of one or more other machine learning models or neural networks discussed above.
  • the at least one processor is programmed to substitute at least one of original word within the recognized text with the synonym word based on the linguistic level of the user.
  • the processor e.g., 320, 520
  • the processor may recognize one or more words, phrases and/or sentences in the written material.
  • the processor may determine a difficulty level associated with the recognized one or more words, phrases and/or sentences.
  • the processor may compare the difficulty level of the recognized words, phrases, and/or sentences with a linguistic level (or difficulty level) of the user.
  • the processor may replace the recognized word with a synonym word having a difficulty level commensurate with (e.g., equal to or lower than) the linguistic level of the user.
  • a difficulty level of the synonym word is less than a difficulty level of the original word.
  • the processor may substitute the word or phrase recognized in the image of the written material with a synonymous word or phrase that may have a difficulty level equal to or lower than the linguistic level of the user. Doing so may make it easier for the user to understand the written material.
  • the disclosed system includes a storage device configured to store a thesaurus and wherein the at least one processor is further programmed to determine the synonym based on the thesaurus.
  • the disclosed system may include one or more memory devices 230 and/or databases 650 configured to store instructions and/or data.
  • the disclosed memory devices and/or databases may store a thesaurus that may include one or more words or phrases and corresponding synonymous words or phrases.
  • the disclosed system includes a storage device configured to store a plurality of words in association with respective difficulty levels.
  • the disclosed system may include one or more memory devices 230 and/or databases 650 configured to store one or more synonymous for a plurality of words or phrases. It is also contemplated, that the one or more memory devices 230 and/or databases 650 may store difficulty levels in association with each of the plurality of stored words or phrases. As discussed above, the processor (e.g., processor 320) may compare a word or phrase recognized in an image of the written material with the plurality of words or phrases stored in the one or more memory devices 230 and/or databases 650.
  • the processor may select a word synonymous with the recognized word from the words/phrases stored in the one or more memory devices 230 and/or databases 650.
  • the processor may select the synonymous word stored in in the one or more memory devices 230 and/or databases 650 and having a difficulty level lower than the difficulty level of the word recognized in the image of the written material.
  • the processor may also select the synonymous word such that the difficulty level of the synonymous word is lower than the linguistic level of the user.
  • the at least one processor is programmed to generate at least one audio signal representing the recognized text, wherein the at least one audio signal represents the synonym word rather than the original word.
  • the processor e.g., 320, 520
  • Such algorithms may include, for example, one or more of WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • processor 320 may execute one or more text-to-speech modules stored in apparatus 110, in secondary computing device 350, on server 380, and/or in database 650 to convert the text recognized by processor 320 into audio signals representing speech.
  • the processor may replace a word recognized in the image of the written material and having a difficulty level higher than the linguistic level of the user with a synonymous word having a lower difficulty level.
  • the processor may generate the audio signal corresponding to the written material with the substituted word.
  • the processor may generate the audio signal such that when the audio signal is played, for example, by an audio output device (e.g., audio feedback device 130), the resulting sounds include the synonym word having a difficulty level equal to or less than the linguistic level of the user.
  • an audio output device e.g., audio feedback device 130
  • the processor when the processor recognizes the word “examination” in the written material (e.g., in the sentence: The math examination was easy), the processor may replace that word with the simpler word “test” when the user’s linguistic level is low.
  • the processor when the audio signal generated by the processor is played on an audio output device, the user may hear the sentence: “The math test was easy,” instead of the sentence: “The math examination was easy.”
  • the processor may replace the word “examination” with its simpler synonymous word “test.”
  • the processor may be programmed to replace a word recognized in the image of the written material with a word that may have a difficulty level higher than the linguistic level of the user.
  • the processor may access the one or more memory devices and/or databases to extract a synonymous word that may have a difficulty level greater than the linguistic level of the user.
  • At least one processor is programmed to analyze the at least one image to recognize text, the recognized text being in a first language.
  • the processor is configured to determine the first language based on the recognized text.
  • the processor e.g., 320 or 520 associated with the disclosed system may recognize one or more words, phrases, and/or sentences based on an analysis of the one or more images of the written material captured by the camera, using one or more techniques described above.
  • Such techniques may include, for example, image recognition techniques such as pattern matching, pattern recognition, or image correlation to match portions or glyph like shapes or features in the image of the written material with known characters of a language.
  • adaptive recognition techniques may be used to recognize the shapes of letters.
  • the processor may determine a language of the written material based on an analysis of the recognized text. For example, in some embodiments, the processor may compare shapes of the letters or glyphs in the image of the written material with images of characters, letters, symbols, words, or the like in different languages stored in a database (e.g., database 650). The images stored in the database may be stored in association with a corresponding language. The processor may use one or more techniques such as pattern matching, pattern recognition, or image correlation to compare the shapes of the letters or glyphs recognized in the image of the written material with the images stored in the database. When the processor determines that there is a match, the processor may retrieve the language (e.g., English, German, Italian, Hebrew, or any other language) associated with the matching letter shapes or glyphs from the database.
  • the language e.g., English, German, Italian, Hebrew, or any other language
  • the processor may use one or more trained machine learning models and/or neural networks to recognize text in the image of the written material to recognize the text as described elsewhere in this disclosure.
  • the processor may also use one or more trained machine learning models or neural networks to determine the language associated with the written material.
  • a set of training examples may include one or more images of written material, an identification of one or more characters, letters, symbols, words, or glyphs in the written material, and an identification of a language associated with the written material.
  • a machine learning model or neural network may be trained to identify the language of the written material included in the image based on these and/or other training examples.
  • the trained machine learning model or neural network may output the language associated with written material, when presented with an image of the written material and/or with text extracted from the image of the written material.
  • a trained machine learning model or neural network for determining a language of the written material from images of written material may be a separate and distinct machine learning model or neural network or may be an integral part of one or more other machine learning models or neural networks discussed herein.
  • the at least one processor is programmed to obtain a second language from the user.
  • the processor e.g., 320 or 520
  • the processor may transmit the audio signal to an audio output device that may play the audio signal such that the user may hear the question requesting information regarding the language of the written material in the text.
  • the processor may generate an audio signal representative of the question: “What is the language?” or “Please specify the language for translation.”
  • user 100 may identify the language, for example, by speaking.
  • the user may say “English,” or “German,” or the user may say something else identifying the second language.
  • Processor 320 may receive audio signals representative of the user speech generated by the one or more microphones 230.
  • Processor 320 may perform speech-to-text processing and recognize words such as “English,” “German,” or other words, identifying the second language in the received audio signals.
  • user 100 may point apparatus 110 towards a word or phrase displayed on an object, where the word or phrase may be representative of the second language or be a word in the second language.
  • camera 210 of apparatus 110 may capture an image of the word or phrase.
  • processor 320 may perform image analysis and/or optical character recognition to identify and recognize the word and/or phrase. Processor 320 may also determine the second language based on the recognized word or phrase.
  • the processor may be programmed to receive an input indicative of the language from a user interface displayed on a secondary device associated with the user.
  • the secondary device may include a display device or screen configured to display textual and/or graphical material, for example, in a graphical user interface.
  • the display screen may display a list of languages on the graphical user interface.
  • the graphical user interface may also have one or more interactive elements, for example, buttons, checkboxes, or other types of widgets that may be selectable by the user via one or more input devices associated with the secondary device.
  • user 100 may be able to select one of the displayed languages by touching the display with a finger, a pen, a pencil, a stylus, or any other type of pointing device.
  • secondary device 350 may have a display screen 360 that may display graphical user interface 1440.
  • graphical user interface 1440 may display a menu or list of languages.
  • Languages 1442 may be identified by their respective labels, for example, “English,” “French,” “German,” “Spanish,” “Hebrew,” etc.
  • Each of the labels may be associated with a checkbox 1444.
  • User 100 may be able to select one of the checkboxes 1444 using one or more input devices associated with secondary device 350, and/or by touching display screen 360 when secondary device 350 is equipped with a touch screen 360.
  • Processor 520 of secondary device 350 may transmit a signal indicative of the user’s selection to processor 320 of apparatus 110.
  • processor 320 may receive an input indicative of the language of the written material from a user interface 1440 displayed on a secondary device 350 associated with the user 100.
  • the at least one processor is programmed to translate the recognize text into the second language.
  • the disclosed system comprises storage device configured to store a utility comprising a dictionary for translating the text from the first language to the second language.
  • the processor e.g., 320 or 520
  • the processor may recognize individual words and/or phrases by analyzing an image of the written material based on one or more techniques described above.
  • one or more memory devices and/or databases associated with the disclosed system may store a dictionary that may relate words and phrase in a first language with words and phrases in a second language. The processor may use the words and/or phrases recognized in the image of the written material as an index to search the dictionary stored in the one or more memory devices and/or databases.
  • the processor may select associated words or phrases based on the second language from the database.
  • the processor may also be programmed to use the words or phrases retrieved from the database to replace the index words or phrases in the recognized text, thereby translating the text in the written material from the first language to the second language.
  • processor 320 recognizes the word “cat” in an image of written material captured by camera 210 of apparatus 110.
  • Processor 320 may use the word cat as an index to search in a dictionary stored in database 650.
  • the dictionary in database 650 may store the word “cat” with the words “gata” and the language “Spanish.”
  • the dictionary in database 650 may also store the word “cat” with the words “chatte” and the language “French.”
  • the dictionary in database 650 may store the word “cat” with the words “katze” and the language “German.”
  • processor 320 may retrieve records associated with the word cat in the Spanish, French, and German languages.
  • Processor 320 may select a record from the retrieved records based on the second language obtained from user 100. For example, when user 100 provides “German” as the second language, processor 320 may select the record associating the word “cat” with the word “katze” and the language “German.” Processor 320 may then replace the word “cat” in the written material with the word “katze.” Processor 320 may repeat this process for all the words and phrases recognized in the image of the written material to translate the text in the written material from the first language to the second language.
  • the dictionary is context dependent.
  • the dictionary stored in the one or more memories and/or databases may include words in different tenses (e.g., past, present, or future) and/or according to different parts of speech (e.g., nouns, pronouns, verbs, adjectives, adverbs, conjunctions, prepositions, and/or articles).
  • the processor e.g., 320 or 520
  • the processor may also identify a tense and/or part of speech associated with the extracted words and/or phrases.
  • the processor may access a set of rules stored in one or more memories (e.g., 330) and/or databases (e.g., 650) and may use one or more rules to determine the tense and/or part of speech associated with an extracted word and/or phrase.
  • the processor may retrieve a corresponding word and/or phrase in the second language by searching a dictionary stored in the one or more memories and/or databases. In selecting the corresponding word and/or phrase, processor may select the word or phrase based on matching the tense and/or part of speech of the words and/or phrases extracted from the image of the written material.
  • processor 320 recognizes the word “play” in an image of written material captured by camera 210 of apparatus 110, for example, appearing in the sentence: “The children play outside.”
  • Processor 320 may use the word play as an index to search in a dictionary stored in database 650.
  • the dictionary in database 650 may store the word “play” with the word “lactic” and the language “German.”
  • the dictionary in database 650 may also store the word “play” with the word “das Spiel” and the language “German.”
  • the dictionary in database 650 may store the word “cat” with the word “jouer” and the language “French.”
  • the dictionary in database 650 may also store the word “play” with the word “lee” and the language “German.”
  • processor 320 may retrieve records associated with the word play in the German and French.
  • processor 320 may identify the records corresponding to the word play from the records retrieved from the database. As discussed above, these records may include, for example, a record including the word “lactic” (context being verb) and a record including the word “das Spiel” (context being noun). Thus, the dictionary stored in database 350 may be context dependent. Processor 320 may select one of the two records based on the context of the word play in the sentence: “The children play outside.” Here the word play is a verb. Therefore, processor 320 may select the record associating the word play with the words “lactic” and the language “German.”
  • the processor is programmed to translate the recognized text using a trained machine learning model.
  • the processor may be programmed to use one or more trained machine learning models and/or neutral network to translate the recognize text from the first language to the second language.
  • a set of training examples may include one or more images of written material in the first language and one or more images of the same written material translated into the second language together with identification of the first and second language. It is also contemplated that in some embodiments, the training examples may include a set of words and/or phrases in the first language extracted from an image of the written material and a corresponding set of words and/or phrases in the second language together with identification of associated tenses and/or parts of speech in both the first and the second language.
  • the machine learning model or neural network may be trained to translate the written material included in the image from the first language to the second language based on these and/or other training examples taking into account tense and parts of speech. Further, the trained machine learning model or neural network may output text in the second language when presented with an image of the written material and/or with a piece of text extracted from the image of the written material in the first language.
  • a trained machine learning model or neural network for translating written material depicted in an image from a first language to a second language may be a separate and distinct machine learning model or neural network or may be an integral part of one or more other machine learning models or neural networks discussed above.
  • the at least one processor is programmed to generate a second audio signal representing the recognized text, wherein the second audio signal represents the text translated into the second language.
  • the processor e.g., 320, 520
  • the processor may be programmed to generate at least one audio signal by executing one or more text-to-speech processing algorithms.
  • Such algorithms may include, for example, one or more of WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • processor 320 may execute one or more text-to-speech modules stored in apparatus 110, in secondary computing device 350, on server 380, and/or in database 650 to convert the translation of the text recognized by processor 320 into audio signals representing speech. Furthermore, processor 320 may generate the audio signal corresponding to the written material as translated into the second language so that when the audio signal is played by audio output device 130, the user may hear the written material as being read by a reader in the second language.
  • processor 320 may generate the audio signal so that when the audio signal is played by audio output device 130, user 100 may hear the written material as if being spoken by a reader in the French language.
  • the at least one processor is programmed to identify at least one first audio signal associated with the at least one word.
  • the at least one word represents a sound emitted by an animal or a motor vehicle or a special effect.
  • the processor e.g., 320, 520
  • OCR optical character recognition
  • the recognized word may represent a sound emitted by an animal.
  • the recognized word “bark” may represent the barking sound of a dog
  • the word “moo” may represent the mooing sound of a cow
  • the word “neigh” may represent the neighing sound of a horse.
  • the recognized word may represent the sound emitted by a motor vehicle.
  • the word “engine” may represent the sound of an engine of a car
  • the word “motor” may represent the sound made by an electric motor
  • the word “jet” may represent the sound made by a jet engine of an airplane.
  • the at least one recognized word may represent a special effect.
  • the word “gunfire” may represent the sound of rapid gunfire
  • the word “crowd” may represent the roar of a crowd at a sporting event
  • the word “applause” may represent the sound of clapping by a plurality of people.
  • the processor may be programmed to identify or select one or more sounds associated with the recognized word.
  • the disclosed system may include one or more databases (e.g., 650).
  • one or more databases 650 may store one or more words in association with sounds or audio signals associated with respective ones of the one or more words.
  • database 650 may store the sound or audio signal of a dog barking in association with the word “bark.”
  • database 650 may store the sound or audio signal of a trumpet being played in association with the word “trumpet.”
  • the processor may be programmed to compare a word recognized in an image of the written material with one or more words stored in the database.
  • the processor may retrieve the sound or audio signal associated with the matching word from the database.
  • the at least one processor may be programmed to generate at least one second audio signal representing the text, the at least one second audio signal including the at least one first audio signal such that when the at least one second audio signal is output via the audio output device, the at least one first audio signal is played immediately before, immediately after, or instead of the at least one word.
  • the processor may generate at least one audio signal representing the recognized text by executing one or more text-to- speech processing algorithms.
  • Such algorithms may include, for example, one or more of WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • processor 320 may execute one or more text-to-speech modules stored in apparatus 110, in secondary computing device 350, on server 380, and/or in database 650 to convert the text recognized by processor 320 into audio signals representing speech.
  • processor 320 may be programmed to incorporate a second audio signal (representing a sound associated with a recognized word) either before or immediately after the portion of the audio signal representing the recognized word.
  • processor 320 may be programmed to incorporate a second audio signal (e.g., of a dog barking) after the portion of the audio signal representing the word “barks” or “barking”.
  • the generated audio signal may include a portion representing the text up to and including the word “barks” or “barking,” followed by the second audio signal representing the sound of a dog barking, and potentially further followed by a portion of the audio signal representing the recognized text after the word “barks” or “barking.”
  • the generated audio signal when the generated audio signal is played by, for example, an audio output device, the user may hear sounds corresponding to the recognized text up to and including the word “barks” or “barking,” immediately followed by the sound of a dog barking.
  • processor 320 may replace the recognized word in the generated audio signal with the second audio signal representing the recognized word.
  • the generated audio signal may include a portion representing the text up to but not including the word “barks” or “barking,” followed by the second audio signal representing the sound of a dog barking, and potentially further followed by a portion of the audio signal representing the recognized text after the word “barks” or “barking.”
  • the generated audio signal when the generated audio signal is played by, for example, an audio output device, the user may hear sounds corresponding to the recognized text up to and prior to the word “barks” or “barking,” followed by the sound of a dog barking, and then followed by sounds corresponding to the text following the word “barks” or “barking.”
  • At least one processor is programmed to receive a user setting associated with reading of the written material.
  • the user setting associated with the reading speed includes an indication of a word rate.
  • the user setting may include one or more of a reading speed, a word spacing, a speaker identity, and/or one or more prosodic features.
  • the one or more prosodic features may include, for example, pitch, loudness, tempo, or rhythm.
  • the processor e.g., 320 or 520
  • the processor may also transmit the audio signal to an audio output device (e.g., audio feedback device 130) that may play audio signal such that the user may hear the question requesting information regarding a reading speed for the written material.
  • an audio output device e.g., audio feedback device 130
  • the processor may generate an audio signal representative of the question: “What is the reading speed?” or “Please specify a reading speed.”
  • user 100 may specify the reading speed or word rate by speaking. For example, in response to the question about reading speed, user 100 may say “125 words per minute,” “160 words per minute,” or some other number of words per minute.
  • processor 320 of apparatus 110 may receive audio signals representative of the user speech generated by the one or more microphones 340.
  • Processor 320 may perform speech-to-text processing and recognize the reading speed (e.g., words per minute) in the received audio signals.
  • the user setting is associated with a reading speed and includes an indication of a time duration to pause between words in the recognized text.
  • a user may specify the reading speed by providing a word rate (e.g., number of words per minute) or alternatively by providing an indication of time duration of a pause between successive words.
  • the processor e.g., 320 or 520
  • the processor may also transmit the audio signal to an audio output device (e.g., audio feedback device 130) that may play audio signal such that the user may hear the question requesting information regarding a reading speed for the written material.
  • the processor may generate an audio signal representative of the question: “Please specify the pause between the words.” After hearing the question played by the audio output device, user 100 may specify the time duration to pause by speaking. For example, in response to the question about reading speed, user 100 may say “0.1 second,” “200 milliseconds,” or some other number indicating the time duration of the pause between words.
  • processor 320 of apparatus 110 may receive audio signals representative of the user speech generated by the one or more microphones 340. Processor 320 may perform speech-to-text processing and recognize the reading speed (e.g., time duration of the pause between words) in the received audio signals.
  • the processor may be programmed to receive an input indicative of the user setting displayed on a secondary device associated with the user.
  • the secondary device may include a display device or screen configured to display textual and/or graphical material, for example, in a graphical user interface.
  • a display screen may display a list of user settings on the graphical user interface.
  • the graphical user interface may also have one or more interactive elements, for example, buttons, checkboxes, or other types of widgets that may be selectable by user 100.
  • user 100 may be able to select one of the displayed user settings by touching the display with a finger, a pen, a pencil, a stylus, or any other type of pointing device.
  • secondary device 350 may have a display screen 360 that may display a graphical user interface 1460.
  • graphical user interface 1460 may display a menu or list of user settings.
  • User settings 1462 may be identified by their respective labels, for example, “Reading Speed,” “Word Spacing,” “Pitch,” “Speaker Name,” etc.
  • Each of the labels may be associated with a checkbox 1464.
  • User 100 may be able to select one of the checkboxes 1464 using one or more input devices associated with secondary device 350, and/or by touching display screen 360 when secondary device 350 is equipped with a touch screen 360.
  • the display screen may change and may display another list or menu corresponding to the selected item 1462.
  • the display screen may change to display a list or menu of reading speeds, for example, 120 words per minute, 150 words per minute, 170 words per minute, etc.
  • selection of any of the items 1462 illustrated in Fig. 14D may cause the display screen to display a subsequent menu from which the user may select a user setting value specific to the setting label 1462 selected on the first screen.
  • the user may be able to select the value “150 words per minute” displayed in response to a selection of the label “Reading Speed,” as discussed above.
  • Processor 520 of secondary device 350 may transmit a signal indicative of the user’s selection to processor 320 of apparatus 110.
  • processor 320 may receive an input indicative of a user setting from a user interface 1440 displayed on a secondary device 350 associated with the user 100.
  • the secondary device may allow the user to specify any of the other user settings described above (e.g., pause between words, pitch, loudness, other prosodic features, etc.).
  • the at least one processor is programmed to generate at least one audio signal representing the recognized text, the at least one audio signal being generated based on the user setting.
  • the processor may generate at least one audio signal representing the recognized text by executing one or more text-to-speech processing algorithms.
  • Such algorithms may include, for example, one or more of WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • processor 320 may execute one or more text-to-speech modules stored in apparatus 110, in secondary computing device 350, on server 380, and/or in database 650 to convert the text recognized by processor 320 into audio signals representing speech.
  • processor 320 may generate the audio signal such that the audio signal incorporates one or more of the user settings. For example, processor 320 may generate an audio signal that when played by audio output device 130 plays the recognized text at, for example, the user specified words per minute. As another example, processor 320 may generate an audio signal that when played by audio output device 130 plays the recognized text at, for example, the user specified pitch, tempo, or loudness.
  • Fig. 15A is a flowchart showing an exemplary process 1500 for reading written material to a user.
  • Process 1500 may be performed by one or more processors associated with apparatus 110, such as processor 320. Although the description below refers to processor 320, in some embodiments, some or all steps of process 1500 may be performed on processors external to apparatus 110. For example, one or more steps of process 1500 may be performed by processor 520 of secondary device 350 and/or a processor associated with server 380. It is also contemplated however, that all steps of process 1500 may be performed by any of or a combination of processor 320, processor 520, and/or a processor associated with server 380.
  • process 1500 includes receiving at least one image captured by the camera, the at least one image including a representation of written material.
  • apparatus 110 may comprise one or more cameras, such as camera 210, which may capture images of environment 400 of user 100.
  • Processor 320 of apparatus 110 may be programmed to receive the one or more images captured by, for example, camera 210.
  • camera 210 may generate one or more image files including digital representations of the one or more images captured by camera 210.
  • These digital representations may be stored in a storage device associated with camera 210 (e.g., memory 330) and/or in a storage device associated with secondary device 350 and/or server 380.
  • Processor 320 may be able to access and read the digital representations of the images stored by camera 210 from memory 330, and/or one or more storage devices associated with secondary device 350 and/or server 380.
  • process 1500 includes obtaining a material type of the written material.
  • processor 320 may recognize one or more words in the image of the written material obtained by camera 210, a graphical shape, or an outline.
  • Processor 320 may compare the one or more recognized words with words stored in, for example, database 650.
  • Processor 320 may extract an identifier or characteristic associated with word or words matching the one or more recognized words.
  • processor 320 may obtain the material type based on the retrieved identifier or characteristic.
  • Processor 320 may also determine the material type based on one or more techniques described above, including, for example, using one or more trained machine learning models or neural networks, or by receiving an input from the user.
  • process 1500 includes analyzing the at least one image to recognize text.
  • processor 320 may employ one or more techniques such as OCR, pattern matching, pattern recognition, or image correlation, and or using trained machine learning models and/or neural networks to analyze the image of the written material obtained by camera 210 to recognize the text in the image.
  • Processor 320 may recognize single words, phrases, or even lines of text instead of focusing on single characters using one or more of the techniques described in this disclosure.
  • process 1500 includes generating at least one audio signal representing the recognized text, the at least one audio signal being generated based on the material type.
  • processor 320 may generate the at least one audio signal by executing one or more text-to-speech processing algorithms. Such algorithms may include, for example, one or more of WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • Processor 320 may generate the audio signal based on the material type such that when the audio signal is played, for example, by an audio output device (e.g., audio feedback device 130), the resulting sound represents the manner in which a reader would read that type of material.
  • an audio output device e.g., audio feedback device 130
  • process 1500 includes outputting the at least one audio signal via the audio output device.
  • processor 320 of apparatus 110 may transmit the generated audio signal to an output device (e.g., audio feedback device 130).
  • Audio output device 130 in turn may play the audio signal to generate a sound that may represent speech by reader. That is, the audio output device 130 may play the audio signal such that user 100 may hear the text recognized in the image captured by camera 210 as if it were being read by a reader to user 100.
  • Fig. 15B is a flowchart showing an exemplary process 1520 for reading written material to a user.
  • Process 1520 may be performed by one or more processors associated with apparatus 110, such as processor 320. Although the description below refers to processor 320, in some embodiments, some or all steps of process 1520 may be performed on processors external to apparatus 110. For example, one or more steps of process 1520 may be performed by processor 520 of secondary device 350 and/or a processor associated with server 380. It is also contemplated however, that all steps of process 1520 may be performed by any of or a combination of processor 320, processor 520, and/or a processor associated with server 380.
  • process 1520 includes receiving at least one image captured by the camera, the at least one image including a representation of written material.
  • processor 320 may perform steps and employ techniques similar to those discussed above, for example, for step 1502 of method 1500.
  • process 1520 includes obtaining a linguistic level of the user.
  • processor 320 may generate an audio signal representative of a question asking the user to identify the user’s linguistic level.
  • processor 320 or 520 may generate an audio signal representative of the question: “What is your linguistic level?” or a request for information: “Please specify your linguistic level.”
  • user 100 may identify his or her linguistic level, for example, by speaking. For example, user 100 may say “My linguistic level is beginner” or “Beginner,” or another word or phrase to indicate the linguistic level.
  • Processor 320 may receive audio signals representative of the user speech generated by the one or more microphones 230.
  • Processor 320 may perform speech-to-text processing and recognize, for example, words such as “Beginner” in the received audio signals.
  • the processor may determine the linguistic level based on the recognized words.
  • processor 320 may receive the linguistic level of user 100 based on an input provided by user 100 on a graphical user interface of secondary device 350.
  • the linguistic level may be obtained in another manner, such as asking the user questions related to one or more words, based upon questions asked by the user, or the like.
  • the linguistic level may also be updated over time.
  • process 1520 may include analyzing the at least one image to recognize text.
  • processor 320 may perform steps and employ techniques similar to those discussed above, for example, for step 1506 of method 1500
  • process 1520 may include substituting at least one original word within the recognized text with a synonym word based on the linguistic level of the user.
  • processor 320 may recognize one or more words, phrases and/or sentences by analyzing an image of the written material.
  • processor 320 may determine a difficulty level or Lexile score associated with the recognized one or more words, phrases and/or sentences.
  • Processor 320 may compare the difficulty level of the recognized words, phrases, and/or sentences with the linguistic level of the user.
  • the processor may replace the recognized word with a synonym word having a difficulty level commensurate with (e.g., equal to or lower than) the linguistic level of the user, identify a recognized word or a phrase that may correspond to a linguistic level higher than that of the user.
  • process 1520 may include generating at least one audio signal representing the recognized text, wherein the at least one audio signal represents the synonym word rather than the original word.
  • processor 320 may employ one or more techniques described with respect to, for example, step 1508 of method 1500 to generate the at least one audio signal.
  • processor 320 may replace a word recognized in the image of the written material and having a difficulty level higher than the linguistic level of user 100 with a synonymous word having a lower difficulty level.
  • Processor 320 may generate the audio signal corresponding to the written material with the substituted word.
  • processor 320 may generate the audio signal based on the linguistic level of the user.
  • processor 320 may generate the audio signal such that when the audio signal is played, for example, by audio output device 130, may produce sounds in which a word recognized from the image of the written material is replaced by a synonym word having a difficulty level equal to or less than the linguistic level of the user.
  • Fig. 15C is a flowchart showing an exemplary process 1540 for reading written material to a user.
  • Process 1540 may be performed by one or more processors associated with apparatus 110, such as processor 320. Although the description below refers to processor 320, in some embodiments, some or all steps of process 1540 may be performed on processors external to apparatus 110. For example, one or more steps of process 1540 may be performed by processor 520 of secondary device 350 and/or a processor associated with server 380. It is also contemplated however, that all steps of process 1540 may be performed by any of or a combination of processor 320, processor 520, and/or a processor associated with server 380.
  • process 1540 includes receiving at least one image captured by the camera, the at least one image including a representation of written material.
  • processor 320 may perform steps and employ techniques similar to those discussed above, for example, for step 1502 of method 1500.
  • process 1540 includes analyzing the at least one image to recognize text, the recognized text being in a first language.
  • processor 320 may recognize one or more words, phrases, and/or sentences based on an analysis of the one or more images of the written material captured by camera 210 using one or more techniques described above.
  • processor 320 may employ, for example, image recognition techniques such as pattern matching, pattern recognition, or image correlation to match portions or glyph like shapes or features in the image of the written material with known characters of a language.
  • adaptive recognition techniques may be used to recognize the shapes of letters. It is contemplated that processor 320 may be programmed to determine a language of the written material based on an analysis of the recognized text.
  • processor 320 may compare shapes of the letters or glyphs in the image of the written material with images of characters, letters, symbols, words, or the like in different languages stored in a database (e.g., database 650).
  • the images stored in the database may be stored in association with a corresponding language.
  • Processor 320 may determine a language (e.g., English, German, Italian, Hebrew, or any other language) associated with the written material based on identifying a match between the shapes of the letters or glyphs extracted from the image of the written material with the images, corresponding to a particular language, stored in the database.
  • processor 320 may determine the language of the written material using one or more trained machine learning models or neural networks or by receiving an input from user 100.
  • process 1540 includes obtaining a second language from the user.
  • processor 320 may generate an audio signal representative of a question asking the user to identify the second language.
  • processor 320 may generate an audio signal representative of the question: “What is the language?” or “Please specify the language of the text in the image.”
  • user 100 may identify the language, for example, by speaking. For example, user 100 may say “English,” or “the language is German,” or the user may say something else identifying the language of the written material.
  • Processor 320 may receive audio signals representative of the user speech generated by the one or more microphones 230.
  • Processor 320 may be programmed to perform speech-to-text processing and recognize words such as “English,” “German,” or other words identifying the language in the received audio signals. As also described above, in some embodiments, processor 320 may receive the second language based on an input provided by user 100 on a graphical user interface of secondary device 350. The second language may also be set to be a default language. The second language may also be as a previously selected second language.
  • process 1540 includes translating the recognized text into the second language.
  • processor 320 may recognize individual words and/or phrases by analyzing an image of the written material based on one or more techniques described above. The processor may use the words and/or phrases recognized in the image of the written material as an index to search a dictionary stored in, for example, database 650. After determining a match between the index words or phrases and corresponding words or phrases in the database, processor 320 may select words in the specified second language corresponding to the matching words from database 650. Processor 320 may also use the words or phrases retrieved from database 650 to replace the index words or phrases in the recognized text, thereby translating the text in the written material from the first language to the second language.
  • process 1540 includes generating at least one audio signal representing the recognized text, wherein the second audio signal represents the text translated into the second language.
  • processor 320 may be programmed to generate at least one audio signal by executing one or more text-to-speech processing algorithms, such as, WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • processor 320 may execute the one or more text-to-speech algorithms to convert the translation of the text recognized by processor 320 into audio signals representing speech.
  • processor 320 may generate the audio signal corresponding to the written material as translated into the second language so that when the audio signal is played by audio output device 130, the user may hear the written material as being read by a reader in the second language.
  • Fig. 15D is a flowchart showing an exemplary process 1560 for reading written material to a user.
  • Process 1560 may be performed by one or more processors associated with apparatus 110, such as processor 320. Although the description below refers to processor 320, in some embodiments, some or all steps of process 1560 may be performed on processors external to apparatus 110. For example, one or more steps of process 1560 may be performed by processor 520 of secondary device 350 and/or a processor associated with server 380. It is also contemplated however, that all steps of process 1540 may be performed by any of or a combination of processor 320, processor 520, and/or a processor associated with server 380.
  • process 1560 includes receiving a user setting associated with a reading speed.
  • processor 320 may generate an audio signal representative of a question asking the user to provide a user setting associated with reading of the written material.
  • processor 320 may also transmit the audio signal to audio output device 130, which may play audio signal such that user 100 may hear the question requesting information regarding a reading speed for the written material.
  • the processor may generate an audio signal representative of the question: “What is the reading speed?” or “Please specify a reading speed.” After hearing the question played by the audio output device, user 100 may specify the reading speed or word rate by speaking.
  • processor 320 of apparatus 110 may receive audio signals representative of the user speech generated by the one or more microphones 340.
  • Processor 320 may perform speech-to-text processing and recognize the reading speed (e.g., words per minute) in the received audio signals.
  • processor 320 may receive the user setting based on an input provided by user 100 on a graphical user interface of secondary device 350.
  • process 1560 includes receiving at least one image captured by the camera, the at least one image including a representation of written material.
  • processor 320 may perform steps and employ techniques similar to those discussed above, for example, for step 1502 of method 1500.
  • process 1560 may include analyzing the at least one image to recognize text.
  • processor 320 may perform steps and employ techniques similar to those discussed above, for example, for step 1506 of method 1500.
  • process 1560 includes generating at least one audio signal representing the recognized text, the at least one audio signal being generated based on the user setting.
  • processor 320 may generate at least one audio signal representing the recognized text by executing one or more text-to-speech processing algorithms, such as, WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech. Further, processor 320 may generate the audio signal such that the audio signal incorporates one or more of the user settings. For example, processor 320 may generate an audio signal that when played by audio output device 130 plays the recognized text at, for example, the user specified words per minute. As another example, processor 320 may generate an audio signal that when played by audio output device 130 plays the recognized text at, for example, the user specified pitch, tempo, or loudness.
  • process 1560 includes outputting the at least one audio signal via the audio output device.
  • processor 320 may perform steps and employ techniques similar to those discussed above, for example, for step 1510 of method 1500.
  • Disclosed embodiments may provide devices, systems, and/or methods that may help users of all ages who may have difficulty in reading, learning, or comprehension. For example, disclosed embodiments may provide devices, systems, and/or methods for automatically capturing and processing images of textual and/or written material, and generating audio signals corresponding to the textual and/or written material. When these audio signals are played by an audio output device, the audio output device may produce sounds that may sound as if the written material is being read out aloud. In particular, disclosed embodiments may provide devices, systems, and/or methods for generating sounds representing the textual and/or written material as if being read aloud in the voice of a particular speaker. Disclosed embodiments may also provide devices, systems, and/or methods for determining a sentiment, emotion, or context associated with the written material and adjusting the generated sounds representing the textual and/or written material based on the sentiment, emotion, or context.
  • an exemplary system 300 for reading text may include apparatus 110 held by user 100, one or more audio feedback devices 130, an optional secondary device 350, and/or a server 380 capable of communicating with apparatus 110 and/or with secondary communications device 350 via network 370.
  • Apparatus 110 may be configured to recognize textual and/or written material and generate audio signals representing the textual and/or written material.
  • Apparatus 110 of system 300 may also be configured to play the generated audio via, for example, audio feedback device 130 to read the recognized text to a user.
  • the system includes a camera configured to capture images from an environment of a user.
  • the disclosed system may include an apparatus that includes an imaging device (e.g., camera) capable of capturing one or more images of an environment surrounding the camera and/or the user.
  • the disclosed imaging device may also be configured to capture an image of an object that may have textual or written material displayed on a surface of the object.
  • apparatus 110 may include camera 210.
  • camera 210 of apparatus 110 may include image sensor 310 (see e.g., Fig. 5) configured to capture one or more images of an environment 400 (see e.g., Fig. 4) of user 100.
  • Environment 400 may include one or more objects (e.g., object 120) that may display written material and/or may have written material printed or displayed on a surface of the one or more objects.
  • Image sensor 310 of camera 210 may be configured to capture one or more images of textual or written material displayed by, for example, object 120.
  • object 120 may display text 410, such as: “The text here is provided as an example of the type of subject matter the disclosed device and systems may be able to process” and/or additional textual material.
  • Camera 210 of apparatus 110 may be configured to capture one or more images of text 410 displayed by object 120.
  • the system includes an audio output device for outputting audio signals.
  • the disclosed system may include an audio output device (or audio feedback device), such as a headphone, a speaker, and/or any other device capable of playing audio signals.
  • system 300 may include apparatus 110 that may include audio output device 130 (e.g., 130a or 130b).
  • audio output device 130 may include an over-the-ear audio headphone being worn by user 100, a built-in speaker, a stand-alone speaker, a portable speaker, a hearing aid device, a bone conduction headphone, a within-ear headphone (e.g., earbuds), a speaker associated with the secondary device, or any other device capable of playing audio signals.
  • Apparatus 110 may be connected to audio output device 130 via a wired connection. It is contemplated, however, that apparatus 110 may be connected to audio output device 130 via a wireless connection based, for example, on a BluetoothTM protocol or any other wireless communication protocol that may allow apparatus 110 to transmit audio signals to audio feedback device 130.
  • Audio output device 130 may be configured to play one or more audio signals by generating sounds corresponding to the one or more audio signals. User 100 may be able to hear the sounds generated by audio output device 130.
  • the system includes at least one processor.
  • the at least one processor may be understood to be a processor as defined elsewhere in this disclosure.
  • the at least one processor may include one or more integrated circuits, microchips, microcontrollers, microprocessors, all or part of a central processing unit (CPU), graphics processing unit (GPU), digital signal processor (DSP), field programmable gate array (FPGA), or other circuits suitable for executing instructions or performing logic operations.
  • system 300 may include apparatus 110 that may include one or more processors 320.
  • the one or more processors 320 may control one or more components of apparatus 110 and/or of system 300. Referring to Figs.
  • the at least one processor may include one or more of processor 320 of apparatus 110, processor 520 of secondary device 350, and/or one or more processors associated with server 380.
  • processor 320 is described as performing one or more functions associated with the system for reading text. It is contemplated, however, that one or more of the processes or functions described herein as being performed by the at least one processor may be performed by any of or a combination of processor 320 of apparatus 110, processor 520 of secondary device 350, and/or one or more processors associated with server 380.
  • the at least one processor is programmed to receive at least one image captured by the camera, the at least one image including a representation of written material.
  • apparatus 110 may comprise one or more cameras, such as camera 210, which may capture images of environment 400 of user 100.
  • Processor 320 of apparatus 110 may receive the one or more images captured by, for example, camera 210.
  • camera 210 may generate one or more image files including digital representations of the one or more images captured by camera 210.
  • These digital representations may be stored by camera 210 in a storage device associated with camera 210 (e.g., memory 330), in a storage device associated with secondary device 350 and/or server 380, and/or in one or more databases 650.
  • Processor 320 may access and read the digital representations of the images stored by camera 210 from memory 330, from one or more storage devices associated with secondary device 350 and/or server 380, and/or from the one or more databases 650.
  • processor 320 of apparatus 110 may receive the one or more images, including the representation of the textual and/or written material directly from camera 210 via a wired or wireless connection.
  • processor 320 may receive and process at least one image out of the plurality of images captured by, for example, camera 210 from environment 400 of user 100. It is contemplated that environment 400 of user 100 may include one or more objects that may have one or more surfaces displaying written material.
  • environment 400 may include one or more objects such as a newspaper, a book, a paper with written material, and/or any other object displaying textual and/or written material.
  • the written material may be displayed on a display device or screen of a secondary device 350 associated with the user 100.
  • secondary device 350 may be a smartphone, tablet, smartwatch, or a desktop or laptop computer, having a display screen capable of displaying textual and/or written material.
  • the written material may be displayed on a standalone display device or screen, for example, an electronic advertising screen, a television, or any other device capable of displaying written material.
  • Camera 210 of apparatus 110 may capture an image that includes a representation of the textual and/or written material displayed on an object.
  • the image may include a picture or depiction of the textual and/or written material displayed on the object.
  • the at least one processor is programmed to analyze the at least one image to recognize text.
  • the processor e.g., 320 or 520 associated with the disclosed system may recognize one or more words, phrases, and/or sentences based on an analysis of the one or more images of the written material captured by the camera.
  • the processor may employ techniques such as optical character recognition (OCR) to recognize the one or more words, phrases and/or sentences in the written material.
  • OCR optical character recognition
  • Such techniques may additionally or alternatively include image recognition techniques such as pattern matching, pattern recognition, or image correlation to match portions or glyph like shapes and features in the image with known characters of a language.
  • adaptive recognition techniques may be used to recognize the shapes of letters. It is also contemplated that in some embodiments, the techniques may include use of trained machine learning models and/or neural networks that may recognize single words, phrases, or even lines of text instead of focusing on single characters.
  • the processor may use one or more trained machine learning models and/or neural networks to recognize single words, phrases, or even lines of text instead of focusing on single characters in the image of the written material.
  • trained machine learning models and/or neural networks may include support vector machines, Fisher’s linear discriminant, nearest neighbor, k nearest neighbors, decision trees, random forests, and so forth.
  • a set of training examples may include one or more images of written material, an identification of one or more characters, letters, symbols, words, glyphs, phrases, and/or sentences in the written material.
  • a machine learning model or neural network may be trained to identify single words, phrases, or lines of text in the written material included in the image based on these and/or other training examples.
  • the trained machine learning model or neural network may output one or more single words, phrases, or lines of text associated with written material, when presented with an image of the written material and/or with text extracted from the image of the written material.
  • a trained machine learning model or neural network for determining a language of the written material from images of written material may be a separate and distinct machine learning model or neural network or may be an integral part of one or more other machine learning models or neural networks discussed above.
  • the at least one processor is programmed to perform text-to- speech conversion of the recognized text to generate at least one audio signal representing the recognized text.
  • the processor e.g., 320 or 520
  • Such algorithms may include, for example, one or more of WaveNet, DeepVoice, Tacatron, natural language processing (NEP), or any other algorithm that converts a text input into an audio signal representing speech.
  • processor 320 may execute one or more text-to-speech modules stored in apparatus 110, in secondary computing device 350, on server 380, and/or in database 650 to convert the text recognized by processor 320 into audio signals representing speech.
  • the audio signal is generated in a voice of a predetermined speaker or speaker type.
  • the predetermined speaker is a parent or a teacher of the user.
  • the processor e.g., 320, 520
  • the processor may generate the audio signal such that when the audio signal is played by an audio output device (e.g., audio feedback device 130), the user may hear the written material in the image being read out aloud in a predetermined voice or in the voice of a particular speaker.
  • the voice of the particular speaker may be the voice of a favorite actor or actress, the voice of a cartoon character, the voice of a parent of the user, the voice of a teacher of the user, the voice of a peer (e.g., colleague, friend, mentor, etc.), or the voice of another speaker. It is also contemplated that in some embodiments, the voice of the particular speaker may be the user’s own voice as the user would hear himself or herself. In some embodiments, the voice of the predetermined speaker may be based on a speaker type.
  • Speaker type may refer to a gender of a speaker (e.g., male, female), age of a speaker (e.g., adult, child, elder person), accent or geography of the speaker (e.g., French accent, British accent, Indian accent) or some other characteristic of the speaker.
  • the processor e.g., 320, 520
  • the voice of the predetermined speaker is selected from a plurality of generic voices.
  • the generic voices may refer for example to a voice of a child, a male voice, a female voice, a voice having a British accent, a voice having an Indian accent, etc., where the speaker may not be identified.
  • the processor may generate the audio signal such that when the audio signal is played by an audio output device (e.g., audio feedback device 130), the user may hear the written material in the image being read out aloud in the voice of an adult, a child, a male speaker, a female speaker, or a speaker having one of a plurality of accents, where the voice man not necessarily be identifiable as that of a particular speaker.
  • the voice of the predetermined speaker or type may be selected by the user in many ways.
  • the processor e.g., 320 or 520
  • the processor may generate an audio signal representative of a question asking the user to indicate the predetermined speaker.
  • the processor may transmit the audio signal to an audio output device (e.g., audio feedback device 130) that may play the audio signal such that the user may hear the question requesting information regarding the language of the written material in the text.
  • the processor may generate an audio signal representative of the question: “Which speaker’s voice should I use?” or “Should I use a male voice or a female voice?”
  • user 100 may select the predetermined speaker, for example, by speaking.
  • Processor 320 may receive audio signals representative of the user speech generated by the one or more microphones 230. Processor 320 may perform speech-to-text processing and recognize words such as “Male Voice,” “Gal Gadot’s Voice,” or other words, identifying the predetermined speaker in the received audio signals.
  • the processor may generate the audio signal such that when the audio signal is played by an audio output device (e.g., audio feedback device 130), the user may hear the written material in the image being read out aloud in the voice of the selected predetermined speaker.
  • an audio output device e.g., audio feedback device 130
  • the processor may generate an audio signal indicating that the selected speaker’s voice is not available or not recognized and asking the user to select another predetermined speaker.
  • the processor may transmit the audio signal to an audio output device that may play the audio signal such that the user may hear the question requesting information regarding the language of the written material in the text.
  • the processor may generate an audio signal representative of the question: “Gal Gadot’s voice is not available. Please select another speaker?” or “Gal Gadot’s voice is not available. May I use Anne Hathaway’s voice instead?”
  • the processor may then generate the audio signal based on the user’s input.
  • the processor may generate an audio signal representing the written material using a default speaker embedding.
  • the processor may be programmed to receive an input indicative of the predetermined speaker from a user interface displayed on a secondary device associated with the user.
  • the secondary device may include a display device or screen configured to display textual and/or graphical material, for example, in a graphical user interface.
  • the display screen may display a list of predetermined speakers or speaker types on the graphical user interface.
  • the graphical user interface may also have one or more interactive elements, for example, buttons, checkboxes, or other types of widgets that may be selectable by the user via one or more input devices associated with the secondary device.
  • user 100 may be able to select one of the displayed languages by touching the display with a finger, a pen, a pencil, a stylus, or any other type of pointing device.
  • secondary device 350 may have a display screen 360 that may display graphical user interface 1680.
  • graphical user interface 1680 may display a menu or list of predetermined speakers.
  • Predetermined Speakers 1682 may be identified by their respective labels, for example, “Gal Gadot,” “James Earl Jones,” “Oded Fehr,” “Moana,” “Roger Rabbit,” etc. Each of the labels may be associated with a checkbox 1684.
  • User 100 may be able to select one of the checkboxes 1684 using one or more input devices associated with secondary device 350, and/or by touching display screen 360 when secondary device 350 is equipped with a touch screen 360.
  • Processor 520 of secondary device 350 may transmit a signal indicative of the user’s selection to processor 320 of apparatus 110.
  • processor 320 may receive an input indicative of the predetermined speaker from a user interface 1440 displayed on a secondary device 350 associated with the user 100.
  • the at least one processor is programmed to generate the audio signal in the voice of the predetermined speaker based on one or more stored voice characteristics of the predetermined speaker. Generating the audio conditioned on a predetermined voice may require a speaker embedding, which may include encoding of a speaker’s voice characteristics.
  • the speaker embodiment may include a vector comprising a speaker’s voice characteristics. Such voice characteristics may include, for example, a speaker’s identity, phonetic data extracted from audio signals representing the speaker’s voice/speech, pitch and volume of the speaker’s voice, intonations, pronunciations, and/or accent.
  • speaker embedding may include i- vector based embeddings, x-vector based embeddings, d-vector based embeddings, and or s-vector based embeddings.
  • a speaker embedding may be produced by an artificial intelligence (Al) engine, such as but not limited to a trained machine learning model or neural network.
  • Al artificial intelligence
  • the at least one processor is programmed to receive at least one second audio signal representing speech by the predetermined speaker.
  • an audio signal represented as an audio recording of speech by a particular speaker may be stored in, for example, one or more memories 330 and/or databases 650 associated with the disclosed system.
  • the processor e.g., 320, 520
  • the processor may access the one or more memories 330 and/or databases 650 and retrieve the audio signal representative of speech by the particular speaker.
  • the processor may receive the second audio signal generated by a microphone associated with apparatus 110 of system 300.
  • the system may include microphone 230, 340 configured to capture sounds from the environment of a predetermined speaker.
  • the microphone may be configured to capture one or more voice commands and/or other speech the predetermined speaker.
  • system 300 may include apparatus 110 that may include one or more microphones 230, 340.
  • the one or more microphones 230, 340 may be configured to capture sounds associated with speech of the predetermined speaker and convert the sounds into audio signals for further processing by apparatus 110.
  • a predetermined speaker may use apparatus 110 to provide a second audio signal representing speech by the predetermined speaker.
  • the processor e.g., 320, 520
  • the processor may be configured to access the one or more memories 330 and/or databases 650 and retrieve the audio signal representative of speech by the particular speaker.
  • the at least one processor is programmed to analyze the at least one second audio signal to extract one or more voice characteristics associated with a voice of the predetermined speaker.
  • a speaker embedding may include encoding of a speaker’s voice characteristics into a vector.
  • voice characteristics may include, for example, a speaker’s identity, phonetic data extracted from audio signals representing the speaker’s voice/speech, pitch and volume of the speaker’s voice, intonations, pronunciations, and accent.
  • the processor e.g., 320, 520
  • Such algorithms may include Linear Predictive Coding, Mel Frequency Cepstral Coefficient (MFCC), Power Normalized Cepstral Coefficients, Gammatone Frequency Cepstral Coefficients, and/or any other algorithm capable of extracting speaker embeddings from an audio signal.
  • the processor may employ one or more machine learning models to extract the voice characteristics of speaker embeddings from the second audio signal.
  • machine learning models may include, for example, Gaussian Mixture Model (GMM), Universal Background Model (UBM), Kaldi, or other deep learning models.
  • the at least one processor is programmed to generate the at least one audio signal based on the extracted one or more voice characteristics.
  • the processor e.g., 320, 520
  • Fig. 16A illustrates an exemplary system 1600 for generating an audio signal based on the predetermined speaker’s voice.
  • Text-to-speech (TTS) engine 1612 may include one or more algorithms or machine learning models/neural networks. TTS engine 1612 may be trained such that after training TTS engine 1612 may receive any text 1608 and generate output audio 1616 in the voice of the speaker upon which TTS engine 1612 was trained.
  • actual audio from the speaker may be used in the training process to improve the accuracy.
  • an audio signal of the speaker reading text 1608 may be provided as the ground truth audio 1604 for training.
  • Comparator 1606 may receive the ground truth audio 1604 and the output audio signal 1616 generated by TTS engine 1612. The results of the comparison may be fed by comparator 1606 back to the TTS engine 1612 to improve the accuracy of output audio signal 1616.
  • the processor e.g., 320, 520
  • the processor may use system 1600 for training TTS engine 1612.
  • the processor may provide text in the form of images of written material or text recognized from the images to TTS engine 1612, which may generate an output audio signal 1616 representative of the voice of the speaker used to train TTS engine 1612.
  • the voice is a first voice and at least a part of the audio signal is further generated in accordance with a second voice of a respective second speaker.
  • the processor e.g., 320, 520
  • the processor may generate the audio signal representing written material such that when the audio signal is played by an audio output device, the text or parts thereof may be read to a user in more than one predetermined voice.
  • a single voice may be used for the whole of the read text, or the voice may switch.
  • the voice may switch according to the speaking character.
  • speaker embeddings for the different speakers may be generated and may be stored in the one or more memories 330 and/or databases 650 associated with system 100.
  • the processor e.g., 320, 520
  • the processor may read or retrieve the speaker embeddings for generating audio signals representing the written material according to the voices of the different speaker.
  • Fig. 16B illustrates an exemplary system 1650 for generating an audio signal in one or more voices.
  • multispeaker TTS engine 1628 may be trained using text 1608 and speaker signature/ID/embeddings 1624, such that each specific training is associated with a particular speaker.
  • multi-speaker TTS engine 1628 may also employ ground truth audio signals 1604 from each of the multiple speakers to ensure that output audio signal 1616 accurately represents the voice of each of the multiple speakers.
  • Multi-speaker TTS engine 1628 may be trained in each of the multiple speaker’s voices by using a corresponding ground truth audio signal 1604 and using the comparator to determine differences between the ground truth audio signal 1604 and the output audio signal 1616.
  • the processor may use system 1650 for training multi-speaker TTS engine 1628.
  • the processor may provide text in the form of images of written material or text recognized from the image to multi-speaker TTS engine 1628 together with a pointer 1624 that may point to an entry of a look-up table storing a plurality of signatures for a plurality of possible speakers.
  • the processor may generate the output audio signal 1616 based on the signature (e.g., audio for signature 1626) identified by pointer 1624.
  • the signature e.g., audio for signature 1626
  • the processor may generate the output audio signal using the speaker embeddings for the selected speaker.
  • the at least one processor is programmed to associate the first voice with a first text segment included in the recognized text based on a context of the first text segment.
  • the at least one processor is programmed to determine the context of the first text segment using natural language processing (NLP).
  • NLP natural language processing
  • the processor e.g., 320, 520
  • the processor may access a set of rules stored in the one or more memories 330 and/or one or more databases 650.
  • the processor may use one or more words, phrases, or sentences in a first portion of text recognized in the image of the written material to determine a context associated with the first portion of text.
  • the processor may apply one or more of the rules to determine a context associated with the first portion of the text.
  • the processor may determine whether the first text segment is being spoken by, for example, a younger person (e.g., a child) or an older person (e.g., a grandfather or grandmother). By way of another example, the processor may determine based on one or more of the stored rules whether the first text segment is a news article, a joke, a story, or a poem.
  • Fig. 17 illustrates an object 120 that may display or include text 410.
  • Camera 210 of apparatus 110 may be configured to capture one or more images of text 410 displayed by object 120.
  • processor 320 may recognize one or more words, phrases or sentences in the one or more images of text 410 captured by camera 210. As illustrated in Fig.
  • processor 320 may use one or more stored rules to identify the context associated with one or more portions of text in an image of the written material obtained by camera 230. For example, as illustrated in Fig. 17, processor 320 may determine the context as one of “Party,” “Clown,” “Subject,” “News,” “Fiction,” “Non-Fiction,” or “Fabrication” from the set of contexts 1710.
  • the written material may include embedded tags that may identify the context of one or more portions or segments of the text in the written material.
  • the processor may recognize the embedded tags using one or more of the techniques described above for recognizing words and phrases from an image of the written material.
  • Fig. 17 illustrates an object 120 that may display or include text 410.
  • Camera 210 of apparatus 110 may be configured to capture one or more images of text 410 displayed by object 120.
  • processor 320 may recognize one or more words, phrases or sentences in the one or more images of text 410 captured by camera 210. As illustrated in Fig.
  • processor 320 may recognize text 1702 as including the tags “ ⁇ /Begin Joke ⁇ ” and “ ⁇ /End Joke ⁇ ” and may determine the context associated with the portion of the text: “The past, the present, and the future walked into a bar. It was tense,” as being a joke. Thus, when generating an audio signal representative of this portion of the text, processor 320 may use speaker embeddings associated with, for example, the voice of a comic based on the context being “joke.”
  • the processor may use other types of text processing and/or analysis to determine a context associated with a portion of text in written material.
  • the processor may use one or more trained machine learning models or neural networks as described above.
  • a set of training examples may include one or more images of text, and/or one or more words, phrases, and/or sentences of text together with a designation of the context associated with that portion of text.
  • a machine learning model or neural network may be trained to identify the context associated with a piece of text based on these and/or other training examples.
  • the trained machine learning model or neural network may output the context when presented with an image depicting a portion of text and/or the recognized words, phrases, and/or sentences from that portion of text.
  • a trained machine learning model or neural network for identifying restaurant menus from images may be a separate and distinct machine learning model or neural network or may be an integral part of one or more other machine learning models or neural networks discussed above.
  • the at least one processor is programmed to associate the second voice with a second text segment included in the recognized text based on a context of the second text segment, wherein the first text segment and the second text segment do not overlap.
  • the at least one processor is programmed to determine the context of the second text segment using natural language processing (NLP).
  • NLP natural language processing
  • the processor e.g., 320, 520
  • first portion of the text and the second portion of the text may not overlap and may be completely distinct from each other.
  • first and second portions of the text may overlap and may, therefore, have one or more common words, phrases, or sentences.
  • the techniques for determining context as described above are equally applicable to both overlapping and non-overlapping portions of text.
  • a “bank” of generic voices may be used, such that the reading voice may change in accordance with the context, even if no predetermined speaker has been selected. For example, in written material representing a dialog between a grandfather and a grandson, the grandfather text may be read in a voice of an old man and the grandson in a voice of a young child.
  • the processor e.g., 320, 520
  • the processor may select the voice (e.g., the speaker embedding for generating the audio signal) using a trained machine learning model or neural network.
  • a set of training examples may include one or more contexts together with a designation of associated speaker embeddings.
  • a machine learning model or neural network may be trained to select a speaker embedding for a particular context based on these and/or other training examples. Further, the trained machine learning model or neural network may output the speaker embedding when presented with a particular context.
  • a trained machine learning model or neural network for selecting speaker embeddings for a particular context may be a separate and distinct machine learning model or neural network or may be an integral part of one or more other machine learning models or neural networks discussed above.
  • the audio signal is generated to convey a sentiment determined based on analysis of the recognized text.
  • the processor may generate an audio signal representative of an image of written material such that when the audio signal is played by an audio output device, the user may hear the written material as being read in an expressionistic manner, representing the context or sentiment of the text.
  • the sentiment may be reflected in the audio signal with or without also selecting a particular speaker embedding (e.g. corresponding to a particular speaker’s voice).
  • the processor may determine the sentiment by analyzing the text recognized in the image of the written material.
  • the sentiment is at least one of sad, happy, concerned, afraid, or amused.
  • the processor may determine the sentiment based on recognizing certain words or phrases in the written material.
  • the processor may determine that the sentiment is sad when the processor recognizes the word “sad” or “mourn” in the written material, for example, in the sentence: “I am so sad.”
  • the processor may determine that the sentiment is happy when the processor recognizes the word “happy” in the written material, for example, in the sentence: “John is happy.” It is contemplated that the processor may determine sentiments other than sad or happy based on recognizing one or more words or phrases associated with other sentiments. It is also contemplated that the processor may determine one or more of the sentiments based on detecting words other than those described in the above examples.
  • the processor may determine the sentiment associated with a portion of text based on one or more of rules stored, for example, in one or more memories 330 and/or databases 650.
  • Fig. 17 illustrates an object 120 that may display or include text 410.
  • Camera 210 of apparatus 110 may be configured to capture one or more images of text 410 displayed by object 120.
  • processor 320 may recognize one or more words, phrases or sentences in the one or more images of text 410 captured by camera 210.
  • Processor 320 may use one or more stored rules to identify the sentiment associated with one or more portions of text in an image of the written material obtained by camera 210. For example, as illustrated in Fig. 17, processor 320 may determine the sentiment as one of “Happiness,” “Sadness,” “Sorrow,” “Anger,” “Fear,” “Laughter,” or “Love” in the set of sentiments 1720
  • the sentiment may be determined based on a context associated with a larger picture, for example, a book or a chapter.
  • a context associated with a larger picture for example, a book or a chapter.
  • written material representative of a clown giving a candy to a child may be associated with a “happy” sentiment, but may also be associated with the sentiment “horror” when the context of the entire chapter or book is “horror” (e.g., in a scary or horror book.)
  • the processor may determine the sentiment based on the author of the written material or the publication in which the written material appears.
  • written material that may be part of a paper may be associated with a “serious” sentiment
  • written material appearing in a humor column may be associated with the sentiment “comedy” or “happy.”
  • the tone or sentiment to be selected for generating the audio signal corresponding to the written material may change in different portions of the reading material.
  • the sentiment may be neutral in some portions of the written material and may instead express happiness, grief, fear, laughter in other portions of the written material.
  • the sentiment is a neutral sentiment.
  • the sentiment associated with a portion of text may be determined to be a neutral sentiment, when that portion of the text is not associated with any of the other predetermined set of sentiments.
  • a portion of text that is not associated with any of the sentiments “Happiness,” “Sadness,” “Sorrow,” “Anger,” “Fear,” “Laughter,” or “Love” would be determined as having a neutral sentiment.
  • the sentiment is determined using an artificial intelligence engine.
  • the processor e.g., 320, 520
  • the processor may determine the sentiment associated with the written material depicted in the image using a trained machine learning model or neural network.
  • a set of training examples may include one or more portions of text recognized by processor 320 from an image of written material captured by, for example, camera 210 together with an associated sentiment.
  • the training examples may include a plurality of passages of text, each with a designated sentiment, such as, sadness, happiness, worry, fear, amusement, laughter, family, strangers, social setting, work setting, or any other sentiment that characterizes the text.
  • a machine learning model or neural network may be trained to determine the sentiment associated with a passage of text based on these and/or other training examples. Further, the trained machine learning model or neural network may output the sentiment when presented with textual material and/or an image depicting the textual material.
  • a trained machine learning model or neural network for selecting speaker embeddings for a particular context may be a separate and distinct machine learning model or neural network or may be an integral part of one or more other machine learning models or neural networks discussed above.
  • the reading with a sentiment once the sentiment has been determined, may be performed by an Al engine trained and operating on different sentiments/behaviors (e.g., sad, happy, concerned, afraid, amused, etc.) similar to multi-speaker text to speech engine 2028 described above.
  • the sentiment is expressed in the text as one or more tags.
  • the text itself may contain indications of the required sentiment, for example tags such as “ ⁇ start _laugh> ‘funny text’ ⁇ end_laugh>,” where “funny text” would include the text in the written material corresponding to the joke or humorous text.
  • Fig. 17 illustrates an object 120 that may display text 410.
  • Camera 210 of apparatus 110 may be configured to capture one or more images of text 410 displayed by object 120.
  • processor 320 may recognize one or more words, phrases or sentences in the one or more images of text 410 captured by camera 210. As illustrated in Fig.
  • processor 320 may recognize text 1702 as including the tags “ ⁇ /Begin Joke ⁇ ” and “ ⁇ /End Joke ⁇ ” and may determine a sentiment associated with the portion of the text: “The past, the present, and the future walked into a bar. It was tense,” as being humorous. Thus, when generating an audio signal representative of this portion of the text, processor 320 may use speaker embeddings associated with, for example, the voice of a comic based on the sentiment being “humorous.” Other tags such as ⁇ /Begin Horror ⁇ and ⁇ /End Horror ⁇ or any other tags that may indicate a sentiment associated with the portion of the text bounded by the tags may also be used by the disclosed system.
  • the at least one processor may be programmed to output the at least one audio signal via the audio output device.
  • processor 320 of apparatus 110 may transmit the generated audio signal to an output device (e.g., audio feedback device 130).
  • audio output device 130 may include one or more of an over-the- ear audio headphone being worn by user 100, a built-in speaker, a stand-alone speaker, a portable speaker, a hearing aid device, a bone conduction headphone, a within-ear headphone (e.g., earbuds), a speaker associated with the secondary device, or any other device capable of playing audio signals.
  • Audio output device 130 in turn may play the audio signal to generate a sound that may represent speech by a reader. That is, the audio output device 130 may play the audio signal such that user 100 may hear the text recognized in the image captured by camera 210 as if it were being read by a reader to user 100.
  • Fig. 18A is a flowchart showing an exemplary process 1800 for reading written material to a user.
  • Process 1800 may be performed by one or more processors associated with apparatus 110, such as processor 320. Although the description below refers to processor 320, in some embodiments, some or all steps of process 1800 may be performed on processors external to apparatus 110. For example, one or more steps of process 1800 may be performed by processor 520 of secondary device 350 and/or a processor associated with server 380. It is also contemplated however, that all steps of process 1800 may be performed by any of or a combination of processor 320, processor 520, and/or a processor associated with server 380.
  • process 1800 includes receiving at least one image captured by the camera, the at least one image including a representation of written material.
  • apparatus 110 may comprise one or more cameras, such as camera 210, which may capture images of environment 400 of user 100.
  • Processor 320 of apparatus 110 may receive the one or more images captured by, for example, camera 210.
  • camera 210 may generate one or more image files including digital representations of the one or more images captured by camera 210.
  • These digital representations may be stored in a storage device associated with camera 210 (e.g., memory 330), in a storage device associated with secondary device 350 and/or server 380, and/or in one or more databases 650.
  • Processor 320 may be able to access and read the digital representations of the images stored by camera 210 from memory 330, and/or one or more storage devices associated with secondary device 350 and/or server 380.
  • process 1800 includes analyzing the at least one image to recognize text.
  • processor 320 may employ one or more techniques such as OCR, pattern matching, pattern recognition, or image correlation, and/or using trained machine learning models and/or neural networks to analyze the image of the written material obtained by camera 210 to recognize the text in the image.
  • Processor 320 may recognize single words, phrases, or even lines of text instead of focusing on single characters using one or more of the techniques described in this disclosure.
  • process 1800 includes performing text-to-speech conversion of the recognized text to generate at least one audio signal representing the recognized text.
  • the at least one audio signal is generated in at least one of a voice of a predetermined speaker.
  • processor 320 may execute one or more text-to-speech processing algorithms such as WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • These text-to-speech modules may be stored in apparatus 110, in secondary computing device 350, on server 380, and/or in database 650.
  • Processor 320 may generate the audio signal such that when the audio signal is played by an audio output device (e.g., audio feedback device 130), the user may hear the written material in the image being read out aloud in a predetermined voice or in the voice of a particular speaker.
  • the voice of the particular speaker may be the voice of a favorite actor or actress, the voice of a cartoon character, the voice of a parent of the user, the voice of a teacher of the user, or the voice of a peer (e.g., colleague, friend, mentor, etc.) It is also contemplated that in some embodiments, the voice of the particular speaker may be the user’s own voice as the user would hear himself or herself.
  • the voice of the predetermined speaker may be based on a speaker type.
  • Speaker type may refer to a gender of a speaker (e.g., male, female), age of a speaker (e.g., adult, child), accent or geography of the speaker (e.g., French accent, British accent, Indian accent) or some other characteristic of the speaker.
  • the processor e.g., 320, 520
  • process 1800 includes outputting the at least one audio signal via the audio output device.
  • processor 320 of apparatus 110 may transmit the generated audio signal to an output device (e.g., audio feedback device 130).
  • Audio output device 130 in turn may play the audio signal to generate a sound that may represent speech by reader. That is, the audio output device 130 may play the audio signal such that user 100 may hear the text recognized in the image captured by camera 210 as if it were being read by a reader to user 100.
  • Fig. 18B is a flowchart showing an exemplary process 1820 for reading written material to a user.
  • Process 1820 may be performed by one or more processors associated with apparatus 110, such as processor 320. Although the description below refers to processor 320, in some embodiments, some or all steps of process 1820 may be performed on processors external to apparatus 110. For example, one or more steps of process 1820 may be performed by processor 520 of secondary device 350 and/or a processor associated with server 380. It is also contemplated however, that all steps of process 1820 may be performed by any of or a combination of processor 320, processor 520, and/or a processor associated with server 380.
  • process 1820 includes receiving at least one image captured by the camera, the at least one image including a representation of written material.
  • processor 320 may perform steps and employ techniques similar to those discussed above, for example, for step 1802 of method 1800.
  • process 1820 includes analyzing the at least one image to recognize text.
  • processor 320 may perform steps and employ techniques similar to those discussed above, for example, for step 1804 of method 1800.
  • process 1820 includes analyzing text to determine the sentiment or context. As discussed above, processor 320 may determine the sentiment or context by analyzing the text recognized in the image of the written material.
  • processor 320 may determine that the sentiment is sad when the processor recognizes the word “sad” in the written material, for example, in the sentence: “I am so sad.” By way of another example, the processor may determine that the sentiment is happy when the processor recognizes the word “happy” in the written material, for example, in the sentence: “John is happy.” It is contemplated that the processor may determine sentiments other than sad or happy based on recognizing one or more words or phrases associated with other sentiments. As also discussed above, processor 320 may additionally or alternatively determine the sentiment or context associated with a portion of text using one or more rules stored in the one or more memories 330 and/or databases 650, and/or by executing one or more trained machine learning models or neural networks.
  • process 1820 includes performing text-to-speech conversion of the recognized text to generate at least one audio signal to convey a sentiment determined based on analysis of the recognized text.
  • processor 320 may execute one or more text-to-speech processing algorithms such as WaveNet, DeepVoice, Tacatron, natural language processing (NLP), or any other algorithm that converts a text input into an audio signal representing speech.
  • These text-to-speech modules may be stored in apparatus 110, in secondary computing device 350, on server 380, and/or in database 650.
  • Processor 320 may generate an audio signal representative of an image of written material such that when the audio signal is played by an audio output device, the user may hear the written material as being read in an expressionistic manner, representing the context or sentiment of the text.
  • process 1820 includes outputting the at least one audio signal via the audio output device.
  • processor 320 may perform steps and employ techniques similar to those discussed above, for example, for step 1808 of method 1800.
  • a system such as handheld apparatus 110 or other device, may capture and/or analyze auditory and/or visual information from an environment of a user.
  • handheld apparatus 110 may include at least one camera (e.g., camera 210), which may be configured to capture one or more images from an environment of a user (e.g., in response to a user input and/or execution of a program).
  • Handheld apparatus 110 may also include at least one microphone (e.g., one or more microphones 230), which may be configured to capture sounds from the environment of the user.
  • Handheld apparatus 110 may receive and use information captured with these or other components, such as by using at least one processor (e.g., one or more processors 320).
  • At least one processor may be programmed to receive at least one image captured by a camera.
  • one or more processors 320 may execute a command causing camera 210 to capture an image.
  • multiple images may be captured.
  • the at least one image may include a representation of written material.
  • Written material may include text (e.g., alphanumeric characters), a character (including, for example, a space), a diacritic, a symbol, a diagram, a chart, a drawing or image, or any other visual information that may be read by a user.
  • written material may be presented by an object 120, discussed above.
  • a representation of written material may include the written material itself or information contained in or derived from the written material.
  • a representation of written material may include digital image data (e.g., at least one image) associated with an image of the written material.
  • user 100 may use handheld apparatus 110 together with object 120, which may include written material 1910.
  • Camera 210 may capture an image 1912, which may include a representation of written material 1910 (e.g., for further use, as discussed below).
  • image 1912 may be stored, such as within memory 330 or in another memory or storage device included in apparatus 110, or within a memory device of a remote server (e.g., server 380).
  • At least one processor may be programmed to analyze the at least one image to recognize text.
  • Text may include one or more of a letter, a character, a number, a digit, a symbol, a diacritic, a word, a phrase, a sentence, a grapheme, or any other visual information associated with written material.
  • Analyzing the at least one image to recognize text may include one or more of manipulating digital image data of the at least one image, such as de-blurring the digital image data, rotating the digital image data (e.g., deskewing), adjusting a color property of the digital image data (e.g., brightness, contrast, hue, or color), applying binarization to the digital image data, performing layout analysis (e.g., identifying zones of text) to the digital image data, interpolating digital image data (e.g., interpolating digital pixel information), transforming the digital image data, performing optical character recognition (OCR) to the digital image data, performing intelligent character recognition (ICR) the digital image data, performing matrix matching with respect to the digital image data, performing feature extraction with respect to the digital image data, or performing any operation to enhance digital image data for recognition of text or other visual information.
  • manipulating digital image data of the at least one image such as de-blurring the digital image data, rotating the digital image data (e.g., deskewing), adjusting a color property of
  • analyzing the at least one image to recognize text may include comparing digital image data from the at least one image to a defined (e.g., known) portion of text (e.g., a character or a word). Additionally or alternatively, analyzing the at least one image to recognize text may include generating digital text data corresponding to the recognized text. For example, at least one processor may recognize a word in the at least one image, may determine a combination of characters (e.g., American Standard Code for Information Interchange, or ASCII, characters) corresponding to the recognized word, and may generate the combination of characters, which may be stored as digital text data. Generated digital text data may be compared to reference digital text data (e.g., digital text data representing a known word) to identify a word.
  • a defined e.g., known portion of text
  • analyzing the at least one image to recognize text may include generating digital text data corresponding to the recognized text.
  • at least one processor may recognize a word in the at least one image, may determine a combination of characters (
  • At least one processor may determine that the generated digital text data corresponds to the reference digital text data, thus recognizing that the reference digital text data is represented in the at least one image.
  • analyzing the at least one image to recognize text may include determining a language associated with recognized text. Additionally or alternatively, analyzing the at least one image to recognize text may include generating a digital audio data representation of the recognized text (e.g., which may be used in a subsequent comparison step, as discussed below).
  • the above determinations of characters, words, etc. may be performed by applying to the captured images one or more artificial intelligence engines, such as but not limited to one or more classifiers, which may be trained to recognize portions of text within an image (e.g., trained to perform the analysis of the at least one image to recognize text).
  • one or more artificial intelligence engines such as but not limited to one or more classifiers, which may be trained to recognize portions of text within an image (e.g., trained to perform the analysis of the at least one image to recognize text).
  • the at least one processor may be programmed to receive at least one audio signal captured by (e.g., transduced by) a microphone.
  • the at least one audio signal may represent speech by the user.
  • Speech by the user may include one or more of spoken phonemes, syllables, words, phrases, or sentences.
  • Speech by the user may include one or more spoken portions of written material, such as a word, a phrase, a sentence, a paragraph, a column, a page, a chapter, or any other amount of text.
  • a user may or may not speak (i.e., read aloud) of the entire presented text (e.g., the written material represented in the at least one image or the text recognized in the at least one image).
  • the at least one processor may receive the at least one audio signal and convert it into a digital recording of the speech by the user (e.g., a stored digital audio file).
  • user 100 may use handheld apparatus 110, which may include a microphone capable of capturing an audio signal 1922, which may represent speech 1920 of the user.
  • audio signal 1922 may be streamed or stored as a digital audio file or other medium (e.g., for subsequent analysis, such as discussed below).
  • the at least one processor may be programmed to analyze the at least one audio signal to recognize at least one first word in the speech by the user. Analyzing the at least one audio signal may include applying a filter (e.g., filtering out certain frequencies) to at least a portion of the at least one audio signal, applying active noise control to at least a portion of the at least one audio signal, segmenting the at least one audio signal (e.g., into separate portions associated with separate identified words), computing a discrete Fourier transform (DFT) of at least a portion of the at least one audio signal, computing an inverse discrete Fourier transform (IDFT) of at least a portion of the at least one audio signal, computing a fast Fourier transform (FFT) of at least a portion of the at least one audio signal, comparing at least a portion of the at least one audio signal to a reference audio signal (e.g., stored digital audio data representing a particular known phoneme, syllable, word, or phrase, which may be associated with a reference audio signal.
  • Recognizing at least one first word may include at least one of determining that at least a portion of the at least one audio signal matches (e.g., within a threshold) a reference audio signal, associating the at least one first word with at least a portion of the at least one audio signal, or generating digital text data representing the at least one first word.
  • recognizing at least one first word may include recognizing multiple words represented in the at least one audio signal, which may involve multiple comparisons (e.g., multiple comparisons between at least one audio signal and one or more reference signals).
  • At least one processor may be programmed to re-assess a recognition of one or more words and may re-designate at least a portion of the at least one audio signal to a different word (or words) from an initially recognized word (or words).
  • Re-designation may be based on contextual syntax rules, which may be generated based on user and/or machine input (e.g., machine-learned relationships). For example, audio that may initially be interpreted to recognize words of “Please go reading” may be re-assessed and re-designated as “Please go read in” based on analyzing subsequent audio, such as “your room” (e.g., when a user speaks “Please go read in your room.”).
  • a user may initially read a word incorrectly (which may lead to a non-match determination) but then correct the mis-reading by reading a word correctly (which may lead to a match determination).
  • at least one processor may be programmed to use re-designations to adjust feedback information or other information derived from one or more words (e.g., written or spoken), which may be accomplished in real time.
  • analyzing the at least one audio signal may include applying may be performed by applying to the audio signal one or more artificial intelligence engines, which may be trained to recognize words within the audio signal (e.g., trained to perform the analysis of the at least one audio signal to recognize at least one first word in the speech by the user).
  • one or more artificial intelligence engines which may be trained to recognize words within the audio signal (e.g., trained to perform the analysis of the at least one audio signal to recognize at least one first word in the speech by the user).
  • the at least one processor may be programmed to compare the at least one first word with at least one second word in the recognized text (e.g., at least one second word recognized in the at least one image). Comparing the at least one first word with the at least one second word may include comparing first digital text data representing the at least one first word with second digital text data representing the at least one second word (e.g., to determine if the first and second text data are exact matches or matches within a character threshold) and/or comparing first digital audio data (e.g., a waveform) representing the at least one first word with second digital audio data (e.g., a waveform) representing the at least one second word (e.g., to determine if the first and second audio data are exact matches or matches within a sound similarity threshold).
  • first digital text data representing the at least one first word with second digital text data representing the at least one second word
  • first digital audio data e.g., a waveform
  • second digital audio data e.g., a waveform
  • At least one processor may recognize at least one first word 2010 in image 1912 and determine first digital text data representing the at least one first word 2010, consistent with disclosed embodiments.
  • the at least one processor may be programmed to extract the at least one first word 2010 from image 1912 or generate or access digital text data representing the at least one first word 2010.
  • at least one processor may be programmed to access or generate a digital audio data representation 2012 of the recognized text (e.g., an audio waveform), which may be used for a comparison.
  • the at least one processor may also be programmed to recognize a word represented in audio signal 1922, consistent with disclosed embodiments.
  • the at least one processor may be programmed to determine that a portion 2020 of audio signal 1922 is associated with at least one second word 2022, which may be represented by second digital text data.
  • the at least one second word 2022 may be the same as, or different from (as shown in Fig. 20), the at least one first word 2010.
  • at least one processor may be programmed to compare first digital text data representing the at least one first word 2010 with second digital text data representing the at least one second word 2022, consistent with disclosed embodiments. Additionally or alternatively, at least one processor may be programmed to compare digital audio data representation 2012 with portion 2020, consistent with disclosed embodiments. [0399] In some embodiments, at least one processor may be programmed to determine whether the at least one first word matches the at least one second word.
  • At least one processor may be programmed to compare a sequence of characters within the first digital text data to another sequence of characters within the second digital text data and determine whether a threshold number of the characters within the first digital text data match characters within the second digital text data (e.g., matching a type of character and a position in a sequence).
  • at least one processor may be programmed to perform multiple comparisons based on a stream of images and/or a stream of audio signals (e.g., captured in real time by handheld apparatus 110).
  • the at least one processor may also be programmed to perform a comparison of already received information (e.g., at least one image and/or at least one audio signal), or other analysis operation, while additional information is being transduced and/or received (e.g., at least one image and/or at least one audio signal).
  • already received information e.g., at least one image and/or at least one audio signal
  • additional information e.g., at least one image and/or at least one audio signal
  • At least one processor may be further programmed to determine whether the at least one first word matches the at least one second word based on a metric for assessing a similarity between the at least one first word and the at least one second word.
  • a metric for assessing a similarity between the at least one first word and the at least one second word may include one or more values, ranges of values, statistics, parameters, or any other quantification of overlap between two words (e.g., overlap in characters and/or sounds, such as phonemes or syllables).
  • a metric for assessing a similarity between the at least one first word and the at least one second word may relate to comparisons between characters associated with the at least one first word and the at least one second word, such as an amount (e.g., percentage) of matching characters, words, phrases, or other character combinations.
  • at least one processor may be programmed to perform character-by-character comparisons (e.g., comparing digital text data) to determine character matches between words (e.g., whether a first character in the at least one first word matches or does not match a second character at a corresponding position within the at least one second word).
  • At least one processor may also be programmed to perform character-by-character comparisons (e.g., comparing digital text data) across a sequence of characters extending beyond a single word (e.g., determining character matches between characters in corresponding positions between phrases).
  • character-by-character comparisons e.g., comparing digital text data
  • two words that have a same character sequence may be considered a match, and two words that do not have a same character sequence may not be considered a match.
  • at least one first word of “dog” may be considered to match at least one second word of “dog.”
  • at least one first word of “dog” may not be considered to match at least one second word of “dot” (i.e., due to the character difference between the “g” and the “t”).
  • a metric for assessing a similarity between the at least one first word and the at least one second word may include a percentage (or other quantification) match of characters between words, phrases, sentences, or any other character combination. For example, “I drive” may be considered to have a 100% match with “I drive,” whereas “I drove” may be considered to have an 83.33% match of written characters with “I drive” (5 of 6 characters match). Alternatively, “I drove” may be considered to have an 85.71% match of written characters with “I drive” (6 of 7 characters match, including spaces).
  • I drove may be considered to have a 50% match of words with “I drive” (the words “I” match, but “drive” and “drove” do not).
  • words that differ but that share the same or a similar meaning e.g., “I drive” and “I drove”
  • words that differ but that share the same or a similar meaning may be considered to have a higher degree of similarity based on the underlying meaning of the words. That is, even though certain words may differ, in some embodiments, words that otherwise share a similar (or the same) meaning may be deemed a match.
  • a metric for assessing a similarity between the at least one first word and the at least one second word may relate to an amount of digital audio data (e.g., audio waveform amplitude values, frequency values, or volume values) associated with the at least one first word that matches, or is within a similarity threshold of, digital audio data associated with the at least one second word.
  • Digital audio data may be associated with a particular combination of one or more syllables, phonemes, tones, inflections, intonations, words, phrases, or sentences.
  • At least one processor may be programmed to perform one or more comparisons between digital audio data associated with the at least one first word and digital audio data associated with the at least one second word, to determine a metric for assessing a similarity between the at least one first word and the at least one second word.
  • a first waveform e.g., recorded speech from the user
  • a second waveform associated with at least one second word, such as in the time and/or frequency domain.
  • points of different waveforms may be compared to determine a difference between them (e.g., an absolute or percentage difference).
  • a waveform (e.g., recorded speech from the user) may be compared to a waveform envelope (e.g., a waveform +/- a range of values) to determine an amount of the waveform falling within the waveform envelope.
  • a waveform envelope e.g., a waveform +/- a range of values
  • a waveform having 80% of its values falling within or on a waveform envelope and 20% of its values falling outside the waveform envelope may be considered 80% similar (e.g., a similarity metric) with the waveform envelope, and at least one word associated with the waveform may be considered to be 80% similar to at least one word associated with the waveform envelope.
  • a metric of similarity associated with digital audio data may be different from a metric of similarity associated with digital text data.
  • the words “red” and “read” may be pronounced very similarly (when “read” has a past-tense meaning, both can be pronounced as /red/) and have a high degree of overlap (e.g., similarity) in terms of a metric of sound similarity, but may have a lower degree of overlap regarding the characters they have in common (e.g., a metric of textual similarity), as “red” does not have an “a” (compared to “read”).
  • “read” and “read” are spelled exactly the same, and thus may have a high degree of overlap (e.g., similarity) in terms of a metric of textual similarity, but may be pronounced differently, for example as /rid/ (when written using International Phonetic Alphabet (IP A) transcription, or as /red/ (using IPA transcription), thus having a lower degree of overlap in terms of a metric of sound similarity.
  • one type of a metric of similarity may be weighted more heavily than, or even overrule, another type of a metric of similarity.
  • a sound-based metric of similarity may supersede a visual text-based metric of similarity.
  • comparisons and/or matching may be performed by a trained artificial intelligence engine, which may assign a certainty score to the comparison and/or match.
  • an artificial intelligence engine may be trained to perform the determination of whether the at least one first word matches the at least one second word.
  • an artificial intelligence engine may be trained to perform at least one of the analysis of the at least one image to recognize text, the analysis of the at least one audio signal to recognize at least one first word in the speech by the user, or the determination of whether the at least one first word matches the at least one second word, consistent with disclosed embodiments.
  • the at least one processor may be programmed to determine that the at least one first word matches the at least one second word based on a comparison of the metric to a predetermined threshold.
  • a predetermined threshold may be a value, range, statistic, or other quantification related to an amount of similarity, which may be determined in advance of a comparison between data associated with at least one first word (e.g., digital text data, digital audio data), which may be associated with (e.g., determined based on) speech of a user, and data associated with at least one second word (e.g., at least one reference word).
  • a predetermined threshold may include a complete character-to-character match between words.
  • a metric of “100%” may indicate that at least one first word and at least one second word match completely, character-to-character, and this value, when compared to a predetermined threshold of, for example, “100%,” may cause the at least one processor to implement a responsive action (e.g., determining and/or presenting feedback information).
  • a predetermined threshold may include a proportion of a correspondence between digital audio data. For instance, a predetermined threshold may include a value of 90% of values falling within or on a waveform envelope.
  • a waveform (e.g., digital audio data representing a word spoken by a user) having a metric of 95% of its values falling within or on the waveform envelope and 5% of its values falling outside the waveform envelope may be considered compared to the predetermined threshold and be considered to satisfy the exemplary 90% threshold (e.g., by the at least one processor).
  • the at least one processor may be programmed to initiate a responsive action (e.g., determine and/or provide feedback information).
  • the at least one processor may be programmed to store the at least one first word, when the at least one first word does not match the at least one second word. In some embodiments, the at least one processor may be programmed to determine the at least one first word does not match the at least one second word when a comparison of a metric to a threshold indicates that the metric does not satisfy the threshold, as discussed above. In response to this determination, the at least one processor may be programmed to store the at least one first word. For example, the at least one processor may be programmed to store digital text data representing or including the at least one first word.
  • the at least one processor may be programmed to store the at least one word in association with information, such as at least one of a user identifier (e.g., a user’s name, a username, a user account), the at least one second word, audio information (e.g., at least one audio signal corresponding to the at least one first word or the at least one second word), the matching score, feedback information, a timestamp, a date stamp, reading session metadata, or any other information related to the user or a reading session.
  • the at least one processor may be programmed to store the at least one word in association with the at least one second word and feedback information indicating that the at least one first word does not match the at least one second word.
  • the at least one processor may be programmed to store a portion of the at least one audio signal corresponding to the at least one first word, when the at least one first word does not match the at least one second word. In some embodiments, the at least one processor may be programmed to determine the at least one first word does not match the at least one second word when a comparison of a metric to a threshold indicates that the metric does not satisfy the threshold, as discussed above. In response to this determination, the at least one processor may be programmed to store the at least one audio signal corresponding to the at least one first word.
  • the at least one processor may be programmed to determine that a portion of a waveform, audio file, audio signal, or other audio data corresponds to the at least one first word (e.g., using a reference audio signal, as discussed above), and may store that portion.
  • the at least one processor may be programmed to store the at least one audio signal corresponding to the at least one first word in association with other information, such as at least one of a user identifier (e.g., a user’s name, a username, a user account), at least one other audio signal corresponding to the at least one second word, text information (e.g., digital text data corresponding to or including the at least one first word or the at least one second word), feedback information, a timestamp, a date stamp, reading session metadata, or any other information related to the user or a reading session.
  • a user identifier e.g., a user’s name, a username, a user account
  • text information e.g., digital text data corresponding to or including the at least one first word or the at least one second word
  • feedback information e.g., a timestamp, a date stamp, reading session metadata, or any other information related to the user or a reading session.
  • the at least one processor may be programmed to store the at least one audio signal corresponding to the at least one first word in association with an audio signal corresponding to the at least one second word (e.g., indicating a correct pronunciation of a word recognized in text) and feedback information indicating that the at least one first word does not match the at least one second word.
  • an audio signal corresponding to the at least one second word e.g., indicating a correct pronunciation of a word recognized in text
  • feedback information indicating that the at least one first word does not match the at least one second word.
  • the at least one processor may be programmed to provide feedback information to the user based on determining whether the at least one first word matches the at least one second word (e.g., feedback information based on a current reading session).
  • Feedback information may include analysis results related to a user’s speech (e.g., relative to written material), a statistic, a recommendation, information describing a degree of accuracy of words (or syllables or other linguistic features) in a user’s speech relative to words (or syllables or other linguistic features) associated with written material (e.g., represented in an image), a recommendation for how to improve speech accuracy (e.g., audio information that may be output to broadcast a correct pronunciation, a suggestion to read more slowly, a linguistic trait of a language, a suggestion indicating a grammatical rule, a suggestion indicating a meaning of a word) or any information associated with improving a user’s reading and/or speaking abilities.
  • Providing feedback information may include causing the presentation of visual information (e.g., on a display), such as text or a graphic, causing the output of auditory information, causing a haptic response (e.g., a vibration of handheld apparatus 110), storing feedback information (e.g., statistics, metrics, recordings, analysis information, recommendations, or other information associated with a user’s speech), or otherwise giving a user access to information associated with speech or written material.
  • providing the feedback information may comprise transmitting the feedback information to one or more of a secondary device associated with the user, a secondary device associated with a teacher of the user, or a secondary device associated with a parent of the user.
  • a secondary device may be configured to access feedback information or other information associated with a user’s reading session.
  • at least one processor may be programmed to cause feedback information to be transmitted to a secondary device.
  • at least one processor may be programmed to store feedback information (e.g., within a memory device) and permit one or more secondary devices to access the feedback information, such as by using one or more of a link, a portal, or an authentication interface (e.g., requiring correct entry of a username and password to permit access to the feedback information).
  • the at least one processor may be programmed to cause the feedback information to be displayed via a dashboard on a display of the secondary device.
  • a dashboard may include one or more visualizations, such as text, graphs, charts, icons, pictograms (e.g., a visual medal, trophy, or other indicator of an achievement status).
  • a dashboard may include a chart showing a change in a metric over time, such as a type of feedback information (e.g., a number or proportion of words accurately read) or any other information indicating performance (e.g., a degree of accuracy) of a user reading text, discussed further below.
  • feedback information provided may be based on determining whether the at least one first word matches the at least one second word. For example, when the at least one first word matches the at least one second word, at least one processor may be programmed to provide first feedback information, and when the at least one first word does not match the at least one second word, at least one processor may be programmed to provide second feedback information, which may be different from the first feedback information. By way of a further example, when the at least one first word matches the at least one second word, at least one processor may be programmed to provide feedback information that includes an indication that the user correctly read or spoke a word (e.g., a word corresponding to the at least one first word and at least one second word).
  • a word e.g., a word corresponding to the at least one first word and at least one second word
  • At least one processor may be programmed to provide feedback information that includes an indication that the user incorrectly read or spoke a word (e.g., a word corresponding to the at least one first word and at least one second word).
  • the feedback information may include information indicating performance (e.g., a degree of accuracy) of a user reading text.
  • the feedback information may include at least one of a number or a percentage of words in the written material read correctly by the user, a reading fluency of the user, a reading speed of the user, a number of self-corrections or attempted corrections made by the user, a number of substitutions made by the user, a number of insertions made by the user (e.g., insertion of filler words or incorrect words), a number of omissions made by the user, or a number of hesitations of the user.
  • At least one processor may be programmed to determine any one of these feedback aspects based on digital text data and/or digital sound data comparisons, as discussed above. For example, at least one processor may be programmed to determine a number or a percentage of words in the written material read correctly by the user based on a number of words read by a user that match and/or do not match a word recognized from text (e.g., an image of text), consistent with disclosed embodiments.
  • a word recognized from text e.g., an image of text
  • the at least one processor may be further programmed to analyze at least one image to recognize a punctuation mark adjacent to the at least one second word.
  • a punctuation mark may include a comma, a semicolon, a period, a dash, a hyphen, a quotation mark, a question mark, an accent mark, a diaeresis, an acute, a circumflex, a cedilla, a diacritic, or any other distinct orthographic feature.
  • the at least one processor may be further programmed to analyze the at least one audio signal to recognize a pause after the recognized at least one first word.
  • a pause may include audio information below a threshold decibel level, a portion of digital audio information (e.g., recorded speech data) where no words are recognized, a portion of an audio waveform with an amplitude below a threshold amplitude level, or any other indication of a lack of vocalization from the user.
  • Analyzing the at least one audio signal to recognize a pause after the recognized at least one first word may include one or more of: detecting a time period associated with relative silence (e.g., no detectable words), comparing a time period associated with the pause to a threshold time period, comparing a waveform associated with the pause to a reference waveform, determining a period of time between the end of the first word as spoken by the user and a next word following the first word, or performing any other digital audio operation to identify auditory gaps between words, phonemes, or morphemes.
  • the at least one processor may be further programmed to compare the pause with a threshold amount of time associated with the recognized punctuation mark.
  • a threshold amount of time associated with the recognized punctuation mark may include a predetermined amount of one or more of milliseconds, centiseconds, deciseconds, seconds, minutes, or any other quantification of time.
  • Different thresholds of time may be associated (e.g., within a data structure in memory) with different punctuation marks. For example, a comma may be associated with a shorter threshold of time than a period.
  • Comparing the pause (e.g., a time period associated with the pause) with the threshold amount of time associated with the recognized punctuation mark may include determining if a time period associated with the pause (e.g., a duration of the pause) reaches the threshold amount of time associated with the recognized punctuation mark, exceeds the threshold amount of time associated with the recognized punctuation mark, does not exceed the threshold amount of time associated with the recognized punctuation mark, is within a threshold range of the threshold amount of time associated with the recognized punctuation mark, or is not within a threshold range of the threshold amount of time associated with the recognized punctuation mark.
  • a time period associated with the pause e.g., a duration of the pause
  • the at least one processor may implement a responsive action. For example, in some embodiments, the at least one processor may be further programmed to include additional information in the feedback information (discussed above) based on the comparison of the pause to the threshold amount of time. By way of further example, when the at least one processor determines that the time period associated with the pause is within a threshold range of the threshold amount of time associated with the recognized punctuation mark, the at least one processor may include additional information in the feedback information that indicates to the user that the pause was of appropriate duration.
  • the at least one processor may include additional information in the feedback information that indicates to the user that the pause was not of appropriate duration. While the information discussed here is referred to as “additional” information, in some embodiments, the feedback information may also not include other information, such that only information based on the comparison of the pause to the threshold amount of time is included.
  • the feedback information may include an indication of proper usage of punctuation by the user and/or an indication of improper usage of punctuation by the user.
  • Proper usage of punctuation may include a proper pause, proper tone, proper intonation, proper inflection, or other phonetic feature of a user’s speech, any of which may be associated with particular punctuation, including a comma, a dash, a hyphen, a semicolon, a period, a quotation mark, a diacritic, or any type of punctuation mark, as discussed above.
  • the at least one processor may analyze a pause, tone, intonation, inflection, or other phonetic feature represented within digital audio information, to determine the indication.
  • At least one processor may be programmed to perform one or more comparisons between digital audio data associated with at least one word spoken by the user and digital audio data associated with at least one reference word, such as by performing one or more of: comparing waveforms (e.g., as discussed above), comparing frequency information (e.g., frequency values), comparing occurrences of frequencies, comparing time periods of frequencies, comparing sequences of frequencies, comparing time periods of pauses, comparing occurrences of pauses, comparing any other portions of audio information capable of indicating phonetic features, or performing any other audio analysis operation discussed above.
  • comparing waveforms e.g., as discussed above
  • comparing frequency information e.g., frequency values
  • pause information may be determined using a threshold amount of time, as discussed above (e.g., analyzing at least one image to recognize a punctuation mark adjacent to the at least one second word, analyzing the at least one audio signal to recognize a pause after the recognized at least one first word, comparing the pause with a threshold amount of time associated with the recognized punctuation mark, and including additional information in the feedback information based on the comparison of the pause to the threshold amount of time).
  • a threshold amount of time as discussed above (e.g., analyzing at least one image to recognize a punctuation mark adjacent to the at least one second word, analyzing the at least one audio signal to recognize a pause after the recognized at least one first word, comparing the pause with a threshold amount of time associated with the recognized punctuation mark, and including additional information in the feedback information based on the comparison of the pause to the threshold amount of time).
  • the digital audio data associated with at least one reference word may be determined based on image analysis, such as identification of a particular diacritic, identification of a particular letter-punctuation combination, identification of a particular letter-diacritic combination, or any other combination of one or more orthographic features.
  • the feedback information may include an indication of similar information, but unrelated to punctuation.
  • the feedback information may include an indication of whether a user properly palatalized a word or phrase based on analysis of digital audio information (e.g., through comparisons of digital audio data, as discussed above).
  • the threshold may be learned (e.g., based on machine learning of a trained artificial engine) and/or personalized based on a time duration determined based on one or more past interactions of a particular user with the system (e.g., a particular user’s typical or historical pause or silence between words).
  • the reading speed may be determined based on the number of words spoken or the number words correctly spoken by the user per minute.
  • at least one processor may be programmed to determine a reading speed of the user based on a number of syllables read per time unit (e.g., per second, per minute), a number of words read per time unit, a number of phrases read per time unit, a number of sentences read per time unit, a number or proportion of correct syllables (e.g., syllables corresponding to matched data) read per time unit, a number or proportion of correct words (e.g., words corresponding to matched data) read per time unit, a number or proportion of correct phrases (e.g., words corresponding to matched data) read per time unit, or a number or proportion of correct sentences (e.g., syllables corresponding to matched data) read per time unit.
  • a number of syllables read per time unit e.g., per second, per
  • At least one processor may be programmed to determine a self-correction or attempted correction made by the user when a word analyzed from the user’s speech matches a word recognized in the at least one image but the matching word occurs out of sequence in the user’s speech relative to the recognized text.
  • at least one processor may be programmed to determine an insertion from the user by recognizing a spoken word (e.g., in digital audio data) that is not recognized in a portion of text (e.g., analyzed from an image) and/or may determine an omission from the user by recognizing a word in text (e.g., analyzed from an image) that is not recognized in speech from the user (e.g., within digital audio data).
  • At least one processor may also be programmed to determine a number of hesitations of the user based on a pause (e.g., a portion of time where no words are recognized within recorded digital sound data) and/or filler words (e.g., non-matching words, “uh,” “run,” “eh,” or “er”) identified between recognized spoken words (which may be spoken correctly and/or incorrectly).
  • a pause e.g., a portion of time where no words are recognized within recorded digital sound data
  • filler words e.g., non-matching words, “uh,” “run,” “eh,” or “er”
  • At least one processor may be programmed to determine a reading fluency associated with the user, which may be based on one or more metrics associated with word comparisons, including numbers or proportions of matching words (e.g., correctly spoken words), numbers or proportions of nonmatching words (e.g., incorrectly spoken words), omitted words, inserted words, stutters, hesitations, or any other indication of how closely a user’s speech resembled recognized words (e.g., analyzed from at least one image).
  • a reading fluency may be determined based on a pause (e.g., a time period where no words are recognized) between successive words spoken by the user.
  • At least one processor may be programmed to derive trends or other information related to metrics based on word comparisons. For example, the at least one processor may be programmed to determine that one or more mispronunciations (e.g., one or more non-matching word situations) relate to one or more particular syllables and/or phonemes, and may include those one or more particular syllables and/or phonemes as part of the feedback information, such as with pre-recorded audio information corresponding to correct pronunciation of a syllable, phoneme, or word.
  • mispronunciations e.g., one or more non-matching word situations
  • the feedback information may include information from one or more previous reading sessions.
  • a reading session may be a time period associated with a user reading a portion of text (e.g., one or more words, phrases, sentences, paragraphs, pages, chapters) and/a contiguous time period during which audio information was recorded (e.g., by one or more microphones 230).
  • Information associated with a reading session may include recorded audio information from the reading session, one or more words recognized during the reading session (e.g., words recognized from image analysis and/or words recognized from audio signal analysis), information indicating performance of a user reading text (as discussed above), metadata associated with the reading session (e.g., metadata indicating one or more of a language of recognized words, a type of written material, a time when written material was read, a date when written material was read, a title associated with written material read during the reading session, an author associated with written material read during the reading session, or a reading difficulty associated with written material read during the reading session), any information gathered during or derived from the reading session. Consistent with disclosed embodiments, not all feedback information need be based on determining whether the at least one first word matches the at least one second word.
  • the at least one processor may be programmed to provide the feedback information at different times and/or in response to one or more different determinations.
  • at least one processor may be programmed to provide the feedback information while the user is speaking the written material (e.g., speaking written material captured in at least one image).
  • the at least one processor may be programmed to produce an output (e.g., an audible sound to a speaker, a haptic vibration to a haptic device, or a visual element to a display) to indicate feedback information to the user.
  • the at least one processor may be programmed to provide the feedback information after one of an end of a paragraph, an end of a page, an end of a chapter, or when an input is received from the user.
  • the at least one processor may also be programmed to provide the feedback information after one of an end of a sentence or an end of a column (e.g., on a newspaper).
  • At least one processor may be programmed to analyze at least one image to identify an end of a sentence, an end of a paragraph, an end of a page, or an end of a chapter, such as by identifying an information divider, which may include one or more of a period, an indentation, a page number (e.g., identifying a number surrounded by a degree of space near an edge of a page), an area lacking text on a page (e.g., where a chapter ends partway on a page), an edge of a page, or any other visual boundary dividing pieces of information.
  • an information divider which may include one or more of a period, an indentation, a page number (e.g., identifying a number surrounded by a degree of space near an edge of a page), an area lacking text on a page (e.g., where a chapter ends partway on a page), an edge of a page, or any other visual boundary dividing pieces of information.
  • the at least one processor may be programmed to provide the feedback information identify the end of a text portion (e.g., the end of a sentence, a paragraph, a page, a chapter, or a column), determine when a user has read up to the end of the text portion (e.g., based on comparing identified words spoken by the user to recognized words from at least one image), and when that determination is made, provide feedback information to the user. Additionally or alternatively, at least one processor may be programmed to provide the feedback information when an input is received from the user (e.g., in response to an input received from the user).
  • the at least one processor may detect a press, touch, turning, shaking, movement, or other detectable human input at an input component or device, such as power button 216 or trigger button 224.
  • at least one processor may determine that the user has completed speaking the written material in response to receiving a user input.
  • at least one processor may be programmed to determine that the user has completed speaking the written material (discussed further below) in response to detecting, within an audio signal, at least one trigger word from the user (e.g., “done”).
  • the at least one processor may be programmed to provide the feedback information after the user has completed reading the written material.
  • Completion of reading written material may include reading from a beginning of a written material to its end and/or reading a particular portion of the written material (e.g., a page, a chapter, an article, or a column).
  • the at least one processor may be programmed to determine, such as based on one of the information dividers discussed above and/or user input, that the user has completed reading the written material, and in response to this determination, may provide the feedback information.
  • the at least one processor may be programmed to determine that the user has completed speaking the written material when no speech is detected in the at least one audio signal for a predetermined amount of time.
  • the at least one processor may be programmed to detect no speech being detected in the at least one audio signal when at least a portion of the at least one audio signal indicates at least one of audio information below a threshold decibel amount, an audio waveform with an amplitude below a threshold value, or audio information from which the at least one processor was unable to recognize a threshold number of particular spoken words (including, for example, a threshold of at least one recognized word).
  • a predetermined amount of time may include a combination of one or more of milliseconds, centiseconds, deciseconds, seconds, minutes, or any other quantification of time.
  • the at least one processor may be programmed to determine that the at least one audio signal lacks audio information above a threshold decibel amount for five seconds, and in response may determine that the user has completed speaking the written material, such as an article, a paragraph, a chapter, a book, a page, or any other amount of text, consistent with disclosed embodiments.
  • a system or device configured to carry out operations of the disclosed embodiments may include an audio output device for outputting audio signals.
  • An audio output device may include a speaker, headphone, hearing aid interface device, voice coil, or other type of transducer (e.g., an electroacoustic transducer).
  • an audio output device may be connected to handheld apparatus 110 using audio outlet 228.
  • An audio signal may include one or more sound waves, which may be produced in response to activation of the audio output device according to stored signal information.
  • at least one processor may be programmed to store the at least one audio signal.
  • the at least one processor may be programmed to store digital information generated using a transducer (e.g., a microphone).
  • a transducer e.g., a microphone
  • an audio signal may be captured by a microphone and/or may represent speech by the user, consistent with disclosed embodiments.
  • a stored audio signal may be a digital recording of speech by the user.
  • the at least one processor may be programmed to output the at least one audio signal via the audio output device.
  • handheld apparatus 110 may transduce a signal captured by a microphone (e.g., communicably coupled with handheld apparatus 110), such as microphone 340, into digital audio information, and may store the digital audio information (e.g., within memory 320 and/or memory 522).
  • the at least one processor may be programmed to highlight a word that is being played by the audio output device. Highlighting a word may include projecting light onto a portion of an object (e.g., object 120) corresponding to a portion of an image where at least one word has been recognized, consistent with disclosed embodiments.
  • light may be projected from LED light 212 and/or targeting laser 214 on or around at least one word (e.g., as underlining or a box outline, or directly on the at least one word), such as to draw attention to the at least one word.
  • at least one word may be highlighted on a device on which it is displayed (e.g., by the device).
  • at least one word may be displayed on a secondary device (e.g., on a display separate from handheld apparatus 110, such as on a mobile device, such as a laptop or mobile phone), such as within an application, and the application may be configured to highlight the at least one word (e.g., in response to a signal transmitted by handheld apparatus 110).
  • a secondary device may be communicably coupled with a handheld apparatus 110 and/or integrated as part of it.
  • the at least one word may be displayed on a secondary device within an image captured by a camera (e.g., an image of written material), consistent with disclosed embodiments.
  • at least one processor may be programmed to determine at least one word and may cause an audio output device to play an audio signal corresponding to the at least one word while also causing the highlighting of the at least one word on an object.
  • the audio signal played by the audio output device may correspond to audio information recorded from the user (e.g., speech), which may or may not match a particular word (or multiple words), and/or pre-recorded audio information, which may correspond to correct pronunciation of a word, such as a word recognized in at least one image, consistent with disclosed embodiments.
  • audio information recorded from the user e.g., speech
  • pre-recorded audio information which may correspond to correct pronunciation of a word, such as a word recognized in at least one image, consistent with disclosed embodiments.
  • Fig. 21 is a flowchart showing an example process 2100 for processing audio and image signals, consistent with the disclosed embodiments.
  • Process 2100 may be performed by at least one processing device, such as processor 320. In some embodiments, some or all of process 2100 may be performed by a different device, such as at least one processor within secondary communications device 350 and/or server 380.
  • processor is used as a shorthand for “at least one processor.” In other words, a processor may include one or more structures that perform logic operations whether such structures are collocated, connected, or disbursed.
  • a non-transitory computer-readable medium may contain instructions that when executed by a processor cause the processor to perform process 2100.
  • process 2100 is not necessarily limited to the steps shown in Fig. 21, and any steps or processes of the various embodiments described throughout the present disclosure may also be included in process 2100. Moreover, any step of process 2100 may be omitted, duplicated, re-arranged, and/or repeated.
  • process 2100 may include receiving at least one image captured by the camera, the at least one image including a representation of written material.
  • camera 210 may capture an image 1912, which may include a representation of written material 1910.
  • the at least one image may include text (e.g., at least one word), as discussed above.
  • process 2100 may include analyzing the at least one image to recognize text. For example, this may include identifying at least one particular word represented within the at least one image, which may include performing an OCR operation to digital image data. This may also include determining digital text data corresponding to the at least one word, as described above.
  • LED light 212 and/or targeting laser 214 may be configured to highlight one or more words recognized in the image.
  • process 2100 may include receiving at least one audio signal captured by the microphone, the at least one audio signal representing speech by the user.
  • a handheld apparatus 110 may include a microphone or other component capable of transducing at least one audio signal produced by the user into storable digital information. This may include storing a digitally recorded version of speech from the user, as described above.
  • process 2100 may include analyzing the at least one audio signal to recognize at least one first word in the speech by the user.
  • at least one processor may be programmed to compare at least one audio signal including a representation of at least one first word to a reference audio signal. This may include recognizing at least one first word may include recognizing multiple words, as described above.
  • process 2100 may include comparing the at least one first word with at least one second word in the recognized text. If a word was highlighted within the text, the at least one first word may be compared to the highlighted word.
  • at least one processor may be programmed to compare digital text data and/or digital audio data corresponding to the at least one first word with digital text data and/or digital audio data at least one second word in the recognized text. This may include comparing characters between portions of digital text data and/or at least one waveform with digital audio data, as described above.
  • process 2100 may include determining whether the at least one first word matches the at least one second word. For example, at least one processor may be programmed to determine, based on a comparison, such as discussed above, that a metric satisfies a threshold. In response to such a determination, at least one processor may be programmed to determine that the at least one first word matches the at least one second word, as described above.
  • process 2100 may include providing feedback information to the user based on determining whether the at least one first word matches the at least one second word.
  • at least one processor may be programmed to determine whether a word read by a user was correct relative to a word recognized from text to determine feedback information to be provided. This may include causing the presentation of visual information corresponding to the feedback information on a display.
  • a system such as handheld apparatus 110 or other device, may capture, detect, and/or analyze visual information.
  • handheld apparatus 110 may include at least one camera (e.g., camera 210), which may be configured to capture one or more images from an environment of a user (e.g., in response to a user input and/or execution of a program).
  • the system may also comprise at least one processor (e.g., one or more processors 320), which may be programmed to perform one or more operations, as discussed herein.
  • the system e.g., handheld apparatus 110
  • the system may include an audio output device for outputting audio signals (e.g., one or more microphones 230).
  • the system may also include a microphone, which may be configured to capture sounds from the environment of the user (e.g., detect and transduce sound waves to storable digital audio data).
  • At least one processor may be programmed to cause the camera to capture a first image including a first representation of written material. Capturing the first image may include transducing light information (e.g., light incident on a camera, which may be reflected from written material) into another form of information (e.g., storable data). Causing a camera to capture the first image may include receiving an input from a user and/or executing one or more instructions or commands that activate the camera, direct the camera to open an aperture, direct the camera to receive light, and/or direct the camera to transduce light (e.g., take a photograph).
  • light information e.g., light incident on a camera, which may be reflected from written material
  • Causing a camera to capture the first image may include receiving an input from a user and/or executing one or more instructions or commands that activate the camera, direct the camera to open an aperture, direct the camera to receive light, and/or direct the camera to transduce light (e.g., take a photograph).
  • a camera may receive light from an environment, which the at least one processor may use to generate an image (e.g., a digital image).
  • the at least one image may include a representation of written material (e.g., digital image information and/or pixel data captured by a camera, corresponding to the written material).
  • Written material may include text (e.g., alphanumeric characters), a character (including, for example, a space), a diacritic, a symbol, a diagram, a chart, an image, or any other aspect of written material discussed above.
  • a representation of written material may include the written material itself or information contained in or derived from the written material, as discussed above.
  • At least one processor may be programmed to analyze the first image to detect a predetermined sign within the first image.
  • a predetermined sign may include highlighting (either physical or virtual), a laser mark, a non-laser light mark, a cursor, a pointing object (e.g., a stylus, a pencil, or a pen), a part of a user (e.g., a user’s finger), or any other detectable visual characteristic within an image separate from written material.
  • a user may hold an object that may produce a laser (e.g., handheld apparatus 110 producing a laser using one or more targeting lasers 214), which may place a laser mark within an environment of the user (e.g., on written material), which may subsequently be captured within at least one image and analyzed by the at least one processor.
  • a user may point to a portion of written material (e.g., a word) or inadvertently cover a portion of the written material, such as with a finger or a pointing object (e.g., a stylus, a pencil, or a pen).
  • a user may cause a device to highlight and/or place a cursor on written material (e.g., written material presented on a display of a device, such as within an application), such as by touching a touchscreen, dragging an input object (e.g., finger or a stylus) across a touchscreen, moving a mouse, clicking a mouse, and/or dragging a mouse.
  • written material e.g., written material presented on a display of a device, such as within an application
  • the predetermined sign may appear adjacent to and/or on top of the written material.
  • Analyzing the first image may include recognizing text, determining a text color, determining a text background color (e.g., a color of a page, whether physical or virtual), parsing the first image for particular pixel data (e.g., a pixel color, contrast between pixel colors meeting a threshold, a region of pixels having a particular color or color within a range of colors), manipulating the first image, such as de-blurring the first image, rotating the first image (e.g., deskewing), adjusting a color property of the first image (e.g., brightness, contrast, hue, or color), applying binarization to the first image, performing layout analysis to the first image (e.g., identifying zones of text), interpolating the first image (e.g., interpolating digital pixel information), transforming the first image, performing optical character recognition (OCR) to the first image (e.g., to recognize text and convert it to machine-readable text), performing intelligent character recognition (ICR) to the first image, performing
  • Detecting the predetermined sign may include recognizing text, determining a color adjacent to pixels of text contrasting (e.g., past a contrast threshold) with a color of the text (e.g., black) and/or the color of a text background (e.g., white), determining a portion of an image having a color with one or more particular traits (e.g., a color within a range of colors, an area exceeding a predetermined size having a particular color, a color appearing in portions of an area exceeding a predetermined size), determining a shape and/or area of color to be moving between successive images (e.g., a position of a cursor moving between images or a laser mark from a laser pointer moving between images), or performing any image analysis operation (e.g., as discussed above) to identify the predetermined sign within an image.
  • a color adjacent to pixels of text contrasting e.g., past a contrast threshold
  • a color of the text e.g., black
  • a text background
  • the at least one processor may parse the at least one image to detect pixel data having a color contrasting (e.g., past a contrast threshold) with a color of text and/or text background.
  • the at least one processor may parse the at least one image to detect pixel data having a color (e.g., yellow) associated with a highlighter (e.g., a physical highlighter or a virtual highlighting function) or a color associated with a laser pointer (e.g., red), such as by detecting a color within a range of colors (e.g., hues of yellow or red) or by detecting a contrast between a color associated with the predetermined sign (e.g., yellow or red) and a color associated with the text and/or background (e.g., black and/or white).
  • a color e.g., yellow
  • a highlighter e.g., a physical highlighter or a virtual highlighting function
  • a laser pointer e.g., red
  • the at least one processor may be further programmed to detect the predetermined sign using a trained machine learning model.
  • a machine learning model may include one or more of an algorithm, a neural network (e.g., a convolutional neural network, or CNN), a semantic segmentation algorithm or model, or a pose estimation algorithm or model.
  • a machine learning model may be trained (e.g., using input of images with and without the predetermined sign) to identify the predetermined sign (e.g., a laser mark) within an image.
  • an input 2202 may produce and/or cause a predetermined sign 2214 to appear on a first image 2210.
  • predetermined sign 2214 may appear on a word 2212.
  • predetermined sign 2214 may appear on or adjacent to any number of words shown in an image.
  • Predetermined sign 2214 may or may not cover one or more words, draw a frame around one or more words, have a certain degree of opacity, or the like.
  • the at least one processor may be programmed to determine first coordinates of the predetermined sign within the first image. Determining first coordinates of the predetermined sign may include one or more of determining a location of at least one pixel within the first image (e.g., at least one pixel associated with the predetermined sign), determining a center of the predetermined sign, determining a distance between the predetermined sign and an edge of the first image, determining a distance between the predetermined sign and an edge of the written material, determining a distance between the predetermined sign and an edge of an object (e.g., an object 120), determining a distance between the predetermined sign and a character, determining a distance between the predetermined sign and a word, determining a distance between the predetermined sign and an edge of a page (e.g., a physical page or a virtual page), or determining a distance between the predetermined sign and an edge of a screen (e.g., a screen upon which written material is displayed).
  • First coordinates of the predetermined sign within the first image may be expressed as any combination of these distances.
  • the first coordinates may include a distance of the predetermined sign from a left edge of the first image and a distance of the predetermined sign from a bottom edge of the first image.
  • a coordinate system e.g., a grid
  • a grid which may have thousands of grid points, may be overlaid onto the first image or a portion of the first image (e.g., a portion corresponding to the written material and/or an object).
  • a coordinate system may be based on a resolution of the first image and/or of a camera that captured the first image. For example, a grid with more grid points may be applied to an image or camera of higher resolution.
  • a coordinate system may be based on pixels in the image. For example, each pixel may be associated with a coordinate point (e.g., and x coordinate and a y coordinate) within the first image. Alternatively, each pixel may be associated with a group of pixels (e.g., a 2x2 region of pixels), which may be associated with a coordinate point.
  • the at least one processor may be programmed to cause the camera to capture a second image including a second representation of the written material.
  • a second image may include an image captured at a later point in time relative to the first image. Capturing a second image may include any aspect described above with respect to capturing the first image. For example, capturing the second image may include transducing light information (e.g., light incident on a camera) into another form of information (e.g., storable data). Also, causing a camera to capture the second image may include executing one or more instructions or commands that activate the camera, direct the camera to open an aperture, direct the camera to receive light, and/or direct the camera to transduce light (e.g., take a photograph).
  • light information e.g., light incident on a camera
  • causing a camera to capture the second image may include executing one or more instructions or commands that activate the camera, direct the camera to open an aperture, direct the camera to receive light, and/or direct the camera to transduce light (e.g.
  • the at least one processor may be programmed to cause the camera to capture the second image at a predetermined period of time after the first image is captured (e.g., one second after the first image is captured). Additionally or alternatively, the at least one processor may be programmed to cause the camera to capture the second image in response to a particular determination, such as determining that the predetermined sign has disappeared or moved within a user’s environment (e.g., to a different position relative to the written material, based on image analysis of images captured prior to the second image), determining that a device producing the predetermined sign has moved (e.g., based on an inertial sensor in handheld apparatus 110), or in response to an input, such as a user’s voice or a key press.
  • a particular determination such as determining that the predetermined sign has disappeared or moved within a user’s environment (e.g., to a different position relative to the written material, based on image analysis of images captured prior to the second image), determining that a device producing the predetermined sign has moved (e
  • a second image 2220 may be captured (e.g., after the capturing of image 2210), which may include a second representation of written material.
  • Second image 2220 may include a word 2222, which may correspond to word 2212 from the first image 2210, but without having predetermined sign 2214 placed on it.
  • predetermined sign 2214 may not be present, or may have moved away from word 2222.
  • the at least one processor may be programmed to determine a transformation between the first image and the second image.
  • a transformation between the first image and the second image may include one or more operations (e.g., a transformation function) that, when performed, cause the first image to more closely resemble the second image in at least one aspect, such as to have more similarly aligned edges of written material, to have more similarity in aligned edges of an object, to have more similarity in text alignment, to have more similar pixels at similar image regions between the first and second images (e.g., regions associated with corresponding coordinate areas between the first and second images), or the like.
  • a transformation function e.g., a transformation function
  • a transformation may include an amount and direction of rotation and/or translation to apply to the first or second image (e.g., a rotation matrix or a translation matrix). Additionally or alternatively, a transformation may include an amount by which to increase or decrease a brightness or contrast of the first or second image. Determining the transformation between the first image and the second image may include comparing one or more portions of the first image to one or more portions of the second image. For example, at least one processor may compare a portion of the first image that has a unique feature (e.g., includes a word, glyph, or other visualization not found elsewhere in the image) to a portion of the second image including the same unique feature and determine a difference in position and/or orientation of the unique feature between the first and second images.
  • a unique feature e.g., includes a word, glyph, or other visualization not found elsewhere in the image
  • the at least one processor may be further programmed to determine the transformation between the first and second images by using at least one of edge detection or feature detection, or by determining one or more boundaries of text within the first and second images.
  • edge detection may include one or more of determining brightness values associated with one or more areas of the first and/or second images, comparing changes in brightness values in an image to a gradient threshold (e.g., an amount of change in brightness vs. a distance change in pixel location), determining a shape present in an image, comparing a detected shape to a known shape (e.g., another shape stored in memory), or otherwise identifying a discontinuity in an image.
  • Edge or feature detection may also be performed by a trained Al engine.
  • the at least one processor may determine first boundaries of a page of text represented in the first image and second boundaries of a page of text represented in the second image, and may use the first and second boundaries to determine a transformation between the first and second images (e.g., by determining an amount of rotational difference between the first and second boundaries).
  • the at least one process may determine a unique sequence of words in the first image, determine the same unique sequence of words in the second image, and use a difference in position between the sequence in the separate images to determine a translational difference between the images (e.g., to determine a transformation).
  • the at least one processor may be programmed to apply the transformation to obtain second coordinates within the second image corresponding to the first coordinates of the predetermined sign within the first image. Applying the transformation may include one or more of applying the transformation to the first coordinates within the first image to obtain the corresponding coordinates in the second image. For example, the transformation may be applied to the first coordinates to translate them to a location of the second coordinates, or vice versa.
  • the second coordinates within the second image corresponding to the first coordinates of the predetermined sign within the first image may be expressed in a similar manner to the first coordinates.
  • the second coordinates may be a combination of distances and/or values expressed relative to a coordinate system, as described above with respect to the first coordinates.
  • the second coordinates may correspond to the first coordinates of the predetermined sign in the first image by indicating the same real-world location associated with the predetermined sign in the second image. For example, if the predetermined sign was associated with a position of (2345, 2435) (e.g., an x-y coordinate) in the first image, and that position was translated to (2145, 2235) in the second image (e.g., according to the transformation), then the predetermined sign will also be associated with the position of (2145, 2235) in the second image.
  • the second coordinates correspond to a position within the written material where the predetermined sign was present in the first image.
  • the transformation may be used to account for the shift, such that a position of the predetermined sign relative to the written material may be determined in the first image and/or the second image.
  • the second image may not include the predetermined sign.
  • the predetermined sign may be overlaid on a word in the first image, but may not be overlaid on the same word in the second image.
  • the second image may include the predetermined sign, but the predetermined sign may be present at a location different from the location where it was in the first image.
  • first image 2210 may include a predetermined sign, which may be determined to have first coordinates 2302, which may include an x coordinate and a y coordinate (e.g., based on an x-y coordinate system), though of course other coordinates for other coordinate systems may be used.
  • a transformation 2304 may be determined, which, if applied to second image 2220, may result in a transformed second image 2300, which may correspond to the first image.
  • second coordinates 2306 in transformed second image 2300, which may correspond to the first coordinates 2302 in the first image 2210, may be determined.
  • the at least one processor may be programmed to analyze the second image to recognize at least one word based on the second coordinates.
  • Analyzing the second image may include accessing the second coordinates, applying a coordinate system to the second image, performing optical character recognition (OCR) to the second image (e.g., to recognize text and convert it to machine-readable text), performing intelligent character recognition (ICR) to the second image, or applying any other image analysis operation to the second image that is discussed above with respect to analyzing the first image.
  • OCR optical character recognition
  • ICR intelligent character recognition
  • Analyzing the second image to recognize at least one word may also include performing any operation as described above with respect to analyzing at least one image to recognize text.
  • analyzing the second image to recognize at least one word may be performed at a particular location of the second image (e.g., an area including the second coordinates), which may reduce processing demand by focusing analysis on only a portion of the image.
  • the at least one processor may perform optical character recognition (OCR) at a portion of the second image associated with the second coordinates (e.g., a portion that includes the second coordinates and/or the at least one word).
  • OCR optical character recognition
  • Analyzing the second image to recognize at least one word based on the second coordinates may include identifying a word (or character, diacritic, or other amount of text) present at the position corresponding to the second coordinates, identifying one or more words present within one or more predetermined distances (e.g., within a radius or within a predetermine shape), determining one or more pixels present at the location of the second coordinates, or otherwise using the second coordinates to recognize the at least one word.
  • the at least one processor may use a machine learning model, such as a neural network, a word2vec model, a bag of words model, or a term frequencyinverse document frequency (tf-idf) model, to recognize at least one word.
  • determining a transformation and using a second image may allow the at least one processor to recognize at least one word (or other portion of text, such as a character) when the at least one word was at least partially obscured, framed, highlighted or otherwise indicated in another (e.g., earlier) image (e.g., such as being obscured by an object, or the predetermined sign itself).
  • the at least one processor may examine pixels from increasing radiuses outward from the second coordinates to determine if they are associated with a character.
  • the at least one processor may examine a pixel at the position corresponding to the second coordinates and determine if the pixel includes a color associated with a range of black and/or grey colors (e.g., colors associated with text).
  • Identifying a space subsequent to the identified character may include recognizing the character, determining an amount of space to the left or right of the identified character, determining a next character to the left or right of the identified character, or determining that an amount of space adjacent to the identified character is larger than a predetermined threshold and is not associated with another character (e.g., includes digital image data with color values outside of a predetermined range).
  • the at least one processor may be programmed to recognize at least one word based on a single image and/or a single set of coordinates. For example, in some embodiments, the at least one processor may be programmed to determine coordinates of the predetermined sign within the first image (e.g., first coordinates, as discussed above) and recognize at least one word based on the determined coordinates (e.g., the determined first coordinates). For example, the at least one processor may determine coordinates of the predetermined sign within the first image and determine a word underlying or closest to the coordinates of the predetermined sign.
  • a predetermined sign of a laser mark may be overlaid onto text (e.g., of written material) without obscuring the text for purposes of image analysis, such as where a red laser mark is strong enough to surround one or more letters of black text without obscuring the one or more letters (e.g., such as by presenting a laser mark with light so bright it hinders reflection of light associated with the one or more letters for text recognition purposes).
  • the at least one processor may be further programmed to generate at least one audio signal representing the at least one word.
  • Generating at least one audio signal representing the at least one word may include one or more of recognizing the word, searching for the word in a data repository, identifying digital audio data associated with the at least one word (e.g., associated by a data structure or other data linkage between the at least one word), or retrieving digital audio data associated with the at least one word (e.g., corresponding to the at least one word).
  • the at least one processor may generate at least one audio signal based on retrieved digital audio information corresponding to the at least one word.
  • the generated audio signal may correspond to the at least one word in a different language from the language in which the word is written (e.g., the language the word is recognized in from within an image).
  • any text-to- speech technique may be used for obtaining the at least one audio signal.
  • the at least one processor may be further programmed to cause the audio output device (e.g., audio feedback device 130) to play the at least one audio signal.
  • Causing the audio output device to play the at least one audio signal may include activating the audio output device and/or transmitting an audio signal (e.g., based on retrieved digital audio data, as discussed above) to the audio output device (e.g., a microphone, such as one or more microphones 230), which may be configured to transduce the digital audio data (e.g., into audible sound waves).
  • the at least one processor may be programmed to generate at least one second audio signal representing a query to the user regarding providing a definition of the at least one word.
  • the at least one second audio signal may or may not include the at least one word.
  • the at least one second audio signal representing a query to the user regarding providing a definition of the at least one word may include language asking the user for input regarding whether to provide a definition of the at least one word.
  • a second audio signal may be a computer-generated voice asking, “Would you like a definition of the selected word?” or “Would you like a definition of ‘boat’?” (e.g., where “boat” is, or is included in, the at least one recognized word).
  • the at least one processor may be programmed to receive an input from the user, and the input may indicate a request to provide a definition of the at least word.
  • An input may include a button press, a verbal command, a touch (e.g., on a touchscreen), or any other input discussed herein, consistent with disclosed embodiments.
  • the input received from the user may include a press detected (e.g., on a touchscreen) on a graphical user interface button displaying “Yes” or “No.”
  • an input received at a “Yes” button or an affirmative verbal command e.g., “yes”.
  • the at least one processor may be programmed to receive at least one audio signal generated by a microphone (e.g., one or more microphones 230), the at least one audio signal generated by the microphone representing a response to the query by the user.
  • the at least one audio signal generated by the microphone may include a representation of a verbal response from a user indicating a desire to receive a definition or indicating a desire not to receive a definition.
  • the at least one processor may be programmed to generate an audio signal representative of the definition of the at least one word and play the audio signal representative of the definition of the at least one word using the audio output device, as discussed above.
  • the at least one processor may be programmed to play the at least one second audio signal representing the query using the audio output device.
  • the at least one processor may cause the audio output device to output an audio signal representing the query, such as “Would you like a definition of the selected word?” or “Would you like a definition of ‘boat’?” (e.g., where “boat” is, or is included in, the at least one recognized word).
  • the at least one processor may be programmed to, in response to the input from the user: generate an audio signal representative of the definition of the at least one word and play the audio signal representative of the definition of the at least one word using the audio output device (e.g., audio feedback device 130).
  • Generating an audio signal representative of the definition of the at least one word may include one or more of recognizing the word, searching for the word in a data repository, identifying digital audio data associated with the at least one word (e.g., digital audio data corresponding to a definition of the word), or retrieving digital audio data associated with the at least one word.
  • the digital audio data may include an auditory definition of the at least one word in a different language from the language in which the word is written (e.g., the language the word is recognized in from within an image).
  • Playing the audio signal representative of the definition of the at least one word using the audio output device may include transmitting at least one audio signal (e.g., an activation signal and an audio signal) to an audio output device (e.g., a microphone, such as one or more microphones 230) or in any way causing an audio output device to produce sound corresponding to the audio signal representative of the definition of the at least one word.
  • an audio output device e.g., a microphone, such as one or more microphones 230
  • the at least one processor may be further programmed to obtain a part of speech of the at least one word (e.g., after recognizing the at least one word).
  • Obtaining a part of speech of the at least one word may include one or more of applying a contextual rule to identify the part of speech of the at least one word (e.g., applying a syntactic rule to identify a part of speech based on adjacent or nearby words), determining a most common part of speech for the at least one word (e.g., based on semantic usage statistics, which may be obtained from a remote device), querying a database, requesting the part of speech of the at least one word from a remote device, or receiving a part of speech of the at least one word.
  • a contextual rule to identify the part of speech of the at least one word
  • determining a most common part of speech for the at least one word e.g., based on semantic usage statistics, which may be obtained from a remote device
  • querying a database requesting the part of speech of the
  • the at least one processor may be further programmed to contact a host (e.g., a device) including or having access to a dictionary (e.g., a built-in, local, remote, or a third party digital dictionary).
  • a host e.g., a device
  • a dictionary e.g., a built-in, local, remote, or a third party digital dictionary
  • the at least one processor may transmit a request to the host of the dictionary.
  • the request may prompt the host of the dictionary to provide a definition for the at least one word, which may (or may not) be specifically associated with the at least one word and the part of speech of the at least one word.
  • the request may include the at least one word and/or part of speech of the at least one word.
  • the at least one processor may receive a reply.
  • the reply may include a single definition, a plurality of definitions, or no definition.
  • the reply may include a plurality of definitions, which may be associated with different parts of speech, or may be associated with a same part of speech (e.g., a part of speech included in the request).
  • the at least one processor may be further programmed to prioritize a received plurality of definitions based on the part of speech.
  • Prioritizing the received plurality of definitions may include reordering (e.g., within a list) one or more definitions, discarding one or more definitions, or deemphasizing one or more definitions (e.g., generating instructions to cause de-prioritized definitions to be presented in grey text rather than black text, or to be read with an auditory output indicating de-prioritized definitions are secondary or alternative).
  • the at least one processor may prioritize definitions associated with the obtained part of speech of the at least one word over definitions associated with a different part of speech associated with the at least one word.
  • the at least one processor may provide a highest prioritized definition to the user.
  • the at least one processor may cause an output device to output the highest prioritized definition (e.g., as text or auditory information).
  • the highest prioritized definition may be output with no other definitions, or may be output with other definitions (which may be de-emphasized, as mentioned above), which may or may not be associated with the part of speech of the highest prioritized definition.
  • the at least one processor may be further programmed to receive at least one audio signal generated by the microphone.
  • Generating an audio signal may include one or more of detecting sound waves, transducing sound waves, or storing digital audio data representing sound waves (e.g., a digital recording of audio information).
  • the at least one processor may cause the microphone to transduce sound waves and store associated digital audio information (e.g., digital audio information representing recorded speech of the user).
  • the at least one audio signal may represent speech by the user speaking (e.g., speaking at least one second word).
  • speech by the user may include one or more spoken portions of written material, such as a word, a phrase, a sentence, a paragraph, a column, a page, a chapter, or any other amount of text.
  • the user may speak the at least one word (e.g., recognized using the second coordinates) or an attempted version of the at least one word.
  • the at least one processor may be programmed to analyze the at least one audio signal to recognize at least one second word (e.g., another word within the audio data separate from the at least one word). Analyzing the at least one audio signal may include applying a filter to at least a portion of the at least one audio signal, comparing at least a portion of the at least one audio signal to a reference audio signal (e.g., stored digital audio data representing a particular known phoneme, syllable, word, or phrase, or performing any operation to enhance the at least one audio signal for recognition of speech or other auditory information, including those discussed above.
  • a reference audio signal e.g., stored digital audio data representing a particular known phoneme, syllable, word, or phrase, or performing any operation to enhance the at least one audio signal for recognition of speech or other auditory information, including those discussed above.
  • Recognizing the at least one second word may include at least one of determining that at least a portion of the at least one audio signal matches (e.g., within a threshold) a reference audio signal, associating the at least one second word with at least a portion of the at least one audio signal, generating digital text data representing the at least one second word, or performing any other operation to extract or identify information from the at least one audio signal, such those as discussed above.
  • first digital text data representing the at least one second word with second digital text data representing the at least one word e.g., to determine if the first and second text data are exact matches or matches within a character threshold
  • first digital audio data e.g., a waveform
  • second digital audio data e.g., a waveform
  • digital audio data may be retrieved based on an association between the digital audio data and text data corresponding to the at least one second word or the at least one word.
  • comparing the at least one second word with the at least one word may include determining if the at least one second word matches the at least one word, as discussed above with respect to determining whether the at least one first word matches the at least one second word. For example, determining that the at least one second word matches the at least one word may include comparing a metric to a predetermined threshold.
  • the at least one processor may be programmed to determine an accuracy of the speech by the user based on a comparison of the at least one second word with the at least one word.
  • An accuracy of the speech by the user may include a number or percentage of words spoken correctly (e.g., within a threshold, based on audio and/or text data comparisons), a number or percentage of syllables spoken correctly, a number or percentage of phonemes spoken correctly, whether a word was spoken correctly or was not spoken correctly, whether a word included correct tone or intonation (e.g., within a threshold, based on audio data), or any other information indicating a difference or closeness between the user’s speech and reference information (e.g., speech information associated with fluency).
  • the accuracy of the speech by the user may be based on the comparison of the at least one second word with the at least one word in that it may be influenced by, dependent on, informed by, or defined at least in part by the comparison.
  • the at least one processor may determine, based on a character- to-character comparison of the at least one second word to the at least one word, that the at least one second word matches the at least one word, and may in response to this determination determine that the accuracy of the speech by the user (e.g., speaking the at least one second word or the at least one word) is fully accurate (e.g., 100% accuracy).
  • the at least one processor may determine, based on comparing first audio information associated with the at least one second word (e.g., recorded audio data or reference audio data) with second audio information associated with the at least one word (e.g., recorded audio data or reference audio data) that four out of five (4/5) of the syllables in the first and second audio information match, and may in response to this determination determine that the accuracy of the speech by the user (e.g., speaking the at least one second word or the at least one word) is partially accurate (e.g., 80% accurate).
  • the accuracy of the speech may be determined based on a number or percentage of at least one of correct phonemes, correct syllables, or correct words in the at least one audio signal.
  • the accuracy may include a statistic influenced by a number or percentage of at least one of correct phonemes, correct syllables, or correct words in the at least one audio signal.
  • the accuracy may be determined using multiple metrics, such as a number or percentage of correct syllables and a number or percentage of correct words.
  • a percentage of correct syllables may be weighted equally with a percentage of correct words, such that if 90% of the syllables are correct and 80% of the words are correct, an accuracy may be determined to be 85% (e.g., weighting the percentage of correct syllables at 50% and the percentage of correct words at 50%).
  • an indicator of the accuracy of the speech may be transmitted to at least one of the user, a teacher of the user, or a parent of the user.
  • the indicator of the accuracy of the user may be provided via an email, text message, audio message, video message to an account (e.g., email address, phone number, application, etc.) associated with the user, a teacher, and/or a parent.
  • the at least one processor may be programmed to determine a fluency based on a spoken word rate in the at least one audio signal.
  • a fluency may include a rating, a score, a statistic, a metric, or any other value or combination of values indicating a closeness of a user’s speech to reference speech information (e.g., associated with a fluent speech).
  • a spoken word rate may include a number or percentage of words (or other unit of speech, such as syllables, phonemes, phrases, or sentences) spoken and/or spoken correctly per an amount of time (e.g., per second or per minute), which may be based on one or more comparisons, consistent with disclosed embodiments.
  • a spoken word rate may be determined using the at least one audio signal.
  • the at least one processor may determine a number of words represented in the at least one audio signal, may determine a duration of the audio signal (e.g., in seconds and/or minutes), and may divide the number of words represented in the audio signal by the duration of the audio signal, to determine the spoken word rate.
  • the fluency may be based on the spoken word rate in that it may be influenced by, dependent on, informed by, or defined at least in part by the spoken word rate. For example, a fluency of X% (where “X” is a number) may be determined when a spoken word rate is X% of words spoken correctly per minute (where “X” is a number).
  • a particular degree of fluency (e.g., “fully fluent,” “almost fluent,” “partially fluent,” “proficient,” “conversational,” “partially conversational,” or any other indicator of a user’s language speaking ability) may be determined in response to the spoken word rate reaching a threshold amount.
  • a “fully fluent” degree of fluency may be determined when a spoken word rate is 95% of words spoken correctly per minute or greater.
  • Fig. 24 is a flowchart showing an example process 2100 for processing audio and image signals, consistent with the disclosed embodiments. Process 2400 may be performed by at least one processing device, such as processor 320.
  • process 2400 may be performed by a different device, such as at least one processor within secondary communications device 350 and/or server 380.
  • processor is used as a shorthand for “at least one processor.”
  • a processor may include one or more structures that perform logic operations whether such structures are collocated, connected, or disbursed.
  • a non-transitory computer-readable medium may contain instructions that when executed by a processor cause the processor to perform process 2400.
  • process 2400 is not necessarily limited to the steps shown in Fig. 24 and any steps or processes of the various embodiments described throughout the present disclosure may also be included in process 2400. Moreover, any step of process 2400 may be omitted, duplicated, re-arranged, and/or repeated.
  • process 2400 may include causing the camera to capture a first image including a first representation of written material, consistent with disclosed embodiments.
  • camera 210 may capture an image 2210, which may include a representation of written material.
  • the at least one image may include text (e.g., at least one word) and/or a predetermined sign 2214, as discussed above.
  • process 2400 may include analyzing the first image (e.g., image 2210) to detect a predetermined sign within the first image, consistent with disclosed embodiments. For example, this may include identifying particular digital image information within the at least one image, as discussed above. For example, this may also include determining pixel color or contrast information associated with the predetermined sign.
  • process 2400 may include determining first coordinates (e.g., coordinates 2302) of the predetermined sign within the first image, consistent with disclosed embodiments.
  • determining first coordinates may include determining a position of the predetermined sign with respect to an edge of an image and/or written material.
  • process 2400 may include causing the camera to capture a second image (e.g., image 2220) including a second representation of the written material, consistent with disclosed embodiments.
  • the camera may capture an image of the written material in which the predetermined sign is not present or is present in a different position.
  • process 2400 may include determining a transformation (e.g., transformation 2304) between the first image and the second image, consistent with disclosed embodiments.
  • a transformation may include an amount and direction of rotation and/or translation to apply to the first or second image.
  • process 2400 may include applying the transformation to obtain second coordinates (e.g., second coordinates 2306) within the second image corresponding to the first coordinates of the predetermined sign within the first image, consistent with disclosed embodiments.
  • applying the transformation may include changing image data or translating coordinates.
  • process 2400 may include analyzing the second image to recognize at least one word based on the second coordinates, consistent with disclosed embodiments.
  • analyzing the second image to recognize at least one word may include performing optical character recognition (OCR) at a portion of the second image associated with the second coordinates (e.g., a portion that includes the second coordinates and/or the at least one word).
  • OCR optical character recognition
  • a reading device may be provided for assisting users with reading, learning, comprehension, or various other tasks associated with written text.
  • handheld apparatus 110 may be used to capture images of written material and may analyze the written material to provide feedback to the user. This written material may also be analyzed in association with audio signals or other inputs from a user.
  • the reading device may further be configured to associate the reading device with a particular user. For example, the reading device may identify the user (or an account associated with a user) and associate the reading device with the user. Accordingly, the reading device may provide more personalized results for the user.
  • this may include maintaining a history of events or materials associated with the user, accessing and/or maintaining preferences for the user, transmitting feedback or other information to other devices associated with the user, identifying other secondary users associated with the user, ranking or scoring abilities or progress of the user, or the like.
  • the user may be identified in various ways. In some embodiments, this may include analyzing one or more captured images to detect identification information associated with a user. In some embodiments, these images may be captured using handheld apparatus 110. For example, handheld apparatus 110 may analyze one or more images captured using camera 210 to detect identification information associated with a user. Alternatively or additionally, the images may be captured using another device. For example, handheld apparatus 110 may receive images captured using secondary communications device 350 and analyze the images to detect identification information associated with a user.
  • the identification information may include any form of visual information from which an identity of the user may be determined, extracted, or derived.
  • the identification information may be a name of a user.
  • handheld apparatus 110 may detect a text within an image and identify the name of a user within the text.
  • the identification information may be another identifier of a user, such as an identification number or string (e.g., an alphanumeric code), a username, an image of a face of the user, or the like.
  • the identification information may be represented as a visual code.
  • Fig. 25 illustrates an example code 2510 that may be used for identifying a user consistent with the disclosed embodiments.
  • handheld apparatus 110 may be used to scan or capture an image of code 2510.
  • code 2510 may be quick response (QR) code, which may encode identification information 2520 associated with a user of handheld apparatus 110.
  • QR quick response
  • identification information 2520 may include a name of the user, an identifier of the user, or various other information.
  • Code 2510 may include various other forms of machine-readable codes, such as a barcode, a DataMatrix code, an Aztec code, a High Capacity Color Barcode, a Trillcode, a Quickmark code, a Shotcode, an mCode, a Beetagg code, a PDF417 code, a proprietary code format, or the like.
  • an image of code 2510 may not necessarily be captured using handheld apparatus 110 and may be captured using another device.
  • image of code 2510 may be captured using secondary communications device 350, as described above.
  • the captured image or identification information 2520 may then be transmitted to handheld apparatus 110, server 380, or other components of system 300.
  • Code 2510 may be presented to user 100 in various ways.
  • code 2510 may be printed on a physical object.
  • code 2510 may be printed on an identification card, a pamphlet, a bookmark, or other object accessible to user 100.
  • code 2510 may be printed on or otherwise affixed to object 120.
  • code 2510 may be displayed on a screen of a secondary device.
  • Handheld apparatus 110 may extract identification information 2520 using various methods. For example, handheld apparatus 110 may apply a Reed-Solomon error correction algorithm or other techniques to extract data from code 2510.
  • identification information 2520 user 100 may be associated with handheld apparatus 110. In some embodiments, this may include accessing a database or other data structure associating identification information 2520 with user 100.
  • database 650 may include a data structure (e.g., a table, array, etc.) correlating identification information to one or more users and identification information 2520 may be used to perform a “look up” function to identify user 100 (or an account associated with user 100).
  • Various other methods for determining an association with user 100 based on identification information 2520 may be used, as will be appreciated by one of skill in the art.
  • information indicating the association between user 100 and handheld apparatus 110 may be stored.
  • database 350 may store identification information 2520 (or other information identifying user 100) in an associative manner with an identifier of handheld apparatus 110, indicating an association between user 100 and handheld apparatus 110. Alternatively or additionally, the association may be indicated on handheld apparatus 110.
  • handheld apparatus 110 may store identification information 2520 (or other information identifying user 100) to indicate that it is now associated with user 100.
  • the association between user 100 and handheld apparatus 110 may be a permanent or least long-term association that is maintained until the association is specifically removed.
  • user 100 may scan code 2510 as part of an initial setup of handheld apparatus 110.
  • handheld apparatus 110 may be associated with user 100 until this association is removed or modified.
  • the association may be temporary.
  • the association between user 100 and handheld apparatus 110 may be associated with a particular session or use of handheld apparatus 110. Accordingly, the association may be removed automatically after a predetermined time period since establishing the association, a predetermined time of inactivity of handheld apparatus 110, or various other triggers.
  • handheld apparatus 110 may be associated with multiple users. For example, each time a new user uses handheld apparatus 110, an association may be established between the new user and handheld apparatus 110.
  • various other visual indicators of user 100 may be used to associate handheld apparatus 110 with user 100. For example, this may include capturing an image of user 100 and identifying user 100 based on a representation of user 100 in the captured image.
  • Fig. 26 A illustrates an example technique for identifying a user based on a physical characteristic of the user, consistent with the disclosed embodiments.
  • handheld apparatus 110 may be used to capture an image of a face of user 100. As described above, an image of the face of user 100 may be captured by various other devices, such as secondary computing device 350.
  • the representation of the face of user 100 detected within the image may be used to identify user 100.
  • handheld apparatus 110 may use one or more image recognition techniques to identify facial features on the face of user 100, such as the eyes, nose, cheekbones, jaw, or other features. Handheld apparatus 110 may then analyze the relative size and/or position of these features to identify the user. Handheld apparatus 110 may use one or more algorithms for analyzing the detected features, such as principal component analysis (e.g., using eigenfaces), linear discriminant analysis, elastic bunch graph matching (e.g., using Fisherface), Local Binary Patterns Histograms (LBPH), Scale Invariant Feature Transform (SIFT), Speed Up Robust Features (SURF), or the like. Other facial recognition techniques such as three-dimensional recognition, skin texture analysis, and/or thermal imaging may also be used to identify individuals.
  • principal component analysis e.g., using eigenfaces
  • linear discriminant analysis e.g., linear discriminant analysis
  • elastic bunch graph matching e.g., using Fisherface
  • handheld apparatus 110 may compare a captured image to stored images associated with one or more users.
  • database 650 may store reference images associated with one or more users 100.
  • these reference images may include images of user 100 captured previously using handheld apparatus 110 (or another device of system 300, such as secondary computing device 350).
  • the reference images may be accessed from an external source.
  • the images may be accessed from a social media profile, an online database, or other resources that may associate images of a face of user 100 with an identity of user 100.
  • handheld apparatus 110 may compare a representation of the face of user 100 to a reference image 2610 to determine a match.
  • this may not be an exact match, but may be determined based on a degree of confidence that the face of user 100 matches reference image 2610 by more than a threshold amount, or the like.
  • other features besides facial features may also be used for identification, such as the height, body shape, vein patterns, behavioral patterns, or other distinguishing features of user 100.
  • user 100 may be identified using a captured audio signal.
  • Fig. 26B illustrates an example technique for identifying user 100 based on a captured audio signal, consistent with the disclosed embodiments.
  • a voice 2622 of user 100 may be captured using handheld apparatus 110 (or another device, such as secondary computing device 350).
  • audio sensor 340 in microphone 230 of apparatus 110 may capture a voice of user 100, as described above.
  • voice 2622 may include a predetermined word, phrase, or sentence spoken by user 100.
  • the word, phrase, or sentence may include identification information of user 100.
  • user 100 may speak a predetermined passphrase associated with user 100, a name of user 100, a username of user 100, an identifier of user 100, or any other word or phrase that may be recognized as an identifier of user 100.
  • user 100 may say “please activate profile for Tammy Meadow,” and handheld apparatus 110 may recognize “Tammy Meadow” as a name of user 100.
  • this identification information may be compared to a database or other data structure storing associations between identification information and one or more users or user profiles.
  • audio characteristics of voice 2622 of user 100 may be used to identify user 100.
  • handheld apparatus 110 may analyze audio signals representative of voice 2622 captured by microphone 230 to identify user 100. This may include applying one or more voice recognition algorithms, such as Hidden Markov Models, Dynamic Time Warping, neural networks, or other techniques.
  • handheld apparatus 110 may access database 650, which may further include one or more of a pre-recorded voice of one or more individuals, characteristics extracted therefrom, or a voiceprint of the one or more individuals.
  • handheld apparatus 110 may analyze the audio signal representative of voice 2622 to determine whether voice 2012 matches a voiceprint 2620 of an individual in database 650. Accordingly, database 650 may contain voiceprint data associated with a number of individuals, similar to the stored facial identification data described above. After determining a match (or a degree of matching exceeding a threshold, etc.), user 100 may be identified.
  • handheld apparatus 110 may be configured to request a confirmation that user 100 has been correctly identified.
  • the confirmation may be an audible confirmation.
  • handheld apparatus 110 may generate at least one audio signal representing a question asking user 100 to confirm identifying information and outputting the audio signal through the device. For example, this may include generating an audio signal representing a request such as “Hello, is this Tammy?” or a similar request.
  • Handheld apparatus 110 may then receive at least one audio signal representing a response to the question by the user. For example, user 100 may say “yes” to confirm they have been correctly identified, or “no” to indicate they have been incorrectly identified. Alternatively or additionally, user 100 may confirm by pressing a physical button or touch sensitive area on handheld apparatus 110, shaking handheld apparatus 110, responding to a prompt on a secondary device, or the like.
  • Handheld apparatus 110 may be configured to perform various actions based on the association between handheld apparatus 110 and user 100. In some embodiments, this may include having a stored collection of materials presented to user 100 in the past, or tracking a progress or various statistical information associated with user 100. As described herein, handheld apparatus 110 may be configured to provide various forms of feedback, such as feedback indicating whether at least one word spoken by a user matches at least one word recognized in written material. In some embodiments, the performance of user 100 may be tracked over time to determine a score for user 100.
  • the score may indicate how often user 100 correctly identifies or pronounces a written word, a degree of similarity between a spoken word and a written word, performance on a multiple-choice test, an amount of improvement over time, or the like.
  • this score may be compared to similar scores of other users. Accordingly, a rank of user 100 compared to other users may be determined, which may be indicated to user 100.
  • Various other forms of scores or statistical information may be tracked in association with user 100 over time and may be made accessible to user 100.
  • handheld apparatus 110 may identify and/or interact with various other devices associated with user 100 based on the association.
  • database 650 (or another storage location) may store data indicating other devices associated with user 100.
  • user 100 may be associated with secondary computing device 350. Accordingly, based on the determined association with user 100, handheld apparatus 110 may provide information to secondary computing device 350.
  • handheld apparatus 110 may determine feedback information indicating whether at least one first word spoken by a user matches the at least one second word.
  • the second word may be a word recognized in written material or any other word expected to be spoken by the user.
  • handheld apparatus 110 may transmit this feedback information to secondary computing device 350.
  • secondary computing device 350 may provide visual or other forms of feedback to user 100 as user 100 reads text on object 120. Alternatively or additionally, this may include transmitting a rank or other information associated with user 100, as described above.
  • handheld apparatus 110 may identify other users or individuals associated with user 100.
  • database 650 may store data associating user 100 with various other users. For example, this may include friends of user 100, a teacher or instructor of user 100, a parent or guardian of user 100, an administrator associated with user 100, an account representative of user 100, or the like.
  • handheld apparatus 110 may transmit feedback information to a secondary computing device associated with other users. For example, this may include feedback indicating whether at least one first word spoken by a user matches the at least one second word recognized in written material, as described above.
  • Fig. 27 is a flowchart showing an example process 2700 for processing audio and image signals, consistent with the disclosed embodiments.
  • Process 2700 may be performed by at least one processing device of a handheld apparatus, such as processor 310, and/or at least one processing device of an earphone case, such as processor 320.
  • processor is used as a shorthand for “at least one processor.”
  • a processor may include one or more structures that perform logic operations whether such structures are collocated, connected, or dispersed.
  • a non-transitory computer readable medium may contain instructions that when executed by a processor cause the processor to perform process 2700.
  • process 2700 is not necessarily limited to the steps shown in Fig. 27, and any steps or processes of the various embodiments described throughout the present disclosure may also be included in process 2700, including those described above with respect to Figs. 25, 26A, and 26B.
  • process 2700 may include associating a reading device with a user by identifying at least one of a user or a user account associated with the user.
  • the user or user account may be identified in various ways, as described herein.
  • the user or user account may be identified based on identification information represented in a visual code.
  • step 2710 may include identifying the user or the user account by analyzing at least one captured image including a representation of a code associated with the user and extracting identification information from the code.
  • the code may be displayed on an object or on a display of a secondary device associated with the first user.
  • the code may include one of an alphanumeric code, a barcode, or a QR code, as described above.
  • Step 2710 may further include associating the reading device with the user based on the identification information, as described above with respect to Fig. 25.
  • the user or user account may be identified based on an audio signal captured by a microphone of the reading device. In some embodiments, this may include extracting identification information for associating the user with the reading device.
  • step 2710 may include analyzing at least one audio signal captured by the microphone representing speech by the user speaking a predetermined word, phrase, or sentence; and extracting identification information associated with the user based on the analysis. Step 2710 may further include associate the reading device with the user based on the identification information, as described above.
  • the user or user account may be identified based on a recognized voice of the user.
  • step 2710 may include analyzing at least one audio signal captured by the microphone representing speech by the user speaking a predetermined word, phrase, or sentence; and recognizing a voice of the user by comparing the at least one audio signal representing speech by the user with a prerecorded voiceprint of the user. Accordingly, step 2710 may further include associating the reading device with the user in response to recognizing the voice.
  • the user or user account may be identified based on a face of a user detected in an image, as described above with respect to Fig. 26B.
  • step 2710 may include comparing at least one captured image including a representation of a face of the user to one or more stored images; and associating the reading device with the user based on the comparison.
  • the at least one image may be captured by the reading device.
  • process 2700 may include causing the camera to capture the at least one image including the representation of the face of the user.
  • the image may be captured using another device.
  • process 2700 may include receiving the at least one image including the representation of the face of the user from a secondary device associated with the user, such as secondary device 350.
  • process 2700 may further include confirming the identity of the user, as described above.
  • step 2710 may include analyzing at least one captured image including a representation of a code associated with the user, and extracting identification information from the code, as described above.
  • Process 2700 may further include generating at least one audio signal representing a question asking the user to confirm the identifying information; outputting the at least one audio signal through the audio output device; and receiving at least one audio signal representing a response to the question by the user.
  • process 2700 may include associating the reading device with the user based on the response.
  • process 2700 may include receiving at least one image captured by a camera configured to capture images from an environment of a user. For example, this many include receiving at least one image captured using image sensor 310.
  • the at least one image may include a representation of written material.
  • the written material may be printed on or in object 120.
  • process 2700 may include analyzing the at least one image to recognize text. For example, this may include executing one or more algorithms stored in text recognition component 610 to identify the text in the received images, as described above. In some embodiments, this may include performing optical character recognition (OCR) or other processing techniques on the identified text.
  • OCR optical character recognition
  • process 2700 may include receiving at least one audio signal captured by a microphone configured to capture sounds from the environment of the user. For example, this may include receiving an audio signal captured using microphone 340, as described above.
  • the at least one audio signal may represent speech by the user. For example, user 110 may be reading the written material.
  • process 2700 may include analyzing the at least one audio signal to recognize at least one first word.
  • step 2750 may include executing one or more sound recognition modules in voice recognition component 620 to identify one or more words in the received audio signals, as described above.
  • process 2700 may include comparing the at least one first word with at least one second word. For example, this may include determining whether the at least one word matches the at least one second word based on a metric for assessing a similarity between the at least one first word and the at least one second word, as described herein.
  • the second word may include any form of word that may be expected to be spoken by a user.
  • the second word may be a word identified in the recognized text.
  • the user may be reading the text aloud and step 2760 may include comparing a spoken word with a word from the text.
  • the user may be answering a question, and step 2760 may include determining whether the at least one word matches an expected answer to the question.
  • process 2700 may include determining feedback information indicating whether the at least one first word matches the at least one second word (e.g., whether the at least one first word matches a word in the text, whether the answer is correct, etc.).
  • the feedback information may include at least one of a number or a percentage of words in the written material read correctly by the user, a reading fluency of the user, a reading speed of the user, a number of selfcorrections or attempted corrections made by the user, a number of substitutions made by the user, a number of insertions made by the user, a number of omissions made by the user, or a number of hesitations of the user, a degree of correctness of an answer, or various other feedback as described herein.
  • the feedback information may include an indication of proper usage of punctuation by the user.
  • process 2700 may include ranking a reading level of the user relative to one or more other users associated with the reading device. For example, this may include determining a score indicating a performance of the user and comparing the score to scores of other users to determine a user rank. The rank of the user may be included in the feedback information.
  • process 2700 may include providing the feedback information to the user.
  • step 2780 may include causing the feedback information to be displayed via a dashboard on a display of a secondary device. Accordingly, step 2780 may include transmitting the feedback information to a secondary device associated with the user.
  • providing the feedback information may include presenting the feedback information via an audio output device, such as headphone 130a and/or speaker 130b.
  • process 2700 may further include associating the reading device with another user different from the user. For example, the another user may be a teacher or a parent of the user.
  • Step 2780 may further include transmitting the feedback information to a secondary device associated with the another user.
  • Computer programs based on the written description and methods of this specification are within the skill of a software developer.
  • the various functions, scripts, programs, or modules may be created using a variety of programming techniques.
  • programs, scripts, functions, program sections or program modules may be designed in or by means of languages, including JAVASCRIPT, C, C++, JAVA, PHP, PYTHON, RUBY, PERL, BASH, or other programming or scripting languages.
  • One or more of such software sections or modules may be integrated into a computer system, non-transitory computer readable media, or existing communications software.
  • the programs, modules, or code may also be implemented or replicated as firmware or circuit logic.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Multimedia (AREA)
  • Educational Technology (AREA)
  • Educational Administration (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Acoustics & Sound (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

L'invention concerne un dispositif de lecture qui peut comprendre une source de lumière configurée pour éclairer un objet ; un bouton de déclenchement configuré pour activer la source de lumière, le déclencheur pouvant être actionné par un doigt d'une main d'un utilisateur ; une caméra configurée pour capturer des images à partir d'un environnement de l'utilisateur ; un dispositif de sortie audio configuré pour délivrer des signaux audio ; et au moins un processeur. Le ou les processeurs peuvent être programmés pour : en réponse au fonctionnement du déclencheur, projeter de la lumière provenant de la source de lumière pour éclairer une zone de l'objet ; capturer au moins une image de la zone éclairée de l'objet, la ou les images comprenant une représentation de matériau écrit ; analyser la ou les images pour reconnaître un texte ; transformer le texte reconnu en au moins un signal audio ; et délivrer le ou les signaux audio à l'aide du dispositif de sortie audio.
PCT/IB2022/000784 2021-12-23 2022-12-22 Appareil et procédés d'aide à la lecture WO2023118967A1 (fr)

Applications Claiming Priority (14)

Application Number Priority Date Filing Date Title
US202163293096P 2021-12-23 2021-12-23
US63/293,096 2021-12-23
US202163294895P 2021-12-30 2021-12-30
US63/294,895 2021-12-30
US202263307175P 2022-02-07 2022-02-07
US63/307,175 2022-02-07
US202263309566P 2022-02-13 2022-02-13
US63/309,566 2022-02-13
US202263329433P 2022-04-10 2022-04-10
US63/329,433 2022-04-10
US202263340494P 2022-05-11 2022-05-11
US63/340,494 2022-05-11
US202263355687P 2022-06-27 2022-06-27
US63/355,687 2022-06-27

Publications (1)

Publication Number Publication Date
WO2023118967A1 true WO2023118967A1 (fr) 2023-06-29

Family

ID=85283782

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2022/000784 WO2023118967A1 (fr) 2021-12-23 2022-12-22 Appareil et procédés d'aide à la lecture

Country Status (1)

Country Link
WO (1) WO2023118967A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6115482A (en) * 1996-02-13 2000-09-05 Ascent Technology, Inc. Voice-output reading system with gesture-based navigation
US20020029146A1 (en) * 2000-09-05 2002-03-07 Nir Einat H. Language acquisition aide
US6509893B1 (en) * 1999-06-28 2003-01-21 C Technologies Ab Reading pen
US20140253701A1 (en) * 2013-03-10 2014-09-11 Orcam Technologies Ltd. Apparatus and method for analyzing images

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6115482A (en) * 1996-02-13 2000-09-05 Ascent Technology, Inc. Voice-output reading system with gesture-based navigation
US6509893B1 (en) * 1999-06-28 2003-01-21 C Technologies Ab Reading pen
US20020029146A1 (en) * 2000-09-05 2002-03-07 Nir Einat H. Language acquisition aide
US20140253701A1 (en) * 2013-03-10 2014-09-11 Orcam Technologies Ltd. Apparatus and method for analyzing images

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NEWYES SHENZEN: "AS1503 scan reader User Manual (1) Shenzhen Newyes Technology", 3 November 2021 (2021-11-03), pages 1 - 7, XP093038941, Retrieved from the Internet <URL:https://fccid.io/2AUY8-AS1503/User-Manual/User-Manual-5521585> [retrieved on 20230413] *
NEWYES SHENZEN: "Scan Reader", 3 November 2021 (2021-11-03), pages 1 - 2, XP093038946, Retrieved from the Internet <URL:https://fccid.io/2AUY8-AS1503/User-Manual/User-Manual-5521585> [retrieved on 20230413] *

Similar Documents

Publication Publication Date Title
US10977452B2 (en) Multi-lingual virtual personal assistant
JP7486540B2 (ja) 複数の年齢グループおよび/または語彙レベルに対処する自動化されたアシスタント
US20210081056A1 (en) Vpa with integrated object recognition and facial expression recognition
Anagnostopoulos et al. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011
US8793118B2 (en) Adaptive multimodal communication assist system
CN111833853B (zh) 语音处理方法及装置、电子设备、计算机可读存储介质
Batliner et al. Segmenting into adequate units for automatic recognition of emotion‐related episodes: a speech‐based approach
CN101551947A (zh) 辅助口语语言学习的计算机系统
CN111858876B (zh) 一种知识库的生成方法、文本查找方法和装置
Khorrami et al. Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning?--A computational investigation
Hoque et al. Robust recognition of emotion from speech
Dai [Retracted] An Automatic Pronunciation Error Detection and Correction Mechanism in English Teaching Based on an Improved Random Forest Model
WO2023118967A1 (fr) Appareil et procédés d&#39;aide à la lecture
JP6978543B2 (ja) スマートリーディング機器およびその制御方法
KR20210066645A (ko) 스탠드형 스마트 리딩 기기 및 그 제어 방법
Avci A Pattern Mining Approach for Improving Speech Emotion Recognition
Madhusha et al. Mobile Base Sinhala Book Reader for Visually Impaired Students
Formolo et al. Extracting interpersonal stance from vocal signals
CN110059231B (zh) 一种回复内容的生成方法及装置
Jayalakshmi et al. Augmenting Kannada Educational Video with Indian Sign Language Captions Using Synthetic Animation
Abbas Improving Arabic Sign Language to support communication between vehicle drivers and passengers from deaf people
Matsuhira et al. Computational measurement of perceived pointiness from pronunciation
Siddiqui A Multi-modal Emotion Recognition Framework Through The Fusion Of Speech With Visible And Infrared Images
WO2023059818A1 (fr) Formatage automatisé de texte par approche linguistique à base acoustique
Simeoni et al. EMPATHIC VOICE: ENABLING EMOTIONAL INTELLIGENCE IN VIRTUAL ASSISTANTS

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22859521

Country of ref document: EP

Kind code of ref document: A1