CN112328308A

CN112328308A - Method and device for recognizing text

Info

Publication number: CN112328308A
Application number: CN202010120050.9A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2021-02-05

Abstract

The embodiment of the application discloses a method and a device for recognizing texts. One embodiment of the method comprises: determining whether a preset mode switching instruction is received, wherein the preset mode switching instruction is used for indicating the terminal equipment to start an immersive mode; in response to receiving the preset mode switching instruction, starting a camera of the terminal equipment; acquiring a text image to be recognized; and generating a text recognition result based on the text image to be recognized. The implementation mode realizes that the text recognition operation can be completed without waking up when the terminal equipment is in the immersive mode, and improves the operation efficiency of the text recognition. Moreover, under the application scene of children's teaching, owing to need not to use the word of awaking, avoided children because the thinking that the word of awaking brought is used to disturb in the learning, promoted the experience of immersive study, help promoting user's concentration degree.

Description

Method and device for recognizing text

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for recognizing texts.

Background

With the rapid development of computer technology, more and more intelligent electronic devices are widely used.

In the field of computer-aided teaching, a reading pen or a scanning pen is usually used to identify text. When the electronic device is used, the electronic device such as the touch-and-talk pen or the swipe pen needs to be awakened first and then the corresponding text is read.

Disclosure of Invention

The embodiment of the application provides a method and a device for recognizing texts.

In a first aspect, an embodiment of the present application provides a method for recognizing text, where the method includes: determining whether a preset mode switching instruction is received, wherein the preset mode switching instruction is used for indicating the terminal equipment to start an immersive mode; in response to receiving a preset mode switching instruction, starting a camera of the terminal equipment; acquiring a text image to be recognized; and generating a text recognition result based on the text image to be recognized.

In some embodiments, the generating a text recognition result based on the text image to be recognized includes: determining whether a text recognition instruction is received; and performing text recognition on the text image to be recognized in response to the received text recognition instruction to generate a text recognition result.

In some embodiments, the text image to be recognized further includes an image of a pointer; and the text recognition is performed on the text image to be recognized to generate a text recognition result, and the method comprises the following steps: determining the position of the indicator from the text image to be recognized; extracting a text image area to be recognized from the text image to be recognized according to the position of the indicator; and performing text recognition on the text image area to be recognized to generate a text recognition result.

In some embodiments, the method further comprises: performing Speech synthesis (Text To Speech, TTS) according To the Text recognition result To generate a synthesized Speech; the synthesized speech is transmitted.

In some embodiments, the determining whether the preset mode switching instruction is received includes: receiving a target voice; determining whether the target voice is matched with an instruction representing switching of a preset mode in a local instruction set; and responding to the fact that the target voice is matched with an instruction representing the switching of the preset mode in the local instruction set, and determining that the preset mode switching instruction is received.

In a second aspect, an embodiment of the present application provides an apparatus for recognizing text, where the apparatus includes: the terminal equipment comprises a determining unit and a switching unit, wherein the determining unit is configured to determine whether a preset mode switching instruction is received or not, and the preset mode switching instruction is used for indicating the terminal equipment to start an immersive mode; the starting unit is configured to respond to the receiving of a preset mode switching instruction and start a camera of the terminal equipment; an acquisition unit configured to acquire a text image to be recognized; the first generation unit is configured to generate a text recognition result based on the text image to be recognized.

In some embodiments, the first generating unit includes: a first determination module configured to determine whether a text recognition instruction is received; the first generation module is configured to perform text recognition on the text image to be recognized in response to receiving the text recognition instruction so as to generate a text recognition result.

In some embodiments, the text image to be recognized further includes an image of a pointer; the first generation unit includes: a second determination module configured to determine a position of the pointer from the text image to be recognized; the extraction module is configured to extract a text image area to be recognized from the text image to be recognized according to the position of the indicator; and the second generation module is configured to perform text recognition on the text image area to be recognized so as to generate a text recognition result.

In some embodiments, the apparatus further comprises: a second generation unit configured to perform speech synthesis based on the text recognition result, generating a synthesized speech; a transmitting unit configured to transmit the synthesized speech.

In some embodiments, the determining unit includes: a receiving module configured to receive a target voice; a third determination module configured to determine whether the target speech matches an instruction in the local instruction set characterizing the switching of the preset mode; a fourth determination module configured to determine that the preset mode switch instruction is received in response to determining that the target speech matches an instruction in the local instruction set characterizing the preset mode switch.

In a third aspect, an embodiment of the present application provides a terminal, where the terminal includes: one or more processors; a storage device having one or more programs stored thereon; the camera is configured to acquire a text image to be recognized; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the method described in any implementation manner of the first aspect.

The method and the device for recognizing the text provided by the embodiment of the application firstly determine whether a preset mode switching instruction is received. The preset mode switching instruction is used for indicating the terminal equipment to start an immersive mode. And then, responding to the received preset mode switching instruction, and starting a camera of the terminal equipment. And then, acquiring a text image to be recognized. And finally, generating a text recognition result based on the text image to be recognized. Therefore, the text recognition operation can be completed without waking up when the terminal equipment is in the immersion mode, and the operation efficiency of the text recognition is improved. Moreover, under the application scene of children's teaching, owing to need not to use the word of awaking, avoided children because the thinking that the word of awaking brought is used to disturb in the learning, promoted the experience of immersive study, help promoting user's concentration degree.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for recognizing text according to the present application;

FIG. 3 is a schematic illustration of an application scenario of a method for recognizing text according to an embodiment of the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method for recognizing text according to the present application;

FIG. 5 is a schematic diagram illustrating the structure of one embodiment of an apparatus for recognizing text according to the present application;

FIG. 6 is a schematic block diagram of an electronic device suitable for use in implementing embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary architecture 100 to which the method for recognizing text or the apparatus for recognizing text of the present application can be applied.

As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, and server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The terminal device 101 may interact with the server 103 via the network 102 to receive or transmit messages or the like. The terminal apparatus 101 may be hardware or software. When the terminal device 101 is hardware, the terminal device 101 may be various terminal devices such as a smart phone, a touch-and-talk pen, a swipe pen, and the like, which integrate a chip with a character recognition function and a camera. When the terminal apparatus 101 is software, it can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 103 may be a server that provides various services, such as a background server that provides support for the terminal apparatus 101 to generate text recognition results. The background server may analyze and process the text image to be recognized acquired by the terminal device 101, generate a processing result, and feed back the processing result (such as a text recognition result) to the terminal device 101.

It should be noted that, the terminal device 101 may also directly perform analysis processing on the acquired text image to be recognized to generate a text recognition result. At this time, the network 102 and the server 103 may not exist.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for recognizing the text provided by the embodiment of the present application is generally executed by the terminal device 101, and accordingly, the apparatus for recognizing the text is generally disposed in the terminal device 101.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. With continued reference to FIG. 2, a flow 200 of one embodiment of a method for recognizing text in accordance with the present application is shown. The method for recognizing text comprises the following steps:

step 201, determining whether a preset mode switching instruction is received.

In the present embodiment, the execution subject of the method for recognizing a text (such as the terminal device 101 shown in fig. 1) may first determine whether a preset mode switching instruction is received. The mode switching command may include various information characterizing the switching mode. The modes may include, but are not limited to, immersive mode (without waking the device prior to operation), regular mode (with waking the device prior to operation). The preset mode switching instruction may be used to instruct the terminal device to start an immersive mode. For example, the preset mode switching instruction may include an instruction to switch the device from the normal mode to the immersive mode. The immersive mode described above may be used to characterize that continuous processes (e.g., processes of child learning) are guaranteed not to be interrupted by minimizing wake-up words or voice interactions. As an example, an electronic device in an immersive mode may remain operational at all times without wake-up words. As yet another example, an electronic device in an immersive mode may operate according to preset voice instructions (e.g., text recognition occurs upon detection of a text image) without waiting for an interactive voice such as "this word read this way".

Step 202, in response to receiving the mode switching instruction, starting a camera of the terminal device.

In this embodiment, the execution main body may receive the preset mode switching instruction in a wired connection manner or a wireless connection manner. As an example, the execution subject may obtain a locally generated preset mode switching instruction. The preset mode switching instruction may be generated after the execution subject receives an operation of clicking an "open immersive mode" button by a user. As another example, the execution main body may also receive a preset mode switching instruction transmitted by an electronic device (for example, a remote controller matched with the execution main body) connected in communication.

Step 203, acquiring a text image to be recognized.

In this embodiment, the execution main body may acquire the text image to be recognized in a wired connection manner or a wireless connection manner. As an example, the execution main body may directly capture an image containing a text as a text image to be recognized through a camera integrated with the execution main body. As yet another example, the execution body described above may also acquire an image from a communicatively connected electronic device as a text image to be recognized.

In some optional implementation manners of this embodiment, the text image to be recognized may further include an image of a pointer. The pointer may include various items, such as a finger, a pen, and the like, for indicating text in the text image to be recognized.

And step 204, generating a text recognition result based on the text image to be recognized.

In the present embodiment, the execution subject described above may generate a text recognition result in various ways based on a text image to be recognized. As an example, the execution body may employ various OCR (Optical Character Recognition) techniques to generate the text Recognition result. The OCR techniques described above may employ, for example, a pre-trained artificial neural network for recognition. The artificial Neural network may include, for example, a Convolutional Neural Network (CNN).

In some optional implementations of the embodiment, the executing subject may generate the text recognition result according to the following steps:

first, it is determined whether a text recognition instruction is received.

In these implementations, the execution body may first determine whether a text recognition instruction is received. The text recognition instruction may include a preset sentence (e.g., "how to read this word") or an operation (e.g., clicking a pre-specified button).

And secondly, in response to the received text recognition instruction, performing text recognition on the text image to be recognized to generate a text recognition result.

In these implementations, in response to receiving the text recognition instruction, the executing entity may employ the OCR techniques described above to generate a text recognition result.

Based on the optional implementation manner, the execution main body can use the received text recognition instruction as a precondition of text recognition, and excessive energy consumption caused by frequent execution of text recognition operation by the terminal device is avoided.

Optionally, based on the optional implementation manner, the executing body may perform text recognition on the text image to be recognized according to the following steps to generate a text recognition result:

in a first step, the position of the pointing object is determined from the text image to be recognized.

In these implementations, the execution subject may determine the position of the pointer from the text image to be recognized in various ways based on the image of the pointer included in the text image to be recognized. As an example, the execution subject may input the text image to be recognized to a position determination model trained in advance, thereby generating position information representing a position of the pointer. The position determination model can be used for representing the corresponding relation between the text image to be recognized and the position information representing the position of the indicator in the text image to be recognized. The position determination model may include various deep neural networks for Object Detection (Object Detection), such as MobileNet, YOLO, and the like. As another example, the execution subject may further perform edge detection on the text image to be recognized. Then, the execution subject may generate an image feature based on the edge detection result. And finally, comparing the image characteristics with the characteristics of the image of a preset indicator, and determining the position corresponding to the matched image characteristics as the position of the indicator.

And secondly, extracting a text image area to be recognized from the text image to be recognized according to the position of the indicator.

In these implementations, the execution body may extract the text image region to be recognized in various ways according to the position of the pointer determined in the first step. The text image region to be recognized may be used to indicate a text region pointed by the pointer.

As an example, the execution body may first determine the position of the feature point of the pointer according to the position of the pointer of the first step. The feature points may include pre-designated feature points, such as a fingertip and a pen tip. Then, the execution subject may extract a region of a preset size including the feature points as a text image region to be recognized.

And thirdly, performing text recognition on the text image area to be recognized to generate a text recognition result.

Based on the optional implementation manner, the execution main body may perform text recognition on the text image area to be recognized in a manner consistent with the foregoing, and then generate a text recognition result. Optionally, the execution main body may further select a character string (for example, a word) closest to the feature point of the indicator from the result of the recognition of the text image region to be recognized as the text recognition result. Therefore, the execution main body can intercept a specific area in the text image to be recognized for text recognition, so that a refined text recognition function is realized, and the method is particularly suitable for accurately recognizing single characters and words.

In some optional implementations of this embodiment, the executing body may further continue to perform the following steps:

first, speech synthesis is performed according to a text recognition result to generate a synthesized speech.

In these implementations, the execution body may perform speech synthesis using various speech synthesis methods according to the generated text recognition result, thereby generating a synthesized speech. The speech synthesis mode may adopt at least one of the following models for speech synthesis: WaveNet model, deep voice model, Tacotron model.

Second, the synthesized speech is transmitted.

In these implementations, the executing body may transmit the synthesized speech generated in the first step to the target electronic device. The target electronic device may be any electronic device specified in advance according to actual application requirements, or may be an electronic device specified according to rules, such as a speaker connected to a communication device. The speaker may be integrated with the execution body, or may be independent of the execution body.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of a method for recognizing text according to an embodiment of the present application. In the application scenario of fig. 3, a camera 3011, a speaker 3012, and a processor (not shown in the figure) with text recognition function may be installed on the intelligent desk lamp 301 for recognizing text. The intelligent desk lamp 301 receives a preset mode switching instruction input by a user, and turns on the camera 3011. The user then places a finger 302 on the book 303 pointing to the text that needs to be queried and speaks "how to read this word". Thereafter, the camera 3011 captures an image including the text of the book 303 pointed to by the finger 302 as a text image to be recognized. Then, the processor of the intelligent desk lamp 301 identifies the text image to be identified, and generates a text identification result. Optionally, the intelligent desk lamp 301 may further perform voice broadcast on the generated text recognition result through a speaker 3032 by using a voice synthesis technology.

At present, in one of the prior art, a user needs to wake up a text recognition device by a wake-up word first, and under a learning situation that text recognition needs to be performed for many times or a thinking break needs to be reduced by the wake-up word, the user needs to speak the wake-up word repeatedly and frequently, which results in low text recognition interaction efficiency. In the method provided by the embodiment of the application, the electronic device is switched to the immersive mode through the preset mode switching instruction, so that the text recognition operation can be performed without waking up words, and the operation efficiency of the text recognition is improved. Moreover, under the application scene of children's teaching, owing to need not to use the word of awaking, avoided children in the learning process because the thinking that the word of awaking brought is used disturbs, reduced the interruption to the learning process, promoted the experience of immersive study, help promoting user's concentration degree.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for recognizing text is shown. The flow 400 of the method for recognizing text comprises the following steps:

step 401, receiving a target voice.

In the present embodiment, the execution subject of the method for recognizing text (e.g., the server 105 shown in fig. 1) may receive the target voice through a wired connection manner or a wireless connection manner. The target voice may be a voice arbitrarily selected from collected voices according to actual application requirements, or a voice determined according to a rule (for example, a voice with a voice length greater than a preset threshold).

Step 402, determining whether the target voice matches an instruction in the local instruction set characterizing the switching of the preset mode.

In this embodiment, the executing entity may first perform speech recognition on the target speech received in step 401 to generate a speech recognition text. Then, the executing body may further determine whether a keyword (e.g., "immersive mode") matching an instruction in the local instruction set characterizing the switching of the preset mode exists in the voice recognition text. And determining whether the matching is carried out according to whether the keywords matched with the instruction representing the switching of the preset mode exist in the voice recognition text or not. The local instruction set is used for representing an instruction set which is stored locally in the execution main body and does not need to interact with a remote server through a network.

Step 403, in response to determining that the target voice matches the instruction in the local instruction set that characterizes the switching of the preset mode, determining that the preset mode switching instruction is received.

In this embodiment, in response to determining that the predetermined mode switching instruction is received, the execution main body may determine that the predetermined mode switching instruction is received.

It should be noted that the description of the preset mode switching is the same as the corresponding description in the foregoing embodiment, and is not repeated here.

And step 404, in response to receiving a preset mode switching instruction, starting a camera of the terminal device.

Step 405, acquiring a text image to be recognized.

And 406, generating a text recognition result based on the text image to be recognized.

Step 404, step 405, and step 406 are respectively consistent with step 202, step 203, and step 204 in the foregoing embodiment, and the above description on step 202, step 203, step 204, and their optional implementation also applies to step 404, step 405, and step 406, which is not described herein again.

As can be seen from fig. 4, the flow 400 of the method for recognizing text in the present embodiment refines the step of turning on the camera of the terminal device in response to receiving the mode switching instruction. Therefore, the scheme described in the embodiment can control the mode switching of the electronic device through voice, and improves the recognition speed through the local instruction set, thereby not only facilitating the user operation, but also shortening the waiting time for the user interaction and improving the user experience.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for recognizing a text, where the embodiment of the apparatus corresponds to the embodiments of the methods shown in fig. 2 and fig. 4, and the apparatus may be specifically applied to various terminal devices.

As shown in fig. 5, the apparatus 500 for recognizing text provided by the present embodiment includes a determination unit 501, a starting unit 502, an acquisition unit 503, and a first generation unit 504. The determining unit 501 is configured to determine whether a preset mode switching instruction is received, wherein the preset mode switching instruction is used for instructing the terminal device to start an immersive mode; a starting unit 502 configured to start a camera of the terminal device in response to receiving a preset mode switching instruction; an acquisition unit 503 configured to acquire a text image to be recognized; a first generating unit 504 configured to generate a text recognition result based on the text image to be recognized.

In the present embodiment, in the apparatus 500 for recognizing text: the specific processes of the determining unit 501, the starting unit 502, the obtaining unit 503 and the first generating unit 504 and the technical effects thereof can refer to the related descriptions of step 201, step 202, step 203 and step 204 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of the present embodiment, the first generating unit 504 may include a first determining module (not shown in the figure) and a first generating module (not shown in the figure). Wherein the first determining module may be configured to determine whether a text recognition instruction is received. The first generation module may be configured to perform text recognition on the text image to be recognized in response to receiving the text recognition instruction, so as to generate a text recognition result.

In some optional implementation manners of this embodiment, the text image to be recognized may further include an image of a pointer. The first generating unit 504 may include a second determining module (not shown), an extracting module (not shown), and a second generating module (not shown). Wherein the second determination module may be configured to determine the position of the pointer from the text image to be recognized. The extraction module may be configured to extract a text image area to be recognized from the text image to be recognized according to the position of the pointer. The second generation module may be configured to perform text recognition on the text image area to be recognized to generate a text recognition result.

In some optional implementations of the present embodiment, the apparatus 500 for recognizing text may further include: a second generating unit (not shown in the figure), a sending unit (not shown in the figure). Wherein the second generating unit may be configured to perform speech synthesis based on the text recognition result, and generate a synthesized speech. The above-mentioned transmission unit may be configured to transmit the synthesized speech.

In some optional implementations of this embodiment, the determining unit 501 may include: a receiving module (not shown), a third determining module (not shown), and a fourth determining module (not shown). Wherein the receiving module may be configured to receive the target voice. The third determining module may be configured to determine whether the target speech matches an instruction in the local instruction set characterizing the preset mode switch. The fourth determining module may be configured to determine that the preset mode switch instruction is received in response to determining that the target speech matches an instruction in the local instruction set characterizing the preset mode switch.

The apparatus provided by the above embodiment of the present application determines, by the determining unit 501, whether a preset mode switching instruction is received, where the preset mode switching instruction is used to instruct the terminal device to start the immersive mode. The starting unit 502 starts a camera of the terminal device in response to receiving the mode switching instruction. After that, the acquisition unit 503 acquires a text image to be recognized. Finally, the first generating unit 504 generates a text recognition result based on the text image to be recognized. Therefore, the text recognition operation can be completed without waking up when the terminal equipment is in the immersion mode, and the operation efficiency of the text recognition is improved. Moreover, under the application scene of children's teaching, owing to need not to use the word of awaking, avoided children because the thinking that the word of awaking brought is used to disturb in the learning, promoted the experience of immersive study, help promoting user's concentration degree.

Referring now to fig. 6, shown is a schematic diagram of an electronic device (e.g., terminal device 101 of fig. 1)600 suitable for implementing embodiments of the present application. The terminal device in the embodiments of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a PDA (personal digital assistant), a PAD (tablet computer), a point-and-read device, and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the electronic apparatus 600 may include a Central Processing Unit (CPU)601, a memory 602, an input unit 603, and an output unit 604, wherein the central processing unit 601, the memory 602, the input unit 603, and the output unit 604 are connected to each other through a bus 605. Here, the method according to an embodiment of the present application may be implemented as a computer program and stored in the memory 602. The central processor 601 in the electronic device 600 specifically implements the text recognition function defined in the method of the embodiment of the present application by calling the above-described computer program stored in the memory 602. In some implementations, the input unit 603 may include a camera. The output unit 604 may include a display screen or the like that can be used to display the text recognition result. The output unit 604 may also include a speaker or the like that can be used to play speech generated from the text recognition result. Therefore, when the central processing unit 601 calls the computer program to execute the text recognition function, the input unit 603 can be controlled to respond to the received mode switching instruction and start the text image to be recognized acquired by the camera of the terminal device; and controls the output unit 604 to display or broadcast the text recognition result.

It should be noted that the computer readable medium described in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (Radio Frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the terminal device. The computer readable medium carries one or more programs which, when executed by the terminal device, cause the terminal device to: determining whether a preset mode switching instruction is received, wherein the preset mode switching instruction is used for indicating the terminal equipment to start an immersive mode; responding to the received mode switching instruction, and starting a camera of the electronic equipment; acquiring a text image to be recognized; and generating a text recognition result based on the text image to be recognized.

Computer program code for carrying out operations for embodiments of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a determination unit, an activation unit, an acquisition unit, and a first generation unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the determination unit may also be described as a unit that determines whether a preset mode switching instruction for instructing the terminal device to turn on the immersive mode is received.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present application is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present application are mutually replaced to form the technical solution.

Claims

1. A method for recognizing text, comprising:

determining whether a preset mode switching instruction is received, wherein the preset mode switching instruction is used for indicating the terminal equipment to start an immersive mode;

responding to the received preset mode switching instruction, and starting a camera of the terminal equipment;

acquiring a text image to be recognized;

and generating a text recognition result based on the text image to be recognized.

2. The method of claim 1, wherein the generating a text recognition result based on the text image to be recognized comprises:

determining whether a text recognition instruction is received;

and performing text recognition on the text image to be recognized in response to the received text recognition instruction to generate the text recognition result.

3. The method according to claim 2, wherein the text image to be recognized further comprises an image of a pointer; and

the text recognition of the text image to be recognized to generate the text recognition result includes:

determining the position of a pointer from the text image to be recognized;

extracting a text image area to be recognized from the text image to be recognized according to the position of the indicator;

and performing text recognition on the text image area to be recognized to generate a text recognition result.

4. The method of claim 1, wherein the method further comprises:

performing voice synthesis according to the text recognition result to generate synthetic voice;

and sending the synthesized voice.

5. The method according to one of claims 1 to 4, wherein the determining whether a preset mode switching instruction is received comprises:

receiving a target voice;

determining whether the target voice is matched with an instruction representing switching of a preset mode in a local instruction set;

and in response to determining that the target voice matches an instruction in a local instruction set characterizing preset mode switching, determining that the preset mode switching instruction is received.

6. An apparatus for recognizing text, comprising:

the terminal equipment comprises a determining unit and a switching unit, wherein the determining unit is configured to determine whether a preset mode switching instruction is received or not, and the preset mode switching instruction is used for indicating the terminal equipment to start an immersive mode;

the starting unit is configured to respond to the receiving of the preset mode switching instruction and start a camera of the terminal equipment;

an acquisition unit configured to acquire a text image to be recognized;

and the first generation unit is configured to generate a text recognition result based on the text image to be recognized.

7. The apparatus of claim 6, wherein the first generating unit comprises:

a first determination module configured to determine whether a text recognition instruction is received;

the first generation module is configured to perform text recognition on the text image to be recognized in response to receiving the text recognition instruction so as to generate the text recognition result.

8. The device of claim 7, wherein the text image to be recognized further comprises an image of a pointer; the first generation unit includes:

a second determination module configured to determine a position of a pointer from the text image to be recognized;

the extraction module is configured to extract a text image area to be recognized from the text image to be recognized according to the position of the indicator;

and the second generation module is configured to perform text recognition on the text image area to be recognized so as to generate a text recognition result.

9. The apparatus of claim 6, wherein the apparatus further comprises:

a second generation unit configured to perform speech synthesis in accordance with the text recognition result to generate a synthesized speech;

a transmitting unit configured to transmit the synthesized voice.

10. The apparatus according to one of claims 6-9, wherein the determining unit comprises:

a receiving module configured to receive a target voice;

a third determination module configured to determine whether the target speech matches an instruction in a local instruction set characterizing a preset mode switch;

a fourth determination module configured to determine that the preset mode switch instruction is received in response to determining that the target speech matches an instruction in a local instruction set characterizing a preset mode switch.

11. A terminal, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

the camera is configured to acquire a text image to be recognized;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.