WO2023231987A1

WO2023231987A1 - Text recognition method and electronic device

Info

Publication number: WO2023231987A1
Application number: PCT/CN2023/096921
Authority: WO
Inventors: 滕益华; 吴觊豪; 洪芳宇
Original assignee: 华为技术有限公司
Priority date: 2022-05-30
Filing date: 2023-05-29
Publication date: 2023-12-07
Also published as: CN117197811A

Abstract

Provided in embodiments of the present application are a text recognition method and an electronic device. The method comprises: an electronic device can obtain an image and first text content in a first text region of an object to be recognized; the electronic device classifies the image and the first text content in the first text region to display a text recognition result of the first text region on the basis of a classification result, wherein the classification result comprises a first classification, a second classification, and a third classification. The text recognition result corresponding to the first classification filters the first text content. The text recognition result corresponding to the second classification comprises text content after correction of the first text content. The text recognition result corresponding to the third classification comprises the first text content. In this way, the electronic device can comprehensively consider an image and text content in a text region to avoid semantically incorrect text content in text recognition results, thereby improving user experience.

Description

Text recognition method and electronic device

This application claims priority to the Chinese patent application filed with the State Intellectual Property Office of China on May 30, 2022, with application number 202210597895.6 and the application name "Text Recognition Method and Electronic Device", the entire content of which is incorporated into this application by reference. middle.

Technical field

Embodiments of the present application relate to the field of terminal equipment, and in particular, to a text recognition method and electronic equipment.

Background technique

With the continuous development of communication technology, mobile phones and other terminals have become an indispensable part of people's daily lives. Users can use mobile phones not only to communicate with other users, but also to browse or process various types of information.

During use, if the user is interested in the content displayed on the mobile phone, such as a picture or some text in the application interface, the user can use the text recognition function of the application to identify the text in the picture or interface. Usually text recognition function is implemented based on optical character recognition (Optical Character Recognition, OCR) technology. Taking pictures as an example, the application can recognize the text in the picture based on OCR technology and output the recognition results. However, for text recognition scenarios that include truncated text, the output results of current OCR technology after text recognition are quite different from the original text, which affects the user experience.

Contents of the invention

In order to solve the above technical problems, this application provides a text recognition method and electronic device. In this method, the electronic device can output a text recognition result that meets the user's needs based on the image and text content of the text area.

In a first aspect, embodiments of the present application provide a text recognition method. The method includes: the electronic device performs text area detection on an object to be recognized, and obtains an image of a first text area, where the first text area includes text content. The electronic device performs text content recognition on the acquired first text area to obtain the first text content. Then, the electronic device performs classification based on the image of the first text area and the first text content, and obtains a classification result. Subsequently, the electronic device displays the text recognition result of the first text area based on the classification result. The step of displaying the text recognition result may specifically include: if the classification result is the first category, the text recognition result filters the first text content. If the classification result is the second classification, the text recognition result includes the text content after the first text content has been corrected. If the classification result is the third classification, the text recognition result includes the first text content. In this way, the electronic device can comprehensively consider the image information (i.e., the image of the text area) and the text information (i.e., the text content), and can recognize the result of the text content when the text content contained in the text area is missing. (i.e. first text content) filtering. In the case where there is less missing text content, the corrected result is output. And the corresponding text can be output when the text content is not missing. As a result, correct and semantically smooth results can be presented in the text recognition results, while results with semantic errors (i.e., text content) are filtered out, so that complex anthropomorphic decision-making effects can be obtained to improve user experience.

Illustratively, the text recognition result is optionally the text recognition result display box 405 in FIG. 4 . That is to say, if the text recognition result is the result indicated by the first classification (ie filtering), then the text recognition result display box 405 will The result corresponding to a text area is empty, that is, the text content recognition result corresponding to the first text area (ie, the first text content) is not displayed. If the text recognition result is the result of the second classification instruction (that is, outputting the corrected text content) or the result of the third classification instruction (that is, directly outputting the text content), the text recognition result display box 405 includes the correction corresponding to the first text area. The following text content or the text content of the first text area.

For example, the text recognition result may be a result corresponding to the text area itself. For example, if the text recognition result is the result indicated by the first classification (ie, filtered), the text recognition result corresponding to the first text area displayed by the electronic device is empty (it may be blank or no blank). If the text recognition result is the result of the second classification instruction (that is, outputting the corrected text content) or the result of the third classification instruction (that is, directly outputting the text content), the electronic device may display the first text in the text recognition result display box 405 The text content corresponding to the area (can be modified or the result of text content recognition).

For example, the classification result is optionally a numerical value, and the numerical value is used to represent the classification item.

For example, the classification result may also include three numerical values, and the classification corresponding to the largest numerical value is the classification corresponding to the first text area.

According to the first aspect, the electronic device performs classification based on the image of the first text area and the first text content, and obtains the classification result, including: the electronic device obtains intermediate representation information based on the image of the first text area and the first text content. The electronic device classifies the intermediate representation information and obtains the classification result. In this way, electronic devices use high-dimensional multi-modal semantic information to make more refined decisions on different input combinations, thereby achieving anthropomorphic complex decision-making effects.

For example, the intermediate representation information may be called multi-modal information.

For example, the intermediate characterization information may be used to characterize the image features of the image of the first text area and the text features of the first text content.

According to the first aspect, or any implementation manner of the above first aspect, the electronic device classifies the intermediate representation information and obtains the classification result, including: the electronic device classifies the intermediate representation information through the classification model and obtains the classification result. In this way, the electronic device can classify the intermediate representation information through the pre-trained classification model to obtain the corresponding classification result.

According to the first aspect, or any implementation of the above first aspect, before the electronic device displays the text recognition result of the first text area based on the classification result, the electronic device further includes: the electronic device corrects the intermediate representation information to obtain the first text Text content after content correction. For example, before, at the same time, or after classifying the intermediate representation information, the electronic device corrects the intermediate representation information to obtain the corrected text content. The electronic device can determine whether to output the corrected text content based on the classification result. For example, if the corrected text content does not need to be output, for example, the classification result is the first category or the third category, then the corrected text content is discarded.

According to the first aspect, or any implementation manner of the above first aspect, the electronic device corrects the intermediate representation information to obtain the corrected target text content, including: the electronic device corrects the intermediate representation information through the correction model to obtain the third A text content after the text content has been corrected. In this way, electronic devices can use pre-trained The correction model corrects the intermediate representation information to obtain the corrected text content.

According to the first aspect, or any implementation of the above first aspect, the electronic device obtains the intermediate representation information based on the image of the first text area and the first text content, including: the electronic device images the image of the first text area. Encoding to obtain the first image encoding information. The electronic device performs text encoding on the first text content to obtain the first text encoding information. The electronic device performs multi-modal encoding on the first image encoding information and the first text encoding information through a multi-modal encoding model to obtain intermediate representation information. In this way, the electronic device can obtain higher-dimensional semantic information by encoding the image and text content of the text area. The electronic device can perform multi-modal encoding on the first image encoding information and the first text encoding information through a pre-trained multi-modal encoding model to obtain intermediate representation information with high-dimensional semantics.

According to the first aspect, or any implementation of the above first aspect, the multi-modal coding model, the classification model and the correction model form a neural network, and the training data of the neural network includes the second text area and the second text area corresponding to the second text area. The second text content, as well as the third text area and the third text content corresponding to the third text area; the second text area includes partially missing text content, and the text content in the third text area is complete text content. In this way, by inputting images and text content of text areas of different types (including text areas with missing text and non-missing text areas), the neural network can be cyclically trained, so that the neural network can complete the corresponding function, that is, it can fuse, classify, and correct image and text content.

According to the first aspect, or any implementation of the above first aspect, the text recognition result of the first text area is displayed in the text recognition area, and the text recognition area also includes text corresponding to the third text area in the object to be recognized. content. In this way, the text recognition method in this application can implement different processing methods for text content, that is, the text recognition results finally displayed are semantically coherent text content. For the semantically incoherent text content in the text content recognition results, filtering or correction methods are used to avoid the impact of the semantically incoherent text content on the text recognition results.

According to the first aspect, or any implementation of the above first aspect, if the first text area includes partially missing text content, the text recognition result is the first category or the second category. For example, the partially missing text content may be that each text in the text area is missing part of the information, for example, the upper half may be missing or the lower half may be missing. Is mental, partially missing text can also be that at least one text in the text area is missing part of the information.

According to the first aspect, or any implementation of the above first aspect, the semantics expressed by the first text content are different from the semantics expressed by the text content in the first text area. In this way, in embodiments of the present application, the text content recognition results can be filtered to filter or correct text content that is different from the original semantics, thereby improving the user experience.

According to the first aspect, or any implementation of the first aspect above, the object to be identified is a picture, a web page or documentation.

In the second aspect, embodiments of the present application provide a text recognition method. The method includes: an electronic device detects a text area of an object to be recognized, and obtains an image of a first text area; the first text area includes text content. The electronic device performs text content recognition on the first text area to obtain the first text content. The electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content. The electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including: if the image of the first text area represents that the first text area includes partially missing text content and the first text content It is semantically coherent text content, or the image of the first text area represents that the first text area does not include part of the missing text content, and the text recognition result includes the first text content; if the image of the first text area represents that the first text area includes Part of the text content is missing, and the first text content includes text content with semantic errors, the text recognition result filters the first text content, or the text recognition result includes text content after the first text content is corrected. In this way, the electronic device can comprehensively consider the image information (i.e., the image of the text area) and the text information (i.e., the text content), and can recognize the result of the text content when the text content contained in the text area is missing. (i.e. first text content) filtering. In the case where there is less missing text content, the corrected result is output. And the corresponding text can be output when the text content is not missing. As a result, correct and semantically smooth results can be presented in the text recognition results, while results with semantic errors (i.e., text content) are filtered out, so that complex anthropomorphic decision-making effects can be obtained to improve user experience.

For example, the electronic device can detect, based on the image of the text area, whether the text content in the text area is truncated, that is, whether the text includes missing content. In one example, if the text content is not truncated, the first text content can be output directly. In another example, if the text content is truncated, it is detected whether the semantics of the first text content is coherent. If the semantics of the first text content are coherent, the first text content can be directly output. If the semantics of the first text content are incoherent, it is further detected whether the first text content can be modified. If the first text content can be modified, the modified text content is output. If the first text content cannot be modified, the first text content is filtered.

According to the second aspect, the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including: if the image of the first text area represents that the first text area includes partially missing text content , and the first text content includes semantically incoherent text content, the electronic device detects whether the first text content can be modified. If the first text content cannot be corrected, the text recognition result filters the first text content. If the first text content can be corrected, the text recognition result includes the text content after the first text content is corrected. In this way, when the electronic device detects that the text content in the first text area is truncated and the semantics of the first text content is incoherent, it can further detect whether the first text content can be corrected. If it can be corrected, the electronic device can correct the first text content and output the corrected text content. If it cannot be corrected, the electronic device filters the first text content. That is to say, the text recognition result of the first text area displayed by the electronic device is empty, or it is corrected text content, or it is original semantically coherent text content, so as to avoid the impact of incorrect text content recognition results on the user's use.

According to the second aspect, or any implementation of the above second aspect, if the first text content can be modified, The method also includes: the electronic device corrects the first text content through the correction model to obtain text content after the first text content is corrected. In this way, the electronic device can correct the first text content through the pre-trained correction model to obtain semantically coherent text content.

According to the second aspect, or any implementation of the above second aspect, the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including: the electronic device uses a classification model to The image of the first text area is classified to obtain a classification result; the classification result is used to indicate whether the first text area includes partially missing text content. In this way, the electronic device can classify the image of the text area through the pre-trained classification model to detect whether the text content in the text area is truncated.

According to the second aspect, or any implementation of the above second aspect, if the image of the first text area represents that the first text area includes partially missing text content, the electronic device based on the image of the first text area and the first text content , displaying the text recognition result of the first text area, including: the electronic device performs semantic analysis on the first text content through a semantic model to obtain a semantic analysis result; the semantic analysis result is used to indicate whether the first text content includes text content with semantic errors. In this way, the electronic device can perform semantic analysis on the text content through the pre-trained semantic model to obtain semantic analysis results.

For example, the semantic analysis result can be a numerical value, and the electronic device can preset a semantic coherence threshold, and the threshold is used to indicate the semantic coherence of the text content. If the value of the semantic analysis result is greater than or equal to the threshold, the first text content is semantically coherent. If the value of the semantic analysis result is less than the threshold, the first text content is semantically incoherent.

According to the second aspect, or any implementation of the above second aspect, the semantic analysis result is also used to indicate whether the first text content can be modified, and the electronic device displays the first text content based on the image of the first text area and the first text content. The text recognition result of a text area includes: the electronic device determines whether the first text content can be modified based on the semantic analysis result. The electronic device may set a correction threshold that is different from the semantic coherence threshold. If the value of the semantic analysis result is greater than or equal to the correction threshold, the first text content may be corrected. If the value of the semantic analysis result is less than the correction threshold, the first text content cannot be corrected.

According to the second aspect, or any implementation of the above second aspect, the correction model, the classification model, and the semantic model form a neural network, and the training data of the neural network includes a second text area and a second text corresponding to the second text area. content, as well as a third text area and third text content corresponding to the third text area; the second text area includes partially missing text content, and the text content in the third text area is complete text content. In this way, by inputting images and text content of text areas of different types (including text areas with missing text and non-missing text areas), the neural network can be cyclically trained, so that the neural network can complete the corresponding function, that is, it can Carry out truncation judgment, semantic analysis and correction of image and text content.

According to the second aspect, or any implementation of the above second aspect, the text recognition result of the first text area is displayed in the text recognition area, and the text recognition area also includes text corresponding to the third text area in the object to be recognized. content.

According to the second aspect, or any implementation of the above second aspect, the semantics expressed by the semantically incorrect text content are different from the semantics expressed by the corresponding text content in the first text area.

According to the second aspect, or any implementation of the above second aspect, the object to be identified is a picture, a web page or a document.

In a third aspect, embodiments of the present application provide an electronic device. The electronic device includes: one or more processors; memory; and one or more computer programs, wherein the one or more computer programs are stored on the memory, and when the computer program is executed by the one or more processors, the electronic device Instructions for performing a method of the first aspect or any possible implementation of the first aspect.

In a fourth aspect, embodiments of the present application provide an electronic device. The electronic device includes: one or more processors; memory; and one or more computer programs, wherein the one or more computer programs are stored on the memory, and when the computer program is executed by the one or more processors, the electronic device Instructions to perform a method of the second aspect or any possible implementation of the second aspect.

In a fifth aspect, embodiments of the present application provide a computer-readable medium for storing a computer program, where the computer program includes instructions for executing the method in the first aspect or any possible implementation of the first aspect.

In a sixth aspect, embodiments of the present application provide a computer-readable medium for storing a computer program. The computer program includes instructions for executing the method in the second aspect or any possible implementation of the second aspect.

In a seventh aspect, embodiments of the present application provide a computer program, which includes instructions for executing the method in the first aspect or any possible implementation of the first aspect.

In an eighth aspect, embodiments of the present application provide a computer program, which includes instructions for executing the method in the second aspect or any possible implementation of the second aspect.

Description of the drawings

Figure 1 is a schematic diagram of the hardware structure of an exemplary electronic device;

Figure 2 is a schematic diagram of the software structure of an exemplary electronic device;

Figure 3 is a schematic diagram of a text recognition scene containing truncated text;

Figure 4 is a schematic diagram illustrating an application scenario for applying the text recognition method in the embodiment of the present application;

Figure 5 is a schematic flow chart of an exemplary text recognition method;

Figure 6 is a schematic diagram of exemplary text recognition;

Figure 7 is an exemplary text image encoding schematic diagram;

Figure 8 is an exemplary schematic diagram of image information encoding;

Figure 9 is an exemplary schematic diagram of image information encoding;

Figure 10 is an exemplary flattening diagram of Image Patch;

Figure 11 is an exemplary text content encoding schematic diagram;

Figure 12 is a schematic diagram of an exemplary text information encoding process;

Figure 13 is a schematic diagram of an exemplary acquisition process of intermediate representation information;

Figure 14a is a schematic diagram of an exemplary multi-modal encoding;

Figure 14b is a schematic diagram of the processing flow of the multi-modal encoder;

Figure 14c is a schematic diagram of an exemplary classification process;

Figure 15 is an exemplary text modification schematic diagram;

Figure 16 is a schematic diagram of the processing flow of the correction module;

Figure 17 is a schematic diagram of the processing flow of the Transformer Decoder.

Figure 18a is a schematic diagram of an exemplary application scenario;

Figure 18b is a schematic diagram of another application scenario;

Figure 18c is a schematic diagram of another exemplary application scenario;

Figure 18d is a schematic diagram of another exemplary application scenario;

Figure 18e is a schematic diagram of another application scenario exemplarily shown;

Figure 19 is a schematic flow chart of an exemplary text recognition method;

Figure 20 is an exemplary schematic diagram of text image processing;

Figure 21 is an exemplary processing flow of the semantic model;

Figure 22 is a schematic structural diagram of an exemplary device.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

FIG. 1 shows a schematic structural diagram of an electronic device 100 . It should be understood that the electronic device 100 shown in FIG. 1 is only an example of an electronic device, and the electronic device 100 may have more or fewer components than shown in the figure, and two or more components may be combined. , or can have different component configurations. The various components shown in Figure 1 may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.

The electronic device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2. Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, And subscriber identification module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, and a distance sensor 180F. Proximity light sensor 180G, fingerprint sensor 180H, temperature sensor 180J, touch sensor 180K, ambient light sensor 180L, bone conduction sensor 180M, etc.

The processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) wait. Among them, different processing units can be independent devices or integrated in one or more processors.

The controller may be the nerve center and command center of the electronic device 100 . The controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.

The processor 110 may also be provided with a memory for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instructions or data again, it can be called directly from the memory. Repeated access is avoided and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.

The charging management module 140 is used to receive charging input from the charger. Among them, the charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from the wired charger through the USB interface 130 . In some wireless charging embodiments, the charging management module 140 may receive wireless charging input through the wireless charging coil of the electronic device 100 . While the charging management module 140 charges the battery 142, it can also provide power to the electronic device through the power management module 141.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, internal memory 121, external memory, display screen 194, camera 193, wireless communication module 160, etc. The power management module 141 can also be used to monitor battery capacity, battery cycle times, battery health status (leakage, impedance) and other parameters. In some other embodiments, the power management module 141 may also be provided in the processor 110 . In other embodiments, the power management module 141 and the charging management module 140 may also be provided in the same device.

The wireless communication function of the electronic device 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.

Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example: Antenna 1 can be reused as a diversity antenna for a wireless LAN. In other embodiments, antennas may be used in conjunction with tuning switches.

The mobile communication module 150 can provide solutions for wireless communication including 2G/3G/4G/5G applied on the electronic device 100 . The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc. The mobile communication module 150 can receive electromagnetic waves through the antenna 1, perform filtering, amplification and other processing on the received electromagnetic waves, and transmit them to the modem processor for demodulation. The mobile communication module 150 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves through the antenna 1 for radiation. In some embodiments, at least part of the functional modules of the mobile communication module 150 may be disposed in the processor 110 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be combined with at least part of the modules of the processor 110 are set up in the same device.

A modem processor may include a modulator and a demodulator. Among them, the modulator is used to modulate the low-frequency baseband signal to be sent into a medium-high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing. After the low-frequency baseband signal is processed by the baseband processor, it is passed to the application processor. The application processor outputs sound signals through audio devices (not limited to speaker 170A, receiver 170B, etc.), or displays images or videos through display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be independent of the processor 110 and may be provided in the same device as the mobile communication module 150 or other functional modules.

The wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), Bluetooth (bluetooth, BT), and global navigation satellites. System (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 . The wireless communication module 160 can also receive the signal to be sent from the processor 110, frequency modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation.

In some embodiments, the antenna 1 of the electronic device 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology. The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, videos, etc. Display 194 includes a display panel. The display panel can use a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode). emitting diode (AMOLED), flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diode (QLED), etc. In some embodiments, the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.

The electronic device 100 can implement the shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process the data fed back by the camera 193. Camera 193 is used to capture still images or video. The object passes through the lens to produce an optical image that is projected onto the photosensitive element. In some embodiments, the electronic device 100 may include 1 or N cameras 193, where N is a positive integer greater than 1.

The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.

Internal memory 121 may be used to store computer executable program code, which includes instructions. The processor 110 executes instructions stored in the internal memory 121 to execute various functional applications and data processing of the electronic device 100 . The internal memory 121 may include a program storage area and a data storage area. Among them, save The stored program area can store the operating system, at least one application program required for a function (such as a sound playback function, an image playback function, etc.). The storage data area may store data created during use of the electronic device 100 (such as audio data, phone book, etc.). In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), etc.

The electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.

The audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .

The software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiment of this application takes the Android system with a layered architecture as an example to illustrate the software structure of the electronic device 100 . In other embodiments, the embodiments of the present application can also be applied to other systems such as the Hongmeng system. The implementation methods may refer to the technical solutions in the embodiments of the present application. This application will not give examples one by one.

FIG. 2 is a software structure block diagram of the electronic device 100 according to the embodiment of the present application.

The layered architecture of the electronic device 100 divides the software into several layers, and each layer has clear roles and division of labor. The layers communicate through software interfaces. In some embodiments, the Android system is divided into four layers, from top to bottom: application layer, application framework layer, Android runtime and system libraries, and kernel layer.

The application layer can include a series of application packages.

As shown in Figure 2, the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, text recognition, text processing, etc. The text recognition application program may also be called a text recognition module or a text recognition engine in the embodiment of the present application, which is not limited by this application. The text recognition module can be used to identify the text area and text content in the image to be recognized (see below for specific concepts). The text processing application program may also be called a text processing module, which is used to further process the output results of the text recognition module (for specific processing procedures, please refer to the embodiment below). It should be noted that in the embodiment of the present application, the text processing module further processes the results of the text recognition module as an example for description. In other embodiments, the text recognition module can also perform the steps performed by the text processing module. It can also be understood that the steps performed by the text recognition module and the text processing module can be performed by one module, which is not limited in this application.

The application framework layer provides an application programming interface (API) and programming framework for applications in the application layer. The application framework layer includes some predefined functions.

As shown in Figure 2, the application framework layer can include a window manager, content provider, view system, phone manager, resource manager, notification manager, etc.

A window manager is used to manage window programs. The window manager can obtain the display size, determine whether there is a status bar, lock the screen, capture the screen, etc.

Content providers are used to store and retrieve data and make this data accessible to applications. Said data can include videos, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The view system includes visual controls, such as controls that display text, controls that display pictures, etc. A view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide communication functions of the electronic device 100 . For example, call status management (including connected, hung up, etc.).

The resource manager provides various resources to applications, such as localized strings, icons, pictures, layout files, video files, etc.

The notification manager allows applications to display notification information in the status bar, which can be used to convey notification-type messages and can automatically disappear after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc. The notification manager can also be notifications that appear in the status bar at the top of the system in the form of charts or scroll bar text, such as notifications for applications running in the background, or notifications that appear on the screen in the form of conversation windows. For example, text information is prompted in the status bar, a beep sounds, the electronic device vibrates, the indicator light flashes, etc.

System libraries can include multiple functional modules. For example: surface manager (surface manager), media libraries (Media Libraries), 3D graphics processing library, 2D graphics engine (for example: SGL), etc.

The surface manager is used to manage the display subsystem and provides the fusion of 2D and 3D layers for multiple applications.

The media library supports playback and recording of a variety of commonly used audio and video formats, as well as static image files, etc. The media library can support multiple audio and video encoding formats.

The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, composition, and layer processing.

2D Graphics Engine is a drawing engine for 2D drawing.

The kernel layer is the layer between hardware and software. The kernel layer contains at least display drivers, camera drivers, audio drivers, sensors, Bluetooth drivers, Wi-Fi drivers and other drivers.

It can be understood that the components included in the system framework layer, system library and runtime layer shown in Figure 2 do not constitute specific limitations on the electronic device 100. In other embodiments of the present application, the electronic device 100 may include more or fewer components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently.

Figure 3 is a schematic diagram illustrating a text recognition scenario containing truncated text. Please refer to (1) of Figure 3 , a picture 302 is displayed in the display interface 301 of the mobile phone. For example, the display interface 301 may be an application interface, for example, it may be an interface of a system application such as a gallery application interface. The interface 301 may also be an application interface of a third-party application such as a chat application. That is to say, in the embodiment of the present application, the system in the mobile phone can have its own text recognition function (ie, the text recognition module in Figure 2). For example, the gallery application can call the text recognition module of the mobile phone to perform text recognition on pictures. Optionally, the third-party application in the mobile phone can also have its own text recognition function. The implementation process of the text recognition function of different third-party applications can be the same or different, which is not limited by this application. Optionally, the third-party application in the mobile phone can also call the text recognition module of the mobile phone, which is not limited in this application.

Still referring to (1) of FIG. 3 , for example, the picture 302 includes text and images (of course, the picture 302 may also include only text). It should be noted that the embodiment of the present application only takes the text recognition scene of a picture as an example for explanation. In other embodiments, it can also be applied to the text recognition scene in the application interface. For example, the scene can be displayed to a browser application. The page is used for text recognition, which is not limited by this application.

Optionally, the picture 302 can be generated after the mobile phone performs a screenshot operation in response to a user operation; the picture 302 can also be generated by the mobile phone through the camera function; the picture 302 can also be a downloaded picture, etc., which is not limited in this application.

For example, the text in the picture 302 includes multiple lines, where the first line of text and the last line of text displayed in the picture 302 are cut off by the border of the picture 302. In the embodiment of this application, this type of text (or text) is called "Truncate text". It should be noted that Figure 3 only takes vertical truncation of text as an example for illustration. The technical solutions in the embodiments of the present application can also be applied to recognition scenarios of horizontal truncation of text and diagonal truncation of text. Specific examples will be described below. For example, the "vertical truncation of text" described in the embodiments of this application may be truncation perpendicular to the text running direction. It can be understood that the text lines are blocked by the upper and lower edges of the screen or some fixed or frozen status bars due to the up and down sliding of the interface. For example, taking picture 302 as a screenshot of a webpage, the user slides the webpage up and down while browsing the webpage. Accordingly, the first line currently displayed on the webpage may be blocked by the upper edge of the webpage (which can also be understood as the upper border of the display box). ) truncation. The user takes a screenshot of the currently displayed web page, and the mobile phone responds to the received screenshot of the user's operation to generate picture 302. Among them, the first line of text displayed in picture 302 is the "vertically truncated text". For example, "transverse truncation of text" described in the embodiments of this application refers to truncation along the text line direction. For example, text lines may be truncated laterally due to taking pictures or scanning. For example, the "oblique stage text" may be a truncation in a direction that has a certain angle with the text running direction.

Still referring to (1) of Figure 3, the user can long press the picture 302. Please refer to (2) of FIG. 3 , for example, the application displays the option box 303 in response to the received long press operation on the picture 302 . Optionally, the option box 303 includes but is not limited to: sharing options, collection options, text extraction options 304, etc. The location and size of the option box 303 as well as the number and names of the options included therein are only illustrative examples and are not limited by this application.

For example, the user clicks the extract text option 304 to indicate extracting the text in the picture 302 . In response to the received user operation, the mobile phone starts the text recognition function (as mentioned above, the text recognition function can be the text recognition function that comes with the application, or it can be the text recognition function of the calling system, which is not limited in this application).

In the embodiment of this application, the text recognition function optionally adopts OCR technology. The OCR technology is mainly divided into two steps. The first step is text area detection, and the second step is text content recognition (which can also be called text content recognition). . Exemplarily, the text area detection step optionally includes detecting at least one text area in the image, that is, identifying the area containing text in the image. For example, the step of identifying text content may optionally include identifying the text in the acquired text area, that is, identifying the specific text content in the text area. For detailed steps of text area detection and text content recognition, please refer to the relevant content in the prior art embodiments, and will not be described again in this application.

Please refer to (3) of FIG. 3 . As an example, the display interface 301 includes but is not limited to: a reduced picture 302 and a text recognition result display box 305 . It should be noted that the interface layout in the display interface 301 in the embodiment of this application is only a schematic example, and this application does not limit it. By way of example, the text recognition result display box 305 includes, but is not limited to: the “smudge selected text” option, text recognition results, and other options. Optionally, other options include but are not limited to: "Select All" option, "Search" option, "Copy" option, "Translate" option, etc. Each of the other options can be used to process the text recognition results accordingly.

Still referring to (3) of FIG. 3 , for example, the text recognition result in the text recognition result display box 305 is the result recognized through the text recognition function. However, in this example, because the first line of text of the picture 302 is truncated, for example, the text is vertically truncated as described above, the first line of text is incompletely displayed. Correspondingly, the text recognition function recognizes The results may not be accurate. For example, as shown in (3) of Figure 3, the original text of the first line of text in the web page is "The first round of the game, the audience cheered when the All-** and others appeared, 5", and because the text on the home page is cut off by the upper border when browsing the web page , causing the first line of text in the screenshot 302 to be truncated. When applying text recognition to picture 302, after recognizing the first line of text, the output result is "Ri Kong L Dai, Shi Hong Roast Shou Ba Yuan's Tu Cong Ding, 5", which is quite different from the original text. , there are semantic logic errors. However, if the recognition results are lower than this type, the original text cannot be restored through timely semantic reasoning and other technologies, affecting the user experience. For example, the recognition result corresponding to the untruncated text line in the picture 302 (for example, the second line of text in the picture 302) is no different from the original text.

Embodiments of the present application provide a text recognition method that uses text images and text content as inputs to a model (which can be called a text recognition model or a text recognition network), and obtains the corresponding modal information through respective modal encodings. Encoded information. The text processing module performs modal information fusion on the coding information corresponding to the text image and the coding information corresponding to the text content as the attention input of the classification decoder and the correction decoder. In principle, the model is equivalent to an implicit comprehensive consideration of image information (mainly truncation) and text information (mainly semantic coherence), and uses high-dimensional multi-modal semantic information to conduct different input combinations. More refined decisions can be made to achieve anthropomorphic and complex decision-making effects. The complex decision-making of abbreviation is reflected in the final result, which is three classification results: direct filtering indicates that the occlusion causes the semantics to be uncorrectable, corrected output indicates that the occlusion causes the semantics to be incoherent but can be corrected, and direct output without correction Indicates situations where there is no occlusion or there is occlusion but does not affect semantics. In other words, the text recognition method provided in the embodiment of the present application can provide a more anthropomorphic processing solution. Under normal circumstances, if the text is too blocked, the user may not be able to recognize the correct information with the naked eye. Users can also determine that the text content they read based on the truncated text is incorrect. If there is little text occlusion, the user can determine the obscured text through semantics. For unobstructed text, users can correctly read the corresponding text content. The technical solution in the embodiment of the present application can achieve an anthropomorphic user reading effect, and no result can be output when the text is largely blocked (ie, truncated). In the case of less occlusion, the corrected result is output. And the corresponding text can be output without obstruction. As a result, correct and semantically smooth results can be presented in the text recognition results, while results with semantic errors (i.e., text content) can be filtered out to improve the user experience.

Figure 4 is a schematic diagram of an application scenario for applying the text recognition method in the embodiment of the present application. Please refer to (1) of Figure 4. Taking the gallery application as an example, the user clicks on the picture 402 displayed in the gallery application. After the thumbnail is obtained, the gallery application may display the picture 402 in the display interface 401. Optionally, the display interface 401 also includes, but is not limited to, options (or controls) such as sharing options and collection options.

For example, the gallery application can call the text recognition module and text processing module of the system to perform text recognition and processing on the picture 402 (which may also be called a picture to be recognized or an image to be recognized). As mentioned above, in the embodiment of the present application, text recognition includes two parts: text area detection and text content recognition. Optionally, after receiving the user's operation of clicking on the thumbnail corresponding to the picture 402, the text recognition module can perform a text area detection step to detect whether the picture 402 includes a text area. In this example, the picture 402 includes pictures and text (of course, it may also include only text, which is not limited in this application). Accordingly, the text recognition module may detect at least one text area included in the picture 402. After the text recognition module detects that the picture 402 includes a text area, the "extract text in the picture" option 403 can be displayed in the display interface 401 . The user can click the "Extract text in picture" option 403 to instruct the text content in the picture 402 to be extracted. The gallery application responds to received user actions through text recognition The module performs text recognition on the picture 402, that is, performs text content recognition steps to obtain the corresponding text content in each text area. In the embodiment of the present application, the text processing module can further process the recognition results (including text areas and text content) obtained by the text recognition module. Please refer to (2) of Figure 4. The display interface 401 includes but is not limited to: a reduced picture 402 and an extracted text display box 404. Optionally, the extracted text display box 404 includes but is not limited to: a text recognition result display box 405 and other options. Other options include, but are not limited to: "Erase selected text" option, "Read full text" option, "Select all" option, "Search" option, "Copy" option and "Translate" option, etc. It should be noted that the layout of each control in the display interface shown in the embodiment of this application is only a schematic example, and this application does not limit it. Exemplarily, the text recognition result display box 405 includes text content recognized by the text recognition module, as shown in (2) of Figure 4. In this embodiment of the present application, for truncated text (such as the first line of text), The mobile phone does not display the corresponding text in the text recognition result display box 405. That is to say, for text recognition results that may contain semantic errors or garbled characters, the text processing module can adopt a non-output (i.e., non-display) method to avoid the problem of large differences between the text recognition results and the original text. Still referring to (2) of FIG. 4 , for untruncated text, the text processing module may display the corresponding text in the text recognition result display box 405 . Optionally, in the embodiment of the present application, the text processing module can also modify (or correct) the text content recognized by the text recognition module to obtain the correct text (which can also be understood as being close to or the same as the original text). text), and output (i.e., display in the text recognition result display box 405) the corrected result. That is to say, in this embodiment of the present application, text with semantic errors is filtered or corrected so that the text recognition results displayed in the text recognition result display box 405 are semantically logically correct and coherent, thereby improving the user experience.

It should be noted that the embodiments of this application only take the text recognition and processing scenario of pictures as an example for explanation. In other embodiments, it can also be applied to text recognition and processing scenarios in application interfaces. For example, the scenario can be browsing. The page displayed by the server application is used for text recognition and processing, which is not limited by this application.

It should be further noted that the picture 402 can be generated after the mobile phone performs a screenshot operation in response to the user operation; the picture 402 can also be generated by the mobile phone through the camera function; the picture 402 can also be a downloaded picture, etc., which is not covered by this application. limited.

It should be further noted that in the embodiment of this application, only the scenario in which the gallery application calls the text recognition module and the text processing module is used as an example for explanation. The steps performed by the text recognition module and the text processing module in the embodiments of the present application can also be applied to other applications. For example, the text recognition function that comes with the chat application can perform text recognition on the image to be recognized and obtain the corresponding text recognition results. The chat application can call the text processing module of the mobile phone to further process the text recognition results. For another example, the chat application may also have its own text recognition module and text processing module, and implement the steps implemented by the text recognition module and text processing module involved in the embodiments of this application. For another example, the chat application can also call the text recognition module and text processing module of the mobile phone, which is not limited in this application.

It should be further noted that the steps performed by the text recognition module described in the embodiments of this application are only illustrative examples. The steps performed by the text recognition module in the mobile phone and the text recognition module that comes with the application may be the same or different. Specific details may refer to existing technical embodiments, and are not limited in this application. For example, the text recognition module in a mobile phone can use OCR technology to perform text recognition and obtain corresponding recognition results, including text images and text content (the concepts of text images and text content will be explained below). The text recognition module in the chat application can use other technologies to perform text recognition and obtain corresponding recognition results, which also include text images and text content. Optionally, the recognition results obtained by the text recognition module of the chat application and the text recognition module of the mobile phone application can be consistent. Same or different. For example, the text recognition module in the mobile phone may recognize 5 text areas and obtain the corresponding text content. The text recognition module in the chat application may recognize 6 text areas and obtain the corresponding text content, which is not limited by this application. That is to say, the text processing module in the embodiment of the present application can further process the recognition results of any text recognition module (which can be a mobile phone and/or an application) to obtain results that meet user needs.

It should be further noted that different applications trigger the text recognition and processing functions in the same or different operations. The user operations involved in this application (i.e., clicking on the "extract text" option) are only illustrative examples and are not limited by this application.

It should be further noted that in the embodiment of the present application, only the scenario of truncation of the first line of text is used as an example. In other embodiments, the text recognition method in the embodiment of the present application can also be applied to the scenario of truncation of the last line of text. .

It should be further noted that in the embodiment of the present application, text truncation by a border is used as an example for explanation. In other embodiments, text truncation may also be caused by image occlusion or other reasons, which is not limited by this application.

In a possible implementation, the text recognition module can perform the text area detection step on each picture in the gallery application when the mobile phone is in standby or the gallery application is in the background. That is to say, the text recognition module can perform the text area detection step on the pictures in the gallery application in advance, so that after the user clicks on the picture including the text area, the "Extract text in the picture" option box can be displayed immediately to improve text recognition and Overall efficiency of processing.

The text recognition method in the embodiment of the present application will be described in detail below with reference to the accompanying drawings. FIG. 5 is a schematic flowchart of an exemplary text recognition method. Referring to Figure 5, the text recognition module can obtain the results recognized based on OCR technology. The results include at least one text image and text content corresponding to each text image. For example, Figure 6 is a schematic diagram of text recognition. Please refer to Figure 6 . The text recognition module uses OCR technology to perform an operation on picture 601 (that is, picture 402. For detailed description, please refer to picture 402, which will not be described again here). Text area detection to obtain at least one text area. Specifically, text area detection can be understood as that after the OCR technology detects the area containing text in the picture 601, it segments at least one text area in the picture 601 to obtain at least one text image (that is, at least one text area in the picture 601). the image corresponding to the text area). For example, as shown in Figure 6, the text recognition module detects a text area 602a containing text in the picture 601. The text recognition module can segment the text area 602a (for example, along a dotted line) to obtain an image corresponding to the text area 602a, which is referred to as Text image 602a.

For example, the text recognition module can sequentially segment areas containing text in the picture 601. For example, the image of the text area 603a can be obtained, which is referred to as the text image 603a. In the embodiment of this application, only the text area 602a and the text area 603a are used as examples for description. The text recognition module can obtain more text areas in the picture 601.

In a possible implementation, after the text recognition module recognizes the text area through OCR technology, it can undergo radiation or perspective transformation correction and other processing to obtain the corresponding text image.

In another possible implementation, the size of the single text image may be the same as the size of the actual area occupied by the text content in the text image, or may be larger than the size of the actual area occupied by the text content. For example, the size of the text image 602a is larger than the size of the area actually occupied by the text content. That is, there is a blank area between the frame of the text image and the text content (ie, the edge of the text content).

Still referring to FIG. 6 , the text recognition module can perform text content recognition on at least one acquired text area (ie, text image) through OCR technology. Still taking the text image 602a and the text image 603a as an example, the text recognition module The text image 602a performs text content recognition and obtains the text content recognition result 602b (which can also be called the text content 602b). That is, it is recognized that the text content in the text image 602a is "Ri Kong L Loan, Shi Hong Roasted Shou Ba Yuan's Soil" From Ding, 5”. The text recognition module continues to recognize other text images to obtain the corresponding text content recognition results. For example, the text recognition module uses OCR technology to perform text content recognition on the text image 603a to obtain the corresponding text content recognition results 603b (also called is the text content 603b), that is, it is recognized that the text content in the text image 603b is "The champion also showed superb strength, 107B in the first round." It should be noted that this embodiment only takes the text image 602a and the text image 603a as an example for explanation. The text recognition module can perform text content recognition on each acquired text image based on OCR technology to obtain the corresponding text content. The application will not be explained one by one. It should be further noted that the text recognition module can perform text content recognition on each text image in parallel or sequentially, which is not limited by this application.

Still referring to FIG. 5 , for example, the text processing module obtains the recognition results obtained by the text recognition module, including but not limited to: text image 602a and corresponding text content 602b, and text image 603a and corresponding text content 603c. The text processing module executes the process in Figure 5 for each text image input by the text recognition module and the text content corresponding to the text image. It should be noted that the text recognition module can output the recognition results to the text processing module for further processing after acquiring the images corresponding to all text areas of the recognized image (for example, picture 601) and the corresponding text content. The text recognition module can execute the process in Figure 5 on the obtained text images and text content one by one. The text recognition module can also process multiple text images and text contents in parallel, which is not limited in this application. Optionally, after acquiring a text content, the text recognition module can also output the text content and the corresponding text image to the text processing module for processing. This application does not limit this, and the description will not be repeated below.

Please continue to refer to Figure 5. Taking the text image 602a and the text content 602b as an example, the text processing module passes the text image 602a and the text content 602b through a coding model (which can also be called a coding module) to obtain the corresponding text image 602a. The image encoding information, and the text encoding information corresponding to the text content 602b. Optionally, the encoding model may include, but is not limited to, an image encoding model (which may be called an image encoding module) and a text encoding model (which may also be called a text encoding module). For example, the image coding model can be used to code the text image 602a to obtain image coding information corresponding to the text image 602a. In other words, the image encoding model can encode text images into machine-recognizable or understandable semantic information. For example, the text encoding module can be used to encode text content 602b to obtain text encoding information. It can also be understood that the text encoding module encodes text content into machine-recognizable or understandable semantic information.

It should be noted that the structures of image encoding information and text encoding information can adopt corresponding encoder architectures according to the encoding process. The encoders described in the embodiments of this application are only illustrative examples and can be set according to actual needs. This application does not Make limitations.

It should be further noted that the text processing module may process the text image 602a and the text content 602b sequentially or in parallel, which is not limited in this application. For example, the text processing module can first process the text image 602a to obtain the image encoding information, and then process the text content 602b to obtain the text encoding information. For another example, the text processing module may first encode the text content 602b, and then encode the text image 602a. For another example, the text processing module can simultaneously encode the text image 602a and the text content 602b, which is not limited in this application.

Still referring to FIG. 5 , still taking the text image 602a and the text content 602b as an example, the text processing module will The image coding information corresponding to the text image 602a and the text coding information corresponding to the text content 602b are fused through a multi-modal model (which may also be called a multi-modal coding module, a multi-modal fusion module, etc., which is not limited in this application), Multimodal coding information is obtained, which can also be called intermediate representation information.

Exemplarily, the text processing module corrects the intermediate representation information through a correction model (which can also be called a correction module), and the text processing module passes the intermediate representation information through a classification model (which can also be called a classification module) to correct the intermediate representation. The information is classified and the classification results are obtained. In the embodiment of this application, the classification results include three categories: filtering, correction and output, and direct output. The filtering classification item is optionally to filter the text content, that is, not to display the corresponding text content in the text recognition result. Correcting and outputting the classification item optionally means outputting the corrected text. It can also be understood that the text content can be corrected and then displayed in the text recognition result. Directly outputting classification items optionally displays text content in the text recognition results. In other words, the text processing module can directly display the text content recognized by the text recognition module through OCR technology in the text recognition results. Taking the intermediate representation corresponding to the text image 602a and the text content 602b as an example, in one example, if the classification result of the intermediate representation information is a filtered classification item, the text processing module filters the text content 602b, that is, the text content is not displayed in the text recognition result. 602b, to avoid the impact of semantically incorrect text on text recognition results. In another example, if the classification result of the intermediate representation information is a corrected output classification item, the text processing module can display the corrected result of the intermediate representation information in the text recognition result. In another example, if the classification result of the intermediate representation information is direct output, the text processing module displays the text content 602b in the text recognition result.

Taking text image 602a and text content 602b as examples, each process in Figure 5 will be described in detail below. FIG. 7 is an exemplary text image encoding schematic diagram. Please refer to Figure 7. In the embodiment of the present application, the text processing module (specifically, the image encoding model) encodes the image information of the text image 602a including Patch Embedding (image block embedding) and Positional Encoding (positional encoding), thereby converting the three-dimensional The image information is converted into two-dimensional image coded information E _v .

It should be noted that, as mentioned above, the structure of the encoded information (such as two-dimensional encoded information) obtained by encoding text images and text content is obtained based on the architecture of the encoder. The encoder architecture can be set according to actual needs, such as , in other embodiments, three-dimensional image information can also be converted into higher-dimensional or lower-dimensional image encoding information, which is not limited in this application and will not be repeated below.

It should be further noted that in the embodiments of this application, image information encoding of text images through the encoding methods of Patch Embedding and Positional Encoding is used as an example for explanation. In other embodiments, encoding can also be performed through other encoding methods. There are no restrictions on application.

The specific processes of Patch Embedding and Positional Encoding include but are not limited to:

(1) The text processing module divides the text image 602a into N patches.

Figure 8 is an exemplary image information encoding schematic diagram. Please refer to Figure 8. Optionally, the text processing module (specifically, an image encoding model, which will not be repeated below) can change the height of the text image 602a (which can also be width, or width and height) to resize (adjust) the height of the text image 602a to a preset pixel value. For example, the text module can adjust the height of the text image 602a to 32 pixels (or 64 pixels, depending on Actual requirements are set, and this application does not limit it). Correspondingly, the width of the text image 602a is adjusted according to the proportion (ie, the aspect ratio of the image 602a) with the height. As shown in FIG. 8 , in the embodiment of the present application, the adjusted height of the text image 602a is H and the width (also called length) is W as an example for explanation. It should be noted that in other embodiments, The text image may not be resized, which is not limited in this application.

Still referring to Figure 8, for example, the text processing module divides the text image 602a into N Image Patches. In the embodiment of this application, assuming that the width of an Image Patch is w and the height is h, then the number of Image Patches obtained by the text processing module is:
N＝L*W/h*w (1)

Optionally, the values of h and w can be the same or different, for example, they can both be 16 pixels, and can be set according to actual requirements, which is not limited in this application.

Optionally, the value of N is a positive integer. For example, N can be obtained by rounding up.

(2) The text processing module performs Patch Embedding on N Image Patches.

FIG. 9 is an exemplary schematic diagram of image information encoding. Please refer to Figure 9. An exemplary Patch Embedding process includes but is not limited to the following steps:

Step a. The text processing module flattens each Image Patch to obtain the one-dimensional vector _Pi corresponding to each Image Patch.

Specifically, the width of each Image Patch is w, the height is h, and the number of channels is c. Correspondingly, the size of each Image Patch is (h*w*c). The text processing module flattens the Image Patch to obtain a one-dimensional vector of length (h*w*c). For the i-th image block, record the one-dimensional vector as _Pi , and _Pi is expressed as:

For example, FIG. 10 is an exemplary flattened schematic diagram of an Image Patch. Please refer to Figure 10, taking Image Patch801 in Figure 8 as an example. The size of Image Patch801 is (h*w*c). After the text processing module expands Image Patch801, the corresponding one-dimensional vector P ₁ is obtained. P ₁ is expressed as:

That is, a one-dimensional vector with length (h*w*c). Based on the above method, the text processing module can flatten each Image Patch to obtain N Pi, that is, P ₁ ...P _n as shown in Figure 9.

Step b. The text processing module passes N one-dimensional vectors Pi through the fully connected layer to obtain N one-dimensional tensors with a preset length.

Illustratively, still referring to Figure 9, the text processing module passes N one-dimensional vectors Pi through a fully connected layer with an output length of embedding_size (which can be set according to actual needs, and is not limited in this application), and obtains N pieces of length embedding_size. One-dimensional tensor E _vi , E _vi is expressed as:

For example, as shown in Figure 9, the text processing module passes P ₁ through a fully connected layer with a length of embedding_size, and obtains a one-dimensional tensor E _v1 with a length of embedding_size. E _v1 is expressed as:

The text processing module performs the same processing on N one-dimensional tensors according to the above method to obtain E _v1 ...E _vn .

It should be noted that in the embodiment of this application, the preset length is embedding_size as an example for explanation. In other embodiments, the preset length can be other values, which is related to what kind of fully connected layer is used. This application does not Make limitations.

In step c, the text processing module arranges the N one-dimensional tensors E _vi in order to obtain a two-dimensional tensor with a dimension of N*embedding_size.

For example, still referring to Figure 9, the text processing module arranges N one-dimensional tensors E _v1 ...E _vn in order to obtain a two-dimensional tensor E _v0 . E _v0 is expressed as:

Among them, the dimension of E _v0 is (N*embedding_size).

It should be noted that the image encoding method in the embodiment of the present application is only a schematic example. For example, in other embodiments, the text processing module can also call the kernel (kernel) size (h*w), stride (step length) is h (or w), and the convolution kernel with the number of output channels embedding_size is obtained by acting on Image Patches. The specific method can be set according to actual needs. The purpose is to encode N Image Patches and obtain higher semantics. machine-encoded information.

Optionally, in this embodiment of the present application, the text processing module can concatenate (concat) E _v0 with the classification header E _cls to obtain the two-dimensional tensor E _v1 . Optionally, the dimension of E _cls is optionally (1, embedding_size). This dimension can be set according to actual requirements and is not limited in this application. Optionally, the classification head E _cls is a learnable parameter of the neural network.

Illustratively, E _v1 can be expressed as:
E _v1 = [E _cls ,E _v0 ] (2)

For example, assume that the classification header E _cls is expressed as:

Taking E _v0 in the above embodiment as an example, the text processing module splices E _v0 and E _cls to obtain E _v1 , and E _v1 is expressed as:

Among them, the dimension of E _v1 is (N+1, embedding_size).

It should be noted that in the embodiment of the present application, only the splicing of E _v0 and E _cls is used as an example for explanation. In other embodiments, addition, fusion, and other methods may also be used, and this application does not limit it.

(3) The text processing module performs Positional Encoding on E _v1 .

For example, the text processing module adds the two-dimensional tensor E _v1 obtained above and the two-dimensional position code E _pos to obtain the image coding information E _v . Exemplarily, the image encoding information E _v can be expressed as:
E _v ＝E _v1 +E _pos (3)

It should be noted that the dimension of the position coding is related to the dimension of the result after the above processing. This application only takes two dimensions as an example for explanation, and this application does not limit it.

For example, assume that E _pos is expressed as:

Among them, the dimension of E _pos is (N+1, embedding_size). Optionally, E _pos is a learnable parameter of the neural network. In the embodiment of this application, for convenience of expression, N _v =N+1 is recorded.

As shown in Figure 9, taking E _v1 in the above example as an example, correspondingly, E _v1 obtains the image encoding information E _v through Positional Encoding, which is expressed as:

It should be noted that in the embodiment of the present application, the method of combining image coding and position coding is only added as an example for explanation. In other embodiments, other combination methods are also possible, and this application does not limit it.

FIG. 11 is an exemplary text content encoding schematic diagram. Please refer to Figure 11. In the embodiment of the present application, the text processing module (specifically a text encoding model, which will not be described again below) performs text information encoding (also called text information encoding) on the text content 602b, including Word Embedding. (Word Embedding) and Positional Encoding, thereby converting text information into text encoding information with higher semantic characteristics (also called text encoding information), recorded as E _t .

It should be noted that, in the embodiment of this application, only text information encoding of text content through the encoding methods of Word Embedding and Positional Encoding is used as an example for explanation. In other embodiments, encoding can also be performed through other encoding methods. This application No restrictions.

Figure 12 is a schematic diagram of an exemplary text information encoding process. Please refer to Figure 12. The process includes but is not limited to the following steps:

(1) The text processing module performs word segmentation processing on the text content 602b.

As shown in Figure 12, for example, the text processing module segments text content 602b according to a preset character length to obtain a segmentation result (which may also be called a segmentation sequence).

In the embodiment of this application, taking the preset character length as one character as an example, that is, the text processing module divides each character (including punctuation marks) into a word to obtain m words (for example, m is 18, that is, divide is 18 words), that is, the word segmentation sequence w with sequence length m, w can be expressed as:
w＝[w ₁ , w ₂ ,...w _m ]

It should be noted that in other embodiments, the preset character length can also be set according to actual needs, for example, it can be two characters, which is not limited in this application. Optionally, the preset character lengths can also be unequal. For example, "eye shape" can be divided into one word, and "mountain" can be divided into one word, which is not limited in this application.

(2) The text processing module obtains the text serial number sequence corresponding to the word segmentation sequence.

In the embodiment of the present application, the text processing module can be preset with a text serial number table (which can also be called text serial number information, character code table, etc., which is not limited in this application). The text serial number table is used to indicate text (words or characters) and Serial number correspondence. For example, the corresponding serial number of "item" in the text serial number table is "12". For another example, the corresponding serial number of "relationship" in the text serial number table is "52". The corresponding relationship between text and serial numbers can be set according to actual needs, and is not limited in this application. It should be noted that the correspondence between text and serial numbers can be saved in a table or in other ways, which is not limited in this application.

Optionally, the text contained in the text sequence number table can cover dictionaries or any books in professional fields, etc., which is not limited by this application.

As shown in Figure 12, for example, the text processing module can search the sequence number (also called text sequence number) corresponding to each segment (word or word) in the segment sequence w based on the text sequence number table to obtain the text sequence number. n, n can be expressed as:
n＝[n ₁ , n ₂ ,...n _m ]

(3) The text processing module passes the text sequence n through word embedding to obtain the two-dimensional tensor E _t0 .

For example, the text processing module passes the text sequence n through the embedding layer to obtain the two-dimensional tensor E _t0 , and E _t0 can be expressed as:
E _t0 =Embedding(n) (4)

For example, in the embodiment of this application, the two-dimensional tensor E _t0 can be expressed as:

Among them, the dimension of the two-dimensional tensor E _t0 is (m, embedding_size).

It should be noted that the dimension of E _t0 is related to the embedding layer and is not limited in this application.

(4) The text processing module adds E _t0 to the position code to obtain the text information code E _t .

For example, as shown in Figure 12, the text processing module adds E _t0 to the position code E _pos ′ to obtain the text information code E _t . For example, the Winn information encoding E _t can be expressed as:
E _t =E _t0 +E _pos ′ (5)

For example, assume that E _pos ' is expressed as:

Among them, the dimension of E _pos ′ is (m, embedding_size). Optionally, E _pos ′ is a learnable parameter of the neural network. In the embodiment of this application, for convenience of expression, N _t =m is recorded.

As shown in Figure 12, taking E _t0 in the above example as an example, the text processing module adds E _t0 and E _pos ' to obtain the text information encoding E _t , _which is expressed as:

It should be noted that in the embodiment of this application, the method of combining text encoding and position encoding is only added as an example for explanation. In other embodiments, other combination methods are also possible, and this application does not limit it.

Illustratively, the positional encoding in the embodiment of this application can be a parameter-learnable embedding layer similar to Bert Positional Embedding, or it can be a positional encoding based on sine/cosine transformation similar to the native Transformer architecture, which can be set according to actual needs. This application No restrictions.

Still referring to FIG. 5 , for example, after the text processing module obtains the image coding information and the text coding information, it can obtain intermediate representation information based on the image coding information and the text coding information. Figure 13 is a schematic flowchart illustrating an exemplary process for obtaining intermediate representation information. Please refer to Figure 13. Specifically, it includes but is not limited to the following steps:

(1) The text processing module performs feature fusion on the image encoding information E _v and the text encoding information E _t to obtain the mixed semantic encoding _Em (which can also be called mixed encoding information, and is not limited in this application).

Exemplarily, the text processing module (specifically, it can be a multi-modal coding model, which will not be repeated below) splices the image coding information E _v and the text coding information E _t to obtain the mixed semantic coding _Em , which can be expressed as, for example :
E _m =[E _v , E _t ] (6)

For example, combining the above-mentioned image encoding information E _v and text encoding information E _t , the mixed semantic encoding E _m can be expressed as:

Among them, the dimension of the mixed semantic encoding E _m is (N _v +N _t , embedding_size)

It should be noted that in the embodiment of the present application, the fusion method of image encoding information E _v and text encoding information E _t is only used as splicing as an example. In other embodiments, other methods can also be used, such as addition, etc. There are no restrictions on application.

(2) The text processing module passes the mixed semantic encoding E _m through the multi-modal encoder to obtain multi-modal encoding information (ie, intermediate representation information).

Figure 14a is an exemplary multi-modal coding schematic diagram. Please refer to Figure 14a. The text processing module passes the mixed semantic encoding _Em through the multi-modal encoder 1301 to obtain multi-modal coding information (ie, intermediate representation information), denoted by is E _IR . For example, a multi-modal encoder can also be understood as being used to extract high-dimensional semantic information that combines image information and text information based on input multi-modal encoding information.

Optionally, the multi-modal encoder (Encoder) 1301 is composed of stacked Transformer Encoder, for example, the number of stacks is L. Each Transformer Encoder mainly consists of Multi-Head Attention layer, Layer Normalization (Norm in Figure 14a) and Feed forward neural network (Feed forward neural network) (Figure 14a) 14a Feed Forwad) composition.

Figure 14b is a schematic diagram of the processing flow of the multi-modal encoder 1301. Please refer to Figure 14b. In this embodiment of the present application, the stacking number L is 3 as an example. That is, the multi-modal encoder 1301 includes a multi-modal encoder 1301a, a multi-modal encoder 1301b, and a multi-modal encoder 1301c. . It should be noted that the number of encoders described in the embodiments of this application is only a schematic example and can be set according to actual needs, and is not limited in this application. Exemplarily, the mixed semantic encoding _Em passes through the multi-mode encoder 1301a, and an output result is obtained. The output result of the multi-modal encoder 1301a is used as the input of the multi-modal encoder 1301b and continues to be encoded. The multi-modal encoder 1301b performs encoding based on the output result of the multi-modal encoder 1301a, and obtains the output result, which is used as the input of the multi-modal encoder 1301c. The multi-modal encoder 1301c performs encoding based on the output result of the multi-modal encoder 1301b, and obtains the output result, which is the multi-modal encoding information E _IR , and E _IR can be expressed as:
E _IR =TE(TE(TE(E _m ))) (7)

Among them, TE identifies a single multi-modal encoder in multi-modal encoder 1301. The dimension of the multi-modal encoding information E _IR is (N _v +N _t , embedding_size). For example it can be expressed as:

It should be noted that the internal processing flow of each layer in the multi-modal encoder 1301 may refer to the relevant content in the prior art embodiments, and will not be described in detail in this application.

It should be further noted that in the embodiment of this application, the multi-modal encoder is a Transformer Encoder as an example for explanation. In other embodiments, the multi-modal encoder can also be similar to a bidirectional recurrent neural network, or a simpler convolutional neural network encoder, which can be set according to actual needs and is not limited in this application.

It should be further noted that the method by which the text processing module obtains multi-modal coding is not limited to the method of splicing image coding information and text coding information through a multi-modal encoder. In other embodiments, the text processing module can also convert images into The coded information and text coded information pass through their respective encoders and then are fused. For example, the text processing module passes the image encoding information through the image encoder to obtain high-dimensional image semantic information, and passes the text encoding information through the text encoder to obtain high-dimensional text semantic information. The text processing module dimensionally aligns the high-dimensional image semantic information and the high-dimensional text semantic information and splices them together to obtain intermediate representation information. The specific method can be set according to actual needs, and the purpose is to obtain high-dimensional image semantic features and text semantic features.

Please continue to refer to Figure 5. For example, the text processing module (specifically, it can be a classification model, which will not be repeated below) can classify the intermediate representation information to determine whether to output text content 602b based on the classification results.

Figure 14c is a schematic diagram of an exemplary classification process. Referring to Figure 14c, for example, the text processing module can pass the multi-modal encoding information (ie, intermediate representation information) through the classification model to obtain the classification result. For example, the classification model may include, but is not limited to, a classification decoder, and an argmax layer (or softmax layer). In the embodiment of this application, the classification decoder is a fully connected layer, and the fully connected layer is an MLP (Multi-layer perceptron, multi-layer perceptron) as an example. For example, the MLP may include multiple hidden layers. It should be noted that in the embodiment of the present application, only the fully connected layer (such as MLP) is used as the classification decoder as an example for explanation. In other embodiments, the classification decoder can also be other decoders, such as but not limited to decoders such as Transformer Decoder or Recurrent Neural Network (RNN) Decoder, which can be set according to actual needs. This application does not Make limitations. Its purpose is to output corresponding classification results based on the input intermediate representation information. It should be further noted that in the embodiment of the present application, only the argmax layer is used as an example for explanation. In other embodiments, the argmax layer and the softmax layer may also be used, and may be set according to actual needs, and are not limited in this application. Its purpose is to output the classification item corresponding to the maximum score.

Optionally, in the embodiment of this application, the classification results include but are not limited to three classification items:

(a)Filtering

(b) Correct and output

(c) Direct output

After the multi-modal coding information passes through the classification decoder, the classification result obtained includes the scores corresponding to the three classification items. The module can pass the scores corresponding to the three classification items through the argmax layer or softmax layer to obtain the final decision category.

For example: As mentioned above, the dimension of the multi-modal coding information E _IR is (N _v +N _t , embedding_size). Optionally, in the embodiment of the present application, the first dimension of the multi-modal coding information E _IR can be obtained to obtain a one-dimensional tensor E _IR0 with a length of embedding_size, which is expressed as:

The text processing module passes the one-dimensional tensor E _IR0 through the fully connected layer and outputs the one-dimensional tensor T _out with a length of 3 (that is, the same number as the number of classification items). Optionally, the fully connected layer can be an MLP, and the MLP can include multiple hidden layers. Correspondingly, T _out can be expressed as:
T _out =MLP(E _IR0 ) (8)

Among them, the dimension of T _out is 3. It can be understood that T _out includes the scores corresponding to the above three classification items a, b, and c. For example, it can be expressed as:
T _out =[f(a), f(b), f(c)]

Among them, f(a) is the score corresponding to the classification item a (that is, the filtered classification item), f(b) is the score corresponding to the classification item b (that is, the corrected output classification item), f(c) is the classification item c (That is, directly output the score corresponding to the classification item).

For example, the text processing module passes T _out through the argmax layer to output the classification item corresponding to the maximum score. It should be noted that in the embodiment of this application, MLP is only used as a fully connected layer as an example for explanation. In other embodiments, the fully connected layer can also be other decoders, such as but not limited to decoders such as Transformer Decoder or Recurrent Neural Network (RNN) Decoder, which can be set according to actual needs. This application does not Make limitations. Its purpose is to output corresponding classification results based on the input intermediate representation information. It should be further noted that in the embodiment of the present application, only the argmax layer is used as an example for explanation. In other embodiments, the argmax layer and the softmax layer may also be used, and may be set according to actual needs, and are not limited in this application. Its purpose is to output the classification item corresponding to the maximum score.

In an example, if f(a) is the maximum value, the output result is a, that is, the classification result is the filtered classification item. Correspondingly, the text processing module can filter the corresponding text content, that is, the corresponding text content is not displayed in the text recognition result. For example, when the text processing module processes the text image 602a and the text content 602b, it detects that the classification result is category a, that is, the filtered classification item, then the text processing module filters the text content 602b, as shown in (2) of Figure 4 indicates that the text recognition results do not include the truncated first line of text, thereby avoiding errors in the truncated text recognition results and affecting the user experience.

In another example, if f(c) is the maximum value, the output result is c, that is, the classification result is to directly output the classification item. In other words, the results of OCR technology recognition are correct. Correspondingly, the text processing module can display the corresponding text content in the text recognition result. For example, when the text processing module processes the text image 603a and the text content 603b, it is detected that the classification result is category c, that is, the classification result is a direct output classification item. The text processing module determines that the text content 603b can be output directly. As shown in (2) of Figure 4, the text processing module can display the text content 603b at a corresponding position in the text recognition result.

In another example, if f(b) is the maximum value, the output result is b, that is, the classification result is the corrected output classification item. It can be understood that the results identified by OCR technology include some errors and need to be corrected before they can be output. As mentioned above (ie, in Figure 5), each multi-modal encoding information (ie, intermediate representation information) obtained by the text processing module will be corrected through the correction module. After the text processing module detects that the classification result corresponding to the single multi-modal encoding information is the corrected output classification item, the text processing module can display the text content corrected by the correction module in the text recognition result. It should be noted that if the classification result is category a or category c, the text processing module discards (or ignores) the correction result output by the correction module.

FIG. 15 is an exemplary text modification schematic diagram. Please refer to Figure 15. In the embodiment of this application, the correction module including Transformer Decoder is used as an example for explanation. The text processing module passes multi-modal coding information (i.e. intermediate representation information) through Transformer Decoder1501, fully connected layer and argmax layer to obtain the corrected text content.

It should be noted that in other embodiments, the correction module can also be other architectures, such as but not limited to: forward decoder based on recurrent neural network, Bert Decoder architecture, similar stepwise monotonic attention (stepwise monotonic attention) The decoder, etc. can be set according to actual needs and is not limited in this application. Its purpose is to correct the input intermediate representation information to obtain the corrected text.

Please refer to Figure 15. Transformer Decoder1501 includes Q stacked Transformer Decoder, and Q can be a positive integer greater than 0. A single Transformer Decoder can be represented as TD. A single TD includes but is not limited to: Masked multi-head attention layer, multi-head attention layer, layer normalization (i.e. (Norm) in Figure 15) and feedforward neural network (i.e. Figure 15 Feed forward). For the specific processing details of each layer, please refer to the relevant content in the prior art embodiments, and will not be repeated in this application.

Optionally, in the Transformer Decoder architecture, the K vector and V vector of the Transformer Decoder are multi-modal encoding information (that is, the output of the Encoder), and the Q vector is the output of the Masked multi-head attention layer.

Figure 16 is a schematic diagram of the processing flow of the correction module. Please refer to Figure 16. As an example, it is assumed that the recognition results of the OCR technology obtained by the text processing module include text content and text images, where the text content is "volcanic eruption". In other words, the word "violent" is recognized incorrectly. Based on the method in the above embodiment, the text processing module obtains multi-modal coding information corresponding to text content and text images. Moreover, the text processing module obtains the corresponding classification results based on the multi-modal coding information, and the classification results are the corrected output classification items. Specific details can be found above and will not be repeated here. Please refer to Figure 16. As an example, the text processing module inputs the multi-modal encoding information into Transformer Decoder1501 as K vector and V vector, and the start character <s> is input into Transformer Decoder1501 as Q vector through Output Embedding and Positional Encoding. Figure 17 is a schematic diagram of the processing flow of the Transformer Decoder. Please refer to Figure 17, assuming that the stack number Q of Transformer Decoder1501 in the embodiment of this application is 2,

Optionally, Output Embedding can be Word Embedding. The specific implementation can refer to the method in the above embodiment, or the implementation method in other existing technical embodiments, which this application will no longer pursue.

For example, it is assumed that the stack number Q of Transformer Decoder 1501 in the embodiment of the present application is 2 (can be set according to actual requirements, and is not limited by this application), including Transformer Decoder 1501a and Transformer Decoder 1501b. For example, the text processing module inputs the multi-modal encoding information into Transformer Decoder1501a as K vector and V vector, and the start character <s> is input into Transformer Decoder1501a as Q vector through Output Embedding and Positional Encoding. The input of Transformer Decoder1501a is input as the Q vector of Transformer Decoder1501b, and the multi-modal encoding information is input as K vector and V vector. Enter Transformer Decoder1501b. The output of Transformer Decoder1501b is recorded as E _dout 1. E _dout 1 passes through the fully connected layer to obtain E _out 1, where the dimension of E _out 1 is (seq_len,N _vocab ). Optionally, the text processing module slices the first dimension of E _out 1, takes its last column, and obtains a one-dimensional tensor with a length of N _vocab . The text processing module passes the one-dimensional tensor through the argmax layer (it can also be the argmax and softmax layers, which can be set according to actual needs, and is not limited in this application). Among them, N _vocab is optionally the number of texts included in the text sequence number table. For example, if the dictionary includes 100 words and corresponding sequence numbers, the value of N _vocab is 100. For example, the value of seq_len is the number of output characters. For example, in the embodiment of this application, the number of output characters is 5, including "fire", "mountain", "explosion", "fa" and the end character <end>. For example, the value output by the argmax layer is used to indicate the sequence number in the dictionary. The text processing module can determine the corresponding word or phrase based on the serial number. In this example, the text processing module may determine that the corresponding word or word is "fire". In other words, the text processing module encodes the multi-modal information and the starting character <s>, and the character "fire" can be obtained through Transformer Decoder1501.

Still referring to Figure 16, the multi-modal encoding information is used as a K vector and a Q vector, and the "fire" character and the start symbol <s> are input into Transformer Decoder 1501 as a Q vector. Optionally, the "fire" character and the start character <s> are input to Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding. Transformer Decoder1501 is based on the multi-modal encoding information, the "fire" character and the start character <s>, and outputs E _dout 2. E _dout 2 passes through the fully connected layer E _out 2. E _out 2 obtains the corresponding value through the argmax layer. The text processing module can determine the corresponding character based on the value, such as "mountain". In other words, the text The processing module passes the multi-modal encoding information, the character "fire" and the start character <s> through Transformer Decoder1501 to obtain the character "mountain". For details that are not described, please refer to the above to obtain the relevant content of the character "fire", which will not be discussed here. Repeat.

Please continue to refer to Figure 16. The multi-modal encoding information is used as K vector and Q vector, and the "fire" character, "mountain" character and start character <s> are input into Transformer Decoder1501 as Q vector. Optionally, the "fire" character, "mountain" character and the start character <s> are input into Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding. Transformer Decoder1501 outputs E _dout 3 based on the multi-modal encoding information, the "fire" character, the "mountain" character and the start character <s>. E _doout 3 passes through the fully connected layer E _out 3 . E _out 3 gets the corresponding value through the argmax layer. The text processing module can determine the corresponding character based on the value, for example, "explosion". In other words, the text processing module passes the multi-modal encoding information, the "fire" character, the "mountain" character and the starting character <s> through Transformer Decoder1501 to obtain the character "explosion". Thus, the incorrect character "explosion" in the OCR technology recognition result is corrected to "explosion". For details that are not described, please refer to the above to obtain the relevant content of the character "fire", which will not be described again here.

Please continue to refer to Figure 16. The multi-modal encoding information is used as K vector and Q vector, and the "fire" character, "mountain" character, "explosion" character and the start character <s> are input into Transformer Decoder1501 as Q vector. Optionally, the "fire" character, "mountain" character, "explosion" character and start character <s> are input into Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding. Transformer Decoder1501 outputs E _dout 4 based on the multi-modal encoding information, the "fire" character, the "mountain" character, the "explosion" character and the start character <s> and the start character <s>. E _doout 4 passes through the fully connected layer E _out 4 . E _out 4 gets the corresponding value through the argmax layer. The text processing module can determine the corresponding character based on the value, for example, "Fa". In other words, the text processing module passes the multi-modal encoding information, the "fire" character, the "mountain" character, the "blast" character and the start character <s> through Transformer Decoder1501 to obtain the character "fa". For details that are not described, please refer to the above to obtain the relevant content of the character "fire", which will not be described again here.

Please continue to refer to Figure 16. The multi-modal encoding information is as K vector and Q vector, "fire" character, "mountain" character, The "explosion" character, the "fa" character and the start character <s> are input into Transformer Decoder1501 as Q vectors. Optionally, the "Fire" character, "Mountain" character, "Explosion" character, "Fa" character and the start character <s> are input into Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding. Transformer Decoder1501 outputs E _dout 5 based on multi-modal encoding information, "fire" characters, "mountain" characters, "blast" characters, "fa" characters and the start character <s>. E _dout 5 passes through the fully connected layer E _out 5 . E _out 5 gets the corresponding value through the argmax layer. The text processing module determines that the output result is the terminator <end>, which ends the loop.

For example, after the text processing module detects that the classification result is b, that is, the corrected classification item is output, the text processing module can obtain the correction result output by the correction module, that is, "volcanic eruption." The text processing module displays the obtained correction results in the recognition results.

It should be noted that the models involved in the embodiments of this application, including but not limited to: image coding model, text coding model, multi-modal coding model, classification model and correction model, can form a text processing model, which can also be understood as Neural Networks. During the training process of the text processing model, the input data of the model are mainly text images (including truncated and untruncated samples) and the corresponding text recognition content (i.e., text content). For each text image and corresponding text content input to the model, a pair of training samples is formed. For each pair of training samples, they can be manually labeled, and the labels are the three categories mentioned above. That is to say, the input text images and text content are classified through manual annotation. In particular, for situations that can be corrected, the text to be corrected is manually modified to obtain the corrected text, which is used as the supervision data output by the text correction decoder. Optionally, the training process of the text processing model is supervised training. The classification decoder (i.e., classification model) uses categorical cross-entropy loss at any time. The text correction decoder (i.e., correction model) is trained similarly to the native Transformer autoregressive decoder. The teacher-forcing method is used for training at each time step. Since the two decoders share the encoder (that is, the backbone of the neural network of the text processing model), the actual training process is joint training.

In a possible implementation, as mentioned above, truncating text may also include transverse truncating text and oblique truncating text. It should be noted that for horizontally truncated text, for example, the first character in each line of text is truncated. In this scenario, the text recognition module can usually predict the text content based on the OCR recognition process to obtain Correct text. That is to say, horizontal truncation of text may generally not cause the semantic error of vertical truncation of text mentioned above. Correspondingly, when the solution in the embodiment of the present application is applied to horizontally truncate text, it can also be aligned and processed accordingly. The processed results may be slightly different from the recognition results of OCR technology. The oblique truncated text is similar to the horizontal stage text. For text lines with a small oblique angle (for example, less than or equal to 10°), the correct text content can be obtained through prediction and other methods in OCR technology. That is to say, after processing through the solution in the embodiment of the present application, the difference between the output result and the OCR technology recognition result is small. For text with a large oblique angle (for example, greater than 10°), OCR technology may not be able to recognize all text areas. For example, as shown in Figure 18a, the angle between the text lines is assumed to be 30°. When the OCR technology performs text area detection, the text area recognized by the OCR technology only includes the part shown in the dotted line. When OCR technology recognizes the text content of the detected text area, based on its prediction function, it can output text content that is consistent with the original text. It can also be understood that for texts with large oblique angles, the corresponding recognition results may not have semantic errors.

It should be noted that the technical solutions in the embodiments of the present application can effectively solve the problem of semantic errors in the recognition results of partially occluded text. In the embodiment of the present application, "partial occlusion" optionally means that the upper part of all characters in the entire line of text is blocked, such as the scene in which the first line of text is blocked as shown in (1) of Figure 4 . In one example, “Department "Partial Occlusion" optionally means that the lower part of the entire line of text is blocked. For example, Figure 18b is a schematic diagram of an exemplary application scenario. Please refer to Figure 18b. For example, the image to be recognized includes the lower part of the text being truncated. For the text line, the text processing module can also process the OCR recognition result corresponding to the text line based on the solution described in the above embodiment. In another example, "partial occlusion" can optionally be the part of the entire line of text. The upper part (or the lower part, or any part) of part of the text is blocked. For example, Figure 18c is a schematic diagram of another application scenario. Please refer to Figure 18c, a part of the text line in the image to be recognized The text is occluded. That is, the original text is "multimodal encoding information (intermediate representation information)", and the "intermediate representation information" is partially occluded. Optionally, the text recognition module performs OCR recognition on the text line, which can Multiple text areas are obtained. For example, as shown in Figure 18d, the text recognition module may identify the text area corresponding to "multi-modal encoding information", as well as the text area corresponding to the occluded "(intermediate representation information)", and The text content corresponding to the two text areas. Then, the text processing module can perform the processing solution in the embodiment of the present application on the images of the two text areas and the corresponding text content. Optionally, the text recognition module performs OCR recognition on the text lines. , it is also possible to obtain a text area. For example, as shown in Figure 18e, the text recognition module may divide the occluded text part and the unobstructed text part into the same text area. In the embodiment of the present application, this type of text area can also be The image and text content of the text area are processed. In other words, the technical solution in the embodiment of the present application can be applied to a variety of scenes where the text is occluded, thereby meeting the needs for text recognition in different scenarios. Optionally, this The application embodiment can effectively solve the text recognition problem of text lines with an occlusion rate of 20% to 50% (it can also float within this range, which is not limited by this application). It should be noted that, as mentioned above, if the text If the occlusion rate of the line is too high (for example, 80%), the corresponding text area may not be detected during the OCR stage. If the occlusion rate is low, the OCR recognition result may be correct. The text processing module can directly output Or output the corresponding text content after correction.

Figure 19 is a schematic flowchart of another text recognition method provided by an embodiment of the present application. Please refer to Figure 19. This method includes but is not limited to:

(1) The text processing module passes the text image through the classification model to obtain the classification result.

(2) The text processing module determines whether the text content is truncated based on the classification results.

For example, the text processing module can pre-process the text image. For example, the pre-processing can be resizing the text image. For specific details, please refer to the relevant content of the above embodiments, which will not be described again here.

Exemplarily, FIG. 20 is an exemplary schematic diagram of text image processing. Please refer to FIG. 20 . Exemplarily, still taking the text image 602a above as an example, the text processing module converts the text image 602a (which can also be processed with text image) is input to the classification model. The classification model can classify the text image 602a and obtain a classification result.

Optionally, the training data used by the classification model in the training phase includes, but is not limited to, text images corresponding to truncated text and text images corresponding to uncensored text.

Optionally, the classification model can be supervised with a cross-entropy loss function.

Optionally, the classification model may include but is not limited to mainstream classification networks based on Convolutional Neural Network (CNN) (for example, including VGG, ResNet, EfficientNet, etc.), or VIT (Vision Transformer, etc.) based on the Transformer structure. Visual Transformer) classification model and its variants. Its purpose is mainly to output the probability of a two-classification problem, that is, the score corresponding to the truncated classification item or the non-truncated classification item.

For example, if the classification model is recorded as CLS, then the output result of the classification model (also called the classification result) can be Expressed as:
score=CLS(I) (9)

Among them, I is used to indicate a text image. The text image includes parameters in three dimensions: width, height, and number of channels. For specific concepts, please refer to the relevant content in Figure 10 and will not be described again here.

Optionally, the output result score can optionally be a value greater than 0 or less than 1. Among them, the closer the value is to 1, the higher the truncation probability is. The text processing module can set a truncation threshold, for example, 0.5, which can be set according to actual needs and is not limited in this application. In one example, if the output result score is greater than or equal to the truncation threshold (0.5), the text content corresponding to the text image is determined to be truncated text. In another example, if the output result score is less than the truncation threshold (0.5), the text content corresponding to the text image is determined to be non-truncated text.

(3) Output text.

For example, if the text processing module determines that the text content corresponding to the text image is non-truncated text, the corresponding text content can be directly output, that is, the corresponding text content can be displayed in the recognition result. For undescribed parts, please refer to the relevant content of the above embodiments and will not be repeated here.

(4) The text processing module passes the text content through the semantic model to obtain semantic judgment results.

For example, if the text processing module determines that the text content corresponding to the text image is truncated text, the text processing module inputs the text content corresponding to the text image into the semantic model (which may also be called a semantic judgment module).

Exemplarily, Figure 21 is an exemplary processing flow of the semantic model. Please refer to Figure 21. The processing flow of the semantic model includes but is not limited to the following steps:

a. The text processing module segments the text content into words and obtains the word segmentation results.

Illustratively, still taking the text content 602b in the above embodiment as an example, the text processing module (specifically, the semantic model) segments the text content 602b into words and obtains the corresponding word segmentation serial number sequence. For the specific steps of word segmentation and obtaining the text sequence, please refer to the relevant content in the above embodiments, and will not be described again here.

b. The text processing module passes the word segmentation results through Word embedding and Positional Endcoing to obtain E _text .

For example, the text module (specifically, the semantic model) passes the obtained text serial number sequence through Word embedding and Positional Endcoing to obtain the text encoding information E _text . For specific details, please refer to the relevant description of Figure 12 in the above embodiment, and will not be described again here.

c. The text processing module passes E _text through the encoding module to obtain F _text .

For example, the text processing module passes E _text through the encoding module (ie, the encoder (Encoder)), and can obtain the encoded information with high-dimensional semantic features, that is, F _text . Encoding modules include but are not limited to: CNN encoder, RNN encoder, BiRNN (bidirectional recurrent neural network) encoder (such as bidirectional LSTM (Long Short-Term Memory, long short-term memory network)), Transformer Encoder, etc. This application does not limited. The processing flow of the encoder can be referred to the relevant descriptions in Figure 14a and Figure 14b and will not be described again here. Among them, during the implementation process, E _text is replaced by the multi-modal coding information in Figure 14a and Figure 14b.

For example, if the encoder is denoted as Encoder, then F _text can be expressed as:
F _text =Encoder(E _text ) (10)

d. The text processing module passes the F _text through the decoding module to obtain the output score score _t (which is the semantic judgment result).

For example, let the decoding module (i.e. decoder) be Decoder, and score _t can be expressed as:
score _t =Decoder(F _text ) (11)

Optionally, the decoding module includes but is not limited to: MLP (ie fully connected layer) decoder, CNN decoder, RNN decoder and Transformer decoder, which can be set according to actual needs and is not limited in this application. For the specific processing flow of the decoding module, please refer to the relevant contents of Figure 15, Figure 16 and Figure 17, and will not be described again here. Optionally, since the output result in this example, score _t, is the result of a binary classification problem. It can be understood that the output results are used to indicate semantic coherence or incoherence. Correspondingly, the armax layer may not be included in the decoder. In other embodiments, an argmax layer may also be included, which is not limited in this application.

Illustratively, in this example, the input of the semantic model is mainly a line or a string, and the output is a category (ie, a semantically coherent type or a semantically incoherent type). During the training process, the semantic model collects corpus and manually annotates each item to determine whether the semantics are coherent. Optionally, the semantic model can also obtain positive and negative training samples through data generation and other methods.

Illustratively, similar to the classification model, the score _t output by the decoding module can be used to indicate semantic coherence. For example, score _t can optionally be a value greater than 0 or less than 1. The text processing module can set a semantic coherence threshold, such as 0.5, which can be set according to actual needs and is not limited in this application.

In an example, if the value of score _t is greater than or equal to the semantic coherence threshold (ie, 0.5), the text processing module can determine the semantic coherence of the corresponding text content. In other words, the results of OCR technology for truncated text recognition are correct. Correspondingly, the text processing module can directly output the text content, that is, the corresponding text content is displayed in the text recognition result.

In another example, if the value of score _t is less than the semantic coherence threshold (ie, 0.5), the text processing module may determine that the corresponding text content is semantically incoherent. That is to say, there is a semantic error in the recognition result of the truncated text by the OCR technology, and the text processing module continues to perform step (5).

It should be noted that the manner in which the semantic coherence model in the embodiment of the present application detects the semantic coherence of text is only a schematic example. In other embodiments, the text processing module can also detect semantic coherence in other ways, for example, it can be based on a grammatical error checking model. The grammatical error checking model can output a candidate set of grammatical error positions based on the input text content. And set the threshold judgment based on the ratio of the candidate set to the total number of tokens (minimum than the semantic unit). For another example, the text processing module can obtain the probability of each token through the forward language model, and make a judgment based on the average probability and a preset threshold. For specific details, please refer to the relevant content in the prior art embodiments and will not be described again here.

(5) The text processing module determines whether the text content can be corrected.

In this embodiment of the present application, the text processing module can continue to determine whether the text content can be modified based on the results output by the semantic model. For example, the text processing module can set a correction threshold, such as 0.2, which can be set according to actual needs and is not limited in this application.

In one example, if the value of score _t is greater than or equal to the correction threshold (ie, 0.2), the text processing module can determine that the corresponding text content can be corrected, and the text processing module can correct the text content and output it. For example, the text processing module can use the text content as the input of the correction module, and perform correction through the correction module. The processing flow of the correction module can be referred to the relevant contents of Figure 15, Figure 16 and Figure 17, which will not be described again here.

In another example, if the value of score _t is less than the correction threshold (i.e. 0.2), the text processing module can determine the corresponding The text content cannot be corrected, then the text processing module filters the text content, that is, the corresponding text content is not displayed in the text recognition result.

It should be noted that the method of determining whether it is modifiable in step (5), that is, the detection method based on the output results of semantic coherence, is only a schematic example. In other embodiments, the text processing module can also detect whether the text content can be corrected based on other detection methods. For example, as mentioned above, the text processing module can perform processing based on the grammatical error checking model in the semantic coherence judgment process. The text processing module can further determine the number of grammatical errors or the proportion of the number of grammatical error characters based on the output results of the grammatical error checking model. Compare. The text processing module can determine whether the text content can be corrected based on this proportion. For another example, as mentioned above, the semantic coherence judgment can calculate the average probability based on the forward language model, and the text processing module can determine whether the text content is modified based on the average probability (for example, setting a corresponding correction threshold).

It should be further noted that, in addition to correcting text content based on the correction method described in the embodiments of this application (which can also be understood as a neural machine translation method), the text processing module can also use other correction methods, such as grammar-based The output results of the error checking model are used to correct the text content through confusion set recall and candidate ranking. For another example, the text processing module can obtain a confusion set of error positions based on the output results of the grammatical error checking model by calling a statistical language model, a neural language model, or Bert's bidirectional language model, and then through candidate sorting and error remoteness mechanisms, Recall corrected text. For specific implementation, reference may be made to relevant content in prior art embodiments, which will not be described again here.

It should be further noted that each model involved in each step in Figure 19 can form a neural network. The training method of the neural network can refer to the relevant description of the neural network training involved in the above embodiments, and will not be described again here. .

In one example, FIG. 22 shows a schematic block diagram of a device 2200 according to an embodiment of the present application. The device 2200 may include: a processor 2201 and a transceiver/transceiver pin 2202, and optionally, a memory 2203.

The various components of device 2200 are coupled together by bus 2204, which includes a power bus, a control bus, and a status signal bus in addition to a data bus. However, for the sake of clarity, various buses are referred to as bus 2204 in the figure.

Optionally, the memory 2203 may be used for instructions in the foregoing method embodiments. The processor 2201 can be used to execute instructions in the memory 2203, and control the receiving pin to receive signals, and control the transmitting pin to send signals.

The device 2200 may be the electronic device or a chip of the electronic device in the above method embodiment,

All relevant content of each step involved in the above method embodiments can be quoted from the functional description of the corresponding functional module, and will not be described again here.

This embodiment also provides a computer storage medium that stores computer instructions. When the computer instructions are run on an electronic device, the electronic device causes the electronic device to execute the above related method steps to implement the method in the above embodiment.

This embodiment also provides a computer program product. When the computer program product is run on a computer, it causes the computer to perform the above related steps to implement the method in the above embodiment.

In addition, embodiments of the present application also provide a device. This device may be a chip, a component or a module. The device may include a connected processor and a memory; where the memory is used to store computer execution instructions. When the device When running, the processor can execute computer execution instructions stored in the memory, so that the chip executes the methods in each of the above method embodiments.

Among them, the electronic equipment, computer storage media, computer program products or chips provided in this embodiment are all used to execute the corresponding methods provided above. Therefore, the beneficial effects they can achieve can be referred to the corresponding methods provided above. The beneficial effects of the method will not be repeated here.

Those skilled in the art should realize that in one or more of the above examples, the functions described in the embodiments of the present application can be implemented using hardware, software, firmware, or any combination thereof. When implemented using software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Storage media can be any available media that can be accessed by a general purpose or special purpose computer.

The term "and/or" in this article is just an association relationship that describes related objects, indicating that three relationships can exist. For example, A and/or B can mean: A exists alone, A and B exist simultaneously, and they exist alone. B these three situations.

The terms “first” and “second” in the description and claims of the embodiments of this application are used to distinguish different objects, rather than to describe a specific order of objects. For example, the first target object, the second target object, etc. are used to distinguish different target objects, rather than to describe a specific order of the target objects.

In the embodiments of this application, words such as "exemplary" or "for example" are used to represent examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "such as" in the embodiments of the present application is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary" or "such as" is intended to present the concept in a concrete manner.

In the description of the embodiments of this application, unless otherwise specified, the meaning of “plurality” refers to two or more. For example, multiple processing units refer to two or more processing units; multiple systems refer to two or more systems.

The embodiments of the present application have been described above in conjunction with the accompanying drawings. However, the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Inspired by this application, many forms can be made without departing from the purpose of this application and the scope protected by the claims, all of which fall within the protection of this application.

Claims

A text recognition method, characterized by including:

The electronic device performs text area detection on the object to be recognized, and obtains an image of the first text area; the first text area includes text content;

The electronic device performs text content recognition on the first text area to obtain the first text content;

The electronic device performs classification based on the image of the first text area and the first text content, and obtains a classification result;

The electronic device displays the text recognition result of the first text area based on the classification result;

The electronic device, based on the classification result, displays the text recognition result of the first text area including:

If the classification result is the first classification, the text recognition result filters the first text content; if the classification result is the second classification, the text recognition result includes the corrected text of the first text content. Content; if the classification result is the third classification, the text recognition result includes the first text content.
The method of claim 1, wherein the electronic device performs classification based on the image of the first text area and the first text content, and obtains a classification result, including:

The electronic device obtains intermediate representation information based on the image of the first text area and the first text content;

The electronic device classifies the intermediate representation information to obtain the classification result.
The method according to claim 2, characterized in that the electronic device classifies the intermediate representation information to obtain the classification result, including:

The electronic device classifies the intermediate representation information through a classification model to obtain the classification result.
The method according to claim 3, characterized in that, before the electronic device displays the text recognition result of the first text area based on the classification result, it further includes:

The electronic device corrects the intermediate representation information to obtain the corrected text content of the first text content.
The method according to claim 4, characterized in that the electronic device corrects the intermediate representation information to obtain the corrected target text content, including:

The electronic device corrects the intermediate representation information through a correction model to obtain the corrected text content of the first text content.
The method of claim 5, wherein the electronic device obtains intermediate representation information based on the image of the first text area and the first text content, including:

The electronic device performs image encoding on the image of the first text area to obtain first image encoding information;

The electronic device performs text encoding on the first text content to obtain the first text encoding information;

The electronic device performs multi-modal coding on the first image coding information and the first text coding information through a multi-modal coding model to obtain the intermediate representation information.
The method according to claim 6, characterized in that the multi-modal coding model, the classification model and the correction model form a neural network, and the training data of the neural network includes a second text area and a second text area. The second text content corresponding to the text area, and the third text area and the third text content corresponding to the third text area; the second text area includes partially missing text content, and the text in the third text area The content is full text content.
The method according to claim 1, characterized in that the text recognition result of the first text area is displayed in the text recognition area, and the text recognition area also includes a third text area corresponding to the object to be recognized. text content.
The method of claim 1, wherein if the first text area includes partially missing text content, the text recognition result is the first category or the second category.
The method according to claim 9, characterized in that the semantics expressed by the first text content are different from the semantics expressed by the text content in the first text area.
The method according to any one of claims 1 to 10, characterized in that the object to be identified is a picture, a web page or a document.
A text recognition method, characterized by including:

The electronic device performs text area detection on the object to be recognized, and obtains an image of the first text area; the first text area includes text content;

The electronic device performs text content recognition on the first text area to obtain the first text content;

The electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content;

The electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including:

If the image of the first text area represents that the first text area includes partially missing text content and the first text content is semantically coherent text content, or the image of the first text area represents the third A text area does not include partially missing text content, and the text recognition result includes the first text content; if the image of the first text area represents that the first text area includes partially missing text content, and the The first text content includes text content with semantic errors, the text recognition result filters the first text content, or the text recognition result includes text content after the first text content is corrected.
The method of claim 12, wherein the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including:

If the image of the first text area represents that the first text area includes partially missing text content, and the first text content includes semantically incoherent text content, the electronic device detects whether the first text content can be corrected;

If the first text content cannot be modified, the text recognition result filters the first text content;

If the first text content can be corrected, the text recognition result includes the text content after the first text content is corrected.
The method according to claim 13, characterized in that if the first text content can be modified, the method further includes:

The electronic device corrects the first text content through a correction model to obtain text content after the first text content has been corrected.
The method of claim 14, wherein the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including:

The electronic device classifies the image of the first text area through a classification model to obtain a classification result; the classification result is used to indicate whether the first text area includes partially missing text content.
The method of claim 15, wherein if the image of the first text area represents that the first text area includes partially missing text content, the electronic device based on the image of the first text area and The first text content displays the text recognition result of the first text area, including:

The electronic device performs semantic analysis on the first text content through a semantic model to obtain a semantic analysis result; the semantic analysis result is used to indicate whether the first text content includes text content with semantic errors.
The method of claim 16, wherein the semantic analysis result is also used to indicate whether the first text content can be modified, and the electronic device is based on the image of the first text area and the third text content. A text content that displays the text recognition result of the first text area, including:

The electronic device determines whether the first text content can be modified based on the semantic analysis result.
The method according to claim 17, characterized in that the correction model, the classification model and the semantic model form a neural network, and the training data of the neural network includes a second text area and a second text area corresponding to the second text area. The second text content, as well as the third text area and the third text content corresponding to the third text area; the second text area includes partially missing text content, and the text content in the third text area is complete Text content.
The method according to claim 12, characterized in that the text recognition result of the first text area is displayed in the text recognition area, and the text recognition area also includes a third text area in the object to be recognized. the corresponding text content.
The method according to claim 9, characterized in that the semantics expressed by the semantically incorrect text content are different from the semantics expressed by the corresponding text content in the first text area.
The method according to any one of claims 12 to 20, characterized in that the object to be recognized is a picture, a web page or a document.
An electronic device, characterized by including:

one or more processors;

memory;

and one or more computer programs, wherein said one or more computer programs are stored on said memory, and when said computer programs are executed by said one or more processors, cause said electronic device to perform as claimed The method described in any one of 1 to 11.
An electronic device, characterized by including:

one or more processors;

memory;

and one or more computer programs, wherein said one or more computer programs are stored on said memory, and when said computer programs are executed by said one or more processors, cause said electronic device to perform as claimed The method described in any one of 12 to 21.
A computer-readable storage medium, characterized by comprising a computer program, which when the computer program is run on an electronic device, causes the electronic device to execute the method according to any one of claims 1 to 11.
A computer-readable storage medium, characterized by comprising a computer program, which when the computer program is run on an electronic device, causes the electronic device to execute the method according to any one of claims 12 to 21.
A computer program product, characterized in that it includes a computer program that, when executed by an electronic device, causes the electronic device to execute the method described in any one of claims 1 to 11.
A computer program product, characterized by comprising a computer program that, when executed by an electronic device, causes the electronic device to execute the method described in any one of claims 12 to 21.